Hollywood Movie Gross Income
Contents
Hollywood Movie Gross Income#
Author: Nicholas Bogarin
Course Project, UC Irvine, Math 10, F22
Introduction#
My project is going to go in depth towards a dataset that describes the Highest Hollywood Grossing Movies. I will be mainly reorganzing the dataset in order for it to be understood more before computating codes that can help predict questions that help uncover the reasons for the highest income sale and more. It will be a more general idea on which movies are the highest sold not only in the domestic area that it was created and released in, but the overall impact on it to the world sales. Being able to identify which productor created the most popular ones and what is predicted to come to his next future releases :)
Import Library:#
import numpy as np
import pandas as pd
import altair as alt
import plotly.express as px
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
Cleansing of the dataset#
In this section, I will not only be cleansing part of the dataset, but I will step by step create sub data frames that help us furhter analyse the datasets together to grasp an idea on what movies are set out to be popular hits financially and physically. This will dive into the main part of the project where I will use numerous of topics used in Pandas to make it more efficiently for more machine learning in the next section :)
Highest Holywood Grossing Movies dataset#
df_movies = pd.read_csv("/work/archive (6)/Highest Holywood Grossing Movies.csv")
df_movies
Unnamed: 0 | Title | Movie Info | Distributor | Release Date | Domestic Sales (in $) | International Sales (in $) | World Sales (in $) | Genre | Movie Runtime | License | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Star Wars: Episode VII - The Force Awakens (2015) | As a new threat to the galaxy rises, Rey, a de... | Walt Disney Studios Motion Pictures | December 16, 2015 | 936662225 | 1132859475 | 2069521700 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 18 min | PG-13 |
1 | 1 | Avengers: Endgame (2019) | After the devastating events of Avengers: Infi... | Walt Disney Studios Motion Pictures | April 24, 2019 | 858373000 | 1939128328 | 2797501328 | ['Action', 'Adventure', 'Drama', 'Sci-Fi'] | 3 hr 1 min | PG-13 |
2 | 2 | Avatar (2009) | A paraplegic Marine dispatched to the moon Pan... | Twentieth Century Fox | December 16, 2009 | 760507625 | 2086738578 | 2847246203 | ['Action', 'Adventure', 'Fantasy', 'Sci-Fi'] | 2 hr 42 min | PG-13 |
3 | 3 | Black Panther (2018) | T'Challa, heir to the hidden but advanced king... | Walt Disney Studios Motion Pictures | NaN | 700426566 | 647171407 | 1347597973 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 14 min | NaN |
4 | 4 | Avengers: Infinity War (2018) | The Avengers and their allies must be willing ... | Walt Disney Studios Motion Pictures | NaN | 678815482 | 1369544272 | 2048359754 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 29 min | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
913 | 913 | The Notebook (2004) | A poor yet passionate young man falls in love ... | New Line Cinema | June 25, 2004 | 81001787 | 36813370 | 117815157 | ['Drama', 'Romance'] | 2 hr 3 min | PG-13 |
914 | 914 | Jimmy Neutron: Boy Genius (2001) | An eight-year-old boy genius and his friends m... | Paramount Pictures | December 21, 2001 | 80936232 | 22056304 | 102992536 | ['Action', 'Adventure', 'Animation', 'Comedy',... | 1 hr 22 min | NaN |
915 | 915 | Eat Pray Love (2010) | A married woman realizes how unhappy her marri... | Sony Pictures Entertainment (SPE) | August 13, 2010 | 80574010 | 124020006 | 204594016 | ['Biography', 'Drama', 'Romance'] | 2 hr 13 min | PG-13 |
916 | 916 | The Texas Chainsaw Massacre (2003) | After picking up a traumatized young hitchhike... | New Line Cinema | October 17, 2003 | 80571655 | 26792250 | 107363905 | ['Crime', 'Horror'] | 1 hr 38 min | R |
917 | 917 | Zookeeper (2011) | A group of zoo animals decide to break their c... | Sony Pictures Entertainment (SPE) | July 6, 2011 | 80360843 | 89491916 | 169852759 | ['Comedy', 'Family', 'Fantasy', 'Romance'] | 1 hr 42 min | PG |
918 rows × 11 columns
df_movies.shape
(918, 11)
This dataset is from kaggle and is a set that contains numerous of the Highest Hollywood Grossing Movies that have occured for the past century. It has 918 rows and 11 columns that describe the movies and information about the movie like the info, distributor, release date, the genre, length, license, and the overall Sales that come within domestic, international, and world. As we can see from the dataset there are values that are missing or a column that is not important in the set, so we are going to remove them to help make our dataset cleaner and shorter(I will not be removing the missing row values here first because later on it will be cleaner if I do it during that time) :)
df_movies = df_movies.drop({'Unnamed: 0','Movie Info'}, axis = 1)
df_movies
Title | Distributor | Release Date | Domestic Sales (in $) | International Sales (in $) | World Sales (in $) | Genre | Movie Runtime | License | |
---|---|---|---|---|---|---|---|---|---|
0 | Star Wars: Episode VII - The Force Awakens (2015) | Walt Disney Studios Motion Pictures | December 16, 2015 | 936662225 | 1132859475 | 2069521700 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 18 min | PG-13 |
1 | Avengers: Endgame (2019) | Walt Disney Studios Motion Pictures | April 24, 2019 | 858373000 | 1939128328 | 2797501328 | ['Action', 'Adventure', 'Drama', 'Sci-Fi'] | 3 hr 1 min | PG-13 |
2 | Avatar (2009) | Twentieth Century Fox | December 16, 2009 | 760507625 | 2086738578 | 2847246203 | ['Action', 'Adventure', 'Fantasy', 'Sci-Fi'] | 2 hr 42 min | PG-13 |
3 | Black Panther (2018) | Walt Disney Studios Motion Pictures | NaN | 700426566 | 647171407 | 1347597973 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 14 min | NaN |
4 | Avengers: Infinity War (2018) | Walt Disney Studios Motion Pictures | NaN | 678815482 | 1369544272 | 2048359754 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 29 min | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
913 | The Notebook (2004) | New Line Cinema | June 25, 2004 | 81001787 | 36813370 | 117815157 | ['Drama', 'Romance'] | 2 hr 3 min | PG-13 |
914 | Jimmy Neutron: Boy Genius (2001) | Paramount Pictures | December 21, 2001 | 80936232 | 22056304 | 102992536 | ['Action', 'Adventure', 'Animation', 'Comedy',... | 1 hr 22 min | NaN |
915 | Eat Pray Love (2010) | Sony Pictures Entertainment (SPE) | August 13, 2010 | 80574010 | 124020006 | 204594016 | ['Biography', 'Drama', 'Romance'] | 2 hr 13 min | PG-13 |
916 | The Texas Chainsaw Massacre (2003) | New Line Cinema | October 17, 2003 | 80571655 | 26792250 | 107363905 | ['Crime', 'Horror'] | 1 hr 38 min | R |
917 | Zookeeper (2011) | Sony Pictures Entertainment (SPE) | July 6, 2011 | 80360843 | 89491916 | 169852759 | ['Comedy', 'Family', 'Fantasy', 'Romance'] | 1 hr 42 min | PG |
918 rows × 9 columns
Now we can see that the columns are decreased to 9 columns as we removed the unnecessary index column and the movie information since we would not be able to graph them or have any use of them. Now we see in the Genre column how all the genre types are combined in a list form and it would have been better if we split the genre string in order to be able to categorize them into genres to get a deeper understanding of which genre is set to be more popular in sales :)
df_movies["Genre"] = df_movies["Genre"].astype(str)
df_movies['Genre']
0 ['Action', 'Adventure', 'Sci-Fi']
1 ['Action', 'Adventure', 'Drama', 'Sci-Fi']
2 ['Action', 'Adventure', 'Fantasy', 'Sci-Fi']
3 ['Action', 'Adventure', 'Sci-Fi']
4 ['Action', 'Adventure', 'Sci-Fi']
...
913 ['Drama', 'Romance']
914 ['Action', 'Adventure', 'Animation', 'Comedy',...
915 ['Biography', 'Drama', 'Romance']
916 ['Crime', 'Horror']
917 ['Comedy', 'Family', 'Fantasy', 'Romance']
Name: Genre, Length: 918, dtype: object
df_genres =pd.DataFrame(df_movies["Genre"].str.split(',', expand=True).values,
columns=['Genre1', 'Genre2','Genre3','Genre4','Genre5','Genre6', 'Genre7','Genre8'])
df_genres
Genre1 | Genre2 | Genre3 | Genre4 | Genre5 | Genre6 | Genre7 | Genre8 | |
---|---|---|---|---|---|---|---|---|
0 | ['Action' | 'Adventure' | 'Sci-Fi'] | None | None | None | None | None |
1 | ['Action' | 'Adventure' | 'Drama' | 'Sci-Fi'] | None | None | None | None |
2 | ['Action' | 'Adventure' | 'Fantasy' | 'Sci-Fi'] | None | None | None | None |
3 | ['Action' | 'Adventure' | 'Sci-Fi'] | None | None | None | None | None |
4 | ['Action' | 'Adventure' | 'Sci-Fi'] | None | None | None | None | None |
... | ... | ... | ... | ... | ... | ... | ... | ... |
913 | ['Drama' | 'Romance'] | None | None | None | None | None | None |
914 | ['Action' | 'Adventure' | 'Animation' | 'Comedy' | 'Family' | 'Sci-Fi'] | None | None |
915 | ['Biography' | 'Drama' | 'Romance'] | None | None | None | None | None |
916 | ['Crime' | 'Horror'] | None | None | None | None | None | None |
917 | ['Comedy' | 'Family' | 'Fantasy' | 'Romance'] | None | None | None | None |
918 rows × 8 columns
df_genres = df_genres.fillna("")
Now that we created a dataframe that seperates all the genres that were described in each movie into seperate columns we are going to combine it with the movies dataset in order to graph them together in this unified code. We see that there are numerous of columns that do not contain values for them in example say Genre5 or 7, so we are mainly going to focus only on Genre 1 since each movie atleast contains 1 genre type. Now we also remove the all the rows that contain missing values in the movie dataframe to not have any missing values :)
df_movies1 = pd.concat([df_movies,df_genres], axis= 1)
df_movies1
Title | Distributor | Release Date | Domestic Sales (in $) | International Sales (in $) | World Sales (in $) | Genre | Movie Runtime | License | Genre1 | Genre2 | Genre3 | Genre4 | Genre5 | Genre6 | Genre7 | Genre8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Star Wars: Episode VII - The Force Awakens (2015) | Walt Disney Studios Motion Pictures | December 16, 2015 | 936662225 | 1132859475 | 2069521700 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 18 min | PG-13 | ['Action' | 'Adventure' | 'Sci-Fi'] | |||||
1 | Avengers: Endgame (2019) | Walt Disney Studios Motion Pictures | April 24, 2019 | 858373000 | 1939128328 | 2797501328 | ['Action', 'Adventure', 'Drama', 'Sci-Fi'] | 3 hr 1 min | PG-13 | ['Action' | 'Adventure' | 'Drama' | 'Sci-Fi'] | ||||
2 | Avatar (2009) | Twentieth Century Fox | December 16, 2009 | 760507625 | 2086738578 | 2847246203 | ['Action', 'Adventure', 'Fantasy', 'Sci-Fi'] | 2 hr 42 min | PG-13 | ['Action' | 'Adventure' | 'Fantasy' | 'Sci-Fi'] | ||||
3 | Black Panther (2018) | Walt Disney Studios Motion Pictures | NaN | 700426566 | 647171407 | 1347597973 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 14 min | NaN | ['Action' | 'Adventure' | 'Sci-Fi'] | |||||
4 | Avengers: Infinity War (2018) | Walt Disney Studios Motion Pictures | NaN | 678815482 | 1369544272 | 2048359754 | ['Action', 'Adventure', 'Sci-Fi'] | 2 hr 29 min | NaN | ['Action' | 'Adventure' | 'Sci-Fi'] | |||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
913 | The Notebook (2004) | New Line Cinema | June 25, 2004 | 81001787 | 36813370 | 117815157 | ['Drama', 'Romance'] | 2 hr 3 min | PG-13 | ['Drama' | 'Romance'] | ||||||
914 | Jimmy Neutron: Boy Genius (2001) | Paramount Pictures | December 21, 2001 | 80936232 | 22056304 | 102992536 | ['Action', 'Adventure', 'Animation', 'Comedy',... | 1 hr 22 min | NaN | ['Action' | 'Adventure' | 'Animation' | 'Comedy' | 'Family' | 'Sci-Fi'] | ||
915 | Eat Pray Love (2010) | Sony Pictures Entertainment (SPE) | August 13, 2010 | 80574010 | 124020006 | 204594016 | ['Biography', 'Drama', 'Romance'] | 2 hr 13 min | PG-13 | ['Biography' | 'Drama' | 'Romance'] | |||||
916 | The Texas Chainsaw Massacre (2003) | New Line Cinema | October 17, 2003 | 80571655 | 26792250 | 107363905 | ['Crime', 'Horror'] | 1 hr 38 min | R | ['Crime' | 'Horror'] | ||||||
917 | Zookeeper (2011) | Sony Pictures Entertainment (SPE) | July 6, 2011 | 80360843 | 89491916 | 169852759 | ['Comedy', 'Family', 'Fantasy', 'Romance'] | 1 hr 42 min | PG | ['Comedy' | 'Family' | 'Fantasy' | 'Romance'] |
918 rows × 17 columns
df_movies1 = df_movies1.dropna(axis = 0)
df_movies1 = df_movies1.drop('Genre', axis = 1)
df_movies1
Title | Distributor | Release Date | Domestic Sales (in $) | International Sales (in $) | World Sales (in $) | Movie Runtime | License | Genre1 | Genre2 | Genre3 | Genre4 | Genre5 | Genre6 | Genre7 | Genre8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Star Wars: Episode VII - The Force Awakens (2015) | Walt Disney Studios Motion Pictures | December 16, 2015 | 936662225 | 1132859475 | 2069521700 | 2 hr 18 min | PG-13 | ['Action' | 'Adventure' | 'Sci-Fi'] | |||||
1 | Avengers: Endgame (2019) | Walt Disney Studios Motion Pictures | April 24, 2019 | 858373000 | 1939128328 | 2797501328 | 3 hr 1 min | PG-13 | ['Action' | 'Adventure' | 'Drama' | 'Sci-Fi'] | ||||
2 | Avatar (2009) | Twentieth Century Fox | December 16, 2009 | 760507625 | 2086738578 | 2847246203 | 2 hr 42 min | PG-13 | ['Action' | 'Adventure' | 'Fantasy' | 'Sci-Fi'] | ||||
6 | Titanic (1997) | Paramount Pictures | December 19, 1997 | 659363944 | 1542283320 | 2201647264 | 3 hr 14 min | PG-13 | ['Drama' | 'Romance'] | ||||||
7 | Jurassic World (2015) | Universal Pictures | June 10, 2015 | 652385625 | 1018130819 | 1670516444 | 2 hr 4 min | PG-13 | ['Action' | 'Adventure' | 'Sci-Fi'] | |||||
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
911 | While You Were Sleeping (1995) | Walt Disney Studios Motion Pictures | April 21, 1995 | 81057016 | 101000000 | 182057016 | 1 hr 43 min | PG | ['Comedy' | 'Drama' | 'Romance'] | |||||
913 | The Notebook (2004) | New Line Cinema | June 25, 2004 | 81001787 | 36813370 | 117815157 | 2 hr 3 min | PG-13 | ['Drama' | 'Romance'] | ||||||
915 | Eat Pray Love (2010) | Sony Pictures Entertainment (SPE) | August 13, 2010 | 80574010 | 124020006 | 204594016 | 2 hr 13 min | PG-13 | ['Biography' | 'Drama' | 'Romance'] | |||||
916 | The Texas Chainsaw Massacre (2003) | New Line Cinema | October 17, 2003 | 80571655 | 26792250 | 107363905 | 1 hr 38 min | R | ['Crime' | 'Horror'] | ||||||
917 | Zookeeper (2011) | Sony Pictures Entertainment (SPE) | July 6, 2011 | 80360843 | 89491916 | 169852759 | 1 hr 42 min | PG | ['Comedy' | 'Family' | 'Fantasy' | 'Romance'] |
744 rows × 16 columns
df_movies1["Genre1"].value_counts()
['Action' 343
['Adventure' 128
['Comedy' 103
['Drama' 47
['Comedy'] 31
['Biography' 30
['Crime' 26
['Horror' 21
['Drama'] 5
['Fantasy' 3
['Animation' 3
['Mystery' 2
['Horror'] 1
['Documentary' 1
Name: Genre1, dtype: int64
df_movies1["Genre1"] = df_movies1["Genre1"].replace(["['Action'","['Adventure'","['Comedy'","['Drama'","['Comedy']","['Biography'","['Crime'","['Horror'","['Drama']","['Fantasy'","['Animation'","['Mystery'","['Horror']","['Documentary'"],["Action","Adventure","Comedy","Drama","Comedy","Biography","Crime","Horror","Drama","Fantasy","Animation","Mystery","Horror","Documentary"])
df_movies1["Genre1"].value_counts()
Action 343
Comedy 134
Adventure 128
Drama 52
Biography 30
Crime 26
Horror 22
Animation 3
Fantasy 3
Mystery 2
Documentary 1
Name: Genre1, dtype: int64
df_movies1 = df_movies1.drop(['Genre2','Genre3','Genre4','Genre5','Genre6','Genre7','Genre8'], axis = 1)
df_movies1.shape
(744, 9)
Now with this new shortened sub data frame we can dive into the logistics and visualizations to learn about each film and what they are related to towards distributors and sale earnings :)
Graphing the Dataset#
Visualizing which Distributor will will have the highest World Sale Earnings#
Here we are first going to graph using altair and compare 3 graphs that demonstrate each distributor of the movies and find the mean value for all the Domestic, International, and World sales in $. After identifying which Distributor of all contains the highest gross income within the movies, we will predict their world sale earnings for all movies they have produced to determine future results :)
c = alt.Chart(df_movies1).mark_bar().encode(
x= alt.X("Distributor", scale=alt.Scale(zero=False)),
y="mean(Domestic Sales (in $))",
color=alt.Color("License", scale=alt.Scale(scheme="redpurple")),
tooltip = ['Title', "Movie Runtime"]
)
c1 = alt.Chart(df_movies1).mark_bar().encode(
x= alt.X("Distributor", scale=alt.Scale(zero=False)),
y="mean(International Sales (in $))",
color=alt.Color("License", scale=alt.Scale(scheme="redpurple")),
tooltip = ['Title', "Movie Runtime"]
)
c2 =alt.Chart(df_movies1).mark_bar().encode(
x= alt.X("Distributor", scale=alt.Scale(zero=False)),
y="mean(World Sales (in $))",
color=alt.Color("License", scale=alt.Scale(scheme="redpurple")),
tooltip = ['Title',"Movie Runtime"]
)
alt.hconcat(c,c1,c2)
As we see on the graphs, there is numerous of high values for 3 or more distributors and there is a huge gap bwtween License preferences as pg 13 contains higher sales than any other rated movie. there are numerous of graphs that goes by each sale seperately and it can be hard to visualize and identify the highests increasing. So I incorporated my extra component with this section using plotly to help us identifty the values in a 3D form :)
import plotly.express as px
fig = px.scatter_3d(df_movies1, x='Domestic Sales (in $)', y='International Sales (in $)', z='World Sales (in $)',
symbol='Genre1', color = "Distributor")
fig.show()
After seeing this diagram, we can go further in depth and I want to explore the World Sales since it is the highest value in income within each movie :)
fig = px.bar(df_movies1, x="Genre1", y="World Sales (in $)", color="Distributor", title="Genre 1")
fig.show()
fig = px.sunburst(df_movies1, path=['Distributor', 'Genre1',"Title"], values='World Sales (in $)', color= "World Sales (in $)")
fig.show()
From both of the graphs, the bar and sunburt, we can see that the most popular and most created genre within this dataset is shown to be Action and it contains almost all of the highest movie films that gained a huge amount of gross. What is more important from these graphs is we can see that Walt Disney is the majority of the values that gained high gross income, so we will explore more in depth towards just the ditributor Disney and predict their outcomes in the films :)
Disney Distributor World Sales#
df_Disney = df_movies1[df_movies1["Distributor"] == "Walt Disney Studios Motion Pictures"].copy()
df_Disney
Title | Distributor | Release Date | Domestic Sales (in $) | International Sales (in $) | World Sales (in $) | Movie Runtime | License | Genre1 | |
---|---|---|---|---|---|---|---|---|---|
0 | Star Wars: Episode VII - The Force Awakens (2015) | Walt Disney Studios Motion Pictures | December 16, 2015 | 936662225 | 1132859475 | 2069521700 | 2 hr 18 min | PG-13 | Action |
1 | Avengers: Endgame (2019) | Walt Disney Studios Motion Pictures | April 24, 2019 | 858373000 | 1939128328 | 2797501328 | 3 hr 1 min | PG-13 | Action |
8 | The Avengers (2012) | Walt Disney Studios Motion Pictures | April 25, 2012 | 623357910 | 895457605 | 1518815515 | 2 hr 23 min | PG-13 | Action |
9 | Star Wars: Episode VIII - The Last Jedi (2017) | Walt Disney Studios Motion Pictures | December 13, 2017 | 620181382 | 712517448 | 1332698830 | 2 hr 32 min | PG-13 | Action |
11 | The Lion King (2019) | Walt Disney Studios Motion Pictures | July 11, 2019 | 543638043 | 1119261396 | 1662899439 | 1 hr 58 min | PG | Adventure |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
862 | Atlantis: The Lost Empire (2001) | Walt Disney Studios Motion Pictures | June 8, 2001 | 84056472 | 101997253 | 186053725 | 1 hr 35 min | PG | Action |
877 | Saving Mr. Banks (2013) | Walt Disney Studios Motion Pictures | November 29, 2013 | 83301580 | 34566404 | 117867984 | 2 hr 5 min | PG-13 | Biography |
902 | Eight Below (2006) | Walt Disney Studios Motion Pictures | February 17, 2006 | 81612565 | 38843429 | 120455994 | 2 hr | PG | Adventure |
909 | Snow Dogs (2002) | Walt Disney Studios Motion Pictures | January 18, 2002 | 81172560 | 33862530 | 115035090 | 1 hr 39 min | PG | Adventure |
911 | While You Were Sleeping (1995) | Walt Disney Studios Motion Pictures | April 21, 1995 | 81057016 | 101000000 | 182057016 | 1 hr 43 min | PG | Comedy |
98 rows × 9 columns
Here I incorporated a student’s project example to help me predict the values and World Sales for future movies on Disney films using linear regression :)
cols = [c for c in df_Disney.columns if is_numeric_dtype(df_Disney[c])]
df_Disney["Release Date"]= pd.to_datetime(df_Disney["Release Date"]).astype(int)
reg = LinearRegression()
cols=['Domestic Sales (in $)','International Sales (in $)','World Sales (in $)']
reg.fit(df_Disney[cols],df_Disney["World Sales (in $)"])
pd.Series(reg.coef_,index=cols)
Domestic Sales (in $) 0.333333
International Sales (in $) 0.333333
World Sales (in $) 0.666667
dtype: float64
reg.coef_
array([0.33333333, 0.33333333, 0.66666667])
Here we see the coefficent for the prediction on all 3 types of sales and we are going ot incorporate it into out graph and only use the Domestic and World Sales to demonstrate the difference and importance on World Sales that help increase numerous of the film’s incomes :)
df_Disney["Pred"] = reg.predict(df_Disney[cols])
c_train,c_test,d_train,d_test = train_test_split(df_Disney[cols],df_Disney["World Sales (in $)"],test_size = 0.5, random_state=10)
reg.fit(c_train,d_train)
pred= reg.predict(c_train)
reg.score(c_train, d_train)
1.0
corr = df_Disney.corr()
corr.sort_values(["World Sales (in $)"], ascending = False, inplace = True)
print(corr.Pred)
World Sales (in $) 1.000000
Pred 1.000000
International Sales (in $) 0.986928
Domestic Sales (in $) 0.951843
Release Date 0.486140
Name: Pred, dtype: float64
Prediction Graph#
sel = alt.selection_single(fields=["Genre1"], empty="none")
base = alt.Chart(df_Disney).mark_circle().encode(
x="Domestic Sales (in $)",
y="World Sales (in $)",
tooltip=["Title", "Movie Runtime"],
size=alt.condition(sel, alt.value(80),alt.value(20)),
color=alt.Color("Genre1", scale=alt.Scale(scheme="Paired")),
opacity=alt.condition(sel, alt.value(1), alt.value(0.5))
).add_selection(sel)
text = alt.Chart(df_Disney).mark_text(y=20, size=20).encode(
text="Genre1",
opacity=alt.condition(sel, alt.value(1), alt.value(0))
)
n1=alt.Chart(df_Disney).mark_line(color="lightgray").encode(
x='Domestic Sales (in $)',
y='Pred',
)
n1
n = base+text
n+n1
fig = px.sunburst(df_Disney, path=['Genre1',"Title"], values='World Sales (in $)', color= "World Sales (in $)")
fig.show()
Summary#
Overall, I was able to find a dataset on kaggle that gave us information about Hollywood’s Highest Grossing Movies and go into depth on the logistics of what makes a movie popular and have the highest gross sales. If it is simple as choosing a genre, or making the movie pg 13 rather than pg, or as complex as choosing the distributor to help boost up credibility, all these we were able to identify in order to grasp an understanding on basic movie preferences. We were also able to predict the future values that are created to find the domestic and world sales of Walt Disney’s Movie films and more.
References#
Your code above should include references. Here is some additional space for references.
Ploty 3D Scatter Plot :[https://plotly.com/python/3d-scatter-plots/]
Plotly Bar Chart: [https://plotly.com/python/bar-charts/]
Plotly Sun Chart: [https://plotly.com/python/sunburst-charts/]
What is the source of your dataset(s)?
Kaggle: [https://www.kaggle.com/code/yasirnikozaiofficial/highest-grossing-movies-of-hollywood-eda/data]
List any other references that you found helpful.
Student Example: [https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/KehanLi.html]
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote