Movie’s Gross
Contents
Movie’s Gross¶
Author: Dana Albakri
Course Project, UC Irvine, Math 10, W22
Introduction¶
The main aspects we are going to explore in the Movies dataset would be the Rotten Tomatoes Ratings in respect to the highest gross. We will be using both the US Gross and the Worldwide Gross to get a better understanding of how popular these movies are in the world. After determining which of the Rotten Tomatoes Ratings movies have the biggest gross we will be able to see if the rating affects the gross for the US vs the World.
Main portion of the project¶
import numpy as np
import pandas as pd
import vega_datasets
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,log_loss
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("/work/movies.csv", na_values=" ")
df
Unnamed: 0 | Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | The Land Girls | 146083.0 | 146083.0 | NaN | 8000000.0 | Jun 12 1998 | R | NaN | Gramercy | NaN | NaN | NaN | NaN | NaN | 6.1 | 1071.0 |
1 | 1 | First Love, Last Rites | 10876.0 | 10876.0 | NaN | 300000.0 | Aug 07 1998 | R | NaN | Strand | NaN | Drama | NaN | NaN | NaN | 6.9 | 207.0 |
2 | 2 | I Married a Strange Person | 203134.0 | 203134.0 | NaN | 250000.0 | Aug 28 1998 | NaN | NaN | Lionsgate | NaN | Comedy | NaN | NaN | NaN | 6.8 | 865.0 |
3 | 3 | Let's Talk About Sex | 373615.0 | 373615.0 | NaN | 300000.0 | Sep 11 1998 | NaN | NaN | Fine Line | NaN | Comedy | NaN | NaN | 13.0 | NaN | NaN |
4 | 4 | Slam | 1009819.0 | 1087521.0 | NaN | 1000000.0 | Oct 09 1998 | R | NaN | Trimark | Original Screenplay | Drama | Contemporary Fiction | NaN | 62.0 | 3.4 | 165.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3196 | 3196 | Zack and Miri Make a Porno | 31452765.0 | 36851125.0 | 21240321.0 | 24000000.0 | Oct 31 2008 | R | 101.0 | Weinstein Co. | Original Screenplay | Comedy | Contemporary Fiction | Kevin Smith | 65.0 | 7.0 | 55687.0 |
3197 | 3197 | Zodiac | 33080084.0 | 83080084.0 | 20983030.0 | 85000000.0 | Mar 02 2007 | R | 157.0 | Paramount Pictures | Based on Book/Short Story | Thriller/Suspense | Dramatization | David Fincher | 89.0 | NaN | NaN |
3198 | 3198 | Zoom | 11989328.0 | 12506188.0 | 6679409.0 | 35000000.0 | Aug 11 2006 | PG | NaN | Sony Pictures | Based on Comic/Graphic Novel | Adventure | Super Hero | Peter Hewitt | 3.0 | 3.4 | 7424.0 |
3199 | 3199 | The Legend of Zorro | 45575336.0 | 141475336.0 | NaN | 80000000.0 | Oct 28 2005 | PG | 129.0 | Sony Pictures | Remake | Adventure | Historical Fiction | Martin Campbell | 26.0 | 5.7 | 21161.0 |
3200 | 3200 | The Mask of Zorro | 93828745.0 | 233700000.0 | NaN | 65000000.0 | Jul 17 1998 | PG-13 | 136.0 | Sony Pictures | Remake | Adventure | Historical Fiction | Martin Campbell | 82.0 | 6.7 | 4789.0 |
3201 rows × 17 columns
Cleaning the Data to get more accurate results:
df.dropna(inplace=True)
df
Unnamed: 0 | Title | US_Gross | Worldwide_Gross | US_DVD_Sales | Production_Budget | Release_Date | MPAA_Rating | Running_Time_min | Distributor | Source | Major_Genre | Creative_Type | Director | Rotten_Tomatoes_Rating | IMDB_Rating | IMDB_Votes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1064 | 1064 | 12 Rounds | 12234694.0 | 18184083.0 | 8283859.0 | 20000000.0 | Mar 27 2009 | PG-13 | 108.0 | 20th Century Fox | Original Screenplay | Action | Contemporary Fiction | Renny Harlin | 28.0 | 5.4 | 8914.0 |
1074 | 1074 | 2012 | 166112167.0 | 766812167.0 | 50736023.0 | 200000000.0 | Nov 13 2009 | PG-13 | 158.0 | Sony Pictures | Original Screenplay | Action | Science Fiction | Roland Emmerich | 39.0 | 6.2 | 396.0 |
1090 | 1090 | 300 | 210614939.0 | 456068181.0 | 261252400.0 | 60000000.0 | Mar 09 2007 | R | 117.0 | Warner Bros. | Based on Comic/Graphic Novel | Action | Historical Fiction | Zack Snyder | 60.0 | 7.8 | 235508.0 |
1095 | 1095 | 3:10 to Yuma | 53606916.0 | 69791889.0 | 51359371.0 | 48000000.0 | Sep 02 2007 | R | 117.0 | Lionsgate | Remake | Western | Historical Fiction | James Mangold | 89.0 | 7.9 | 98355.0 |
1107 | 1107 | 88 Minutes | 16930884.0 | 32955399.0 | 11385055.0 | 30000000.0 | Apr 18 2008 | R | 106.0 | Sony Pictures | Original Screenplay | Thriller/Suspense | Contemporary Fiction | Jon Avnet | 5.0 | 5.9 | 31205.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3158 | 3158 | The Wrestler | 26236603.0 | 43236603.0 | 11912450.0 | 6000000.0 | Dec 17 2008 | R | 109.0 | Fox Searchlight | Original Screenplay | Drama | Contemporary Fiction | Darren Aronofsky | 98.0 | 8.2 | 93301.0 |
3181 | 3181 | Year One | 43337279.0 | 57604723.0 | 14813995.0 | 60000000.0 | Jun 19 2009 | PG-13 | 97.0 | Sony Pictures | Original Screenplay | Comedy | Historical Fiction | Harold Ramis | 14.0 | 5.0 | 23091.0 |
3183 | 3183 | Yes Man | 97690976.0 | 225990976.0 | 26601131.0 | 50000000.0 | Dec 19 2008 | PG-13 | 104.0 | Warner Bros. | Based on Book/Short Story | Comedy | Contemporary Fiction | Peyton Reed | 43.0 | 7.0 | 62150.0 |
3195 | 3195 | Zombieland | 75590286.0 | 98690286.0 | 28281155.0 | 23600000.0 | Oct 02 2009 | R | 87.0 | Sony Pictures | Original Screenplay | Comedy | Fantasy | Ruben Fleischer | 89.0 | 7.8 | 81629.0 |
3196 | 3196 | Zack and Miri Make a Porno | 31452765.0 | 36851125.0 | 21240321.0 | 24000000.0 | Oct 31 2008 | R | 101.0 | Weinstein Co. | Original Screenplay | Comedy | Contemporary Fiction | Kevin Smith | 65.0 | 7.0 | 55687.0 |
174 rows × 17 columns
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174 entries, 1064 to 3196
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 174 non-null int64
1 Title 174 non-null object
2 US_Gross 174 non-null float64
3 Worldwide_Gross 174 non-null float64
4 US_DVD_Sales 174 non-null float64
5 Production_Budget 174 non-null float64
6 Release_Date 174 non-null object
7 MPAA_Rating 174 non-null object
8 Running_Time_min 174 non-null float64
9 Distributor 174 non-null object
10 Source 174 non-null object
11 Major_Genre 174 non-null object
12 Creative_Type 174 non-null object
13 Director 174 non-null object
14 Rotten_Tomatoes_Rating 174 non-null float64
15 IMDB_Rating 174 non-null float64
16 IMDB_Votes 174 non-null float64
dtypes: float64(8), int64(1), object(8)
memory usage: 24.5+ KB
df.dtypes
Unnamed: 0 int64
Title object
US_Gross float64
Worldwide_Gross float64
US_DVD_Sales float64
Production_Budget float64
Release_Date object
MPAA_Rating object
Running_Time_min float64
Distributor object
Source object
Major_Genre object
Creative_Type object
Director object
Rotten_Tomatoes_Rating float64
IMDB_Rating float64
IMDB_Votes float64
dtype: object
print(f"The number of rows in this dataset is {df.shape[0]}")
The number of rows in this dataset is 174
Now we will see using groupby to anaylze if for every movie the gross has a rotten Tomato rating.
df.groupby(["US_Gross", "Worldwide_Gross"])["Rotten_Tomatoes_Rating"].count()
US_Gross Worldwide_Gross
2223293.0 1.321803e+08 1
3005605.0 3.502586e+07 1
3688560.0 3.203061e+07 1
5463019.0 1.476302e+07 1
5755286.0 6.521829e+06 1
..
370782930.0 6.118994e+08 1
373524485.0 7.837050e+08 1
402111870.0 8.363037e+08 1
423315812.0 1.065660e+09 1
533345358.0 1.022345e+09 1
Name: Rotten_Tomatoes_Rating, Length: 174, dtype: int64
Making a graph to getting a better understaning on how the Data looks.
c1= alt.Chart(df).mark_circle().encode(
x = "US_Gross",
y = "Worldwide_Gross",
color="Rotten_Tomatoes_Rating"
).properties(
title= "Gross Depending on Rotten Tomato Rating",
width=700,
height=100,
)
c1
The graph above displays as worldwide gross increases then US Gross also increases. Addtionally the Rotten Tomatos rating displays that as the rating gets higher it becomes a darker blue.
Now I am going to use K-Nearest Neighbors Regressor, and K Neighbors Classifier to see which data has the best results.
Testing with K-Nearest Neighbors:
First I am going to rescale the data:
scaler = StandardScaler()
scaler.fit(df[["US_Gross","Worldwide_Gross"]])
StandardScaler()
df[["US_Gross","Worldwide_Gross"]]= scaler.transform(df[["US_Gross","Worldwide_Gross"]])
Next I am going to use train_test_split to train the data.
X_train, X_test, y_train, y_test = train_test_split(df[["US_Gross","Worldwide_Gross"]],df["Rotten_Tomatoes_Rating"],test_size=0.5)
print(f"The number of rows in this dataset is {y_train.shape[0]}")
The number of rows in this dataset is 87
print(f"The shape of this dataset using Train {X_train.shape}")
The shape of this dataset using Train (87, 2)
K_reg = KNeighborsRegressor(n_neighbors=10)
K_reg.fit(X_train,y_train)
KNeighborsRegressor(n_neighbors=10)
mean_absolute_error(K_reg.predict(X_train), y_train)
22.03218390804598
mean_absolute_error(K_reg.predict(X_test), y_test)
23.81379310344828
Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=10.
Now we will be examining which k values gives us the least test error for K-Nearest Regressor.
def get_scores(k):
K_reg = KNeighborsRegressor(n_neighbors=k)
K_reg.fit(X_train, y_train)
train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
return (train_error, test_error)
K_reg_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})
for i in K_reg_scores.index:
K_reg_scores.loc[i,["train_error","test_error"]] = get_scores(K_reg_scores.loc[i,"k"])
K_reg_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.000000 | 31.666667 |
1 | 2 | 15.804598 | 30.752874 |
2 | 3 | 16.762452 | 27.455939 |
3 | 4 | 18.658046 | 27.922414 |
4 | 5 | 19.629885 | 26.650575 |
... | ... | ... | ... |
82 | 83 | 25.061903 | 23.652403 |
83 | 84 | 25.078955 | 23.682813 |
84 | 85 | 25.106288 | 23.667478 |
85 | 86 | 25.129644 | 23.689522 |
86 | 87 | 25.113489 | 23.702999 |
87 rows × 3 columns
Anaylzing for the least test error:
(K_reg_scores["test_error"]).min()
23.155172413793107
(K_reg_scores["test_error"]<25).sum()
81
This means that choosing k=10 was a good choice because the test error was around 24.8, meaning it is close to the least test error.
Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.
K_reg_scores["kinv"] = 1/K_reg_scores.k
K_regtest = alt.Chart(K_reg_scores).mark_line(color="green").encode(
x = "kinv",
y = "test_error"
)
K_regtrain = alt.Chart(K_reg_scores).mark_line().encode(
x = "kinv",
y = "train_error"
).properties(
title= "Error"
)
K_regtest+K_regtrain
Looking at the graph we can see the K values are at a good high flexibility and high variance in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.
K-Nearest Neighbors Classifier
clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=6)
mean_absolute_error(clf.predict(X_train), y_train)
32.55172413793103
mean_absolute_error(clf.predict(X_test), y_test)
39.06896551724138
Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=6.
Now we will be examining which k values gives us the least test error for K-Nearest Classifier.
def get_clf_scores(k):
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_train, y_train)
train_error = mean_absolute_error(clf.predict(X_train), y_train)
test_error = mean_absolute_error(clf.predict(X_test), y_test)
return (train_error, test_error)
clf_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})
for i in clf_scores.index:
clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])
clf_scores
k | train_error | test_error | |
---|---|---|---|
0 | 1 | 0.000000 | 31.666667 |
1 | 2 | 15.597701 | 32.701149 |
2 | 3 | 23.471264 | 35.137931 |
3 | 4 | 27.919540 | 38.528736 |
4 | 5 | 30.344828 | 38.735632 |
... | ... | ... | ... |
82 | 83 | 33.022989 | 31.045977 |
83 | 84 | 32.816092 | 30.862069 |
84 | 85 | 32.609195 | 30.655172 |
85 | 86 | 32.402299 | 30.643678 |
86 | 87 | 34.264368 | 31.597701 |
87 rows × 3 columns
Anaylzing for the least test error:
clf_scores["test_error"].min()
27.42528735632184
(clf_scores["test_error"]< 30).sum()
26
This means that choosing k=6 was a bad choice because the test error was around 38.8, meaning it is far from the least test error.
Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.
clf_scores["kinv"] = 1/clf_scores.k
clftrain = alt.Chart(clf_scores).mark_line().encode(
x = "kinv",
y = "train_error"
)
clftest = alt.Chart(clf_scores).mark_line(color="green").encode(
x = "kinv",
y = "test_error"
).properties(
title= "Error",
)
clftrain+clftest
Looking at the graph we can see the overfitting in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.
The whole reason of why we looked at the test error is analyze whether this Data had any correlation to the previous claim, so to find the best fitting K that will give us the least Test error.
Summary¶
Throughout our analyzing of the data using K-Nearest Regressers and Classifiers we can see how Regressor would be the better choice for our Data.Overall our Test error displays that we have around 25% error using K-Nearest Regressor, meaning for our US Gross and the Worldwide Gross the rotten tomato rating will have a around 25% error. We can conclude that rotten tomato ratings mostly has some impact on US Gross and Worldwide Gross.
References¶
Dataset was found in deepnote from Thursday (Week 8) under the Vega files.
Created in Deepnote