Movie’s Gross¶

Author: Dana Albakri

Course Project, UC Irvine, Math 10, W22

Introduction¶

The main aspects we are going to explore in the Movies dataset would be the Rotten Tomatoes Ratings in respect to the highest gross. We will be using both the US Gross and the Worldwide Gross to get a better understanding of how popular these movies are in the world. After determining which of the Rotten Tomatoes Ratings movies have the biggest gross we will be able to see if the rating affects the gross for the US vs the World.

Main portion of the project¶

import numpy as np
import pandas as pd
import vega_datasets
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,log_loss
from sklearn.preprocessing import StandardScaler

df = pd.read_csv("/work/movies.csv", na_values=" ")
df

	Unnamed: 0	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
0	0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	NaN	NaN	NaN	NaN	NaN	6.1	1071.0
1	1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	NaN	Drama	NaN	NaN	NaN	6.9	207.0
2	2	I Married a Strange Person	203134.0	203134.0	NaN	250000.0	Aug 28 1998	NaN	NaN	Lionsgate	NaN	Comedy	NaN	NaN	NaN	6.8	865.0
3	3	Let's Talk About Sex	373615.0	373615.0	NaN	300000.0	Sep 11 1998	NaN	NaN	Fine Line	NaN	Comedy	NaN	NaN	13.0	NaN	NaN
4	4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	NaN	62.0	3.4	165.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3196	3196	Zack and Miri Make a Porno	31452765.0	36851125.0	21240321.0	24000000.0	Oct 31 2008	R	101.0	Weinstein Co.	Original Screenplay	Comedy	Contemporary Fiction	Kevin Smith	65.0	7.0	55687.0
3197	3197	Zodiac	33080084.0	83080084.0	20983030.0	85000000.0	Mar 02 2007	R	157.0	Paramount Pictures	Based on Book/Short Story	Thriller/Suspense	Dramatization	David Fincher	89.0	NaN	NaN
3198	3198	Zoom	11989328.0	12506188.0	6679409.0	35000000.0	Aug 11 2006	PG	NaN	Sony Pictures	Based on Comic/Graphic Novel	Adventure	Super Hero	Peter Hewitt	3.0	3.4	7424.0
3199	3199	The Legend of Zorro	45575336.0	141475336.0	NaN	80000000.0	Oct 28 2005	PG	129.0	Sony Pictures	Remake	Adventure	Historical Fiction	Martin Campbell	26.0	5.7	21161.0
3200	3200	The Mask of Zorro	93828745.0	233700000.0	NaN	65000000.0	Jul 17 1998	PG-13	136.0	Sony Pictures	Remake	Adventure	Historical Fiction	Martin Campbell	82.0	6.7	4789.0

3201 rows × 17 columns

Cleaning the Data to get more accurate results:

df.dropna(inplace=True)
df

	Unnamed: 0	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
1064	1064	12 Rounds	12234694.0	18184083.0	8283859.0	20000000.0	Mar 27 2009	PG-13	108.0	20th Century Fox	Original Screenplay	Action	Contemporary Fiction	Renny Harlin	28.0	5.4	8914.0
1074	1074	2012	166112167.0	766812167.0	50736023.0	200000000.0	Nov 13 2009	PG-13	158.0	Sony Pictures	Original Screenplay	Action	Science Fiction	Roland Emmerich	39.0	6.2	396.0
1090	1090	300	210614939.0	456068181.0	261252400.0	60000000.0	Mar 09 2007	R	117.0	Warner Bros.	Based on Comic/Graphic Novel	Action	Historical Fiction	Zack Snyder	60.0	7.8	235508.0
1095	1095	3:10 to Yuma	53606916.0	69791889.0	51359371.0	48000000.0	Sep 02 2007	R	117.0	Lionsgate	Remake	Western	Historical Fiction	James Mangold	89.0	7.9	98355.0
1107	1107	88 Minutes	16930884.0	32955399.0	11385055.0	30000000.0	Apr 18 2008	R	106.0	Sony Pictures	Original Screenplay	Thriller/Suspense	Contemporary Fiction	Jon Avnet	5.0	5.9	31205.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3158	3158	The Wrestler	26236603.0	43236603.0	11912450.0	6000000.0	Dec 17 2008	R	109.0	Fox Searchlight	Original Screenplay	Drama	Contemporary Fiction	Darren Aronofsky	98.0	8.2	93301.0
3181	3181	Year One	43337279.0	57604723.0	14813995.0	60000000.0	Jun 19 2009	PG-13	97.0	Sony Pictures	Original Screenplay	Comedy	Historical Fiction	Harold Ramis	14.0	5.0	23091.0
3183	3183	Yes Man	97690976.0	225990976.0	26601131.0	50000000.0	Dec 19 2008	PG-13	104.0	Warner Bros.	Based on Book/Short Story	Comedy	Contemporary Fiction	Peyton Reed	43.0	7.0	62150.0
3195	3195	Zombieland	75590286.0	98690286.0	28281155.0	23600000.0	Oct 02 2009	R	87.0	Sony Pictures	Original Screenplay	Comedy	Fantasy	Ruben Fleischer	89.0	7.8	81629.0
3196	3196	Zack and Miri Make a Porno	31452765.0	36851125.0	21240321.0	24000000.0	Oct 31 2008	R	101.0	Weinstein Co.	Original Screenplay	Comedy	Contemporary Fiction	Kevin Smith	65.0	7.0	55687.0

174 rows × 17 columns

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 174 entries, 1064 to 3196
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 Unnamed: 0              174 non-null    int64  
 Title                   174 non-null    object 
 US_Gross                174 non-null    float64
 Worldwide_Gross         174 non-null    float64
 US_DVD_Sales            174 non-null    float64
 Production_Budget       174 non-null    float64
 Release_Date            174 non-null    object 
 MPAA_Rating             174 non-null    object 
 Running_Time_min        174 non-null    float64
 Distributor             174 non-null    object 
Source                  174 non-null    object 
Major_Genre             174 non-null    object 
Creative_Type           174 non-null    object 
Director                174 non-null    object 
Rotten_Tomatoes_Rating  174 non-null    float64
IMDB_Rating             174 non-null    float64
IMDB_Votes              174 non-null    float64
dtypes: float64(8), int64(1), object(8)
memory usage: 24.5+ KB

df.dtypes

Unnamed: 0                  int64
Title                      object
US_Gross                  float64
Worldwide_Gross           float64
US_DVD_Sales              float64
Production_Budget         float64
Release_Date               object
MPAA_Rating                object
Running_Time_min          float64
Distributor                object
Source                     object
Major_Genre                object
Creative_Type              object
Director                   object
Rotten_Tomatoes_Rating    float64
IMDB_Rating               float64
IMDB_Votes                float64
dtype: object

print(f"The number of rows in this dataset is {df.shape[0]}")

The number of rows in this dataset is 174

Now we will see using groupby to anaylze if for every movie the gross has a rotten Tomato rating.

df.groupby(["US_Gross", "Worldwide_Gross"])["Rotten_Tomatoes_Rating"].count()

US_Gross     Worldwide_Gross
2223293.0    1.321803e+08       1
3005605.0    3.502586e+07       1
3688560.0    3.203061e+07       1
5463019.0    1.476302e+07       1
5755286.0    6.521829e+06       1
                               ..
370782930.0  6.118994e+08       1
373524485.0  7.837050e+08       1
402111870.0  8.363037e+08       1
423315812.0  1.065660e+09       1
533345358.0  1.022345e+09       1
Name: Rotten_Tomatoes_Rating, Length: 174, dtype: int64

Making a graph to getting a better understaning on how the Data looks.

c1= alt.Chart(df).mark_circle().encode(
    x = "US_Gross",
    y = "Worldwide_Gross",
    color="Rotten_Tomatoes_Rating"
).properties(
    title= "Gross Depending on Rotten Tomato Rating",
        width=700,
        height=100,
)
c1

The graph above displays as worldwide gross increases then US Gross also increases. Addtionally the Rotten Tomatos rating displays that as the rating gets higher it becomes a darker blue.

Now I am going to use K-Nearest Neighbors Regressor, and K Neighbors Classifier to see which data has the best results.

Testing with K-Nearest Neighbors:

First I am going to rescale the data:

scaler = StandardScaler()

scaler.fit(df[["US_Gross","Worldwide_Gross"]])

StandardScaler()

df[["US_Gross","Worldwide_Gross"]]= scaler.transform(df[["US_Gross","Worldwide_Gross"]])

Next I am going to use train_test_split to train the data.

X_train, X_test, y_train, y_test = train_test_split(df[["US_Gross","Worldwide_Gross"]],df["Rotten_Tomatoes_Rating"],test_size=0.5)

 print(f"The number of rows in this dataset is {y_train.shape[0]}")

The number of rows in this dataset is 87

 print(f"The shape of this dataset using Train {X_train.shape}")

The shape of this dataset using Train (87, 2)

K_reg = KNeighborsRegressor(n_neighbors=10)

K_reg.fit(X_train,y_train)

KNeighborsRegressor(n_neighbors=10)

mean_absolute_error(K_reg.predict(X_train), y_train)

22.03218390804598

mean_absolute_error(K_reg.predict(X_test), y_test)

23.81379310344828

Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=10.

Now we will be examining which k values gives us the least test error for K-Nearest Regressor.

def get_scores(k):
    K_reg = KNeighborsRegressor(n_neighbors=k)
    K_reg.fit(X_train, y_train)
    train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
    test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
    return (train_error, test_error)

K_reg_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})

for i in K_reg_scores.index:
    K_reg_scores.loc[i,["train_error","test_error"]] = get_scores(K_reg_scores.loc[i,"k"])

K_reg_scores

	k	train_error	test_error
0	1	0.000000	31.666667
1	2	15.804598	30.752874
2	3	16.762452	27.455939
3	4	18.658046	27.922414
4	5	19.629885	26.650575
...	...	...	...
82	83	25.061903	23.652403
83	84	25.078955	23.682813
84	85	25.106288	23.667478
85	86	25.129644	23.689522
86	87	25.113489	23.702999

87 rows × 3 columns

Anaylzing for the least test error:

(K_reg_scores["test_error"]).min()

23.155172413793107

(K_reg_scores["test_error"]<25).sum()

This means that choosing k=10 was a good choice because the test error was around 24.8, meaning it is close to the least test error.

Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.

K_reg_scores["kinv"] = 1/K_reg_scores.k

K_regtest = alt.Chart(K_reg_scores).mark_line(color="green").encode(
    x = "kinv",
    y = "test_error"
)

K_regtrain = alt.Chart(K_reg_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
    ).properties( 
        title= "Error"
)

K_regtest+K_regtrain

Looking at the graph we can see the K values are at a good high flexibility and high variance in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.

K-Nearest Neighbors Classifier

clf = KNeighborsClassifier(n_neighbors=6)

clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=6)

mean_absolute_error(clf.predict(X_train), y_train)

32.55172413793103

mean_absolute_error(clf.predict(X_test), y_test)

39.06896551724138

Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=6.

Now we will be examining which k values gives us the least test error for K-Nearest Classifier.

def get_clf_scores(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)

clf_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})

for i in clf_scores.index:
    clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])

clf_scores

	k	train_error	test_error
0	1	0.000000	31.666667
1	2	15.597701	32.701149
2	3	23.471264	35.137931
3	4	27.919540	38.528736
4	5	30.344828	38.735632
...	...	...	...
82	83	33.022989	31.045977
83	84	32.816092	30.862069
84	85	32.609195	30.655172
85	86	32.402299	30.643678
86	87	34.264368	31.597701

87 rows × 3 columns

Anaylzing for the least test error:

clf_scores["test_error"].min()

27.42528735632184

(clf_scores["test_error"]< 30).sum()

This means that choosing k=6 was a bad choice because the test error was around 38.8, meaning it is far from the least test error.

Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.

clf_scores["kinv"] = 1/clf_scores.k

clftrain = alt.Chart(clf_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)

clftest = alt.Chart(clf_scores).mark_line(color="green").encode(
    x = "kinv",
    y = "test_error"
  ).properties(
      title= "Error",
       
    
)

clftrain+clftest

Looking at the graph we can see the overfitting in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.

The whole reason of why we looked at the test error is analyze whether this Data had any correlation to the previous claim, so to find the best fitting K that will give us the least Test error.

Summary¶

Throughout our analyzing of the data using K-Nearest Regressers and Classifiers we can see how Regressor would be the better choice for our Data.Overall our Test error displays that we have around 25% error using K-Nearest Regressor, meaning for our US Gross and the Worldwide Gross the rotten tomato rating will have a around 25% error. We can conclude that rotten tomato ratings mostly has some impact on US Gross and Worldwide Gross.

References¶

Dataset was found in deepnote from Thursday (Week 8) under the Vega files.

Pandas Groupby Link

Created in Deepnote

UC Irvine Math 10 W22

Movie’s Gross

Contents

Movie’s Gross¶

Introduction¶

Main portion of the project¶

Summary¶

References¶