Movie’s Gross

Author: Dana Albakri

Course Project, UC Irvine, Math 10, W22

Introduction

The main aspects we are going to explore in the Movies dataset would be the Rotten Tomatoes Ratings in respect to the highest gross. We will be using both the US Gross and the Worldwide Gross to get a better understanding of how popular these movies are in the world. After determining which of the Rotten Tomatoes Ratings movies have the biggest gross we will be able to see if the rating affects the gross for the US vs the World.

Main portion of the project

import numpy as np
import pandas as pd
import vega_datasets
import altair as alt
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error,mean_absolute_error,log_loss
from sklearn.preprocessing import StandardScaler
df = pd.read_csv("/work/movies.csv", na_values=" ")
df
Unnamed: 0 Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
0 0 The Land Girls 146083.0 146083.0 NaN 8000000.0 Jun 12 1998 R NaN Gramercy NaN NaN NaN NaN NaN 6.1 1071.0
1 1 First Love, Last Rites 10876.0 10876.0 NaN 300000.0 Aug 07 1998 R NaN Strand NaN Drama NaN NaN NaN 6.9 207.0
2 2 I Married a Strange Person 203134.0 203134.0 NaN 250000.0 Aug 28 1998 NaN NaN Lionsgate NaN Comedy NaN NaN NaN 6.8 865.0
3 3 Let's Talk About Sex 373615.0 373615.0 NaN 300000.0 Sep 11 1998 NaN NaN Fine Line NaN Comedy NaN NaN 13.0 NaN NaN
4 4 Slam 1009819.0 1087521.0 NaN 1000000.0 Oct 09 1998 R NaN Trimark Original Screenplay Drama Contemporary Fiction NaN 62.0 3.4 165.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3196 3196 Zack and Miri Make a Porno 31452765.0 36851125.0 21240321.0 24000000.0 Oct 31 2008 R 101.0 Weinstein Co. Original Screenplay Comedy Contemporary Fiction Kevin Smith 65.0 7.0 55687.0
3197 3197 Zodiac 33080084.0 83080084.0 20983030.0 85000000.0 Mar 02 2007 R 157.0 Paramount Pictures Based on Book/Short Story Thriller/Suspense Dramatization David Fincher 89.0 NaN NaN
3198 3198 Zoom 11989328.0 12506188.0 6679409.0 35000000.0 Aug 11 2006 PG NaN Sony Pictures Based on Comic/Graphic Novel Adventure Super Hero Peter Hewitt 3.0 3.4 7424.0
3199 3199 The Legend of Zorro 45575336.0 141475336.0 NaN 80000000.0 Oct 28 2005 PG 129.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 26.0 5.7 21161.0
3200 3200 The Mask of Zorro 93828745.0 233700000.0 NaN 65000000.0 Jul 17 1998 PG-13 136.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 82.0 6.7 4789.0

3201 rows × 17 columns

Cleaning the Data to get more accurate results:

df.dropna(inplace=True)
df
Unnamed: 0 Title US_Gross Worldwide_Gross US_DVD_Sales Production_Budget Release_Date MPAA_Rating Running_Time_min Distributor Source Major_Genre Creative_Type Director Rotten_Tomatoes_Rating IMDB_Rating IMDB_Votes
1064 1064 12 Rounds 12234694.0 18184083.0 8283859.0 20000000.0 Mar 27 2009 PG-13 108.0 20th Century Fox Original Screenplay Action Contemporary Fiction Renny Harlin 28.0 5.4 8914.0
1074 1074 2012 166112167.0 766812167.0 50736023.0 200000000.0 Nov 13 2009 PG-13 158.0 Sony Pictures Original Screenplay Action Science Fiction Roland Emmerich 39.0 6.2 396.0
1090 1090 300 210614939.0 456068181.0 261252400.0 60000000.0 Mar 09 2007 R 117.0 Warner Bros. Based on Comic/Graphic Novel Action Historical Fiction Zack Snyder 60.0 7.8 235508.0
1095 1095 3:10 to Yuma 53606916.0 69791889.0 51359371.0 48000000.0 Sep 02 2007 R 117.0 Lionsgate Remake Western Historical Fiction James Mangold 89.0 7.9 98355.0
1107 1107 88 Minutes 16930884.0 32955399.0 11385055.0 30000000.0 Apr 18 2008 R 106.0 Sony Pictures Original Screenplay Thriller/Suspense Contemporary Fiction Jon Avnet 5.0 5.9 31205.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3158 3158 The Wrestler 26236603.0 43236603.0 11912450.0 6000000.0 Dec 17 2008 R 109.0 Fox Searchlight Original Screenplay Drama Contemporary Fiction Darren Aronofsky 98.0 8.2 93301.0
3181 3181 Year One 43337279.0 57604723.0 14813995.0 60000000.0 Jun 19 2009 PG-13 97.0 Sony Pictures Original Screenplay Comedy Historical Fiction Harold Ramis 14.0 5.0 23091.0
3183 3183 Yes Man 97690976.0 225990976.0 26601131.0 50000000.0 Dec 19 2008 PG-13 104.0 Warner Bros. Based on Book/Short Story Comedy Contemporary Fiction Peyton Reed 43.0 7.0 62150.0
3195 3195 Zombieland 75590286.0 98690286.0 28281155.0 23600000.0 Oct 02 2009 R 87.0 Sony Pictures Original Screenplay Comedy Fantasy Ruben Fleischer 89.0 7.8 81629.0
3196 3196 Zack and Miri Make a Porno 31452765.0 36851125.0 21240321.0 24000000.0 Oct 31 2008 R 101.0 Weinstein Co. Original Screenplay Comedy Contemporary Fiction Kevin Smith 65.0 7.0 55687.0

174 rows × 17 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174 entries, 1064 to 3196
Data columns (total 17 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Unnamed: 0              174 non-null    int64  
 1   Title                   174 non-null    object 
 2   US_Gross                174 non-null    float64
 3   Worldwide_Gross         174 non-null    float64
 4   US_DVD_Sales            174 non-null    float64
 5   Production_Budget       174 non-null    float64
 6   Release_Date            174 non-null    object 
 7   MPAA_Rating             174 non-null    object 
 8   Running_Time_min        174 non-null    float64
 9   Distributor             174 non-null    object 
 10  Source                  174 non-null    object 
 11  Major_Genre             174 non-null    object 
 12  Creative_Type           174 non-null    object 
 13  Director                174 non-null    object 
 14  Rotten_Tomatoes_Rating  174 non-null    float64
 15  IMDB_Rating             174 non-null    float64
 16  IMDB_Votes              174 non-null    float64
dtypes: float64(8), int64(1), object(8)
memory usage: 24.5+ KB
df.dtypes
Unnamed: 0                  int64
Title                      object
US_Gross                  float64
Worldwide_Gross           float64
US_DVD_Sales              float64
Production_Budget         float64
Release_Date               object
MPAA_Rating                object
Running_Time_min          float64
Distributor                object
Source                     object
Major_Genre                object
Creative_Type              object
Director                   object
Rotten_Tomatoes_Rating    float64
IMDB_Rating               float64
IMDB_Votes                float64
dtype: object
print(f"The number of rows in this dataset is {df.shape[0]}")
The number of rows in this dataset is 174

Now we will see using groupby to anaylze if for every movie the gross has a rotten Tomato rating.

df.groupby(["US_Gross", "Worldwide_Gross"])["Rotten_Tomatoes_Rating"].count()
US_Gross     Worldwide_Gross
2223293.0    1.321803e+08       1
3005605.0    3.502586e+07       1
3688560.0    3.203061e+07       1
5463019.0    1.476302e+07       1
5755286.0    6.521829e+06       1
                               ..
370782930.0  6.118994e+08       1
373524485.0  7.837050e+08       1
402111870.0  8.363037e+08       1
423315812.0  1.065660e+09       1
533345358.0  1.022345e+09       1
Name: Rotten_Tomatoes_Rating, Length: 174, dtype: int64

Making a graph to getting a better understaning on how the Data looks.

c1= alt.Chart(df).mark_circle().encode(
    x = "US_Gross",
    y = "Worldwide_Gross",
    color="Rotten_Tomatoes_Rating"
).properties(
    title= "Gross Depending on Rotten Tomato Rating",
        width=700,
        height=100,
)
c1

The graph above displays as worldwide gross increases then US Gross also increases. Addtionally the Rotten Tomatos rating displays that as the rating gets higher it becomes a darker blue.

Now I am going to use K-Nearest Neighbors Regressor, and K Neighbors Classifier to see which data has the best results.

Testing with K-Nearest Neighbors:

First I am going to rescale the data:

scaler = StandardScaler()
scaler.fit(df[["US_Gross","Worldwide_Gross"]])
StandardScaler()
df[["US_Gross","Worldwide_Gross"]]= scaler.transform(df[["US_Gross","Worldwide_Gross"]])

Next I am going to use train_test_split to train the data.

X_train, X_test, y_train, y_test = train_test_split(df[["US_Gross","Worldwide_Gross"]],df["Rotten_Tomatoes_Rating"],test_size=0.5)
 print(f"The number of rows in this dataset is {y_train.shape[0]}")
The number of rows in this dataset is 87
 print(f"The shape of this dataset using Train {X_train.shape}")
The shape of this dataset using Train (87, 2)
K_reg = KNeighborsRegressor(n_neighbors=10)
K_reg.fit(X_train,y_train)
KNeighborsRegressor(n_neighbors=10)
mean_absolute_error(K_reg.predict(X_train), y_train)
22.03218390804598
mean_absolute_error(K_reg.predict(X_test), y_test)
23.81379310344828

Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=10.

Now we will be examining which k values gives us the least test error for K-Nearest Regressor.

def get_scores(k):
    K_reg = KNeighborsRegressor(n_neighbors=k)
    K_reg.fit(X_train, y_train)
    train_error = mean_absolute_error(K_reg.predict(X_train), y_train)
    test_error = mean_absolute_error(K_reg.predict(X_test), y_test)
    return (train_error, test_error)
K_reg_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})
for i in K_reg_scores.index:
    K_reg_scores.loc[i,["train_error","test_error"]] = get_scores(K_reg_scores.loc[i,"k"])
K_reg_scores
k train_error test_error
0 1 0.000000 31.666667
1 2 15.804598 30.752874
2 3 16.762452 27.455939
3 4 18.658046 27.922414
4 5 19.629885 26.650575
... ... ... ...
82 83 25.061903 23.652403
83 84 25.078955 23.682813
84 85 25.106288 23.667478
85 86 25.129644 23.689522
86 87 25.113489 23.702999

87 rows × 3 columns

Anaylzing for the least test error:

(K_reg_scores["test_error"]).min()
23.155172413793107
(K_reg_scores["test_error"]<25).sum()
81

This means that choosing k=10 was a good choice because the test error was around 24.8, meaning it is close to the least test error.

Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.

K_reg_scores["kinv"] = 1/K_reg_scores.k
K_regtest = alt.Chart(K_reg_scores).mark_line(color="green").encode(
    x = "kinv",
    y = "test_error"
)
K_regtrain = alt.Chart(K_reg_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
    ).properties( 
        title= "Error"
)
K_regtest+K_regtrain

Looking at the graph we can see the K values are at a good high flexibility and high variance in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.

K-Nearest Neighbors Classifier

clf = KNeighborsClassifier(n_neighbors=6)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=6)
mean_absolute_error(clf.predict(X_train), y_train)
32.55172413793103
mean_absolute_error(clf.predict(X_test), y_test)
39.06896551724138

Finding the mean absolute error showed us that the data will not be overfitting when we use n_neighbors=6.

Now we will be examining which k values gives us the least test error for K-Nearest Classifier.

def get_clf_scores(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)
clf_scores = pd.DataFrame({"k":range(1,88),"train_error":np.nan,"test_error":np.nan})
for i in clf_scores.index:
    clf_scores.loc[i,["train_error","test_error"]] = get_clf_scores(clf_scores.loc[i,"k"])
clf_scores
k train_error test_error
0 1 0.000000 31.666667
1 2 15.597701 32.701149
2 3 23.471264 35.137931
3 4 27.919540 38.528736
4 5 30.344828 38.735632
... ... ... ...
82 83 33.022989 31.045977
83 84 32.816092 30.862069
84 85 32.609195 30.655172
85 86 32.402299 30.643678
86 87 34.264368 31.597701

87 rows × 3 columns

Anaylzing for the least test error:

clf_scores["test_error"].min()
27.42528735632184
(clf_scores["test_error"]< 30).sum()
26

This means that choosing k=6 was a bad choice because the test error was around 38.8, meaning it is far from the least test error.

Now we will plot the test error curve to get a better understanding of the flexibilty and variance of K.

clf_scores["kinv"] = 1/clf_scores.k
clftrain = alt.Chart(clf_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
clftest = alt.Chart(clf_scores).mark_line(color="green").encode(
    x = "kinv",
    y = "test_error"
  ).properties(
      title= "Error",
       
    
)
clftrain+clftest

Looking at the graph we can see the overfitting in the begining and later on all the underfitting where the graph has been seperated meaning that there is lower flexibility.

The whole reason of why we looked at the test error is analyze whether this Data had any correlation to the previous claim, so to find the best fitting K that will give us the least Test error.

Summary

Throughout our analyzing of the data using K-Nearest Regressers and Classifiers we can see how Regressor would be the better choice for our Data.Overall our Test error displays that we have around 25% error using K-Nearest Regressor, meaning for our US Gross and the Worldwide Gross the rotten tomato rating will have a around 25% error. We can conclude that rotten tomato ratings mostly has some impact on US Gross and Worldwide Gross.

References

Dataset was found in deepnote from Thursday (Week 8) under the Vega files.

Pandas Groupby Link

Created in deepnote.com Created in Deepnote