What contributes to a movie’s commercial success?#

Author: Cece Sun

Course Project, UC Irvine, Math 10, F22

Introduction#

This project investigates certain features that movies screening in the theater might have and dives into the correlation between each feature and the commercial success of a movie (i.e. the revenue each movie makes). Meanwhile, this project also shows us what the top revenue/top popularity movies and the top directors would be. One important factor of a success of a movie is how much revenue it can generate. How is this feature correlating with others such as a movie’s budget, popularity, and ratings? Let’s take a look together.

Getting Ready#

Cleaning data and preparing it for future analysis and operations.

import pandas as pd

There are 2 datasets used in this project. Merge them together by each movie’s id number.

df_c = pd.read_csv("tmdb_5000_credits.csv")
df_c.head(3)

	movie_id	title	cast	crew
0	19995	Avatar	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	285	Pirates of the Caribbean: At World's End	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	206647	Spectre	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...

df_m = pd.read_csv("tmdb_5000_movies.csv")
df_m.head(3)

	budget	genres	homepage	id	keywords	original_language	original_title	overview	popularity	production_companies	production_countries	release_date	revenue	runtime	spoken_languages	status	tagline	title	vote_average	vote_count
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466

df_m=df_m.rename(columns={"id" : "movie_id"})

df = df_m.merge(df_c, on="movie_id")
df.head(3)

	budget	genres	homepage	movie_id	keywords	original_language	original_title	overview	popularity	production_companies	...	runtime	spoken_languages	status	tagline	title_x	vote_average	vote_count	title_y	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.avatarmovie.com/	19995	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	In the 22nd century, a paraplegic Marine is di...	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	...	162.0	[{"iso_639_1": "en", "name": "English"}, {"iso...	Released	Enter the World of Pandora.	Avatar	7.2	11800	Avatar	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	http://disney.go.com/disneypictures/pirates/	285	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	Captain Barbossa, long believed to be dead, ha...	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	...	169.0	[{"iso_639_1": "en", "name": "English"}]	Released	At the end of the world, the adventure begins.	Pirates of the Caribbean: At World's End	6.9	4500	Pirates of the Caribbean: At World's End	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	http://www.sonypictures.com/movies/spectre/	206647	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	A cryptic message from Bond’s past sends him o...	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	...	148.0	[{"iso_639_1": "fr", "name": "Fran\u00e7ais"},...	Released	A Plan No One Escapes	Spectre	6.3	4466	Spectre	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...

3 rows × 23 columns

df.columns

Index(['budget', 'genres', 'homepage', 'movie_id', 'keywords',
       'original_language', 'original_title', 'overview', 'popularity',
       'production_companies', 'production_countries', 'release_date',
       'revenue', 'runtime', 'spoken_languages', 'status', 'tagline',
       'title_x', 'vote_average', 'vote_count', 'title_y', 'cast', 'crew'],
      dtype='object')

Get rid of the overlapped columns and the data we are not interested in.

df2 = df.drop(columns = ["movie_id","homepage", "title_x", "title_y", "overview", "status", "tagline","spoken_languages"])
df2.head(3)

	budget	genres	keywords	original_language	original_title	popularity	production_companies	production_countries	release_date	revenue	runtime	vote_average	vote_count	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	7.2	11800	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	6.9	4500	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	6.3	4466	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...

Get rid of all the na values.

df3 = df2.dropna()
df3.head(3)

	budget	genres	keywords	original_language	original_title	popularity	production_companies	production_countries	release_date	revenue	runtime	vote_average	vote_count	cast	crew
0	237000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	[{"id": 1463, "name": "culture clash"}, {"id":...	en	Avatar	150.437577	[{"name": "Ingenious Film Partners", "id": 289...	[{"iso_3166_1": "US", "name": "United States o...	2009-12-10	2787965087	162.0	7.2	11800	[{"cast_id": 242, "character": "Jake Sully", "...	[{"credit_id": "52fe48009251416c750aca23", "de...
1	300000000	[{"id": 12, "name": "Adventure"}, {"id": 14, "...	[{"id": 270, "name": "ocean"}, {"id": 726, "na...	en	Pirates of the Caribbean: At World's End	139.082615	[{"name": "Walt Disney Pictures", "id": 2}, {"...	[{"iso_3166_1": "US", "name": "United States o...	2007-05-19	961000000	169.0	6.9	4500	[{"cast_id": 4, "character": "Captain Jack Spa...	[{"credit_id": "52fe4232c3a36847f800b579", "de...
2	245000000	[{"id": 28, "name": "Action"}, {"id": 12, "nam...	[{"id": 470, "name": "spy"}, {"id": 818, "name...	en	Spectre	107.376788	[{"name": "Columbia Pictures", "id": 5}, {"nam...	[{"iso_3166_1": "GB", "name": "United Kingdom"...	2015-10-26	880674609	148.0	6.3	4466	[{"cast_id": 1, "character": "James Bond", "cr...	[{"credit_id": "54805967c3a36829b5002c41", "de...

Rename some column names to simplify future codes.

df4=df3.rename(columns = {"original_language":"language", "original_title":"title", "production_companies":"companies", "production_countries":"countries", "release_date":"date"})

Preparing the categorical data for future uses. Particularly, extract the director name for each movie from the “crew” column.

from ast import literal_eval
features = ["genres", "keywords", "companies", "countries", "cast", "crew"]
for feature in features:
    df4[feature] = df4[feature].apply(literal_eval)

import numpy as np

def get_director(x):
    for i in x:
        if i["job"] == "Director":
            return i["name"]
    return np.nan 

def get_list(x):
    if isinstance(x, list):
        names = [i["name"] for i in x]
        if len(names)>3:
            names = names[:3]
        return names
    return[]

df4["director"] = df4["crew"].apply(get_director)

features = ["cast", "keywords", "genres","companies", "countries"]
for feature in features:
    df4[feature] = df4[feature].apply(get_list)

df5 = df4.drop(columns = ["crew"]).dropna()
df5.head(3)

	budget	genres	keywords	language	title	popularity	companies	countries	date	revenue	runtime	vote_average	vote_count	cast	director
0	237000000	[Action, Adventure, Fantasy]	[culture clash, future, space war]	en	Avatar	150.437577	[Ingenious Film Partners, Twentieth Century Fo...	[United States of America, United Kingdom]	2009-12-10	2787965087	162.0	7.2	11800	[Sam Worthington, Zoe Saldana, Sigourney Weaver]	James Cameron
1	300000000	[Adventure, Fantasy, Action]	[ocean, drug abuse, exotic island]	en	Pirates of the Caribbean: At World's End	139.082615	[Walt Disney Pictures, Jerry Bruckheimer Films...	[United States of America]	2007-05-19	961000000	169.0	6.9	4500	[Johnny Depp, Orlando Bloom, Keira Knightley]	Gore Verbinski
2	245000000	[Action, Adventure, Crime]	[spy, based on novel, secret agent]	en	Spectre	107.376788	[Columbia Pictures, Danjaq, B24]	[United Kingdom, United States of America]	2015-10-26	880674609	148.0	6.3	4466	[Daniel Craig, Christoph Waltz, Léa Seydoux]	Sam Mendes

df5["date"] = pd.to_datetime(df5["date"])

Reorder the columns’ positions in a way I want them to be.

ordered_col = ["title", "date", "revenue","budget", "popularity","runtime", "vote_average", "vote_count", "countries", "language", "genres", "keywords", "director", "cast"]

df6=df5[ordered_col]

Movie Ranking#

What are the top5 movies based on their revenue? Popularity? ##

df6.sort_values(by="revenue", ascending=False).head(5)

	title	date	revenue	budget	popularity	runtime	vote_average	vote_count	countries	language	genres	keywords	director	cast
0	Avatar	2009-12-10	2787965087	237000000	150.437577	162.0	7.2	11800	[United States of America, United Kingdom]	en	[Action, Adventure, Fantasy]	[culture clash, future, space war]	James Cameron	[Sam Worthington, Zoe Saldana, Sigourney Weaver]
25	Titanic	1997-11-18	1845034188	200000000	100.025899	194.0	7.5	7562	[United States of America]	en	[Drama, Romance, Thriller]	[shipwreck, iceberg, ship]	James Cameron	[Kate Winslet, Leonardo DiCaprio, Frances Fisher]
16	The Avengers	2012-04-25	1519557910	220000000	144.448633	143.0	7.4	11776	[United States of America]	en	[Science Fiction, Action, Adventure]	[new york, shield, marvel comic]	Joss Whedon	[Robert Downey Jr., Chris Evans, Mark Ruffalo]
28	Jurassic World	2015-06-09	1513528810	150000000	418.708552	124.0	6.5	8662	[United States of America]	en	[Action, Adventure, Science Fiction]	[monster, dna, tyrannosaurus rex]	Colin Trevorrow	[Chris Pratt, Bryce Dallas Howard, Irrfan Khan]
44	Furious 7	2015-04-01	1506249360	190000000	102.322217	137.0	7.3	4176	[Japan, United States of America]	en	[Action]	[car race, speed, revenge]	James Wan	[Vin Diesel, Paul Walker, Dwayne Johnson]

pop = df6.sort_values('revenue', ascending=False)
import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))

plt.barh(pop['title'].head(8),pop['revenue'].head(8), align='center',
        color='indianred')
plt.gca().invert_yaxis()
plt.xlabel("Revenue")
plt.title("Profitable Movies")

Text(0.5, 1.0, 'Profitable Movies')

From this bar chart we can see that “Avatar”, “Titanic”, and “The Avegers(the first one)” are the top 3 most profitable movies.

df6.sort_values(by="popularity", ascending=False).head(5)

	title	date	revenue	budget	popularity	runtime	vote_average	vote_count	countries	language	genres	keywords	director	cast
546	Minions	2015-06-17	1156730962	74000000	875.581305	91.0	6.4	4571	[United States of America]	en	[Family, Animation, Adventure]	[assistant, aftercreditsstinger, duringcredits...	Kyle Balda	[Sandra Bullock, Jon Hamm, Michael Keaton]
95	Interstellar	2014-11-05	675120017	165000000	724.247784	169.0	8.1	10867	[Canada, United States of America, United King...	en	[Adventure, Drama, Science Fiction]	[saving the world, artificial intelligence, fa...	Christopher Nolan	[Matthew McConaughey, Jessica Chastain, Anne H...
788	Deadpool	2016-02-09	783112979	58000000	514.569956	108.0	7.4	10995	[United States of America]	en	[Action, Adventure, Comedy]	[anti hero, mercenary, marvel comic]	Tim Miller	[Ryan Reynolds, Morena Baccarin, Ed Skrein]
94	Guardians of the Galaxy	2014-07-30	773328629	170000000	481.098624	121.0	7.9	9742	[United Kingdom, United States of America]	en	[Action, Science Fiction, Adventure]	[marvel comic, spaceship, space]	James Gunn	[Chris Pratt, Zoe Saldana, Dave Bautista]
127	Mad Max: Fury Road	2015-05-13	378858340	150000000	434.278564	120.0	7.2	9427	[Australia, United States of America]	en	[Action, Adventure, Science Fiction]	[future, chase, post-apocalyptic]	George Miller	[Tom Hardy, Charlize Theron, Nicholas Hoult]

pop2 = df6.sort_values('popularity', ascending=False)
import matplotlib.pyplot as plt
plt.figure(figsize=(12,4))

plt.barh(pop2['title'].head(8),pop2['popularity'].head(8), align='center',
        color='skyblue')
plt.gca().invert_yaxis()
plt.xlabel("Popularity")
plt.title("Popular Movies")

Text(0.5, 1.0, 'Popular Movies')

From this bar chart we can see that “Minions”, “Interstellar”, and “Deadpool” are the top 3 most popular movies, which diverges from the top 3 most profitable movies.

Computing a movie_score for each movie#

What are IMDb movie ratings?#

IMDb registered users can cast a vote (from 1 to 10) on every released title in the database. Individual votes are then aggregated and summarized as a single IMDb rating, visible on the title’s main page. By “released title” we mean that the movie (or TV show) must have been shown publicly at least once (including festival screening).

Users can update their votes as often as they’d like, but any new vote on the same title will overwrite the previous one, so it is one vote per title per user.

How are the ratings calculated?#

They take all the individual ratings cast by IMDb registered users and use them to calculate a single rating. They don’t use the arithmetic mean (i.e. the sum of all votes divided by the number of votes), although they do display the mean and average votes on the votes breakdown page; instead the rating displayed on a title’s page is a weighted average.

The formula for calculating the Top Rated 250 Titles gives a true Bayesian estimate: weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C where:

R = average for the movie (mean) = (Rating) v = number of votes for the movie = (votes) m = minimum votes required to be listed in the Top 250 (currently 25000) C = the mean vote across the whole report (currently 7.0) ##

C = df6["vote_average"].mean()
C

6.110207503667994

m = df6["vote_count"].quantile(0.9)
m

1862.0

movies2 = df6.copy().loc[df6["vote_count"] >= m]
len(movies2)

def weighted_rating(x, m=m, C=C):
    v = x["vote_count"]
    R = x["vote_average"]
    return(v/(v+m)*R) + (m/(m+v)*R)

movies2["score"] = movies2.apply(weighted_rating, axis=1)
df6["movie_scores"] = movies2["score"].round(2)

df6.sort_values(by="movie_scores", ascending=False).head(5)

	title	date	revenue	budget	popularity	runtime	vote_average	vote_count	countries	language	genres	keywords	director	cast	movie_scores
1881	The Shawshank Redemption	1994-09-23	28341469	25000000	136.747729	142.0	8.5	8205	[United States of America]	en	[Drama, Crime]	[prison, corruption, police brutality]	Frank Darabont	[Tim Robbins, Morgan Freeman, Bob Gunton]	8.5
3337	The Godfather	1972-03-14	245066411	6000000	143.659698	175.0	8.4	5893	[United States of America]	en	[Drama, Crime]	[italy, love at first sight, loss of father]	Francis Ford Coppola	[Marlon Brando, Al Pacino, James Caan]	8.4
2294	千と千尋の神隠し	2001-07-20	274925095	15000000	118.968562	125.0	8.3	3840	[Japan]	ja	[Fantasy, Adventure, Animation]	[witch, parents kids relationship, magic]	Hayao Miyazaki	[Rumi Hiiragi, Miyu Irino, Mari Natsuki]	8.3
662	Fight Club	1999-10-15	100853753	63000000	146.757391	139.0	8.3	9413	[Germany, United States of America]	en	[Drama]	[support group, dual identity, nihilism]	David Fincher	[Edward Norton, Brad Pitt, Meat Loaf]	8.3
1818	Schindler's List	1993-11-29	321365567	22000000	104.469351	195.0	8.3	4329	[United States of America]	en	[Drama, History, War]	[factory, concentration camp, hero]	Steven Spielberg	[Liam Neeson, Ben Kingsley, Ralph Fiennes]	8.3

Visualize movies’ revenue, budget, and ratings#

Using altair to display charts containing movies’ revenue, budget, and ratings information

import altair as alt

df9 = df6[df6["revenue"] != 0]
df9

	title	date	revenue	budget	popularity	runtime	vote_average	vote_count	countries	language	genres	keywords	director	cast	movie_scores
0	Avatar	2009-12-10	2787965087	237000000	150.437577	162.0	7.2	11800	[United States of America, United Kingdom]	en	[Action, Adventure, Fantasy]	[culture clash, future, space war]	James Cameron	[Sam Worthington, Zoe Saldana, Sigourney Weaver]	7.2
1	Pirates of the Caribbean: At World's End	2007-05-19	961000000	300000000	139.082615	169.0	6.9	4500	[United States of America]	en	[Adventure, Fantasy, Action]	[ocean, drug abuse, exotic island]	Gore Verbinski	[Johnny Depp, Orlando Bloom, Keira Knightley]	6.9
2	Spectre	2015-10-26	880674609	245000000	107.376788	148.0	6.3	4466	[United Kingdom, United States of America]	en	[Action, Adventure, Crime]	[spy, based on novel, secret agent]	Sam Mendes	[Daniel Craig, Christoph Waltz, Léa Seydoux]	6.3
3	The Dark Knight Rises	2012-07-16	1084939099	250000000	112.312950	165.0	7.6	9106	[United States of America]	en	[Action, Crime, Drama]	[dc comics, crime fighter, terrorist]	Christopher Nolan	[Christian Bale, Michael Caine, Gary Oldman]	7.6
4	John Carter	2012-03-07	284139100	260000000	43.926995	132.0	6.1	2124	[United States of America]	en	[Action, Adventure, Science Fiction]	[based on novel, mars, medallion]	Andrew Stanton	[Taylor Kitsch, Lynn Collins, Samantha Morton]	6.1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4775	Funny Ha Ha	2002-09-20	76901	0	0.362633	85.0	6.3	8	[United States of America]	en	[Drama, Comedy]	[mumblecore]	Andrew Bujalski	[Kate Dollenmayer, Mark Herlehy, Christian Rud...	NaN
4788	Pink Flamingos	1972-03-12	6000000	12000	4.553644	93.0	6.2	110	[United States of America]	en	[Horror, Comedy, Crime]	[gay, trailer park, pop culture]	John Waters	[Divine, David Lochary, Mary Vivian Pearce]	NaN
4792	キュア	1997-11-06	99000	20000	0.212443	111.0	7.4	63	[Japan]	ja	[Crime, Horror, Mystery]	[japan, prostitute, hotel]	Kiyoshi Kurosawa	[Koji Yakusho, Masato Hagiwara, Tsuyoshi Ujiki]	NaN
4796	Primer	2004-10-08	424760	7000	23.307949	77.0	6.9	658	[United States of America]	en	[Science Fiction, Drama, Thriller]	[distrust, garage, identity crisis]	Shane Carruth	[Shane Carruth, David Sullivan, Casey Gooden]	NaN
4798	El Mariachi	1992-09-04	2040920	220000	14.269792	81.0	6.6	238	[Mexico, United States of America]	es	[Action, Crime, Thriller]	[united states–mexico barrier, legs, arms]	Robert Rodriguez	[Carlos Gallardo, Jaime de Hoyos, Peter Marqua...	NaN

3374 rows × 15 columns

alt.Chart(df9).mark_circle().encode(
    x="budget",
    y="revenue",
    color="language",
    tooltip=("title", "budget", "revenue")
)

From this chart we can learn that generally the higher the budget, the higher the revenue. However, the revenue of a movie can depend on other features as well (not entirely affected by the budget), as we can see the highest budget movie “Pirates of the Caribbean: On Stranger Tides” is not even in the top10 most profitable movies. And “Avatar”, which generated the highest revenue, did not spend the most money for production.

alt.Chart(df9).mark_circle().encode(
    alt.X("movie_scores",
        scale=alt.Scale(zero=False)
    ),
    y="revenue",
    size="popularity",
    color="title",
    tooltip=("title", "revenue", "popularity", "movie_scores")
)

From this chart, we can see that a movie’s rating does not necessarily correlates its commercial success (revenue). Most of the top profitable movies distribute in the mid range of the movie ratings (6.0~8.0).

Who are the top directors?#

Making a sub-dataframe containing the top100 most profitable movies using their “revenue” ranking. Look for the top3 directors in this sub-dataframe by counting how many top100 movies are directed by them. Draw charts for all the movies made by these top3 directors separately.

df_top100=df6.sort_values("revenue", ascending=False).head(100)
df_top100["director"].value_counts().sort_values(ascending=False).head(3)

Peter Jackson    6
Michael Bay      4
George Lucas     4
Name: director, dtype: int64

df_topDirectors = df6.loc[(df6["director"] == "Peter Jackson") | (df6["director"] == "Michael Bay") | (df6["director"] == "Christopher Nolan")]

alt.Chart(df_topDirectors).mark_circle().encode(
    x=alt.X("movie_scores", scale=alt.Scale(zero=False)),
    y="revenue",
    size="popularity",
    color="title:N",
    tooltip=("director", "title", "revenue", "movie_scores")
).facet("director").resolve_scale(
    x='independent'
)

From these charts we can clearly see the comparison of these three directors. All three of them have directed some high office box movies. Michael Bay’s movies have the lowest range of movie ratings among the three. And Nolan has directed the most popular movie “Insterstellar” among all the movies directed by these three directors.

Relationship between a movie’s revenue and its bugdet#

Using linear regression to plot the regression line for “revenue” and “budget”.

from sklearn.linear_model import LinearRegression

Create and fit the model

reg = LinearRegression()
reg.fit(df6[["budget"]], df6["revenue"])

LinearRegression()

Making Predictions

df6["pred"] = reg.predict(df6[["budget"]])

base = alt.Chart(df6).mark_circle().encode(
    x="budget",
    y="revenue"
)
base

c1 = alt.Chart(df6).mark_line().encode(
    x="budget",
    y="pred"
)
base+c1

reg.intercept_

-2632810.1895543337

reg.coef_

array([2.9228099])

Coefficients of “popularity” and “runtime”#

Using linear regression and feature_names_in_ to get the coefficients of feature “popularity” and “runtime”.

cols2 = ["popularity", "runtime"]

reg2 = LinearRegression()
reg2.fit(df6[cols2], df6["revenue"])

LinearRegression()

pd.Series(reg2.coef_, index=reg2.feature_names_in_)

popularity    3.168460e+06
runtime       8.436382e+05
dtype: float64

Which genres is more likely to bring a movie high revenue?#

Use One hot encoding to convert the categorical data variables to be provided to machine and deep learning algorithms and compute the contribution that each genres makes to the movie’s success.

Each movie might be classified into more than one genres. We use the first genres in each movie’s “genres” feature as the one that can describe the movie the most.

g2 = [c[:1] for c in df6["genres"].tolist()]

mystring = [str(c) for c in g2]

df6["g2"] = mystring

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
encoder.fit(df6[["g2"]])

OneHotEncoder()

new_cols = list(encoder.get_feature_names_out())

df7 = df6.copy()
df7[new_cols] = encoder.transform(df6[["g2"]]).toarray()

encoder.fit_transform(df6[["g2"]])

<4771x21 sparse matrix of type '<class 'numpy.float64'>'
	with 4771 stored elements in Compressed Sparse Row format>

encoder.fit_transform(df6[["g2"]]).toarray()

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 0., 0., 0.]])

reg4 = LinearRegression(fit_intercept = False)
reg4.fit(df7[new_cols], df7["revenue"])

LinearRegression(fit_intercept=False)

pd.Series(reg4.coef_, index = reg4.feature_names_in_).sort_values(ascending=False)

g2_['Animation']          2.437383e+08
g2_['Adventure']          2.109611e+08
g2_['Science Fiction']    1.685224e+08
g2_['Family']             1.653226e+08
g2_['Fantasy']            1.475592e+08
g2_['Action']             1.220232e+08
g2_['History']            7.277602e+07
g2_['Mystery']            6.928847e+07
g2_['Romance']            6.637681e+07
g2_['War']                6.471706e+07
g2_['Thriller']           6.048303e+07
g2_['Comedy']             5.198028e+07
g2_['Western']            4.957033e+07
g2_['Crime']              4.832888e+07
g2_['Drama']              4.555130e+07
g2_['Horror']             4.463309e+07
g2_['Music']              3.170353e+07
g2_['Documentary']        9.057394e+06
g2_['Foreign']            5.565000e+04
g2_['TV Movie']           0.000000e+00
g2_[]                     0.000000e+00
dtype: float64

Use a movie’s budget, popularity, revenue, rating, and runtime informaton to predict its language.#

Using train test split to split the data into train and test parts. Make budget, popularity, revenue, rating, and runtime the input variables in order to predict its language. Use DecisionTreeClassifier and matplotlib to calculate the train and test score (to determine if it’s overfitting) and display the classification tree.

from sklearn.model_selection import train_test_split

df8 = df6.dropna()

len(df8["language"].unique())

input_cols = ["budget", "popularity", "revenue", "movie_scores", "runtime"]

X = df8[input_cols]
y = df8["language"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.9, random_state=0)

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_leaf_nodes=6)

clf.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=6)

clf.score(X_train, y_train)

0.9930232558139535

clf.score(X_test, y_test)

0.9791666666666666

The test score is lower than the train score but not too much lower, suggesting that the model is not overfitting.

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

fig = plt.figure()
_ = plot_tree(clf,
                    feature_names = clf.feature_names_in_,
                    class_names = clf.classes_,
                    filled = True)

What words appear in movies’ titles most frequently?#

Use wordcloud and plt to display an image containing the most frequent words appearing in movies’ title. The bigger the word, the more frequently it appears.

from wordcloud import WordCloud

plt.figure(figsize = (12, 12))
token_title = ' '.join(df8['title'].values) 

wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1200).generate(token_title)
plt.imshow(wordcloud)
plt.title('Top words from movie titles ')
plt.axis("off") 
plt.show()

Summary#

Throughout this project, I investigate several important features of a movie, and find out their contribution upon whether the movie is a commercial success. It turns out that a high popularity movie is more likely to generate good revenue than a highly-rated movie. I also show the audience some interesting ranking of the movies.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata

List any other references that you found helpful.

https://help.imdb.com/article/imdb/track-movies-tv/ratings-faq/G67Y87TFYYP6TWAV https://medium.com/analytics-vidhya/how-to-use-machine-learning-approach-to-predict-movie-box-office-revenue-success-e2e688669972

https://www.analyticsvidhya.com/blog/2021/05/how-to-build-word-cloud-in-python/ https://www.kaggle.com/code/ibtesama/getting-started-with-a-movie-recommendation-system

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

What contributes to a movie’s commercial success?

Contents