Video Game Sales
Contents
Video Game Sales#
Author: Aditya Krishnan Radhakrishnan
Course Project, UC Irvine, Math 10, F22
Introduction#
The goal of this project is to analyze what factors affect the sales of popular video games in the past couple decades. The project also aims to explore regional preferences of games and create a model to predict the release platform of a game. This analysis will incorporate data manipulation using the Pandas library, plotting graphs using Altair and some machine learning algorithms.
import pandas as pd
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
Exploring and cleaning the data#
df = pd.read_csv("vgsales.csv")
df.dropna(axis=0, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16291 entries, 0 to 16597
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 16291 non-null int64
1 Name 16291 non-null object
2 Platform 16291 non-null object
3 Year 16291 non-null float64
4 Genre 16291 non-null object
5 Publisher 16291 non-null object
6 NA_Sales 16291 non-null float64
7 EU_Sales 16291 non-null float64
8 JP_Sales 16291 non-null float64
9 Other_Sales 16291 non-null float64
10 Global_Sales 16291 non-null float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.5+ MB
df.head()
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Wii Sports | Wii | 2006.0 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | 2 | Super Mario Bros. | NES | 1985.0 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | 3 | Mario Kart Wii | Wii | 2008.0 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | 4 | Wii Sports Resort | Wii | 2009.0 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | 5 | Pokemon Red/Pokemon Blue | GB | 1996.0 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
There are still some games in the dataset with “Unknown” publishers so we will use boolean indexing to delete those rows.
df = df[df["Publisher"] != 'Unknown'].copy()
The ‘Ranked’ column is quite redundant when we already have index.
df.drop(['Rank'], axis=1, inplace=True)
The ‘Year’ column should be converted to datetime values rather than floats.
df['Year'] = df['Year'].map(lambda x: int(x)).copy()
df['Year'] = df['Year'].map(lambda x: f"{x}-01-01").copy()
df['Year'] = pd.to_datetime(df['Year']).copy()
Our dataset is still quite large. We can shrink it by removing the rows closer to the end as those games are not so popular anyways.
df = df[:5000].copy()
df
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006-01-01 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
1 | Super Mario Bros. | NES | 1985-01-01 | Platform | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 |
2 | Mario Kart Wii | Wii | 2008-01-01 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
3 | Wii Sports Resort | Wii | 2009-01-01 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
4 | Pokemon Red/Pokemon Blue | GB | 1996-01-01 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5082 | Solaris | 2600 | 1986-01-01 | Shooter | Atari | 0.35 | 0.02 | 0.00 | 0.00 | 0.37 |
5083 | Puppy Palace | DS | 2008-01-01 | Simulation | Ubisoft | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 |
5084 | Super Dragon Ball Z | PS2 | 2006-01-01 | Fighting | Atari | 0.09 | 0.07 | 0.19 | 0.02 | 0.37 |
5085 | Paws & Claws: Pampered Pets | DS | 2009-01-01 | Simulation | THQ | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 |
5086 | ESPN: Sports Connection | WiiU | 2012-01-01 | Sports | Ubisoft | 0.21 | 0.12 | 0.00 | 0.04 | 0.37 |
5000 rows × 10 columns
We are also changing the genre “Platform” to “Platformer” for the future.
df['Genre'] = df['Genre'].map(lambda x: "Platformer" if x == "Platform" else x).copy()
Now we will graph the sales of the video games over time, grouped by genre. From this graph, we can tell what genres are dominated by different companies.
sel = alt.selection_single(fields=['Genre'], bind="legend")
c1 = alt.Chart(df).mark_circle().encode(
x='Year',
y='Global_Sales',
color='Genre:N',
tooltip=['Name', 'Platform', 'Publisher']
).add_selection(sel)
c2 = alt.Chart(df).mark_bar().encode(
x=alt.X('Publisher', sort='-y'),
y='sum(Global_Sales)'
).transform_filter(sel)
c1|c2
Japan’s Preferences (Regression)#
We will use scikit-learn’s LinearRegression to figure out what factors affect the popularity of video games in Japan. We will consider the year, top 5 publishers, and genres. Since we are using regression (not classification), we must first make sure the variables we are using are numerical and not categorical. We will use OneHotEncoder for this. Also, we will only be considering the top 5 publishers to not make the data too complex.
df2 = df[(df['Publisher'] == 'Nintendo') | (df['Publisher'] == 'Electronic Arts') | (df['Publisher'] == 'Activision') | (df['Publisher'] == 'Sony Computer Entertainment') | (df['Publisher'] == 'Ubisoft')].copy()
We convert the release years from datetime values to int values.
df2['Year2'] = df2['Year'].dt.year
We make series of boolean values for the top 5 publishers using OneHotEncoder.
encoder_pub = OneHotEncoder()
encoder_pub.fit(df2[["Publisher"]])
OneHotEncoder()
arr_pub = encoder_pub.transform(df2[["Publisher"]]).toarray()
arr_pub
array([[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0.],
...,
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 1.]])
pub_list = [encoder_pub.get_feature_names_out()[i][10:] for i in range(5)]
pub_list
['Activision',
'Electronic Arts',
'Nintendo',
'Sony Computer Entertainment',
'Ubisoft']
df2[pub_list] = arr_pub
df2
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | Year2 | Activision | Electronic Arts | Nintendo | Sony Computer Entertainment | Ubisoft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006-01-01 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 | 2006 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | Super Mario Bros. | NES | 1985-01-01 | Platformer | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | 1985 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | Mario Kart Wii | Wii | 2008-01-01 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 | 2008 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | Wii Sports Resort | Wii | 2009-01-01 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 | 2009 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | Pokemon Red/Pokemon Blue | GB | 1996-01-01 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | 1996 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5079 | Harry Potter and the Prisoner of Azkaban | GC | 2004-01-01 | Action | Electronic Arts | 0.29 | 0.08 | 0.00 | 0.01 | 0.38 | 2004 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5080 | Quake 4 | X360 | 2005-01-01 | Shooter | Activision | 0.32 | 0.03 | 0.00 | 0.03 | 0.38 | 2005 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5081 | Petz Rescue: Ocean Patrol | DS | 2008-01-01 | Adventure | Ubisoft | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 | 2008 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5083 | Puppy Palace | DS | 2008-01-01 | Simulation | Ubisoft | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 | 2008 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
5086 | ESPN: Sports Connection | WiiU | 2012-01-01 | Sports | Ubisoft | 0.21 | 0.12 | 0.00 | 0.04 | 0.37 | 2012 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
2316 rows × 16 columns
We then repeat the same process for the genres.
encoder_g = OneHotEncoder()
encoder_g.fit(df2[["Genre"]])
OneHotEncoder()
arr_g = encoder_g.transform(df2[["Genre"]]).toarray()
arr_g
array([[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.]])
g_list = [encoder_g.get_feature_names_out()[i][6:] for i in range(12)]
g_list
['Action',
'Adventure',
'Fighting',
'Misc',
'Platformer',
'Puzzle',
'Racing',
'Role-Playing',
'Shooter',
'Simulation',
'Sports',
'Strategy']
df2[g_list] = arr_g
df2
Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | ... | Fighting | Misc | Platformer | Puzzle | Racing | Role-Playing | Shooter | Simulation | Sports | Strategy | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Wii Sports | Wii | 2006-01-01 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | Super Mario Bros. | NES | 1985-01-01 | Platformer | Nintendo | 29.08 | 3.58 | 6.81 | 0.77 | 40.24 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | Mario Kart Wii | Wii | 2008-01-01 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | Wii Sports Resort | Wii | 2009-01-01 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
4 | Pokemon Red/Pokemon Blue | GB | 1996-01-01 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
5079 | Harry Potter and the Prisoner of Azkaban | GC | 2004-01-01 | Action | Electronic Arts | 0.29 | 0.08 | 0.00 | 0.01 | 0.38 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5080 | Quake 4 | X360 | 2005-01-01 | Shooter | Activision | 0.32 | 0.03 | 0.00 | 0.03 | 0.38 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
5081 | Petz Rescue: Ocean Patrol | DS | 2008-01-01 | Adventure | Ubisoft | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5083 | Puppy Palace | DS | 2008-01-01 | Simulation | Ubisoft | 0.35 | 0.00 | 0.00 | 0.03 | 0.37 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5086 | ESPN: Sports Connection | WiiU | 2012-01-01 | Sports | Ubisoft | 0.21 | 0.12 | 0.00 | 0.04 | 0.37 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2316 rows × 28 columns
Below is a list of the columns we want to use for the regressor.
cols = list(df2.columns)[10:]
cols
['Year2',
'Activision',
'Electronic Arts',
'Nintendo',
'Sony Computer Entertainment',
'Ubisoft',
'Action',
'Adventure',
'Fighting',
'Misc',
'Platformer',
'Puzzle',
'Racing',
'Role-Playing',
'Shooter',
'Simulation',
'Sports',
'Strategy']
We now instantiate the regressor object, fit the data, and get the coefficients for the predicted values.
reg = LinearRegression(fit_intercept=False)
reg.fit(df2[cols], df2['JP_Sales'])
LinearRegression(fit_intercept=False)
df2['Pred'] = reg.predict(df2[cols])
pd.Series(reg.coef_, index=cols)
Year2 -0.011049
Activision 15.598151
Electronic Arts 15.604179
Nintendo 16.389373
Sony Computer Entertainment 15.737958
Ubisoft 15.638160
Action 6.540390
Adventure 6.371640
Fighting 6.638727
Misc 6.567025
Platformer 6.662443
Puzzle 6.476242
Racing 6.603110
Role-Playing 6.917231
Shooter 6.562349
Simulation 6.626960
Sports 6.567868
Strategy 6.433834
dtype: float64
We can see from the coefficients that the coefficient for year is negative; meaning video games got less popular over time in Japan. We can also see that Japanese sales are higher for Nintendo games. In terms of genre, we can see that role-playing games are popular in Japan (likely due to the immense success of Pokemon).
We can graph the prediction line on top of the data.
sel2 = alt.selection_single(fields=["Publisher"], bind="legend", empty="none")
c3 = alt.Chart(df2).mark_circle().encode(
x='Year',
y='JP_Sales',
color='Publisher',
tooltip = ["Name", "Publisher", "Genre"],
opacity=alt.condition(sel2, alt.value(1), alt.value(0.2)),
size=alt.condition(sel2, alt.value(60), alt.value(20)),
).add_selection(sel2)
c4 = alt.Chart(df2).mark_line(color='black').encode(
x='Year',
y='Pred'
).transform_filter(
sel2
)
c3+c4
We can see from the graph that Nintendo is dominating the Japanese video game market (along with Sony to a certain extent) while other American publishers like Electronic Arts and Activision are not doing as well.
Platforms used by publishers (Classification)#
We will use the K-Nearest Neighbors Classifier to predict the platforms used by the top 5 publishers. Other data we will use to help train the model includes the sales, genre and release year.
X = df2[['Activision',
'Electronic Arts',
'Nintendo',
'Sony Computer Entertainment',
'Ubisoft','NA_Sales',
'EU_Sales',
'JP_Sales',
'Other_Sales',
'Global_Sales',
'Action',
'Adventure',
'Fighting',
'Misc',
'Platformer',
'Puzzle',
'Racing',
'Role-Playing',
'Shooter',
'Simulation',
'Sports',
'Strategy',
'Year2']]
y = df2["Platform"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=2)
Similar to before, we will instantiate the classifier object, fit the data, and obtain the scores.
clf = KNeighborsClassifier(n_neighbors=30)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=30)
clf.score(X_train, y_train)
0.48758099352051837
clf.score(X_test, y_test)
0.44612068965517243
From the classifier’s score on the training set and the test set, we can see that they are fairly close (within 5%). From this we know that we are not overfitting the data with our parameters as the training score is not significantly better than the test score. Although we can get a higher accuracy on the training data by using less neighbors, we will be sacrificing accuracy on the test set which is much more important because it evaluates how well our model can perform on more data.
Looking at the actual values however, we can see that the model is not very accurate, with an accuracy of only 44%. Thus, we can say that it is hard to predict the platform of a game from sales in different regions, the publishers of the game, and its genre.
Summary#
To summarize, we analyzed the sales of video games over time using data on publishers, genres and platforms and cleaned and explored the data using Pandas and Altair respectively. We used OneHotEncoder to allow us to use categorical variables in scikitlearn’s linear regression and reached conclusions on Japan’s taste in video games. We then used a KNeighbors Classifier to try and predict the platform a game was released on, but ended up with a not-so-accurate model meaning it is difficult to predict the platform.
References#
Your code above should include references. Here is some additional space for references.
The sorted bar graph was adapted from https://altair-viz.github.io/gallery/bar_chart_sorted.html Using if and else in lambda functions from https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/
What is the source of your dataset(s)? This dataset called “Video Game Sales” was taken from Kaggle and was uploaded by GREGORYSMITH. https://www.kaggle.com/datasets/gregorut/videogamesales?resource=download
List any other references that you found helpful.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote