Video Game Sales#

Author: Aditya Krishnan Radhakrishnan

Course Project, UC Irvine, Math 10, F22

Introduction#

The goal of this project is to analyze what factors affect the sales of popular video games in the past couple decades. The project also aims to explore regional preferences of games and create a model to predict the release platform of a game. This analysis will incorporate data manipulation using the Pandas library, plotting graphs using Altair and some machine learning algorithms.

import pandas as pd
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

Exploring and cleaning the data#

df = pd.read_csv("vgsales.csv")
df.dropna(axis=0, inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16291 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16291 non-null  int64  
 1   Name          16291 non-null  object 
 2   Platform      16291 non-null  object 
 3   Year          16291 non-null  float64
 4   Genre         16291 non-null  object 
 5   Publisher     16291 non-null  object 
 6   NA_Sales      16291 non-null  float64
 7   EU_Sales      16291 non-null  float64
 8   JP_Sales      16291 non-null  float64
 9   Other_Sales   16291 non-null  float64
 10  Global_Sales  16291 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.5+ MB
df.head()
Rank Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 1 Wii Sports Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37

There are still some games in the dataset with “Unknown” publishers so we will use boolean indexing to delete those rows.

df = df[df["Publisher"] != 'Unknown'].copy()

The ‘Ranked’ column is quite redundant when we already have index.

df.drop(['Rank'], axis=1, inplace=True)

The ‘Year’ column should be converted to datetime values rather than floats.

df['Year'] = df['Year'].map(lambda x: int(x)).copy()
df['Year'] = df['Year'].map(lambda x: f"{x}-01-01").copy()
df['Year'] = pd.to_datetime(df['Year']).copy()

Our dataset is still quite large. We can shrink it by removing the rows closer to the end as those games are not so popular anyways.

df = df[:5000].copy()
df
Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales
0 Wii Sports Wii 2006-01-01 Sports Nintendo 41.49 29.02 3.77 8.46 82.74
1 Super Mario Bros. NES 1985-01-01 Platform Nintendo 29.08 3.58 6.81 0.77 40.24
2 Mario Kart Wii Wii 2008-01-01 Racing Nintendo 15.85 12.88 3.79 3.31 35.82
3 Wii Sports Resort Wii 2009-01-01 Sports Nintendo 15.75 11.01 3.28 2.96 33.00
4 Pokemon Red/Pokemon Blue GB 1996-01-01 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37
... ... ... ... ... ... ... ... ... ... ...
5082 Solaris 2600 1986-01-01 Shooter Atari 0.35 0.02 0.00 0.00 0.37
5083 Puppy Palace DS 2008-01-01 Simulation Ubisoft 0.35 0.00 0.00 0.03 0.37
5084 Super Dragon Ball Z PS2 2006-01-01 Fighting Atari 0.09 0.07 0.19 0.02 0.37
5085 Paws & Claws: Pampered Pets DS 2009-01-01 Simulation THQ 0.35 0.00 0.00 0.03 0.37
5086 ESPN: Sports Connection WiiU 2012-01-01 Sports Ubisoft 0.21 0.12 0.00 0.04 0.37

5000 rows × 10 columns

We are also changing the genre “Platform” to “Platformer” for the future.

df['Genre'] = df['Genre'].map(lambda x: "Platformer" if x == "Platform" else x).copy()

Now we will graph the sales of the video games over time, grouped by genre. From this graph, we can tell what genres are dominated by different companies.

sel = alt.selection_single(fields=['Genre'], bind="legend")

c1 = alt.Chart(df).mark_circle().encode(
    x='Year',
    y='Global_Sales',
    color='Genre:N',
    tooltip=['Name', 'Platform', 'Publisher']
).add_selection(sel)

c2 = alt.Chart(df).mark_bar().encode(
    x=alt.X('Publisher', sort='-y'),
    y='sum(Global_Sales)'
).transform_filter(sel)

c1|c2

Japan’s Preferences (Regression)#

We will use scikit-learn’s LinearRegression to figure out what factors affect the popularity of video games in Japan. We will consider the year, top 5 publishers, and genres. Since we are using regression (not classification), we must first make sure the variables we are using are numerical and not categorical. We will use OneHotEncoder for this. Also, we will only be considering the top 5 publishers to not make the data too complex.

df2 = df[(df['Publisher'] == 'Nintendo') | (df['Publisher'] == 'Electronic Arts') | (df['Publisher'] == 'Activision') | (df['Publisher'] == 'Sony Computer Entertainment') | (df['Publisher'] == 'Ubisoft')].copy()

We convert the release years from datetime values to int values.

df2['Year2'] = df2['Year'].dt.year

We make series of boolean values for the top 5 publishers using OneHotEncoder.

encoder_pub = OneHotEncoder()
encoder_pub.fit(df2[["Publisher"]])
OneHotEncoder()
arr_pub = encoder_pub.transform(df2[["Publisher"]]).toarray()
arr_pub
array([[0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])
pub_list = [encoder_pub.get_feature_names_out()[i][10:] for i in range(5)]
pub_list
['Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft']
df2[pub_list] = arr_pub
df2
Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Year2 Activision Electronic Arts Nintendo Sony Computer Entertainment Ubisoft
0 Wii Sports Wii 2006-01-01 Sports Nintendo 41.49 29.02 3.77 8.46 82.74 2006 0.0 0.0 1.0 0.0 0.0
1 Super Mario Bros. NES 1985-01-01 Platformer Nintendo 29.08 3.58 6.81 0.77 40.24 1985 0.0 0.0 1.0 0.0 0.0
2 Mario Kart Wii Wii 2008-01-01 Racing Nintendo 15.85 12.88 3.79 3.31 35.82 2008 0.0 0.0 1.0 0.0 0.0
3 Wii Sports Resort Wii 2009-01-01 Sports Nintendo 15.75 11.01 3.28 2.96 33.00 2009 0.0 0.0 1.0 0.0 0.0
4 Pokemon Red/Pokemon Blue GB 1996-01-01 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37 1996 0.0 0.0 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5079 Harry Potter and the Prisoner of Azkaban GC 2004-01-01 Action Electronic Arts 0.29 0.08 0.00 0.01 0.38 2004 0.0 1.0 0.0 0.0 0.0
5080 Quake 4 X360 2005-01-01 Shooter Activision 0.32 0.03 0.00 0.03 0.38 2005 1.0 0.0 0.0 0.0 0.0
5081 Petz Rescue: Ocean Patrol DS 2008-01-01 Adventure Ubisoft 0.35 0.00 0.00 0.03 0.37 2008 0.0 0.0 0.0 0.0 1.0
5083 Puppy Palace DS 2008-01-01 Simulation Ubisoft 0.35 0.00 0.00 0.03 0.37 2008 0.0 0.0 0.0 0.0 1.0
5086 ESPN: Sports Connection WiiU 2012-01-01 Sports Ubisoft 0.21 0.12 0.00 0.04 0.37 2012 0.0 0.0 0.0 0.0 1.0

2316 rows × 16 columns

We then repeat the same process for the genres.

encoder_g = OneHotEncoder()
encoder_g.fit(df2[["Genre"]])
OneHotEncoder()
arr_g = encoder_g.transform(df2[["Genre"]]).toarray()
arr_g
array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])
g_list = [encoder_g.get_feature_names_out()[i][6:] for i in range(12)]
g_list
['Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy']
df2[g_list] = arr_g
df2
Name Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales ... Fighting Misc Platformer Puzzle Racing Role-Playing Shooter Simulation Sports Strategy
0 Wii Sports Wii 2006-01-01 Sports Nintendo 41.49 29.02 3.77 8.46 82.74 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 Super Mario Bros. NES 1985-01-01 Platformer Nintendo 29.08 3.58 6.81 0.77 40.24 ... 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Mario Kart Wii Wii 2008-01-01 Racing Nintendo 15.85 12.88 3.79 3.31 35.82 ... 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
3 Wii Sports Resort Wii 2009-01-01 Sports Nintendo 15.75 11.01 3.28 2.96 33.00 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 Pokemon Red/Pokemon Blue GB 1996-01-01 Role-Playing Nintendo 11.27 8.89 10.22 1.00 31.37 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5079 Harry Potter and the Prisoner of Azkaban GC 2004-01-01 Action Electronic Arts 0.29 0.08 0.00 0.01 0.38 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5080 Quake 4 X360 2005-01-01 Shooter Activision 0.32 0.03 0.00 0.03 0.38 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0
5081 Petz Rescue: Ocean Patrol DS 2008-01-01 Adventure Ubisoft 0.35 0.00 0.00 0.03 0.37 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5083 Puppy Palace DS 2008-01-01 Simulation Ubisoft 0.35 0.00 0.00 0.03 0.37 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
5086 ESPN: Sports Connection WiiU 2012-01-01 Sports Ubisoft 0.21 0.12 0.00 0.04 0.37 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0

2316 rows × 28 columns

Below is a list of the columns we want to use for the regressor.

cols = list(df2.columns)[10:]
cols
['Year2',
 'Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft',
 'Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy']

We now instantiate the regressor object, fit the data, and get the coefficients for the predicted values.

reg = LinearRegression(fit_intercept=False)
reg.fit(df2[cols], df2['JP_Sales'])
LinearRegression(fit_intercept=False)
df2['Pred'] = reg.predict(df2[cols])
pd.Series(reg.coef_, index=cols)
Year2                          -0.011049
Activision                     15.598151
Electronic Arts                15.604179
Nintendo                       16.389373
Sony Computer Entertainment    15.737958
Ubisoft                        15.638160
Action                          6.540390
Adventure                       6.371640
Fighting                        6.638727
Misc                            6.567025
Platformer                      6.662443
Puzzle                          6.476242
Racing                          6.603110
Role-Playing                    6.917231
Shooter                         6.562349
Simulation                      6.626960
Sports                          6.567868
Strategy                        6.433834
dtype: float64

We can see from the coefficients that the coefficient for year is negative; meaning video games got less popular over time in Japan. We can also see that Japanese sales are higher for Nintendo games. In terms of genre, we can see that role-playing games are popular in Japan (likely due to the immense success of Pokemon).

We can graph the prediction line on top of the data.

sel2 = alt.selection_single(fields=["Publisher"], bind="legend", empty="none")

c3 = alt.Chart(df2).mark_circle().encode(
    x='Year',
    y='JP_Sales',
    color='Publisher',
    tooltip = ["Name", "Publisher", "Genre"],
    opacity=alt.condition(sel2, alt.value(1), alt.value(0.2)),
    size=alt.condition(sel2, alt.value(60), alt.value(20)),
).add_selection(sel2)

c4 = alt.Chart(df2).mark_line(color='black').encode(
    x='Year',
    y='Pred'
).transform_filter(
    sel2
)

c3+c4

We can see from the graph that Nintendo is dominating the Japanese video game market (along with Sony to a certain extent) while other American publishers like Electronic Arts and Activision are not doing as well.

Platforms used by publishers (Classification)#

We will use the K-Nearest Neighbors Classifier to predict the platforms used by the top 5 publishers. Other data we will use to help train the model includes the sales, genre and release year.

X = df2[['Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft','NA_Sales',
 'EU_Sales',
 'JP_Sales',
 'Other_Sales',
 'Global_Sales',
 'Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy',
 'Year2']]
y = df2["Platform"]
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=2)

Similar to before, we will instantiate the classifier object, fit the data, and obtain the scores.

clf = KNeighborsClassifier(n_neighbors=30)
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=30)
clf.score(X_train, y_train)
0.48758099352051837
clf.score(X_test, y_test)
0.44612068965517243

From the classifier’s score on the training set and the test set, we can see that they are fairly close (within 5%). From this we know that we are not overfitting the data with our parameters as the training score is not significantly better than the test score. Although we can get a higher accuracy on the training data by using less neighbors, we will be sacrificing accuracy on the test set which is much more important because it evaluates how well our model can perform on more data.

Looking at the actual values however, we can see that the model is not very accurate, with an accuracy of only 44%. Thus, we can say that it is hard to predict the platform of a game from sales in different regions, the publishers of the game, and its genre.

Summary#

To summarize, we analyzed the sales of video games over time using data on publishers, genres and platforms and cleaned and explored the data using Pandas and Altair respectively. We used OneHotEncoder to allow us to use categorical variables in scikitlearn’s linear regression and reached conclusions on Japan’s taste in video games. We then used a KNeighbors Classifier to try and predict the platform a game was released on, but ended up with a not-so-accurate model meaning it is difficult to predict the platform.

References#

Your code above should include references. Here is some additional space for references.

The sorted bar graph was adapted from https://altair-viz.github.io/gallery/bar_chart_sorted.html Using if and else in lambda functions from https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/

  • List any other references that you found helpful.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote