Video Game Sales#

Author: Aditya Krishnan Radhakrishnan

Course Project, UC Irvine, Math 10, F22

Introduction#

The goal of this project is to analyze what factors affect the sales of popular video games in the past couple decades. The project also aims to explore regional preferences of games and create a model to predict the release platform of a game. This analysis will incorporate data manipulation using the Pandas library, plotting graphs using Altair and some machine learning algorithms.

import pandas as pd
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression

Exploring and cleaning the data#

df = pd.read_csv("vgsales.csv")
df.dropna(axis=0, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16291 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16291 non-null  int64  
 1   Name          16291 non-null  object 
 2   Platform      16291 non-null  object 
 3   Year          16291 non-null  float64
 4   Genre         16291 non-null  object 
 5   Publisher     16291 non-null  object 
 6   NA_Sales      16291 non-null  float64
 7   EU_Sales      16291 non-null  float64
 8   JP_Sales      16291 non-null  float64
 9   Other_Sales   16291 non-null  float64
 10  Global_Sales  16291 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.5+ MB

df.head()

	Rank	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	1	Wii Sports	Wii	2006.0	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	2	Super Mario Bros.	NES	1985.0	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	3	Mario Kart Wii	Wii	2008.0	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	4	Wii Sports Resort	Wii	2009.0	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	5	Pokemon Red/Pokemon Blue	GB	1996.0	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37

There are still some games in the dataset with “Unknown” publishers so we will use boolean indexing to delete those rows.

df = df[df["Publisher"] != 'Unknown'].copy()

The ‘Ranked’ column is quite redundant when we already have index.

df.drop(['Rank'], axis=1, inplace=True)

The ‘Year’ column should be converted to datetime values rather than floats.

df['Year'] = df['Year'].map(lambda x: int(x)).copy()

df['Year'] = df['Year'].map(lambda x: f"{x}-01-01").copy()

df['Year'] = pd.to_datetime(df['Year']).copy()

Our dataset is still quite large. We can shrink it by removing the rows closer to the end as those games are not so popular anyways.

df = df[:5000].copy()
df

	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales
0	Wii Sports	Wii	2006-01-01	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74
1	Super Mario Bros.	NES	1985-01-01	Platform	Nintendo	29.08	3.58	6.81	0.77	40.24
2	Mario Kart Wii	Wii	2008-01-01	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82
3	Wii Sports Resort	Wii	2009-01-01	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00
4	Pokemon Red/Pokemon Blue	GB	1996-01-01	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37
...	...	...	...	...	...	...	...	...	...	...
5082	Solaris	2600	1986-01-01	Shooter	Atari	0.35	0.02	0.00	0.00	0.37
5083	Puppy Palace	DS	2008-01-01	Simulation	Ubisoft	0.35	0.00	0.00	0.03	0.37
5084	Super Dragon Ball Z	PS2	2006-01-01	Fighting	Atari	0.09	0.07	0.19	0.02	0.37
5085	Paws & Claws: Pampered Pets	DS	2009-01-01	Simulation	THQ	0.35	0.00	0.00	0.03	0.37
5086	ESPN: Sports Connection	WiiU	2012-01-01	Sports	Ubisoft	0.21	0.12	0.00	0.04	0.37

5000 rows × 10 columns

We are also changing the genre “Platform” to “Platformer” for the future.

df['Genre'] = df['Genre'].map(lambda x: "Platformer" if x == "Platform" else x).copy()

Now we will graph the sales of the video games over time, grouped by genre. From this graph, we can tell what genres are dominated by different companies.

sel = alt.selection_single(fields=['Genre'], bind="legend")

c1 = alt.Chart(df).mark_circle().encode(
    x='Year',
    y='Global_Sales',
    color='Genre:N',
    tooltip=['Name', 'Platform', 'Publisher']
).add_selection(sel)

c2 = alt.Chart(df).mark_bar().encode(
    x=alt.X('Publisher', sort='-y'),
    y='sum(Global_Sales)'
).transform_filter(sel)

c1|c2

Japan’s Preferences (Regression)#

We will use scikit-learn’s LinearRegression to figure out what factors affect the popularity of video games in Japan. We will consider the year, top 5 publishers, and genres. Since we are using regression (not classification), we must first make sure the variables we are using are numerical and not categorical. We will use OneHotEncoder for this. Also, we will only be considering the top 5 publishers to not make the data too complex.

df2 = df[(df['Publisher'] == 'Nintendo') | (df['Publisher'] == 'Electronic Arts') | (df['Publisher'] == 'Activision') | (df['Publisher'] == 'Sony Computer Entertainment') | (df['Publisher'] == 'Ubisoft')].copy()

We convert the release years from datetime values to int values.

df2['Year2'] = df2['Year'].dt.year

We make series of boolean values for the top 5 publishers using OneHotEncoder.

encoder_pub = OneHotEncoder()

encoder_pub.fit(df2[["Publisher"]])

OneHotEncoder()

arr_pub = encoder_pub.transform(df2[["Publisher"]]).toarray()
arr_pub

array([[0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0.],
       ...,
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 1.]])

pub_list = [encoder_pub.get_feature_names_out()[i][10:] for i in range(5)]
pub_list

['Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft']

df2[pub_list] = arr_pub
df2

	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	Year2	Activision	Electronic Arts	Nintendo	Sony Computer Entertainment	Ubisoft
0	Wii Sports	Wii	2006-01-01	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74	2006	0.0	0.0	1.0	0.0	0.0
1	Super Mario Bros.	NES	1985-01-01	Platformer	Nintendo	29.08	3.58	6.81	0.77	40.24	1985	0.0	0.0	1.0	0.0	0.0
2	Mario Kart Wii	Wii	2008-01-01	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82	2008	0.0	0.0	1.0	0.0	0.0
3	Wii Sports Resort	Wii	2009-01-01	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00	2009	0.0	0.0	1.0	0.0	0.0
4	Pokemon Red/Pokemon Blue	GB	1996-01-01	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	1996	0.0	0.0	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5079	Harry Potter and the Prisoner of Azkaban	GC	2004-01-01	Action	Electronic Arts	0.29	0.08	0.00	0.01	0.38	2004	0.0	1.0	0.0	0.0	0.0
5080	Quake 4	X360	2005-01-01	Shooter	Activision	0.32	0.03	0.00	0.03	0.38	2005	1.0	0.0	0.0	0.0	0.0
5081	Petz Rescue: Ocean Patrol	DS	2008-01-01	Adventure	Ubisoft	0.35	0.00	0.00	0.03	0.37	2008	0.0	0.0	0.0	0.0	1.0
5083	Puppy Palace	DS	2008-01-01	Simulation	Ubisoft	0.35	0.00	0.00	0.03	0.37	2008	0.0	0.0	0.0	0.0	1.0
5086	ESPN: Sports Connection	WiiU	2012-01-01	Sports	Ubisoft	0.21	0.12	0.00	0.04	0.37	2012	0.0	0.0	0.0	0.0	1.0

2316 rows × 16 columns

We then repeat the same process for the genres.

encoder_g = OneHotEncoder()

encoder_g.fit(df2[["Genre"]])

OneHotEncoder()

arr_g = encoder_g.transform(df2[["Genre"]]).toarray()
arr_g

array([[0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.]])

g_list = [encoder_g.get_feature_names_out()[i][6:] for i in range(12)]
g_list

['Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy']

df2[g_list] = arr_g
df2

	Name	Platform	Year	Genre	Publisher	NA_Sales	EU_Sales	JP_Sales	Other_Sales	Global_Sales	...	Fighting	Misc	Platformer	Puzzle	Racing	Role-Playing	Shooter	Simulation	Sports	Strategy
0	Wii Sports	Wii	2006-01-01	Sports	Nintendo	41.49	29.02	3.77	8.46	82.74	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
1	Super Mario Bros.	NES	1985-01-01	Platformer	Nintendo	29.08	3.58	6.81	0.77	40.24	...	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	Mario Kart Wii	Wii	2008-01-01	Racing	Nintendo	15.85	12.88	3.79	3.31	35.82	...	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0
3	Wii Sports Resort	Wii	2009-01-01	Sports	Nintendo	15.75	11.01	3.28	2.96	33.00	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0
4	Pokemon Red/Pokemon Blue	GB	1996-01-01	Role-Playing	Nintendo	11.27	8.89	10.22	1.00	31.37	...	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5079	Harry Potter and the Prisoner of Azkaban	GC	2004-01-01	Action	Electronic Arts	0.29	0.08	0.00	0.01	0.38	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5080	Quake 4	X360	2005-01-01	Shooter	Activision	0.32	0.03	0.00	0.03	0.38	...	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0
5081	Petz Rescue: Ocean Patrol	DS	2008-01-01	Adventure	Ubisoft	0.35	0.00	0.00	0.03	0.37	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5083	Puppy Palace	DS	2008-01-01	Simulation	Ubisoft	0.35	0.00	0.00	0.03	0.37	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0
5086	ESPN: Sports Connection	WiiU	2012-01-01	Sports	Ubisoft	0.21	0.12	0.00	0.04	0.37	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0

2316 rows × 28 columns

Below is a list of the columns we want to use for the regressor.

cols = list(df2.columns)[10:]
cols

['Year2',
 'Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft',
 'Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy']

We now instantiate the regressor object, fit the data, and get the coefficients for the predicted values.

reg = LinearRegression(fit_intercept=False)

reg.fit(df2[cols], df2['JP_Sales'])

LinearRegression(fit_intercept=False)

df2['Pred'] = reg.predict(df2[cols])

pd.Series(reg.coef_, index=cols)

Year2                          -0.011049
Activision                     15.598151
Electronic Arts                15.604179
Nintendo                       16.389373
Sony Computer Entertainment    15.737958
Ubisoft                        15.638160
Action                          6.540390
Adventure                       6.371640
Fighting                        6.638727
Misc                            6.567025
Platformer                      6.662443
Puzzle                          6.476242
Racing                          6.603110
Role-Playing                    6.917231
Shooter                         6.562349
Simulation                      6.626960
Sports                          6.567868
Strategy                        6.433834
dtype: float64

We can see from the coefficients that the coefficient for year is negative; meaning video games got less popular over time in Japan. We can also see that Japanese sales are higher for Nintendo games. In terms of genre, we can see that role-playing games are popular in Japan (likely due to the immense success of Pokemon).

We can graph the prediction line on top of the data.

sel2 = alt.selection_single(fields=["Publisher"], bind="legend", empty="none")

c3 = alt.Chart(df2).mark_circle().encode(
    x='Year',
    y='JP_Sales',
    color='Publisher',
    tooltip = ["Name", "Publisher", "Genre"],
    opacity=alt.condition(sel2, alt.value(1), alt.value(0.2)),
    size=alt.condition(sel2, alt.value(60), alt.value(20)),
).add_selection(sel2)

c4 = alt.Chart(df2).mark_line(color='black').encode(
    x='Year',
    y='Pred'
).transform_filter(
    sel2
)

c3+c4

We can see from the graph that Nintendo is dominating the Japanese video game market (along with Sony to a certain extent) while other American publishers like Electronic Arts and Activision are not doing as well.

Platforms used by publishers (Classification)#

We will use the K-Nearest Neighbors Classifier to predict the platforms used by the top 5 publishers. Other data we will use to help train the model includes the sales, genre and release year.

X = df2[['Activision',
 'Electronic Arts',
 'Nintendo',
 'Sony Computer Entertainment',
 'Ubisoft','NA_Sales',
 'EU_Sales',
 'JP_Sales',
 'Other_Sales',
 'Global_Sales',
 'Action',
 'Adventure',
 'Fighting',
 'Misc',
 'Platformer',
 'Puzzle',
 'Racing',
 'Role-Playing',
 'Shooter',
 'Simulation',
 'Sports',
 'Strategy',
 'Year2']]
y = df2["Platform"]

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=2)

Similar to before, we will instantiate the classifier object, fit the data, and obtain the scores.

clf = KNeighborsClassifier(n_neighbors=30)

clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=30)

clf.score(X_train, y_train)

0.48758099352051837

clf.score(X_test, y_test)

0.44612068965517243

From the classifier’s score on the training set and the test set, we can see that they are fairly close (within 5%). From this we know that we are not overfitting the data with our parameters as the training score is not significantly better than the test score. Although we can get a higher accuracy on the training data by using less neighbors, we will be sacrificing accuracy on the test set which is much more important because it evaluates how well our model can perform on more data.

Looking at the actual values however, we can see that the model is not very accurate, with an accuracy of only 44%. Thus, we can say that it is hard to predict the platform of a game from sales in different regions, the publishers of the game, and its genre.

Summary#

To summarize, we analyzed the sales of video games over time using data on publishers, genres and platforms and cleaned and explored the data using Pandas and Altair respectively. We used OneHotEncoder to allow us to use categorical variables in scikitlearn’s linear regression and reached conclusions on Japan’s taste in video games. We then used a KNeighbors Classifier to try and predict the platform a game was released on, but ended up with a not-so-accurate model meaning it is difficult to predict the platform.

References#

Your code above should include references. Here is some additional space for references.

The sorted bar graph was adapted from https://altair-viz.github.io/gallery/bar_chart_sorted.html Using if and else in lambda functions from https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/

What is the source of your dataset(s)? This dataset called “Video Game Sales” was taken from Kaggle and was uploaded by GREGORYSMITH. https://www.kaggle.com/datasets/gregorut/videogamesales?resource=download

List any other references that you found helpful.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

Video Game Sales

Contents