Project Title

Author: Siyu Sun

Course Project, UC Irvine, Math 10, W22

Introduction

Introduce your project here. About 3 sentences.

I import a dataset about sales of PS4 games around the world. I want to investigate basic information about the dataset for example, what is the most welcoming publisher and what is the genre with the highest sales. And is it possible for us to know genre type after we know its sales around the world?

Main portion of the project

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import altair as alt

This is the dataset that I am investigating

df = pd.read_csv('PS4_GamesSales.csv',encoding='unicode_escape')
df.dropna(inplace=True)
df.head()
Game Year Genre Publisher North America Europe Japan Rest of World Global
0 Grand Theft Auto V 2014.0 Action Rockstar Games 6.06 9.71 0.60 3.02 19.39
1 Call of Duty: Black Ops 3 2015.0 Shooter Activision 6.18 6.05 0.41 2.44 15.09
2 Red Dead Redemption 2 2018.0 Action-Adventure Rockstar Games 5.26 6.21 0.21 2.26 13.94
3 Call of Duty: WWII 2017.0 Shooter Activision 4.67 6.21 0.40 2.12 13.40
4 FIFA 18 2017.0 Sports EA Sports 1.27 8.64 0.15 1.73 11.80

Some basic information about the dataset

df.shape
(825, 9)
print(f"The number of rows in this dataset is {df.shape[0]}")
The number of rows in this dataset is 825
df.dtypes
Game              object
Year             float64
Genre             object
Publisher         object
North America    float64
Europe           float64
Japan            float64
Rest of World    float64
Global           float64
dtype: object

The chart shows global sales of each genre.(From the chart, we can tell Action and Shooter have the most global sales)

alt.Chart(df).mark_bar().encode(
    x="Genre",
    y="Global",
    color=alt.Color("Genre", title="Genre type"),
).properties(
    title="Global sales of each genre",
    width=1000,
    height=200,
)

How many games are more welcomed in North America than Europe? (345 games are more welcomed in North America while 480 games are more welcomed in Europe)

(df.loc[:,"North America"] > df.loc[:,"Europe"]).value_counts()
False    480
True     345
dtype: int64

Find the most frequent genre of the dataset. The most frequent genre is “Action”

df['Genre'].value_counts().idxmax()
'Action'

Global total sales of each genre. (The genre with the most global sales is “Action”.)

A=df.groupby(['Genre']).sum().Global
A
Genre
Action              136.82
Action-Adventure     61.86
Adventure            15.22
Fighting             19.36
MMO                   3.52
Misc                 12.47
Music                 5.03
Party                 0.65
Platform             17.85
Puzzle                0.52
Racing               25.29
Role-Playing         62.73
Shooter             134.99
Simulation            4.52
Sports               92.85
Strategy              1.28
Visual Novel          0.46
Name: Global, dtype: float64

Global average sales of each genre. (The genre with highest global average sales is Shooter)

B=df.groupby(['Genre']).mean().Global
B
Genre
Action              0.667415
Action-Adventure    1.627895
Adventure           0.214366
Fighting            0.605000
MMO                 0.440000
Misc                0.226727
Music               0.279444
Party               0.325000
Platform            0.540909
Puzzle              0.052000
Racing              0.526875
Role-Playing        0.586262
Shooter             1.799867
Simulation          0.215238
Sports              1.345652
Strategy            0.051200
Visual Novel        0.057500
Name: Global, dtype: float64

Predict genre from sales of the game from each region of the world

In this case, I am using KNeighborsClassifier to predict genre since it is a classification problem. And the reason that I choose n_neighbors=3 is because I find out that the predcition accuracy will become higher when the value of n_neighbors is a small value.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
clf = KNeighborsClassifier(n_neighbors=3)
X=df[['North America','Europe','Japan','Rest of World']].copy()
y=df['Genre']
clf.fit(X,y)
KNeighborsClassifier(n_neighbors=3)

On the “pred” column, it shows the result of the prediction of the genre using KNeighborsClassifier

X["pred"] = clf.predict(X)
X.head()
North America Europe Japan Rest of World pred
0 6.06 9.71 0.60 3.02 Action
1 6.18 6.05 0.41 2.44 Shooter
2 5.26 6.21 0.21 2.26 Shooter
3 4.67 6.21 0.40 2.12 Shooter
4 1.27 8.64 0.15 1.73 Sports

Find the correctness rate of the prediction. However, the result shows that it is not a good estimation.

np.count_nonzero(X['pred'] == df['Genre'])/825
0.42424242424242425

I want to test in this case, is it over-fitting or under-fitting?

del X["pred"]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
0.41515151515151516
np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
0.46060606060606063

Since test score is 0.46 and train score is 0.41 and test score is higher than train socre, the result is under-fitting.

A new dataframe df2 which contains only genre of Action and Shooter

df2 = df[df["Genre"].isin(["Action","Shooter"])]
df2.head()
Game Year Genre Publisher North America Europe Japan Rest of World Global
0 Grand Theft Auto V 2014.0 Action Rockstar Games 6.06 9.71 0.60 3.02 19.39
1 Call of Duty: Black Ops 3 2015.0 Shooter Activision 6.18 6.05 0.41 2.44 15.09
3 Call of Duty: WWII 2017.0 Shooter Activision 4.67 6.21 0.40 2.12 13.40
6 Uncharted (PS4) 2016.0 Action Sony Interactive Entertainment 4.49 3.93 0.21 1.70 10.33
8 Call of Duty: Infinite Warfare 2016.0 Shooter Activision 3.11 3.83 0.19 1.36 8.48
c = alt.Chart(df2).mark_circle().encode(
    x="North America",
    y="Global",
    color=alt.Color("Genre", title="Genre type"),
).properties(
    title="Sales of Action and shooter from North America and Global",
    width=1000,
    height=200,
)
c

Will the prediction becomes more accurate if I include only two genres in the dataset? The result shows that it is still a bad estimation and the accuracy is the same as the original dateset.

X2=df[['North America','Europe','Japan','Rest of World']].copy()
y2=df['Genre']
clf.fit(X2,y2)
clf.score(X2,y2)
0.42424242424242425

Summary

Either summarize what you did, or summarize the results. About 3 sentences.

I import data, and get some basic information about the data: draw a graph to show global sales of each genre and find the most welcome genre. I try to predict genre from knowing its sales from each region but the prediction only has an accuracy of 0.42. I wonder if the accuracy of the prediction will become better if I only include the most popular genre(Action and Shooter), however the prediction doesn’t improve and it has the same accuracy as the original dateset.

References

Include references that you found helpful. Also say where you found the dataset you used.

The extra thing I use is pandas.DataFrame.groupby. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html

I find “PS4_Games Sales” from Kaggle.

Created in deepnote.com Created in Deepnote