Project Title
Contents
Project Title¶
Author: Siyu Sun
Course Project, UC Irvine, Math 10, W22
Introduction¶
Introduce your project here. About 3 sentences.
I import a dataset about sales of PS4 games around the world. I want to investigate basic information about the dataset for example, what is the most welcoming publisher and what is the genre with the highest sales. And is it possible for us to know genre type after we know its sales around the world?
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
import numpy as np
import pandas as pd
import altair as alt
This is the dataset that I am investigating
df = pd.read_csv('PS4_GamesSales.csv',encoding='unicode_escape')
df.dropna(inplace=True)
df.head()
Game | Year | Genre | Publisher | North America | Europe | Japan | Rest of World | Global | |
---|---|---|---|---|---|---|---|---|---|
0 | Grand Theft Auto V | 2014.0 | Action | Rockstar Games | 6.06 | 9.71 | 0.60 | 3.02 | 19.39 |
1 | Call of Duty: Black Ops 3 | 2015.0 | Shooter | Activision | 6.18 | 6.05 | 0.41 | 2.44 | 15.09 |
2 | Red Dead Redemption 2 | 2018.0 | Action-Adventure | Rockstar Games | 5.26 | 6.21 | 0.21 | 2.26 | 13.94 |
3 | Call of Duty: WWII | 2017.0 | Shooter | Activision | 4.67 | 6.21 | 0.40 | 2.12 | 13.40 |
4 | FIFA 18 | 2017.0 | Sports | EA Sports | 1.27 | 8.64 | 0.15 | 1.73 | 11.80 |
Some basic information about the dataset¶
df.shape
(825, 9)
print(f"The number of rows in this dataset is {df.shape[0]}")
The number of rows in this dataset is 825
df.dtypes
Game object
Year float64
Genre object
Publisher object
North America float64
Europe float64
Japan float64
Rest of World float64
Global float64
dtype: object
The chart shows global sales of each genre.(From the chart, we can tell Action and Shooter have the most global sales)
alt.Chart(df).mark_bar().encode(
x="Genre",
y="Global",
color=alt.Color("Genre", title="Genre type"),
).properties(
title="Global sales of each genre",
width=1000,
height=200,
)
How many games are more welcomed in North America than Europe? (345 games are more welcomed in North America while 480 games are more welcomed in Europe)
(df.loc[:,"North America"] > df.loc[:,"Europe"]).value_counts()
False 480
True 345
dtype: int64
Find the most frequent genre of the dataset. The most frequent genre is “Action”
df['Genre'].value_counts().idxmax()
'Action'
Global total sales of each genre. (The genre with the most global sales is “Action”.)
A=df.groupby(['Genre']).sum().Global
A
Genre
Action 136.82
Action-Adventure 61.86
Adventure 15.22
Fighting 19.36
MMO 3.52
Misc 12.47
Music 5.03
Party 0.65
Platform 17.85
Puzzle 0.52
Racing 25.29
Role-Playing 62.73
Shooter 134.99
Simulation 4.52
Sports 92.85
Strategy 1.28
Visual Novel 0.46
Name: Global, dtype: float64
Global average sales of each genre. (The genre with highest global average sales is Shooter)
B=df.groupby(['Genre']).mean().Global
B
Genre
Action 0.667415
Action-Adventure 1.627895
Adventure 0.214366
Fighting 0.605000
MMO 0.440000
Misc 0.226727
Music 0.279444
Party 0.325000
Platform 0.540909
Puzzle 0.052000
Racing 0.526875
Role-Playing 0.586262
Shooter 1.799867
Simulation 0.215238
Sports 1.345652
Strategy 0.051200
Visual Novel 0.057500
Name: Global, dtype: float64
Predict genre from sales of the game from each region of the world¶
In this case, I am using KNeighborsClassifier to predict genre since it is a classification problem. And the reason that I choose n_neighbors=3 is because I find out that the predcition accuracy will become higher when the value of n_neighbors is a small value.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
clf = KNeighborsClassifier(n_neighbors=3)
X=df[['North America','Europe','Japan','Rest of World']].copy()
y=df['Genre']
clf.fit(X,y)
KNeighborsClassifier(n_neighbors=3)
On the “pred” column, it shows the result of the prediction of the genre using KNeighborsClassifier
X["pred"] = clf.predict(X)
X.head()
North America | Europe | Japan | Rest of World | pred | |
---|---|---|---|---|---|
0 | 6.06 | 9.71 | 0.60 | 3.02 | Action |
1 | 6.18 | 6.05 | 0.41 | 2.44 | Shooter |
2 | 5.26 | 6.21 | 0.21 | 2.26 | Shooter |
3 | 4.67 | 6.21 | 0.40 | 2.12 | Shooter |
4 | 1.27 | 8.64 | 0.15 | 1.73 | Sports |
Find the correctness rate of the prediction. However, the result shows that it is not a good estimation.
np.count_nonzero(X['pred'] == df['Genre'])/825
0.42424242424242425
I want to test in this case, is it over-fitting or under-fitting?
del X["pred"]
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)
np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
0.41515151515151516
np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
0.46060606060606063
Since test score is 0.46 and train score is 0.41 and test score is higher than train socre, the result is under-fitting.
A new dataframe df2 which contains only genre of Action and Shooter¶
df2 = df[df["Genre"].isin(["Action","Shooter"])]
df2.head()
Game | Year | Genre | Publisher | North America | Europe | Japan | Rest of World | Global | |
---|---|---|---|---|---|---|---|---|---|
0 | Grand Theft Auto V | 2014.0 | Action | Rockstar Games | 6.06 | 9.71 | 0.60 | 3.02 | 19.39 |
1 | Call of Duty: Black Ops 3 | 2015.0 | Shooter | Activision | 6.18 | 6.05 | 0.41 | 2.44 | 15.09 |
3 | Call of Duty: WWII | 2017.0 | Shooter | Activision | 4.67 | 6.21 | 0.40 | 2.12 | 13.40 |
6 | Uncharted (PS4) | 2016.0 | Action | Sony Interactive Entertainment | 4.49 | 3.93 | 0.21 | 1.70 | 10.33 |
8 | Call of Duty: Infinite Warfare | 2016.0 | Shooter | Activision | 3.11 | 3.83 | 0.19 | 1.36 | 8.48 |
c = alt.Chart(df2).mark_circle().encode(
x="North America",
y="Global",
color=alt.Color("Genre", title="Genre type"),
).properties(
title="Sales of Action and shooter from North America and Global",
width=1000,
height=200,
)
c
Will the prediction becomes more accurate if I include only two genres in the dataset? The result shows that it is still a bad estimation and the accuracy is the same as the original dateset.
X2=df[['North America','Europe','Japan','Rest of World']].copy()
y2=df['Genre']
clf.fit(X2,y2)
clf.score(X2,y2)
0.42424242424242425
Summary¶
Either summarize what you did, or summarize the results. About 3 sentences.
I import data, and get some basic information about the data: draw a graph to show global sales of each genre and find the most welcome genre. I try to predict genre from knowing its sales from each region but the prediction only has an accuracy of 0.42. I wonder if the accuracy of the prediction will become better if I only include the most popular genre(Action and Shooter), however the prediction doesn’t improve and it has the same accuracy as the original dateset.
References¶
Include references that you found helpful. Also say where you found the dataset you used.
The extra thing I use is pandas.DataFrame.groupby. https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html
I find “PS4_Games Sales” from Kaggle.
Created in Deepnote