Chess Games Analysis
Contents
Chess Games Analysis¶
Author:Josh Tseng
Course Project, UC Irvine, Math 10, S22
Introduction¶
Introduce your project here. Maybe 3 sentences.
Chess is a game about pattern recongnition and this is what machine learning is all about. In this project, we will use over 30,000 games on one of the most popular website “Lichess” to analyse the best wining strategy in a game of chess. The dataset is from kaggle, somone used a Lichess api to create it. Chess Dataset
Data Cleaning and Feature Engineer¶
import pandas as pd
import altair as alt
pd.set_option('mode.chained_assignment', None)
df = pd.read_csv('games.csv')
df
id | rated | created_at | last_move_at | turns | victory_status | winner | increment_code | white_id | white_rating | black_id | black_rating | moves | opening_eco | opening_name | opening_ply | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TZJHLljE | False | 1.504210e+12 | 1.504210e+12 | 13 | outoftime | white | 15+2 | bourgris | 1500 | a-00 | 1191 | d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5... | D10 | Slav Defense: Exchange Variation | 5 |
1 | l1NXvwaE | True | 1.504130e+12 | 1.504130e+12 | 16 | resign | black | 5+10 | a-00 | 1322 | skinnerua | 1261 | d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6... | B00 | Nimzowitsch Defense: Kennedy Variation | 4 |
2 | mIICvQHh | True | 1.504130e+12 | 1.504130e+12 | 61 | mate | white | 5+10 | ischia | 1496 | a-00 | 1500 | e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc... | C20 | King's Pawn Game: Leonardis Variation | 3 |
3 | kWKvrqYL | True | 1.504110e+12 | 1.504110e+12 | 61 | mate | white | 20+0 | daniamurashov | 1439 | adivanov2009 | 1454 | d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O... | D02 | Queen's Pawn Game: Zukertort Variation | 3 |
4 | 9tXo1AUZ | True | 1.504030e+12 | 1.504030e+12 | 95 | mate | white | 30+3 | nik221107 | 1523 | adivanov2009 | 1469 | e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N... | C41 | Philidor Defense | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20053 | EfqH7VVH | True | 1.499791e+12 | 1.499791e+12 | 24 | resign | white | 10+10 | belcolt | 1691 | jamboger | 1220 | d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5... | A80 | Dutch Defense | 2 |
20054 | WSJDhbPl | True | 1.499698e+12 | 1.499699e+12 | 82 | mate | black | 10+0 | jamboger | 1233 | farrukhasomiddinov | 1196 | d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd... | A41 | Queen's Pawn | 2 |
20055 | yrAas0Kj | True | 1.499698e+12 | 1.499698e+12 | 35 | mate | white | 10+0 | jamboger | 1219 | schaaksmurf3 | 1286 | d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd... | D00 | Queen's Pawn Game: Mason Attack | 3 |
20056 | b0v4tRyF | True | 1.499696e+12 | 1.499697e+12 | 109 | resign | white | 10+0 | marcodisogno | 1360 | jamboger | 1227 | e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb... | B07 | Pirc Defense | 4 |
20057 | N8G2JHGG | True | 1.499643e+12 | 1.499644e+12 | 78 | mate | black | 10+0 | jamboger | 1235 | ffbob | 1339 | d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3... | D00 | Queen's Pawn Game: Mason Attack | 3 |
20058 rows × 16 columns
#checking for null value
df.isna().any().any()
False
There is no null value in the whole dataset. However, I do want to get rid of the games that has than 6 terms, since those games does not really have much meanings.
df = df[df['turns']>5].copy()
Coverting the “created_at” and “last_move_at” to Pandas Datetime¶
The original time was measure in something that is called unix timestamp, which is not readable for human, so I converedti into pandas datetime.
df['created_at'] = pd.to_datetime(df['created_at'], unit='ms')
df['last_move_at'] = pd.to_datetime(df['last_move_at'], unit='ms')
Creating a New Column called Format¶
Here I want to create some extra columns that will help the machine learning process easier and more accurate.
The ‘increment_code’ column is not too useful, genrally online chess games are sepereted into 3 different format, less than 3 minutes is called bullet, 3 minutes to 14 minutes is called blitz, and 15 minutes or above is called rapid, the first number before the plus sign in the ‘increment_code’ column indicats the minute, so I want to create a new column called ‘format’ and it has three values, ‘bullet’, ‘blitz’, and ‘rapid’
df['increment_code1'] = (df['increment_code'].str.split('+',-1).str[0]).map(int)
df['format'] = 'bullet'
df.loc[(df['increment_code1'] >= 3),'format'] = 'blitz'
df.loc[df['increment_code1'] >= 15,'format'] = 'rapid'
df.drop('increment_code1', inplace=True, axis=1)
(df['format']).value_counts()
blitz 13624
rapid 5937
bullet 115
Name: format, dtype: int64
Creating New Columns for Machine Learning¶
Here I want to add 2 new columns that are numerical, which will help me perform linear regression. The two columns being, ‘black_win’and ‘white_win’. Each column will have 3 different values, 0 for lose, 0.5 for draw, and 1 for win
df['black_win'] = 0
df.loc[df['winner'] == 'black', 'black_win'] = 1
df.loc[df['winner'] == 'draw', 'black_win'] = 0.5
df['white_win'] = 1 - df['black_win']
Creating New Data Frames for the Chess Library¶
I want to create a new data frame that has the first 6 moves of each game and if white win or not.
moves = ['move ' + str(i) for i in range(1,7)]
df2 = df['opening_name']
df2 = pd.DataFrame(df2)
df2[moves] = 0
for i in df.index:
temp = df.loc[i, 'moves'].split()
for j in range(1,7):
df2.loc[i, 'move ' + str(j)] = temp[j-1]
The reason I decided to store the moves column into a new data frame is because when the moves are in df, the size of it gets too big, and cannot plot some altair chart. Df2 is for the chess library.
Create New for Machine Learning Part 2¶
This new data frame called df3 will be for the second part of the macine laerning that uses the chess library to do feature engineering.
df3 = pd.DataFrame(df['moves'])
df3['white_win'] = False
df3.loc[df['winner'] == 'white', 'white_win'] = True
Basic Statistics¶
In this secion, I will be using altair charts to analyze the basic statistics in chess.
for i in ['white', 'black']:
print(f'The winning percentage for {i} is ' + str(df['winner'].value_counts()[i]/len(df)))
print("And the drawing percentage is " + str(df['winner'].value_counts()['draw']/len(df)))
The winning percentage for white is 0.49801788981500306
The winning percentage for black is 0.45461475909737753
And the drawing percentage is 0.047367351087619435
#increase the maximun data points altair can plot
alt.data_transformers.enable('default', max_rows=None)
alt.Chart(df[['winner']]).mark_bar().encode(
x = 'winner:O',
y = 'count(winner)',
color = 'winner:N'
)
From the above 2 we cann see, in a chess game white has a distinct advatage over black disregarding the format or moves.
Now let’s see if white has an advantage in all formats of chess game.
alt.Chart(df[['winner', 'rated']]).mark_bar().encode(
x = 'winner:O',
y = 'count(winner)',
color = 'winner:N',
column = 'rated'
)
As we can see, in both format white still has a noticeable advantage over black.
I am intrest in how ‘rated’ corrolates with ‘victory_status’. The coulumn ‘vcitory_status’ have 4 different values, whcih are resign, mate, outoftime, and draw.
df['victory_status'].value_counts()
resign 10872
mate 6307
outoftime 1609
draw 888
Name: victory_status, dtype: int64
alt.Chart(df[['victory_status', 'winner', 'rated']]).mark_bar().encode(
x = 'victory_status',
y = 'count(victory_status)',
color = 'winner',
column = 'rated',
tooltip = ['winner','count(victory_status)']
)
From the plot, it is hard to see the differences. Let’s calculate some numerical number.
not_rated = df.groupby('rated').victory_status.value_counts()[0]/len(df[df['rated'] == False])
rated = df.groupby('rated').victory_status.value_counts()[1]/len(df[df['rated'] == True])
not_rated #non-rated's percentage of each vicotry status
victory_status
resign 0.554764
mate 0.310108
outoftime 0.078121
draw 0.057007
Name: victory_status, dtype: float64
rated #rated's percentage of each vicotry status
victory_status
resign 0.552024
mate 0.323031
outoftime 0.082646
draw 0.042299
Name: victory_status, dtype: float64
As we can see, there is actually not too big of diffecnes for rated and not rated, was expecting people will surrander easier in non rated ones, seems like it is not the case.
Machine Learning Part 1¶
After the basic statistic, Let’s get into linear regression.
df
id | rated | created_at | last_move_at | turns | victory_status | winner | increment_code | white_id | white_rating | black_id | black_rating | moves | opening_eco | opening_name | opening_ply | format | black_win | white_win | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TZJHLljE | False | 2017-08-31 20:06:40.000000000 | 2017-08-31 20:06:40.000000000 | 13 | outoftime | white | 15+2 | bourgris | 1500 | a-00 | 1191 | d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5... | D10 | Slav Defense: Exchange Variation | 5 | rapid | 0.0 | 1.0 |
1 | l1NXvwaE | True | 2017-08-30 21:53:20.000000000 | 2017-08-30 21:53:20.000000000 | 16 | resign | black | 5+10 | a-00 | 1322 | skinnerua | 1261 | d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6... | B00 | Nimzowitsch Defense: Kennedy Variation | 4 | blitz | 1.0 | 0.0 |
2 | mIICvQHh | True | 2017-08-30 21:53:20.000000000 | 2017-08-30 21:53:20.000000000 | 61 | mate | white | 5+10 | ischia | 1496 | a-00 | 1500 | e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc... | C20 | King's Pawn Game: Leonardis Variation | 3 | blitz | 0.0 | 1.0 |
3 | kWKvrqYL | True | 2017-08-30 16:20:00.000000000 | 2017-08-30 16:20:00.000000000 | 61 | mate | white | 20+0 | daniamurashov | 1439 | adivanov2009 | 1454 | d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O... | D02 | Queen's Pawn Game: Zukertort Variation | 3 | rapid | 0.0 | 1.0 |
4 | 9tXo1AUZ | True | 2017-08-29 18:06:40.000000000 | 2017-08-29 18:06:40.000000000 | 95 | mate | white | 30+3 | nik221107 | 1523 | adivanov2009 | 1469 | e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N... | C41 | Philidor Defense | 5 | rapid | 0.0 | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20053 | EfqH7VVH | True | 2017-07-11 16:35:14.342000128 | 2017-07-11 16:40:36.076000000 | 24 | resign | white | 10+10 | belcolt | 1691 | jamboger | 1220 | d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5... | A80 | Dutch Defense | 2 | blitz | 0.0 | 1.0 |
20054 | WSJDhbPl | True | 2017-07-10 14:48:09.760000000 | 2017-07-10 15:00:33.979000064 | 82 | mate | black | 10+0 | jamboger | 1233 | farrukhasomiddinov | 1196 | d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd... | A41 | Queen's Pawn | 2 | blitz | 1.0 | 0.0 |
20055 | yrAas0Kj | True | 2017-07-10 14:44:37.492999936 | 2017-07-10 14:47:30.327000064 | 35 | mate | white | 10+0 | jamboger | 1219 | schaaksmurf3 | 1286 | d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd... | D00 | Queen's Pawn Game: Mason Attack | 3 | blitz | 0.0 | 1.0 |
20056 | b0v4tRyF | True | 2017-07-10 14:15:27.019000064 | 2017-07-10 14:31:13.718000128 | 109 | resign | white | 10+0 | marcodisogno | 1360 | jamboger | 1227 | e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb... | B07 | Pirc Defense | 4 | blitz | 0.0 | 1.0 |
20057 | N8G2JHGG | True | 2017-07-09 23:32:32.648999936 | 2017-07-09 23:44:49.348000000 | 78 | mate | black | 10+0 | jamboger | 1235 | ffbob | 1339 | d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3... | D00 | Queen's Pawn Game: Mason Attack | 3 | blitz | 1.0 | 0.0 |
19676 rows × 19 columns
Linear Regression¶
This part would be using ‘turns’, ‘opeing_ply’, and ‘black_win’ or ‘white_win’ to predict the rating of each player.
First let’s split the dataset into 2 sub data frame and have a test and train set for each of them.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Let’s start with predicting the black’s rating first
cols1 = ['turns', 'opening_ply', 'black_win']
X_train, X_test, y_train, y_test = train_test_split(df[cols1], df['black_rating'], test_size=0.2, random_state=0)
reg_black = LinearRegression()
reg_black.fit(X_train, y_train)
LinearRegression()
Let’s use both mean squered error and mean aboslute error to measure the accuracy of the prediction and see if it is overfitting.
from sklearn.metrics import mean_squared_error, mean_absolute_error
mean_squared_error(reg_black.predict(X_train), y_train)
74518.74574075406
mean_squared_error(reg_black.predict(X_test), y_test)
73644.84351354155
mean_absolute_error(reg_black.predict(X_train), y_train)
217.59747560753172
mean_absolute_error(reg_black.predict(X_test), y_test)
215.41370670550344
Let’s do the same thing for white, see if there will be a difference.
cols2 = ['turns', 'opening_ply', 'white_win']
X_train, X_test, y_train, y_test = train_test_split(df[cols2], df['white_rating'], test_size=0.2, random_state=0)
reg_white = LinearRegression()
reg_white.fit(X_train, y_train)
LinearRegression()
mean_squared_error(reg_white.predict(X_train), y_train)
74223.69960957882
mean_squared_error(reg_white.predict(X_test), y_test)
74970.70006569766
mean_absolute_error(reg_white.predict(X_train), y_train)
217.1967432196025
mean_absolute_error(reg_white.predict(X_test), y_test)
217.96851240583175
Before the coclusion, let’s make an altair chart that plot the predicted values.
#first predict the whole dataset
df['black_pred'] = reg_black.predict(df[cols1])
df['white_pred'] = reg_white.predict(df[cols2])
df
id | rated | created_at | last_move_at | turns | victory_status | winner | increment_code | white_id | white_rating | ... | black_rating | moves | opening_eco | opening_name | opening_ply | format | black_win | white_win | black_pred | white_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TZJHLljE | False | 2017-08-31 20:06:40.000000000 | 2017-08-31 20:06:40.000000000 | 13 | outoftime | white | 15+2 | bourgris | 1500 | ... | 1191 | d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5... | D10 | Slav Defense: Exchange Variation | 5 | rapid | 0.0 | 1.0 | 1492.716943 | 1593.759787 |
1 | l1NXvwaE | True | 2017-08-30 21:53:20.000000000 | 2017-08-30 21:53:20.000000000 | 16 | resign | black | 5+10 | a-00 | 1322 | ... | 1261 | d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6... | B00 | Nimzowitsch Defense: Kennedy Variation | 4 | blitz | 1.0 | 0.0 | 1570.218338 | 1488.518709 |
2 | mIICvQHh | True | 2017-08-30 21:53:20.000000000 | 2017-08-30 21:53:20.000000000 | 61 | mate | white | 5+10 | ischia | 1496 | ... | 1500 | e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc... | C20 | King's Pawn Game: Leonardis Variation | 3 | blitz | 0.0 | 1.0 | 1494.786898 | 1582.650804 |
3 | kWKvrqYL | True | 2017-08-30 16:20:00.000000000 | 2017-08-30 16:20:00.000000000 | 61 | mate | white | 20+0 | daniamurashov | 1439 | ... | 1454 | d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O... | D02 | Queen's Pawn Game: Zukertort Variation | 3 | rapid | 0.0 | 1.0 | 1494.786898 | 1582.650804 |
4 | 9tXo1AUZ | True | 2017-08-29 18:06:40.000000000 | 2017-08-29 18:06:40.000000000 | 95 | mate | white | 30+3 | nik221107 | 1523 | ... | 1469 | e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N... | C41 | Philidor Defense | 5 | rapid | 0.0 | 1.0 | 1585.167206 | 1672.563933 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20053 | EfqH7VVH | True | 2017-07-11 16:35:14.342000128 | 2017-07-11 16:40:36.076000000 | 24 | resign | white | 10+10 | belcolt | 1691 | ... | 1220 | d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5... | A80 | Dutch Defense | 2 | blitz | 0.0 | 1.0 | 1427.047899 | 1518.473716 |
20054 | WSJDhbPl | True | 2017-07-10 14:48:09.760000000 | 2017-07-10 15:00:33.979000064 | 82 | mate | black | 10+0 | jamboger | 1233 | ... | 1196 | d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd... | A41 | Queen's Pawn | 2 | blitz | 1.0 | 0.0 | 1592.582253 | 1494.708198 |
20055 | yrAas0Kj | True | 2017-07-10 14:44:37.492999936 | 2017-07-10 14:47:30.327000064 | 35 | mate | white | 10+0 | jamboger | 1219 | ... | 1286 | d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd... | D00 | Queen's Pawn Game: Mason Attack | 3 | blitz | 0.0 | 1.0 | 1465.473400 | 1557.664123 |
20056 | b0v4tRyF | True | 2017-07-10 14:15:27.019000064 | 2017-07-10 14:31:13.718000128 | 109 | resign | white | 10+0 | marcodisogno | 1360 | ... | 1227 | e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb... | B07 | Pirc Defense | 4 | blitz | 0.0 | 1.0 | 1574.927761 | 1657.399180 |
20057 | N8G2JHGG | True | 2017-07-09 23:32:32.648999936 | 2017-07-09 23:44:49.348000000 | 78 | mate | black | 10+0 | jamboger | 1235 | ... | 1339 | d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3... | D00 | Queen's Pawn Game: Mason Attack | 3 | blitz | 1.0 | 0.0 | 1614.096120 | 1519.483213 |
19676 rows × 21 columns
use_cols = ['white_id','white_rating', 'white_pred','created_at','black_rating','format', 'black_pred']
sel = alt.selection_single(fields=['white_id'])
sel1 = alt.selection_single(fields=['white_id','white_rating', 'white_pred','created_at'])
a = alt.Chart(df[use_cols]).mark_circle().encode(
x = 'black_rating',
y = 'white_rating',
color = 'format',
tooltip = ['white_id', 'white_rating', 'white_pred']
).properties(
title='Real'
)
b = alt.Chart(df[use_cols]).mark_circle().encode(
x = 'black_pred',
y = 'white_pred',
color = 'format',
tooltip = ['white_id', 'white_rating', 'white_pred', 'created_at' ]
).properties(
title='Predict'
)
a.add_selection(sel)&b.transform_filter(sel).add_selection(sel1)
Try couple clicks until you see there is multiple points on the second graph, you will see that it can predict points in the middle of the cluster a lot better than the ones on the edge.
Conclusion For Using ‘opeing_ply’ and ‘turns’ to Predict Ratings¶
Before ploting it, just judge it by the mean absolute error, the prediction seems reasonalbe and not overfitting, since the test set does better than the train set. However, when we plot the prediction next to the actual rating we can see that the prediction is actually very pathetic just by its shape. The prediciton seems like a linear line, because I use linear regression, but all the values centered at 1300-2200. This shows that number of opening moves and total moves does not correlate to ratings at all.
Chess Library¶
!pip install chess==1.9.1
import chess
board = chess.Board()
Requirement already satisfied: chess==1.9.1 in /root/venv/lib/python3.7/site-packages (1.9.1)
WARNING: You are using pip version 20.1.1; however, version 22.1.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Let’s try to see what the first 6 moves looks like the first 6 moves of one of the chess game and its opening name.
board = chess.Board()
for i in range(1,7):
board.push_san(df2.iloc[0,i])
print(df2.iloc[0,0])
board
Slav Defense: Exchange Variation
We only could get the first six moves because it splited in the data frame, if we want the final position of a game we can do something like this.
board = chess.Board()
all_move = df.loc[0,'moves'].split()
for i in all_move:
board.push_san(i)
board
We can check if it is stillmate or checkmate by using is_stalemate() or is_checkmate()
board.is_stalemate()
False
board.is_checkmate()
False
Here I am going to define a function that takes in a board as input and output each side’s total pieces value, a pawn is 1, knight and bishop are both 3, rook is 5, and queen is 9. This is the convention of each chess peice’s value, some may argue that bishop is 3.5, but here we will just use 3. The approach I am going to take is to check every single square and use
def board_value(board):
white = 0
black = 0
for i in range(0,64):
x = board.piece_type_at(i)
if board.color_at(i) == True:
if x == 1:
white =+ 1
elif x == 2 or x == 3:
white =+ 3
elif x == 4:
white =+ 5
elif x == 5:
white =+ 9
else:
if x == 1:
black =+ 1
elif x == 2 or x == 3:
black =+ 3
elif x == 4:
black =+ 5
elif x == 5:
black =+ 9
return white, black
We will use this function to help us manipulate some raw data.
Machine Learning Part 2¶
For this part of machine Learning I will be using the final board’s value to predict the winner of the game with logsitic regression. But first, I will need to turn all the moves for each game into the final board position and use board_value to calculatethe valaue of it then the machine learning part will start.
def final_board(moves):
board = chess.Board()
all_move = moves.split()
for i in all_move:
board.push_san(i)
return board
df3['board'] = final_board(df3.iloc[0,0])
I found something that was not expected, which is pandas DataFrame can actually store board in it. Thinking about it, it kind of makes snese, since board is an object and DataFrame can store all sorts of objects.
So lets covert all the moves to become a board in df3 using final_board
df3['board'] = df3['moves'].map(final_board)
Instead of storing white’s and black’s value, I will store the difference postivie being white has more and negative being black has more.
df3['values'] = df3['board'].map(board_value)
df3['values'] = df3['values'].map(lambda x: x[0]- x[1])
df3
moves | white_win | board | values | |
---|---|---|---|---|
0 | d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5... | True | r n b q k . n r\np p . . . . p p\n. . p . p . ... | -2 |
1 | d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6... | False | r . b q k . n r\np p p p . . p p\n. . . . . . ... | -4 |
2 | e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc... | True | . . . . . . . .\n. . . . . P . .\n. . P . . . ... | 0 |
3 | d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O... | True | . . . . . . . r\nR . . . . . Q p\n. . p . . . ... | 4 |
4 | e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N... | True | . . . . . . Q .\n. . . . . . . R\n. . . . . . ... | 8 |
... | ... | ... | ... | ... |
20053 | d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5... | True | r n . q . k . r\np b p . b . . .\n. p . . p n ... | -2 |
20054 | d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd... | False | . Q . . . . . .\n. . . . . N k .\n. p . . . . ... | 8 |
20055 | d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd... | True | r . b . q r . .\np . p n b k . Q\n. p . . p p ... | 4 |
20056 | e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb... | True | . . R . . . . .\n. Q . . . . . .\n. . . . . . ... | 4 |
20057 | d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3... | False | . . . K q . . .\n. . P . . k . .\n. . . . p . ... | -8 |
19676 rows × 4 columns
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(df3[['values']], df3['white_win'], test_size=0.2, random_state=0)
clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
Let’s check the score on both train and test set.
clf.score(X_train, y_train)
0.6550190597204575
clf.score(X_test, y_test)
0.6557418699186992
We get about 15 wrong prediction every 100 prediction we make. So, the prediction is not great nor awful, but one thing we know is that it is definitly not over fitting. The train score and the test score have very similar performance. x
Summary¶
First, we can see that some basic statistics of chess, such as which side wins more. We also found out that number of opening and total turns does not really correlate to ratings. At the end, we try to use the chess library’s help to covert my raw data into some thing I will be intrest in, and eventually did logistic regression prediction the rating from final board’s total value. After this, I understand that the most diffcult part of machine laerning, is not fitting nor predicting, sicne we can just use a library like sklearn, but the most challenging part is to turn raw data into something that can be use to do the machine learning process.
References¶
What is the source of your dataset(s)?
From Kaggle: Chess Dataset
Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.
Covering Unix to Datetime stack over flow
Interactive graph course notes
Chess Library documantation
List other references that you found helpful.
Altari Graph documantation
Created in Deepnote