Chess Games Analysis¶

Author:Josh Tseng

Course Project, UC Irvine, Math 10, S22

Introduction¶

Introduce your project here. Maybe 3 sentences.

Chess is a game about pattern recongnition and this is what machine learning is all about. In this project, we will use over 30,000 games on one of the most popular website “Lichess” to analyse the best wining strategy in a game of chess. The dataset is from kaggle, somone used a Lichess api to create it. Chess Dataset

Data Cleaning and Feature Engineer¶

import pandas as pd
import altair as alt
pd.set_option('mode.chained_assignment', None)

df = pd.read_csv('games.csv')

df

	id	rated	created_at	last_move_at	turns	victory_status	winner	increment_code	white_id	white_rating	black_id	black_rating	moves	opening_eco	opening_name	opening_ply
0	TZJHLljE	False	1.504210e+12	1.504210e+12	13	outoftime	white	15+2	bourgris	1500	a-00	1191	d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...	D10	Slav Defense: Exchange Variation	5
1	l1NXvwaE	True	1.504130e+12	1.504130e+12	16	resign	black	5+10	a-00	1322	skinnerua	1261	d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...	B00	Nimzowitsch Defense: Kennedy Variation	4
2	mIICvQHh	True	1.504130e+12	1.504130e+12	61	mate	white	5+10	ischia	1496	a-00	1500	e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...	C20	King's Pawn Game: Leonardis Variation	3
3	kWKvrqYL	True	1.504110e+12	1.504110e+12	61	mate	white	20+0	daniamurashov	1439	adivanov2009	1454	d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...	D02	Queen's Pawn Game: Zukertort Variation	3
4	9tXo1AUZ	True	1.504030e+12	1.504030e+12	95	mate	white	30+3	nik221107	1523	adivanov2009	1469	e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...	C41	Philidor Defense	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20053	EfqH7VVH	True	1.499791e+12	1.499791e+12	24	resign	white	10+10	belcolt	1691	jamboger	1220	d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5...	A80	Dutch Defense	2
20054	WSJDhbPl	True	1.499698e+12	1.499699e+12	82	mate	black	10+0	jamboger	1233	farrukhasomiddinov	1196	d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd...	A41	Queen's Pawn	2
20055	yrAas0Kj	True	1.499698e+12	1.499698e+12	35	mate	white	10+0	jamboger	1219	schaaksmurf3	1286	d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd...	D00	Queen's Pawn Game: Mason Attack	3
20056	b0v4tRyF	True	1.499696e+12	1.499697e+12	109	resign	white	10+0	marcodisogno	1360	jamboger	1227	e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb...	B07	Pirc Defense	4
20057	N8G2JHGG	True	1.499643e+12	1.499644e+12	78	mate	black	10+0	jamboger	1235	ffbob	1339	d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3...	D00	Queen's Pawn Game: Mason Attack	3

20058 rows × 16 columns

#checking for null value
df.isna().any().any()

False

There is no null value in the whole dataset. However, I do want to get rid of the games that has than 6 terms, since those games does not really have much meanings.

df = df[df['turns']>5].copy()

Coverting the “created_at” and “last_move_at” to Pandas Datetime¶

The original time was measure in something that is called unix timestamp, which is not readable for human, so I converedti into pandas datetime.

df['created_at'] = pd.to_datetime(df['created_at'], unit='ms')

df['last_move_at'] = pd.to_datetime(df['last_move_at'], unit='ms')

Creating a New Column called Format¶

Here I want to create some extra columns that will help the machine learning process easier and more accurate.

The ‘increment_code’ column is not too useful, genrally online chess games are sepereted into 3 different format, less than 3 minutes is called bullet, 3 minutes to 14 minutes is called blitz, and 15 minutes or above is called rapid, the first number before the plus sign in the ‘increment_code’ column indicats the minute, so I want to create a new column called ‘format’ and it has three values, ‘bullet’, ‘blitz’, and ‘rapid’

df['increment_code1'] = (df['increment_code'].str.split('+',-1).str[0]).map(int)
df['format'] = 'bullet'
df.loc[(df['increment_code1'] >= 3),'format'] = 'blitz' 
df.loc[df['increment_code1'] >= 15,'format'] = 'rapid'
df.drop('increment_code1', inplace=True, axis=1)

(df['format']).value_counts()

blitz     13624
rapid      5937
bullet      115
Name: format, dtype: int64

Creating New Columns for Machine Learning¶

Here I want to add 2 new columns that are numerical, which will help me perform linear regression. The two columns being, ‘black_win’and ‘white_win’. Each column will have 3 different values, 0 for lose, 0.5 for draw, and 1 for win

df['black_win'] = 0
df.loc[df['winner'] == 'black', 'black_win'] = 1
df.loc[df['winner'] == 'draw', 'black_win'] = 0.5
df['white_win'] = 1 - df['black_win']

Creating New Data Frames for the Chess Library¶

I want to create a new data frame that has the first 6 moves of each game and if white win or not.

moves = ['move ' + str(i) for i in range(1,7)]
df2 = df['opening_name']
df2 = pd.DataFrame(df2)
df2[moves] = 0
for i in df.index:
    temp = df.loc[i, 'moves'].split()
    for j in range(1,7):
        df2.loc[i, 'move ' + str(j)] = temp[j-1]

The reason I decided to store the moves column into a new data frame is because when the moves are in df, the size of it gets too big, and cannot plot some altair chart. Df2 is for the chess library.

Create New for Machine Learning Part 2¶

This new data frame called df3 will be for the second part of the macine laerning that uses the chess library to do feature engineering.

df3 = pd.DataFrame(df['moves'])
df3['white_win'] = False
df3.loc[df['winner'] == 'white', 'white_win'] = True

Basic Statistics¶

In this secion, I will be using altair charts to analyze the basic statistics in chess.

for i in ['white', 'black']:
    print(f'The winning percentage for {i} is ' + str(df['winner'].value_counts()[i]/len(df)))
print("And the drawing percentage is " + str(df['winner'].value_counts()['draw']/len(df)))

The winning percentage for white is 0.49801788981500306
The winning percentage for black is 0.45461475909737753
And the drawing percentage is 0.047367351087619435

#increase the maximun data points altair can plot
alt.data_transformers.enable('default', max_rows=None)
alt.Chart(df[['winner']]).mark_bar().encode(
    x = 'winner:O',
    y = 'count(winner)',
    color = 'winner:N'
)

From the above 2 we cann see, in a chess game white has a distinct advatage over black disregarding the format or moves.

Now let’s see if white has an advantage in all formats of chess game.

alt.Chart(df[['winner', 'rated']]).mark_bar().encode(
    x = 'winner:O',
    y = 'count(winner)',
    color = 'winner:N',
    column = 'rated'
)

As we can see, in both format white still has a noticeable advantage over black.

I am intrest in how ‘rated’ corrolates with ‘victory_status’. The coulumn ‘vcitory_status’ have 4 different values, whcih are resign, mate, outoftime, and draw.

df['victory_status'].value_counts()

resign       10872
mate          6307
outoftime     1609
draw           888
Name: victory_status, dtype: int64

alt.Chart(df[['victory_status', 'winner', 'rated']]).mark_bar().encode(
    x = 'victory_status',
    y = 'count(victory_status)',
    color = 'winner',
    column = 'rated',
    tooltip = ['winner','count(victory_status)']
)

From the plot, it is hard to see the differences. Let’s calculate some numerical number.

not_rated = df.groupby('rated').victory_status.value_counts()[0]/len(df[df['rated'] == False])
rated = df.groupby('rated').victory_status.value_counts()[1]/len(df[df['rated'] == True])
not_rated #non-rated's percentage of each vicotry status

victory_status
resign       0.554764
mate         0.310108
outoftime    0.078121
draw         0.057007
Name: victory_status, dtype: float64

rated #rated's percentage of each vicotry status

victory_status
resign       0.552024
mate         0.323031
outoftime    0.082646
draw         0.042299
Name: victory_status, dtype: float64

As we can see, there is actually not too big of diffecnes for rated and not rated, was expecting people will surrander easier in non rated ones, seems like it is not the case.

Machine Learning Part 1¶

After the basic statistic, Let’s get into linear regression.

df

	id	rated	created_at	last_move_at	turns	victory_status	winner	increment_code	white_id	white_rating	black_id	black_rating	moves	opening_eco	opening_name	opening_ply	format	black_win	white_win
0	TZJHLljE	False	2017-08-31 20:06:40.000000000	2017-08-31 20:06:40.000000000	13	outoftime	white	15+2	bourgris	1500	a-00	1191	d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...	D10	Slav Defense: Exchange Variation	5	rapid	0.0	1.0
1	l1NXvwaE	True	2017-08-30 21:53:20.000000000	2017-08-30 21:53:20.000000000	16	resign	black	5+10	a-00	1322	skinnerua	1261	d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...	B00	Nimzowitsch Defense: Kennedy Variation	4	blitz	1.0	0.0
2	mIICvQHh	True	2017-08-30 21:53:20.000000000	2017-08-30 21:53:20.000000000	61	mate	white	5+10	ischia	1496	a-00	1500	e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...	C20	King's Pawn Game: Leonardis Variation	3	blitz	0.0	1.0
3	kWKvrqYL	True	2017-08-30 16:20:00.000000000	2017-08-30 16:20:00.000000000	61	mate	white	20+0	daniamurashov	1439	adivanov2009	1454	d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...	D02	Queen's Pawn Game: Zukertort Variation	3	rapid	0.0	1.0
4	9tXo1AUZ	True	2017-08-29 18:06:40.000000000	2017-08-29 18:06:40.000000000	95	mate	white	30+3	nik221107	1523	adivanov2009	1469	e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...	C41	Philidor Defense	5	rapid	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20053	EfqH7VVH	True	2017-07-11 16:35:14.342000128	2017-07-11 16:40:36.076000000	24	resign	white	10+10	belcolt	1691	jamboger	1220	d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5...	A80	Dutch Defense	2	blitz	0.0	1.0
20054	WSJDhbPl	True	2017-07-10 14:48:09.760000000	2017-07-10 15:00:33.979000064	82	mate	black	10+0	jamboger	1233	farrukhasomiddinov	1196	d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd...	A41	Queen's Pawn	2	blitz	1.0	0.0
20055	yrAas0Kj	True	2017-07-10 14:44:37.492999936	2017-07-10 14:47:30.327000064	35	mate	white	10+0	jamboger	1219	schaaksmurf3	1286	d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd...	D00	Queen's Pawn Game: Mason Attack	3	blitz	0.0	1.0
20056	b0v4tRyF	True	2017-07-10 14:15:27.019000064	2017-07-10 14:31:13.718000128	109	resign	white	10+0	marcodisogno	1360	jamboger	1227	e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb...	B07	Pirc Defense	4	blitz	0.0	1.0
20057	N8G2JHGG	True	2017-07-09 23:32:32.648999936	2017-07-09 23:44:49.348000000	78	mate	black	10+0	jamboger	1235	ffbob	1339	d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3...	D00	Queen's Pawn Game: Mason Attack	3	blitz	1.0	0.0

19676 rows × 19 columns

Linear Regression¶

This part would be using ‘turns’, ‘opeing_ply’, and ‘black_win’ or ‘white_win’ to predict the rating of each player.

First let’s split the dataset into 2 sub data frame and have a test and train set for each of them.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Let’s start with predicting the black’s rating first

cols1 = ['turns', 'opening_ply', 'black_win']
X_train, X_test, y_train, y_test = train_test_split(df[cols1], df['black_rating'], test_size=0.2, random_state=0)
reg_black = LinearRegression()
reg_black.fit(X_train, y_train)

LinearRegression()

Let’s use both mean squered error and mean aboslute error to measure the accuracy of the prediction and see if it is overfitting.

from sklearn.metrics import mean_squared_error, mean_absolute_error

mean_squared_error(reg_black.predict(X_train), y_train)

74518.74574075406

mean_squared_error(reg_black.predict(X_test), y_test)

73644.84351354155

mean_absolute_error(reg_black.predict(X_train), y_train)

217.59747560753172

mean_absolute_error(reg_black.predict(X_test), y_test)

215.41370670550344

Let’s do the same thing for white, see if there will be a difference.

cols2 = ['turns', 'opening_ply', 'white_win']
X_train, X_test, y_train, y_test = train_test_split(df[cols2], df['white_rating'], test_size=0.2, random_state=0)
reg_white = LinearRegression()
reg_white.fit(X_train, y_train)

LinearRegression()

mean_squared_error(reg_white.predict(X_train), y_train)

74223.69960957882

mean_squared_error(reg_white.predict(X_test), y_test)

74970.70006569766

mean_absolute_error(reg_white.predict(X_train), y_train)

217.1967432196025

mean_absolute_error(reg_white.predict(X_test), y_test)

217.96851240583175

Before the coclusion, let’s make an altair chart that plot the predicted values.

#first predict the whole dataset
df['black_pred'] = reg_black.predict(df[cols1])
df['white_pred'] = reg_white.predict(df[cols2])

df

	id	rated	created_at	last_move_at	turns	victory_status	winner	increment_code	white_id	white_rating	...	black_rating	moves	opening_eco	opening_name	opening_ply	format	black_win	white_win	black_pred	white_pred
0	TZJHLljE	False	2017-08-31 20:06:40.000000000	2017-08-31 20:06:40.000000000	13	outoftime	white	15+2	bourgris	1500	...	1191	d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...	D10	Slav Defense: Exchange Variation	5	rapid	0.0	1.0	1492.716943	1593.759787
1	l1NXvwaE	True	2017-08-30 21:53:20.000000000	2017-08-30 21:53:20.000000000	16	resign	black	5+10	a-00	1322	...	1261	d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...	B00	Nimzowitsch Defense: Kennedy Variation	4	blitz	1.0	0.0	1570.218338	1488.518709
2	mIICvQHh	True	2017-08-30 21:53:20.000000000	2017-08-30 21:53:20.000000000	61	mate	white	5+10	ischia	1496	...	1500	e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...	C20	King's Pawn Game: Leonardis Variation	3	blitz	0.0	1.0	1494.786898	1582.650804
3	kWKvrqYL	True	2017-08-30 16:20:00.000000000	2017-08-30 16:20:00.000000000	61	mate	white	20+0	daniamurashov	1439	...	1454	d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...	D02	Queen's Pawn Game: Zukertort Variation	3	rapid	0.0	1.0	1494.786898	1582.650804
4	9tXo1AUZ	True	2017-08-29 18:06:40.000000000	2017-08-29 18:06:40.000000000	95	mate	white	30+3	nik221107	1523	...	1469	e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...	C41	Philidor Defense	5	rapid	0.0	1.0	1585.167206	1672.563933
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
20053	EfqH7VVH	True	2017-07-11 16:35:14.342000128	2017-07-11 16:40:36.076000000	24	resign	white	10+10	belcolt	1691	...	1220	d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5...	A80	Dutch Defense	2	blitz	0.0	1.0	1427.047899	1518.473716
20054	WSJDhbPl	True	2017-07-10 14:48:09.760000000	2017-07-10 15:00:33.979000064	82	mate	black	10+0	jamboger	1233	...	1196	d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd...	A41	Queen's Pawn	2	blitz	1.0	0.0	1592.582253	1494.708198
20055	yrAas0Kj	True	2017-07-10 14:44:37.492999936	2017-07-10 14:47:30.327000064	35	mate	white	10+0	jamboger	1219	...	1286	d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd...	D00	Queen's Pawn Game: Mason Attack	3	blitz	0.0	1.0	1465.473400	1557.664123
20056	b0v4tRyF	True	2017-07-10 14:15:27.019000064	2017-07-10 14:31:13.718000128	109	resign	white	10+0	marcodisogno	1360	...	1227	e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb...	B07	Pirc Defense	4	blitz	0.0	1.0	1574.927761	1657.399180
20057	N8G2JHGG	True	2017-07-09 23:32:32.648999936	2017-07-09 23:44:49.348000000	78	mate	black	10+0	jamboger	1235	...	1339	d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3...	D00	Queen's Pawn Game: Mason Attack	3	blitz	1.0	0.0	1614.096120	1519.483213

19676 rows × 21 columns

use_cols = ['white_id','white_rating', 'white_pred','created_at','black_rating','format', 'black_pred']

sel = alt.selection_single(fields=['white_id'])
sel1 = alt.selection_single(fields=['white_id','white_rating', 'white_pred','created_at'])
a = alt.Chart(df[use_cols]).mark_circle().encode(
    x = 'black_rating',
    y = 'white_rating',
    color = 'format',
    tooltip = ['white_id', 'white_rating', 'white_pred']
).properties(
    title='Real'
)
b = alt.Chart(df[use_cols]).mark_circle().encode(
    x = 'black_pred',
    y = 'white_pred',
    color = 'format',
    tooltip = ['white_id', 'white_rating', 'white_pred', 'created_at' ]
).properties(
    title='Predict'
)
a.add_selection(sel)&b.transform_filter(sel).add_selection(sel1)

Try couple clicks until you see there is multiple points on the second graph, you will see that it can predict points in the middle of the cluster a lot better than the ones on the edge.

Conclusion For Using ‘opeing_ply’ and ‘turns’ to Predict Ratings¶

Before ploting it, just judge it by the mean absolute error, the prediction seems reasonalbe and not overfitting, since the test set does better than the train set. However, when we plot the prediction next to the actual rating we can see that the prediction is actually very pathetic just by its shape. The prediciton seems like a linear line, because I use linear regression, but all the values centered at 1300-2200. This shows that number of opening moves and total moves does not correlate to ratings at all.

Chess Library¶

!pip install chess==1.9.1
import chess
board = chess.Board()

Requirement already satisfied: chess==1.9.1 in /root/venv/lib/python3.7/site-packages (1.9.1)
WARNING: You are using pip version 20.1.1; however, version 22.1.2 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.

Let’s try to see what the first 6 moves looks like the first 6 moves of one of the chess game and its opening name.

board = chess.Board()
for i in range(1,7):
    board.push_san(df2.iloc[0,i])
print(df2.iloc[0,0])
board

Slav Defense: Exchange Variation

We only could get the first six moves because it splited in the data frame, if we want the final position of a game we can do something like this.

board = chess.Board()
all_move = df.loc[0,'moves'].split()
for i in all_move:
    board.push_san(i)
board

We can check if it is stillmate or checkmate by using is_stalemate() or is_checkmate()

board.is_stalemate()

False

board.is_checkmate()

False

Here I am going to define a function that takes in a board as input and output each side’s total pieces value, a pawn is 1, knight and bishop are both 3, rook is 5, and queen is 9. This is the convention of each chess peice’s value, some may argue that bishop is 3.5, but here we will just use 3. The approach I am going to take is to check every single square and use

def board_value(board):
    white = 0
    black = 0
    for i in range(0,64):
        x = board.piece_type_at(i)
        if board.color_at(i) == True:
            if x == 1:
                white =+ 1
            elif x == 2 or x == 3:
                white =+ 3
            elif x == 4: 
                white =+ 5
            elif x == 5:
                white =+ 9
        else:
            if x == 1:
                black =+ 1
            elif x == 2 or x == 3:
                black =+ 3
            elif x == 4: 
                black =+ 5
            elif x == 5:
                black =+ 9
    return white, black

We will use this function to help us manipulate some raw data.

Machine Learning Part 2¶

For this part of machine Learning I will be using the final board’s value to predict the winner of the game with logsitic regression. But first, I will need to turn all the moves for each game into the final board position and use board_value to calculatethe valaue of it then the machine learning part will start.

def final_board(moves):
    board = chess.Board()
    all_move = moves.split()
    for i in all_move:
        board.push_san(i)
    return board

df3['board'] = final_board(df3.iloc[0,0])

I found something that was not expected, which is pandas DataFrame can actually store board in it. Thinking about it, it kind of makes snese, since board is an object and DataFrame can store all sorts of objects.

So lets covert all the moves to become a board in df3 using final_board

df3['board'] = df3['moves'].map(final_board)

Instead of storing white’s and black’s value, I will store the difference postivie being white has more and negative being black has more.

df3['values'] = df3['board'].map(board_value)

df3['values'] = df3['values'].map(lambda x: x[0]- x[1])

df3

	moves	white_win	board	values
0	d4 d5 c4 c6 cxd5 e6 dxe6 fxe6 Nf3 Bb4+ Nc3 Ba5...	True	r n b q k . n r\np p . . . . p p\n. . p . p . ...	-2
1	d4 Nc6 e4 e5 f4 f6 dxe5 fxe5 fxe5 Nxe5 Qd4 Nc6...	False	r . b q k . n r\np p p p . . p p\n. . . . . . ...	-4
2	e4 e5 d3 d6 Be3 c6 Be2 b5 Nd2 a5 a4 c5 axb5 Nc...	True	. . . . . . . .\n. . . . . P . .\n. . P . . . ...	0
3	d4 d5 Nf3 Bf5 Nc3 Nf6 Bf4 Ng4 e3 Nc6 Be2 Qd7 O...	True	. . . . . . . r\nR . . . . . Q p\n. . p . . . ...	4
4	e4 e5 Nf3 d6 d4 Nc6 d5 Nb4 a3 Na6 Nc3 Be7 b4 N...	True	. . . . . . Q .\n. . . . . . . R\n. . . . . . ...	8
...	...	...	...	...
20053	d4 f5 e3 e6 Nf3 Nf6 Nc3 b6 Be2 Bb7 O-O Be7 Ne5...	True	r n . q . k . r\np b p . b . . .\n. p . . p n ...	-2
20054	d4 d6 Bf4 e5 Bg3 Nf6 e3 exd4 exd4 d5 c3 Bd6 Bd...	False	. Q . . . . . .\n. . . . . N k .\n. p . . . . ...	8
20055	d4 d5 Bf4 Nc6 e3 Nf6 c3 e6 Nf3 Be7 Bd3 O-O Nbd...	True	r . b . q r . .\np . p n b k . Q\n. p . . p p ...	4
20056	e4 d6 d4 Nf6 e5 dxe5 dxe5 Qxd1+ Kxd1 Nd5 c4 Nb...	True	. . R . . . . .\n. Q . . . . . .\n. . . . . . ...	4
20057	d4 d5 Bf4 Na6 e3 e6 c3 Nf6 Nf3 Bd7 Nbd2 b5 Bd3...	False	. . . K q . . .\n. . P . . k . .\n. . . . p . ...	-8

19676 rows × 4 columns

from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(df3[['values']], df3['white_win'], test_size=0.2, random_state=0)
clf = LogisticRegression()
clf.fit(X_train, y_train)

LogisticRegression()

Let’s check the score on both train and test set.

clf.score(X_train, y_train)

0.6550190597204575

clf.score(X_test, y_test)

0.6557418699186992

We get about 15 wrong prediction every 100 prediction we make. So, the prediction is not great nor awful, but one thing we know is that it is definitly not over fitting. The train score and the test score have very similar performance. x

Summary¶

First, we can see that some basic statistics of chess, such as which side wins more. We also found out that number of opening and total turns does not really correlate to ratings. At the end, we try to use the chess library’s help to covert my raw data into some thing I will be intrest in, and eventually did logistic regression prediction the rating from final board’s total value. After this, I understand that the most diffcult part of machine laerning, is not fitting nor predicting, sicne we can just use a library like sklearn, but the most challenging part is to turn raw data into something that can be use to do the machine learning process.

References¶

What is the source of your dataset(s)?

From Kaggle: Chess Dataset

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

Covering Unix to Datetime stack over flow

Interactive graph course notes

Chess Library documantation

List other references that you found helpful.

Altari Graph documantation

Created in Deepnote

UC Irvine Math 10 S22

Chess Games Analysis

Contents