Predicting match outcomes in professional Couter Strike: Global Offensive¶

Author: Ryan Wei

Course Project, UC Irvine, Math 10, W22

Introduction¶

Counter Strike: Global Offensive (CSGO) is a tactical 5v5 first person shooter video game with a thriving esports scene. A series is played out in a Best of 3 series, where teams take turns picking and banning maps from a predetermined pool of maps. Maps have a T and CT side, with teams switching halfway through a match. The first team to 16 points wins that map and takes a point in the Best of 3 series.

The goal of this project is to see I can predict the outcome of matches based on a number of variables: the global rank of a team, their preference for the maps picked and their strength on that map.

Main portion of the project¶

import pandas as pd
import numpy as np
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

!kaggle datasets download -d viniciusromanosilva/csgo-hltv --unzip

Downloading csgo-hltv.zip to /work
  0%|                                                | 0.00/910k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 910k/910k [00:00<00:00, 75.1MB/s]

df = pd.read_csv("cs_hltv_data.csv")

df

	Unnamed: 0	date	time	first_team	second_team	first_team_world_rank_#	second_team_world_rank_#	first_team_total_score	second_team_total_score	first_team_won	...	first_pick_by_first_team	ban 1	ban 2	pick 1	pick 2	ban 3	ban 4	pick 3	url	timestamp
0	0	2021-06-02	10:00	Fiend	K23	28	27	0	2	0	...	0	Mirage	Vertigo	Overpass	Nuke	Inferno	Train	Dust2	https://www.hltv.org/matches/2349155/fiend-vs-...	1.622628e+09
1	1	2021-06-02	10:00	Entropiq	BIG	15	8	0	2	0	...	1	Inferno	Overpass	Mirage	Vertigo	Train	Nuke	Dust2	https://www.hltv.org/matches/2348945/entropiq-...	1.622628e+09
2	2	2021-06-02	07:00	Entropiq	Nemiga	15	52	2	0	1	...	0	Train	Inferno	Vertigo	Dust2	Overpass	Nuke	Mirage	https://www.hltv.org/matches/2349184/entropiq-...	1.622617e+09
3	3	2021-06-02	07:00	OG	Fiend	20	28	2	0	1	...	0	Nuke	Vertigo	Mirage	Inferno	Dust2	Train	Overpass	https://www.hltv.org/matches/2348944/og-vs-fie...	1.622617e+09
4	4	2021-06-01	14:15	BIG	Wisla Krakow	8	49	2	0	1	...	1	Ancient	Vertigo	Dust2	Nuke	Inferno	Overpass	Mirage	https://www.hltv.org/matches/2349144/big-vs-wi...	1.622557e+09
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3485	3485	2017-12-16	20:35	FaZe	Cloud9	2	5	2	0	1	...	0	Nuke	Cobblestone	Mirage	Overpass	Inferno	Train	Cache	https://www.hltv.org/matches/2317944/faze-vs-c...	1.513456e+09
3486	3486	2017-12-16	18:05	OpTic	mousesports	12	15	0	2	0	...	0	Inferno	Cobblestone	Nuke	Train	Mirage	Cache	Overpass	https://www.hltv.org/matches/2317943/optic-vs-...	1.513448e+09
3487	3487	2017-12-16	15:15	Cloud9	Liquid	5	10	2	0	1	...	1	Nuke	Train	Mirage	Inferno	Cache	Overpass	Cobblestone	https://www.hltv.org/matches/2317942/cloud9-vs...	1.513437e+09
3488	3488	2017-12-16	13:00	GODSENT	AGO	28	21	0	2	0	...	1	Nuke	Cache	Mirage	Cobblestone	Overpass	Inferno	Train	https://www.hltv.org/matches/2318180/godsent-v...	1.513429e+09
3489	3489	2017-12-16	12:15	Luminosity	mousesports	24	15	0	2	0	...	1	Mirage	Cache	Train	Nuke	Cobblestone	Inferno	Overpass	https://www.hltv.org/matches/2317941/luminosit...	1.513426e+09

3490 rows × 157 columns

I realized that the data in score_second_team_t2_M1 (through M3) are all truncated for some reason, so instead of showing a proper score range it is limited to 0-9. This data will need to either be reverse engineered.

#df for the outcomes of respective best of 3 matches
match_df = df.iloc[:,3:37].copy()

match_df

	first_team	second_team	first_team_world_rank_#	second_team_world_rank_#	first_team_total_score	second_team_total_score	first_team_won	M1	first_team_score_M1	second_team_score_M1	...	score_first_team_t1_M2	score_first_team_t2_M2	score_second_team_t1_M2	score_second_team_t2_M2	side_first_team_M3	side_second_team_M3	score_first_team_t1_M3	score_first_team_t2_M3	score_second_team_t1_M3	score_second_team_t2_M3
0	Fiend	K23	28	27	0	2	0	Overpass	16	19	...	7	3	8	8	-	-	-	-	-	-
1	Entropiq	BIG	15	8	0	2	0	Mirage	10	16	...	9	2	6	1	-	-	-	-	-	-
2	Entropiq	Nemiga	15	52	2	0	1	Vertigo	16	5	...	10	5	5	1	-	-	-	-	-	-
3	OG	Fiend	20	28	2	0	1	Mirage	16	11	...	10	6	5	4	-	-	-	-	-	-
4	BIG	Wisla Krakow	8	49	2	0	1	Dust2	16	8	...	8	8	7	5	-	-	-	-	-	-
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3485	FaZe	Cloud9	2	5	2	0	1	Mirage	16	8	...	11	5	4	7	-	-	-	-	-	-
3486	OpTic	mousesports	12	15	0	2	0	Nuke	7	16	...	3	9	12	4	-	-	-	-	-	-
3487	Cloud9	Liquid	5	10	2	0	1	Mirage	16	14	...	6	10	9	1	-	-	-	-	-	-
3488	GODSENT	AGO	28	21	0	2	0	Mirage	14	16	...	10	3	5	1	-	-	-	-	-	-
3489	Luminosity	mousesports	24	15	0	2	0	Train	9	16	...	2	1	13	3	-	-	-	-	-	-

3490 rows × 34 columns

#build new df consisting of just pure game data, isolating individual matches 
map_data_cols = ["T1_name", "T2_name", "Map", "T1_score", "T2_score", "T1_side", "T2_side", "T1_score_H1", "T1_score_H2", "T2_score_H1", "T2_score_H2"]

#pick out wanted columns, then rename
map_data = match_df.iloc[:, np.r_[0:2, 7:10, 16:22]].copy()
map_data.columns = map_data_cols

#employ np.r_, and splice the desired colums. Append to existing matrix map_data after matching up column names
map_data_temp = match_df.iloc[:, np.r_[0:2, 10:13, 22:28]].copy()
map_data_temp.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)

map_data_temp2 = match_df.iloc[:, np.r_[0:2, 13:16, 28:34]].copy()
map_data_temp2.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)

#T1 = first team, T2 = second team, H1 = first half, H2 = second half
map_data

	T1_name	T2_name	Map	T1_score	T2_score	T1_side	T2_side	T1_score_H1	T1_score_H2	T2_score_H1	T2_score_H2
0	Fiend	K23	Overpass	16	19	CT	T	9	6	6	9
1	Entropiq	BIG	Mirage	10	16	CT	T	7	3	8	8
2	Entropiq	Nemiga	Vertigo	16	5	CT	T	10	6	5	0
3	OG	Fiend	Mirage	16	11	CT	T	9	7	6	5
4	BIG	Wisla Krakow	Dust2	16	8	T	CT	13	3	2	6
...	...	...	...	...	...	...	...	...	...	...	...
10465	FaZe	Cloud9	Overpass	16	11	CT	T	11	5	4	7
10466	OpTic	mousesports	Train	12	16	CT	T	3	9	12	4
10467	Cloud9	Liquid	Inferno	16	10	T	CT	6	10	9	1
10468	GODSENT	AGO	Cobblestone	13	16	CT	T	10	3	5	1
10469	Luminosity	mousesports	Nuke	3	16	CT	T	2	1	13	3

10470 rows × 11 columns

#convert a column to str, so entries with "-" can be fitered out
    #these are Map 3, since matches are Best of 2 sometimes the third map isn't played
map_data = map_data.convert_dtypes()
map_data.drop(map_data[map_data["T1_score"] == "-"].index, inplace = True)

#alter T2_score_H2 data based on whether match went into overtime
#if match went into overtime, then alter based on T2_score (want to eliminate extra rounds from overtime)
for x in map_data.index:
    if (map_data["T1_score"][x] + map_data["T2_score"][x] > 30):
        if (map_data["T2_score"][x] > 16):
            map_data.at[x, "T2_score_H2"] = 15 - map_data["T2_score_H1"][x]
        else:
            map_data.at[x, "T2_score_H2"] = map_data["T2_score"][x] - map_data["T2_score_H1"][x]

map_data

	T1_name	T2_name	Map	T1_score	T2_score	T1_side	T2_side	T1_score_H1	T1_score_H2	T2_score_H1	T2_score_H2
0	Fiend	K23	Overpass	16	19	CT	T	9	6	6	9
1	Entropiq	BIG	Mirage	10	16	CT	T	7	3	8	8
2	Entropiq	Nemiga	Vertigo	16	5	CT	T	10	6	5	0
3	OG	Fiend	Mirage	16	11	CT	T	9	7	6	5
4	BIG	Wisla Krakow	Dust2	16	8	T	CT	13	3	2	6
...	...	...	...	...	...	...	...	...	...	...	...
10465	FaZe	Cloud9	Overpass	16	11	CT	T	11	5	4	7
10466	OpTic	mousesports	Train	12	16	CT	T	3	9	12	4
10467	Cloud9	Liquid	Inferno	16	10	T	CT	6	10	9	1
10468	GODSENT	AGO	Cobblestone	13	16	CT	T	10	3	5	1
10469	Luminosity	mousesports	Nuke	3	16	CT	T	2	1	13	3

10470 rows × 11 columns

After fixing that column of data, we can proceed to calculating whether each map is T or CT sided

map_names = map_data.Map.unique()
map_side_data = {"Map": map_names,
    "CT_wins": 0,
    "T_wins": 0
}
map_side = pd.DataFrame(map_side_data, columns = ["Map", "CT_wins", "T_wins"])

#fill out map_side df with raw wins/losses
CT = "CT_wins"
T = "T_wins"
for x in map_data.index:
    m_name = map_data["Map"][x]
    if (map_data["T1_side"][x] == "CT"):
        map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]
        map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
    else:
        map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
        map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]

#calculate win rate percentages
map_side["CT_winrate"] = map_side["CT_wins"] / (map_side["CT_wins"] + map_side["T_wins"])
map_side["T_winrate"] = map_side["T_wins"] / (map_side["CT_wins"] + map_side["T_wins"])

#flag maps as T or CT
map_side["T_or_CT"] = ""

for x in range(9):
    if (map_side.iloc[x]["CT_winrate"] >= 0.5):
        map_side.at[x,"T_or_CT"] = "CT"
    else:
        map_side.at[x,"T_or_CT"] = "T"

map_side

	Map	CT_wins	T_wins	CT_winrate	T_winrate	T_or_CT
0	Overpass	16828	15633	0.518407	0.481593	CT
1	Mirage	24299	22449	0.519787	0.480213	CT
2	Vertigo	7827	8158	0.489647	0.510353	T
3	Dust2	19014	20090	0.486242	0.513758	T
4	Inferno	24701	24724	0.499767	0.500233	T
5	Nuke	20403	17510	0.538153	0.461847	CT
6	Train	17161	14301	0.545452	0.454548	CT
7	Cache	3877	4229	0.478288	0.521712	T
8	Cobblestone	1746	1826	0.488802	0.511198	T

alt.Chart(map_side).mark_bar().encode(
    x = "Map",
    y = alt.Y("CT_winrate:Q",
        scale = alt.Scale(domain = (.45, .55)))
).properties(
    title = "Map CT win rates"
)

From the data above, we can see that maps like Nuke, Train and Mirage and Overpass are more heavily CT sided while maps like Vertigo, Dust2 and Cache and Cobblestone are more T-sided.

Inferno is quite balanced, with almost a perfectly even 50% win rate.

With the data on which side is more likely to win a map, now we want to calculate how each team performs on each side of the map. That can be used to determine if a team is proficient at a map or not, and then use that as another metric to predict the outcome of a match.

#create an array with all unique team names
team_names = np.concatenate([map_data.T1_name.unique(), map_data.T2_name.unique()])
team_names = pd.unique(team_names)

#dictionary to convert into df for individual teams and their success rate on each map
team_map_data = {"Team": team_names}
for x in map_names:
    team_map_data[x + "_CT_wins"] = 0
    team_map_data[x + "_CT_losses"] = 0
    team_map_data[x + "_T_wins"] = 0
    team_map_data[x + "_T_losses"] = 0
    team_map_data[x + "_CT_winrate"] = 0
    team_map_data[x + "_T_winrate"] = 0
    team_map_data[x + "_game_count"] = 0

#df of teams and their map data, empty
team_map = pd.DataFrame(team_map_data)

#define a function to compute the desired raw stats
team_map_elements = ["_CT_wins", "_CT_losses", "_T_wins", "_T_losses"]

def map_data_updater(team_name, score_headers, map_title, index):
    for a, b in zip(team_map_elements, score_headers):
        team_map.loc[team_map.Team == team_name, map_title + a] += map_data[b][index]
    team_map.loc[team_map.Team == team_name, map_name + "_game_count"] += 1

for x in map_data.index:
    team1_name = map_data["T1_name"][x]
    team2_name = map_data["T2_name"][x]
    map_name = map_data["Map"][x]

    #call on map_data_updater to fill out scores as desired
    if (map_data["T1_side"][x] == "CT"):
        score_lst = ["T1_score_H1", "T2_score_H1", "T1_score_H2", "T2_score_H2"]
    else:
        score_lst = ["T1_score_H2", "T2_score_H2", "T1_score_H1", "T2_score_H1"]
    map_data_updater(team1_name, score_lst, map_name, x)
    map_data_updater(team2_name, score_lst[::-1], map_name, x)

#calculating each team's win rate on each map
var = 1
for x in range(9):
    team_map.iloc[:,var + 4] = team_map.iloc[:,var]/(team_map.iloc[:,var] + team_map.iloc[:,var + 1])
    team_map.iloc[:,var + 5] = team_map.iloc[:,var + 2]/(team_map.iloc[:,var + 2] + team_map.iloc[:,var + 3])
    var = var + 7

team_map = team_map.fillna(0)

team_map

	Team	Overpass_CT_wins	Overpass_CT_losses	Overpass_T_wins	Overpass_T_losses	Overpass_CT_winrate	Overpass_T_winrate	Overpass_game_count	Mirage_CT_wins	Mirage_CT_losses	...	Cache_CT_winrate	Cache_T_winrate	Cache_game_count	Cobblestone_CT_wins	Cobblestone_CT_losses	Cobblestone_T_wins	Cobblestone_T_losses	Cobblestone_CT_winrate	Cobblestone_T_winrate	Cobblestone_game_count
0	Fiend	43	32	34	39	0.573333	0.465753	5	51	35	...	0.000000	0.000000	0	0	0	0	0	0.000000	0.000000	0
1	Entropiq	98	85	126	127	0.535519	0.498024	17	90	110	...	0.000000	0.000000	0	0	0	0	0	0.000000	0.000000	0
2	OG	243	189	161	169	0.562500	0.487879	30	343	332	...	0.000000	0.000000	0	0	0	0	0	0.000000	0.000000	0
3	BIG	498	525	426	449	0.486804	0.486857	72	668	519	...	0.503356	0.436364	22	26	26	28	47	0.500000	0.373333	5
4	fnatic	483	493	457	503	0.494877	0.476042	76	795	630	...	0.500000	0.537688	16	59	49	59	49	0.546296	0.546296	8
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
280	HOLLYWOOD	6	7	6	9	0.461538	0.400000	1	16	14	...	0.000000	0.000000	0	0	0	0	0	0.000000	0.000000	0
281	GoodJob	0	0	0	0	0.000000	0.000000	0	0	0	...	0.000000	0.000000	0	6	9	8	7	0.400000	0.533333	1
282	l4nd0dg3	0	0	0	0	0.000000	0.000000	0	9	1	...	0.000000	0.000000	0	0	0	0	0	0.000000	0.000000	0
283	MANS NOT HOT	1	6	5	10	0.142857	0.333333	1	16	10	...	0.733333	0.200000	2	2	13	0	3	0.133333	0.000000	1
284	eXtatus	0	0	0	0	0.000000	0.000000	0	6	9	...	0.200000	0.428571	1	12	2	20	10	0.857143	0.666667	2

285 rows × 64 columns

With this new table of data/win rates, I want a way to quantify a team’s success rate on each respective map. This new quantifier will be called “map proficiency”, a score defined by:

(Map CT win rate + Map T win rate) * games played

This will give me a rough idea of how good each team is on a given map, as well as include the team’s experience in some way. In a tactical game like CS:GO, map experience can be a big determining factor in how a team performs.

#map proficiency score calculation
for x in map_names:
    team_map[x + "_aggregate_percentage"] = (team_map[x + "_CT_winrate"] + team_map[x + "_T_winrate"])

for x in map_names:
    team_map[x + "_proficiency"] = team_map[x + "_aggregate_percentage"] * team_map[x + "_game_count"]

team_map

	Team	Overpass_CT_wins	Overpass_CT_losses	Overpass_T_wins	Overpass_T_losses	Overpass_CT_winrate	Overpass_T_winrate	Overpass_game_count	Mirage_CT_wins	Mirage_CT_losses	...	Cobblestone_aggregate_percentage	Overpass_proficiency	Mirage_proficiency	Vertigo_proficiency	Dust2_proficiency	Inferno_proficiency	Nuke_proficiency	Train_proficiency	Cache_proficiency	Cobblestone_proficiency
0	Fiend	43	32	34	39	0.573333	0.465753	5	51	35	...	0.000000	5.195434	9.544186	1.422222	1.704762	4.757543	1.478788	0.000000	0.000000	0.000000
1	Entropiq	98	85	126	127	0.535519	0.498024	17	90	110	...	0.000000	17.570228	13.300000	10.435231	16.613333	0.000000	3.378378	2.786047	0.000000	0.000000
2	OG	243	189	161	169	0.562500	0.487879	30	343	332	...	0.000000	31.511364	58.661478	0.000000	62.494161	88.417969	54.718422	40.566433	0.000000	0.000000
3	BIG	498	525	426	449	0.486804	0.486857	72	668	519	...	0.873333	70.103568	100.677380	43.323614	171.005794	104.796536	122.914967	39.999217	20.673826	4.366667
4	fnatic	483	493	457	503	0.494877	0.476042	76	795	630	...	1.092593	73.789822	115.963039	28.208878	53.843385	163.420422	72.651557	76.657212	16.603015	8.740741
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
280	HOLLYWOOD	6	7	6	9	0.461538	0.400000	1	16	14	...	0.000000	0.861538	1.266667	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
281	GoodJob	0	0	0	0	0.000000	0.000000	0	0	0	...	0.933333	0.000000	0.000000	0.000000	0.000000	1.866667	0.000000	0.000000	0.000000	0.933333
282	l4nd0dg3	0	0	0	0	0.000000	0.000000	0	9	1	...	0.000000	0.000000	1.366667	0.000000	0.000000	0.000000	0.000000	2.742857	0.000000	0.000000
283	MANS NOT HOT	1	6	5	10	0.142857	0.333333	1	16	10	...	0.133333	0.476190	1.764103	0.000000	0.000000	0.000000	0.000000	0.000000	1.866667	0.133333
284	eXtatus	0	0	0	0	0.000000	0.000000	0	6	9	...	1.523810	0.000000	0.900000	0.000000	0.000000	2.000000	0.000000	0.000000	0.628571	3.047619

285 rows × 82 columns

Now I want to chart proficiency ratings against T or CT win rates, depending on which side a map favors (note: Inferno is incredibly even, so T/CT makes almost no difference). This can give a visual indication of approximately in what range do the best teams perform, as illustrated by their higher proficiency scores.

In these graphs I am electing to choose either using T or CT winrates along the x-axis according to each map’s bias, since this gives the best indication of how a team performs when they have an advantage. I have tried doing the opposite, but since the win rates are lower the graphical representation is much more scattered and difficult to observe.

def make_proficiency_chart(maps):
    counter = map_side.index[map_side["Map"] == maps]
    if(map_side.at[counter[0],"T_or_CT"] == "CT"):
        phrase = "_CT_winrate"
    else: 
        phrase = "_T_winrate"

    c = alt.Chart(team_map).mark_circle().encode(
        x = alt.X(maps + phrase + ":Q",
            scale = alt.Scale(domain = (0.0, 1.0))),
        y = alt.Y(maps + "_proficiency:Q",
            scale = alt.Scale(domain = (0, 250))) 

    ).properties(
        title = f"{maps} winrate vs proficiency"
    )
    return c

alt.vconcat(*[make_proficiency_chart(k) for k in map_names])

A general trend can be observed where better teams tend to perform in accordance with each map’s bias, whether that is T or CT sided. Inferno lines up right down the middle, illustrating its 50/50 win percentages. Cache and Cobblestone have a lack of data, most likely due to these maps being rotated out early on within the time frame captured by this data set.

With the analysis on individual maps completed, along with the performace of each team on any given map, it is now time to augment a copy of the match_df dataframe so newly constructed variables can be used to analyze the outcome of matches.

T1_pref_score: preference score for a given match, incremented by one for each map that the first team has an aggregate rating over 1 (up to 3). 
T1_proficiency: proficiency score of team 1, as calculated earlier. 

#initialize new columns
match_df_final = match_df.copy()
m_df_add = ["T1_pref_score", "T2_pref_score", "T1_proficiency", "T2_proficiency"]

for x in m_df_add:
    match_df_final[x] = 0

def map_pref_proficiency(team, map_name_lst):
    preference = 0
    proficiency = 0

    for x in map_name_lst:
        if (team_map.loc[team_map["Team"] == team, x + "_aggregate_percentage"].values[0] > 1):
            preference += 1
        proficiency += team_map.loc[team_map["Team"] == team, x + "_proficiency"].values[0]
    ret_arr = [preference, proficiency]
    return ret_arr

#this entry was causing an issue, M3 wasn't filled out correctly
match_df_final.loc[3119, "M3"] = "Cache"

#pd.options.mode.chained_assignment = None

for x in match_df_final.index:
    maps = [match_df_final["M1"][x], match_df_final["M2"][x], match_df_final["M3"][x]]
    team1 = match_df_final["first_team"][x]
    team2 = match_df_final["second_team"][x]
    T1_vals = []
    T2_vals = []

    T1_vals = map_pref_proficiency(team1, maps)
    T2_vals = map_pref_proficiency(team2, maps)
    
    #for a, b in zip(m_df_add, [0, 0, 1, 1]):
    #    match_df_final.at[x, a] = T1_vals[b]
    #for a, b in zip(team_map_elements, score_headers):
    match_df_final.at[x, "T1_pref_score"] = T1_vals[0]
    match_df_final.at[x, "T1_proficiency"] = T1_vals[1]
    match_df_final.at[x, "T2_pref_score"] = T2_vals[0]
    match_df_final.at[x, "T2_proficiency"] = T2_vals[1]

match_df_final

	first_team	second_team	first_team_world_rank_#	second_team_world_rank_#	first_team_total_score	second_team_total_score	first_team_won	M1	first_team_score_M1	second_team_score_M1	...	side_first_team_M3	side_second_team_M3	score_first_team_t1_M3	score_first_team_t2_M3	score_second_team_t1_M3	score_second_team_t2_M3	T1_pref_score	T2_pref_score	T1_proficiency	T2_proficiency
0	Fiend	K23	28	27	0	2	0	Overpass	16	19	...	-	-	-	-	-	-	1	0	8	28
1	Entropiq	BIG	15	8	0	2	0	Mirage	10	16	...	-	-	-	-	-	-	2	3	40	315
2	Entropiq	Nemiga	15	52	2	0	1	Vertigo	16	5	...	-	-	-	-	-	-	2	1	40	60
3	OG	Fiend	20	28	2	0	1	Mirage	16	11	...	-	-	-	-	-	-	2	2	178	19
4	BIG	Wisla Krakow	8	49	2	0	1	Dust2	16	8	...	-	-	-	-	-	-	3	0	394	22
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3485	FaZe	Cloud9	2	5	2	0	1	Mirage	16	8	...	-	-	-	-	-	-	3	1	249	129
3486	OpTic	mousesports	12	15	0	2	0	Nuke	7	16	...	-	-	-	-	-	-	0	3	125	271
3487	Cloud9	Liquid	5	10	2	0	1	Mirage	16	14	...	-	-	-	-	-	-	1	3	120	326
3488	GODSENT	AGO	28	21	0	2	0	Mirage	14	16	...	-	-	-	-	-	-	3	0	89	103
3489	Luminosity	mousesports	24	15	0	2	0	Train	9	16	...	-	-	-	-	-	-	0	3	18	271

3490 rows × 38 columns

Below: calculating how likely a team is to win based purely on their global rank

#how likely will a team win based on their global rank?
wr_arr = ["first_team_world_rank_#", "second_team_world_rank_#", "first_team_won"]
wr_df = match_df_final[wr_arr].copy()
scaler = StandardScaler()

wr_df = wr_df[wr_df["first_team_world_rank_#"] != "Unranked"]
wr_df = wr_df[wr_df["second_team_world_rank_#"] != "Unranked"]
wr_df2 = wr_df.astype(int)

dfrnk = wr_df2[["first_team_world_rank_#", "second_team_world_rank_#"]]

scaler.fit(dfrnk)
X_scaled = scaler.transform(dfrnk[["first_team_world_rank_#", "second_team_world_rank_#"]])
y = wr_df2["first_team_won"]

clf = KNeighborsClassifier(n_neighbors = 7)
clf.fit(X_scaled, wr_df2["first_team_won"])
dfrnk["pred"] = clf.predict(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.2)
from sklearn.linear_model import LogisticRegression
cmf = LogisticRegression()
cmf.fit(X_train, y_train)
cmf.predict(X_test) == y_test

   True
    True
  False
  False
   True
        ...  
    False
   True
    True
   True
   True
Name: first_team_won, Length: 689, dtype: bool

np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)

0.7111756168359942

np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)

0.7176043557168784

it appears that global ranking is a pretty good indicator of winning, as is expected since the two should go hand in hand. This result is not very interesting but it can serve as a baseline

#creating df to analyze with machine learning
analysis_df = match_df_final.iloc[:, np.r_[2:4, 34:38]].copy()
analysis_df["first_team_won"] = match_df_final["first_team_won"].copy()
aScaler = StandardScaler()

#remove rows w/ no world rank for a given team, only removes about 50 entries 
analysis_df = analysis_df[analysis_df["first_team_world_rank_#"] != "Unranked"]
analysis_df = analysis_df[analysis_df["second_team_world_rank_#"] != "Unranked"]

analysis_df_X = analysis_df.iloc[:,0:6]

analysis_df_X

	first_team_world_rank_#	second_team_world_rank_#	T1_pref_score	T2_pref_score	T1_proficiency	T2_proficiency
0	28	27	1	0	8	28
1	15	8	2	3	40	315
2	15	52	2	1	40	60
3	20	28	2	2	178	19
4	8	49	3	0	394	22
...	...	...	...	...	...	...
3485	2	5	3	1	249	129
3486	12	15	0	3	125	271
3487	5	10	1	3	120	326
3488	28	21	3	0	89	103
3489	24	15	0	3	18	271

3444 rows × 6 columns

from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsClassifier

#scaling values of analysis_df_X
aScaler.fit(analysis_df_X)
X_scaled = aScaler.transform(analysis_df_X)
y = analysis_df["first_team_won"]
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.15)

Below is mostly from code found in class, used to graph outputs in relation to different K values for KNeighborsCLassifier. I liked this code since it gives a good representation of what values of K would do best, granted the data set is good.

def score_generator(k):
    cmf = KNeighborsClassifier(n_neighbors = k)
    cmf.fit(X_scaled, y)
    train_error = mean_absolute_error(cmf.predict(X_train), y_train)
    test_error = mean_absolute_error(cmf.predict(X_test), y_test)
    return (train_error, test_error)

#define df with 200 empty cells, with 3 columns of appropriate names 
df_scores = pd.DataFrame({"k":range(1,200),"train_error":np.nan,"test_error":np.nan})

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = score_generator(df_scores.loc[i,"k"])

#scale the values of K and create graphs for the training and test sets
df_scores["1/k"] = 1/df_scores.k

KNtrain = alt.Chart(df_scores).mark_line().encode(
    x = "1/k",
    y = "train_error"
)
KNtest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "1/k",
    y = "test_error"
)

#combine the graphs
KNtrain + KNtest

Looking at the graph, it doesn’t seem to bode well for this data set. It appears that test error and training error start at near zero, which is strange, I’m not sure what is causing this. What is more concerning is that both sets seem to increase in error as K increases.

Now, I want to try using a different machine learning method, this time with random forests.

from sklearn.ensemble import RandomForestClassifier

Defining another score generator so I can graph the outputs of this with different K values as well.

def score_generator_forest(k):
    rfc = RandomForestClassifier(n_estimators = k, max_features=None, max_depth=None, min_samples_split=2)
    rfc.fit(X_train, y_train)
    train_error = mean_absolute_error(rfc.predict(X_train), y_train)
    test_error = mean_absolute_error(rfc.predict(X_test), y_test)
    return (train_error, test_error)

df_scores_forest = pd.DataFrame({"k":range(1,50),"train_error":np.nan,"test_error":np.nan})
for i in df_scores_forest.index:
    df_scores_forest.loc[i,["train_error","test_error"]] = score_generator_forest(df_scores_forest.loc[i,"k"])

This time I’m opting not to scale the values of K

RFtrain = alt.Chart(df_scores_forest).mark_line().encode(
    x = "k",
    y = "train_error"
)
RFtest = alt.Chart(df_scores_forest).mark_line(color="orange").encode(
    x = "k",
    y = "test_error"
)

RFtrain + RFtest

This graph more closely resembles what results I would be expecting, as the test error is always higher than the training set’s error. The random trees model seems to maximize at around 20 trees, with the test sets plateauing around 35% error (so ~65% error on average at 20+ trees). It appears that neither method of machine learning fared especially well for the match analysis I wanted to do, as the featured engineered columns seem to introduce more chaos than clarity.

Summary¶

I set out to try and predict the match outcomes of professional CS:GO matches, and I started by feature engineering columns of data to analyze.

First was calculating the raw match data, so I could then compare that to each respective team’s performance. With that comparison, I could determine if a team was “proficient” or not to some degree on a given map, and that was assigned a score based on their combined win/loss percentage multiplied by their number of games on that given map.

I thought multiplying those values would give a more accurate reflection of how well a team would perform on that map. I finally funneled all of that data back into a copy of the original dataframe.

The machine learning was not as great as expected, seeing as the use of random trees was around 65% accurate while just depending on a team’s world rank was around 70% accurate. I think that the training set size wasn’t an issue, but rather the data itself. The metrics I chose could be good predictors of how teams would perform if they were just playing a best of one series, but in the highest echelons of professional play, CS:GO is played in best of three series.

This means that one of my feature engineered columns, the match preference value, grossly oversimplifies the complexity involved in how teams pick/ban maps, and there may be a better way to find a total aggregate map proficiency value than what I did. Finding a way to incorporate the teams themselves into the machine learning could have also been better than just using a world ranking, since world rank on its own doesn’t account for each team’s strengths and weaknesses.

References¶

Dataset for CS:GO found on Kaggle, which was imported from HLTV

Kaggle integration on DeepNote: https://deepnote.com/@dhruvildave/Kaggle-heouwNORROiS3aTQFvbklg

Map pick/ban order in competitive CSGO (BO3s): https://help.challengermode.com/en/articles/684985-how-to-ban-and-pick-cs-go-maps

Using np.r_ indexer: https://stackoverflow.com/questions/45985877/slicing-multiple-column-ranges-from-a-dataframe-using-iloc

Graphical representation of different KNeighborsRegressor values: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html

Created in Deepnote

UC Irvine Math 10 W22

Predicting match outcomes in professional Couter Strike: Global Offensive

Contents

Predicting match outcomes in professional Couter Strike: Global Offensive¶

Introduction¶

Main portion of the project¶

Summary¶

References¶