Predicting match outcomes in professional Couter Strike: Global Offensive

Predicting match outcomes in professional Couter Strike: Global Offensive

Author: Ryan Wei

Course Project, UC Irvine, Math 10, W22

Introduction

Counter Strike: Global Offensive (CSGO) is a tactical 5v5 first person shooter video game with a thriving esports scene. A series is played out in a Best of 3 series, where teams take turns picking and banning maps from a predetermined pool of maps. Maps have a T and CT side, with teams switching halfway through a match. The first team to 16 points wins that map and takes a point in the Best of 3 series.

The goal of this project is to see I can predict the outcome of matches based on a number of variables: the global rank of a team, their preference for the maps picked and their strength on that map.

Main portion of the project

import pandas as pd
import numpy as np
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
!kaggle datasets download -d viniciusromanosilva/csgo-hltv --unzip
Downloading csgo-hltv.zip to /work
  0%|                                                | 0.00/910k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 910k/910k [00:00<00:00, 75.1MB/s]
df = pd.read_csv("cs_hltv_data.csv")
df
Unnamed: 0 date time first_team second_team first_team_world_rank_# second_team_world_rank_# first_team_total_score second_team_total_score first_team_won ... first_pick_by_first_team ban 1 ban 2 pick 1 pick 2 ban 3 ban 4 pick 3 url timestamp
0 0 2021-06-02 10:00 Fiend K23 28 27 0 2 0 ... 0 Mirage Vertigo Overpass Nuke Inferno Train Dust2 https://www.hltv.org/matches/2349155/fiend-vs-... 1.622628e+09
1 1 2021-06-02 10:00 Entropiq BIG 15 8 0 2 0 ... 1 Inferno Overpass Mirage Vertigo Train Nuke Dust2 https://www.hltv.org/matches/2348945/entropiq-... 1.622628e+09
2 2 2021-06-02 07:00 Entropiq Nemiga 15 52 2 0 1 ... 0 Train Inferno Vertigo Dust2 Overpass Nuke Mirage https://www.hltv.org/matches/2349184/entropiq-... 1.622617e+09
3 3 2021-06-02 07:00 OG Fiend 20 28 2 0 1 ... 0 Nuke Vertigo Mirage Inferno Dust2 Train Overpass https://www.hltv.org/matches/2348944/og-vs-fie... 1.622617e+09
4 4 2021-06-01 14:15 BIG Wisla Krakow 8 49 2 0 1 ... 1 Ancient Vertigo Dust2 Nuke Inferno Overpass Mirage https://www.hltv.org/matches/2349144/big-vs-wi... 1.622557e+09
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3485 3485 2017-12-16 20:35 FaZe Cloud9 2 5 2 0 1 ... 0 Nuke Cobblestone Mirage Overpass Inferno Train Cache https://www.hltv.org/matches/2317944/faze-vs-c... 1.513456e+09
3486 3486 2017-12-16 18:05 OpTic mousesports 12 15 0 2 0 ... 0 Inferno Cobblestone Nuke Train Mirage Cache Overpass https://www.hltv.org/matches/2317943/optic-vs-... 1.513448e+09
3487 3487 2017-12-16 15:15 Cloud9 Liquid 5 10 2 0 1 ... 1 Nuke Train Mirage Inferno Cache Overpass Cobblestone https://www.hltv.org/matches/2317942/cloud9-vs... 1.513437e+09
3488 3488 2017-12-16 13:00 GODSENT AGO 28 21 0 2 0 ... 1 Nuke Cache Mirage Cobblestone Overpass Inferno Train https://www.hltv.org/matches/2318180/godsent-v... 1.513429e+09
3489 3489 2017-12-16 12:15 Luminosity mousesports 24 15 0 2 0 ... 1 Mirage Cache Train Nuke Cobblestone Inferno Overpass https://www.hltv.org/matches/2317941/luminosit... 1.513426e+09

3490 rows × 157 columns

I realized that the data in score_second_team_t2_M1 (through M3) are all truncated for some reason, so instead of showing a proper score range it is limited to 0-9. This data will need to either be reverse engineered.

#df for the outcomes of respective best of 3 matches
match_df = df.iloc[:,3:37].copy()
match_df
first_team second_team first_team_world_rank_# second_team_world_rank_# first_team_total_score second_team_total_score first_team_won M1 first_team_score_M1 second_team_score_M1 ... score_first_team_t1_M2 score_first_team_t2_M2 score_second_team_t1_M2 score_second_team_t2_M2 side_first_team_M3 side_second_team_M3 score_first_team_t1_M3 score_first_team_t2_M3 score_second_team_t1_M3 score_second_team_t2_M3
0 Fiend K23 28 27 0 2 0 Overpass 16 19 ... 7 3 8 8 - - - - - -
1 Entropiq BIG 15 8 0 2 0 Mirage 10 16 ... 9 2 6 1 - - - - - -
2 Entropiq Nemiga 15 52 2 0 1 Vertigo 16 5 ... 10 5 5 1 - - - - - -
3 OG Fiend 20 28 2 0 1 Mirage 16 11 ... 10 6 5 4 - - - - - -
4 BIG Wisla Krakow 8 49 2 0 1 Dust2 16 8 ... 8 8 7 5 - - - - - -
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3485 FaZe Cloud9 2 5 2 0 1 Mirage 16 8 ... 11 5 4 7 - - - - - -
3486 OpTic mousesports 12 15 0 2 0 Nuke 7 16 ... 3 9 12 4 - - - - - -
3487 Cloud9 Liquid 5 10 2 0 1 Mirage 16 14 ... 6 10 9 1 - - - - - -
3488 GODSENT AGO 28 21 0 2 0 Mirage 14 16 ... 10 3 5 1 - - - - - -
3489 Luminosity mousesports 24 15 0 2 0 Train 9 16 ... 2 1 13 3 - - - - - -

3490 rows × 34 columns

#build new df consisting of just pure game data, isolating individual matches 
map_data_cols = ["T1_name", "T2_name", "Map", "T1_score", "T2_score", "T1_side", "T2_side", "T1_score_H1", "T1_score_H2", "T2_score_H1", "T2_score_H2"]

#pick out wanted columns, then rename
map_data = match_df.iloc[:, np.r_[0:2, 7:10, 16:22]].copy()
map_data.columns = map_data_cols
#employ np.r_, and splice the desired colums. Append to existing matrix map_data after matching up column names
map_data_temp = match_df.iloc[:, np.r_[0:2, 10:13, 22:28]].copy()
map_data_temp.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)

map_data_temp2 = match_df.iloc[:, np.r_[0:2, 13:16, 28:34]].copy()
map_data_temp2.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)
#T1 = first team, T2 = second team, H1 = first half, H2 = second half
map_data
T1_name T2_name Map T1_score T2_score T1_side T2_side T1_score_H1 T1_score_H2 T2_score_H1 T2_score_H2
0 Fiend K23 Overpass 16 19 CT T 9 6 6 9
1 Entropiq BIG Mirage 10 16 CT T 7 3 8 8
2 Entropiq Nemiga Vertigo 16 5 CT T 10 6 5 0
3 OG Fiend Mirage 16 11 CT T 9 7 6 5
4 BIG Wisla Krakow Dust2 16 8 T CT 13 3 2 6
... ... ... ... ... ... ... ... ... ... ... ...
10465 FaZe Cloud9 Overpass 16 11 CT T 11 5 4 7
10466 OpTic mousesports Train 12 16 CT T 3 9 12 4
10467 Cloud9 Liquid Inferno 16 10 T CT 6 10 9 1
10468 GODSENT AGO Cobblestone 13 16 CT T 10 3 5 1
10469 Luminosity mousesports Nuke 3 16 CT T 2 1 13 3

10470 rows × 11 columns

#convert a column to str, so entries with "-" can be fitered out
    #these are Map 3, since matches are Best of 2 sometimes the third map isn't played
map_data = map_data.convert_dtypes()
map_data.drop(map_data[map_data["T1_score"] == "-"].index, inplace = True)
#alter T2_score_H2 data based on whether match went into overtime
#if match went into overtime, then alter based on T2_score (want to eliminate extra rounds from overtime)
for x in map_data.index:
    if (map_data["T1_score"][x] + map_data["T2_score"][x] > 30):
        if (map_data["T2_score"][x] > 16):
            map_data.at[x, "T2_score_H2"] = 15 - map_data["T2_score_H1"][x]
        else:
            map_data.at[x, "T2_score_H2"] = map_data["T2_score"][x] - map_data["T2_score_H1"][x]
map_data
T1_name T2_name Map T1_score T2_score T1_side T2_side T1_score_H1 T1_score_H2 T2_score_H1 T2_score_H2
0 Fiend K23 Overpass 16 19 CT T 9 6 6 9
1 Entropiq BIG Mirage 10 16 CT T 7 3 8 8
2 Entropiq Nemiga Vertigo 16 5 CT T 10 6 5 0
3 OG Fiend Mirage 16 11 CT T 9 7 6 5
4 BIG Wisla Krakow Dust2 16 8 T CT 13 3 2 6
... ... ... ... ... ... ... ... ... ... ... ...
10465 FaZe Cloud9 Overpass 16 11 CT T 11 5 4 7
10466 OpTic mousesports Train 12 16 CT T 3 9 12 4
10467 Cloud9 Liquid Inferno 16 10 T CT 6 10 9 1
10468 GODSENT AGO Cobblestone 13 16 CT T 10 3 5 1
10469 Luminosity mousesports Nuke 3 16 CT T 2 1 13 3

10470 rows × 11 columns

After fixing that column of data, we can proceed to calculating whether each map is T or CT sided

map_names = map_data.Map.unique()
map_side_data = {"Map": map_names,
    "CT_wins": 0,
    "T_wins": 0
}
map_side = pd.DataFrame(map_side_data, columns = ["Map", "CT_wins", "T_wins"])
#fill out map_side df with raw wins/losses
CT = "CT_wins"
T = "T_wins"
for x in map_data.index:
    m_name = map_data["Map"][x]
    if (map_data["T1_side"][x] == "CT"):
        map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]
        map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
    else:
        map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
        map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]
#calculate win rate percentages
map_side["CT_winrate"] = map_side["CT_wins"] / (map_side["CT_wins"] + map_side["T_wins"])
map_side["T_winrate"] = map_side["T_wins"] / (map_side["CT_wins"] + map_side["T_wins"])
#flag maps as T or CT
map_side["T_or_CT"] = ""

for x in range(9):
    if (map_side.iloc[x]["CT_winrate"] >= 0.5):
        map_side.at[x,"T_or_CT"] = "CT"
    else:
        map_side.at[x,"T_or_CT"] = "T"
map_side
Map CT_wins T_wins CT_winrate T_winrate T_or_CT
0 Overpass 16828 15633 0.518407 0.481593 CT
1 Mirage 24299 22449 0.519787 0.480213 CT
2 Vertigo 7827 8158 0.489647 0.510353 T
3 Dust2 19014 20090 0.486242 0.513758 T
4 Inferno 24701 24724 0.499767 0.500233 T
5 Nuke 20403 17510 0.538153 0.461847 CT
6 Train 17161 14301 0.545452 0.454548 CT
7 Cache 3877 4229 0.478288 0.521712 T
8 Cobblestone 1746 1826 0.488802 0.511198 T
alt.Chart(map_side).mark_bar().encode(
    x = "Map",
    y = alt.Y("CT_winrate:Q",
        scale = alt.Scale(domain = (.45, .55)))
).properties(
    title = "Map CT win rates"
)

From the data above, we can see that maps like Nuke, Train and Mirage and Overpass are more heavily CT sided while maps like Vertigo, Dust2 and Cache and Cobblestone are more T-sided.

Inferno is quite balanced, with almost a perfectly even 50% win rate.

With the data on which side is more likely to win a map, now we want to calculate how each team performs on each side of the map. That can be used to determine if a team is proficient at a map or not, and then use that as another metric to predict the outcome of a match.

#create an array with all unique team names
team_names = np.concatenate([map_data.T1_name.unique(), map_data.T2_name.unique()])
team_names = pd.unique(team_names)

#dictionary to convert into df for individual teams and their success rate on each map
team_map_data = {"Team": team_names}
for x in map_names:
    team_map_data[x + "_CT_wins"] = 0
    team_map_data[x + "_CT_losses"] = 0
    team_map_data[x + "_T_wins"] = 0
    team_map_data[x + "_T_losses"] = 0
    team_map_data[x + "_CT_winrate"] = 0
    team_map_data[x + "_T_winrate"] = 0
    team_map_data[x + "_game_count"] = 0

#df of teams and their map data, empty
team_map = pd.DataFrame(team_map_data)
#define a function to compute the desired raw stats
team_map_elements = ["_CT_wins", "_CT_losses", "_T_wins", "_T_losses"]

def map_data_updater(team_name, score_headers, map_title, index):
    for a, b in zip(team_map_elements, score_headers):
        team_map.loc[team_map.Team == team_name, map_title + a] += map_data[b][index]
    team_map.loc[team_map.Team == team_name, map_name + "_game_count"] += 1
for x in map_data.index:
    team1_name = map_data["T1_name"][x]
    team2_name = map_data["T2_name"][x]
    map_name = map_data["Map"][x]

    #call on map_data_updater to fill out scores as desired
    if (map_data["T1_side"][x] == "CT"):
        score_lst = ["T1_score_H1", "T2_score_H1", "T1_score_H2", "T2_score_H2"]
    else:
        score_lst = ["T1_score_H2", "T2_score_H2", "T1_score_H1", "T2_score_H1"]
    map_data_updater(team1_name, score_lst, map_name, x)
    map_data_updater(team2_name, score_lst[::-1], map_name, x)
#calculating each team's win rate on each map
var = 1
for x in range(9):
    team_map.iloc[:,var + 4] = team_map.iloc[:,var]/(team_map.iloc[:,var] + team_map.iloc[:,var + 1])
    team_map.iloc[:,var + 5] = team_map.iloc[:,var + 2]/(team_map.iloc[:,var + 2] + team_map.iloc[:,var + 3])
    var = var + 7

team_map = team_map.fillna(0)
team_map
Team Overpass_CT_wins Overpass_CT_losses Overpass_T_wins Overpass_T_losses Overpass_CT_winrate Overpass_T_winrate Overpass_game_count Mirage_CT_wins Mirage_CT_losses ... Cache_CT_winrate Cache_T_winrate Cache_game_count Cobblestone_CT_wins Cobblestone_CT_losses Cobblestone_T_wins Cobblestone_T_losses Cobblestone_CT_winrate Cobblestone_T_winrate Cobblestone_game_count
0 Fiend 43 32 34 39 0.573333 0.465753 5 51 35 ... 0.000000 0.000000 0 0 0 0 0 0.000000 0.000000 0
1 Entropiq 98 85 126 127 0.535519 0.498024 17 90 110 ... 0.000000 0.000000 0 0 0 0 0 0.000000 0.000000 0
2 OG 243 189 161 169 0.562500 0.487879 30 343 332 ... 0.000000 0.000000 0 0 0 0 0 0.000000 0.000000 0
3 BIG 498 525 426 449 0.486804 0.486857 72 668 519 ... 0.503356 0.436364 22 26 26 28 47 0.500000 0.373333 5
4 fnatic 483 493 457 503 0.494877 0.476042 76 795 630 ... 0.500000 0.537688 16 59 49 59 49 0.546296 0.546296 8
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
280 HOLLYWOOD 6 7 6 9 0.461538 0.400000 1 16 14 ... 0.000000 0.000000 0 0 0 0 0 0.000000 0.000000 0
281 GoodJob 0 0 0 0 0.000000 0.000000 0 0 0 ... 0.000000 0.000000 0 6 9 8 7 0.400000 0.533333 1
282 l4nd0dg3 0 0 0 0 0.000000 0.000000 0 9 1 ... 0.000000 0.000000 0 0 0 0 0 0.000000 0.000000 0
283 MANS NOT HOT 1 6 5 10 0.142857 0.333333 1 16 10 ... 0.733333 0.200000 2 2 13 0 3 0.133333 0.000000 1
284 eXtatus 0 0 0 0 0.000000 0.000000 0 6 9 ... 0.200000 0.428571 1 12 2 20 10 0.857143 0.666667 2

285 rows × 64 columns

With this new table of data/win rates, I want a way to quantify a team’s success rate on each respective map. This new quantifier will be called “map proficiency”, a score defined by:

(Map CT win rate + Map T win rate) * games played

This will give me a rough idea of how good each team is on a given map, as well as include the team’s experience in some way. In a tactical game like CS:GO, map experience can be a big determining factor in how a team performs.

#map proficiency score calculation
for x in map_names:
    team_map[x + "_aggregate_percentage"] = (team_map[x + "_CT_winrate"] + team_map[x + "_T_winrate"])

for x in map_names:
    team_map[x + "_proficiency"] = team_map[x + "_aggregate_percentage"] * team_map[x + "_game_count"]
team_map
Team Overpass_CT_wins Overpass_CT_losses Overpass_T_wins Overpass_T_losses Overpass_CT_winrate Overpass_T_winrate Overpass_game_count Mirage_CT_wins Mirage_CT_losses ... Cobblestone_aggregate_percentage Overpass_proficiency Mirage_proficiency Vertigo_proficiency Dust2_proficiency Inferno_proficiency Nuke_proficiency Train_proficiency Cache_proficiency Cobblestone_proficiency
0 Fiend 43 32 34 39 0.573333 0.465753 5 51 35 ... 0.000000 5.195434 9.544186 1.422222 1.704762 4.757543 1.478788 0.000000 0.000000 0.000000
1 Entropiq 98 85 126 127 0.535519 0.498024 17 90 110 ... 0.000000 17.570228 13.300000 10.435231 16.613333 0.000000 3.378378 2.786047 0.000000 0.000000
2 OG 243 189 161 169 0.562500 0.487879 30 343 332 ... 0.000000 31.511364 58.661478 0.000000 62.494161 88.417969 54.718422 40.566433 0.000000 0.000000
3 BIG 498 525 426 449 0.486804 0.486857 72 668 519 ... 0.873333 70.103568 100.677380 43.323614 171.005794 104.796536 122.914967 39.999217 20.673826 4.366667
4 fnatic 483 493 457 503 0.494877 0.476042 76 795 630 ... 1.092593 73.789822 115.963039 28.208878 53.843385 163.420422 72.651557 76.657212 16.603015 8.740741
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
280 HOLLYWOOD 6 7 6 9 0.461538 0.400000 1 16 14 ... 0.000000 0.861538 1.266667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
281 GoodJob 0 0 0 0 0.000000 0.000000 0 0 0 ... 0.933333 0.000000 0.000000 0.000000 0.000000 1.866667 0.000000 0.000000 0.000000 0.933333
282 l4nd0dg3 0 0 0 0 0.000000 0.000000 0 9 1 ... 0.000000 0.000000 1.366667 0.000000 0.000000 0.000000 0.000000 2.742857 0.000000 0.000000
283 MANS NOT HOT 1 6 5 10 0.142857 0.333333 1 16 10 ... 0.133333 0.476190 1.764103 0.000000 0.000000 0.000000 0.000000 0.000000 1.866667 0.133333
284 eXtatus 0 0 0 0 0.000000 0.000000 0 6 9 ... 1.523810 0.000000 0.900000 0.000000 0.000000 2.000000 0.000000 0.000000 0.628571 3.047619

285 rows × 82 columns

Now I want to chart proficiency ratings against T or CT win rates, depending on which side a map favors (note: Inferno is incredibly even, so T/CT makes almost no difference). This can give a visual indication of approximately in what range do the best teams perform, as illustrated by their higher proficiency scores.

In these graphs I am electing to choose either using T or CT winrates along the x-axis according to each map’s bias, since this gives the best indication of how a team performs when they have an advantage. I have tried doing the opposite, but since the win rates are lower the graphical representation is much more scattered and difficult to observe.

def make_proficiency_chart(maps):
    counter = map_side.index[map_side["Map"] == maps]
    if(map_side.at[counter[0],"T_or_CT"] == "CT"):
        phrase = "_CT_winrate"
    else: 
        phrase = "_T_winrate"

    c = alt.Chart(team_map).mark_circle().encode(
        x = alt.X(maps + phrase + ":Q",
            scale = alt.Scale(domain = (0.0, 1.0))),
        y = alt.Y(maps + "_proficiency:Q",
            scale = alt.Scale(domain = (0, 250))) 

    ).properties(
        title = f"{maps} winrate vs proficiency"
    )
    return c
alt.vconcat(*[make_proficiency_chart(k) for k in map_names])

A general trend can be observed where better teams tend to perform in accordance with each map’s bias, whether that is T or CT sided. Inferno lines up right down the middle, illustrating its 50/50 win percentages. Cache and Cobblestone have a lack of data, most likely due to these maps being rotated out early on within the time frame captured by this data set.

With the analysis on individual maps completed, along with the performace of each team on any given map, it is now time to augment a copy of the match_df dataframe so newly constructed variables can be used to analyze the outcome of matches.

T1_pref_score: preference score for a given match, incremented by one for each map that the first team has an aggregate rating over 1 (up to 3). 
T1_proficiency: proficiency score of team 1, as calculated earlier. 
#initialize new columns
match_df_final = match_df.copy()
m_df_add = ["T1_pref_score", "T2_pref_score", "T1_proficiency", "T2_proficiency"]

for x in m_df_add:
    match_df_final[x] = 0
def map_pref_proficiency(team, map_name_lst):
    preference = 0
    proficiency = 0

    for x in map_name_lst:
        if (team_map.loc[team_map["Team"] == team, x + "_aggregate_percentage"].values[0] > 1):
            preference += 1
        proficiency += team_map.loc[team_map["Team"] == team, x + "_proficiency"].values[0]
    ret_arr = [preference, proficiency]
    return ret_arr
#this entry was causing an issue, M3 wasn't filled out correctly
match_df_final.loc[3119, "M3"] = "Cache"

#pd.options.mode.chained_assignment = None

for x in match_df_final.index:
    maps = [match_df_final["M1"][x], match_df_final["M2"][x], match_df_final["M3"][x]]
    team1 = match_df_final["first_team"][x]
    team2 = match_df_final["second_team"][x]
    T1_vals = []
    T2_vals = []

    T1_vals = map_pref_proficiency(team1, maps)
    T2_vals = map_pref_proficiency(team2, maps)
    
    #for a, b in zip(m_df_add, [0, 0, 1, 1]):
    #    match_df_final.at[x, a] = T1_vals[b]
    #for a, b in zip(team_map_elements, score_headers):
    match_df_final.at[x, "T1_pref_score"] = T1_vals[0]
    match_df_final.at[x, "T1_proficiency"] = T1_vals[1]
    match_df_final.at[x, "T2_pref_score"] = T2_vals[0]
    match_df_final.at[x, "T2_proficiency"] = T2_vals[1]
match_df_final
first_team second_team first_team_world_rank_# second_team_world_rank_# first_team_total_score second_team_total_score first_team_won M1 first_team_score_M1 second_team_score_M1 ... side_first_team_M3 side_second_team_M3 score_first_team_t1_M3 score_first_team_t2_M3 score_second_team_t1_M3 score_second_team_t2_M3 T1_pref_score T2_pref_score T1_proficiency T2_proficiency
0 Fiend K23 28 27 0 2 0 Overpass 16 19 ... - - - - - - 1 0 8 28
1 Entropiq BIG 15 8 0 2 0 Mirage 10 16 ... - - - - - - 2 3 40 315
2 Entropiq Nemiga 15 52 2 0 1 Vertigo 16 5 ... - - - - - - 2 1 40 60
3 OG Fiend 20 28 2 0 1 Mirage 16 11 ... - - - - - - 2 2 178 19
4 BIG Wisla Krakow 8 49 2 0 1 Dust2 16 8 ... - - - - - - 3 0 394 22
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3485 FaZe Cloud9 2 5 2 0 1 Mirage 16 8 ... - - - - - - 3 1 249 129
3486 OpTic mousesports 12 15 0 2 0 Nuke 7 16 ... - - - - - - 0 3 125 271
3487 Cloud9 Liquid 5 10 2 0 1 Mirage 16 14 ... - - - - - - 1 3 120 326
3488 GODSENT AGO 28 21 0 2 0 Mirage 14 16 ... - - - - - - 3 0 89 103
3489 Luminosity mousesports 24 15 0 2 0 Train 9 16 ... - - - - - - 0 3 18 271

3490 rows × 38 columns

Below: calculating how likely a team is to win based purely on their global rank

#how likely will a team win based on their global rank?
wr_arr = ["first_team_world_rank_#", "second_team_world_rank_#", "first_team_won"]
wr_df = match_df_final[wr_arr].copy()
scaler = StandardScaler()

wr_df = wr_df[wr_df["first_team_world_rank_#"] != "Unranked"]
wr_df = wr_df[wr_df["second_team_world_rank_#"] != "Unranked"]
wr_df2 = wr_df.astype(int)

dfrnk = wr_df2[["first_team_world_rank_#", "second_team_world_rank_#"]]
scaler.fit(dfrnk)
X_scaled = scaler.transform(dfrnk[["first_team_world_rank_#", "second_team_world_rank_#"]])
y = wr_df2["first_team_won"]

clf = KNeighborsClassifier(n_neighbors = 7)
clf.fit(X_scaled, wr_df2["first_team_won"])
dfrnk["pred"] = clf.predict(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.2)
from sklearn.linear_model import LogisticRegression
cmf = LogisticRegression()
cmf.fit(X_train, y_train)
cmf.predict(X_test) == y_test
1862     True
918      True
1123    False
1252    False
2109     True
        ...  
75      False
1078     True
428      True
1762     True
1995     True
Name: first_team_won, Length: 689, dtype: bool
np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
0.7111756168359942
np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
0.7176043557168784

it appears that global ranking is a pretty good indicator of winning, as is expected since the two should go hand in hand. This result is not very interesting but it can serve as a baseline

#creating df to analyze with machine learning
analysis_df = match_df_final.iloc[:, np.r_[2:4, 34:38]].copy()
analysis_df["first_team_won"] = match_df_final["first_team_won"].copy()
aScaler = StandardScaler()

#remove rows w/ no world rank for a given team, only removes about 50 entries 
analysis_df = analysis_df[analysis_df["first_team_world_rank_#"] != "Unranked"]
analysis_df = analysis_df[analysis_df["second_team_world_rank_#"] != "Unranked"]

analysis_df_X = analysis_df.iloc[:,0:6]
analysis_df_X
first_team_world_rank_# second_team_world_rank_# T1_pref_score T2_pref_score T1_proficiency T2_proficiency
0 28 27 1 0 8 28
1 15 8 2 3 40 315
2 15 52 2 1 40 60
3 20 28 2 2 178 19
4 8 49 3 0 394 22
... ... ... ... ... ... ...
3485 2 5 3 1 249 129
3486 12 15 0 3 125 271
3487 5 10 1 3 120 326
3488 28 21 3 0 89 103
3489 24 15 0 3 18 271

3444 rows × 6 columns

from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsClassifier
#scaling values of analysis_df_X
aScaler.fit(analysis_df_X)
X_scaled = aScaler.transform(analysis_df_X)
y = analysis_df["first_team_won"]
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.15)

Below is mostly from code found in class, used to graph outputs in relation to different K values for KNeighborsCLassifier. I liked this code since it gives a good representation of what values of K would do best, granted the data set is good.

def score_generator(k):
    cmf = KNeighborsClassifier(n_neighbors = k)
    cmf.fit(X_scaled, y)
    train_error = mean_absolute_error(cmf.predict(X_train), y_train)
    test_error = mean_absolute_error(cmf.predict(X_test), y_test)
    return (train_error, test_error)
#define df with 200 empty cells, with 3 columns of appropriate names 
df_scores = pd.DataFrame({"k":range(1,200),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = score_generator(df_scores.loc[i,"k"])
#scale the values of K and create graphs for the training and test sets
df_scores["1/k"] = 1/df_scores.k

KNtrain = alt.Chart(df_scores).mark_line().encode(
    x = "1/k",
    y = "train_error"
)
KNtest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "1/k",
    y = "test_error"
)
#combine the graphs
KNtrain + KNtest

Looking at the graph, it doesn’t seem to bode well for this data set. It appears that test error and training error start at near zero, which is strange, I’m not sure what is causing this. What is more concerning is that both sets seem to increase in error as K increases.

Now, I want to try using a different machine learning method, this time with random forests.

from sklearn.ensemble import RandomForestClassifier

Defining another score generator so I can graph the outputs of this with different K values as well.

def score_generator_forest(k):
    rfc = RandomForestClassifier(n_estimators = k, max_features=None, max_depth=None, min_samples_split=2)
    rfc.fit(X_train, y_train)
    train_error = mean_absolute_error(rfc.predict(X_train), y_train)
    test_error = mean_absolute_error(rfc.predict(X_test), y_test)
    return (train_error, test_error)
df_scores_forest = pd.DataFrame({"k":range(1,50),"train_error":np.nan,"test_error":np.nan})
for i in df_scores_forest.index:
    df_scores_forest.loc[i,["train_error","test_error"]] = score_generator_forest(df_scores_forest.loc[i,"k"])

This time I’m opting not to scale the values of K

RFtrain = alt.Chart(df_scores_forest).mark_line().encode(
    x = "k",
    y = "train_error"
)
RFtest = alt.Chart(df_scores_forest).mark_line(color="orange").encode(
    x = "k",
    y = "test_error"
)
RFtrain + RFtest

This graph more closely resembles what results I would be expecting, as the test error is always higher than the training set’s error. The random trees model seems to maximize at around 20 trees, with the test sets plateauing around 35% error (so ~65% error on average at 20+ trees). It appears that neither method of machine learning fared especially well for the match analysis I wanted to do, as the featured engineered columns seem to introduce more chaos than clarity.

Summary

I set out to try and predict the match outcomes of professional CS:GO matches, and I started by feature engineering columns of data to analyze.

First was calculating the raw match data, so I could then compare that to each respective team’s performance. With that comparison, I could determine if a team was “proficient” or not to some degree on a given map, and that was assigned a score based on their combined win/loss percentage multiplied by their number of games on that given map.

I thought multiplying those values would give a more accurate reflection of how well a team would perform on that map. I finally funneled all of that data back into a copy of the original dataframe.

The machine learning was not as great as expected, seeing as the use of random trees was around 65% accurate while just depending on a team’s world rank was around 70% accurate. I think that the training set size wasn’t an issue, but rather the data itself. The metrics I chose could be good predictors of how teams would perform if they were just playing a best of one series, but in the highest echelons of professional play, CS:GO is played in best of three series.

This means that one of my feature engineered columns, the match preference value, grossly oversimplifies the complexity involved in how teams pick/ban maps, and there may be a better way to find a total aggregate map proficiency value than what I did. Finding a way to incorporate the teams themselves into the machine learning could have also been better than just using a world ranking, since world rank on its own doesn’t account for each team’s strengths and weaknesses.

References

Dataset for CS:GO found on Kaggle, which was imported from HLTV

Kaggle integration on DeepNote: https://deepnote.com/@dhruvildave/Kaggle-heouwNORROiS3aTQFvbklg

Map pick/ban order in competitive CSGO (BO3s): https://help.challengermode.com/en/articles/684985-how-to-ban-and-pick-cs-go-maps

Using np.r_ indexer: https://stackoverflow.com/questions/45985877/slicing-multiple-column-ranges-from-a-dataframe-using-iloc

Graphical representation of different KNeighborsRegressor values: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html

Created in deepnote.com Created in Deepnote