Predicting match outcomes in professional Couter Strike: Global Offensive
Contents
Predicting match outcomes in professional Couter Strike: Global Offensive¶
Author: Ryan Wei
Course Project, UC Irvine, Math 10, W22
Introduction¶
Counter Strike: Global Offensive (CSGO) is a tactical 5v5 first person shooter video game with a thriving esports scene. A series is played out in a Best of 3 series, where teams take turns picking and banning maps from a predetermined pool of maps. Maps have a T and CT side, with teams switching halfway through a match. The first team to 16 points wins that map and takes a point in the Best of 3 series.
The goal of this project is to see I can predict the outcome of matches based on a number of variables: the global rank of a team, their preference for the maps picked and their strength on that map.
Main portion of the project¶
import pandas as pd
import numpy as np
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
!kaggle datasets download -d viniciusromanosilva/csgo-hltv --unzip
Downloading csgo-hltv.zip to /work
0%| | 0.00/910k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 910k/910k [00:00<00:00, 75.1MB/s]
df = pd.read_csv("cs_hltv_data.csv")
df
Unnamed: 0 | date | time | first_team | second_team | first_team_world_rank_# | second_team_world_rank_# | first_team_total_score | second_team_total_score | first_team_won | ... | first_pick_by_first_team | ban 1 | ban 2 | pick 1 | pick 2 | ban 3 | ban 4 | pick 3 | url | timestamp | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 2021-06-02 | 10:00 | Fiend | K23 | 28 | 27 | 0 | 2 | 0 | ... | 0 | Mirage | Vertigo | Overpass | Nuke | Inferno | Train | Dust2 | https://www.hltv.org/matches/2349155/fiend-vs-... | 1.622628e+09 |
1 | 1 | 2021-06-02 | 10:00 | Entropiq | BIG | 15 | 8 | 0 | 2 | 0 | ... | 1 | Inferno | Overpass | Mirage | Vertigo | Train | Nuke | Dust2 | https://www.hltv.org/matches/2348945/entropiq-... | 1.622628e+09 |
2 | 2 | 2021-06-02 | 07:00 | Entropiq | Nemiga | 15 | 52 | 2 | 0 | 1 | ... | 0 | Train | Inferno | Vertigo | Dust2 | Overpass | Nuke | Mirage | https://www.hltv.org/matches/2349184/entropiq-... | 1.622617e+09 |
3 | 3 | 2021-06-02 | 07:00 | OG | Fiend | 20 | 28 | 2 | 0 | 1 | ... | 0 | Nuke | Vertigo | Mirage | Inferno | Dust2 | Train | Overpass | https://www.hltv.org/matches/2348944/og-vs-fie... | 1.622617e+09 |
4 | 4 | 2021-06-01 | 14:15 | BIG | Wisla Krakow | 8 | 49 | 2 | 0 | 1 | ... | 1 | Ancient | Vertigo | Dust2 | Nuke | Inferno | Overpass | Mirage | https://www.hltv.org/matches/2349144/big-vs-wi... | 1.622557e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3485 | 3485 | 2017-12-16 | 20:35 | FaZe | Cloud9 | 2 | 5 | 2 | 0 | 1 | ... | 0 | Nuke | Cobblestone | Mirage | Overpass | Inferno | Train | Cache | https://www.hltv.org/matches/2317944/faze-vs-c... | 1.513456e+09 |
3486 | 3486 | 2017-12-16 | 18:05 | OpTic | mousesports | 12 | 15 | 0 | 2 | 0 | ... | 0 | Inferno | Cobblestone | Nuke | Train | Mirage | Cache | Overpass | https://www.hltv.org/matches/2317943/optic-vs-... | 1.513448e+09 |
3487 | 3487 | 2017-12-16 | 15:15 | Cloud9 | Liquid | 5 | 10 | 2 | 0 | 1 | ... | 1 | Nuke | Train | Mirage | Inferno | Cache | Overpass | Cobblestone | https://www.hltv.org/matches/2317942/cloud9-vs... | 1.513437e+09 |
3488 | 3488 | 2017-12-16 | 13:00 | GODSENT | AGO | 28 | 21 | 0 | 2 | 0 | ... | 1 | Nuke | Cache | Mirage | Cobblestone | Overpass | Inferno | Train | https://www.hltv.org/matches/2318180/godsent-v... | 1.513429e+09 |
3489 | 3489 | 2017-12-16 | 12:15 | Luminosity | mousesports | 24 | 15 | 0 | 2 | 0 | ... | 1 | Mirage | Cache | Train | Nuke | Cobblestone | Inferno | Overpass | https://www.hltv.org/matches/2317941/luminosit... | 1.513426e+09 |
3490 rows × 157 columns
I realized that the data in score_second_team_t2_M1 (through M3) are all truncated for some reason, so instead of showing a proper score range it is limited to 0-9. This data will need to either be reverse engineered.
#df for the outcomes of respective best of 3 matches
match_df = df.iloc[:,3:37].copy()
match_df
first_team | second_team | first_team_world_rank_# | second_team_world_rank_# | first_team_total_score | second_team_total_score | first_team_won | M1 | first_team_score_M1 | second_team_score_M1 | ... | score_first_team_t1_M2 | score_first_team_t2_M2 | score_second_team_t1_M2 | score_second_team_t2_M2 | side_first_team_M3 | side_second_team_M3 | score_first_team_t1_M3 | score_first_team_t2_M3 | score_second_team_t1_M3 | score_second_team_t2_M3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | K23 | 28 | 27 | 0 | 2 | 0 | Overpass | 16 | 19 | ... | 7 | 3 | 8 | 8 | - | - | - | - | - | - |
1 | Entropiq | BIG | 15 | 8 | 0 | 2 | 0 | Mirage | 10 | 16 | ... | 9 | 2 | 6 | 1 | - | - | - | - | - | - |
2 | Entropiq | Nemiga | 15 | 52 | 2 | 0 | 1 | Vertigo | 16 | 5 | ... | 10 | 5 | 5 | 1 | - | - | - | - | - | - |
3 | OG | Fiend | 20 | 28 | 2 | 0 | 1 | Mirage | 16 | 11 | ... | 10 | 6 | 5 | 4 | - | - | - | - | - | - |
4 | BIG | Wisla Krakow | 8 | 49 | 2 | 0 | 1 | Dust2 | 16 | 8 | ... | 8 | 8 | 7 | 5 | - | - | - | - | - | - |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3485 | FaZe | Cloud9 | 2 | 5 | 2 | 0 | 1 | Mirage | 16 | 8 | ... | 11 | 5 | 4 | 7 | - | - | - | - | - | - |
3486 | OpTic | mousesports | 12 | 15 | 0 | 2 | 0 | Nuke | 7 | 16 | ... | 3 | 9 | 12 | 4 | - | - | - | - | - | - |
3487 | Cloud9 | Liquid | 5 | 10 | 2 | 0 | 1 | Mirage | 16 | 14 | ... | 6 | 10 | 9 | 1 | - | - | - | - | - | - |
3488 | GODSENT | AGO | 28 | 21 | 0 | 2 | 0 | Mirage | 14 | 16 | ... | 10 | 3 | 5 | 1 | - | - | - | - | - | - |
3489 | Luminosity | mousesports | 24 | 15 | 0 | 2 | 0 | Train | 9 | 16 | ... | 2 | 1 | 13 | 3 | - | - | - | - | - | - |
3490 rows × 34 columns
#build new df consisting of just pure game data, isolating individual matches
map_data_cols = ["T1_name", "T2_name", "Map", "T1_score", "T2_score", "T1_side", "T2_side", "T1_score_H1", "T1_score_H2", "T2_score_H1", "T2_score_H2"]
#pick out wanted columns, then rename
map_data = match_df.iloc[:, np.r_[0:2, 7:10, 16:22]].copy()
map_data.columns = map_data_cols
#employ np.r_, and splice the desired colums. Append to existing matrix map_data after matching up column names
map_data_temp = match_df.iloc[:, np.r_[0:2, 10:13, 22:28]].copy()
map_data_temp.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)
map_data_temp2 = match_df.iloc[:, np.r_[0:2, 13:16, 28:34]].copy()
map_data_temp2.columns = map_data_cols
map_data = map_data.append(map_data_temp, ignore_index = True)
#T1 = first team, T2 = second team, H1 = first half, H2 = second half
map_data
T1_name | T2_name | Map | T1_score | T2_score | T1_side | T2_side | T1_score_H1 | T1_score_H2 | T2_score_H1 | T2_score_H2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | K23 | Overpass | 16 | 19 | CT | T | 9 | 6 | 6 | 9 |
1 | Entropiq | BIG | Mirage | 10 | 16 | CT | T | 7 | 3 | 8 | 8 |
2 | Entropiq | Nemiga | Vertigo | 16 | 5 | CT | T | 10 | 6 | 5 | 0 |
3 | OG | Fiend | Mirage | 16 | 11 | CT | T | 9 | 7 | 6 | 5 |
4 | BIG | Wisla Krakow | Dust2 | 16 | 8 | T | CT | 13 | 3 | 2 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10465 | FaZe | Cloud9 | Overpass | 16 | 11 | CT | T | 11 | 5 | 4 | 7 |
10466 | OpTic | mousesports | Train | 12 | 16 | CT | T | 3 | 9 | 12 | 4 |
10467 | Cloud9 | Liquid | Inferno | 16 | 10 | T | CT | 6 | 10 | 9 | 1 |
10468 | GODSENT | AGO | Cobblestone | 13 | 16 | CT | T | 10 | 3 | 5 | 1 |
10469 | Luminosity | mousesports | Nuke | 3 | 16 | CT | T | 2 | 1 | 13 | 3 |
10470 rows × 11 columns
#convert a column to str, so entries with "-" can be fitered out
#these are Map 3, since matches are Best of 2 sometimes the third map isn't played
map_data = map_data.convert_dtypes()
map_data.drop(map_data[map_data["T1_score"] == "-"].index, inplace = True)
#alter T2_score_H2 data based on whether match went into overtime
#if match went into overtime, then alter based on T2_score (want to eliminate extra rounds from overtime)
for x in map_data.index:
if (map_data["T1_score"][x] + map_data["T2_score"][x] > 30):
if (map_data["T2_score"][x] > 16):
map_data.at[x, "T2_score_H2"] = 15 - map_data["T2_score_H1"][x]
else:
map_data.at[x, "T2_score_H2"] = map_data["T2_score"][x] - map_data["T2_score_H1"][x]
map_data
T1_name | T2_name | Map | T1_score | T2_score | T1_side | T2_side | T1_score_H1 | T1_score_H2 | T2_score_H1 | T2_score_H2 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | K23 | Overpass | 16 | 19 | CT | T | 9 | 6 | 6 | 9 |
1 | Entropiq | BIG | Mirage | 10 | 16 | CT | T | 7 | 3 | 8 | 8 |
2 | Entropiq | Nemiga | Vertigo | 16 | 5 | CT | T | 10 | 6 | 5 | 0 |
3 | OG | Fiend | Mirage | 16 | 11 | CT | T | 9 | 7 | 6 | 5 |
4 | BIG | Wisla Krakow | Dust2 | 16 | 8 | T | CT | 13 | 3 | 2 | 6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
10465 | FaZe | Cloud9 | Overpass | 16 | 11 | CT | T | 11 | 5 | 4 | 7 |
10466 | OpTic | mousesports | Train | 12 | 16 | CT | T | 3 | 9 | 12 | 4 |
10467 | Cloud9 | Liquid | Inferno | 16 | 10 | T | CT | 6 | 10 | 9 | 1 |
10468 | GODSENT | AGO | Cobblestone | 13 | 16 | CT | T | 10 | 3 | 5 | 1 |
10469 | Luminosity | mousesports | Nuke | 3 | 16 | CT | T | 2 | 1 | 13 | 3 |
10470 rows × 11 columns
After fixing that column of data, we can proceed to calculating whether each map is T or CT sided
map_names = map_data.Map.unique()
map_side_data = {"Map": map_names,
"CT_wins": 0,
"T_wins": 0
}
map_side = pd.DataFrame(map_side_data, columns = ["Map", "CT_wins", "T_wins"])
#fill out map_side df with raw wins/losses
CT = "CT_wins"
T = "T_wins"
for x in map_data.index:
m_name = map_data["Map"][x]
if (map_data["T1_side"][x] == "CT"):
map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]
map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
else:
map_side.loc[map_side.Map == m_name, f"{CT}"] += map_data["T1_score_H2"][x] + map_data["T2_score_H1"][x]
map_side.loc[map_side.Map == m_name, f"{T}"] += map_data["T1_score_H1"][x] + map_data["T2_score_H2"][x]
#calculate win rate percentages
map_side["CT_winrate"] = map_side["CT_wins"] / (map_side["CT_wins"] + map_side["T_wins"])
map_side["T_winrate"] = map_side["T_wins"] / (map_side["CT_wins"] + map_side["T_wins"])
#flag maps as T or CT
map_side["T_or_CT"] = ""
for x in range(9):
if (map_side.iloc[x]["CT_winrate"] >= 0.5):
map_side.at[x,"T_or_CT"] = "CT"
else:
map_side.at[x,"T_or_CT"] = "T"
map_side
Map | CT_wins | T_wins | CT_winrate | T_winrate | T_or_CT | |
---|---|---|---|---|---|---|
0 | Overpass | 16828 | 15633 | 0.518407 | 0.481593 | CT |
1 | Mirage | 24299 | 22449 | 0.519787 | 0.480213 | CT |
2 | Vertigo | 7827 | 8158 | 0.489647 | 0.510353 | T |
3 | Dust2 | 19014 | 20090 | 0.486242 | 0.513758 | T |
4 | Inferno | 24701 | 24724 | 0.499767 | 0.500233 | T |
5 | Nuke | 20403 | 17510 | 0.538153 | 0.461847 | CT |
6 | Train | 17161 | 14301 | 0.545452 | 0.454548 | CT |
7 | Cache | 3877 | 4229 | 0.478288 | 0.521712 | T |
8 | Cobblestone | 1746 | 1826 | 0.488802 | 0.511198 | T |
alt.Chart(map_side).mark_bar().encode(
x = "Map",
y = alt.Y("CT_winrate:Q",
scale = alt.Scale(domain = (.45, .55)))
).properties(
title = "Map CT win rates"
)
From the data above, we can see that maps like Nuke, Train and Mirage and Overpass are more heavily CT sided while maps like Vertigo, Dust2 and Cache and Cobblestone are more T-sided.
Inferno is quite balanced, with almost a perfectly even 50% win rate.
With the data on which side is more likely to win a map, now we want to calculate how each team performs on each side of the map. That can be used to determine if a team is proficient at a map or not, and then use that as another metric to predict the outcome of a match.
#create an array with all unique team names
team_names = np.concatenate([map_data.T1_name.unique(), map_data.T2_name.unique()])
team_names = pd.unique(team_names)
#dictionary to convert into df for individual teams and their success rate on each map
team_map_data = {"Team": team_names}
for x in map_names:
team_map_data[x + "_CT_wins"] = 0
team_map_data[x + "_CT_losses"] = 0
team_map_data[x + "_T_wins"] = 0
team_map_data[x + "_T_losses"] = 0
team_map_data[x + "_CT_winrate"] = 0
team_map_data[x + "_T_winrate"] = 0
team_map_data[x + "_game_count"] = 0
#df of teams and their map data, empty
team_map = pd.DataFrame(team_map_data)
#define a function to compute the desired raw stats
team_map_elements = ["_CT_wins", "_CT_losses", "_T_wins", "_T_losses"]
def map_data_updater(team_name, score_headers, map_title, index):
for a, b in zip(team_map_elements, score_headers):
team_map.loc[team_map.Team == team_name, map_title + a] += map_data[b][index]
team_map.loc[team_map.Team == team_name, map_name + "_game_count"] += 1
for x in map_data.index:
team1_name = map_data["T1_name"][x]
team2_name = map_data["T2_name"][x]
map_name = map_data["Map"][x]
#call on map_data_updater to fill out scores as desired
if (map_data["T1_side"][x] == "CT"):
score_lst = ["T1_score_H1", "T2_score_H1", "T1_score_H2", "T2_score_H2"]
else:
score_lst = ["T1_score_H2", "T2_score_H2", "T1_score_H1", "T2_score_H1"]
map_data_updater(team1_name, score_lst, map_name, x)
map_data_updater(team2_name, score_lst[::-1], map_name, x)
#calculating each team's win rate on each map
var = 1
for x in range(9):
team_map.iloc[:,var + 4] = team_map.iloc[:,var]/(team_map.iloc[:,var] + team_map.iloc[:,var + 1])
team_map.iloc[:,var + 5] = team_map.iloc[:,var + 2]/(team_map.iloc[:,var + 2] + team_map.iloc[:,var + 3])
var = var + 7
team_map = team_map.fillna(0)
team_map
Team | Overpass_CT_wins | Overpass_CT_losses | Overpass_T_wins | Overpass_T_losses | Overpass_CT_winrate | Overpass_T_winrate | Overpass_game_count | Mirage_CT_wins | Mirage_CT_losses | ... | Cache_CT_winrate | Cache_T_winrate | Cache_game_count | Cobblestone_CT_wins | Cobblestone_CT_losses | Cobblestone_T_wins | Cobblestone_T_losses | Cobblestone_CT_winrate | Cobblestone_T_winrate | Cobblestone_game_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | 43 | 32 | 34 | 39 | 0.573333 | 0.465753 | 5 | 51 | 35 | ... | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 |
1 | Entropiq | 98 | 85 | 126 | 127 | 0.535519 | 0.498024 | 17 | 90 | 110 | ... | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 |
2 | OG | 243 | 189 | 161 | 169 | 0.562500 | 0.487879 | 30 | 343 | 332 | ... | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 |
3 | BIG | 498 | 525 | 426 | 449 | 0.486804 | 0.486857 | 72 | 668 | 519 | ... | 0.503356 | 0.436364 | 22 | 26 | 26 | 28 | 47 | 0.500000 | 0.373333 | 5 |
4 | fnatic | 483 | 493 | 457 | 503 | 0.494877 | 0.476042 | 76 | 795 | 630 | ... | 0.500000 | 0.537688 | 16 | 59 | 49 | 59 | 49 | 0.546296 | 0.546296 | 8 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
280 | HOLLYWOOD | 6 | 7 | 6 | 9 | 0.461538 | 0.400000 | 1 | 16 | 14 | ... | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 |
281 | GoodJob | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 0 | 0 | ... | 0.000000 | 0.000000 | 0 | 6 | 9 | 8 | 7 | 0.400000 | 0.533333 | 1 |
282 | l4nd0dg3 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 9 | 1 | ... | 0.000000 | 0.000000 | 0 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 |
283 | MANS NOT HOT | 1 | 6 | 5 | 10 | 0.142857 | 0.333333 | 1 | 16 | 10 | ... | 0.733333 | 0.200000 | 2 | 2 | 13 | 0 | 3 | 0.133333 | 0.000000 | 1 |
284 | eXtatus | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 6 | 9 | ... | 0.200000 | 0.428571 | 1 | 12 | 2 | 20 | 10 | 0.857143 | 0.666667 | 2 |
285 rows × 64 columns
With this new table of data/win rates, I want a way to quantify a team’s success rate on each respective map. This new quantifier will be called “map proficiency”, a score defined by:
(Map CT win rate + Map T win rate) * games played
This will give me a rough idea of how good each team is on a given map, as well as include the team’s experience in some way. In a tactical game like CS:GO, map experience can be a big determining factor in how a team performs.
#map proficiency score calculation
for x in map_names:
team_map[x + "_aggregate_percentage"] = (team_map[x + "_CT_winrate"] + team_map[x + "_T_winrate"])
for x in map_names:
team_map[x + "_proficiency"] = team_map[x + "_aggregate_percentage"] * team_map[x + "_game_count"]
team_map
Team | Overpass_CT_wins | Overpass_CT_losses | Overpass_T_wins | Overpass_T_losses | Overpass_CT_winrate | Overpass_T_winrate | Overpass_game_count | Mirage_CT_wins | Mirage_CT_losses | ... | Cobblestone_aggregate_percentage | Overpass_proficiency | Mirage_proficiency | Vertigo_proficiency | Dust2_proficiency | Inferno_proficiency | Nuke_proficiency | Train_proficiency | Cache_proficiency | Cobblestone_proficiency | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | 43 | 32 | 34 | 39 | 0.573333 | 0.465753 | 5 | 51 | 35 | ... | 0.000000 | 5.195434 | 9.544186 | 1.422222 | 1.704762 | 4.757543 | 1.478788 | 0.000000 | 0.000000 | 0.000000 |
1 | Entropiq | 98 | 85 | 126 | 127 | 0.535519 | 0.498024 | 17 | 90 | 110 | ... | 0.000000 | 17.570228 | 13.300000 | 10.435231 | 16.613333 | 0.000000 | 3.378378 | 2.786047 | 0.000000 | 0.000000 |
2 | OG | 243 | 189 | 161 | 169 | 0.562500 | 0.487879 | 30 | 343 | 332 | ... | 0.000000 | 31.511364 | 58.661478 | 0.000000 | 62.494161 | 88.417969 | 54.718422 | 40.566433 | 0.000000 | 0.000000 |
3 | BIG | 498 | 525 | 426 | 449 | 0.486804 | 0.486857 | 72 | 668 | 519 | ... | 0.873333 | 70.103568 | 100.677380 | 43.323614 | 171.005794 | 104.796536 | 122.914967 | 39.999217 | 20.673826 | 4.366667 |
4 | fnatic | 483 | 493 | 457 | 503 | 0.494877 | 0.476042 | 76 | 795 | 630 | ... | 1.092593 | 73.789822 | 115.963039 | 28.208878 | 53.843385 | 163.420422 | 72.651557 | 76.657212 | 16.603015 | 8.740741 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
280 | HOLLYWOOD | 6 | 7 | 6 | 9 | 0.461538 | 0.400000 | 1 | 16 | 14 | ... | 0.000000 | 0.861538 | 1.266667 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
281 | GoodJob | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 0 | 0 | ... | 0.933333 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.866667 | 0.000000 | 0.000000 | 0.000000 | 0.933333 |
282 | l4nd0dg3 | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 9 | 1 | ... | 0.000000 | 0.000000 | 1.366667 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.742857 | 0.000000 | 0.000000 |
283 | MANS NOT HOT | 1 | 6 | 5 | 10 | 0.142857 | 0.333333 | 1 | 16 | 10 | ... | 0.133333 | 0.476190 | 1.764103 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.866667 | 0.133333 |
284 | eXtatus | 0 | 0 | 0 | 0 | 0.000000 | 0.000000 | 0 | 6 | 9 | ... | 1.523810 | 0.000000 | 0.900000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.628571 | 3.047619 |
285 rows × 82 columns
Now I want to chart proficiency ratings against T or CT win rates, depending on which side a map favors (note: Inferno is incredibly even, so T/CT makes almost no difference). This can give a visual indication of approximately in what range do the best teams perform, as illustrated by their higher proficiency scores.
In these graphs I am electing to choose either using T or CT winrates along the x-axis according to each map’s bias, since this gives the best indication of how a team performs when they have an advantage. I have tried doing the opposite, but since the win rates are lower the graphical representation is much more scattered and difficult to observe.
def make_proficiency_chart(maps):
counter = map_side.index[map_side["Map"] == maps]
if(map_side.at[counter[0],"T_or_CT"] == "CT"):
phrase = "_CT_winrate"
else:
phrase = "_T_winrate"
c = alt.Chart(team_map).mark_circle().encode(
x = alt.X(maps + phrase + ":Q",
scale = alt.Scale(domain = (0.0, 1.0))),
y = alt.Y(maps + "_proficiency:Q",
scale = alt.Scale(domain = (0, 250)))
).properties(
title = f"{maps} winrate vs proficiency"
)
return c
alt.vconcat(*[make_proficiency_chart(k) for k in map_names])
A general trend can be observed where better teams tend to perform in accordance with each map’s bias, whether that is T or CT sided. Inferno lines up right down the middle, illustrating its 50/50 win percentages. Cache and Cobblestone have a lack of data, most likely due to these maps being rotated out early on within the time frame captured by this data set.
With the analysis on individual maps completed, along with the performace of each team on any given map, it is now time to augment a copy of the match_df dataframe so newly constructed variables can be used to analyze the outcome of matches.
T1_pref_score: preference score for a given match, incremented by one for each map that the first team has an aggregate rating over 1 (up to 3).
T1_proficiency: proficiency score of team 1, as calculated earlier.
#initialize new columns
match_df_final = match_df.copy()
m_df_add = ["T1_pref_score", "T2_pref_score", "T1_proficiency", "T2_proficiency"]
for x in m_df_add:
match_df_final[x] = 0
def map_pref_proficiency(team, map_name_lst):
preference = 0
proficiency = 0
for x in map_name_lst:
if (team_map.loc[team_map["Team"] == team, x + "_aggregate_percentage"].values[0] > 1):
preference += 1
proficiency += team_map.loc[team_map["Team"] == team, x + "_proficiency"].values[0]
ret_arr = [preference, proficiency]
return ret_arr
#this entry was causing an issue, M3 wasn't filled out correctly
match_df_final.loc[3119, "M3"] = "Cache"
#pd.options.mode.chained_assignment = None
for x in match_df_final.index:
maps = [match_df_final["M1"][x], match_df_final["M2"][x], match_df_final["M3"][x]]
team1 = match_df_final["first_team"][x]
team2 = match_df_final["second_team"][x]
T1_vals = []
T2_vals = []
T1_vals = map_pref_proficiency(team1, maps)
T2_vals = map_pref_proficiency(team2, maps)
#for a, b in zip(m_df_add, [0, 0, 1, 1]):
# match_df_final.at[x, a] = T1_vals[b]
#for a, b in zip(team_map_elements, score_headers):
match_df_final.at[x, "T1_pref_score"] = T1_vals[0]
match_df_final.at[x, "T1_proficiency"] = T1_vals[1]
match_df_final.at[x, "T2_pref_score"] = T2_vals[0]
match_df_final.at[x, "T2_proficiency"] = T2_vals[1]
match_df_final
first_team | second_team | first_team_world_rank_# | second_team_world_rank_# | first_team_total_score | second_team_total_score | first_team_won | M1 | first_team_score_M1 | second_team_score_M1 | ... | side_first_team_M3 | side_second_team_M3 | score_first_team_t1_M3 | score_first_team_t2_M3 | score_second_team_t1_M3 | score_second_team_t2_M3 | T1_pref_score | T2_pref_score | T1_proficiency | T2_proficiency | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fiend | K23 | 28 | 27 | 0 | 2 | 0 | Overpass | 16 | 19 | ... | - | - | - | - | - | - | 1 | 0 | 8 | 28 |
1 | Entropiq | BIG | 15 | 8 | 0 | 2 | 0 | Mirage | 10 | 16 | ... | - | - | - | - | - | - | 2 | 3 | 40 | 315 |
2 | Entropiq | Nemiga | 15 | 52 | 2 | 0 | 1 | Vertigo | 16 | 5 | ... | - | - | - | - | - | - | 2 | 1 | 40 | 60 |
3 | OG | Fiend | 20 | 28 | 2 | 0 | 1 | Mirage | 16 | 11 | ... | - | - | - | - | - | - | 2 | 2 | 178 | 19 |
4 | BIG | Wisla Krakow | 8 | 49 | 2 | 0 | 1 | Dust2 | 16 | 8 | ... | - | - | - | - | - | - | 3 | 0 | 394 | 22 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3485 | FaZe | Cloud9 | 2 | 5 | 2 | 0 | 1 | Mirage | 16 | 8 | ... | - | - | - | - | - | - | 3 | 1 | 249 | 129 |
3486 | OpTic | mousesports | 12 | 15 | 0 | 2 | 0 | Nuke | 7 | 16 | ... | - | - | - | - | - | - | 0 | 3 | 125 | 271 |
3487 | Cloud9 | Liquid | 5 | 10 | 2 | 0 | 1 | Mirage | 16 | 14 | ... | - | - | - | - | - | - | 1 | 3 | 120 | 326 |
3488 | GODSENT | AGO | 28 | 21 | 0 | 2 | 0 | Mirage | 14 | 16 | ... | - | - | - | - | - | - | 3 | 0 | 89 | 103 |
3489 | Luminosity | mousesports | 24 | 15 | 0 | 2 | 0 | Train | 9 | 16 | ... | - | - | - | - | - | - | 0 | 3 | 18 | 271 |
3490 rows × 38 columns
Below: calculating how likely a team is to win based purely on their global rank
#how likely will a team win based on their global rank?
wr_arr = ["first_team_world_rank_#", "second_team_world_rank_#", "first_team_won"]
wr_df = match_df_final[wr_arr].copy()
scaler = StandardScaler()
wr_df = wr_df[wr_df["first_team_world_rank_#"] != "Unranked"]
wr_df = wr_df[wr_df["second_team_world_rank_#"] != "Unranked"]
wr_df2 = wr_df.astype(int)
dfrnk = wr_df2[["first_team_world_rank_#", "second_team_world_rank_#"]]
scaler.fit(dfrnk)
X_scaled = scaler.transform(dfrnk[["first_team_world_rank_#", "second_team_world_rank_#"]])
y = wr_df2["first_team_won"]
clf = KNeighborsClassifier(n_neighbors = 7)
clf.fit(X_scaled, wr_df2["first_team_won"])
dfrnk["pred"] = clf.predict(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.2)
from sklearn.linear_model import LogisticRegression
cmf = LogisticRegression()
cmf.fit(X_train, y_train)
cmf.predict(X_test) == y_test
1862 True
918 True
1123 False
1252 False
2109 True
...
75 False
1078 True
428 True
1762 True
1995 True
Name: first_team_won, Length: 689, dtype: bool
np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
0.7111756168359942
np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
0.7176043557168784
it appears that global ranking is a pretty good indicator of winning, as is expected since the two should go hand in hand. This result is not very interesting but it can serve as a baseline
#creating df to analyze with machine learning
analysis_df = match_df_final.iloc[:, np.r_[2:4, 34:38]].copy()
analysis_df["first_team_won"] = match_df_final["first_team_won"].copy()
aScaler = StandardScaler()
#remove rows w/ no world rank for a given team, only removes about 50 entries
analysis_df = analysis_df[analysis_df["first_team_world_rank_#"] != "Unranked"]
analysis_df = analysis_df[analysis_df["second_team_world_rank_#"] != "Unranked"]
analysis_df_X = analysis_df.iloc[:,0:6]
analysis_df_X
first_team_world_rank_# | second_team_world_rank_# | T1_pref_score | T2_pref_score | T1_proficiency | T2_proficiency | |
---|---|---|---|---|---|---|
0 | 28 | 27 | 1 | 0 | 8 | 28 |
1 | 15 | 8 | 2 | 3 | 40 | 315 |
2 | 15 | 52 | 2 | 1 | 40 | 60 |
3 | 20 | 28 | 2 | 2 | 178 | 19 |
4 | 8 | 49 | 3 | 0 | 394 | 22 |
... | ... | ... | ... | ... | ... | ... |
3485 | 2 | 5 | 3 | 1 | 249 | 129 |
3486 | 12 | 15 | 0 | 3 | 125 | 271 |
3487 | 5 | 10 | 1 | 3 | 120 | 326 |
3488 | 28 | 21 | 3 | 0 | 89 | 103 |
3489 | 24 | 15 | 0 | 3 | 18 | 271 |
3444 rows × 6 columns
from sklearn.metrics import mean_absolute_error
from sklearn.neighbors import KNeighborsClassifier
#scaling values of analysis_df_X
aScaler.fit(analysis_df_X)
X_scaled = aScaler.transform(analysis_df_X)
y = analysis_df["first_team_won"]
X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,test_size=0.15)
Below is mostly from code found in class, used to graph outputs in relation to different K values for KNeighborsCLassifier. I liked this code since it gives a good representation of what values of K would do best, granted the data set is good.
def score_generator(k):
cmf = KNeighborsClassifier(n_neighbors = k)
cmf.fit(X_scaled, y)
train_error = mean_absolute_error(cmf.predict(X_train), y_train)
test_error = mean_absolute_error(cmf.predict(X_test), y_test)
return (train_error, test_error)
#define df with 200 empty cells, with 3 columns of appropriate names
df_scores = pd.DataFrame({"k":range(1,200),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
df_scores.loc[i,["train_error","test_error"]] = score_generator(df_scores.loc[i,"k"])
#scale the values of K and create graphs for the training and test sets
df_scores["1/k"] = 1/df_scores.k
KNtrain = alt.Chart(df_scores).mark_line().encode(
x = "1/k",
y = "train_error"
)
KNtest = alt.Chart(df_scores).mark_line(color="orange").encode(
x = "1/k",
y = "test_error"
)
#combine the graphs
KNtrain + KNtest
Looking at the graph, it doesn’t seem to bode well for this data set. It appears that test error and training error start at near zero, which is strange, I’m not sure what is causing this. What is more concerning is that both sets seem to increase in error as K increases.
Now, I want to try using a different machine learning method, this time with random forests.
from sklearn.ensemble import RandomForestClassifier
Defining another score generator so I can graph the outputs of this with different K values as well.
def score_generator_forest(k):
rfc = RandomForestClassifier(n_estimators = k, max_features=None, max_depth=None, min_samples_split=2)
rfc.fit(X_train, y_train)
train_error = mean_absolute_error(rfc.predict(X_train), y_train)
test_error = mean_absolute_error(rfc.predict(X_test), y_test)
return (train_error, test_error)
df_scores_forest = pd.DataFrame({"k":range(1,50),"train_error":np.nan,"test_error":np.nan})
for i in df_scores_forest.index:
df_scores_forest.loc[i,["train_error","test_error"]] = score_generator_forest(df_scores_forest.loc[i,"k"])
This time I’m opting not to scale the values of K
RFtrain = alt.Chart(df_scores_forest).mark_line().encode(
x = "k",
y = "train_error"
)
RFtest = alt.Chart(df_scores_forest).mark_line(color="orange").encode(
x = "k",
y = "test_error"
)
RFtrain + RFtest
This graph more closely resembles what results I would be expecting, as the test error is always higher than the training set’s error. The random trees model seems to maximize at around 20 trees, with the test sets plateauing around 35% error (so ~65% error on average at 20+ trees). It appears that neither method of machine learning fared especially well for the match analysis I wanted to do, as the featured engineered columns seem to introduce more chaos than clarity.
Summary¶
I set out to try and predict the match outcomes of professional CS:GO matches, and I started by feature engineering columns of data to analyze.
First was calculating the raw match data, so I could then compare that to each respective team’s performance. With that comparison, I could determine if a team was “proficient” or not to some degree on a given map, and that was assigned a score based on their combined win/loss percentage multiplied by their number of games on that given map.
I thought multiplying those values would give a more accurate reflection of how well a team would perform on that map. I finally funneled all of that data back into a copy of the original dataframe.
The machine learning was not as great as expected, seeing as the use of random trees was around 65% accurate while just depending on a team’s world rank was around 70% accurate. I think that the training set size wasn’t an issue, but rather the data itself. The metrics I chose could be good predictors of how teams would perform if they were just playing a best of one series, but in the highest echelons of professional play, CS:GO is played in best of three series.
This means that one of my feature engineered columns, the match preference value, grossly oversimplifies the complexity involved in how teams pick/ban maps, and there may be a better way to find a total aggregate map proficiency value than what I did. Finding a way to incorporate the teams themselves into the machine learning could have also been better than just using a world ranking, since world rank on its own doesn’t account for each team’s strengths and weaknesses.
References¶
Dataset for CS:GO found on Kaggle, which was imported from HLTV
Kaggle integration on DeepNote: https://deepnote.com/@dhruvildave/Kaggle-heouwNORROiS3aTQFvbklg
Map pick/ban order in competitive CSGO (BO3s): https://help.challengermode.com/en/articles/684985-how-to-ban-and-pick-cs-go-maps
Using np.r_ indexer: https://stackoverflow.com/questions/45985877/slicing-multiple-column-ranges-from-a-dataframe-using-iloc
Graphical representation of different KNeighborsRegressor values: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html
Created in Deepnote