Predicting Match/Map Outcomes in LAN versus Online Play Using Player Statistics in Professional Counter-Strike

Predicting Match/Map Outcomes in LAN versus Online Play Using Player Statistics in Professional Counter-Strike#

Author: Andre Ngo

Email: 0andrengo@gmail.com (personal) and ango14@uci.edu (school)

Course Project, UC Irvine, Math 10, S23

Introduction#

I intend to explore how individual player performances affect the probability of winning a singular map or best-of-three series of maps in Counter Strike, as well as predicting their end-of-map rating using their basic scoreboard statistics. These predicitions will be separated in various ways, such as for LAN (in-person) vs online play and player role, and compared with each other.

The Dataset#

This dataset was taken from Kaggle, scraped from HLTV.org by user Nafis Barzki. It contains match, tournament, and player data from January of 2020 until December of 2022, filtering exclusively for top-tier play. HLTV is the premier website for “Counter-Strike” news and statistics, with match, tournament, and player data from all the way to the beginning of the current iteration of Counter-Strike in 2012. One of HLTV’s main propritery features are their “HLTV Ranking 2.0”- a metric used to condense a player’s in-game performance into a single number. The formula for this rating was kept a secret, but eventually reverse engineered by a user named “Dave.” This rating will be one of the most important variables we will be exploring within this analysis.

import pandas as pd
import numpy as np
import altair as alt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv("hltv_matches.csv")

Cleaning and Splitting the Dataset by LAN versus Online#

The first order of business is splitting the dataset into two, this would ideally be done by the running dates or playing dates of tournemtns or matches respectively, but this dataset unfortunately does not contain any date columns. Therefore, we will instead use the name of the event. This also allows for the oppertunity to only use tier-1 events in our datset, as this dataframe, although filtering for “top tier” tournaments only, also allows for tier-2 tournaments.

df["bo_won"] = df["bo_won"].astype("bool")

A little bit of cleanup due to a scraping error, likely.

Online Events#

Using HLTV’s built-in filters, sorting for online events, from 2020-01-01 until 2021-12-31 (considered the “online era”), with a minimum prize pool of $200,000, we can get the names of some of the events were are looking for. Using the .str.contains() method, we can also get matches from the qualifier for these events, which would otherwise be skipped since they award qualification to the main event instead of prize money.

online_list = [
    "ESL Pro League Season 11",
    "Flashpoint 1",
    "BLAST Premier Spring 2020 Europe Finals",
    "BLAST Premier Spring 2020 Americas Finals",
    "ESL One Cologne 2020",
    "ESL Pro League Season 12",
    "Flashpoint 2",
    "BLAST Premier Fall 2020 Finals",
    "IEM Global Challenge 2020",
    "BLAST Premier Global Final 2020",
    "cs_summit 7",
    "IEM Katowice 2021",
    "ESL Pro League Season 13",
    "Funspark ULTI 2020 Europe Final",
    "Dreamhack Masters Spring 2021",
    "IEM Summer 2021",
    "BLAST Premier Spring Final 2021",
    "ESL Pro League Season 14"
]

df_online = df[df["event"].str.contains("|".join(online_list)) & ~df["event"].str.contains("Qualifier|Conference")]

Weeding out lower-level competition which use the words “qualifier” or “conference” as they would pass through lan_list while not being top tier events.

df_online["event"].unique()

array(['BLAST Premier Spring 2020 Americas Finals',
       'BLAST Premier Spring 2020 Europe Finals',
       'ESL Pro League Season 11 North America', 'Flashpoint 1',
       'ESL Pro League Season 11 Europe', 'IEM Global Challenge 2020',
       'BLAST Premier Fall 2020 Finals', 'Flashpoint 2',
       'ESL Pro League Season 12 North America',
       'ESL Pro League Season 12 Europe', 'ESL One Cologne 2020 Europe',
       'ESL One Cologne 2020 North America',
       'Funspark ULTI 2020 Europe Final', 'ESL Pro League Season 13',
       'IEM Katowice 2021', 'IEM Katowice 2021 Play-In', 'cs_summit 7',
       'BLAST Premier Global Final 2020', 'ESL Pro League Season 14',
       'BLAST Premier Spring Final 2021', 'IEM Summer 2021'], dtype=object)

This allows me to make sure that all of the events in this new dataframe are all what spectators would consider tier 1 events.

International LANs and Majors#

International LANs are tournaments in which teams from around the world fly into once location to compete. Majors are the pinnicale of “Counter-Strike” competition and the marquee events of the year, usually twice a year. Once again, using HLTV’s built-in filters, we sort for for majors and international LANs, from 2021-01-01 until 2022-12-31 (considered “post-online era”), with a minimum prize pool of $200,000, we can get the names of some of the events were are looking for. Using the .str.contains() method, we can also get matches from the play-in stages for these events, which would otherwise be skipped for the same reasons as above.

lan_list = [
    "IEM Cologne 2021",
    "PGL Major Stockholm 2021",
    "V4 Future Sports Festival 2021",
    "BLAST Premier Fall Final 2021",
    "IEM Winter 2021",
    "BLAST Premier World Final 2021",
    "IEM Katowice 2022",
    "ESL Pro League Season 15",
    "PGL Major Antwerp 2022",
    "IEM Dallas 2022",
    "Pinicle Cup Championship 2022",
    "Global Esports Tour Dubai 2022",
    "BLAST Premier Spring Final 2022",
    "IEM Cologne 2022",
    "ESL Pro League Season 16",
    "IEM Rio Major 2022",
    "Elisa Masters Espoo 2022",
    "BLAST Premier Fall Final 2022",
    "BLAST Premier World Final 2022"
]

df_lan = df[df["event"].str.contains("|".join(lan_list)) & ~df["event"].str.contains("Qualifier|Conference")]

Weeding out lower-level competition which use the words “qualifier” or “conference” as they would pass through lan_list while not being top tier events as well as usually being played online.

df_lan["event"].unique()

array(['IEM Cologne 2021', 'IEM Cologne 2021 Play-In',
       'BLAST Premier World Final 2021', 'IEM Winter 2021',
       'BLAST Premier Fall Final 2021', 'V4 Future Sports Festival 2021',
       'V4 Future Sports Festival 2021 Play-In',
       'PGL Major Stockholm 2021',
       'PGL Major Stockholm 2021 Challengers Stage',
       'V4 Future Sports Festival 2021 International Cup',
       'PGL Major Antwerp 2022 Europe RMR B',
       'PGL Major Antwerp 2022 Europe RMR A',
       'PGL Major Antwerp 2022 Americas RMR', 'ESL Pro League Season 15',
       'IEM Katowice 2022', 'IEM Katowice 2022 Play-In',
       'ESL Pro League Season 16', 'IEM Cologne 2022',
       'IEM Cologne 2022 Play-In', 'BLAST Premier Spring Final 2022',
       'Global Esports Tour Dubai 2022', 'IEM Dallas 2022',
       'PGL Major Antwerp 2022',
       'PGL Major Antwerp 2022 Challengers Stage',
       'IEM Rio Major 2022 Challengers Stage',
       'BLAST Premier Fall Final 2022', 'Elisa Masters Espoo 2022',
       'IEM Rio Major 2022', 'BLAST Premier World Final 2022'],
      dtype=object)

Winrate Correlation#

Before I try predicting winrates, I’m going to first check the correlations of our future inputs. This is done using the performance of every member of the team versus the AWPer role. AWPers are widely considered as the most important role in the game. Usually, they are the star players of the roster and teams will build their stratagies and playstyles around the presence of their own AWPer or the opposing team’s AWPer. In short, they’re important, but how important? Are they more imporant than the rest of the team?

AWP_list = [
    "s1mple",
    "ZywOo",
    "CadiaN",
    "broky",
    "m0nesy",
    "SunPayus",
    "sh1ro",
    "oSee",
    "torzsi",
    "saffee",
    "Jame",
    "headtr1ck",
    "nicoodoz",
    "FalleN",
    "degseter",
    "hyped",
    "syrsoN",
    "hallzerk",
    "junior",
    "aliStair",
    "try",
    "zevy",
    "acoR",
    "kennyS"
]

Here, I begin by manually creating a list of tier-1 AWPers who played within our given timeframe. This will be used to further split our dataframe.

Correlation of Winrate Using Statistics of the Whole Team#

player_stats =[
    "kill_score",
    "death_score",
    "kddiff",
    "adr",
    "kast",
    "rating"
]

df_lan_team = df_lan[df_lan["player_name"].str.contains("|".join(AWP_list)) == False]

df_online_team = df_online[df_online["player_name"].str.contains("|".join(AWP_list)) == False]

lan_map_team = df_lan_team.corr()["map_won"].rename("lan_map_team")
lan_series_team = df_lan_team.corr()["bo_won"].rename("lan_series_team")
online_map_team = df_online_team.corr()["map_won"].rename("online_map_team")
online_series_team = df_online_team.corr()["bo_won"].rename("online_series_team")

team_correlation = pd.concat([lan_map_team,lan_series_team,online_map_team,online_series_team], axis=1)

Aside: Map vs Series Winrate#

BO3 (Best of three) matches are considered better dicators of team and player performance per-match compared to BO1 (Best of one) matches. There are seven maps teams are able to veto and pick from. If you’ve been keeping track, I’ve now separated our data firstly between LAN/Online, AWPer/non-AWPer, and now map vs series. We will find out the significance of map vs series soon.

Correlation of Winrate Using Only AWPer Statistics#

df_lan_awp = df_lan[df_lan["player_name"].str.contains("|".join(AWP_list))]

df_online_awp = df_online[df_online["player_name"].str.contains("|".join(AWP_list))]

lan_map_awp = df_lan_awp.corr()["map_won"].rename("lan_map_awp")
lan_series_awp = df_lan_awp.corr()["bo_won"].rename("lan_series_awp")
online_map_awp = df_online_awp.corr()["map_won"].rename("online_map_awp")
online_series_awp = df_online_awp.corr()["bo_won"].rename("online_series_awp")

awp_correlation = pd.concat([lan_map_awp,lan_series_awp, online_map_awp, online_series_awp], axis=1)

t_corr = team_correlation.loc[player_stats].T.copy()
t_corr["index"] = t_corr.index

a_corr = awp_correlation.loc[player_stats].T.copy()
a_corr["index"] = a_corr.index

Here, I transpose these new dataframes using [.T](https://www.youtube.com/watch?v=6AjYmHLHlpY
) to make graphs later.

for i in player_stats:
    c1 = alt.Chart(a_corr).mark_bar().encode(
        x="index",
        y=i
    )
    c2 = alt.Chart(t_corr).mark_bar().encode(
        x="index",
        y=i
    )
    (c1+c2).display()

Thusfar, it seems there is no significant difference between LAN and online, but there seems to be even a little bit between map vs series and AWPers and the rest of the team, with AWPers having a slightly higher win correlation in every metric except kill score. Therefore, will will continue forward by using machine learning on AWPers vs the rest of the team, and abandon our LAN vs online angle due to lack of findings.

(Small aside: I was originally going to try to make these graphs clearer by creating double bar graphs using str.startswith for LAN/Online and str.contains for team/AWP but unfortunately I couldn’t find a method despite scouring the internet. I also tried some .mean methods/techniques to try to extract even more information but it quickly became too bloated.)

Winrate Prediction#

Here, we begin the machine learning portion of the project, beginning by splitting our test and train sets. Due to the findings of the previous section, we start by creating two new datasets for our AWP players, as well as one for the rest of the team. The target I have chosen for our machine learning is map wins instead of series wins, as seen by the greater correlation on the graphs as well as my previous aside explaination.

rfc = RandomForestClassifier(n_estimators=200, max_leaf_nodes=5)
clf = DecisionTreeClassifier(max_leaf_nodes=5)

df_awpers = df[df["player_name"].str.contains("|".join(AWP_list))]

Machine Learning for AWPers#

X_train_awp,X_test_awp,y_train_awp,y_test_awp=train_test_split(df_awpers[player_stats],df_awpers["map_won"],train_size=0.8)

clf.fit(X_train_awp,y_train_awp)
rfc.fit(X_train_awp, y_train_awp)

RandomForestClassifier(max_leaf_nodes=5, n_estimators=200)

print(clf.score(X_train_awp,y_train_awp))
print(clf.score(X_test_awp,y_test_awp))
print(abs(clf.score(X_train_awp,y_train_awp)-clf.score(X_test_awp,y_test_awp)))

7793427230046949
7542448614834674
025097861521227505

print(rfc.score(X_train_awp,y_train_awp))
print(rfc.score(X_test_awp,y_test_awp))
print(abs(rfc.score(X_train_awp,y_train_awp)-rfc.score(X_test_awp,y_test_awp)))

7918622848200313
7631814119749777
028680872845053607

The test set probability is similar to the train set probability in both machine learning estimators, which suggests little or no overfitting.

Machine Learning for the Rest of the Team#

df_team = df[df["player_name"].str.contains("|".join(AWP_list)) == False]

X_train_team,X_test_team,y_train_team,y_test_team=train_test_split(df_team[player_stats],df_team["map_won"],train_size=0.8)

clf.fit(X_train_team,y_train_team)
rfc.fit(X_train_team, y_train_team)

RandomForestClassifier(max_leaf_nodes=5, n_estimators=200)

print(clf.score(X_train_team,y_train_team))
print(clf.score(X_test_team,y_test_team))
print(abs(clf.score(X_train_team,y_train_team)-clf.score(X_test_team,y_test_team)))

7436487297459492
7340134693605388
009635260385410405

print(rfc.score(X_train_team,y_train_team))
print(rfc.score(X_test_team,y_test_team))
print(abs(rfc.score(X_train_team,y_train_team)-rfc.score(X_test_team,y_test_team)))

7543508701740348
7456824698272988
008668400346735994

Once again, the test set probability is similar to the train set probability in both machine learning estimators, which suggests little or no overfitting.

Summary/Findings#

AWPers vs The Rest of the Team#

In winrate correlation, we found out that the performance of AWP players had a slightly higher correlation to map and match wins than the rest of the team. After running our machine learning methods, we see that the scores for the AWPer test sets were also higher than the scores of the test sets for the rest of the team, (~0.75 vs ~0.73 for DecisionTreeClassifier, and ~0.76 vs ~0.75 for RandomForestClassifier). This supports my assumption that the AWP players are a better indicator for team performance than the rest of the team, due to the impact their role inherently has on the game.

Maps vs Series#

All inputs across the board were less significant when correlating/predicting series winrates due to the nature of being longer/more drawn out and therefore an individual’s performance mattering less in the long term.

LAN vs Online#

Unforunately, there were no significant differences observed in LAN vs Online play using player statistics as inputs. Perhaps something such as player age or LANs attended would be a better choice, although data such as that would be nearly impossible to scrape.

References#

What is the source of your dataset(s)? This dataset was taken from Kaggle, scraped from HLTV.org by user Nafis Barzki.

List any other references that you found helpful.

StackOverflow YouTube Miscellaneous Websites

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote