Predicting League of Legends Winners#

Author: Kyle Lemler

Course Project, UC Irvine, Math 10, S23

Introduction#

Introduce your project here. Maybe 3 sentences.

In this project, I will be exploring which team is predicted to win through LogsiticRegression. Followed by accessing the accuracy of the test and ways to improve them. Then I finish the project by using ChatGPT and looking at the importance of all of the columns instead of just the ones I chose.

Exploring Gold Earned and Blue Team’s Winrate#

The first step to starting this process was importing all of the needed libraries as I will need them throughout the project. After importing the libraries, the next step understanding how the dataset was formed.

import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LogisticRegression
df= pd.read_csv('high_diamond_ranked_10min.csv')
df.head(6)
gameId blueWins blueWardsPlaced blueWardsDestroyed blueFirstBlood blueKills blueDeaths blueAssists blueEliteMonsters blueDragons ... redTowersDestroyed redTotalGold redAvgLevel redTotalExperience redTotalMinionsKilled redTotalJungleMinionsKilled redGoldDiff redExperienceDiff redCSPerMin redGoldPerMin
0 4519157822 0 28 2 1 9 6 11 0 0 ... 0 16567 6.8 17047 197 55 -643 8 19.7 1656.7
1 4523371949 0 12 1 0 5 5 5 0 0 ... 1 17620 6.8 17438 240 52 2908 1173 24.0 1762.0
2 4521474530 0 15 0 0 7 11 4 1 1 ... 0 17285 6.8 17254 203 28 1172 1033 20.3 1728.5
3 4524384067 0 43 1 0 4 5 5 1 0 ... 0 16478 7.0 17961 235 47 1321 7 23.5 1647.8
4 4436033771 0 75 4 0 6 6 6 0 0 ... 0 17404 7.0 18313 225 67 1004 -230 22.5 1740.4
5 4475365709 1 18 0 0 5 3 6 1 1 ... 0 15201 7.0 18060 221 59 -698 -101 22.1 1520.1

6 rows Ă— 40 columns

As we can see in the Dataset, it is a large Dataset with 40 columns. However, we can ignore some of these columns due to irrelevancy. For example, gameID,and blueGoldDiff vs redGoldDiff. gameID is literally just a game ID, and redGoldDiff= -blueGoldDiff, are just opposites to each other parity-wise. For now we’ll just focus in on the columns involving Gold, as I believe that they will be the most impactful.

Now we’ll set up a LogisticRegression for 'blueTotalGold" and 'redTotalGold' in df

clf1=LogisticRegression()
cols=['blueTotalGold','redTotalGold']
clf1.fit(df[cols],df['blueWins'])
LogisticRegression()

After setting up the LinearRegression, I use predict_proba in order to predict the probablity of who wins based on my input values. After that I create and reorganize a DataFrame df2 to allow for future graphing. I also renamed blueWins to whoWins, and changed the values 0 to Red Wins and 1 to Blue Wins, because I will be using tooltip in altair charts later and it looks clearer.

df2=pd.DataFrame(clf1.predict_proba(df[cols]))
df2['blueTotalGold']=df['blueTotalGold']
df2['redTotalGold']=df['redTotalGold']


#ChatGPT helped here with the line underneath, Reference #1. My original code was working, but had a Warning
df2['whoWins'] = df['blueWins'].replace({0: 'Red Wins', 1: 'Blue Wins'})

#ChatGPT helped here with the three lines underneath, Reference #2.
#I really wanted to have "blueWins" as the first column as I believe it looks better in the DataFrame.
df2columns=df2.columns.tolist()
df2columns.insert(0, df2columns.pop(df2columns.index('whoWins')))
df2 = df2[df2columns]

#ChatGPT helped here with rename() and inplace, Refence:#3.
df2.rename(columns={0:"ProbRed_Gold",1:'ProbBlue_Gold'},inplace=True)

print(len(df2))
df2.head(3)
9879
whoWins ProbRed_Gold ProbBlue_Gold blueTotalGold redTotalGold
0 Red Wins 0.409757 0.590243 17210 16567
1 Red Wins 0.847875 0.152125 14712 17620
2 Red Wins 0.668151 0.331849 16113 17285

After setting up df2 I attempt to graph and fail because of the maximum limit of 5000 inputs in altair graphs. So accordingly, I took a random sample of df2 (df2sample) and split it up depending on the true winner of the match. I did this so I could plot them separately, then I could combine them afterwards for proper coloration of the true winner.

I chose a sample size of only 500, because although 500/9789 is barely 5%, adding over 1000 in the sample creates a messy graph (a lot of points will overlap).

df2sample=df2.sample(500, random_state=36)
df2sampleRed_sub=df2sample[df2sample['whoWins']=='Red Wins']
df2sampleBlue_sub=df2sample[df2sample['whoWins']=='Blue Wins']

c1=alt.Chart(df2sampleRed_sub).mark_circle(color='red').encode(
    x='ProbRed_Gold:Q',
    y='redTotalGold:Q',
    tooltip=['ProbRed_Gold','redTotalGold','blueTotalGold','whoWins']
)

c2=alt.Chart(df2sampleBlue_sub).mark_circle(color='blue').encode(
    x='ProbBlue_Gold:Q',
    y='blueTotalGold:Q',
    tooltip=['ProbBlue_Gold','redTotalGold','blueTotalGold','whoWins']
)
c1+c2

Firstly, I remind you that the color associates with the team that DID win. Now that we have df2sample graphed, there are a couple notes of observations. One of the points is that we can see a slight incline throughout the graph and a large incline towards the end. This tells us that with more gold on a team, the higher chance that predict_proba will predict a higher chance of winning. This includes the opposite too, where if a team has less gold, it will predict have a lower chance of winning. Another observation is that, we can see multiple matches with predict_proba predicting \(<30\)% chance of winning, yet despite that, that team still won the match.

Now that we’ve seen that more gold can lead to a higher predict_proba value, let’s expand on that by looking at the graph using predict_proba’s predicted winner and how it predicts.

df2sample2=df2.sample(4000, random_state=36)
c3=alt.Chart(df2sample2).mark_circle().encode(

    #ChatGPT helped with adjusting the domain and range on the graph, Reference: #4. 
    x=alt.X('blueTotalGold:Q', scale=alt.Scale(domain=[10000, 24000])),
    y=alt.Y('redTotalGold:Q', scale=alt.Scale(domain=[11000, 23000])),

    color=alt.Color('ProbRed_Gold:Q', scale=alt.Scale(scheme='redblue',reverse=True)),
    tooltip=['ProbRed_Gold','redTotalGold','blueTotalGold','whoWins']
).properties(
    width=500
)
c3

Firstly, I need to point out on this graph the coloring depends on predict_proba’s predicted winner, NOT the actual winner. One of the first observations we can make is that clearly if Red team has a lot of gold and the blue team has significantly less, predict_proba heavily favors the red team as the winner (and vice versa). Another observation is looking at the white area of the graph towards the center. This area reveals that predict_proba has difficulty declaring who it believes is the winner if the teams are similar in gold. This logically makes sense as if the teams are similar in gold, it would be very difficult to predict a winner based just on gold.

Another observation is that even with high predict_proba it doesn’t necessarily mean that team is the winner. This is highlighted in the circled point on the following graph. With tooltip we can hover over it and see that predict_proba predicts a 97.6% chance of winning for the Red team. However, looking at whoWins we can see that the Blue team actually wins this match despite the 6324 gold difference.

c3dot=df2sample2.loc[[8459]]
c4=alt.Chart(c3dot).mark_point(color='black',size=250).encode(x='blueTotalGold:Q', y='redTotalGold:Q',
tooltip=['ProbRed_Gold','redTotalGold','blueTotalGold','whoWins'])

c3+c4

Accuracy of the Predictions#

Now that we’ve seen the graphs, we can clearly see there are some issues with the accuracy of the test. Now we’ll look into increasing the accuracy of the predictions.

Firstly, we can run the .score method on clf1 to see the accuracy of the current test.

print(f"Using the same inputs as before: {clf1.score(df[cols],df['blueWins'])}")

clf2=LogisticRegression()
clf2.fit(df[['blueGoldDiff']],df['blueWins'])
print(f"Using the new input of only 'blueGoldDiff' in df: {clf2.score(df[['blueGoldDiff']],df['blueWins'])}")
Using the same inputs as before: 0.7234537908695212
Using the new input of only 'blueGoldDiff' in df: 0.7228464419475655

After running .score on clf1 we see that it has an accuracy of .72345 or 72.345%. This accuracy is not very good. While it is more correct then it is incorrect, its not very impressive. My next thought was to change the input variable to one that has very close relation to blueTotalGold and redTotalGold which was blueGoldDiff. blueGoldDiff is calculated by simply doing blueTotalGold - redTotalGold. Firstly I wanted to see what the scoring of this new blueGoldDiff would be, which was .72284 or 72.284%. Which was about a .06 decrease in accuracy. To be honest, I was not expecting a decrease, so I do not know why there was a decrease. I predict that the issue lies within some rounding error in the formula for predict_proba. However, a .06 decrease is very small, but now that it is in a single form, I can manipulate the data to exclude certain gold difference amounts.

clf3=LogisticRegression()
goldDiffSize=5000
clf3_df=df[abs(df['blueGoldDiff'])>=goldDiffSize]
print(f'This new data frame has {len(clf3_df)} rows with the minimum gold difference being {goldDiffSize}')

clf3.fit(clf3_df[['blueGoldDiff']],clf3_df['blueWins'])
clf3.score(clf3_df[['blueGoldDiff']],clf3_df['blueWins'])
This new data frame has 445 rows with the minimum gold difference being 5000
0.9842696629213483

After restricting the gold difference to being at least 5000, we can see that the accuracy dramatically increased. This is directly correlated to removing a lot of the points that were in the “white zone” in the graph above. However, if we recall the circled point, the gold difference for that point was 6324 and still plays a roll in impeding our prediction. However, we can also see that only 445 rows remaining with a gold difference of 5000 or more. So there is a clear trade off here between accuracy of the test and how many rows remain.

With this in mind, let’s create a graph that will explore the accuracy of the scorings with a changing gold difference.

clf3_Graphing_df=pd.DataFrame(columns=['goldDiffSize','Scoring','Length'])

#ChatGPT helped me on the range. I simply forgot how to do it. Reference #5. 
for i in range(0,10050,50): 
    goldDiffSize=i
    clf3_Graphing_df.loc[i,'goldDiffSize']=i
    clf3_df=df[abs(df['blueGoldDiff'])>=goldDiffSize]
    clf3.fit(clf3_df[['blueGoldDiff']],clf3_df['blueWins'])
    clf3_Graphing_df.loc[i,'Scoring']=clf3.score(clf3_df[['blueGoldDiff']],clf3_df['blueWins'])
    clf3_Graphing_df.loc[i,'Length']=len(clf3_df)


alt.Chart(clf3_Graphing_df).mark_line().encode( 
    x='goldDiffSize:Q',
    y=alt.Y('Scoring:Q', scale=alt.Scale(domain=[.7, 1])),
    tooltip=['goldDiffSize','Scoring','Length']
).properties(
    width=500
)

Looking at the graph above we can see that as the goldDiffSize increases it usually becomes more accurate. With this graph we can see a massive difference in accuracy at a goldDiffSize of 4000, however similarly to a goldDiffSize of 5000, we have less matches remaining. But nonetheless there are roughly double as many matches being considered while still maintaining a high accuracy of about 95%. If we wished to include more games and accepted a 90% accuracy, we could use a goldDiffSize of 2900, which keeps about 25% of the set. One of the main takeaway from this graph is seeing how badly the “white area” from above messes with the predictions. It’s very difficult to make a prediction when its input/gold values are so close to the other team’s (or in this case a small `goldDiffsize).

A couple of less important but still interesting, in my opinion, was the spots where the accuracy decreases. I belive this indicates after increasing the goldDiffSize it removes primary correctly predicted values. So if one wanted to predict values, they’d need to make sure to choose one that’s not a decreased spot as either direction would increase accuracy. The other interesting spot is when the graph hits an accuracy of 100%. Looking at when it begins, there are only 88 games left to predict the winner from. This type of predicting is fairly useless as that isn’t even 1% of the original dataset, but interesting nonetheless.

Now that we’ve explored the goldDiffSize importance, it asks the question of could we introduce other columns to goldDiffSize to increase the prediction rate instead of decreasing the available matches.

Exploring other conditions#

Firstly, I am going to choose 4 columns, two of which are mirrored. Specifically I am choosing 'blueEliteMonsters','blueTotalExperience','redEliteMonsters','redTotalExperience' from df. The reason I am choosing these specific points is because I believe they are some of the most important factors in winning a game.

Firstly, once again I am going to set up a LogisiticRegression with these four inputs and the typical df['blueWins'] as the target variable.

clf4=LogisticRegression()
cols4=['blueEliteMonsters','blueTotalExperience','redEliteMonsters','redTotalExperience']
df4=df[cols4].copy()
df4['blueWins'] = df['blueWins']


clf4.fit(df4[cols4],df4['blueWins'])
clf4score_noGold=clf4.score(df4[cols4],df4['blueWins'])
clf4score_noGold
0.7127239599149712

This result was surprising to me as I had previously stated that I valued [blue/red]EliteMonsters and [blue/red]TotalExperience to both be very important to a game. As such, this needed further investigation on how it possibly had lower accuracy. My first inital theory was, [blue/red]TotalExperience could be having a similar issue that was explored previously with [blue/red]TotalGoldDiff. If both teams had similar experience then it would be hard to predict based soley on that,due to the “white area”. However, that is why I had introduced the [blue/red]EliteMonsters to try and offput some of the “white area.” I wanted to see how the [blue/red]EliteMonsters was effecting the predicted values of [blue/red]TotalExperience, so I ran another test with just [blue/red]TotalExperience.

clf4=LogisticRegression()
cols41=['blueTotalExperience','redTotalExperience']
df4=df[cols4].copy()
df4['blueWins'] = df['blueWins']

clf4.fit(df4[cols41],df4['blueWins'])
clf4.score(df4[cols41],df4['blueWins'])
0.7127239599149712

The exact same number was the result of the scoring as the previous scoring. I found this result shocking as I interpreted this as the values of EliteMonsters play a very little roll, or in this case no roll in accuracy of making predicitions of who wins. As such, I wanted to see other combinations and their accuracy.

clf4=LogisticRegression()
cols4=['blueGoldDiff','blueEliteMonsters','blueTotalExperience','redEliteMonsters','redTotalExperience']
df4=df[cols4].copy()
df4['blueWins'] = df['blueWins']

clf4.fit(df4[['blueEliteMonsters','redEliteMonsters']],df4['blueWins'])
print(f"Scoring of only EliteMonsters:{clf4.score(df4[['blueEliteMonsters','redEliteMonsters']],df4['blueWins'])}")

cols42=['blueTotalExperience','redTotalExperience']
clf4.fit(df4[cols42],df4['blueWins'])
print(f"Scoring of only Experience:{clf4.score(df4[cols42],df4['blueWins'])}")

print(f'Scoring of EliteMonsters and Experience: {clf4score_noGold}')

cols43=['blueGoldDiff','blueTotalExperience','redTotalExperience']
clf4.fit(df4[cols43],df4['blueWins'])
print(f"Scoring of only Gold and Experience {clf4.score(df4[cols43],df4['blueWins'])}")

clf4.fit(df4[cols4],df4['blueWins'])
print(f"scoring of Gold, Experience and Elite Monsters: {clf4.score(df4[cols4],df4['blueWins'])}")
Scoring of only EliteMonsters:0.6063366737524041
Scoring of only Experience:0.7127239599149712
Scoring of EliteMonsters and Experience: 0.7127239599149712
Scoring of only Gold and Experience 0.7269966595809293
scoring of Gold, Experience and Elite Monsters: 0.7269966595809293

Firstly, I found that a scoring of only the EliteMonsters was about 60.633% accurate. Secondly, GoldDiff and TotalExperience had an accuracy of 72.699%. Finally GoldDiff, TotalExperience and EliteMonsters had the exact same accuracy again of 72.699%. This was now two times where EliteMonsters had no effect on the accuracy. Due to these results, I’ve come to the conclusion that EliteMonsters have very little effect on accuracy when considering other factors alongside it. However, this statement isn’t entirely true. Which will be explored in the next section.

Initially, I was surprised by this low numbers on the first scoring of just the EliteMonsters. Now that I’ve given it thought, I believe this is due to the nature of the elite monsters in the game. As this Dataset is create at exactly the 10th minute of the game, at most 1 Dragon and 1 Herald may spawn, and as such be killed within that time. In addition, Dragons give small buffs individually, but play a much larger factor when multiple are obtained. As such, not enough time has passed in these matches to make a strong claim of who will win based purely on the first drake.

In the final section, we’ll explore how ChatGPT believes each column relates to blueWins, and disproving my claim of “EliteMonsters` have very little effect on accuracy when considering other factors alongside it.”

ChatGPT correlations on blueWins#

Firstly, I need to state that, I wrote none of the code in the code block below. 100% of it was copy and pasted directly from ChatGPT. My only roll in writing the code block below was prompting ChatGPT with specific things to adjust the graph so it would appear how I desired.

Using ChatGPT I wanted to see which columns has the most relevance in the blue team winning. You can also use the data found to predict relevance between the columns and the red team winning by reversing the parities of each column.

correlations = []
columns = df.columns  # assuming df is your dataframe

# Calculate the correlation coefficient for each column with 'blueWins'
for column in columns:
    correlation = df[column].corr(df['blueWins'])
    correlations.append(correlation)

# Create a DataFrame with the column names and correlation coefficients
data = pd.DataFrame({'Column': columns, 'Correlation Coefficient': correlations}) 

# Sort the DataFrame by correlation coefficient in descending order
data_sub = data.sort_values('Correlation Coefficient', ascending=False)
data = data_sub[data_sub['Column'] != 'blueWins']

# Create the Altair bar chart with tooltips
chart = alt.Chart(data).mark_bar().encode(
    x=alt.X('Column:N', sort='-y'),  # Sort the x-axis in descending order
    y='Correlation Coefficient:Q',
    tooltip=['Column', 'Correlation Coefficient']
).properties(
    width=600,
    height=400,
    title='Correlation Coefficients with blueWins'
)

chart

There are a few things to takeaway from this chart. Firstly, we can see that ChatGPT believes blueGoldDiff and blueExperienceDiff are the two most important columns to predicting if blue wins or not. In addition, conversely redExperienceDiff and redGoldDiff are the two most important columns in predicting against the blue team winning.

Another spot of notice are the areas with equal importance. Such areas would include the pairs blueTotalGold and blueGoldPerMin, blueCSPerMin and blueTotalMinionsKilled,blueKills and redDeaths and the red teams’ counterparts. This logically stands as these pairs directly relate to each other in computing them. Note that blueTotalGold and blueGoldDiff would not have this relation as they are not pair-wise related to each other. They also need redTotalGold

Continuing we can note that redTotalGold and blueTotalGold have different values in the chart. This leads me to believe that one of the reasons that the scoring of bluegoldDiff vs redTotalGold and blueTotalGold having different values lies in this relation to another instead of my original thought of a rounding error taking place.

Finally, as more of a fun note, despite saying we can ignore gameID at the start, if we look at gameID in the chart we can see that ChatGPT claims it has more importance in calculating if blue wins or not then blueWardsPlaced. While it’s a silly thing to occur, realistically this makes no sense to me. I don’t understand how blueWardsPlaced has such a small effect of predicting who wins. I can personally say that I have had fights decided based on vision. It’s a very shocking result.

Now that we’ve seen where ChatGPT calculates the importance of each section, we can explore the claim that I made previously of “EliteMonsters have very little effect on accuracy when considering other factors alongside it.”

The rest of project below was coded by me, not ChatGPT:

Before we continue as a reminder the scoring of just bluegoldDiff was 0.7228464419475655, and the scoring of only [red/blue]EliteMonsters was 0.6063366737524041

clf5=LogisticRegression()
cols5=['blueEliteMonsters','redEliteMonsters','blueWardsPlaced']
clf5.fit(df[cols5],df['blueWins'])
clf5.score(df[cols5],df['blueWins'])
0.6060329992914263
clf6=LogisticRegression()
cols6=['blueEliteMonsters','redEliteMonsters',''blueGoldDiff'']
clf6.fit(df[cols6],df['blueWins'])
clf6.score(df[cols6],df['blueWins'])
0.7292236056281

Running a couple more of scorings we can see that [red/blue]EliteMonsters did in fact have a roll to play, it was just dependent on the other columns it was paired with. We can note here if we include blueWardsPlaced, which according to ChatGPT had a very minor roll, the scoring doesn’t exclude it. Simarily, if we set a Regression with [red/blue]EliteMonsters and blueGoldDiff we can see that it did slightly improve despite the massive difference of importance according to ChatGPT. As such, my initial claim isn’t correct, because it did have a roll in predicting who wins. Something must have occured between [red/blue]EliteMonsters and [red/blue]TotalExperience. My theory is that when you slay an Elite Monster you receive experience. As such a part of the contributing factors towards the win is a subset of [red/blue]TotalExperience. However I have my doubts, because I’m not really sure why this is the case as the Elite Monsters have other rolls besides just experience. In addition, it’s not that we just used two strongly relevant columns which overpowered [red/blue]EliteMonsters, because [red/blue]TotalExperience and [red/blue]EliteMonsters was also equal to [red/blue]TotalExperience’s scoring. As a conclusion here, [red/blue]EliteMonsters is not one to be discarded, but the scoring between Total Experience and Elite Monsters has a strange unique connection.

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In this project, I explored how gold values is directly related to predicting the winner of the match. Afterwards I computed the accuracy of the prediction and discovered ways to improve the accuracy through restricting blueGoldDiff and introducing other input values. Finally, I finished the project by using ChatGPT to explore the importance of all the columns in predicting the winner in one quick code block. In addition, through ChatGPT it confirmed another one of my claims of gold being the most relevant in predicting the winner.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

My dataset came from Kaggle: https://www.kaggle.com/datasets/bobbyscience/league-of-legends-diamond-ranked-games-10-min?resource=download. Thank you to Yi Lan Ma for posting the dataset onto Kaggle, so I could use it.

  • List any other references that you found helpful.

Link to ChatGPT: https://openai.com/blog/chatgpt

ChatGPT assistance #1: In this line of code I asked ChatGPT change the values of 0 or 1 to whichever team wins. I had originally made my own code that worked, however it came with a warning of “SettingWithCopyWarning.” I wanted to remove the warning as it was rather large and ugly so I decided to change the code. The main reason I decided to include either of these lines of code was to help with the tooltip in the altair graphs. I felt like directly seeing who won instead of 0 or 1 was better. In addition, updating the code decreased load time for the block of code. My original code:  df2['whoWins']=df['blueWins'] for i in range(len(df2['whoWins'])):      if df2['whoWins'][i]==0:          df2['whoWins'][i]='Red Wins'      else:          df2['whoWins'][i]='Blue Wins'

Exact ChatGPT code provided: df2['blueWins'] = df['blueWins'].replace({0: 'Red Wins', 1: 'Blue Wins'})

ChatGPT assistance #2: In these lines of code I used ChatGPT to help me rearrange the columns to allow for “blueWins” to be first in the DataFrame. Although it wasn’t necessary, I believe it made the DataFrame look cleaner and better. This was the exact code provided by ChatGPT: columns = df.columns.tolist() columns.insert(0, columns.pop(columns.index(column_to_move))) df = df[columns]

ChatGPT assistance #3: In this line of code I used ChatGPT to give me the command of rename and inplace. I wanted to do this, because I planned on adding similar data to the DataFrame, so it would be confusing and error creating otherwise. The given version from ChatGPT would be df.rename(columns={'old_name': 'new_name'}, inplace=True)

ChatGPT assistance #4: Here I wanted to rescale the graph to allow for a better visual effect. Code given from ChatGPT: x=alt.X('blueTotalGold:Q', scale=alt.Scale(domain=[min_value_x, max_value_x])), y=alt.Y('redTotalGold:Q', scale=alt.Scale(domain=[min_value_y, max_value_y])),

ChatGPT assistance #5: I wanted to create a range from 0 to 4800 originally, but found that the load time was too long and the difference in accuracy between numbers like 2000 and 2001 were minimal. As such, I decided to ask for help/a reminder of how to create a range with skipping values. Exact ChatGPT code: range_values = list(range(0, 90, 50))

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote