Examining Free Throw percentages in the NBA Playoffs
Contents
Examining Free Throw percentages in the NBA Playoffs¶
Author: Joseph Yecco, 61517010
Course Project, UC Irvine, Math 10, W22
Introduction¶
When a player in the NBA is fouled during a shot attempt, and the shot attempt is unsuccessful, the player is awarded 2 or 3 free throws depending on the spot of the foul; they are awarded 1 free throw if the shot is successful. In this project, I aim to examine whether free throw percentage increases or decreases during the playoffs, and to determine if any of the variables in the data available would allow us to examine where this increase/decrease may stem from.
import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler
Below are the two datasets we will use. The first is a dataframe containing records of over 600k free throws attempts between 2006-2016, while the second lists all NBA all-star selections from 2000-2016
df0 = pd.read_csv('free_throws.csv')
all_stars = pd.read_csv('allstars.csv')
Main Portion¶
Now we move on to the main portion of our project. First, lets only include players in our dataframe who have taken over 100 free throw attempts in the playoffs, so that we can have a sufficient sample size to justify their free throw percentage.(The below cell may take a few minutes to run)
total_player_list = df0.iloc[:,5].value_counts().index #Lists all of the players in our dataframe
a = np.where([np.count_nonzero((df0["player"] == (total_player_list[i])) & (df0["playoffs"] == "playoffs"))>100 for i in range(len(total_player_list))])
#where function is used to find where in the list are players with over 100 playoff free throw attempts
player_list = total_player_list[a]
df2 = df0[df0['player'].isin(player_list)] #New dataframe only contains info for players with >100 playoff attempts
df2
end_result | game | game_id | period | play | player | playoffs | score | season | shot_made | time | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 106 - 114 | PHX - LAL | 261031013.0 | 1.0 | Andrew Bynum makes free throw 1 of 2 | Andrew Bynum | regular | 0 - 1 | 2006 - 2007 | 1 | 11:45 |
1 | 106 - 114 | PHX - LAL | 261031013.0 | 1.0 | Andrew Bynum makes free throw 2 of 2 | Andrew Bynum | regular | 0 - 2 | 2006 - 2007 | 1 | 11:45 |
2 | 106 - 114 | PHX - LAL | 261031013.0 | 1.0 | Andrew Bynum makes free throw 1 of 2 | Andrew Bynum | regular | 18 - 12 | 2006 - 2007 | 1 | 7:26 |
3 | 106 - 114 | PHX - LAL | 261031013.0 | 1.0 | Andrew Bynum misses free throw 2 of 2 | Andrew Bynum | regular | 18 - 12 | 2006 - 2007 | 0 | 7:26 |
4 | 106 - 114 | PHX - LAL | 261031013.0 | 1.0 | Shawn Marion makes free throw 1 of 1 | Shawn Marion | regular | 21 - 12 | 2006 - 2007 | 1 | 7:18 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
618010 | 104 - 118 | DAL - OKC | 400874368.0 | 4.0 | Russell Westbrook makes free throw 1 of 2 | Russell Westbrook | playoffs | 98 - 102 | 2015 - 2016 | 1 | 6:56 |
618011 | 104 - 118 | DAL - OKC | 400874368.0 | 4.0 | Russell Westbrook makes free throw 2 of 2 | Russell Westbrook | playoffs | 98 - 103 | 2015 - 2016 | 1 | 6:56 |
618012 | 104 - 118 | DAL - OKC | 400874368.0 | 4.0 | Kevin Durant makes technical free throw | Kevin Durant | playoffs | 103 - 112 | 2015 - 2016 | 1 | 2:48 |
618013 | 104 - 118 | DAL - OKC | 400874368.0 | 4.0 | Kevin Durant makes free throw 1 of 1 | Kevin Durant | playoffs | 103 - 113 | 2015 - 2016 | 1 | 2:48 |
618016 | 104 - 118 | DAL - OKC | 400874368.0 | 4.0 | Kevin Durant makes technical free throw | Kevin Durant | playoffs | 103 - 118 | 2015 - 2016 | 1 | 0:27 |
261646 rows × 11 columns
Below we create the main datatable which we will use for the rest of the project
main_df = pd.DataFrame()
main_df["Player"] = pd.Series(player_list)
main_df["Regular Season Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Season Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Percentage"] = (main_df["Regular Season Makes"]/main_df["Regular Season Attempts"]).round(2)
main_df["Playoff Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Percentage"] = (main_df["Playoff Makes"]/main_df["Playoff Attempts"]).round(2)
main_df["Change"] = (main_df["Playoff Percentage"]-main_df["Regular Percentage"]).round(2)
main_df["All Star"]= main_df.iloc[:,0].isin(all_stars["Player"])
main_df
Player | Regular Season Attempts | Regular Season Makes | Regular Percentage | Playoff Attempts | Playoff Makes | Playoff Percentage | Change | All Star | |
---|---|---|---|---|---|---|---|---|---|
0 | LeBron James | 6318 | 4697 | 0.74 | 1683 | 1260 | 0.75 | 0.01 | True |
1 | Dwight Howard | 6839 | 3821 | 0.56 | 889 | 481 | 0.54 | -0.02 | True |
2 | Kevin Durant | 5226 | 4611 | 0.88 | 804 | 682 | 0.85 | -0.03 | True |
3 | Kobe Bryant | 4829 | 4055 | 0.84 | 765 | 647 | 0.85 | 0.01 | True |
4 | Dwyane Wade | 4885 | 3738 | 0.77 | 709 | 538 | 0.76 | -0.01 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
106 | Bradley Beal | 683 | 532 | 0.78 | 108 | 88 | 0.81 | 0.03 | False |
107 | Harrison Barnes | 656 | 484 | 0.74 | 119 | 90 | 0.76 | 0.02 | False |
108 | Mickael Pietrus | 631 | 418 | 0.66 | 143 | 97 | 0.68 | 0.02 | False |
109 | Delonte West | 623 | 515 | 0.83 | 106 | 90 | 0.85 | 0.02 | False |
110 | Rasheed Wallace | 533 | 411 | 0.77 | 106 | 85 | 0.80 | 0.03 | True |
111 rows × 9 columns
Now let’s take a look at the difference between regular season and playoff percentage. Note that the last column (“Change”) represents the change from regular season to the playoffs
print(f'The average change in free throw percentage between the regular season and playoffs is {100*(main_df["Change"].mean().round(5))}%') #Mean value -.0093
The average change in free throw percentage between the regular season and playoffs is -0.928%
Now let’s look at whether the regular season percentage has a direct impact on that, i.e. does a higher percent yield more stability in the playoffs or not. We will use linear regression to check.
reg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(main_df[["Regular Percentage"]],main_df["Change"],test_size=0.2)
reg.fit(X_train,y_train)
print(f"The coefficient of our line is {reg.coef_.round(4)} and the intercept is {reg.intercept_.round(4)}")
The coefficient of our line is [-0.0248] and the intercept is 0.0097
These coefficients suggest a slight higher degree of loss in free throw percentage for players who shoot better in the regular season. We want to check how strong the correlation is betwen regular season free throw percentage and the change that occurs in the playoffs.
main_df["pred"] = reg.predict(main_df[["Regular Percentage"]])
c1 = alt.Chart(main_df).mark_circle().encode(
x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
y = 'Change',
color = "All Star"
)
c2 = alt.Chart(main_df).mark_line(color ='red').encode(
x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
y = 'pred'
)
(c1+c2).properties(
title = "Change in Percentage by Regular Season Percentage",
width = 600
)
From the above graph, it seems that there is not much of a linear relationship between regular season percentage and the change that occurs in the playoffs. Checking the score below confirms that linear regression is not a great predictor for change in playoff percentage.
reg.score(X_test,y_test,sample_weight=None)
0.03622750339721614
Thus while there may be some correlation between regular season percentage and the change that occurs in the playoffs, the correlation is not strong enough to suggest a causal relationship. From here, we will then try to answer 2 follow up questions:
1.) Are players who were at some point selected as all-stars(and thus are more likely to have their team depend on them to perform well in the playoffs) more or less consistent with their regular season percentage in the playoffs?
2.) Can we accurately predict the change that will occur in playoff percentage based on a player’s regular season percentage and all-star status?
change_by_starstatus = main_df.groupby('All Star').mean()["Change"]
change_by_starstatus #Average change from regular season to playoffs for non-all-stars and all-stars
All Star
False -0.010227
True -0.008657
Name: Change, dtype: float64
It appears that on average, non-all-stars face a decrease of approximately 1.02% in their playoff free throw percentage, while all-stars decrease about 0.87%.
increase = (100*(abs(change_by_starstatus[1]-change_by_starstatus[0]))/abs(change_by_starstatus[1])).round(2)
print(f"The free throw percentage of All-Stars changes {increase}% less than that of non-All-Stars when going from the regular season to the playoffs")
The free throw percentage of All-Stars changes 18.14% less than that of non-All-Stars when going from the regular season to the playoffs
From the above, it appears that all-stars are slightly more stable in terms of maintaining their free throw percentage than non-all-stars, with about 18% less variation. Now let’s look at the second question. We will use a K-nearest neighbors regressor to try to predict the change. First, we need to scale the input variables so that one is not considered more severely than the other.
var_cols = ["Regular Percentage", "All Star"]
scaler = StandardScaler()
scaler.fit(main_df[var_cols])
StandardScaler()
X_scaled = scaler.transform(main_df[var_cols])
Below we set up a test, then try to determine the optimal number of neighbors to use to obtain the best possible predictions without overfitting.
X_train, X_test, y_train, y_test = train_test_split(X_scaled, main_df["Change"], test_size = 0.2)
def get_scores(k):
clf = KNeighborsRegressor(n_neighbors=k)
clf.fit(X_train, y_train)
train_error = mean_absolute_error(clf.predict(X_train), y_train)
test_error = mean_absolute_error(clf.predict(X_test), y_test)
return (train_error, test_error)
error_df = pd.DataFrame({"K-inverse":[1/k for k in range(1,20)],"Training Error":[get_scores(k)[0] for k in range(1,20)],"Test Error":[get_scores(k)[1] for k in range(1,20)]})
e1 = alt.Chart(error_df).mark_line(color="Blue").encode(
x = "K-inverse",
y = "Training Error"
)
e2 = alt.Chart(error_df).mark_line(color="Orange").encode(
x = "K-inverse",
y = "Test Error"
)
(e1+e2).properties(
title = "Training and test error by number of Neighbors",
width = 500
)
Based on the above, we want to avoid lower k values for our classifer as these tend to cause overfitting. This is seen on the right-hand-side of the graph, where a low k-value will lead to a high value for 1/k. Here, we see that the training error(blue line) is low, while the test error(orange line) is high, suggesting overfitting. Thus we will want a k-value of at least 10(where K-inverse=0.1) to avoid this problem and minimize overfitting.
def score(n):
clf2 = KNeighborsRegressor(n_neighbors=n)
clf2.fit(X_train, y_train)
print(clf2.score(X_scaled,main_df["Change"]))
[score(i) for i in range(1,25)]
-0.291718600191754
-0.031479206615532274
-0.014573346116970365
-0.005247408317353708
0.025450862895493698
0.01931228028123977
0.030959922319838906
0.03225034455896425
0.0100371524448708
-0.00047662991371000274
-0.008002484093088347
-0.017972994566954448
-0.023313425088076656
-0.020046116481108545
0.0004058005752635152
0.011955517250419323
-0.007930299044213163
0.011344867458541796
0.0039943111045007695
0.0005227558724834047
0.007320751886564891
0.0015322288297425768
0.002006750829637083
-0.00316997368368499
[None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None,
None]
Looking at this list, we see that all of the scores for k>9 are less than .2, indicating that either k-nearest neighbors may not be the best predictor for change in free throw percentages due to the playoffs, or that the combination of independent variables used(Regular Percentage and All-Star Status) are not sufficient to make prediction for our dependent variabel(Change in free throw percentage).
Summary¶
The results from this examination of NBA free throw data demonstrated a minor dropoff in free throw percentage during the playoffs. The dropoff was slightly less for All-Star caliber players than for non-all-stars. There did not seem to be any strong linear correlation between regular season free throw percentage and the amount that the player’s percentage would drop in the playoffs, although there was a slightly higher dropoff corresponding to players with higher regular season percentages. Finally, we found that k-nearest neighbors could not be successfully used to predict the change that would occur based on whether a player’s regular season percentage and their all-star status.
While the results of this investigation did not yield any major insights, it did reflect the idea that there are many, many factors affecting a player’s performance in game. While the datasets we used were thorough, it would be hard to fully account for any particular variable having a major impact on so many players
References¶
Free throw Dataset was found on Kaggle at https://www.kaggle.com/sebastianmantey/nba-free-throws/code
All star dataset was found at https://www.kaggle.com/fmejia21/nba-all-star-game-20002016
Created in Deepnote