Examining Free Throw percentages in the NBA Playoffs

Author: Joseph Yecco, 61517010

Course Project, UC Irvine, Math 10, W22

Introduction

When a player in the NBA is fouled during a shot attempt, and the shot attempt is unsuccessful, the player is awarded 2 or 3 free throws depending on the spot of the foul; they are awarded 1 free throw if the shot is successful. In this project, I aim to examine whether free throw percentage increases or decreases during the playoffs, and to determine if any of the variables in the data available would allow us to examine where this increase/decrease may stem from.

import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler

Below are the two datasets we will use. The first is a dataframe containing records of over 600k free throws attempts between 2006-2016, while the second lists all NBA all-star selections from 2000-2016

df0 = pd.read_csv('free_throws.csv')
all_stars = pd.read_csv('allstars.csv')

Main Portion

Now we move on to the main portion of our project. First, lets only include players in our dataframe who have taken over 100 free throw attempts in the playoffs, so that we can have a sufficient sample size to justify their free throw percentage.(The below cell may take a few minutes to run)

total_player_list = df0.iloc[:,5].value_counts().index #Lists all of the players in our dataframe
a = np.where([np.count_nonzero((df0["player"] == (total_player_list[i])) & (df0["playoffs"] == "playoffs"))>100 for i in range(len(total_player_list))])
#where function is used to find where in the list are players with over 100 playoff free throw attempts
player_list = total_player_list[a] 
df2 = df0[df0['player'].isin(player_list)] #New dataframe only contains info for players with >100 playoff attempts
df2
end_result game game_id period play player playoffs score season shot_made time
0 106 - 114 PHX - LAL 261031013.0 1.0 Andrew Bynum makes free throw 1 of 2 Andrew Bynum regular 0 - 1 2006 - 2007 1 11:45
1 106 - 114 PHX - LAL 261031013.0 1.0 Andrew Bynum makes free throw 2 of 2 Andrew Bynum regular 0 - 2 2006 - 2007 1 11:45
2 106 - 114 PHX - LAL 261031013.0 1.0 Andrew Bynum makes free throw 1 of 2 Andrew Bynum regular 18 - 12 2006 - 2007 1 7:26
3 106 - 114 PHX - LAL 261031013.0 1.0 Andrew Bynum misses free throw 2 of 2 Andrew Bynum regular 18 - 12 2006 - 2007 0 7:26
4 106 - 114 PHX - LAL 261031013.0 1.0 Shawn Marion makes free throw 1 of 1 Shawn Marion regular 21 - 12 2006 - 2007 1 7:18
... ... ... ... ... ... ... ... ... ... ... ...
618010 104 - 118 DAL - OKC 400874368.0 4.0 Russell Westbrook makes free throw 1 of 2 Russell Westbrook playoffs 98 - 102 2015 - 2016 1 6:56
618011 104 - 118 DAL - OKC 400874368.0 4.0 Russell Westbrook makes free throw 2 of 2 Russell Westbrook playoffs 98 - 103 2015 - 2016 1 6:56
618012 104 - 118 DAL - OKC 400874368.0 4.0 Kevin Durant makes technical free throw Kevin Durant playoffs 103 - 112 2015 - 2016 1 2:48
618013 104 - 118 DAL - OKC 400874368.0 4.0 Kevin Durant makes free throw 1 of 1 Kevin Durant playoffs 103 - 113 2015 - 2016 1 2:48
618016 104 - 118 DAL - OKC 400874368.0 4.0 Kevin Durant makes technical free throw Kevin Durant playoffs 103 - 118 2015 - 2016 1 0:27

261646 rows × 11 columns

Below we create the main datatable which we will use for the rest of the project

main_df = pd.DataFrame()
main_df["Player"] = pd.Series(player_list)
main_df["Regular Season Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Season Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Percentage"] = (main_df["Regular Season Makes"]/main_df["Regular Season Attempts"]).round(2)
main_df["Playoff Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Percentage"] = (main_df["Playoff Makes"]/main_df["Playoff Attempts"]).round(2)
main_df["Change"] =  (main_df["Playoff Percentage"]-main_df["Regular Percentage"]).round(2)
main_df["All Star"]= main_df.iloc[:,0].isin(all_stars["Player"])
main_df
Player Regular Season Attempts Regular Season Makes Regular Percentage Playoff Attempts Playoff Makes Playoff Percentage Change All Star
0 LeBron James 6318 4697 0.74 1683 1260 0.75 0.01 True
1 Dwight Howard 6839 3821 0.56 889 481 0.54 -0.02 True
2 Kevin Durant 5226 4611 0.88 804 682 0.85 -0.03 True
3 Kobe Bryant 4829 4055 0.84 765 647 0.85 0.01 True
4 Dwyane Wade 4885 3738 0.77 709 538 0.76 -0.01 True
... ... ... ... ... ... ... ... ... ...
106 Bradley Beal 683 532 0.78 108 88 0.81 0.03 False
107 Harrison Barnes 656 484 0.74 119 90 0.76 0.02 False
108 Mickael Pietrus 631 418 0.66 143 97 0.68 0.02 False
109 Delonte West 623 515 0.83 106 90 0.85 0.02 False
110 Rasheed Wallace 533 411 0.77 106 85 0.80 0.03 True

111 rows × 9 columns

Now let’s take a look at the difference between regular season and playoff percentage. Note that the last column (“Change”) represents the change from regular season to the playoffs

print(f'The average change in free throw percentage between the regular season and playoffs is {100*(main_df["Change"].mean().round(5))}%') #Mean value -.0093
The average change in free throw percentage between the regular season and playoffs is -0.928%

Now let’s look at whether the regular season percentage has a direct impact on that, i.e. does a higher percent yield more stability in the playoffs or not. We will use linear regression to check.

reg = LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(main_df[["Regular Percentage"]],main_df["Change"],test_size=0.2)
reg.fit(X_train,y_train)
print(f"The coefficient of our line is {reg.coef_.round(4)} and the intercept is {reg.intercept_.round(4)}")
The coefficient of our line is [-0.0248] and the intercept is 0.0097

These coefficients suggest a slight higher degree of loss in free throw percentage for players who shoot better in the regular season. We want to check how strong the correlation is betwen regular season free throw percentage and the change that occurs in the playoffs.

main_df["pred"] = reg.predict(main_df[["Regular Percentage"]])
c1 = alt.Chart(main_df).mark_circle().encode(
    x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
    y = 'Change',
    color = "All Star"
)
c2 = alt.Chart(main_df).mark_line(color ='red').encode(
    x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
    y = 'pred'
)
(c1+c2).properties(
    title = "Change in Percentage by Regular Season Percentage",
    width = 600
)

From the above graph, it seems that there is not much of a linear relationship between regular season percentage and the change that occurs in the playoffs. Checking the score below confirms that linear regression is not a great predictor for change in playoff percentage.

reg.score(X_test,y_test,sample_weight=None)
0.03622750339721614

Thus while there may be some correlation between regular season percentage and the change that occurs in the playoffs, the correlation is not strong enough to suggest a causal relationship. From here, we will then try to answer 2 follow up questions:

1.) Are players who were at some point selected as all-stars(and thus are more likely to have their team depend on them to perform well in the playoffs) more or less consistent with their regular season percentage in the playoffs?

2.) Can we accurately predict the change that will occur in playoff percentage based on a player’s regular season percentage and all-star status?

change_by_starstatus = main_df.groupby('All Star').mean()["Change"] 
change_by_starstatus #Average change from regular season to playoffs for non-all-stars and all-stars
All Star
False   -0.010227
True    -0.008657
Name: Change, dtype: float64

It appears that on average, non-all-stars face a decrease of approximately 1.02% in their playoff free throw percentage, while all-stars decrease about 0.87%.

increase = (100*(abs(change_by_starstatus[1]-change_by_starstatus[0]))/abs(change_by_starstatus[1])).round(2)
print(f"The free throw percentage of All-Stars changes {increase}% less than that of non-All-Stars when going from the regular season to the playoffs")
The free throw percentage of All-Stars changes 18.14% less than that of non-All-Stars when going from the regular season to the playoffs

From the above, it appears that all-stars are slightly more stable in terms of maintaining their free throw percentage than non-all-stars, with about 18% less variation. Now let’s look at the second question. We will use a K-nearest neighbors regressor to try to predict the change. First, we need to scale the input variables so that one is not considered more severely than the other.

var_cols = ["Regular Percentage", "All Star"]
scaler = StandardScaler()
scaler.fit(main_df[var_cols])
StandardScaler()
X_scaled = scaler.transform(main_df[var_cols])

Below we set up a test, then try to determine the optimal number of neighbors to use to obtain the best possible predictions without overfitting.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, main_df["Change"], test_size = 0.2)
def get_scores(k):
    clf = KNeighborsRegressor(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)
error_df = pd.DataFrame({"K-inverse":[1/k for k in range(1,20)],"Training Error":[get_scores(k)[0] for k in range(1,20)],"Test Error":[get_scores(k)[1] for k in range(1,20)]})
e1 = alt.Chart(error_df).mark_line(color="Blue").encode(
    x = "K-inverse",
    y = "Training Error"
)
e2 = alt.Chart(error_df).mark_line(color="Orange").encode(
    x = "K-inverse",
    y = "Test Error"
)
(e1+e2).properties(
    title = "Training and test error by number of Neighbors",
    width = 500
)

Based on the above, we want to avoid lower k values for our classifer as these tend to cause overfitting. This is seen on the right-hand-side of the graph, where a low k-value will lead to a high value for 1/k. Here, we see that the training error(blue line) is low, while the test error(orange line) is high, suggesting overfitting. Thus we will want a k-value of at least 10(where K-inverse=0.1) to avoid this problem and minimize overfitting.

def score(n):
    clf2 = KNeighborsRegressor(n_neighbors=n)
    clf2.fit(X_train, y_train)
    print(clf2.score(X_scaled,main_df["Change"]))
[score(i) for i in range(1,25)]
-0.291718600191754
-0.031479206615532274
-0.014573346116970365
-0.005247408317353708
0.025450862895493698
0.01931228028123977
0.030959922319838906
0.03225034455896425
0.0100371524448708
-0.00047662991371000274
-0.008002484093088347
-0.017972994566954448
-0.023313425088076656
-0.020046116481108545
0.0004058005752635152
0.011955517250419323
-0.007930299044213163
0.011344867458541796
0.0039943111045007695
0.0005227558724834047
0.007320751886564891
0.0015322288297425768
0.002006750829637083
-0.00316997368368499
[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

Looking at this list, we see that all of the scores for k>9 are less than .2, indicating that either k-nearest neighbors may not be the best predictor for change in free throw percentages due to the playoffs, or that the combination of independent variables used(Regular Percentage and All-Star Status) are not sufficient to make prediction for our dependent variabel(Change in free throw percentage).

Summary

The results from this examination of NBA free throw data demonstrated a minor dropoff in free throw percentage during the playoffs. The dropoff was slightly less for All-Star caliber players than for non-all-stars. There did not seem to be any strong linear correlation between regular season free throw percentage and the amount that the player’s percentage would drop in the playoffs, although there was a slightly higher dropoff corresponding to players with higher regular season percentages. Finally, we found that k-nearest neighbors could not be successfully used to predict the change that would occur based on whether a player’s regular season percentage and their all-star status.

While the results of this investigation did not yield any major insights, it did reflect the idea that there are many, many factors affecting a player’s performance in game. While the datasets we used were thorough, it would be hard to fully account for any particular variable having a major impact on so many players