Examining Free Throw percentages in the NBA Playoffs

Examining Free Throw percentages in the NBA Playoffs¶

Author: Joseph Yecco, 61517010

Course Project, UC Irvine, Math 10, W22

Introduction¶

When a player in the NBA is fouled during a shot attempt, and the shot attempt is unsuccessful, the player is awarded 2 or 3 free throws depending on the spot of the foul; they are awarded 1 free throw if the shot is successful. In this project, I aim to examine whether free throw percentage increases or decreases during the playoffs, and to determine if any of the variables in the data available would allow us to examine where this increase/decrease may stem from.

import pandas as pd
import numpy as np
import altair as alt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.preprocessing import StandardScaler

Below are the two datasets we will use. The first is a dataframe containing records of over 600k free throws attempts between 2006-2016, while the second lists all NBA all-star selections from 2000-2016

df0 = pd.read_csv('free_throws.csv')
all_stars = pd.read_csv('allstars.csv')

Main Portion¶

Now we move on to the main portion of our project. First, lets only include players in our dataframe who have taken over 100 free throw attempts in the playoffs, so that we can have a sufficient sample size to justify their free throw percentage.(The below cell may take a few minutes to run)

total_player_list = df0.iloc[:,5].value_counts().index #Lists all of the players in our dataframe
a = np.where([np.count_nonzero((df0["player"] == (total_player_list[i])) & (df0["playoffs"] == "playoffs"))>100 for i in range(len(total_player_list))])
#where function is used to find where in the list are players with over 100 playoff free throw attempts

player_list = total_player_list[a] 
df2 = df0[df0['player'].isin(player_list)] #New dataframe only contains info for players with >100 playoff attempts

df2

	end_result	game	game_id	period	play	player	playoffs	score	season	shot_made	time
0	106 - 114	PHX - LAL	261031013.0	1.0	Andrew Bynum makes free throw 1 of 2	Andrew Bynum	regular	0 - 1	2006 - 2007	1	11:45
1	106 - 114	PHX - LAL	261031013.0	1.0	Andrew Bynum makes free throw 2 of 2	Andrew Bynum	regular	0 - 2	2006 - 2007	1	11:45
2	106 - 114	PHX - LAL	261031013.0	1.0	Andrew Bynum makes free throw 1 of 2	Andrew Bynum	regular	18 - 12	2006 - 2007	1	7:26
3	106 - 114	PHX - LAL	261031013.0	1.0	Andrew Bynum misses free throw 2 of 2	Andrew Bynum	regular	18 - 12	2006 - 2007	0	7:26
4	106 - 114	PHX - LAL	261031013.0	1.0	Shawn Marion makes free throw 1 of 1	Shawn Marion	regular	21 - 12	2006 - 2007	1	7:18
...	...	...	...	...	...	...	...	...	...	...	...
618010	104 - 118	DAL - OKC	400874368.0	4.0	Russell Westbrook makes free throw 1 of 2	Russell Westbrook	playoffs	98 - 102	2015 - 2016	1	6:56
618011	104 - 118	DAL - OKC	400874368.0	4.0	Russell Westbrook makes free throw 2 of 2	Russell Westbrook	playoffs	98 - 103	2015 - 2016	1	6:56
618012	104 - 118	DAL - OKC	400874368.0	4.0	Kevin Durant makes technical free throw	Kevin Durant	playoffs	103 - 112	2015 - 2016	1	2:48
618013	104 - 118	DAL - OKC	400874368.0	4.0	Kevin Durant makes free throw 1 of 1	Kevin Durant	playoffs	103 - 113	2015 - 2016	1	2:48
618016	104 - 118	DAL - OKC	400874368.0	4.0	Kevin Durant makes technical free throw	Kevin Durant	playoffs	103 - 118	2015 - 2016	1	0:27

261646 rows × 11 columns

Below we create the main datatable which we will use for the rest of the project

main_df = pd.DataFrame()
main_df["Player"] = pd.Series(player_list)
main_df["Regular Season Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Season Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='regular')) for i in player_list])
main_df["Regular Percentage"] = (main_df["Regular Season Makes"]/main_df["Regular Season Attempts"]).round(2)
main_df["Playoff Attempts"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Makes"] = pd.Series([np.count_nonzero((df2["player"] == i) & (df2["shot_made"] == 1) & (df2["playoffs"]=='playoffs')) for i in player_list])
main_df["Playoff Percentage"] = (main_df["Playoff Makes"]/main_df["Playoff Attempts"]).round(2)
main_df["Change"] =  (main_df["Playoff Percentage"]-main_df["Regular Percentage"]).round(2)
main_df["All Star"]= main_df.iloc[:,0].isin(all_stars["Player"])
main_df

	Player	Regular Season Attempts	Regular Season Makes	Regular Percentage	Playoff Attempts	Playoff Makes	Playoff Percentage	Change	All Star
0	LeBron James	6318	4697	0.74	1683	1260	0.75	0.01	True
1	Dwight Howard	6839	3821	0.56	889	481	0.54	-0.02	True
2	Kevin Durant	5226	4611	0.88	804	682	0.85	-0.03	True
3	Kobe Bryant	4829	4055	0.84	765	647	0.85	0.01	True
4	Dwyane Wade	4885	3738	0.77	709	538	0.76	-0.01	True
...	...	...	...	...	...	...	...	...	...
106	Bradley Beal	683	532	0.78	108	88	0.81	0.03	False
107	Harrison Barnes	656	484	0.74	119	90	0.76	0.02	False
108	Mickael Pietrus	631	418	0.66	143	97	0.68	0.02	False
109	Delonte West	623	515	0.83	106	90	0.85	0.02	False
110	Rasheed Wallace	533	411	0.77	106	85	0.80	0.03	True

111 rows × 9 columns

Now let’s take a look at the difference between regular season and playoff percentage. Note that the last column (“Change”) represents the change from regular season to the playoffs

print(f'The average change in free throw percentage between the regular season and playoffs is {100*(main_df["Change"].mean().round(5))}%') #Mean value -.0093

The average change in free throw percentage between the regular season and playoffs is -0.928%

Now let’s look at whether the regular season percentage has a direct impact on that, i.e. does a higher percent yield more stability in the playoffs or not. We will use linear regression to check.

reg = LinearRegression()

X_train, X_test, y_train, y_test = train_test_split(main_df[["Regular Percentage"]],main_df["Change"],test_size=0.2)

reg.fit(X_train,y_train)
print(f"The coefficient of our line is {reg.coef_.round(4)} and the intercept is {reg.intercept_.round(4)}")

The coefficient of our line is [-0.0248] and the intercept is 0.0097

These coefficients suggest a slight higher degree of loss in free throw percentage for players who shoot better in the regular season. We want to check how strong the correlation is betwen regular season free throw percentage and the change that occurs in the playoffs.

main_df["pred"] = reg.predict(main_df[["Regular Percentage"]])

c1 = alt.Chart(main_df).mark_circle().encode(
    x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
    y = 'Change',
    color = "All Star"
)

c2 = alt.Chart(main_df).mark_line(color ='red').encode(
    x = alt.X("Regular Percentage", scale = alt.Scale(domain=(0.35,1.0))),
    y = 'pred'
)

(c1+c2).properties(
    title = "Change in Percentage by Regular Season Percentage",
    width = 600
)

From the above graph, it seems that there is not much of a linear relationship between regular season percentage and the change that occurs in the playoffs. Checking the score below confirms that linear regression is not a great predictor for change in playoff percentage.

reg.score(X_test,y_test,sample_weight=None)

0.03622750339721614

Thus while there may be some correlation between regular season percentage and the change that occurs in the playoffs, the correlation is not strong enough to suggest a causal relationship. From here, we will then try to answer 2 follow up questions:

1.) Are players who were at some point selected as all-stars(and thus are more likely to have their team depend on them to perform well in the playoffs) more or less consistent with their regular season percentage in the playoffs?

2.) Can we accurately predict the change that will occur in playoff percentage based on a player’s regular season percentage and all-star status?

change_by_starstatus = main_df.groupby('All Star').mean()["Change"] 
change_by_starstatus #Average change from regular season to playoffs for non-all-stars and all-stars

All Star
False   -0.010227
True    -0.008657
Name: Change, dtype: float64

It appears that on average, non-all-stars face a decrease of approximately 1.02% in their playoff free throw percentage, while all-stars decrease about 0.87%.

increase = (100*(abs(change_by_starstatus[1]-change_by_starstatus[0]))/abs(change_by_starstatus[1])).round(2)
print(f"The free throw percentage of All-Stars changes {increase}% less than that of non-All-Stars when going from the regular season to the playoffs")

The free throw percentage of All-Stars changes 18.14% less than that of non-All-Stars when going from the regular season to the playoffs

From the above, it appears that all-stars are slightly more stable in terms of maintaining their free throw percentage than non-all-stars, with about 18% less variation. Now let’s look at the second question. We will use a K-nearest neighbors regressor to try to predict the change. First, we need to scale the input variables so that one is not considered more severely than the other.

var_cols = ["Regular Percentage", "All Star"]

scaler = StandardScaler()

scaler.fit(main_df[var_cols])

StandardScaler()

X_scaled = scaler.transform(main_df[var_cols])

Below we set up a test, then try to determine the optimal number of neighbors to use to obtain the best possible predictions without overfitting.

X_train, X_test, y_train, y_test = train_test_split(X_scaled, main_df["Change"], test_size = 0.2)

def get_scores(k):
    clf = KNeighborsRegressor(n_neighbors=k)
    clf.fit(X_train, y_train)
    train_error = mean_absolute_error(clf.predict(X_train), y_train)
    test_error = mean_absolute_error(clf.predict(X_test), y_test)
    return (train_error, test_error)

error_df = pd.DataFrame({"K-inverse":[1/k for k in range(1,20)],"Training Error":[get_scores(k)[0] for k in range(1,20)],"Test Error":[get_scores(k)[1] for k in range(1,20)]})

e1 = alt.Chart(error_df).mark_line(color="Blue").encode(
    x = "K-inverse",
    y = "Training Error"
)

e2 = alt.Chart(error_df).mark_line(color="Orange").encode(
    x = "K-inverse",
    y = "Test Error"
)

(e1+e2).properties(
    title = "Training and test error by number of Neighbors",
    width = 500
)

Based on the above, we want to avoid lower k values for our classifer as these tend to cause overfitting. This is seen on the right-hand-side of the graph, where a low k-value will lead to a high value for 1/k. Here, we see that the training error(blue line) is low, while the test error(orange line) is high, suggesting overfitting. Thus we will want a k-value of at least 10(where K-inverse=0.1) to avoid this problem and minimize overfitting.

def score(n):
    clf2 = KNeighborsRegressor(n_neighbors=n)
    clf2.fit(X_train, y_train)
    print(clf2.score(X_scaled,main_df["Change"]))

[score(i) for i in range(1,25)]

-0.291718600191754
-0.031479206615532274
-0.014573346116970365
-0.005247408317353708
0.025450862895493698
0.01931228028123977
0.030959922319838906
0.03225034455896425
0.0100371524448708
-0.00047662991371000274
-0.008002484093088347
-0.017972994566954448
-0.023313425088076656
-0.020046116481108545
0.0004058005752635152
0.011955517250419323
-0.007930299044213163
0.011344867458541796
0.0039943111045007695
0.0005227558724834047
0.007320751886564891
0.0015322288297425768
0.002006750829637083
-0.00316997368368499

[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

Looking at this list, we see that all of the scores for k>9 are less than .2, indicating that either k-nearest neighbors may not be the best predictor for change in free throw percentages due to the playoffs, or that the combination of independent variables used(Regular Percentage and All-Star Status) are not sufficient to make prediction for our dependent variabel(Change in free throw percentage).

Summary¶

The results from this examination of NBA free throw data demonstrated a minor dropoff in free throw percentage during the playoffs. The dropoff was slightly less for All-Star caliber players than for non-all-stars. There did not seem to be any strong linear correlation between regular season free throw percentage and the amount that the player’s percentage would drop in the playoffs, although there was a slightly higher dropoff corresponding to players with higher regular season percentages. Finally, we found that k-nearest neighbors could not be successfully used to predict the change that would occur based on whether a player’s regular season percentage and their all-star status.

While the results of this investigation did not yield any major insights, it did reflect the idea that there are many, many factors affecting a player’s performance in game. While the datasets we used were thorough, it would be hard to fully account for any particular variable having a major impact on so many players

References¶

Free throw Dataset was found on Kaggle at https://www.kaggle.com/sebastianmantey/nba-free-throws/code

All star dataset was found at https://www.kaggle.com/fmejia21/nba-all-star-game-20002016

Created in Deepnote

Predicting match outcomes in professional Couter Strike: Global Offensive

Pokemon Types with Machine Learning

UC Irvine Math 10 W22