Final Project NBA Players¶

Author: Rex Kim

Course Project Preparation, UC Irvine, Math 10, W22

Introduction¶

This project dives into advanced statistics of current NBA players in the 2021-2022 NBA season. Different statistical categories include win shares (WS), value over replacement player (VORP), and box plus/minus. I’m going to discover how the NBA ranks its top 10 current NBA players based on the categories just listed as well as other main stats such as points, rebounds, and assists.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss


df = pd.read_csv('nba_2020_adv.csv')
df

	Player	Pos	Age	Tm	G	MP	PER	TS%	3PAr	FTr	...	TOV%	USG%	OWS	DWS	WS	WS/48	OBPM	DBPM	BPM	VORP
0	Steven Adams	C	26	OKC	63	1680	20.5	0.604	0.006	0.421	...	14.2	17.3	3.8	2.7	6.5	0.185	1.9	1.1	2.9	2.1
1	Bam Adebayo	PF	22	MIA	72	2417	20.3	0.598	0.018	0.484	...	17.6	21.2	4.6	3.9	8.5	0.168	1.4	2.0	3.4	3.3
2	LaMarcus Aldridge	C	34	SAS	53	1754	19.7	0.571	0.198	0.241	...	7.8	23.4	3.0	1.4	4.5	0.122	1.8	-0.5	1.4	1.5
3	Kyle Alexander	PF	23	MIA	2	13	4.7	0.500	0.000	0.000	...	33.3	10.2	0.0	0.0	0.0	-0.003	-6.1	-3.5	-9.6	0.0
4	Nickeil Alexander-Walker	SG	21	NOP	47	591	8.9	0.473	0.500	0.139	...	16.1	23.3	-0.7	0.4	-0.2	-0.020	-3.2	-1.4	-4.6	-0.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
646	Trae Young	PG	21	ATL	60	2120	23.9	0.595	0.455	0.448	...	16.2	34.9	5.3	0.6	5.9	0.133	6.2	-2.3	3.9	3.1
647	Cody Zeller	C	27	CHO	58	1341	18.8	0.576	0.157	0.374	...	11.9	20.8	2.3	1.3	3.6	0.129	0.2	-0.8	-0.6	0.5
648	Tyler Zeller	C	30	SAS	2	4	22.4	0.250	0.000	0.000	...	0.0	43.2	0.0	0.0	0.0	-0.075	-0.3	-22.1	-22.4	0.0
649	Ante Žižić	C	23	CLE	22	221	16.4	0.597	0.000	0.264	...	11.1	17.5	0.3	0.2	0.5	0.106	-1.7	-1.5	-3.2	-0.1
650	Ivica Zubac	C	22	LAC	72	1326	21.7	0.651	0.005	0.431	...	11.8	16.4	4.4	2.3	6.6	0.241	1.9	0.8	2.8	1.6

651 rows × 26 columns

I will locate all players in the NBA and the position they play.

df.loc[:,["Player","Pos"]]

	Player	Pos
0	Steven Adams	C
1	Bam Adebayo	PF
2	LaMarcus Aldridge	C
3	Kyle Alexander	PF
4	Nickeil Alexander-Walker	SG
...	...	...
646	Trae Young	PG
647	Cody Zeller	C
648	Tyler Zeller	C
649	Ante Žižić	C
650	Ivica Zubac	C

651 rows × 2 columns

Because how much popularized the point guard position is, I decided to check all point guards and their statistics.

df.loc[df['Pos'] == 'PG']

	Player	Pos	Age	Tm	G	MP	PER	TS%	3PAr	FTr	...	TOV%	USG%	OWS	DWS	WS	WS/48	OBPM	DBPM	BPM	VORP
17	Ryan Arcidiacono	PG	25	CHI	58	930	9.0	0.551	0.627	0.173	...	13.5	12.5	0.7	0.7	1.4	0.071	-2.7	-0.1	-2.8	-0.2
21	D.J. Augustin	PG	32	ORL	57	1420	14.0	0.554	0.436	0.393	...	13.7	19.1	2.0	0.9	2.9	0.098	-0.3	-1.3	-1.6	0.2
25	Lonzo Ball	PG	22	NOP	63	2025	13.1	0.517	0.575	0.111	...	21.1	18.5	0.4	2.0	2.4	0.058	0.1	0.6	0.6	1.3
27	J.J. Barea	PG	35	DAL	29	450	14.6	0.512	0.411	0.106	...	14.9	24.2	0.5	0.2	0.7	0.072	1.6	-3.0	-1.4	0.1
53	Patrick Beverley	PG	31	LAC	51	1342	12.5	0.560	0.604	0.138	...	15.4	13.3	1.6	2.0	3.6	0.128	-0.3	2.5	2.2	1.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
618	Tremont Waters	PG	22	BOS	11	119	4.3	0.381	0.490	0.163	...	22.2	24.2	-0.5	0.2	-0.3	-0.103	-8.5	1.0	-7.5	-0.2
623	Russell Westbrook	PG	31	HOU	57	2049	21.0	0.536	0.166	0.297	...	15.0	34.4	1.7	2.5	4.2	0.098	1.6	-0.1	1.5	1.8
638	Nigel Williams-Goss	PG	25	UTA	10	50	8.1	0.415	0.438	0.125	...	15.1	17.4	-0.1	0.1	0.0	0.017	-5.2	1.3	-3.9	0.0
644	Justin Wright-Foreman	PG	22	UTA	4	45	8.8	0.437	0.500	0.200	...	12.1	24.1	-0.1	0.1	0.0	-0.008	-7.0	-2.0	-9.0	-0.1
646	Trae Young	PG	21	ATL	60	2120	23.9	0.595	0.455	0.448	...	16.2	34.9	5.3	0.6	5.9	0.133	6.2	-2.3	3.9	3.1

111 rows × 26 columns

df.loc[df['Pos'] == 'PG']

	Player	Pos	Age	Tm	G	MP	PER	TS%	3PAr	FTr	...	TOV%	USG%	OWS	DWS	WS	WS/48	OBPM	DBPM	BPM	VORP
17	Ryan Arcidiacono	PG	25	CHI	58	930	9.0	0.551	0.627	0.173	...	13.5	12.5	0.7	0.7	1.4	0.071	-2.7	-0.1	-2.8	-0.2
21	D.J. Augustin	PG	32	ORL	57	1420	14.0	0.554	0.436	0.393	...	13.7	19.1	2.0	0.9	2.9	0.098	-0.3	-1.3	-1.6	0.2
25	Lonzo Ball	PG	22	NOP	63	2025	13.1	0.517	0.575	0.111	...	21.1	18.5	0.4	2.0	2.4	0.058	0.1	0.6	0.6	1.3
27	J.J. Barea	PG	35	DAL	29	450	14.6	0.512	0.411	0.106	...	14.9	24.2	0.5	0.2	0.7	0.072	1.6	-3.0	-1.4	0.1
53	Patrick Beverley	PG	31	LAC	51	1342	12.5	0.560	0.604	0.138	...	15.4	13.3	1.6	2.0	3.6	0.128	-0.3	2.5	2.2	1.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
618	Tremont Waters	PG	22	BOS	11	119	4.3	0.381	0.490	0.163	...	22.2	24.2	-0.5	0.2	-0.3	-0.103	-8.5	1.0	-7.5	-0.2
623	Russell Westbrook	PG	31	HOU	57	2049	21.0	0.536	0.166	0.297	...	15.0	34.4	1.7	2.5	4.2	0.098	1.6	-0.1	1.5	1.8
638	Nigel Williams-Goss	PG	25	UTA	10	50	8.1	0.415	0.438	0.125	...	15.1	17.4	-0.1	0.1	0.0	0.017	-5.2	1.3	-3.9	0.0
644	Justin Wright-Foreman	PG	22	UTA	4	45	8.8	0.437	0.500	0.200	...	12.1	24.1	-0.1	0.1	0.0	-0.008	-7.0	-2.0	-9.0	-0.1
646	Trae Young	PG	21	ATL	60	2120	23.9	0.595	0.455	0.448	...	16.2	34.9	5.3	0.6	5.9	0.133	6.2	-2.3	3.9	3.1

111 rows × 26 columns

The following assigns a standard scaler to scaler and also initializes our training and testing data.

scaler = StandardScaler()
df[["BPM","USG%"]] = scaler.fit_transform(df[["BPM","USG%"]])

X_train, X_test, y_train, y_test = train_test_split(df[["BPM","USG%"]], df["Age"], test_size=0.2)

X = df[['USG%', 'BPM']]
X

	USG%	BPM
0	-0.179574	1.124504
1	0.512894	1.244773
2	0.903518	0.763698
3	-1.440223	-1.882216
4	0.885762	-0.679528
...	...	...
646	2.945413	1.365042
647	0.441872	0.282622
648	4.419128	-4.961097
649	-0.144063	-0.342775
650	-0.339375	1.100450

651 rows × 2 columns

Below, I’m ensuring that the dataframe X which holds usage rate and box plus/minus are resized for distribution.

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

y = df['Age']
y

    26
    22
    34
    23
    21
       ..
  21
  27
  30
  23
  22
Name: Age, Length: 651, dtype: int64

y_test

   27
  26
  23
  33
  33
       ..
  21
  28
  25
  28
  32
Name: Age, Length: 131, dtype: int64

The K-neighbors classifier is assigned to clf and this is where we create an Altair scatter plot with box plus/minus being on the x-axis and usage rate being on the y axis. I am also configuring the orientiation of the legend.

The Altair chart below shows when nearest neighbors has a k value of 1 meaning that the bias is 0. If k = 1, then the object is simply assigned to the class of that single nearest neighbor. By hovering over the rightmost players, we can see that the more notable players like James Harden and Lebron James lead the lead in box plus/minus but there are also players who played minimal minutes but had productive games.

clf = KNeighborsClassifier(n_neighbors=1)
clf.fit(X_train, y_train)
df["pred"] = clf.predict(df[["BPM","USG%"]])
c = alt.Chart(df).mark_circle().encode(
    x="BPM",
    y="USG%",
    color=alt.Color("pred", title="Age"),
    tooltip = ('Player:N','Age:Q','BPM:Q','USG%:Q','MP:Q')
).properties(
    title="Box Plus/Minus vs. Usage",
    width=1000,
    height=400,
)

c.configure_legend(
    strokeColor='gray',
    fillColor='#EEEEEE',
    padding=10,
    cornerRadius=10,
    orient='top-right'
)
alt.data_transformers.disable_max_rows()
print(f"The number of rows in this dataset is {df.shape[0]}")
c

The number of rows in this dataset is 651

Below I am checking to see whether there is overfitting or underfitting.

clf.score(X_train,y_train)

0.9903846153846154

clf.score(X_test,y_test)

0.08396946564885496

The data looks to be underfitting since the training set is lower than the testing set.

Here, I am creating another ten Altair scatter plots but when k neighbors increments from 1 to 9. Larger values of k will have smoother decision boundaries which mean lower variance but increased bias. In turn, this can significantly influence our result.

def make_chart(k):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    df[f"pred{k}"] = clf.predict(df[["BPM","USG%"]])
    test_score = clf.score(X_test,y_test) 
    train_score = clf.score(X_train,y_train)
    c = alt.Chart(df).mark_circle().encode(
        x="BPM",
        y="USG%",
        color=alt.Color(f"pred{k}", title="Predicted Age"),
        tooltip = ('Player:N','Age:Q','BPM:Q','USG%:Q','MP:Q')
    ).properties(
        title=f"n_neighbors = {k}",
        width=1000,
        height=200,
    )
    c.configure_legend(
    strokeColor='gray',
    fillColor='#EEEEEE',
    padding=10,
    cornerRadius=10,
    orient='top-right'
)
    
    return c 

alt.vconcat(*[make_chart(k) for k in range(1,10)])

Below I will now attempt to create a boxplot for all players’ age, box plus/minus, and usage rate. I will include how to read a boxplot

Minimum: Smallest number in the dataset. First quartile: Middle number between the minimum and the median. Second quartile (Median): Middle number of the (sorted) dataset. Third quartile: Middle number between median and maximum. Maximum: Highest number in the dataset.

The boxplot below shows that the average age of NBA players is roughly 25 years. With outliers like Jamal Crawford being age 43.

ageBoxPlot = df.boxplot(column=['Age'])  

The boxplot below measures all NBA players box plus/minus. The lower the box plus/minus the lesser the player actually performs in game and contributes to winning. This isn’t exactly the best representation of NBA players as Jamal Crawford only played 6 total minutes this recorded season.

bpmBoxPlot = df.boxplot(column=['BPM'])  

As k gets larger, there is lower variance and greater bias. This is how players like Tyler Zeller have such high usage rates despite playing such little minutes.

usgBoxPlot = df.boxplot(column=["USG%"]) 

boxplot = df.boxplot(column=["Age","BPM","USG%"]) 

I found that specifying age, BPM, and USG% by position and top 10 players in the NBA was somewhat difficult. The plots and altair charts above represent the idea that there are only a select few players whose advanced statistics still remain high despite aging. Players like Stephen Curry, Kevin Durant, and Lebron James are namely a few.

Summary¶

In the NBA, win shares are calculated using main stats. In this project, I tried to determine if a player’s age of could be predicted by a player’s usage rate(USG%) and box plus/minus. The models above show that with increased age, majority of players’ USG% and BPM decrease significantly.

References¶

Include references that you found helpful. Also say where you found the dataset you used.

Information on pie charts and box plots https://deepnote.com/@a11s/Data-Visualisation-with-Python-EgfTyjpfS129FYXEnB2U7Q

URL to dataset https://www.kaggle.com/nicklauskim/nba-per-game-stats-201920

Created in Deepnote

UC Irvine Math 10 W22

Final Project NBA Players

Contents

Final Project NBA Players¶

Introduction¶

Main portion of the project¶

Summary¶

References¶