NBA Salaries from the 2017 Season¶

Author: Zhong Wang

Course Project, UC Irvine, Math 10, S22

Introduction¶

My project will be exploring the salaries of NBA players from the 2017-2018 season. Given that there are many recorded statistics of NBA Players, I hope to find several factors that could model out the salary of these professional basketball players. As a fan of the NBA, I also wish to learn more about the players that make up this league.

Main portion of the project¶

#Import 
import pandas as pd
import altair as alt

#Reading the dataset
df = pd.read_csv("Data.csv")

An introduction to the dataset¶

This dataset contains many playing statistics of NBA players from the 2017-2018 season. In addition, it also contains the salary, nationality and draft number of the players. A guide to the specific calculations of some of the playing statistics can be found at the reference section. As this dataset contains a tremendous amount of information, I will be only using a select few of the factors.

#Renaming the columns from their abbreviations
df = df.rename({"Tm": "Team", "G":"Games", "MP":"Minutes Played", "PER":"Player Efficiency Rating",
"TS%": "True Shooting Percentage", "3PAr":"3 Point Attempt Rate", "FTr":"Free Throw Rate", 
"ORB%":"Offensive Rebound Percentage", "DRB%":"Defensive Rebound Percentage",
"TRB%": "Total Rebound Percentage", "AST%":"Assist Percentage", "STL%": "Steal Percentage",
"BLK%":"Block Percentage","TOV%":"Turnover Percentage", "USG%":"Usage Percentage", "OWS":"Offensive Win Shares",
"DWS":"Defensive Win Shares", "WS":"Win Shares", "WS/48":"Win Shares Per 48 Minutes", "OBPM":"Offensive Box Plus/Minus",
"DBPM": "Defensive Box Plus/Minus", "BPM":"Box Plus/Minus", "VORP":"Value Over Replacement Player"},axis="columns")

#Cleaning the dataframe
#Removing NA values
df=df[~df.isna().any(axis=1)]

#Some of the numbers are in decimal form while some of the numbers are in percentage
#I Will be converting the columns that are in decimal form to percentage
def to_percent(x):
    return round(x*100,2)
#Find the columns that are in decimal form
cols = ["True Shooting Percentage", "3 Point Attempt Rate", "Free Throw Rate"]
for x in cols:
    df[x] = df[x].map(to_percent)

Exploring the dataset¶

The statistic “Player Efficiency Rating”,is a rating of a player’s per minute productivity. I will be exploring to see whether there is a trend between this rating metric and salary as “player efficiency rating” appears to be an overall rating of a player.

c = alt.Chart(df).mark_circle(clip=True).encode(x=alt.X("Player Efficiency Rating", 
scale=alt.Scale(domain=(-20, 45))),
 y = "Salary",
 tooltip = ["Salary","Player Efficiency Rating", "Player"],
 color=alt.Color('NBA_DraftNumber',scale=alt.Scale(scheme="darkblue")))
c

Through this chart, we can see that there is a slight positive correlation between a player’s efficiency rating and their salary. Through the coloring, we can see that players that have a lower draft number (dark colors) as displayed tend to be scattered around the upper part of the chart while players with a high draft number tend to be scattered on the bottom of the chart. A lower draft number is seen as more prestigious as it means that teams picked such player early on.

Next, we examine whether age is correlated with salary. The idea behind this is that as a player gains more experience with age, their contribution to the team is reflected in their salary.

alt.Chart(df).mark_boxplot(extent='min-max').encode(
    x='Age:O',
    y='Salary:Q'
)

Through these box plots, we can visually confirm that the salary of NBA player is positively correlated with age because as age increases, the body of the boxplot is increasing in range (indicating a higher range of salaries) and the whiskers are also getting higher. Specifically, during a player’s mid to late 20’s salaries tend to rapidly increase. However, a player’s salary tends to peak around their early 30’s and begin to decrease after.

I chose a box plot instead of a bargraph or scatter plot because a box plot allows us to visually see more information about salaries categorized by a player’s age. Had I chosen to use a bar graph, I would only be able to see the highest salary of a particular age. A box plot allows me to see the quartiles of salaries at each age along with minimum, median and maximum. The data conveyed in the box plot will also be less skewed by outliers.

Now that I have examined some factors, I will move onto building the model that predicts salary.

Building a model¶

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In the previous section, we have only been examining the correlations between salary and other factors through charts. Now, we will approach this from a numerical perspective. I believe that variables that are highly correlated to salary will be good for modeling it.

df.corr().iloc[0]

Salary                           1.000000
NBA_DraftNumber                 -0.380664
Age                              0.336001
Games                            0.294014
Minutes Played                   0.505095
Player Efficiency Rating         0.266495
True Shooting Percentage         0.174759
3 Point Attempt Rate            -0.073502
Free Throw Rate                  0.023494
Offensive Rebound Percentage     0.000747
Defensive Rebound Percentage     0.190907
Total Rebound Percentage         0.135332
Assist Percentage                0.263263
Steal Percentage                 0.030657
Block Percentage                 0.042045
Turnover Percentage             -0.043205
Usage Percentage                 0.294996
Offensive Win Shares             0.561989
Defensive Win Shares             0.503794
Win Shares                       0.591307
Win Shares Per 48 Minutes        0.160954
Offensive Box Plus/Minus         0.263521
Defensive Box Plus/Minus         0.178119
Box Plus/Minus                   0.308737
Value Over Replacement Player    0.573295
Name: Salary, dtype: float64

The pandas series above shows us the correlation coefficient of all the different factors and salary. Some of the highest correlations, whether positive or negative, are draft number, age, games, minutes played and player efficiency rating.

Given that minutes played (MP) is a factor that has one of the highest correlations with salary, I will attempt to build a model using that. From my intution, if a player is paid a higher salary, I would expect teams to maximize that player’s output and this could be measured by playing time.

c1 = alt.Chart(df).mark_circle().encode(x="Minutes Played",y="Salary")
c1

#This looks like it could be fit using a second or third degree polynomial
#First get values for MP squared and MP cubed and add them to the new columns
MPs = []
for i in range(1,4):
    c = f"MP{i}"
    MPs.append(c)
    df[c] = df["Minutes Played"]**i

df.head()

	Player	Salary	NBA_Country	NBA_DraftNumber	Age	Team	Games	Minutes Played	Player Efficiency Rating	True Shooting Percentage	...	Defensive Win Shares	Win Shares	Win Shares Per 48 Minutes	Offensive Box Plus/Minus	Defensive Box Plus/Minus	Box Plus/Minus	Value Over Replacement Player	MP1	MP2	MP3
0	Zhou Qi	815615	China	43	22	HOU	16	87	0.6	30.3	...	0.1	-0.2	-0.121	-10.6	0.5	-10.1	-0.2	87	7569	658503
1	Zaza Pachulia	3477600	Georgia	42	33	GSW	66	937	16.8	60.8	...	1.4	3.1	0.160	-0.6	1.3	0.8	0.7	937	877969	822656953
2	Zach Randolph	12307692	USA	19	36	SAC	59	1508	17.3	52.9	...	1.1	1.4	0.046	-0.6	-1.3	-1.9	0.0	1508	2274064	3429288512
3	Zach LaVine	3202217	USA	13	22	CHI	24	656	14.6	49.9	...	0.5	0.4	0.027	-0.7	-2.0	-2.6	-0.1	656	430336	282300416
4	Zach Collins	3057240	USA	10	20	POR	62	979	8.2	48.7	...	1.2	0.8	0.038	-3.7	0.9	-2.9	-0.2	979	958441	938313739

5 rows × 31 columns

#Retrieve a testing and training dataset
X_train, X_test, y_train, y_test = train_test_split(df[MPs], df["Salary"], train_size=0.5, random_state=0)

#Instantiate, fit, predict our model
#Find MSE for training and test dataset
mse_test_dict = {}
mse_train_dict = {}
coef_ints = {}
for i in range(2,4):
    reg = LinearRegression()
    reg.fit(X_train[MPs[:i]],y_train)
    df[f"Pred{i}"] = reg.predict(df[MPs[:i]])
    mse_test_dict[i] = mean_squared_error(y_test, reg.predict(X_test[MPs[:i]]))
    mse_train_dict[i] = mean_squared_error(y_train, reg.predict(X_train[MPs[:i]]))

In addition to modeling salaries using age, the thought of allowing KMeans to cluster players based on their salaries also interests me.

#import Kmeans
from sklearn.cluster import KMeans

#instantiate and fit
kmeans = KMeans(n_clusters = 5)
kmeans.fit(df[["Salary"]])
df["cluster"] = kmeans.predict(df[["Salary"]])

#Values pred2 represent the predicted values of the second degree polynomial while values Pred3
#represent the third degree polynomial
c1= alt.Chart(df).mark_circle().encode(x="Minutes Played",y="Salary",
color = alt.Color('cluster:N',scale=alt.Scale(scheme="category10")), 
tooltip = ["Player"])
c2 = alt.Chart(df).mark_line(color="red").encode(x="Minutes Played",y="Pred2")
c3 = alt.Chart(df).mark_line(color="purple").encode(x="Minutes Played",y="Pred3")
c1+c2+c3

The red line represents the second degree polynomial and the purple line represents the third degree polynomial. Given that both lines essentially overlap each other, I believe that this data is best represented by a second degree polynomial with minutes played being the explanatory variable.

As an NBA fan, looking at the KMeans clustering of salary is very intruiging. Players of cluster 4, are all NBA superstars while players in cluster 2 and 0 are more inexperienced rookie players. Although there are no clear defintions of superstar and rookie players, I believe that Kmeans clustering has done a very good job of classifying players.

print(f"The mean squared error on the test set is {mse_test_dict[2]} for our first model and {mse_test_dict[3]} for our second model.")

The mean squared error on the test set is 31783168855090.08 for our first model and 31493181673353.64 for our second model.

print(f"The mean squared error on the training set is {mse_train_dict[2]} for our first model and {mse_train_dict[3]} for our second model.")

The mean squared error on the training set is 45858785276720.55 for our first model and 45823585405179.02 for our second model.

Given that both the mean squared error for the training and testing dataset is high, I conclude that this model does not fit both the testing and training dataset well. Although I can try degrees of higher polynomial to find a better fitting model, this may lead to overfitting. One reason I believe that NBA salaries are hard to model is because there are alot of players that are only being paid the minimum salary as seen by the large amount of points near the x axis. In addition, there are many other factors, some of which that cannot be numerically represented, that contribute to a player’s salary. The combinations of these reasons lead to the high variance in a player’s salary.

Stepwise regression¶

Upon more research on picking explanatory variables, I discovered this method called stepwise regression. According to towardsdatasceince.com, stepwise regression essentially helps determine which factors are important and which are not. We will be using the library statsmodel.api.

import numpy as np
import statsmodels.api as sm

#y is our target variable, and x are the variables we are interested in seeing whether there is explanatory power
y = df["Salary"]
x = df.columns[6:-6]
results = sm.OLS(y, df[x]).fit()
print(results.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 Salary   R-squared (uncentered):                   0.689
Model:                            OLS   Adj. R-squared (uncentered):              0.674
Method:                 Least Squares   F-statistic:                              46.44
Date:                Sat, 04 Jun 2022   Prob (F-statistic):                   4.16e-102
Time:                        19:11:01   Log-Likelihood:                         -8185.8
No. Observations:                 483   AIC:                                  1.642e+04
Df Residuals:                     461   BIC:                                  1.651e+04
Df Model:                          22                                                  
Covariance Type:            nonrobust                                                  
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Games                         -1.602e+05   2.76e+04     -5.797      0.000   -2.15e+05   -1.06e+05
Minutes Played                 6641.3297   1129.197      5.881      0.000    4422.318    8860.341
Player Efficiency Rating      -4.987e+05    3.1e+05     -1.610      0.108   -1.11e+06     1.1e+05
True Shooting Percentage       8.412e+04   3.64e+04      2.313      0.021    1.27e+04    1.56e+05
3 Point Attempt Rate          -1.726e+04   1.84e+04     -0.940      0.348   -5.34e+04    1.88e+04
Free Throw Rate               -1.194e+04   9793.842     -1.219      0.224   -3.12e+04    7309.584
Offensive Rebound Percentage  -1.142e+06      1e+06     -1.140      0.255   -3.11e+06    8.26e+05
Defensive Rebound Percentage  -9.301e+05   9.94e+05     -0.936      0.350   -2.88e+06    1.02e+06
Total Rebound Percentage        2.25e+06   1.99e+06      1.132      0.258   -1.66e+06    6.16e+06
Assist Percentage              6.179e+04   4.63e+04      1.334      0.183   -2.92e+04    1.53e+05
Steal Percentage              -3.122e+04   4.68e+05     -0.067      0.947   -9.51e+05    8.89e+05
Block Percentage               1.282e+05   3.37e+05      0.381      0.704   -5.34e+05     7.9e+05
Turnover Percentage            -1.44e+04   5.54e+04     -0.260      0.795   -1.23e+05    9.44e+04
Usage Percentage               2.385e+05    1.1e+05      2.165      0.031     2.2e+04    4.55e+05
Offensive Win Shares          -1.994e+06   4.98e+06     -0.400      0.689   -1.18e+07     7.8e+06
Defensive Win Shares          -3.062e+06   5.01e+06     -0.611      0.541   -1.29e+07    6.78e+06
Win Shares                     2.897e+06   4.99e+06      0.580      0.562   -6.91e+06    1.27e+07
Win Shares Per 48 Minutes      1.328e+07   1.13e+07      1.179      0.239   -8.86e+06    3.54e+07
Offensive Box Plus/Minus       3.007e+06   5.19e+06      0.579      0.563   -7.19e+06    1.32e+07
Defensive Box Plus/Minus       2.945e+06   5.16e+06      0.571      0.569    -7.2e+06    1.31e+07
Box Plus/Minus                -2.787e+06   5.17e+06     -0.539      0.590   -1.29e+07    7.37e+06
Value Over Replacement Player  3.278e+05   6.92e+05      0.473      0.636   -1.03e+06    1.69e+06
==============================================================================
Omnibus:                       58.906   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              108.223
Skew:                           0.728   Prob(JB):                     3.16e-24
Kurtosis:                       4.805   Cond. No.                     6.19e+04
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 6.19e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The column we are most interested in is the P>|t| column. According to UCLA Institute for Digital Research and Education, “Coefficients having p-values less than alpha are statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0).” Basically, we should pick the columns with a p-value less than or equal to 0.05.

#Picking out the coefficients with a P-value less than or equal to 0.05
#Sig_coef represents the statistically significant coefficients listed above
sig_coef = ["Games", "Minutes Played", "True Shooting Percentage", "Usage Percentage"]

I will now build a linear regression with the coefficients above in attempt to model salary.

reg2 = LinearRegression()
reg2.fit(df[sig_coef],df["Salary"])
pd.Series(reg2.coef_, index=sig_coef)

Games                      -175897.771280
Minutes Played                8967.362560
True Shooting Percentage     54257.420356
Usage Percentage             78011.675985
dtype: float64

print(f"Salary = {reg2.intercept_} + {reg2.coef_[0]}*(Games) + {reg2.coef_[1]}*(Minutes Played) + {reg2.coef_[2]}*(True Shooting Percentage + {reg2.coef_[3]}*(Usage Percentage)")

Salary = 743098.1367627522 + -175897.7712804055*(Games) + 8967.36256015433*(Minutes Played) + 54257.4203558637*(True Shooting Percentage + 78011.67598486425*(Usage Percentage)

#Finding the mean squared error of this new model.
mean_squared_error(df["Salary"],reg2.predict(df[sig_coef]))

35209832886193.07

Although the MSE is higher than the MSE on our test set, I believe that if we try to make a model with the variables and its second powers, we will be getting a model that will produce a lower MSE.

Summary¶

I explored this dataset containing many statistics of NBA players. I focused on modeling salary, and throughout this project I explored the various factors that could possibly model salary. I first started with a second and third degree polynomial of minutes played to model salary. Then, upon further research, I discovered stepwise regresion and used that to pick out coefficients that have significant explanatory power to salary and built a model off those factors.

References¶

What is the source of your dataset(s)? https://www.kaggle.com/datasets/aishjun/nba-salaries-prediction-in-20172018-season

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

https://towardsdatascience.com/stepwise-regression-tutorial-in-python-ebf7c782c922 This website helped explain how stepwise regression works. It also listed a libary in which I can use to perform stepwise regression. I format my code similar to the way they did. The object works very similar to to our LinearRegression object.