NBA Salaries from the 2017 Season

Author: Zhong Wang

Course Project, UC Irvine, Math 10, S22

Introduction

My project will be exploring the salaries of NBA players from the 2017-2018 season. Given that there are many recorded statistics of NBA Players, I hope to find several factors that could model out the salary of these professional basketball players. As a fan of the NBA, I also wish to learn more about the players that make up this league.

Main portion of the project

#Import 
import pandas as pd
import altair as alt
#Reading the dataset
df = pd.read_csv("Data.csv")

An introduction to the dataset

This dataset contains many playing statistics of NBA players from the 2017-2018 season. In addition, it also contains the salary, nationality and draft number of the players. A guide to the specific calculations of some of the playing statistics can be found at the reference section. As this dataset contains a tremendous amount of information, I will be only using a select few of the factors.

#Renaming the columns from their abbreviations
df = df.rename({"Tm": "Team", "G":"Games", "MP":"Minutes Played", "PER":"Player Efficiency Rating",
"TS%": "True Shooting Percentage", "3PAr":"3 Point Attempt Rate", "FTr":"Free Throw Rate", 
"ORB%":"Offensive Rebound Percentage", "DRB%":"Defensive Rebound Percentage",
"TRB%": "Total Rebound Percentage", "AST%":"Assist Percentage", "STL%": "Steal Percentage",
"BLK%":"Block Percentage","TOV%":"Turnover Percentage", "USG%":"Usage Percentage", "OWS":"Offensive Win Shares",
"DWS":"Defensive Win Shares", "WS":"Win Shares", "WS/48":"Win Shares Per 48 Minutes", "OBPM":"Offensive Box Plus/Minus",
"DBPM": "Defensive Box Plus/Minus", "BPM":"Box Plus/Minus", "VORP":"Value Over Replacement Player"},axis="columns")
#Cleaning the dataframe
#Removing NA values
df=df[~df.isna().any(axis=1)]
#Some of the numbers are in decimal form while some of the numbers are in percentage
#I Will be converting the columns that are in decimal form to percentage
def to_percent(x):
    return round(x*100,2)
#Find the columns that are in decimal form
cols = ["True Shooting Percentage", "3 Point Attempt Rate", "Free Throw Rate"]
for x in cols:
    df[x] = df[x].map(to_percent)

Exploring the dataset

The statistic “Player Efficiency Rating”,is a rating of a player’s per minute productivity. I will be exploring to see whether there is a trend between this rating metric and salary as “player efficiency rating” appears to be an overall rating of a player.

c = alt.Chart(df).mark_circle(clip=True).encode(x=alt.X("Player Efficiency Rating", 
scale=alt.Scale(domain=(-20, 45))),
 y = "Salary",
 tooltip = ["Salary","Player Efficiency Rating", "Player"],
 color=alt.Color('NBA_DraftNumber',scale=alt.Scale(scheme="darkblue")))
c

Through this chart, we can see that there is a slight positive correlation between a player’s efficiency rating and their salary. Through the coloring, we can see that players that have a lower draft number (dark colors) as displayed tend to be scattered around the upper part of the chart while players with a high draft number tend to be scattered on the bottom of the chart. A lower draft number is seen as more prestigious as it means that teams picked such player early on.

Next, we examine whether age is correlated with salary. The idea behind this is that as a player gains more experience with age, their contribution to the team is reflected in their salary.

alt.Chart(df).mark_boxplot(extent='min-max').encode(
    x='Age:O',
    y='Salary:Q'
)

Through these box plots, we can visually confirm that the salary of NBA player is positively correlated with age because as age increases, the body of the boxplot is increasing in range (indicating a higher range of salaries) and the whiskers are also getting higher. Specifically, during a player’s mid to late 20’s salaries tend to rapidly increase. However, a player’s salary tends to peak around their early 30’s and begin to decrease after.

I chose a box plot instead of a bargraph or scatter plot because a box plot allows us to visually see more information about salaries categorized by a player’s age. Had I chosen to use a bar graph, I would only be able to see the highest salary of a particular age. A box plot allows me to see the quartiles of salaries at each age along with minimum, median and maximum. The data conveyed in the box plot will also be less skewed by outliers.

Now that I have examined some factors, I will move onto building the model that predicts salary.

Building a model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In the previous section, we have only been examining the correlations between salary and other factors through charts. Now, we will approach this from a numerical perspective. I believe that variables that are highly correlated to salary will be good for modeling it.

df.corr().iloc[0]
Salary                           1.000000
NBA_DraftNumber                 -0.380664
Age                              0.336001
Games                            0.294014
Minutes Played                   0.505095
Player Efficiency Rating         0.266495
True Shooting Percentage         0.174759
3 Point Attempt Rate            -0.073502
Free Throw Rate                  0.023494
Offensive Rebound Percentage     0.000747
Defensive Rebound Percentage     0.190907
Total Rebound Percentage         0.135332
Assist Percentage                0.263263
Steal Percentage                 0.030657
Block Percentage                 0.042045
Turnover Percentage             -0.043205
Usage Percentage                 0.294996
Offensive Win Shares             0.561989
Defensive Win Shares             0.503794
Win Shares                       0.591307
Win Shares Per 48 Minutes        0.160954
Offensive Box Plus/Minus         0.263521
Defensive Box Plus/Minus         0.178119
Box Plus/Minus                   0.308737
Value Over Replacement Player    0.573295
Name: Salary, dtype: float64

The pandas series above shows us the correlation coefficient of all the different factors and salary. Some of the highest correlations, whether positive or negative, are draft number, age, games, minutes played and player efficiency rating.

Given that minutes played (MP) is a factor that has one of the highest correlations with salary, I will attempt to build a model using that. From my intution, if a player is paid a higher salary, I would expect teams to maximize that player’s output and this could be measured by playing time.

c1 = alt.Chart(df).mark_circle().encode(x="Minutes Played",y="Salary")
c1
#This looks like it could be fit using a second or third degree polynomial
#First get values for MP squared and MP cubed and add them to the new columns
MPs = []
for i in range(1,4):
    c = f"MP{i}"
    MPs.append(c)
    df[c] = df["Minutes Played"]**i
df.head()
Player Salary NBA_Country NBA_DraftNumber Age Team Games Minutes Played Player Efficiency Rating True Shooting Percentage ... Defensive Win Shares Win Shares Win Shares Per 48 Minutes Offensive Box Plus/Minus Defensive Box Plus/Minus Box Plus/Minus Value Over Replacement Player MP1 MP2 MP3
0 Zhou Qi 815615 China 43 22 HOU 16 87 0.6 30.3 ... 0.1 -0.2 -0.121 -10.6 0.5 -10.1 -0.2 87 7569 658503
1 Zaza Pachulia 3477600 Georgia 42 33 GSW 66 937 16.8 60.8 ... 1.4 3.1 0.160 -0.6 1.3 0.8 0.7 937 877969 822656953
2 Zach Randolph 12307692 USA 19 36 SAC 59 1508 17.3 52.9 ... 1.1 1.4 0.046 -0.6 -1.3 -1.9 0.0 1508 2274064 3429288512
3 Zach LaVine 3202217 USA 13 22 CHI 24 656 14.6 49.9 ... 0.5 0.4 0.027 -0.7 -2.0 -2.6 -0.1 656 430336 282300416
4 Zach Collins 3057240 USA 10 20 POR 62 979 8.2 48.7 ... 1.2 0.8 0.038 -3.7 0.9 -2.9 -0.2 979 958441 938313739

5 rows × 31 columns

#Retrieve a testing and training dataset
X_train, X_test, y_train, y_test = train_test_split(df[MPs], df["Salary"], train_size=0.5, random_state=0)
#Instantiate, fit, predict our model
#Find MSE for training and test dataset
mse_test_dict = {}
mse_train_dict = {}
coef_ints = {}
for i in range(2,4):
    reg = LinearRegression()
    reg.fit(X_train[MPs[:i]],y_train)
    df[f"Pred{i}"] = reg.predict(df[MPs[:i]])
    mse_test_dict[i] = mean_squared_error(y_test, reg.predict(X_test[MPs[:i]]))
    mse_train_dict[i] = mean_squared_error(y_train, reg.predict(X_train[MPs[:i]]))

In addition to modeling salaries using age, the thought of allowing KMeans to cluster players based on their salaries also interests me.

#import Kmeans
from sklearn.cluster import KMeans
#instantiate and fit
kmeans = KMeans(n_clusters = 5)
kmeans.fit(df[["Salary"]])
df["cluster"] = kmeans.predict(df[["Salary"]])
#Values pred2 represent the predicted values of the second degree polynomial while values Pred3
#represent the third degree polynomial
c1= alt.Chart(df).mark_circle().encode(x="Minutes Played",y="Salary",
color = alt.Color('cluster:N',scale=alt.Scale(scheme="category10")), 
tooltip = ["Player"])
c2 = alt.Chart(df).mark_line(color="red").encode(x="Minutes Played",y="Pred2")
c3 = alt.Chart(df).mark_line(color="purple").encode(x="Minutes Played",y="Pred3")
c1+c2+c3

The red line represents the second degree polynomial and the purple line represents the third degree polynomial. Given that both lines essentially overlap each other, I believe that this data is best represented by a second degree polynomial with minutes played being the explanatory variable.

As an NBA fan, looking at the KMeans clustering of salary is very intruiging. Players of cluster 4, are all NBA superstars while players in cluster 2 and 0 are more inexperienced rookie players. Although there are no clear defintions of superstar and rookie players, I believe that Kmeans clustering has done a very good job of classifying players.

print(f"The mean squared error on the test set is {mse_test_dict[2]} for our first model and {mse_test_dict[3]} for our second model.")
The mean squared error on the test set is 31783168855090.08 for our first model and 31493181673353.64 for our second model.
print(f"The mean squared error on the training set is {mse_train_dict[2]} for our first model and {mse_train_dict[3]} for our second model.")
The mean squared error on the training set is 45858785276720.55 for our first model and 45823585405179.02 for our second model.

Given that both the mean squared error for the training and testing dataset is high, I conclude that this model does not fit both the testing and training dataset well. Although I can try degrees of higher polynomial to find a better fitting model, this may lead to overfitting. One reason I believe that NBA salaries are hard to model is because there are alot of players that are only being paid the minimum salary as seen by the large amount of points near the x axis. In addition, there are many other factors, some of which that cannot be numerically represented, that contribute to a player’s salary. The combinations of these reasons lead to the high variance in a player’s salary.

Stepwise regression

Upon more research on picking explanatory variables, I discovered this method called stepwise regression. According to towardsdatasceince.com, stepwise regression essentially helps determine which factors are important and which are not. We will be using the library statsmodel.api.

import numpy as np
import statsmodels.api as sm
#y is our target variable, and x are the variables we are interested in seeing whether there is explanatory power
y = df["Salary"]
x = df.columns[6:-6]
results = sm.OLS(y, df[x]).fit()
print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                 Salary   R-squared (uncentered):                   0.689
Model:                            OLS   Adj. R-squared (uncentered):              0.674
Method:                 Least Squares   F-statistic:                              46.44
Date:                Sat, 04 Jun 2022   Prob (F-statistic):                   4.16e-102
Time:                        19:11:01   Log-Likelihood:                         -8185.8
No. Observations:                 483   AIC:                                  1.642e+04
Df Residuals:                     461   BIC:                                  1.651e+04
Df Model:                          22                                                  
Covariance Type:            nonrobust                                                  
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Games                         -1.602e+05   2.76e+04     -5.797      0.000   -2.15e+05   -1.06e+05
Minutes Played                 6641.3297   1129.197      5.881      0.000    4422.318    8860.341
Player Efficiency Rating      -4.987e+05    3.1e+05     -1.610      0.108   -1.11e+06     1.1e+05
True Shooting Percentage       8.412e+04   3.64e+04      2.313      0.021    1.27e+04    1.56e+05
3 Point Attempt Rate          -1.726e+04   1.84e+04     -0.940      0.348   -5.34e+04    1.88e+04
Free Throw Rate               -1.194e+04   9793.842     -1.219      0.224   -3.12e+04    7309.584
Offensive Rebound Percentage  -1.142e+06      1e+06     -1.140      0.255   -3.11e+06    8.26e+05
Defensive Rebound Percentage  -9.301e+05   9.94e+05     -0.936      0.350   -2.88e+06    1.02e+06
Total Rebound Percentage        2.25e+06   1.99e+06      1.132      0.258   -1.66e+06    6.16e+06
Assist Percentage              6.179e+04   4.63e+04      1.334      0.183   -2.92e+04    1.53e+05
Steal Percentage              -3.122e+04   4.68e+05     -0.067      0.947   -9.51e+05    8.89e+05
Block Percentage               1.282e+05   3.37e+05      0.381      0.704   -5.34e+05     7.9e+05
Turnover Percentage            -1.44e+04   5.54e+04     -0.260      0.795   -1.23e+05    9.44e+04
Usage Percentage               2.385e+05    1.1e+05      2.165      0.031     2.2e+04    4.55e+05
Offensive Win Shares          -1.994e+06   4.98e+06     -0.400      0.689   -1.18e+07     7.8e+06
Defensive Win Shares          -3.062e+06   5.01e+06     -0.611      0.541   -1.29e+07    6.78e+06
Win Shares                     2.897e+06   4.99e+06      0.580      0.562   -6.91e+06    1.27e+07
Win Shares Per 48 Minutes      1.328e+07   1.13e+07      1.179      0.239   -8.86e+06    3.54e+07
Offensive Box Plus/Minus       3.007e+06   5.19e+06      0.579      0.563   -7.19e+06    1.32e+07
Defensive Box Plus/Minus       2.945e+06   5.16e+06      0.571      0.569    -7.2e+06    1.31e+07
Box Plus/Minus                -2.787e+06   5.17e+06     -0.539      0.590   -1.29e+07    7.37e+06
Value Over Replacement Player  3.278e+05   6.92e+05      0.473      0.636   -1.03e+06    1.69e+06
==============================================================================
Omnibus:                       58.906   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              108.223
Skew:                           0.728   Prob(JB):                     3.16e-24
Kurtosis:                       4.805   Cond. No.                     6.19e+04
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 6.19e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The column we are most interested in is the P>|t| column. According to UCLA Institute for Digital Research and Education, “Coefficients having p-values less than alpha are statistically significant. For example, if you chose alpha to be 0.05, coefficients having a p-value of 0.05 or less would be statistically significant (i.e., you can reject the null hypothesis and say that the coefficient is significantly different from 0).” Basically, we should pick the columns with a p-value less than or equal to 0.05.

#Picking out the coefficients with a P-value less than or equal to 0.05
#Sig_coef represents the statistically significant coefficients listed above
sig_coef = ["Games", "Minutes Played", "True Shooting Percentage", "Usage Percentage"]

I will now build a linear regression with the coefficients above in attempt to model salary.

reg2 = LinearRegression()
reg2.fit(df[sig_coef],df["Salary"])
pd.Series(reg2.coef_, index=sig_coef)
Games                      -175897.771280
Minutes Played                8967.362560
True Shooting Percentage     54257.420356
Usage Percentage             78011.675985
dtype: float64
print(f"Salary = {reg2.intercept_} + {reg2.coef_[0]}*(Games) + {reg2.coef_[1]}*(Minutes Played) + {reg2.coef_[2]}*(True Shooting Percentage + {reg2.coef_[3]}*(Usage Percentage)")
Salary = 743098.1367627522 + -175897.7712804055*(Games) + 8967.36256015433*(Minutes Played) + 54257.4203558637*(True Shooting Percentage + 78011.67598486425*(Usage Percentage)
#Finding the mean squared error of this new model.
mean_squared_error(df["Salary"],reg2.predict(df[sig_coef]))
35209832886193.07

Although the MSE is higher than the MSE on our test set, I believe that if we try to make a model with the variables and its second powers, we will be getting a model that will produce a lower MSE.

Summary

I explored this dataset containing many statistics of NBA players. I focused on modeling salary, and throughout this project I explored the various factors that could possibly model salary. I first started with a second and third degree polynomial of minutes played to model salary. Then, upon further research, I discovered stepwise regresion and used that to pick out coefficients that have significant explanatory power to salary and built a model off those factors.

References

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

https://towardsdatascience.com/stepwise-regression-tutorial-in-python-ebf7c782c922 This website helped explain how stepwise regression works. It also listed a libary in which I can use to perform stepwise regression. I format my code similar to the way they did. The object works very similar to to our LinearRegression object.

  • List other references that you found helpful.

https://altair-viz.github.io/gallery/boxplot.html Box plot

https://vega.github.io/vega/docs/schemes/ Altair color schemes

https://www.basketball-reference.com/about/glossary.html A dictionary containing the method the player statistics were calcualted

https://towardsdatascience.com/stepwise-regression-tutorial-in-python-ebf7c782c922 Stepwise Regression.

https://stats.oarc.ucla.edu/stata/output/regression-analysis/ p-value explanation

https://www.espn.com/nba/columns/story?columnist=hollinger_john&id=2850240 In depth description of Player Efficiency Rating.

Created in deepnote.com Created in Deepnote