Predict Position of Player with Seasonal Stats

Author: Sarah Thayer

Course Project, UC Irvine, Math 10, W22

Introduction

My project is exploring two Kaggle Datasets of NBA stats to predict their listed position. Two datasets are needed. One contains the players and their listed positions. The second contains season stats, height, and weight. We. merge the two for a complete dataset.

Main portion of the project

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import altair as alt

Load First Dataset

NBA Stats found here. The columns we are interested in extracting are Player, Team, Pos, Age.

Display shape of the dataframe to confirm we have enough data points.

df_position = pd.read_csv("nba.csv", sep = ',' ,encoding = 'latin-1')

df_position = df_position.loc[:,df_position.columns.isin(['Player', 'Tm', 'Pos', 'Age']) ]
df_position.head()
Player Pos Age Tm
0 Alex Abrines\abrinal01 SG 24 OKC
1 Quincy Acy\acyqu01 PF 27 BRK
2 Steven Adams\adamsst01 C 24 OKC
3 Bam Adebayo\adebaba01 C 20 MIA
4 Arron Afflalo\afflaar01 SG 32 ORL
df_position.shape
(664, 4)

Clean First Dataset

Rename column TM to Team.

NBA Players have unique player id in Player column. Remove player ID to view names. ( i.e.”Alex Abrines\abrinal01” remove the unique player id after the “")

NBA players that have been traded mid-season appear twice. Drop Player duplicates from the dataframe and check shape.

df_position = df_position.rename(columns={'Tm' : 'Team'}) 

df_position['Player'] = df_position['Player'].map(lambda x: x.split('\\')[0])

df_pos_unique = df_position[~df_position.duplicated(subset=['Player'])]
df_pos_unique.shape

df_pos_unique.head()
Player Pos Age Team
0 Alex Abrines SG 24 OKC
1 Quincy Acy PF 27 BRK
2 Steven Adams C 24 OKC
3 Bam Adebayo C 20 MIA
4 Arron Afflalo SG 32 ORL

Load Second Dataset

NBA stats found here.

Large dataset of 20 years of stats form 49 differrent leagues. Parse dataframe for the the relevant data in the NBA league during the 2017 - 2018 regular season. Then our new dataframe contains height and weight.

df = pd.read_csv("players_stats_by_season_full_details.csv",  encoding='latin-1' ) 

df= df[(df["League"] == 'NBA') &  (df["Season"] == '2017 - 2018') & (df["Stage"] == 'Regular_Season')] 

df_hw = df.loc[:,~df.columns.isin(['Rk', 'League', 'Season', 'Stage', 'birth_month', 'birth_date', 'height', 'weight',
                                  'nationality', 'high_school', 'draft_round', 'draft_pick', 'draft_team'])]

df_hw.shape
(279, 22)

Clean Second Dataset

Drop duplicates from dataframe with Player from the dataframe containing height and weight.

df_hw_unique = df_hw[~df_hw.duplicated(subset=['Player'])]
df_hw_unique.shape
(279, 22)

Prepare Merged Data

  • Merge First and Second Dataset

  • Encode the NBA listed positions

Confirm it’s the same player by matching name and team.

df_merged = pd.merge(df_pos_unique,df_hw_unique, on = ['Player','Team'])

One-Hot Encoding

Encode the positions (strings) into numbers (ints).

enc = {'PG' : 1, 'SG' : 2,  'SF': 3, 'PF': 4, 'C':5}
df_merged["Pos_enc"] = df_merged["Pos"].map(enc)
df_merged
Player Pos Age Team GP MIN FGM FGA 3PM 3PA ... DRB REB AST STL BLK PTS birth_year height_cm weight_kg Pos_enc
0 Alex Abrines SG 24 OKC 75 1134.2 115 291 84 221 ... 88 114 28 38 8 353 1993.0 198.0 86.0 2
1 Quincy Acy PF 27 BRK 70 1359.2 130 365 102 292 ... 217 257 57 33 29 411 1990.0 201.0 109.0 4
2 Steven Adams C 24 OKC 76 2487.0 448 712 0 2 ... 301 685 88 92 78 1056 1993.0 213.0 120.0 5
3 Bam Adebayo C 20 MIA 69 1368.1 174 340 0 7 ... 263 381 101 32 41 477 1997.0 208.0 116.0 5
4 LaMarcus Aldridge C 32 SAS 75 2508.7 687 1347 27 92 ... 389 635 152 43 90 1735 1985.0 211.0 120.0 5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
229 Lou Williams SG 31 LAC 79 2589.2 582 1337 186 518 ... 158 198 417 85 19 1782 1986.0 185.0 79.0 2
230 Justise Winslow PF 21 MIA 68 1680.2 207 488 49 129 ... 306 370 148 54 33 529 1996.0 201.0 102.0 4
231 Delon Wright PG 25 TOR 69 1432.7 201 432 56 153 ... 153 198 200 72 33 555 1992.0 196.0 83.0 1
232 Nick Young SG 32 GSW 80 1393.1 201 488 123 326 ... 105 125 36 38 7 581 1985.0 201.0 95.0 2
233 Thaddeus Young PF 29 IND 81 2607.2 421 864 58 181 ... 328 512 152 135 36 955 1988.0 203.0 100.0 4

234 rows × 25 columns

Find Best Model

Feature Selection: Data has player name, team, position, height, weight, and 20+ seasonal stats. Not all features are relevant to predicting NBA Position. Test with different varations of features. Iterate through combinations() of k columns in cols. Combinations and estimating the counts of the training trials can be found here: “…the number of k-element subsets (or k-combinations) of an n-element set”.

KNN Optimization: On each combination of possible training features, iterate through range of ints for possible n_neighbors.

Log Results: Create results dictionary to store features, n_neighbors, log_loss, and classifier for each training iteration. Iterate through results dictionary to find smallest log_loss along with features used, n_neighbors used, and the classifier object.

results  = {
    trial_num: {
        'features': [],
        'n_neighbors':,
        'log_loss':
        'classifier':
    }
}
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import warnings
warnings.filterwarnings('ignore')

from itertools import combinations
cols = ['Pos_enc','FGM', 'FGA', '3PM','3PA','FTM', 'FTA','TOV','PF','ORB','DRB',
        'REB','AST','STL', "BLK",'PTS','height_cm','weight_kg'
]

trial_num = 0 # count of training attempts
loss_min = False # found log_loss minimum
n_search = True # searching for ideal n neighbors 
results = {} # dictionary of results per training attempt

found_clf = False
for count in range(12, 18): 
    print(f"Testing: {len(cols)} Choose {count}")
    
    for tup in combinations(cols,count):  # iterate through combination of columns
       
        for i in range(3,6): # iterate through options of n neighbors
            if n_search:
                X = df_merged[list(tup)]
                y = df_merged["Pos"]
                scaler = StandardScaler()
                scaler.fit(X)
                X_scaled = scaler.transform(X)

                X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)            

                clf = KNeighborsClassifier(n_neighbors=i)
                clf.fit(X_scaled_train,y_train)
                X["pred"] = clf.predict(X_scaled)

                probs = clf.predict_proba(X_scaled_test)
                loss = log_loss( y_true=y_test ,y_pred=probs , labels= clf.classes_)

                results[trial_num] = {
                    'features': list(tup) ,
                    'n_neighbors': i,
                    'log_loss': loss,
                    'classifier': clf
                }

                trial_num+=1

                if loss < .7:
                    n_search = False
                    loss_min = True
                    found_clf = True
                    print(f"Found ideal n_neighbors")
                    break
        
        if (n_search == False) or (loss<.6): 
            loss_min = True
            print('Found combination of features')
            break
            
    if loss_min:
        print('Return classifier')
        break

if not found_clf:
    print(f"Couldn't find accurate classifier. Continue to find best results.")
Testing: 18 Choose 12
Found ideal n_neighbors
Found combination of features
eturn classifier

Return Best Results

Find the training iteration with the best log_loss.

Return the classifier and print the features selected, neighbors used, and corresponding log_loss.

min_log_loss = results[0]['log_loss']
for key in results:
    # key = trial number
    iter_features = results[key]['features']
    iter_n_neighbors = results[key]['n_neighbors']
    iter_log_loss = results[key]['log_loss']
    
    if iter_log_loss < min_log_loss:
        min_log_loss = iter_log_loss
        min_key=key

print(f"Total Attempts: {len(results)}")
print(f"Best log_loss: {results[min_key]['log_loss']}")
print(f"Best features: {results[min_key]['features']}")
print(f"Number of features: {len(results[min_key]['features'])}")
print(f"Ideal n_neighbors: {results[min_key]['n_neighbors']}")
print(f"Best classifier: {results[min_key]['classifier']}")
Total Attempts: 54609
Best log_loss: 0.6832883413753273
Best features: ['3PM', '3PA', 'FTM', 'FTA', 'TOV', 'ORB', 'DRB', 'REB', 'AST', 'PTS', 'height_cm', 'weight_kg']
Number of features: 12
Ideal n_neighbors: 5
Best classifier: KNeighborsClassifier()

Predict position of NBA players on entire dataset.

Access best classifier in results dict by min_key.

X = df_merged[results[min_key]['features']]
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf = results[min_key]['classifier']

df_merged['Preds'] = clf.predict(X)

Vizualize Results

Display all the Centers. Our predicted values show good results at identifying Centers.

Look at a true Point Gaurd. Chris Paul is a good example of a Point Gaurd.

Look at Lebron James. In 2018, for Clevland, Kaggle has him listed has a Power Foward.

df_merged[df_merged['Pos_enc']==5]
Player Pos Age Team GP MIN FGM FGA 3PM 3PA ... REB AST STL BLK PTS birth_year height_cm weight_kg Pos_enc Preds
2 Steven Adams C 24 OKC 76 2487.0 448 712 0 2 ... 685 88 92 78 1056 1993.0 213.0 120.0 5 C
3 Bam Adebayo C 20 MIA 69 1368.1 174 340 0 7 ... 381 101 32 41 477 1997.0 208.0 116.0 5 C
4 LaMarcus Aldridge C 32 SAS 75 2508.7 687 1347 27 92 ... 635 152 43 90 1735 1985.0 211.0 120.0 5 C
5 Jarrett Allen C 19 BRK 72 1441.2 234 397 5 15 ... 388 49 28 88 587 1998.0 211.0 108.0 5 C
16 Aron Baynes C 31 BOS 81 1484.7 210 446 3 21 ... 434 93 22 51 482 1986.0 208.0 118.0 5 C
21 Jordan Bell C 23 GSW 57 808.9 116 185 0 4 ... 207 102 35 56 262 1995.0 206.0 102.0 5 C
23 Bismack Biyombo C 25 ORL 82 1494.9 183 352 0 1 ... 468 66 21 95 468 1992.0 206.0 116.0 5 C
25 Tarik Black C 26 HOU 51 536.5 75 127 1 11 ... 163 13 21 30 180 1991.0 206.0 113.0 5 C
35 Clint Capela C 23 HOU 74 2034.2 441 676 0 1 ... 802 68 58 138 1026 1994.0 208.0 109.0 5 C
38 Willie Cauley-Stein C 24 SAC 73 2044.0 388 773 3 12 ... 510 172 77 67 932 1993.0 213.0 109.0 5 C
43 Zach Collins C 20 POR 66 1045.6 115 289 35 113 ... 221 52 17 31 292 1997.0 213.0 107.0 5 C
51 Deyonta Davis C 21 MEM 62 942.6 161 265 0 0 ... 250 40 15 39 360 1996.0 206.0 108.0 5 C
52 Ed Davis C 28 POR 78 1470.9 170 292 0 1 ... 575 40 32 52 414 1989.0 208.0 102.0 5 C
53 Dewayne Dedmon C 28 ATL 62 1542.3 250 477 50 141 ... 489 90 40 51 617 1989.0 213.0 111.0 5 C
57 Gorgui Dieng C 28 MIN 79 1332.7 186 388 19 61 ... 360 71 45 39 470 1990.0 211.0 114.0 5 C
60 Andre Drummond C 24 DET 78 2625.0 466 881 0 11 ... 1247 237 114 127 1171 1993.0 211.0 127.0 5 C
63 Joel Embiid C 23 PHI 63 1912.3 510 1056 66 214 ... 690 199 40 111 1445 1994.0 213.0 113.0 5 C
64 Derrick Favors C 26 UTA 77 2153.5 395 702 14 63 ... 552 102 54 82 944 1991.0 208.0 120.0 5 C
72 Marc Gasol C 33 MEM 73 2408.4 434 1033 109 320 ... 592 305 54 101 1258 1985.0 216.0 116.0 5 C
73 Pau Gasol C 37 SAS 77 1812.0 287 627 43 120 ... 619 238 24 79 775 1980.0 213.0 113.0 5 C
78 Rudy Gobert C 25 UTA 56 1816.1 276 444 0 0 ... 601 80 44 129 756 1992.0 216.0 111.0 5 C
81 Marcin Gortat C 33 WAS 82 2074.6 290 560 0 0 ... 623 151 40 61 690 1984.0 211.0 109.0 5 C
90 Montrezl Harrell C 24 LAC 76 1293.0 348 548 1 7 ... 307 74 36 52 836 1994.0 203.0 109.0 5 PF
94 John Henson C 27 MIL 76 1969.7 287 502 1 7 ... 513 114 45 109 665 1990.0 211.0 99.0 5 C
100 Al Horford C 31 BOS 72 2277.1 368 753 97 226 ... 530 339 43 78 927 1986.0 208.0 111.0 5 PF
112 Amir Johnson C 30 PHI 74 1170.3 140 260 10 32 ... 330 118 45 44 342 1987.0 206.0 109.0 5 C
117 Nikola Jokic C 22 DEN 75 2440.9 504 1010 111 280 ... 803 458 90 61 1385 1995.0 213.0 113.0 5 C
119 DeAndre Jordan C 29 LAC 77 2423.7 373 578 0 0 ... 1171 117 39 71 927 1988.0 211.0 120.0 5 C
121 Enes Kanter C 25 NYK 71 1829.8 422 713 0 2 ... 780 105 36 37 1000 1992.0 211.0 113.0 5 C
125 Kosta Koufos C 28 SAC 71 1391.2 222 389 0 0 ... 472 87 48 32 477 1989.0 213.0 120.0 5 C
133 Kevon Looney C 21 GSW 66 910.0 112 193 1 5 ... 215 42 34 56 267 1996.0 206.0 100.0 5 C
134 Brook Lopez C 29 LAL 74 1735.0 369 793 112 325 ... 294 126 30 98 961 1988.0 213.0 122.0 5 SF
135 Robin Lopez C 29 CHI 64 1690.5 342 645 4 14 ... 290 124 14 53 756 1988.0 213.0 125.0 5 C
136 Kevin Love C 29 CLE 59 1651.2 334 729 137 330 ... 546 103 43 24 1039 1988.0 208.0 114.0 5 PF
140 Ian Mahinmi C 31 WAS 77 1145.3 138 248 0 2 ... 312 53 38 42 366 1986.0 211.0 119.0 5 C
141 Thon Maker C 20 MIL 74 1237.8 130 316 31 104 ... 225 46 38 53 356 1997.0 216.0 100.0 5 C
148 JaVale McGee C 30 GSW 65 615.2 136 219 0 6 ... 169 33 21 57 310 1988.0 213.0 122.0 5 C
150 Salah Mejri C 31 DAL 61 728.9 88 137 0 3 ... 246 35 22 67 214 1986.0 218.0 107.0 5 C
164 Dirk Nowitzki C 39 DAL 77 1900.4 346 758 138 337 ... 438 120 43 45 927 1978.0 213.0 111.0 5 PF
166 Jusuf Nurkic C 23 POR 79 2088.3 480 951 0 7 ... 708 143 64 111 1132 1994.0 213.0 132.0 5 C
169 Kyle O'Quinn C 27 NYK 77 1386.6 224 384 4 17 ... 470 158 36 98 550 1990.0 208.0 113.0 5 C
174 Zaza Pachulia C 33 GSW 69 971.8 149 264 0 1 ... 321 109 38 17 373 1984.0 211.0 122.0 5 C
179 Mason Plumlee C 27 DEN 74 1441.3 221 368 0 1 ... 400 142 49 81 524 1990.0 211.0 107.0 5 C
180 Jakob Poeltl C 22 TOR 82 1523.5 253 384 1 2 ... 393 57 39 100 567 1995.0 213.0 109.0 5 C
183 Dwight Powell C 26 DAL 79 1671.6 255 430 28 84 ... 444 91 67 32 671 1991.0 211.0 109.0 5 C
185 Julius Randle C 23 LAL 82 2190.5 504 904 10 45 ... 654 210 43 45 1323 1994.0 206.0 113.0 5 C
193 Domantas Sabonis C 21 IND 74 1811.8 340 661 13 37 ... 572 151 40 32 861 1996.0 211.0 109.0 5 C
211 Daniel Theis C 25 BOS 63 935.7 126 233 18 58 ... 274 56 30 48 331 1992.0 203.0 110.0 5 C
214 Tristan Thompson C 26 CLE 53 1072.2 132 235 0 0 ... 352 33 16 17 307 1991.0 208.0 108.0 5 C
217 Karl-Anthony Towns C 22 MIN 82 2918.1 639 1172 120 285 ... 1012 199 64 115 1743 1995.0 213.0 112.0 5 C
220 Myles Turner C 21 IND 65 1836.3 306 639 56 157 ... 417 87 38 118 828 1996.0 211.0 113.0 5 PF
221 Ekpe Udoh C 30 UTA 63 809.5 60 120 0 1 ... 150 53 43 74 162 1987.0 208.0 111.0 5 SF
222 Jonas Valanciunas C 25 TOR 77 1727.3 390 687 30 74 ... 660 81 29 69 980 1992.0 213.0 120.0 5 C
225 David West C 37 GSW 73 998.7 216 378 3 8 ... 238 138 47 75 495 1980.0 206.0 113.0 5 C
227 Hassan Whiteside C 28 MIA 54 1364.2 312 578 2 2 ... 618 54 38 94 754 1989.0 216.0 120.0 5 C

55 rows × 26 columns

df_merged[df_merged['Player']=='Chris Paul']
Player Pos Age Team GP MIN FGM FGA 3PM 3PA ... REB AST STL BLK PTS birth_year height_cm weight_kg Pos_enc Preds
178 Chris Paul PG 32 HOU 58 1846.6 367 798 144 379 ... 313 457 96 14 1081 1985.0 183.0 79.0 1 PG

1 rows × 26 columns

df_merged[df_merged['Player']=='LeBron James']
Player Pos Age Team GP MIN FGM FGA 3PM 3PA ... REB AST STL BLK PTS birth_year height_cm weight_kg Pos_enc Preds
110 LeBron James PF 33 CLE 82 3025.8 857 1580 149 406 ... 709 747 116 71 2251 1984.0 203.0 113.0 4 PF

1 rows × 26 columns

Evaluate Performance

Evalute the performance of classifier using the log_loss metric.

X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

probs = clf.predict_proba(X_scaled_test)

log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss = log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss
0.6991690157387052

1st row mostly Pink some Light Blue. We’re good at determining Point Guards, sometimes Small Forwards trigger false positive.

5th row mostly Blue with some Yellow. Makes sense. Great at determining Centers. Sometimes Power Forwards trigger false positive.

chart = alt.Chart(df_merged).mark_circle().encode(
    x=alt.X('height_cm:O', axis=alt.Axis(title='Height in cm')),
    y=alt.X('Pos_enc:O', axis=alt.Axis(title='Encoded')),
    color = alt.Color("Preds", title = "Positions"),
).properties(
    title = f"Predicted NBA Positions",
)
chart

Summary

Taking player seasonal stats, height, and weight we attempted to predict NBA positions by classification. Some NBA positions are easier to predict than others.

References

Include references that you found helpful. Also say where you found the dataset you used.

Dataframes used from Kaggle

Basketball Players Stats per Season - 49 Leagues found here.

NBA Player Stats 2017-2018 found here.

Created in deepnote.com Created in Deepnote