# Predict Position of Player with Seasonal Stats
Author: <b>Sarah Thayer</b>

Course Project, UC Irvine, Math 10, W22<br>

## Introduction

My project is exploring two Kaggle Datasets of NBA stats to predict their listed position. 
Two datasets are needed. One contains the players and their listed positions. The second contains season stats, height, and weight. We. merge the two for a complete dataset.

## Main portion of the project

(You can either have all one section or divide into multiple sections)

In [None]:
import numpy as np
import pandas as pd
import altair as alt

### Load First Dataset 

NBA Stats [found here](https://www.kaggle.com/mcamli/nba17-18?select=nba.csv). 
The columns we are interested in extracting are `Player`, `Team`, `Pos`, `Age`.

Display shape of the dataframe to confirm we have enough data points.

In [None]:
df_position = pd.read_csv("nba.csv", sep = ',' ,encoding = 'latin-1')

df_position = df_position.loc[:,df_position.columns.isin(['Player', 'Tm', 'Pos', 'Age']) ]
df_position.head()

Unnamed: 0,Player,Pos,Age,Tm
0,Alex Abrines\abrinal01,SG,24,OKC
1,Quincy Acy\acyqu01,PF,27,BRK
2,Steven Adams\adamsst01,C,24,OKC
3,Bam Adebayo\adebaba01,C,20,MIA
4,Arron Afflalo\afflaar01,SG,32,ORL


In [None]:
df_position.shape

(664, 4)

### Clean First Dataset
Rename column `TM` to `Team`.

NBA Players have unique player id in `Player` column. Remove player ID to view names. ( i.e."Alex Abrines\abrinal01" remove the unique player id after the "\\")

NBA players that have been traded mid-season appear twice. Drop `Player` duplicates from the dataframe and check shape. 

  

In [None]:
df_position = df_position.rename(columns={'Tm' : 'Team'}) 

df_position['Player'] = df_position['Player'].map(lambda x: x.split('\\')[0])

df_pos_unique = df_position[~df_position.duplicated(subset=['Player'])]
df_pos_unique.shape

df_pos_unique.head()

Unnamed: 0,Player,Pos,Age,Team
0,Alex Abrines,SG,24,OKC
1,Quincy Acy,PF,27,BRK
2,Steven Adams,C,24,OKC
3,Bam Adebayo,C,20,MIA
4,Arron Afflalo,SG,32,ORL


### Load Second Dataset

NBA stats [found here](https://www.kaggle.com/jacobbaruch/basketball-players-stats-per-season-49-leagues).

Large dataset of 20 years of stats form 49 differrent leagues. Parse dataframe for the the relevant data in the NBA league during the 2017 - 2018 regular season. Then our new dataframe contains height and weight.

In [None]:
df = pd.read_csv("players_stats_by_season_full_details.csv",  encoding='latin-1' ) 

df= df[(df["League"] == 'NBA') &  (df["Season"] == '2017 - 2018') & (df["Stage"] == 'Regular_Season')] 

df_hw = df.loc[:,~df.columns.isin(['Rk', 'League', 'Season', 'Stage', 'birth_month', 'birth_date', 'height', 'weight',
                                  'nationality', 'high_school', 'draft_round', 'draft_pick', 'draft_team'])]

df_hw.shape

(279, 22)

### Clean Second Dataset
Drop duplicates from dataframe with `Player` from the dataframe containing height and weight.

In [None]:
df_hw_unique = df_hw[~df_hw.duplicated(subset=['Player'])]
df_hw_unique.shape

(279, 22)

### Prepare Merged Data
- Merge First and Second Dataset 
- Encode the NBA listed positions 


Confirm it's the same player by matching name and team.

In [None]:
df_merged = pd.merge(df_pos_unique,df_hw_unique, on = ['Player','Team'])

### One-Hot Encoding
Encode the positions (`strings`) into numbers (`ints`).

In [None]:
enc = {'PG' : 1, 'SG' : 2,  'SF': 3, 'PF': 4, 'C':5}
df_merged["Pos_enc"] = df_merged["Pos"].map(enc)
df_merged

Unnamed: 0,Player,Pos,Age,Team,GP,MIN,FGM,FGA,3PM,3PA,...,DRB,REB,AST,STL,BLK,PTS,birth_year,height_cm,weight_kg,Pos_enc
0,Alex Abrines,SG,24,OKC,75,1134.2,115,291,84,221,...,88,114,28,38,8,353,1993.0,198.0,86.0,2
1,Quincy Acy,PF,27,BRK,70,1359.2,130,365,102,292,...,217,257,57,33,29,411,1990.0,201.0,109.0,4
2,Steven Adams,C,24,OKC,76,2487.0,448,712,0,2,...,301,685,88,92,78,1056,1993.0,213.0,120.0,5
3,Bam Adebayo,C,20,MIA,69,1368.1,174,340,0,7,...,263,381,101,32,41,477,1997.0,208.0,116.0,5
4,LaMarcus Aldridge,C,32,SAS,75,2508.7,687,1347,27,92,...,389,635,152,43,90,1735,1985.0,211.0,120.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229,Lou Williams,SG,31,LAC,79,2589.2,582,1337,186,518,...,158,198,417,85,19,1782,1986.0,185.0,79.0,2
230,Justise Winslow,PF,21,MIA,68,1680.2,207,488,49,129,...,306,370,148,54,33,529,1996.0,201.0,102.0,4
231,Delon Wright,PG,25,TOR,69,1432.7,201,432,56,153,...,153,198,200,72,33,555,1992.0,196.0,83.0,1
232,Nick Young,SG,32,GSW,80,1393.1,201,488,123,326,...,105,125,36,38,7,581,1985.0,201.0,95.0,2


### Find Best Model

Feature Selection: Data has player name, team, position, height, weight, and 20+ seasonal stats. Not all  features are relevant to predicting NBA Position. Test with different varations of features. Iterate through `combinations()` of k columns in `cols`. Combinations and estimating the counts of the training trials can be found here: ["...the number of k-element subsets (or k-combinations) of an n-element set"](https://en.wikipedia.org/wiki/Binomial_coefficient).

KNN Optimization: On each combination of possible training features, iterate through range of ints for possible `n_neighbors`.

Log Results: Create results dictionary to store `features`, `n_neighbors`, `log_loss`, and `classifier` for each training iteration. Iterate through results dictionary to find smallest `log_loss` along with features used, `n_neighbors` used, and the classifier object.

```
results  = {
    trial_num: {
        'features': [],
        'n_neighbors':,
        'log_loss':
        'classifier':
    }
}
```



In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import warnings
warnings.filterwarnings('ignore')

from itertools import combinations
cols = ['Pos_enc','FGM', 'FGA', '3PM','3PA','FTM', 'FTA','TOV','PF','ORB','DRB',
        'REB','AST','STL', "BLK",'PTS','height_cm','weight_kg'
]

trial_num = 0 # count of training attempts
loss_min = False # found log_loss minimum
n_search = True # searching for ideal n neighbors 
results = {} # dictionary of results per training attempt

found_clf = False
for count in range(12, 18): 
    print(f"Testing: {len(cols)} Choose {count}")
    
    for tup in combinations(cols,count):  # iterate through combination of columns
       
        for i in range(3,6): # iterate through options of n neighbors
            if n_search:
                X = df_merged[list(tup)]
                y = df_merged["Pos"]
                scaler = StandardScaler()
                scaler.fit(X)
                X_scaled = scaler.transform(X)

                X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)            

                clf = KNeighborsClassifier(n_neighbors=i)
                clf.fit(X_scaled_train,y_train)
                X["pred"] = clf.predict(X_scaled)

                probs = clf.predict_proba(X_scaled_test)
                loss = log_loss( y_true=y_test ,y_pred=probs , labels= clf.classes_)

                results[trial_num] = {
                    'features': list(tup) ,
                    'n_neighbors': i,
                    'log_loss': loss,
                    'classifier': clf
                }

                trial_num+=1

                if loss < .7:
                    n_search = False
                    loss_min = True
                    found_clf = True
                    print(f"Found ideal n_neighbors")
                    break
        
        if (n_search == False) or (loss<.6): 
            loss_min = True
            print('Found combination of features')
            break
            
    if loss_min:
        print('Return classifier')
        break

if not found_clf:
    print(f"Couldn't find accurate classifier. Continue to find best results.")

Testing: 18 Choose 12
Found ideal n_neighbors
Found combination of features
eturn classifier


### Return Best Results

Find the training iteration with the best `log_loss`. 

Return the classifier and print the features selected, neighbors used, and corresponding `log_loss`.

In [None]:
min_log_loss = results[0]['log_loss']
for key in results:
    # key = trial number
    iter_features = results[key]['features']
    iter_n_neighbors = results[key]['n_neighbors']
    iter_log_loss = results[key]['log_loss']
    
    if iter_log_loss < min_log_loss:
        min_log_loss = iter_log_loss
        min_key=key

print(f"Total Attempts: {len(results)}")
print(f"Best log_loss: {results[min_key]['log_loss']}")
print(f"Best features: {results[min_key]['features']}")
print(f"Number of features: {len(results[min_key]['features'])}")
print(f"Ideal n_neighbors: {results[min_key]['n_neighbors']}")
print(f"Best classifier: {results[min_key]['classifier']}")

Total Attempts: 54609
Best log_loss: 0.6832883413753273
Best features: ['3PM', '3PA', 'FTM', 'FTA', 'TOV', 'ORB', 'DRB', 'REB', 'AST', 'PTS', 'height_cm', 'weight_kg']
Number of features: 12
Ideal n_neighbors: 5
Best classifier: KNeighborsClassifier()


Predict position of NBA players on entire dataset.

Access best classifier in `results` dict by `min_key`.

In [None]:
X = df_merged[results[min_key]['features']]
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf = results[min_key]['classifier']

df_merged['Preds'] = clf.predict(X)

### Vizualize Results

Display all the Centers. Our predicted values show good results at identifying Centers.

Look at a true Point Gaurd. Chris Paul is a good example of a Point Gaurd.

Look at Lebron James. In 2018, for Clevland, Kaggle has him listed has a Power Foward. 

In [None]:
df_merged[df_merged['Pos_enc']==5]

Unnamed: 0,Player,Pos,Age,Team,GP,MIN,FGM,FGA,3PM,3PA,...,REB,AST,STL,BLK,PTS,birth_year,height_cm,weight_kg,Pos_enc,Preds
2,Steven Adams,C,24,OKC,76,2487.0,448,712,0,2,...,685,88,92,78,1056,1993.0,213.0,120.0,5,C
3,Bam Adebayo,C,20,MIA,69,1368.1,174,340,0,7,...,381,101,32,41,477,1997.0,208.0,116.0,5,C
4,LaMarcus Aldridge,C,32,SAS,75,2508.7,687,1347,27,92,...,635,152,43,90,1735,1985.0,211.0,120.0,5,C
5,Jarrett Allen,C,19,BRK,72,1441.2,234,397,5,15,...,388,49,28,88,587,1998.0,211.0,108.0,5,C
16,Aron Baynes,C,31,BOS,81,1484.7,210,446,3,21,...,434,93,22,51,482,1986.0,208.0,118.0,5,C
21,Jordan Bell,C,23,GSW,57,808.9,116,185,0,4,...,207,102,35,56,262,1995.0,206.0,102.0,5,C
23,Bismack Biyombo,C,25,ORL,82,1494.9,183,352,0,1,...,468,66,21,95,468,1992.0,206.0,116.0,5,C
25,Tarik Black,C,26,HOU,51,536.5,75,127,1,11,...,163,13,21,30,180,1991.0,206.0,113.0,5,C
35,Clint Capela,C,23,HOU,74,2034.2,441,676,0,1,...,802,68,58,138,1026,1994.0,208.0,109.0,5,C
38,Willie Cauley-Stein,C,24,SAC,73,2044.0,388,773,3,12,...,510,172,77,67,932,1993.0,213.0,109.0,5,C


In [None]:
df_merged[df_merged['Player']=='Chris Paul']

Unnamed: 0,Player,Pos,Age,Team,GP,MIN,FGM,FGA,3PM,3PA,...,REB,AST,STL,BLK,PTS,birth_year,height_cm,weight_kg,Pos_enc,Preds
178,Chris Paul,PG,32,HOU,58,1846.6,367,798,144,379,...,313,457,96,14,1081,1985.0,183.0,79.0,1,PG


In [None]:
df_merged[df_merged['Player']=='LeBron James']

Unnamed: 0,Player,Pos,Age,Team,GP,MIN,FGM,FGA,3PM,3PA,...,REB,AST,STL,BLK,PTS,birth_year,height_cm,weight_kg,Pos_enc,Preds
110,LeBron James,PF,33,CLE,82,3025.8,857,1580,149,406,...,709,747,116,71,2251,1984.0,203.0,113.0,4,PF


### Evaluate Performance
Evalute the performance of classifier using the `log_loss` metric.

In [None]:
X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

probs = clf.predict_proba(X_scaled_test)

log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss = log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss

0.6991690157387052

1st row mostly Pink some Light Blue. 
We're good at determining Point Guards, sometimes Small Forwards trigger false positive.

5th row mostly Blue with some Yellow.
Makes sense. Great at determining Centers. Sometimes Power Forwards trigger false positive. 

In [None]:
chart = alt.Chart(df_merged).mark_circle().encode(
    x=alt.X('height_cm:O', axis=alt.Axis(title='Height in cm')),
    y=alt.X('Pos_enc:O', axis=alt.Axis(title='Encoded')),
    color = alt.Color("Preds", title = "Positions"),
).properties(
    title = f"Predicted NBA Positions",
)
chart

## Summary
Taking player seasonal stats, height, and weight we attempted to predict NBA positions by classification. Some NBA positions are easier to predict than others. 

## References

Include references that you found helpful.  Also say where you found the dataset you used.

**Dataframes used from Kaggle**


Basketball Players Stats per Season - 49 Leagues [found here](https://www.kaggle.com/jacobbaruch/basketball-players-stats-per-season-49-leagues).

NBA Player Stats 2017-2018 [found here](https://www.kaggle.com/mcamli/nba17-18?select=nba.csv). 


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=da008a44-10fb-41c0-95d8-1e6b51ff39cf' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>