Predict Position of Player with Seasonal Stats¶

Author: Sarah Thayer

Course Project, UC Irvine, Math 10, W22

Introduction¶

My project is exploring two Kaggle Datasets of NBA stats to predict their listed position. Two datasets are needed. One contains the players and their listed positions. The second contains season stats, height, and weight. We. merge the two for a complete dataset.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import altair as alt

Load First Dataset¶

NBA Stats found here. The columns we are interested in extracting are Player, Team, Pos, Age.

Display shape of the dataframe to confirm we have enough data points.

df_position = pd.read_csv("nba.csv", sep = ',' ,encoding = 'latin-1')

df_position = df_position.loc[:,df_position.columns.isin(['Player', 'Tm', 'Pos', 'Age']) ]
df_position.head()

	Player	Pos	Age	Tm
0	Alex Abrines\abrinal01	SG	24	OKC
1	Quincy Acy\acyqu01	PF	27	BRK
2	Steven Adams\adamsst01	C	24	OKC
3	Bam Adebayo\adebaba01	C	20	MIA
4	Arron Afflalo\afflaar01	SG	32	ORL

df_position.shape

(664, 4)

Clean First Dataset¶

Rename column TM to Team.

NBA Players have unique player id in Player column. Remove player ID to view names. ( i.e.”Alex Abrines\abrinal01” remove the unique player id after the “")

NBA players that have been traded mid-season appear twice. Drop Player duplicates from the dataframe and check shape.

df_position = df_position.rename(columns={'Tm' : 'Team'}) 

df_position['Player'] = df_position['Player'].map(lambda x: x.split('\\')[0])

df_pos_unique = df_position[~df_position.duplicated(subset=['Player'])]
df_pos_unique.shape

df_pos_unique.head()

	Player	Pos	Age	Team
0	Alex Abrines	SG	24	OKC
1	Quincy Acy	PF	27	BRK
2	Steven Adams	C	24	OKC
3	Bam Adebayo	C	20	MIA
4	Arron Afflalo	SG	32	ORL

Load Second Dataset¶

NBA stats found here.

Large dataset of 20 years of stats form 49 differrent leagues. Parse dataframe for the the relevant data in the NBA league during the 2017 - 2018 regular season. Then our new dataframe contains height and weight.

df = pd.read_csv("players_stats_by_season_full_details.csv",  encoding='latin-1' ) 

df= df[(df["League"] == 'NBA') &  (df["Season"] == '2017 - 2018') & (df["Stage"] == 'Regular_Season')] 

df_hw = df.loc[:,~df.columns.isin(['Rk', 'League', 'Season', 'Stage', 'birth_month', 'birth_date', 'height', 'weight',
                                  'nationality', 'high_school', 'draft_round', 'draft_pick', 'draft_team'])]

df_hw.shape

(279, 22)

Clean Second Dataset¶

Drop duplicates from dataframe with Player from the dataframe containing height and weight.

df_hw_unique = df_hw[~df_hw.duplicated(subset=['Player'])]
df_hw_unique.shape

(279, 22)

Prepare Merged Data¶

Merge First and Second Dataset
Encode the NBA listed positions

Confirm it’s the same player by matching name and team.

df_merged = pd.merge(df_pos_unique,df_hw_unique, on = ['Player','Team'])

One-Hot Encoding¶

Encode the positions (strings) into numbers (ints).

enc = {'PG' : 1, 'SG' : 2,  'SF': 3, 'PF': 4, 'C':5}
df_merged["Pos_enc"] = df_merged["Pos"].map(enc)
df_merged

	Player	Pos	Age	Team	GP	MIN	FGM	FGA	3PM	3PA	...	DRB	REB	AST	STL	BLK	PTS	birth_year	height_cm	weight_kg	Pos_enc
0	Alex Abrines	SG	24	OKC	75	1134.2	115	291	84	221	...	88	114	28	38	8	353	1993.0	198.0	86.0	2
1	Quincy Acy	PF	27	BRK	70	1359.2	130	365	102	292	...	217	257	57	33	29	411	1990.0	201.0	109.0	4
2	Steven Adams	C	24	OKC	76	2487.0	448	712	0	2	...	301	685	88	92	78	1056	1993.0	213.0	120.0	5
3	Bam Adebayo	C	20	MIA	69	1368.1	174	340	0	7	...	263	381	101	32	41	477	1997.0	208.0	116.0	5
4	LaMarcus Aldridge	C	32	SAS	75	2508.7	687	1347	27	92	...	389	635	152	43	90	1735	1985.0	211.0	120.0	5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
229	Lou Williams	SG	31	LAC	79	2589.2	582	1337	186	518	...	158	198	417	85	19	1782	1986.0	185.0	79.0	2
230	Justise Winslow	PF	21	MIA	68	1680.2	207	488	49	129	...	306	370	148	54	33	529	1996.0	201.0	102.0	4
231	Delon Wright	PG	25	TOR	69	1432.7	201	432	56	153	...	153	198	200	72	33	555	1992.0	196.0	83.0	1
232	Nick Young	SG	32	GSW	80	1393.1	201	488	123	326	...	105	125	36	38	7	581	1985.0	201.0	95.0	2
233	Thaddeus Young	PF	29	IND	81	2607.2	421	864	58	181	...	328	512	152	135	36	955	1988.0	203.0	100.0	4

234 rows × 25 columns

Find Best Model¶

Feature Selection: Data has player name, team, position, height, weight, and 20+ seasonal stats. Not all features are relevant to predicting NBA Position. Test with different varations of features. Iterate through combinations() of k columns in cols. Combinations and estimating the counts of the training trials can be found here: “…the number of k-element subsets (or k-combinations) of an n-element set”.

KNN Optimization: On each combination of possible training features, iterate through range of ints for possible n_neighbors.

Log Results: Create results dictionary to store features, n_neighbors, log_loss, and classifier for each training iteration. Iterate through results dictionary to find smallest log_loss along with features used, n_neighbors used, and the classifier object.

results  = {
    trial_num: {
        'features': [],
        'n_neighbors':,
        'log_loss':
        'classifier':
    }
}

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import warnings
warnings.filterwarnings('ignore')

from itertools import combinations
cols = ['Pos_enc','FGM', 'FGA', '3PM','3PA','FTM', 'FTA','TOV','PF','ORB','DRB',
        'REB','AST','STL', "BLK",'PTS','height_cm','weight_kg'
]

trial_num = 0 # count of training attempts
loss_min = False # found log_loss minimum
n_search = True # searching for ideal n neighbors 
results = {} # dictionary of results per training attempt

found_clf = False
for count in range(12, 18): 
    print(f"Testing: {len(cols)} Choose {count}")
    
    for tup in combinations(cols,count):  # iterate through combination of columns
       
        for i in range(3,6): # iterate through options of n neighbors
            if n_search:
                X = df_merged[list(tup)]
                y = df_merged["Pos"]
                scaler = StandardScaler()
                scaler.fit(X)
                X_scaled = scaler.transform(X)

                X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)            

                clf = KNeighborsClassifier(n_neighbors=i)
                clf.fit(X_scaled_train,y_train)
                X["pred"] = clf.predict(X_scaled)

                probs = clf.predict_proba(X_scaled_test)
                loss = log_loss( y_true=y_test ,y_pred=probs , labels= clf.classes_)

                results[trial_num] = {
                    'features': list(tup) ,
                    'n_neighbors': i,
                    'log_loss': loss,
                    'classifier': clf
                }

                trial_num+=1

                if loss < .7:
                    n_search = False
                    loss_min = True
                    found_clf = True
                    print(f"Found ideal n_neighbors")
                    break
        
        if (n_search == False) or (loss<.6): 
            loss_min = True
            print('Found combination of features')
            break
            
    if loss_min:
        print('Return classifier')
        break

if not found_clf:
    print(f"Couldn't find accurate classifier. Continue to find best results.")

Testing: 18 Choose 12
Found ideal n_neighbors
Found combination of features
eturn classifier

Return Best Results¶

Find the training iteration with the best log_loss.

Return the classifier and print the features selected, neighbors used, and corresponding log_loss.

min_log_loss = results[0]['log_loss']
for key in results:
    # key = trial number
    iter_features = results[key]['features']
    iter_n_neighbors = results[key]['n_neighbors']
    iter_log_loss = results[key]['log_loss']
    
    if iter_log_loss < min_log_loss:
        min_log_loss = iter_log_loss
        min_key=key

print(f"Total Attempts: {len(results)}")
print(f"Best log_loss: {results[min_key]['log_loss']}")
print(f"Best features: {results[min_key]['features']}")
print(f"Number of features: {len(results[min_key]['features'])}")
print(f"Ideal n_neighbors: {results[min_key]['n_neighbors']}")
print(f"Best classifier: {results[min_key]['classifier']}")

Total Attempts: 54609
Best log_loss: 0.6832883413753273
Best features: ['3PM', '3PA', 'FTM', 'FTA', 'TOV', 'ORB', 'DRB', 'REB', 'AST', 'PTS', 'height_cm', 'weight_kg']
Number of features: 12
Ideal n_neighbors: 5
Best classifier: KNeighborsClassifier()

Predict position of NBA players on entire dataset.

Access best classifier in results dict by min_key.

X = df_merged[results[min_key]['features']]
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf = results[min_key]['classifier']

df_merged['Preds'] = clf.predict(X)

Vizualize Results¶

Display all the Centers. Our predicted values show good results at identifying Centers.

Look at a true Point Gaurd. Chris Paul is a good example of a Point Gaurd.

Look at Lebron James. In 2018, for Clevland, Kaggle has him listed has a Power Foward.

df_merged[df_merged['Pos_enc']==5]

	Player	Pos	Age	Team	GP	MIN	FGM	FGA	3PM	3PA	...	REB	AST	STL	BLK	PTS	birth_year	height_cm	weight_kg	Pos_enc	Preds
2	Steven Adams	C	24	OKC	76	2487.0	448	712	0	2	...	685	88	92	78	1056	1993.0	213.0	120.0	5	C
3	Bam Adebayo	C	20	MIA	69	1368.1	174	340	0	7	...	381	101	32	41	477	1997.0	208.0	116.0	5	C
4	LaMarcus Aldridge	C	32	SAS	75	2508.7	687	1347	27	92	...	635	152	43	90	1735	1985.0	211.0	120.0	5	C
5	Jarrett Allen	C	19	BRK	72	1441.2	234	397	5	15	...	388	49	28	88	587	1998.0	211.0	108.0	5	C
16	Aron Baynes	C	31	BOS	81	1484.7	210	446	3	21	...	434	93	22	51	482	1986.0	208.0	118.0	5	C
21	Jordan Bell	C	23	GSW	57	808.9	116	185	0	4	...	207	102	35	56	262	1995.0	206.0	102.0	5	C
23	Bismack Biyombo	C	25	ORL	82	1494.9	183	352	0	1	...	468	66	21	95	468	1992.0	206.0	116.0	5	C
25	Tarik Black	C	26	HOU	51	536.5	75	127	1	11	...	163	13	21	30	180	1991.0	206.0	113.0	5	C
35	Clint Capela	C	23	HOU	74	2034.2	441	676	0	1	...	802	68	58	138	1026	1994.0	208.0	109.0	5	C
38	Willie Cauley-Stein	C	24	SAC	73	2044.0	388	773	3	12	...	510	172	77	67	932	1993.0	213.0	109.0	5	C
43	Zach Collins	C	20	POR	66	1045.6	115	289	35	113	...	221	52	17	31	292	1997.0	213.0	107.0	5	C
51	Deyonta Davis	C	21	MEM	62	942.6	161	265	0	0	...	250	40	15	39	360	1996.0	206.0	108.0	5	C
52	Ed Davis	C	28	POR	78	1470.9	170	292	0	1	...	575	40	32	52	414	1989.0	208.0	102.0	5	C
53	Dewayne Dedmon	C	28	ATL	62	1542.3	250	477	50	141	...	489	90	40	51	617	1989.0	213.0	111.0	5	C
57	Gorgui Dieng	C	28	MIN	79	1332.7	186	388	19	61	...	360	71	45	39	470	1990.0	211.0	114.0	5	C
60	Andre Drummond	C	24	DET	78	2625.0	466	881	0	11	...	1247	237	114	127	1171	1993.0	211.0	127.0	5	C
63	Joel Embiid	C	23	PHI	63	1912.3	510	1056	66	214	...	690	199	40	111	1445	1994.0	213.0	113.0	5	C
64	Derrick Favors	C	26	UTA	77	2153.5	395	702	14	63	...	552	102	54	82	944	1991.0	208.0	120.0	5	C
72	Marc Gasol	C	33	MEM	73	2408.4	434	1033	109	320	...	592	305	54	101	1258	1985.0	216.0	116.0	5	C
73	Pau Gasol	C	37	SAS	77	1812.0	287	627	43	120	...	619	238	24	79	775	1980.0	213.0	113.0	5	C
78	Rudy Gobert	C	25	UTA	56	1816.1	276	444	0	0	...	601	80	44	129	756	1992.0	216.0	111.0	5	C
81	Marcin Gortat	C	33	WAS	82	2074.6	290	560	0	0	...	623	151	40	61	690	1984.0	211.0	109.0	5	C
90	Montrezl Harrell	C	24	LAC	76	1293.0	348	548	1	7	...	307	74	36	52	836	1994.0	203.0	109.0	5	PF
94	John Henson	C	27	MIL	76	1969.7	287	502	1	7	...	513	114	45	109	665	1990.0	211.0	99.0	5	C
100	Al Horford	C	31	BOS	72	2277.1	368	753	97	226	...	530	339	43	78	927	1986.0	208.0	111.0	5	PF
112	Amir Johnson	C	30	PHI	74	1170.3	140	260	10	32	...	330	118	45	44	342	1987.0	206.0	109.0	5	C
117	Nikola Jokic	C	22	DEN	75	2440.9	504	1010	111	280	...	803	458	90	61	1385	1995.0	213.0	113.0	5	C
119	DeAndre Jordan	C	29	LAC	77	2423.7	373	578	0	0	...	1171	117	39	71	927	1988.0	211.0	120.0	5	C
121	Enes Kanter	C	25	NYK	71	1829.8	422	713	0	2	...	780	105	36	37	1000	1992.0	211.0	113.0	5	C
125	Kosta Koufos	C	28	SAC	71	1391.2	222	389	0	0	...	472	87	48	32	477	1989.0	213.0	120.0	5	C
133	Kevon Looney	C	21	GSW	66	910.0	112	193	1	5	...	215	42	34	56	267	1996.0	206.0	100.0	5	C
134	Brook Lopez	C	29	LAL	74	1735.0	369	793	112	325	...	294	126	30	98	961	1988.0	213.0	122.0	5	SF
135	Robin Lopez	C	29	CHI	64	1690.5	342	645	4	14	...	290	124	14	53	756	1988.0	213.0	125.0	5	C
136	Kevin Love	C	29	CLE	59	1651.2	334	729	137	330	...	546	103	43	24	1039	1988.0	208.0	114.0	5	PF
140	Ian Mahinmi	C	31	WAS	77	1145.3	138	248	0	2	...	312	53	38	42	366	1986.0	211.0	119.0	5	C
141	Thon Maker	C	20	MIL	74	1237.8	130	316	31	104	...	225	46	38	53	356	1997.0	216.0	100.0	5	C
148	JaVale McGee	C	30	GSW	65	615.2	136	219	0	6	...	169	33	21	57	310	1988.0	213.0	122.0	5	C
150	Salah Mejri	C	31	DAL	61	728.9	88	137	0	3	...	246	35	22	67	214	1986.0	218.0	107.0	5	C
164	Dirk Nowitzki	C	39	DAL	77	1900.4	346	758	138	337	...	438	120	43	45	927	1978.0	213.0	111.0	5	PF
166	Jusuf Nurkic	C	23	POR	79	2088.3	480	951	0	7	...	708	143	64	111	1132	1994.0	213.0	132.0	5	C
169	Kyle O'Quinn	C	27	NYK	77	1386.6	224	384	4	17	...	470	158	36	98	550	1990.0	208.0	113.0	5	C
174	Zaza Pachulia	C	33	GSW	69	971.8	149	264	0	1	...	321	109	38	17	373	1984.0	211.0	122.0	5	C
179	Mason Plumlee	C	27	DEN	74	1441.3	221	368	0	1	...	400	142	49	81	524	1990.0	211.0	107.0	5	C
180	Jakob Poeltl	C	22	TOR	82	1523.5	253	384	1	2	...	393	57	39	100	567	1995.0	213.0	109.0	5	C
183	Dwight Powell	C	26	DAL	79	1671.6	255	430	28	84	...	444	91	67	32	671	1991.0	211.0	109.0	5	C
185	Julius Randle	C	23	LAL	82	2190.5	504	904	10	45	...	654	210	43	45	1323	1994.0	206.0	113.0	5	C
193	Domantas Sabonis	C	21	IND	74	1811.8	340	661	13	37	...	572	151	40	32	861	1996.0	211.0	109.0	5	C
211	Daniel Theis	C	25	BOS	63	935.7	126	233	18	58	...	274	56	30	48	331	1992.0	203.0	110.0	5	C
214	Tristan Thompson	C	26	CLE	53	1072.2	132	235	0	0	...	352	33	16	17	307	1991.0	208.0	108.0	5	C
217	Karl-Anthony Towns	C	22	MIN	82	2918.1	639	1172	120	285	...	1012	199	64	115	1743	1995.0	213.0	112.0	5	C
220	Myles Turner	C	21	IND	65	1836.3	306	639	56	157	...	417	87	38	118	828	1996.0	211.0	113.0	5	PF
221	Ekpe Udoh	C	30	UTA	63	809.5	60	120	0	1	...	150	53	43	74	162	1987.0	208.0	111.0	5	SF
222	Jonas Valanciunas	C	25	TOR	77	1727.3	390	687	30	74	...	660	81	29	69	980	1992.0	213.0	120.0	5	C
225	David West	C	37	GSW	73	998.7	216	378	3	8	...	238	138	47	75	495	1980.0	206.0	113.0	5	C
227	Hassan Whiteside	C	28	MIA	54	1364.2	312	578	2	2	...	618	54	38	94	754	1989.0	216.0	120.0	5	C

55 rows × 26 columns

df_merged[df_merged['Player']=='Chris Paul']

	Player	Pos	Age	Team	GP	MIN	FGM	FGA	3PM	3PA	...	REB	AST	STL	BLK	PTS	birth_year	height_cm	weight_kg	Pos_enc	Preds
178	Chris Paul	PG	32	HOU	58	1846.6	367	798	144	379	...	313	457	96	14	1081	1985.0	183.0	79.0	1	PG

1 rows × 26 columns

df_merged[df_merged['Player']=='LeBron James']

	Player	Pos	Age	Team	GP	MIN	FGM	FGA	3PM	3PA	...	REB	AST	STL	BLK	PTS	birth_year	height_cm	weight_kg	Pos_enc	Preds
110	LeBron James	PF	33	CLE	82	3025.8	857	1580	149	406	...	709	747	116	71	2251	1984.0	203.0	113.0	4	PF

1 rows × 26 columns

Evaluate Performance¶

Evalute the performance of classifier using the log_loss metric.

X_scaled_train, X_scaled_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

probs = clf.predict_proba(X_scaled_test)

log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss = log_loss(y_true=y_test ,y_pred=probs , labels= clf.classes_)
loss

0.6991690157387052

1st row mostly Pink some Light Blue. We’re good at determining Point Guards, sometimes Small Forwards trigger false positive.

5th row mostly Blue with some Yellow. Makes sense. Great at determining Centers. Sometimes Power Forwards trigger false positive.

chart = alt.Chart(df_merged).mark_circle().encode(
    x=alt.X('height_cm:O', axis=alt.Axis(title='Height in cm')),
    y=alt.X('Pos_enc:O', axis=alt.Axis(title='Encoded')),
    color = alt.Color("Preds", title = "Positions"),
).properties(
    title = f"Predicted NBA Positions",
)
chart

Summary¶

Taking player seasonal stats, height, and weight we attempted to predict NBA positions by classification. Some NBA positions are easier to predict than others.

References¶

Include references that you found helpful. Also say where you found the dataset you used.

Dataframes used from Kaggle

Basketball Players Stats per Season - 49 Leagues found here.

NBA Player Stats 2017-2018 found here.

Created in Deepnote

UC Irvine Math 10 W22

Predict Position of Player with Seasonal Stats

Contents