Major League Baseball (MLB) Team Performances Predictions & Analysis#

Author: Yuan Gao

Course Project, UC Irvine, Math 10, F22

Introduction#

Introduce your project here. Maybe 3 sentences.

In this project, I will explore the dataset of Major League Baseball (MLB), and as a baseball fan, I hope to find out some relationships between the team performances and those advances statistics. Also, I would like to predict the team performances. Last but not least, I would like to discover which factor(s) contribute the most to team performances.

Main Portion of the Project#

import pandas as pd
import altair as alt
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv("baseball.csv")
df

	Team	League	Year	RS	RA	W	OBP	SLG	BA	Playoffs	RankSeason	RankPlayoffs	G	OOBP	OSLG
0	ARI	NL	2012	734	688	81	0.328	0.418	0.259	0	NaN	NaN	162	0.317	0.415
1	ATL	NL	2012	700	600	94	0.320	0.389	0.247	1	4.0	5.0	162	0.306	0.378
2	BAL	AL	2012	712	705	93	0.311	0.417	0.247	1	5.0	4.0	162	0.315	0.403
3	BOS	AL	2012	734	806	69	0.315	0.415	0.260	0	NaN	NaN	162	0.331	0.428
4	CHC	NL	2012	613	759	61	0.302	0.378	0.240	0	NaN	NaN	162	0.335	0.424
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1227	PHI	NL	1962	705	759	81	0.330	0.390	0.260	0	NaN	NaN	161	NaN	NaN
1228	PIT	NL	1962	706	626	93	0.321	0.394	0.268	0	NaN	NaN	161	NaN	NaN
1229	SFG	NL	1962	878	690	103	0.341	0.441	0.278	1	1.0	2.0	165	NaN	NaN
1230	STL	NL	1962	774	664	84	0.335	0.394	0.271	0	NaN	NaN	163	NaN	NaN
1231	WSA	AL	1962	599	716	60	0.308	0.373	0.250	0	NaN	NaN	162	NaN	NaN

1232 rows × 15 columns

I would like to change the column names from abbreviations to full names for better understanding. At the same time, I would like to change the numbers in columns which represent percentages from decimals to percentages by defining a new function.

df = df.rename({"RS": "Runs Scored", "RA": "Runs Allowed", "W": "Wins", "OBP": "On-Base Percentage",
 "SLG": "Slugging Percentage", "BA": "Batting Average", "G": "Games Played", "OOBP": "Opponents On-Base Percentage", 
 "OSLG": "Opponents Slugging Percentage"}, axis="columns")

df

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
0	ARI	NL	2012	734	688	81	0.328	0.418	0.259	0	NaN	NaN	162	0.317	0.415
1	ATL	NL	2012	700	600	94	0.320	0.389	0.247	1	4.0	5.0	162	0.306	0.378
2	BAL	AL	2012	712	705	93	0.311	0.417	0.247	1	5.0	4.0	162	0.315	0.403
3	BOS	AL	2012	734	806	69	0.315	0.415	0.260	0	NaN	NaN	162	0.331	0.428
4	CHC	NL	2012	613	759	61	0.302	0.378	0.240	0	NaN	NaN	162	0.335	0.424
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1227	PHI	NL	1962	705	759	81	0.330	0.390	0.260	0	NaN	NaN	161	NaN	NaN
1228	PIT	NL	1962	706	626	93	0.321	0.394	0.268	0	NaN	NaN	161	NaN	NaN
1229	SFG	NL	1962	878	690	103	0.341	0.441	0.278	1	1.0	2.0	165	NaN	NaN
1230	STL	NL	1962	774	664	84	0.335	0.394	0.271	0	NaN	NaN	163	NaN	NaN
1231	WSA	AL	1962	599	716	60	0.308	0.373	0.250	0	NaN	NaN	162	NaN	NaN

1232 rows × 15 columns

def to_percent(x):
    return round(x*100,2)

cols = ["On-Base Percentage", "Slugging Percentage", "Batting Average", "Opponents On-Base Percentage", "Opponents Slugging Percentage"]
for x in cols:
    df[x] = df[x].map(to_percent)

df

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5
1	ATL	NL	2012	700	600	94	32.0	38.9	24.7	1	4.0	5.0	162	30.6	37.8
2	BAL	AL	2012	712	705	93	31.1	41.7	24.7	1	5.0	4.0	162	31.5	40.3
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1227	PHI	NL	1962	705	759	81	33.0	39.0	26.0	0	NaN	NaN	161	NaN	NaN
1228	PIT	NL	1962	706	626	93	32.1	39.4	26.8	0	NaN	NaN	161	NaN	NaN
1229	SFG	NL	1962	878	690	103	34.1	44.1	27.8	1	1.0	2.0	165	NaN	NaN
1230	STL	NL	1962	774	664	84	33.5	39.4	27.1	0	NaN	NaN	163	NaN	NaN
1231	WSA	AL	1962	599	716	60	30.8	37.3	25.0	0	NaN	NaN	162	NaN	NaN

1232 rows × 15 columns

In the years before 1999, there is no data of the opponents’ statistics. Therefore, we would only use the data from 1999 till 2012.

df = df[df["Year"] > 1998]
df

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5
1	ATL	NL	2012	700	600	94	32.0	38.9	24.7	1	4.0	5.0	162	30.6	37.8
2	BAL	AL	2012	712	705	93	31.1	41.7	24.7	1	5.0	4.0	162	31.5	40.3
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
415	SFG	NL	1999	872	831	86	35.6	43.4	27.1	0	NaN	NaN	162	34.5	42.3
416	STL	NL	1999	809	838	75	33.8	42.6	26.2	0	NaN	NaN	161	35.5	42.7
417	TBD	AL	1999	772	913	69	34.3	41.1	27.4	0	NaN	NaN	162	37.1	44.8
418	TEX	AL	1999	945	859	95	36.1	47.9	29.3	1	5.0	4.0	162	34.6	45.9
419	TOR	AL	1999	883	862	84	35.2	45.7	28.0	0	NaN	NaN	162	35.3	45.6

420 rows × 15 columns

df1 = df.copy()
df2 = df.copy()
df3 = df.copy()
df4 = df.copy()

Predict Using the Decision Tree Classifier#

Since we want to find the relationships between team statistics and team performances (we evaluate team performances by whether the team enters the playoff or not), we would like to use the decision tree classifier to predict the team performances.

del df1["RankSeason"]
del df1["RankPlayoffs"]
del df1["Year"]
del df1["Wins"]
del df1["Runs Scored"]
del df1["Runs Allowed"]
del df1["Games Played"]

df1.columns

Index(['Team', 'League', 'On-Base Percentage', 'Slugging Percentage',
       'Batting Average', 'Playoffs', 'Opponents On-Base Percentage',
       'Opponents Slugging Percentage'],
      dtype='object')

from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df1.columns if is_numeric_dtype(df[c])]
num_cols

['On-Base Percentage',
 'Slugging Percentage',
 'Batting Average',
 'Playoffs',
 'Opponents On-Base Percentage',
 'Opponents Slugging Percentage']

df1_1 = df[num_cols].drop("Playoffs", axis=1)
df1_1

	On-Base Percentage	Slugging Percentage	Batting Average	Opponents On-Base Percentage	Opponents Slugging Percentage
0	32.8	41.8	25.9	31.7	41.5
1	32.0	38.9	24.7	30.6	37.8
2	31.1	41.7	24.7	31.5	40.3
3	31.5	41.5	26.0	33.1	42.8
4	30.2	37.8	24.0	33.5	42.4
...	...	...	...	...	...
415	35.6	43.4	27.1	34.5	42.3
416	33.8	42.6	26.2	35.5	42.7
417	34.3	41.1	27.4	37.1	44.8
418	36.1	47.9	29.3	34.6	45.9
419	35.2	45.7	28.0	35.3	45.6

420 rows × 5 columns

df1_1.isna().any(axis=0)

On-Base Percentage               False
Slugging Percentage              False
Batting Average                  False
Opponents On-Base Percentage     False
Opponents Slugging Percentage    False
dtype: bool

df1_1.columns

Index(['On-Base Percentage', 'Slugging Percentage', 'Batting Average',
       'Opponents On-Base Percentage', 'Opponents Slugging Percentage'],
      dtype='object')

X_train, X_test, y_train, y_test = train_test_split(df1_1, df1["Playoffs"], test_size=0.2, random_state=0)

from sklearn.tree import DecisionTreeClassifier

for nodes in range(10,400,5):
    clf = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=nodes)
    clf.fit(X_train, y_train)
    a = clf.score(X_train, y_train)
    b = clf.score(X_test, y_test)
    print(a,b)

8988095238095238 0.7976190476190477
9077380952380952 0.7976190476190477
9255952380952381 0.7619047619047619
9404761904761905 0.7738095238095238
9613095238095238 0.7857142857142857
9732142857142857 0.7738095238095238
9821428571428571 0.75
9880952380952381 0.7857142857142857
9940476190476191 0.7142857142857143
9970238095238095 0.7619047619047619
9970238095238095 0.7142857142857143
9970238095238095 0.7380952380952381
9970238095238095 0.7380952380952381
9970238095238095 0.75
9970238095238095 0.75
9970238095238095 0.7380952380952381
9970238095238095 0.7142857142857143
9970238095238095 0.75
9970238095238095 0.7380952380952381
9970238095238095 0.7261904761904762
9970238095238095 0.75
9970238095238095 0.8095238095238095
9970238095238095 0.7380952380952381
9970238095238095 0.75
9970238095238095 0.75
9970238095238095 0.7261904761904762
9970238095238095 0.7380952380952381
9970238095238095 0.7619047619047619
9970238095238095 0.7619047619047619
9970238095238095 0.7380952380952381
9970238095238095 0.7976190476190477
9970238095238095 0.75
9970238095238095 0.7619047619047619
9970238095238095 0.7380952380952381
9970238095238095 0.7738095238095238
9970238095238095 0.75
9970238095238095 0.7857142857142857
9970238095238095 0.7380952380952381
9970238095238095 0.7261904761904762
9970238095238095 0.7261904761904762
9970238095238095 0.75
9970238095238095 0.7619047619047619
9970238095238095 0.7619047619047619
9970238095238095 0.75
9970238095238095 0.75
9970238095238095 0.7261904761904762
9970238095238095 0.7738095238095238
9970238095238095 0.7261904761904762
9970238095238095 0.7142857142857143
9970238095238095 0.75
9970238095238095 0.7619047619047619
9970238095238095 0.7619047619047619
9970238095238095 0.7619047619047619
9970238095238095 0.7738095238095238
9970238095238095 0.7261904761904762
9970238095238095 0.7619047619047619
9970238095238095 0.75
9970238095238095 0.7619047619047619
9970238095238095 0.75
9970238095238095 0.7380952380952381
9970238095238095 0.7619047619047619
9970238095238095 0.75
9970238095238095 0.75
9970238095238095 0.7380952380952381
9970238095238095 0.7738095238095238
9970238095238095 0.7738095238095238
9970238095238095 0.7142857142857143
9970238095238095 0.7142857142857143
9970238095238095 0.7142857142857143
9970238095238095 0.7619047619047619
9970238095238095 0.75
9970238095238095 0.7142857142857143
9970238095238095 0.75
9970238095238095 0.7380952380952381
9970238095238095 0.7380952380952381
9970238095238095 0.7619047619047619
9970238095238095 0.7380952380952381
9970238095238095 0.7619047619047619

In the process above, I choose the max_leaf_nodes in the range (10,400) with an interval of 5. The result shows that the possibility that the train data predicts correctly is always around 0.99, and the possibility that the test data predicts correctly is always around 0.73 to 0.77. Therefore, there does not exist overfitting issue in the model above.

clf_1 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

clf_1.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

clf_1.score(X_train, y_train)

0.9970238095238095

clf_1.score(X_test, y_test)

0.7619047619047619

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(100,200))
plot_tree(
    clf_1,
    feature_names=clf_1.feature_names_in_,
    filled=True
);

Since there’s no overfitting issue in the model, I plan to plot one decision tree with a random number 225 as the max_leaf_nodes. We could discover that the clf_1.score is close to the clf.score above.

Predict Using the Logistic Regression#

We have already used the DecisionTree Classifier to predict the possibility of entering playoff games. Therefore, I would like to use Logistic Regression to predict the possibility as well, and see if the results are similar or different.

df1_1

	On-Base Percentage	Slugging Percentage	Batting Average	Opponents On-Base Percentage	Opponents Slugging Percentage
0	32.8	41.8	25.9	31.7	41.5
1	32.0	38.9	24.7	30.6	37.8
2	31.1	41.7	24.7	31.5	40.3
3	31.5	41.5	26.0	33.1	42.8
4	30.2	37.8	24.0	33.5	42.4
...	...	...	...	...	...
415	35.6	43.4	27.1	34.5	42.3
416	33.8	42.6	26.2	35.5	42.7
417	34.3	41.1	27.4	37.1	44.8
418	36.1	47.9	29.3	34.6	45.9
419	35.2	45.7	28.0	35.3	45.6

420 rows × 5 columns

from sklearn.linear_model import LogisticRegression

clf_2 = LogisticRegression()

clf_2.fit(df1_1,df["Playoffs"])

LogisticRegression()

df2["Pred"] = clf_2.predict(df1_1)

df2

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage	Pred
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5	0
1	ATL	NL	2012	700	600	94	32.0	38.9	24.7	1	4.0	5.0	162	30.6	37.8	0
2	BAL	AL	2012	712	705	93	31.1	41.7	24.7	1	5.0	4.0	162	31.5	40.3	0
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8	0
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
415	SFG	NL	1999	872	831	86	35.6	43.4	27.1	0	NaN	NaN	162	34.5	42.3	0
416	STL	NL	1999	809	838	75	33.8	42.6	26.2	0	NaN	NaN	161	35.5	42.7	0
417	TBD	AL	1999	772	913	69	34.3	41.1	27.4	0	NaN	NaN	162	37.1	44.8	0
418	TEX	AL	1999	945	859	95	36.1	47.9	29.3	1	5.0	4.0	162	34.6	45.9	0
419	TOR	AL	1999	883	862	84	35.2	45.7	28.0	0	NaN	NaN	162	35.3	45.6	0

420 rows × 16 columns

df2[df2["Playoffs"] == df2["Pred"]]

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage	Pred
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5	0
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8	0
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4	0
5	CHW	AL	2012	748	676	85	31.8	42.2	25.5	0	NaN	NaN	162	31.9	40.5	0
7	CLE	AL	2012	667	845	68	32.4	38.1	25.1	0	NaN	NaN	162	33.6	43.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
414	SEA	AL	1999	859	905	79	34.3	45.5	26.9	0	NaN	NaN	162	36.8	45.4	0
415	SFG	NL	1999	872	831	86	35.6	43.4	27.1	0	NaN	NaN	162	34.5	42.3	0
416	STL	NL	1999	809	838	75	33.8	42.6	26.2	0	NaN	NaN	161	35.5	42.7	0
417	TBD	AL	1999	772	913	69	34.3	41.1	27.4	0	NaN	NaN	162	37.1	44.8	0
419	TOR	AL	1999	883	862	84	35.2	45.7	28.0	0	NaN	NaN	162	35.3	45.6	0

358 rows × 16 columns

358/420

0.8523809523809524

We could find out that in 420 rows of data, there are 358 rows which the prediction equals to the real statistic. We can see that there is nearly 85% of the correct predictions, which performs better than the results under DecisionTreeClassifier.

from sklearn.metrics import mean_absolute_error

mean_absolute_error(clf_2.predict(X_test), y_test)

0.13095238095238096

mean_absolute_error(clf_2.predict(X_train), y_train)

0.15178571428571427

We could discover that the error of the training set is larger than the error of the test set. Therefore, the results could be inaccurate.

alt.data_transformers.enable('default', max_rows=500)

c = alt.Chart(df2).mark_rect().encode(
    x="Playoffs:N",
    y="Pred:N",
    color = alt.Color('count()', scale = alt.Scale(scheme = "redpurple", reverse = True)),
)

c_text = alt.Chart(df2).mark_text(color="black").encode(
    x="Playoffs:N",
    y="Pred:N",
    text="Pred"
)

(c+c_text).properties(
    height=200,
    width=200
)

From the confusion matrix above, we could also verify our result above (85% correct predictions) as the two squares on the antidiagonal (wrong predictions) has the darkest color, which means that neither of them has counts over 50.

Figure Out Which Factor Is the Most Important#

As a baseball fan, I understand the meaning and importance of all these statistics. However, I’m not sure which one(s) is the most important for teams to make playoffs. Thus, I would like to know which factor contributes the most to team performances.

Since the method of Logistic Regression may not be accurate, I would use the test statistics from the DecisionTree Classifier method(clf_1).

clf_1.feature_importances_

array([0.14013201, 0.2535732 , 0.0887258 , 0.36051521, 0.15705379])

pd.Series(clf_1.feature_importances_, index=df1_1.columns)

On-Base Percentage               0.140132
Slugging Percentage              0.253573
Batting Average                  0.088726
Opponents On-Base Percentage     0.360515
Opponents Slugging Percentage    0.157054
dtype: float64

df_fi = pd.DataFrame({"importance": clf_1.feature_importances_, "factors": clf_1.feature_names_in_})

df_fi

	importance	factors
0	0.140132	On-Base Percentage
1	0.253573	Slugging Percentage
2	0.088726	Batting Average
3	0.360515	Opponents On-Base Percentage
4	0.157054	Opponents Slugging Percentage

alt.Chart(df_fi).mark_bar().encode(
    x="factors",
    y="importance",
    tooltip=["importance"]
).properties(
    title = 'Importance of Factors'
)

Thus, we could find out that the most important factor here is the opponents on-base percentage, which means that good pitchers and infielders with good defense could play the most important roles to help the team.

c1=alt.Chart(df).mark_circle().encode(
    x=alt.X("Opponents On-Base Percentage", scale=alt.Scale(zero=False)),
    y=alt.Y("Slugging Percentage", scale=alt.Scale(zero=False)),
    color="Playoffs:Q",
    tooltip=["Team", "Year", "Opponents On-Base Percentage","Slugging Percentage"],
).interactive()

c1

This graph clearly shows that the data points in the upper-left graph are more likely to make the playoffs since they have low opponents on-base percentage and high slugging percentage.

import seaborn as sns

sns.regplot(x=df1_1["Opponents On-Base Percentage"], y=df["Playoffs"], data=df2, logistic=True, ci=None, scatter_kws={'color': 'blue'}, line_kws={'color': 'red'})

<AxesSubplot:xlabel='Opponents On-Base Percentage', ylabel='Playoffs'>

I also use the seaborn to draw a graph with logistic regression. I also use the most important factor (opponents on-base percentage) to model the predicted possibility of making the playoffs. It shows similar trends compared with the analysis above.

Analysis of Offense and Defense#

Next, we would break up the data into offense part and defense part. For the offense, we would like to see the relationship between Runs Scored and other statistics (Batting Average, On-Base Percentage, Slugging Percentage). For the defense, we would like to see the relationship between Runs Saved and other statistics (Opponents On-Base Percentage, Opponents Slugging Percentage).

df3

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5
1	ATL	NL	2012	700	600	94	32.0	38.9	24.7	1	4.0	5.0	162	30.6	37.8
2	BAL	AL	2012	712	705	93	31.1	41.7	24.7	1	5.0	4.0	162	31.5	40.3
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
415	SFG	NL	1999	872	831	86	35.6	43.4	27.1	0	NaN	NaN	162	34.5	42.3
416	STL	NL	1999	809	838	75	33.8	42.6	26.2	0	NaN	NaN	161	35.5	42.7
417	TBD	AL	1999	772	913	69	34.3	41.1	27.4	0	NaN	NaN	162	37.1	44.8
418	TEX	AL	1999	945	859	95	36.1	47.9	29.3	1	5.0	4.0	162	34.6	45.9
419	TOR	AL	1999	883	862	84	35.2	45.7	28.0	0	NaN	NaN	162	35.3	45.6

420 rows × 15 columns

Offense#

df3_of = df3[["On-Base Percentage", "Slugging Percentage", "Batting Average"]]
df3_of

	On-Base Percentage	Slugging Percentage	Batting Average
0	32.8	41.8	25.9
1	32.0	38.9	24.7
2	31.1	41.7	24.7
3	31.5	41.5	26.0
4	30.2	37.8	24.0
...	...	...	...
415	35.6	43.4	27.1
416	33.8	42.6	26.2
417	34.3	41.1	27.4
418	36.1	47.9	29.3
419	35.2	45.7	28.0

420 rows × 3 columns

X_train, X_test, y_train, y_test = train_test_split(df3_of, df3["Runs Scored"], test_size=0.2, random_state=0)

for nodes in range(10,400,5):
    clf_3 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=nodes)
    clf.fit(X_train, y_train)
    a = clf.score(X_train, y_train)
    b = clf.score(X_test, y_test)
    print(a,b)

5744047619047619 0.0
5952380952380952 0.0
5892857142857143 0.0
5892857142857143 0.0
5922619047619048 0.011904761904761904
5803571428571429 0.0
5892857142857143 0.011904761904761904
5863095238095238 0.0
5863095238095238 0.0
5922619047619048 0.0
5833333333333334 0.0
5803571428571429 0.0
5803571428571429 0.0
5803571428571429 0.0
5803571428571429 0.0
5803571428571429 0.0
5744047619047619 0.0
5892857142857143 0.0
5892857142857143 0.0
5863095238095238 0.0
5833333333333334 0.0
5922619047619048 0.0
5892857142857143 0.0
5773809523809523 0.0
5773809523809523 0.0
5922619047619048 0.0
5803571428571429 0.0
5773809523809523 0.0
5863095238095238 0.0
5892857142857143 0.0
5863095238095238 0.0
5773809523809523 0.0
5744047619047619 0.0
5833333333333334 0.0
5744047619047619 0.0
5684523809523809 0.0
5803571428571429 0.0
5833333333333334 0.0
5863095238095238 0.0
5892857142857143 0.0
5863095238095238 0.0
5803571428571429 0.0
5714285714285714 0.0
5863095238095238 0.0
5803571428571429 0.011904761904761904
5803571428571429 0.011904761904761904
5863095238095238 0.0
5833333333333334 0.0
5833333333333334 0.011904761904761904
5773809523809523 0.011904761904761904
5803571428571429 0.011904761904761904
5863095238095238 0.0
5714285714285714 0.0
5982142857142857 0.0
5922619047619048 0.0
5892857142857143 0.0
5803571428571429 0.0
5744047619047619 0.0
5863095238095238 0.0
5863095238095238 0.0
5863095238095238 0.011904761904761904
5863095238095238 0.0
5863095238095238 0.0
5863095238095238 0.0
5892857142857143 0.0
5803571428571429 0.0
5833333333333334 0.0
5773809523809523 0.0
5803571428571429 0.011904761904761904
5803571428571429 0.0
5833333333333334 0.0
5773809523809523 0.0
5744047619047619 0.0
5744047619047619 0.0
5863095238095238 0.011904761904761904
5863095238095238 0.0
5892857142857143 0.0
5803571428571429 0.0

clf_4 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

clf_4.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

pd.Series(clf_4.feature_importances_, index=df3_of.columns)

On-Base Percentage     0.310467
Slugging Percentage    0.321475
Batting Average        0.368057
dtype: float64

Thus, we could discover that all three factors are nearly equally important, but the slugging percentage is more important, which is understandable since home runs are counted as slugs in the statistics.

Defense#

df3_de = df3[["Opponents On-Base Percentage", "Opponents Slugging Percentage"]]

X_train, X_test, y_train, y_test = train_test_split(df3_de, df3["Runs Allowed"], test_size=0.2, random_state=0)

for nodes in range(10,400,5):
    clf_5 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=nodes)
    clf.fit(X_train, y_train)
    a = clf.score(X_train, y_train)
    b = clf.score(X_test, y_test)
    print(a,b)

6607142857142857 0.011904761904761904
6547619047619048 0.011904761904761904
6547619047619048 0.011904761904761904
6607142857142857 0.011904761904761904
6636904761904762 0.011904761904761904
6547619047619048 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.011904761904761904
6547619047619048 0.011904761904761904
6547619047619048 0.011904761904761904
6547619047619048 0.023809523809523808
6547619047619048 0.023809523809523808
6577380952380952 0.023809523809523808
6577380952380952 0.023809523809523808
6547619047619048 0.011904761904761904
6577380952380952 0.023809523809523808
6636904761904762 0.011904761904761904
6607142857142857 0.023809523809523808
6577380952380952 0.023809523809523808
6577380952380952 0.011904761904761904
6636904761904762 0.023809523809523808
6547619047619048 0.023809523809523808
6547619047619048 0.011904761904761904
6577380952380952 0.011904761904761904
6577380952380952 0.011904761904761904
6577380952380952 0.011904761904761904
6666666666666666 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.011904761904761904
6636904761904762 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.011904761904761904
6458333333333334 0.023809523809523808
6547619047619048 0.011904761904761904
6547619047619048 0.023809523809523808
6607142857142857 0.011904761904761904
6577380952380952 0.011904761904761904
6577380952380952 0.023809523809523808
6607142857142857 0.011904761904761904
6547619047619048 0.011904761904761904
6636904761904762 0.011904761904761904
6607142857142857 0.011904761904761904
6607142857142857 0.023809523809523808
6577380952380952 0.011904761904761904
6636904761904762 0.011904761904761904
6607142857142857 0.023809523809523808
6577380952380952 0.011904761904761904
6547619047619048 0.011904761904761904
6607142857142857 0.011904761904761904
6577380952380952 0.023809523809523808
6547619047619048 0.011904761904761904
6547619047619048 0.011904761904761904
6696428571428571 0.011904761904761904
6577380952380952 0.023809523809523808
6577380952380952 0.011904761904761904
6577380952380952 0.011904761904761904
6488095238095238 0.011904761904761904
6547619047619048 0.011904761904761904
6577380952380952 0.023809523809523808
6636904761904762 0.011904761904761904
6666666666666666 0.011904761904761904
6517857142857143 0.023809523809523808
6607142857142857 0.023809523809523808
6517857142857143 0.023809523809523808
6607142857142857 0.023809523809523808
6607142857142857 0.011904761904761904
6577380952380952 0.011904761904761904
6577380952380952 0.011904761904761904
6547619047619048 0.023809523809523808
6607142857142857 0.011904761904761904
6636904761904762 0.023809523809523808
6666666666666666 0.023809523809523808
6517857142857143 0.023809523809523808
6666666666666666 0.011904761904761904
6547619047619048 0.023809523809523808
6577380952380952 0.011904761904761904

clf_6 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

clf_6.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=225)

pd.Series(clf_6.feature_importances_, index=df3_de.columns)

Opponents On-Base Percentage     0.504177
Opponents Slugging Percentage    0.495823
dtype: float64

We could discover that these two factors are almost equally important. However, this could only be used as a reference since this dataset doesn’t provide us with the opponents’ batting average.

Los Angeles Dodgers Performance Analysis#

As a Dodgers fan, I would like to analyze the team’s performance in the end. I would try to use some new methods here, such as StandardScaler and K-Neighbors Classifier.

df4

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
0	ARI	NL	2012	734	688	81	32.8	41.8	25.9	0	NaN	NaN	162	31.7	41.5
1	ATL	NL	2012	700	600	94	32.0	38.9	24.7	1	4.0	5.0	162	30.6	37.8
2	BAL	AL	2012	712	705	93	31.1	41.7	24.7	1	5.0	4.0	162	31.5	40.3
3	BOS	AL	2012	734	806	69	31.5	41.5	26.0	0	NaN	NaN	162	33.1	42.8
4	CHC	NL	2012	613	759	61	30.2	37.8	24.0	0	NaN	NaN	162	33.5	42.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
415	SFG	NL	1999	872	831	86	35.6	43.4	27.1	0	NaN	NaN	162	34.5	42.3
416	STL	NL	1999	809	838	75	33.8	42.6	26.2	0	NaN	NaN	161	35.5	42.7
417	TBD	AL	1999	772	913	69	34.3	41.1	27.4	0	NaN	NaN	162	37.1	44.8
418	TEX	AL	1999	945	859	95	36.1	47.9	29.3	1	5.0	4.0	162	34.6	45.9
419	TOR	AL	1999	883	862	84	35.2	45.7	28.0	0	NaN	NaN	162	35.3	45.6

420 rows × 15 columns

df_4 = df4.loc[df['Team'] == 'LAD']
df_4

	Team	League	Year	Runs Scored	Runs Allowed	Wins	On-Base Percentage	Slugging Percentage	Batting Average	Playoffs	RankSeason	RankPlayoffs	Games Played	Opponents On-Base Percentage	Opponents Slugging Percentage
13	LAD	NL	2012	637	597	86	31.7	37.4	25.2	0	NaN	NaN	162	31.0	36.4
44	LAD	NL	2011	644	612	82	32.2	37.5	25.7	0	NaN	NaN	161	31.0	37.1
74	LAD	NL	2010	667	692	80	32.2	37.9	25.2	0	NaN	NaN	162	31.6	37.8
104	LAD	NL	2009	780	611	95	34.6	41.2	27.0	1	3.0	3.0	162	31.2	36.1
134	LAD	NL	2008	700	648	84	33.3	39.9	26.4	1	7.0	3.0	162	31.5	37.6
164	LAD	NL	2007	735	727	82	33.7	40.6	27.5	0	NaN	NaN	162	32.6	39.9
194	LAD	NL	2006	820	751	88	34.8	43.2	27.6	1	5.0	4.0	162	33.0	41.8
224	LAD	NL	2005	685	755	71	32.6	39.5	25.3	0	NaN	NaN	162	32.7	42.9
254	LAD	NL	2004	761	684	93	33.2	42.3	26.2	1	5.0	4.0	162	32.3	40.8
284	LAD	NL	2003	574	556	85	30.3	36.8	24.3	0	NaN	NaN	162	30.6	35.4
314	LAD	NL	2002	713	643	92	32.0	40.9	26.4	0	NaN	NaN	162	31.5	38.7
344	LAD	NL	2001	758	744	86	32.3	42.5	25.5	0	NaN	NaN	162	32.3	41.1
374	LAD	NL	2000	798	729	86	34.1	43.1	25.7	0	NaN	NaN	162	33.2	40.7
404	LAD	NL	1999	793	787	77	33.9	42.0	26.6	0	NaN	NaN	162	33.6	42.2

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

scaler = StandardScaler()
df_4[["Slugging Percentage","Opponents On-Base Percentage"]] = scaler.fit_transform(df_4[["Slugging Percentage","Opponents On-Base Percentage"]])

X_train, X_test, y_train, y_test = train_test_split(df_4[["Slugging Percentage","Opponents On-Base Percentage"]], df_4["Wins"], test_size=0.2)

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/core/indexing.py:1738: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, value[:, i].tolist(), pi)

X = df_4[["Slugging Percentage", "Opponents On-Base Percentage"]]
X

	Slugging Percentage	Opponents On-Base Percentage
13	-1.373455	-1.121558
44	-1.326785	-1.121558
74	-1.140101	-0.453396
104	0.400036	-0.898838
134	-0.206685	-0.564756
164	0.120011	0.660208
194	1.333452	1.105650
224	-0.393368	0.771568
254	0.913415	0.326127
284	-1.653480	-1.567000
314	0.260023	-0.564756
344	1.006756	0.326127
374	1.286781	1.328370
404	0.773402	1.773812

y = df_4["Wins"]
y

   86
   82
   80
  95
  84
  82
  88
  71
  93
  85
  92
  86
  86
  77
Name: Wins, dtype: int64

We’re randomly picking the n_neighbors=3.

clf_7 = KNeighborsClassifier(n_neighbors=3)
clf_7.fit(X_train, y_train)
df_4["pred"] = clf_7.predict(df_4[["Slugging Percentage","Opponents On-Base Percentage"]])

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until

c2 = alt.Chart(df_4).mark_circle().encode(
    x="Opponents On-Base Percentage",
    y="Slugging Percentage",
    color=alt.Color("pred", title="Wins"),
    tooltip = ("Year", "Playoffs", "Slugging Percentage", "Opponents On-Base Percentage")
).properties(
    width=500,
    height=200,
)

c2

Here, we can find out that the trend is similar to the graph c1 above. The data points on the upper-left could always make playoffs. However, the trend is not that obvious to wins since entering the playoffs or not also depends on the performances of other teams. Therefore, we might not get a clear conclusion here by just analyzing one team.

Summary#

In this project, we discover that using the Decision Tree Classifier is most accurate to make predictions than using the logistic regression.

We also find out that slugging percentage and opponents on-base percentage are two most important factors affecting the teams’ performances. We also find out that in teams’ offense, slugging percentage is the most important factor. Thus, we suggest teams to find players who can hit hard as well as players who are good at defense.

However, we could not find a clear conclusion/relationship in teams’ defense. Also, we cannot predict accurately whether a team enters the playoff or not by only analyzing the statistics of that team.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

Kaggle. https://www.kaggle.com/datasets/wduckett/moneyball-mlb-stats-19622012

List any other references that you found helpful.

https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn I learned basic k-neighbors classifier here.

https://christopherdavisuci.github.io/UCI-Math-10-S22/Week7/Week7-Friday.html https://www.statology.org/plot-logistic-regression-in-python/ I learned the logistic regression from here.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

Major League Baseball (MLB) Team Performances Predictions & Analysis

Contents

Major League Baseball (MLB) Team Performances Predictions & Analysis#

Introduction#

Main Portion of the Project#

Predict Using the Decision Tree Classifier#

Predict Using the Logistic Regression#

Figure Out Which Factor Is the Most Important#

Analysis of Offense and Defense#

Offense#

Defense#

Los Angeles Dodgers Performance Analysis#

Summary#

References#

Submission#