Predicting Forza Horizon Car Classes and Ratings

Predicting Forza Horizon Car Classes and Ratings#

Author: Felipe I Rios

Course Project, UC Irvine, Math 10, S23

Introduction#

Since I was younger, my favorite genre of video games have always been racing or driving simulator games and I remembered how it is that cars have stats to determine how good they are in the game. One game in particular, Forza Horizon, not only has stats to determine how good the cars are, but the game also has a classification in system in place that corresponds to how good the game is. As such, this will serve as the basis for this project, where I will see if the machine can accurately predict the classification classes bases on the various stats.

Cleaning up the Data & Determining Inputs#

import numpy as np
import pandas as pd
import altair as alt

df = pd.read_csv("Forza_Horizon_Cars.csv")
df

	Car_Image	Name_and_model	Model_type	In_Game_Price	car_source	stock_specs	Stock_Rating	Drive_Type	speed	handling	...	braking	Offroad	Top_Speed	0-60_Mph	0-100_Mph	g-force	car_source_1	car_source_2	Horse_Power	Weight_lbs
0	https://www.kudosprime.com/fh5/images/cars/sid...	2001 Acura Integra Type R	RETRO HOT HATCH	25,000	Autoshow	C	596.0	FWD	5.6	5.1	...	3.4	5.1	155.5 Mph	6.2s	14.7s	0.90 g	info_not_found	info_not_found	195	2,639
1	https://www.kudosprime.com/fh5/images/cars/sid...	2002 Acura RSX Type S	RETRO HOT HATCH	25,000	Autoshow	C	585.0	FWD	5.6	5.1	...	3.5	5.3	info_not_found	info_not_found	info_not_found	info_not_found	info_not_found	info_not_found	200	2,820
2	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Acura NSX	MODERN SUPERCARS	170,000	Autoshow	S1	831.0	AWD	7.0	7.0	...	7.1	4.7	info_not_found	info_not_found	info_not_found	info_not_found	info_not_found	info_not_found	573	3,803
3	https://www.kudosprime.com/fh5/images/cars/sid...	1973 Alpine A110 1600s	CLASSIC RALLY	98,000	Autoshow	C	550.0	RWD	5.0	4.4	...	3.0	5.3	138.0 Mph	7.0s	20.6s	0.84 g	info_not_found	info_not_found	123	1,576
4	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Alpine A110	MODERN SPORTS CARS	67,500	Wheelspin	B	694.0	RWD	6.0	6.4	...	4.2	4.2	info_not_found	info_not_found	info_not_found	info_not_found	Season Event	Festival Playlist	248	2,432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
534	https://www.kudosprime.com/fh5/images/cars/sid...	1997 Volvo 850 R	RETRO SALOONS	25,000	Autoshow	C	511.0	FWD	5.8	4.3	...	2.5	5.0	info_not_found	info_not_found	info_not_found	info_not_found	Wheelspin	info_not_found	240	3,230
535	https://www.kudosprime.com/fh5/images/cars/sid...	2015 Volvo V60 Polestar	SUPER SALOONS	62,000	Autoshow	B	662.0	AWD	6.1	5.9	...	3.9	5.1	info_not_found	info_not_found	info_not_found	info_not_found	Wheelspin	Accolade	346	3,985
536	https://www.kudosprime.com/fh5/images/cars/sid...	2017 VUHL 05RR	TRACK TOYS	100,000	Autoshow	S1	886.0	RWD	6.8	8.6	...	7.8	3.7	info_not_found	info_not_found	info_not_found	info_not_found	Wheelspin	info_not_found	385	1,598
537	https://www.kudosprime.com/fh5/images/cars/sid...	1945 WILLYS MB Jeep	PICK-UP & 4X4'S	40,000	Autoshow	D	198.0	AWD	2.6	3.7	...	2.1	7.1	info_not_found	info_not_found	info_not_found	info_not_found	Wheelspin	info_not_found	60	2,137
538	https://www.kudosprime.com/fh5/images/cars/sid...	2019 Zenvo TSR-S	HYPERCARS	1,200,000	Autoshow	S2	927.0	RWD	9.0	8.7	...	9.6	4.0	info_not_found	info_not_found	info_not_found	info_not_found	Wheelspin	info_not_found	1,177	3,410

539 rows × 22 columns

df = df.applymap(lambda x: np.nan if x == "info_not_found" else x)
df

	Car_Image	Name_and_model	Model_type	In_Game_Price	car_source	stock_specs	Stock_Rating	Drive_Type	speed	handling	...	braking	Offroad	Top_Speed	0-60_Mph	0-100_Mph	g-force	car_source_1	car_source_2	Horse_Power	Weight_lbs
0	https://www.kudosprime.com/fh5/images/cars/sid...	2001 Acura Integra Type R	RETRO HOT HATCH	25,000	Autoshow	C	596.0	FWD	5.6	5.1	...	3.4	5.1	155.5 Mph	6.2s	14.7s	0.90 g	NaN	NaN	195	2,639
1	https://www.kudosprime.com/fh5/images/cars/sid...	2002 Acura RSX Type S	RETRO HOT HATCH	25,000	Autoshow	C	585.0	FWD	5.6	5.1	...	3.5	5.3	NaN	NaN	NaN	NaN	NaN	NaN	200	2,820
2	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Acura NSX	MODERN SUPERCARS	170,000	Autoshow	S1	831.0	AWD	7.0	7.0	...	7.1	4.7	NaN	NaN	NaN	NaN	NaN	NaN	573	3,803
3	https://www.kudosprime.com/fh5/images/cars/sid...	1973 Alpine A110 1600s	CLASSIC RALLY	98,000	Autoshow	C	550.0	RWD	5.0	4.4	...	3.0	5.3	138.0 Mph	7.0s	20.6s	0.84 g	NaN	NaN	123	1,576
4	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Alpine A110	MODERN SPORTS CARS	67,500	Wheelspin	B	694.0	RWD	6.0	6.4	...	4.2	4.2	NaN	NaN	NaN	NaN	Season Event	Festival Playlist	248	2,432
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
534	https://www.kudosprime.com/fh5/images/cars/sid...	1997 Volvo 850 R	RETRO SALOONS	25,000	Autoshow	C	511.0	FWD	5.8	4.3	...	2.5	5.0	NaN	NaN	NaN	NaN	Wheelspin	NaN	240	3,230
535	https://www.kudosprime.com/fh5/images/cars/sid...	2015 Volvo V60 Polestar	SUPER SALOONS	62,000	Autoshow	B	662.0	AWD	6.1	5.9	...	3.9	5.1	NaN	NaN	NaN	NaN	Wheelspin	Accolade	346	3,985
536	https://www.kudosprime.com/fh5/images/cars/sid...	2017 VUHL 05RR	TRACK TOYS	100,000	Autoshow	S1	886.0	RWD	6.8	8.6	...	7.8	3.7	NaN	NaN	NaN	NaN	Wheelspin	NaN	385	1,598
537	https://www.kudosprime.com/fh5/images/cars/sid...	1945 WILLYS MB Jeep	PICK-UP & 4X4'S	40,000	Autoshow	D	198.0	AWD	2.6	3.7	...	2.1	7.1	NaN	NaN	NaN	NaN	Wheelspin	NaN	60	2,137
538	https://www.kudosprime.com/fh5/images/cars/sid...	2019 Zenvo TSR-S	HYPERCARS	1,200,000	Autoshow	S2	927.0	RWD	9.0	8.7	...	9.6	4.0	NaN	NaN	NaN	NaN	Wheelspin	NaN	1,177	3,410

539 rows × 22 columns

Here I made sure to all the values that had been inputted as “info_not_found” were replaced as missing values.

df.dtypes

Car_Image         object
Name_and_model    object
Model_type        object
In_Game_Price     object
car_source        object
stock_specs       object
Stock_Rating      object
Drive_Type        object
speed             object
handling          object
acceleration      object
launch            object
braking           object
Offroad           object
Top_Speed         object
0-60_Mph          object
0-100_Mph         object
g-force           object
car_source_1      object
car_source_2      object
Horse_Power       object
Weight_lbs        object
dtype: object

cols = ["Stock_Rating", "speed", "handling", "acceleration", "launch", "braking", "Offroad", "Horse_Power", "Weight_lbs"]

Notice that all the columns in the Data Frame have dtype object and thus need to be changed to numeric dtypes. In the code block above, I chose the columns that would be my input features as part of my data analysis later in the project. Given how there is no “pattern” in the way the words were written, there was no way to concise way of building this list such as through list comprehension, so manually typing them out was my only choice.

for c in cols:
    if df[c].str.contains(",").any():
        df[c] = df[c].str.replace(',',"").astype(float)
    else:
        df[c] = pd.to_numeric(df[c])

Lines 2-3, from the code block directly above, were partially adapted from ChatGPT given how the initial Data Frame is formatted. Apparently, strings in the form of numbers cannot be changed to numeric dtypes if they have commas like in “1,042”, so they have to be removed.

df.dtypes

Car_Image          object
Name_and_model     object
Model_type         object
In_Game_Price      object
car_source         object
stock_specs        object
Stock_Rating      float64
Drive_Type         object
speed             float64
handling          float64
acceleration      float64
launch            float64
braking           float64
Offroad           float64
Top_Speed          object
0-60_Mph           object
0-100_Mph          object
g-force            object
car_source_1       object
car_source_2       object
Horse_Power       float64
Weight_lbs        float64
dtype: object

threshold = (len(df)*0.1)

missing_values = df.isna().sum()

columns_to_drop = missing_values[missing_values > threshold].index
columns_to_drop

Index(['Top_Speed', '0-60_Mph', '0-100_Mph', 'g-force', 'car_source_1',
       'car_source_2'],
      dtype='object')

df = df.drop(columns_to_drop, axis = 1)

The four code blocks directly above were adapted from ChatGPT to help remove columns whose values had more than 10% amount of them being missing values as they have no value in the Machine Learning part of the project. The original Data Frame has 23 columns, so I removed these columns for it to be more concise, considering some of them have over 70% missing values.

df = df.dropna(axis = 0)

df

	Car_Image	Name_and_model	Model_type	In_Game_Price	car_source	stock_specs	Stock_Rating	Drive_Type	speed	handling	acceleration	launch	braking	Offroad	Horse_Power	Weight_lbs
0	https://www.kudosprime.com/fh5/images/cars/sid...	2001 Acura Integra Type R	RETRO HOT HATCH	25,000	Autoshow	C	596.0	FWD	5.6	5.1	3.9	3.1	3.4	5.1	195.0	2639.0
1	https://www.kudosprime.com/fh5/images/cars/sid...	2002 Acura RSX Type S	RETRO HOT HATCH	25,000	Autoshow	C	585.0	FWD	5.6	5.1	3.9	3.0	3.5	5.3	200.0	2820.0
2	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Acura NSX	MODERN SUPERCARS	170,000	Autoshow	S1	831.0	AWD	7.0	7.0	9.2	10.0	7.1	4.7	573.0	3803.0
3	https://www.kudosprime.com/fh5/images/cars/sid...	1973 Alpine A110 1600s	CLASSIC RALLY	98,000	Autoshow	C	550.0	RWD	5.0	4.4	4.1	3.1	3.0	5.3	123.0	1576.0
4	https://www.kudosprime.com/fh5/images/cars/sid...	2017 Alpine A110	MODERN SPORTS CARS	67,500	Wheelspin	B	694.0	RWD	6.0	6.4	5.4	5.7	4.2	4.2	248.0	2432.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
534	https://www.kudosprime.com/fh5/images/cars/sid...	1997 Volvo 850 R	RETRO SALOONS	25,000	Autoshow	C	511.0	FWD	5.8	4.3	3.3	2.6	2.5	5.0	240.0	3230.0
535	https://www.kudosprime.com/fh5/images/cars/sid...	2015 Volvo V60 Polestar	SUPER SALOONS	62,000	Autoshow	B	662.0	AWD	6.1	5.9	5.5	3.6	3.9	5.1	346.0	3985.0
536	https://www.kudosprime.com/fh5/images/cars/sid...	2017 VUHL 05RR	TRACK TOYS	100,000	Autoshow	S1	886.0	RWD	6.8	8.6	8.5	9.2	7.8	3.7	385.0	1598.0
537	https://www.kudosprime.com/fh5/images/cars/sid...	1945 WILLYS MB Jeep	PICK-UP & 4X4'S	40,000	Autoshow	D	198.0	AWD	2.6	3.7	2.0	3.3	2.1	7.1	60.0	2137.0
538	https://www.kudosprime.com/fh5/images/cars/sid...	2019 Zenvo TSR-S	HYPERCARS	1,200,000	Autoshow	S2	927.0	RWD	9.0	8.7	6.7	7.2	9.6	4.0	1177.0	3410.0

537 rows × 16 columns

df.shape

(537, 16)

This effectively finishes the cleaning portion of the data that will allow the data analysis to begin

Predicting the Car’s Rating Using Various Regressions#

Comparison between Pipeline (the best degree polynomial) and Decision Tree Regressor and which one was more accurate in its prediction.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

strings_to_drop = ["Stock_Rating", "Horse_Power", "Weight_lbs"]
cols1 = [x for x in cols if x not in strings_to_drop]

cols1

['speed', 'handling', 'acceleration', 'launch', 'braking', 'Offroad']

A new list is made for ease of use when applying machine learning from our original list hence why remove some features. We dont want “Stock_Rating” as part of our features since it is our target column when doing regressions, and “Horse_Power” and “Weight_lbs” are not what I’d consider traditional car stats, hence why I removed them as well.

def make_chart(feat):
    c = alt.Chart(df).mark_circle().encode(
        x = feat,
        y = "Stock_Rating",
        color = alt.Color(feat, legend = alt.Legend(title = "Feature"))
    )
    return c

import random

charts = [make_chart(x) for x in cols1]  
# Create an empty list to store the layered charts
layered_charts = []

def generate_random_color():
    # Generate random values for red, green, and blue color components
    red = random.randint(0, 255)
    green = random.randint(0, 255)
    blue = random.randint(0, 255)
    # Construct the RGB color string
    color = f'rgb({red}, {green}, {blue})'
    return color
# Generate a random color
random_color = generate_random_color()

# Iterate over the charts and apply different visual properties to each layer
for i, (chart,x_label) in enumerate(zip(charts, cols1)):
    # Apply different color to each layer
    color = random_color if i == 0 else generate_random_color() 
    # Apply the visual properties to the current layer
    layered_chart = chart.mark_circle().encode(
        x = x_label,
        y ="Stock_Rating",
        color = alt.Chart(value = f"{color}"),
        tooltip = [x_label],
    ).interactive()
    layered_charts.append(layered_chart)

# Combine the layered charts using 'alt.layer'
combined_chart = alt.layer(*layered_charts)

combined_chart

This chart was made possible because of ChatGPT. Given that I have multiple input features it is not easily to visualize all the information as it would technically produce a multi-dimentional chart. As such, this is was the best alternative I could come up with. While it is hard to read, it does help visualize how the stats correspond to which what level of “Stock_Rating.”

legend_chart = alt.Chart(pd.DataFrame({'column': cols1})).mark_point().encode(
    color=alt.Color("column:N", legend=alt.Legend(title='Features'))
)
legend_chart

alt.layer(*(combined_chart, legend_chart))

This is how I would have wanted my chart to look, but despite searching the internet, I could not figure out how to get the legend to reflect the actual colors that are in the chart, given that I am using a random color generator to distinguish the different features.

Linear Regression#

As a precursor, when performing a regression, not all regression types have a .score() to determine the accuracy of the model, as such its better to compare the values between a train and test set using mean_absolute_error and mean_squared_error. While the numbers dont mean much in context, we do know that if the value produce by the test set is higher then the model is a good predictor. As such, we instantiate the train_test_split below which also helps reduce the influence of overfitting in our model.

X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["Stock_Rating"], test_size=0.7)

reg = LinearRegression()
reg.fit(X_train,y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

mean_squared_error(y_train,reg.predict(X_train))

5279.724534199711

mean_squared_error(y_test,reg.predict(X_test))

4339.725113794179

mean_absolute_error(y_train, reg.predict(X_train))

53.49695209326706

mean_absolute_error(y_test, reg.predict(X_test))

48.324484758786745

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Linear Regression is a good model for predicting the “Stock_Rating” of the cars.

Polynomial Feautes with Pipeline#

def poly_fit(d, s):
    #cols2 = ["set", "degree", "mae", "mse"]
    pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=d,include_bias=False)),
        ('reg', LinearRegression())
    ])
    pipe.fit(X_train, y_train)
    if s == "train":
        sqr_err_train = mean_squared_error(y_train, pipe.predict(X_train))
        abs_err_train = mean_absolute_error(y_train, pipe.predict(X_train))
        arr = [s, d, abs_err_train, sqr_err_train]
    else: 
        sqr_err_test = mean_squared_error(y_test, pipe.predict(X_test))
        abs_err_test = mean_absolute_error(y_test, pipe.predict(X_test))
        arr = [s, d, abs_err_test, sqr_err_test]
    return arr 

cols3 = ["set", "degree", "mae", "mse"]
df_arr = pd.DataFrame(columns=cols3)
for d in range(2,9):
    series = [poly_fit(d, s) for s in ["train", 'test']]
    df_arr.loc[len(df_arr)] = series[0]
    df_arr.loc[len(df_arr)] = series[1]

df_arr

	set	degree	mae	mse
0	train	2	2.509779e+01	1.101053e+03
1	test	2	3.190195e+01	1.964577e+03
2	train	3	1.266801e+01	3.025544e+02
3	test	3	5.312205e+01	1.391856e+04
4	train	4	1.195453e-09	2.743645e-18
5	test	4	9.918989e+02	1.139071e+07
6	train	5	1.572095e-10	5.547747e-20
7	test	5	9.684616e+02	1.775343e+07
8	train	6	2.174201e-10	1.022347e-19
9	test	6	1.372796e+03	3.054070e+07
10	train	7	1.094435e-10	4.879715e-20
11	test	7	2.205851e+03	6.273823e+07
12	train	8	1.352969e-10	6.088622e-20
13	test	8	3.773978e+03	1.645278e+08

Polynomial Features with Pipeline shows that after degree 3, the results become unreasonable, but up to degree 3, we can see that the test values remain higher than the train values, indicating that degree 2 or a degree 3 regression are useful in predicting the “Stock_Rating” for the cars.

Decision Tree Regressor#

For Decision Tree Regressor, our initial instantiation with max_leaf_nodes = 4 is a guess for determing the best results.

reg2 = DecisionTreeRegressor(max_leaf_nodes=4)

reg2.fit(X_train,y_train)

DecisionTreeRegressor(max_leaf_nodes=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

mean_squared_error(y_train, reg2.predict(X_train))

4795.489271173062

mean_squared_error(y_test, reg2.predict(X_test))

6615.546457009281

mean_absolute_error(y_train, reg2.predict(X_train))

55.78619206880076

mean_absolute_error(y_test, reg2.predict(X_test))

64.99354430548637

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Decision Tree Regressor is a good model for predicting the “Stock_Rating” of the cars. Given that our initial guess of max_leaf_nodes = 4 is a good fit, its possible to improve our results such that the values of each error is reduced but the test value still remains higher using a test error curve.

df_err1 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
    dtr = DecisionTreeRegressor(max_leaf_nodes=i)
    dtr.fit(X_train, y_train)
    d = {"leaves": i, "error": (1 - dtr.score(X_train, y_train)),"set":"train"}
    d2 = {"leaves": i, "error": (1 - dtr.score(X_test, y_test)),"set":"test"}
    df_err1.loc[len(df_err1)] = d
    df_err1.loc[len(df_err1)] = d2

c = alt.Chart(df_err1).mark_line().encode(
    x = alt.X("leaves", scale=alt.Scale(zero=False)),
    y = alt.Y("error", scale=alt.Scale(zero=False)),
    color = "set"
)
c

As we can see from the test error curve, max_leaf_nodes = 10 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope.

Random Forest Regressor#

rfg = RandomForestRegressor(n_estimators=200, max_leaf_nodes=13)

rfg.fit(X_train,y_train)

RandomForestRegressor(max_leaf_nodes=13, n_estimators=200)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

mean_squared_error(y_train, rfg.predict(X_train))

906.1045239852818

mean_squared_error(y_test, rfg.predict(X_test))

1609.5503655502175

mean_absolute_error(y_train, rfg.predict(X_train))

23.802718069907115

mean_absolute_error(y_test, rfg.predict(X_test))

29.794632547022715

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Random Forrest Regressor is a good model for predicting the “Stock_Rating” of the cars.

Predicting the Car’s Classes and Accessing their Accuracy#

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Here we instantiate a new train_test_split because or target column of “stock_specs” is given by 5 different classes, so a new split is need for our classification models. We will be checking how accurate each classification model is using the .score() method. As with regression, we want the value produced by the test split to be greater than that of the train split.

X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["stock_specs"], test_size=0.7)

Logist Regression#

clf = LogisticRegression(max_iter=100000)

clf.fit(X_train, y_train)

LogisticRegression(max_iter=100000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(X_train, y_train)

0.8260869565217391

clf.score(X_test, y_test)

0.723404255319149

As we can see, the score produced by test split is lower than that of the train split indicating that the Logistic Regression model is not a good fit for predicting our “stock_specs” for our cars.

Decision Tree Classifier#

For Decision Tree Classifier, our initial instantiation with max_leaf_nodes = 9 is a guess for determing the best results.

clf2 = DecisionTreeClassifier(max_leaf_nodes=9)

clf2.fit(X_train,y_train)

DecisionTreeClassifier(max_leaf_nodes=9)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf2.score(X_train, y_train)

0.8136645962732919

clf2.score(X_test, y_test)

0.6781914893617021

As we can see, the score produced by test split is lower than that of the train split indicating that the Decistion Tree Classifier model is not a good fit for predicting our “stock_specs” for our cars.

clf2.classes_

array(['A', 'B', 'C', 'D', 'S1', 'S2'], dtype=object)

df_err2 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
    dtc = DecisionTreeClassifier(max_leaf_nodes=i)
    dtc.fit(X_train, y_train)
    d3 = {"leaves": i, "error": (1 - dtc.score(X_train, y_train)),"set":"train"}
    d4 = {"leaves": i, "error": (1 - dtc.score(X_test, y_test)),"set":"test"}
    df_err2.loc[len(df_err2)] = d3
    df_err2.loc[len(df_err2)] = d4

c = alt.Chart(df_err2).mark_line().encode(
    x = alt.X("leaves", scale=alt.Scale(zero=False)),
    y = alt.Y("error", scale=alt.Scale(zero=False)),
    color = "set"
)
c

As we can see from the test error curve, max_leaf_nodes = 8 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope. But, given that our original model did nor produce favorable results with max_leaf_nodes = 9, it is possible that the model still wont be good for predicting or “stock_specs” for our cars.

Random Forest Classifier#

rfc = RandomForestClassifier(n_estimators=200,max_leaf_nodes=9)

rfc.fit(X_train,y_train)

RandomForestClassifier(max_leaf_nodes=9, n_estimators=200)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rfc.score(X_train,y_train)

0.8944099378881988

rfc.score(X_test,y_test)

0.7420212765957447

As we can see, the score produced by test split is lower than that of the train split indicating that the Random Forest Classifier model is not a good fit for predicting our “stock_specs” for our cars.

Checking the Adequacy of Our Classifying Models#

from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

As part of my extra portion of my project, I chose to use log_loss to determine the adequacy of our different classification models. This is determined by the lower the value, the better the model is suited for predicting our “stock_specs”. This function only works for classification models has it needs the predicted probabilities of the models to determine its value.

Logistic Regression#

log_loss(y_train, clf.predict_proba(X_train))

0.5002982111368205

log_loss(y_test, clf.predict_proba(X_test))

0.6696431553535065

As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Logistic Regression is not an adequate model for our predictions.

Decision Tree Classifier#

log_loss(y_train, clf2.predict_proba(X_train))

0.46980498391662157

log_loss(y_test, clf2.predict_proba(X_test))

2.8158974669983867

As we can see, the value produced by the test set is significantly higher than the train set indicating that our model clf2 defined as Decision Tree Classifier is not an adequate model for our predictions.

Random Forest Classifier#

log_loss(y_train, rfc.predict_proba(X_train))

0.48410537656977254

log_loss(y_test, rfc.predict_proba(X_test))

0.6558411789879937

As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Random Forest Classifier is not an adequate model for our predictions.

Summary#

To conclude this project, we can see that all of our regression models were a good fit for predicting the “Stock_Rating” of our cars. In retrospect the data is rather linear in nature given that the higher an individual stat is directly proportional to a higher “Stock_Rating” so it was not difficult for the machine to predict the corresponding values. On the other hand, it came as a surprise that the classification models were not as successful in predicting the “stock_specs” for the cars. I believe that this is the case, from my in game knowledge, given that the 5 in game classes do not sub-divide the rating values evenly into 5 equal parts, rather, each class has a unique corresponding range for the rating values.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

This data set comes from Kaggle found here. It is dataset that contains all the possible vehicles and their respective stats that can be acquired within the game, Forza Horizon 5.

List any other references that you found helpful.

All my references are from ChatGPT which I highlighted the code blocks that made us of it.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

Predicting Forza Horizon Car Classes and Ratings

Contents

Predicting Forza Horizon Car Classes and Ratings#

Introduction#

Cleaning up the Data & Determining Inputs#

Predicting the Car’s Rating Using Various Regressions#

Linear Regression#

Polynomial Feautes with Pipeline#

Decision Tree Regressor#

Random Forest Regressor#

Predicting the Car’s Classes and Accessing their Accuracy#

Logist Regression#

Decision Tree Classifier#

Random Forest Classifier#

Checking the Adequacy of Our Classifying Models#

Logistic Regression#

Decision Tree Classifier#

Random Forest Classifier#

Summary#

References#

Submission#