Predicting Forza Horizon Car Classes and Ratings#

Author: Felipe I Rios

Course Project, UC Irvine, Math 10, S23

Introduction#

Since I was younger, my favorite genre of video games have always been racing or driving simulator games and I remembered how it is that cars have stats to determine how good they are in the game. One game in particular, Forza Horizon, not only has stats to determine how good the cars are, but the game also has a classification in system in place that corresponds to how good the game is. As such, this will serve as the basis for this project, where I will see if the machine can accurately predict the classification classes bases on the various stats.

Cleaning up the Data & Determining Inputs#

import numpy as np
import pandas as pd
import altair as alt
df = pd.read_csv("Forza_Horizon_Cars.csv")
df
Car_Image Name_and_model Model_type In_Game_Price car_source stock_specs Stock_Rating Drive_Type speed handling ... braking Offroad Top_Speed 0-60_Mph 0-100_Mph g-force car_source_1 car_source_2 Horse_Power Weight_lbs
0 https://www.kudosprime.com/fh5/images/cars/sid... 2001 Acura Integra Type R RETRO HOT HATCH 25,000 Autoshow C 596.0 FWD 5.6 5.1 ... 3.4 5.1 155.5 Mph 6.2s 14.7s 0.90 g info_not_found info_not_found 195 2,639
1 https://www.kudosprime.com/fh5/images/cars/sid... 2002 Acura RSX Type S RETRO HOT HATCH 25,000 Autoshow C 585.0 FWD 5.6 5.1 ... 3.5 5.3 info_not_found info_not_found info_not_found info_not_found info_not_found info_not_found 200 2,820
2 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Acura NSX MODERN SUPERCARS 170,000 Autoshow S1 831.0 AWD 7.0 7.0 ... 7.1 4.7 info_not_found info_not_found info_not_found info_not_found info_not_found info_not_found 573 3,803
3 https://www.kudosprime.com/fh5/images/cars/sid... 1973 Alpine A110 1600s CLASSIC RALLY 98,000 Autoshow C 550.0 RWD 5.0 4.4 ... 3.0 5.3 138.0 Mph 7.0s 20.6s 0.84 g info_not_found info_not_found 123 1,576
4 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Alpine A110 MODERN SPORTS CARS 67,500 Wheelspin B 694.0 RWD 6.0 6.4 ... 4.2 4.2 info_not_found info_not_found info_not_found info_not_found Season Event Festival Playlist 248 2,432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
534 https://www.kudosprime.com/fh5/images/cars/sid... 1997 Volvo 850 R RETRO SALOONS 25,000 Autoshow C 511.0 FWD 5.8 4.3 ... 2.5 5.0 info_not_found info_not_found info_not_found info_not_found Wheelspin info_not_found 240 3,230
535 https://www.kudosprime.com/fh5/images/cars/sid... 2015 Volvo V60 Polestar SUPER SALOONS 62,000 Autoshow B 662.0 AWD 6.1 5.9 ... 3.9 5.1 info_not_found info_not_found info_not_found info_not_found Wheelspin Accolade 346 3,985
536 https://www.kudosprime.com/fh5/images/cars/sid... 2017 VUHL 05RR TRACK TOYS 100,000 Autoshow S1 886.0 RWD 6.8 8.6 ... 7.8 3.7 info_not_found info_not_found info_not_found info_not_found Wheelspin info_not_found 385 1,598
537 https://www.kudosprime.com/fh5/images/cars/sid... 1945 WILLYS MB Jeep PICK-UP & 4X4'S 40,000 Autoshow D 198.0 AWD 2.6 3.7 ... 2.1 7.1 info_not_found info_not_found info_not_found info_not_found Wheelspin info_not_found 60 2,137
538 https://www.kudosprime.com/fh5/images/cars/sid... 2019 Zenvo TSR-S HYPERCARS 1,200,000 Autoshow S2 927.0 RWD 9.0 8.7 ... 9.6 4.0 info_not_found info_not_found info_not_found info_not_found Wheelspin info_not_found 1,177 3,410

539 rows Ă— 22 columns

df = df.applymap(lambda x: np.nan if x == "info_not_found" else x)
df
Car_Image Name_and_model Model_type In_Game_Price car_source stock_specs Stock_Rating Drive_Type speed handling ... braking Offroad Top_Speed 0-60_Mph 0-100_Mph g-force car_source_1 car_source_2 Horse_Power Weight_lbs
0 https://www.kudosprime.com/fh5/images/cars/sid... 2001 Acura Integra Type R RETRO HOT HATCH 25,000 Autoshow C 596.0 FWD 5.6 5.1 ... 3.4 5.1 155.5 Mph 6.2s 14.7s 0.90 g NaN NaN 195 2,639
1 https://www.kudosprime.com/fh5/images/cars/sid... 2002 Acura RSX Type S RETRO HOT HATCH 25,000 Autoshow C 585.0 FWD 5.6 5.1 ... 3.5 5.3 NaN NaN NaN NaN NaN NaN 200 2,820
2 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Acura NSX MODERN SUPERCARS 170,000 Autoshow S1 831.0 AWD 7.0 7.0 ... 7.1 4.7 NaN NaN NaN NaN NaN NaN 573 3,803
3 https://www.kudosprime.com/fh5/images/cars/sid... 1973 Alpine A110 1600s CLASSIC RALLY 98,000 Autoshow C 550.0 RWD 5.0 4.4 ... 3.0 5.3 138.0 Mph 7.0s 20.6s 0.84 g NaN NaN 123 1,576
4 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Alpine A110 MODERN SPORTS CARS 67,500 Wheelspin B 694.0 RWD 6.0 6.4 ... 4.2 4.2 NaN NaN NaN NaN Season Event Festival Playlist 248 2,432
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
534 https://www.kudosprime.com/fh5/images/cars/sid... 1997 Volvo 850 R RETRO SALOONS 25,000 Autoshow C 511.0 FWD 5.8 4.3 ... 2.5 5.0 NaN NaN NaN NaN Wheelspin NaN 240 3,230
535 https://www.kudosprime.com/fh5/images/cars/sid... 2015 Volvo V60 Polestar SUPER SALOONS 62,000 Autoshow B 662.0 AWD 6.1 5.9 ... 3.9 5.1 NaN NaN NaN NaN Wheelspin Accolade 346 3,985
536 https://www.kudosprime.com/fh5/images/cars/sid... 2017 VUHL 05RR TRACK TOYS 100,000 Autoshow S1 886.0 RWD 6.8 8.6 ... 7.8 3.7 NaN NaN NaN NaN Wheelspin NaN 385 1,598
537 https://www.kudosprime.com/fh5/images/cars/sid... 1945 WILLYS MB Jeep PICK-UP & 4X4'S 40,000 Autoshow D 198.0 AWD 2.6 3.7 ... 2.1 7.1 NaN NaN NaN NaN Wheelspin NaN 60 2,137
538 https://www.kudosprime.com/fh5/images/cars/sid... 2019 Zenvo TSR-S HYPERCARS 1,200,000 Autoshow S2 927.0 RWD 9.0 8.7 ... 9.6 4.0 NaN NaN NaN NaN Wheelspin NaN 1,177 3,410

539 rows Ă— 22 columns

Here I made sure to all the values that had been inputted as “info_not_found” were replaced as missing values.

df.dtypes
Car_Image         object
Name_and_model    object
Model_type        object
In_Game_Price     object
car_source        object
stock_specs       object
Stock_Rating      object
Drive_Type        object
speed             object
handling          object
acceleration      object
launch            object
braking           object
Offroad           object
Top_Speed         object
0-60_Mph          object
0-100_Mph         object
g-force           object
car_source_1      object
car_source_2      object
Horse_Power       object
Weight_lbs        object
dtype: object
cols = ["Stock_Rating", "speed", "handling", "acceleration", "launch", "braking", "Offroad", "Horse_Power", "Weight_lbs"]

Notice that all the columns in the Data Frame have dtype object and thus need to be changed to numeric dtypes. In the code block above, I chose the columns that would be my input features as part of my data analysis later in the project. Given how there is no “pattern” in the way the words were written, there was no way to concise way of building this list such as through list comprehension, so manually typing them out was my only choice.

for c in cols:
    if df[c].str.contains(",").any():
        df[c] = df[c].str.replace(',',"").astype(float)
    else:
        df[c] = pd.to_numeric(df[c])

Lines 2-3, from the code block directly above, were partially adapted from ChatGPT given how the initial Data Frame is formatted. Apparently, strings in the form of numbers cannot be changed to numeric dtypes if they have commas like in “1,042”, so they have to be removed.

df.dtypes
Car_Image          object
Name_and_model     object
Model_type         object
In_Game_Price      object
car_source         object
stock_specs        object
Stock_Rating      float64
Drive_Type         object
speed             float64
handling          float64
acceleration      float64
launch            float64
braking           float64
Offroad           float64
Top_Speed          object
0-60_Mph           object
0-100_Mph          object
g-force            object
car_source_1       object
car_source_2       object
Horse_Power       float64
Weight_lbs        float64
dtype: object
threshold = (len(df)*0.1)
missing_values = df.isna().sum()
columns_to_drop = missing_values[missing_values > threshold].index
columns_to_drop
Index(['Top_Speed', '0-60_Mph', '0-100_Mph', 'g-force', 'car_source_1',
       'car_source_2'],
      dtype='object')
df = df.drop(columns_to_drop, axis = 1)

The four code blocks directly above were adapted from ChatGPT to help remove columns whose values had more than 10% amount of them being missing values as they have no value in the Machine Learning part of the project. The original Data Frame has 23 columns, so I removed these columns for it to be more concise, considering some of them have over 70% missing values.

df = df.dropna(axis = 0)
df
Car_Image Name_and_model Model_type In_Game_Price car_source stock_specs Stock_Rating Drive_Type speed handling acceleration launch braking Offroad Horse_Power Weight_lbs
0 https://www.kudosprime.com/fh5/images/cars/sid... 2001 Acura Integra Type R RETRO HOT HATCH 25,000 Autoshow C 596.0 FWD 5.6 5.1 3.9 3.1 3.4 5.1 195.0 2639.0
1 https://www.kudosprime.com/fh5/images/cars/sid... 2002 Acura RSX Type S RETRO HOT HATCH 25,000 Autoshow C 585.0 FWD 5.6 5.1 3.9 3.0 3.5 5.3 200.0 2820.0
2 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Acura NSX MODERN SUPERCARS 170,000 Autoshow S1 831.0 AWD 7.0 7.0 9.2 10.0 7.1 4.7 573.0 3803.0
3 https://www.kudosprime.com/fh5/images/cars/sid... 1973 Alpine A110 1600s CLASSIC RALLY 98,000 Autoshow C 550.0 RWD 5.0 4.4 4.1 3.1 3.0 5.3 123.0 1576.0
4 https://www.kudosprime.com/fh5/images/cars/sid... 2017 Alpine A110 MODERN SPORTS CARS 67,500 Wheelspin B 694.0 RWD 6.0 6.4 5.4 5.7 4.2 4.2 248.0 2432.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
534 https://www.kudosprime.com/fh5/images/cars/sid... 1997 Volvo 850 R RETRO SALOONS 25,000 Autoshow C 511.0 FWD 5.8 4.3 3.3 2.6 2.5 5.0 240.0 3230.0
535 https://www.kudosprime.com/fh5/images/cars/sid... 2015 Volvo V60 Polestar SUPER SALOONS 62,000 Autoshow B 662.0 AWD 6.1 5.9 5.5 3.6 3.9 5.1 346.0 3985.0
536 https://www.kudosprime.com/fh5/images/cars/sid... 2017 VUHL 05RR TRACK TOYS 100,000 Autoshow S1 886.0 RWD 6.8 8.6 8.5 9.2 7.8 3.7 385.0 1598.0
537 https://www.kudosprime.com/fh5/images/cars/sid... 1945 WILLYS MB Jeep PICK-UP & 4X4'S 40,000 Autoshow D 198.0 AWD 2.6 3.7 2.0 3.3 2.1 7.1 60.0 2137.0
538 https://www.kudosprime.com/fh5/images/cars/sid... 2019 Zenvo TSR-S HYPERCARS 1,200,000 Autoshow S2 927.0 RWD 9.0 8.7 6.7 7.2 9.6 4.0 1177.0 3410.0

537 rows Ă— 16 columns

df.shape
(537, 16)

This effectively finishes the cleaning portion of the data that will allow the data analysis to begin

Predicting the Car’s Rating Using Various Regressions#

Comparison between Pipeline (the best degree polynomial) and Decision Tree Regressor and which one was more accurate in its prediction.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
strings_to_drop = ["Stock_Rating", "Horse_Power", "Weight_lbs"]
cols1 = [x for x in cols if x not in strings_to_drop]
cols1
['speed', 'handling', 'acceleration', 'launch', 'braking', 'Offroad']

A new list is made for ease of use when applying machine learning from our original list hence why remove some features. We dont want “Stock_Rating” as part of our features since it is our target column when doing regressions, and “Horse_Power” and “Weight_lbs” are not what I’d consider traditional car stats, hence why I removed them as well.

def make_chart(feat):
    c = alt.Chart(df).mark_circle().encode(
        x = feat,
        y = "Stock_Rating",
        color = alt.Color(feat, legend = alt.Legend(title = "Feature"))
    )
    return c
import random

charts = [make_chart(x) for x in cols1]  
# Create an empty list to store the layered charts
layered_charts = []

def generate_random_color():
    # Generate random values for red, green, and blue color components
    red = random.randint(0, 255)
    green = random.randint(0, 255)
    blue = random.randint(0, 255)
    # Construct the RGB color string
    color = f'rgb({red}, {green}, {blue})'
    return color
# Generate a random color
random_color = generate_random_color()

# Iterate over the charts and apply different visual properties to each layer
for i, (chart,x_label) in enumerate(zip(charts, cols1)):
    # Apply different color to each layer
    color = random_color if i == 0 else generate_random_color() 
    # Apply the visual properties to the current layer
    layered_chart = chart.mark_circle().encode(
        x = x_label,
        y ="Stock_Rating",
        color = alt.Chart(value = f"{color}"),
        tooltip = [x_label],
    ).interactive()
    layered_charts.append(layered_chart)

# Combine the layered charts using 'alt.layer'
combined_chart = alt.layer(*layered_charts)
combined_chart

This chart was made possible because of ChatGPT. Given that I have multiple input features it is not easily to visualize all the information as it would technically produce a multi-dimentional chart. As such, this is was the best alternative I could come up with. While it is hard to read, it does help visualize how the stats correspond to which what level of “Stock_Rating.”

legend_chart = alt.Chart(pd.DataFrame({'column': cols1})).mark_point().encode(
    color=alt.Color("column:N", legend=alt.Legend(title='Features'))
)
legend_chart
alt.layer(*(combined_chart, legend_chart))

This is how I would have wanted my chart to look, but despite searching the internet, I could not figure out how to get the legend to reflect the actual colors that are in the chart, given that I am using a random color generator to distinguish the different features.

Linear Regression#

As a precursor, when performing a regression, not all regression types have a .score() to determine the accuracy of the model, as such its better to compare the values between a train and test set using mean_absolute_error and mean_squared_error. While the numbers dont mean much in context, we do know that if the value produce by the test set is higher then the model is a good predictor. As such, we instantiate the train_test_split below which also helps reduce the influence of overfitting in our model.

X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["Stock_Rating"], test_size=0.7)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
mean_squared_error(y_train,reg.predict(X_train))
5279.724534199711
mean_squared_error(y_test,reg.predict(X_test))
4339.725113794179
mean_absolute_error(y_train, reg.predict(X_train))
53.49695209326706
mean_absolute_error(y_test, reg.predict(X_test))
48.324484758786745

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Linear Regression is a good model for predicting the “Stock_Rating” of the cars.

Polynomial Feautes with Pipeline#

def poly_fit(d, s):
    #cols2 = ["set", "degree", "mae", "mse"]
    pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=d,include_bias=False)),
        ('reg', LinearRegression())
    ])
    pipe.fit(X_train, y_train)
    if s == "train":
        sqr_err_train = mean_squared_error(y_train, pipe.predict(X_train))
        abs_err_train = mean_absolute_error(y_train, pipe.predict(X_train))
        arr = [s, d, abs_err_train, sqr_err_train]
    else: 
        sqr_err_test = mean_squared_error(y_test, pipe.predict(X_test))
        abs_err_test = mean_absolute_error(y_test, pipe.predict(X_test))
        arr = [s, d, abs_err_test, sqr_err_test]
    return arr 
cols3 = ["set", "degree", "mae", "mse"]
df_arr = pd.DataFrame(columns=cols3)
for d in range(2,9):
    series = [poly_fit(d, s) for s in ["train", 'test']]
    df_arr.loc[len(df_arr)] = series[0]
    df_arr.loc[len(df_arr)] = series[1]
df_arr
set degree mae mse
0 train 2 2.509779e+01 1.101053e+03
1 test 2 3.190195e+01 1.964577e+03
2 train 3 1.266801e+01 3.025544e+02
3 test 3 5.312205e+01 1.391856e+04
4 train 4 1.195453e-09 2.743645e-18
5 test 4 9.918989e+02 1.139071e+07
6 train 5 1.572095e-10 5.547747e-20
7 test 5 9.684616e+02 1.775343e+07
8 train 6 2.174201e-10 1.022347e-19
9 test 6 1.372796e+03 3.054070e+07
10 train 7 1.094435e-10 4.879715e-20
11 test 7 2.205851e+03 6.273823e+07
12 train 8 1.352969e-10 6.088622e-20
13 test 8 3.773978e+03 1.645278e+08

Polynomial Features with Pipeline shows that after degree 3, the results become unreasonable, but up to degree 3, we can see that the test values remain higher than the train values, indicating that degree 2 or a degree 3 regression are useful in predicting the “Stock_Rating” for the cars.

Decision Tree Regressor#

For Decision Tree Regressor, our initial instantiation with max_leaf_nodes = 4 is a guess for determing the best results.

reg2 = DecisionTreeRegressor(max_leaf_nodes=4)
reg2.fit(X_train,y_train)
DecisionTreeRegressor(max_leaf_nodes=4)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
mean_squared_error(y_train, reg2.predict(X_train))
4795.489271173062
mean_squared_error(y_test, reg2.predict(X_test))
6615.546457009281
mean_absolute_error(y_train, reg2.predict(X_train))
55.78619206880076
mean_absolute_error(y_test, reg2.predict(X_test))
64.99354430548637

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Decision Tree Regressor is a good model for predicting the “Stock_Rating” of the cars. Given that our initial guess of max_leaf_nodes = 4 is a good fit, its possible to improve our results such that the values of each error is reduced but the test value still remains higher using a test error curve.

df_err1 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
    dtr = DecisionTreeRegressor(max_leaf_nodes=i)
    dtr.fit(X_train, y_train)
    d = {"leaves": i, "error": (1 - dtr.score(X_train, y_train)),"set":"train"}
    d2 = {"leaves": i, "error": (1 - dtr.score(X_test, y_test)),"set":"test"}
    df_err1.loc[len(df_err1)] = d
    df_err1.loc[len(df_err1)] = d2
c = alt.Chart(df_err1).mark_line().encode(
    x = alt.X("leaves", scale=alt.Scale(zero=False)),
    y = alt.Y("error", scale=alt.Scale(zero=False)),
    color = "set"
)
c

As we can see from the test error curve, max_leaf_nodes = 10 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope.

Random Forest Regressor#

rfg = RandomForestRegressor(n_estimators=200, max_leaf_nodes=13)
rfg.fit(X_train,y_train)
RandomForestRegressor(max_leaf_nodes=13, n_estimators=200)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
mean_squared_error(y_train, rfg.predict(X_train))
906.1045239852818
mean_squared_error(y_test, rfg.predict(X_test))
1609.5503655502175
mean_absolute_error(y_train, rfg.predict(X_train))
23.802718069907115
mean_absolute_error(y_test, rfg.predict(X_test))
29.794632547022715

From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Random Forrest Regressor is a good model for predicting the “Stock_Rating” of the cars.

Predicting the Car’s Classes and Accessing their Accuracy#

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Here we instantiate a new train_test_split because or target column of “stock_specs” is given by 5 different classes, so a new split is need for our classification models. We will be checking how accurate each classification model is using the .score() method. As with regression, we want the value produced by the test split to be greater than that of the train split.

X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["stock_specs"], test_size=0.7)

Logist Regression#

clf = LogisticRegression(max_iter=100000)
clf.fit(X_train, y_train)
LogisticRegression(max_iter=100000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(X_train, y_train)
0.8260869565217391
clf.score(X_test, y_test)
0.723404255319149

As we can see, the score produced by test split is lower than that of the train split indicating that the Logistic Regression model is not a good fit for predicting our “stock_specs” for our cars.

Decision Tree Classifier#

For Decision Tree Classifier, our initial instantiation with max_leaf_nodes = 9 is a guess for determing the best results.

clf2 = DecisionTreeClassifier(max_leaf_nodes=9)
clf2.fit(X_train,y_train)
DecisionTreeClassifier(max_leaf_nodes=9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf2.score(X_train, y_train)
0.8136645962732919
clf2.score(X_test, y_test)
0.6781914893617021

As we can see, the score produced by test split is lower than that of the train split indicating that the Decistion Tree Classifier model is not a good fit for predicting our “stock_specs” for our cars.

clf2.classes_
array(['A', 'B', 'C', 'D', 'S1', 'S2'], dtype=object)
df_err2 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
    dtc = DecisionTreeClassifier(max_leaf_nodes=i)
    dtc.fit(X_train, y_train)
    d3 = {"leaves": i, "error": (1 - dtc.score(X_train, y_train)),"set":"train"}
    d4 = {"leaves": i, "error": (1 - dtc.score(X_test, y_test)),"set":"test"}
    df_err2.loc[len(df_err2)] = d3
    df_err2.loc[len(df_err2)] = d4
c = alt.Chart(df_err2).mark_line().encode(
    x = alt.X("leaves", scale=alt.Scale(zero=False)),
    y = alt.Y("error", scale=alt.Scale(zero=False)),
    color = "set"
)
c

As we can see from the test error curve, max_leaf_nodes = 8 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope. But, given that our original model did nor produce favorable results with max_leaf_nodes = 9, it is possible that the model still wont be good for predicting or “stock_specs” for our cars.

Random Forest Classifier#

rfc = RandomForestClassifier(n_estimators=200,max_leaf_nodes=9)
rfc.fit(X_train,y_train)
RandomForestClassifier(max_leaf_nodes=9, n_estimators=200)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rfc.score(X_train,y_train)
0.8944099378881988
rfc.score(X_test,y_test)
0.7420212765957447

As we can see, the score produced by test split is lower than that of the train split indicating that the Random Forest Classifier model is not a good fit for predicting our “stock_specs” for our cars.

Checking the Adequacy of Our Classifying Models#

from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split

As part of my extra portion of my project, I chose to use log_loss to determine the adequacy of our different classification models. This is determined by the lower the value, the better the model is suited for predicting our “stock_specs”. This function only works for classification models has it needs the predicted probabilities of the models to determine its value.

Logistic Regression#

log_loss(y_train, clf.predict_proba(X_train))
0.5002982111368205
log_loss(y_test, clf.predict_proba(X_test))
0.6696431553535065

As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Logistic Regression is not an adequate model for our predictions.

Decision Tree Classifier#

log_loss(y_train, clf2.predict_proba(X_train))
0.46980498391662157
log_loss(y_test, clf2.predict_proba(X_test))
2.8158974669983867

As we can see, the value produced by the test set is significantly higher than the train set indicating that our model clf2 defined as Decision Tree Classifier is not an adequate model for our predictions.

Random Forest Classifier#

log_loss(y_train, rfc.predict_proba(X_train))
0.48410537656977254
log_loss(y_test, rfc.predict_proba(X_test))
0.6558411789879937

As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Random Forest Classifier is not an adequate model for our predictions.

Summary#

To conclude this project, we can see that all of our regression models were a good fit for predicting the “Stock_Rating” of our cars. In retrospect the data is rather linear in nature given that the higher an individual stat is directly proportional to a higher “Stock_Rating” so it was not difficult for the machine to predict the corresponding values. On the other hand, it came as a surprise that the classification models were not as successful in predicting the “stock_specs” for the cars. I believe that this is the case, from my in game knowledge, given that the 5 in game classes do not sub-divide the rating values evenly into 5 equal parts, rather, each class has a unique corresponding range for the rating values.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

This data set comes from Kaggle found here. It is dataset that contains all the possible vehicles and their respective stats that can be acquired within the game, Forza Horizon 5.

  • List any other references that you found helpful.

All my references are from ChatGPT which I highlighted the code blocks that made us of it.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote