Predicting Forza Horizon Car Classes and Ratings#
Author: Felipe I Rios
Course Project, UC Irvine, Math 10, S23
Introduction#
Since I was younger, my favorite genre of video games have always been racing or driving simulator games and I remembered how it is that cars have stats to determine how good they are in the game. One game in particular, Forza Horizon, not only has stats to determine how good the cars are, but the game also has a classification in system in place that corresponds to how good the game is. As such, this will serve as the basis for this project, where I will see if the machine can accurately predict the classification classes bases on the various stats.
Cleaning up the Data & Determining Inputs#
import numpy as np
import pandas as pd
import altair as alt
df = pd.read_csv("Forza_Horizon_Cars.csv")
df
Car_Image | Name_and_model | Model_type | In_Game_Price | car_source | stock_specs | Stock_Rating | Drive_Type | speed | handling | ... | braking | Offroad | Top_Speed | 0-60_Mph | 0-100_Mph | g-force | car_source_1 | car_source_2 | Horse_Power | Weight_lbs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.kudosprime.com/fh5/images/cars/sid... | 2001 Acura Integra Type R | RETRO HOT HATCH | 25,000 | Autoshow | C | 596.0 | FWD | 5.6 | 5.1 | ... | 3.4 | 5.1 | 155.5 Mph | 6.2s | 14.7s | 0.90 g | info_not_found | info_not_found | 195 | 2,639 |
1 | https://www.kudosprime.com/fh5/images/cars/sid... | 2002 Acura RSX Type S | RETRO HOT HATCH | 25,000 | Autoshow | C | 585.0 | FWD | 5.6 | 5.1 | ... | 3.5 | 5.3 | info_not_found | info_not_found | info_not_found | info_not_found | info_not_found | info_not_found | 200 | 2,820 |
2 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Acura NSX | MODERN SUPERCARS | 170,000 | Autoshow | S1 | 831.0 | AWD | 7.0 | 7.0 | ... | 7.1 | 4.7 | info_not_found | info_not_found | info_not_found | info_not_found | info_not_found | info_not_found | 573 | 3,803 |
3 | https://www.kudosprime.com/fh5/images/cars/sid... | 1973 Alpine A110 1600s | CLASSIC RALLY | 98,000 | Autoshow | C | 550.0 | RWD | 5.0 | 4.4 | ... | 3.0 | 5.3 | 138.0 Mph | 7.0s | 20.6s | 0.84 g | info_not_found | info_not_found | 123 | 1,576 |
4 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Alpine A110 | MODERN SPORTS CARS | 67,500 | Wheelspin | B | 694.0 | RWD | 6.0 | 6.4 | ... | 4.2 | 4.2 | info_not_found | info_not_found | info_not_found | info_not_found | Season Event | Festival Playlist | 248 | 2,432 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
534 | https://www.kudosprime.com/fh5/images/cars/sid... | 1997 Volvo 850 R | RETRO SALOONS | 25,000 | Autoshow | C | 511.0 | FWD | 5.8 | 4.3 | ... | 2.5 | 5.0 | info_not_found | info_not_found | info_not_found | info_not_found | Wheelspin | info_not_found | 240 | 3,230 |
535 | https://www.kudosprime.com/fh5/images/cars/sid... | 2015 Volvo V60 Polestar | SUPER SALOONS | 62,000 | Autoshow | B | 662.0 | AWD | 6.1 | 5.9 | ... | 3.9 | 5.1 | info_not_found | info_not_found | info_not_found | info_not_found | Wheelspin | Accolade | 346 | 3,985 |
536 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 VUHL 05RR | TRACK TOYS | 100,000 | Autoshow | S1 | 886.0 | RWD | 6.8 | 8.6 | ... | 7.8 | 3.7 | info_not_found | info_not_found | info_not_found | info_not_found | Wheelspin | info_not_found | 385 | 1,598 |
537 | https://www.kudosprime.com/fh5/images/cars/sid... | 1945 WILLYS MB Jeep | PICK-UP & 4X4'S | 40,000 | Autoshow | D | 198.0 | AWD | 2.6 | 3.7 | ... | 2.1 | 7.1 | info_not_found | info_not_found | info_not_found | info_not_found | Wheelspin | info_not_found | 60 | 2,137 |
538 | https://www.kudosprime.com/fh5/images/cars/sid... | 2019 Zenvo TSR-S | HYPERCARS | 1,200,000 | Autoshow | S2 | 927.0 | RWD | 9.0 | 8.7 | ... | 9.6 | 4.0 | info_not_found | info_not_found | info_not_found | info_not_found | Wheelspin | info_not_found | 1,177 | 3,410 |
539 rows Ă— 22 columns
df = df.applymap(lambda x: np.nan if x == "info_not_found" else x)
df
Car_Image | Name_and_model | Model_type | In_Game_Price | car_source | stock_specs | Stock_Rating | Drive_Type | speed | handling | ... | braking | Offroad | Top_Speed | 0-60_Mph | 0-100_Mph | g-force | car_source_1 | car_source_2 | Horse_Power | Weight_lbs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.kudosprime.com/fh5/images/cars/sid... | 2001 Acura Integra Type R | RETRO HOT HATCH | 25,000 | Autoshow | C | 596.0 | FWD | 5.6 | 5.1 | ... | 3.4 | 5.1 | 155.5 Mph | 6.2s | 14.7s | 0.90 g | NaN | NaN | 195 | 2,639 |
1 | https://www.kudosprime.com/fh5/images/cars/sid... | 2002 Acura RSX Type S | RETRO HOT HATCH | 25,000 | Autoshow | C | 585.0 | FWD | 5.6 | 5.1 | ... | 3.5 | 5.3 | NaN | NaN | NaN | NaN | NaN | NaN | 200 | 2,820 |
2 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Acura NSX | MODERN SUPERCARS | 170,000 | Autoshow | S1 | 831.0 | AWD | 7.0 | 7.0 | ... | 7.1 | 4.7 | NaN | NaN | NaN | NaN | NaN | NaN | 573 | 3,803 |
3 | https://www.kudosprime.com/fh5/images/cars/sid... | 1973 Alpine A110 1600s | CLASSIC RALLY | 98,000 | Autoshow | C | 550.0 | RWD | 5.0 | 4.4 | ... | 3.0 | 5.3 | 138.0 Mph | 7.0s | 20.6s | 0.84 g | NaN | NaN | 123 | 1,576 |
4 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Alpine A110 | MODERN SPORTS CARS | 67,500 | Wheelspin | B | 694.0 | RWD | 6.0 | 6.4 | ... | 4.2 | 4.2 | NaN | NaN | NaN | NaN | Season Event | Festival Playlist | 248 | 2,432 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
534 | https://www.kudosprime.com/fh5/images/cars/sid... | 1997 Volvo 850 R | RETRO SALOONS | 25,000 | Autoshow | C | 511.0 | FWD | 5.8 | 4.3 | ... | 2.5 | 5.0 | NaN | NaN | NaN | NaN | Wheelspin | NaN | 240 | 3,230 |
535 | https://www.kudosprime.com/fh5/images/cars/sid... | 2015 Volvo V60 Polestar | SUPER SALOONS | 62,000 | Autoshow | B | 662.0 | AWD | 6.1 | 5.9 | ... | 3.9 | 5.1 | NaN | NaN | NaN | NaN | Wheelspin | Accolade | 346 | 3,985 |
536 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 VUHL 05RR | TRACK TOYS | 100,000 | Autoshow | S1 | 886.0 | RWD | 6.8 | 8.6 | ... | 7.8 | 3.7 | NaN | NaN | NaN | NaN | Wheelspin | NaN | 385 | 1,598 |
537 | https://www.kudosprime.com/fh5/images/cars/sid... | 1945 WILLYS MB Jeep | PICK-UP & 4X4'S | 40,000 | Autoshow | D | 198.0 | AWD | 2.6 | 3.7 | ... | 2.1 | 7.1 | NaN | NaN | NaN | NaN | Wheelspin | NaN | 60 | 2,137 |
538 | https://www.kudosprime.com/fh5/images/cars/sid... | 2019 Zenvo TSR-S | HYPERCARS | 1,200,000 | Autoshow | S2 | 927.0 | RWD | 9.0 | 8.7 | ... | 9.6 | 4.0 | NaN | NaN | NaN | NaN | Wheelspin | NaN | 1,177 | 3,410 |
539 rows Ă— 22 columns
Here I made sure to all the values that had been inputted as “info_not_found” were replaced as missing values.
df.dtypes
Car_Image object
Name_and_model object
Model_type object
In_Game_Price object
car_source object
stock_specs object
Stock_Rating object
Drive_Type object
speed object
handling object
acceleration object
launch object
braking object
Offroad object
Top_Speed object
0-60_Mph object
0-100_Mph object
g-force object
car_source_1 object
car_source_2 object
Horse_Power object
Weight_lbs object
dtype: object
cols = ["Stock_Rating", "speed", "handling", "acceleration", "launch", "braking", "Offroad", "Horse_Power", "Weight_lbs"]
Notice that all the columns in the Data Frame have dtype object and thus need to be changed to numeric dtypes. In the code block above, I chose the columns that would be my input features as part of my data analysis later in the project. Given how there is no “pattern” in the way the words were written, there was no way to concise way of building this list such as through list comprehension, so manually typing them out was my only choice.
for c in cols:
if df[c].str.contains(",").any():
df[c] = df[c].str.replace(',',"").astype(float)
else:
df[c] = pd.to_numeric(df[c])
Lines 2-3, from the code block directly above, were partially adapted from ChatGPT given how the initial Data Frame is formatted. Apparently, strings in the form of numbers cannot be changed to numeric dtypes if they have commas like in “1,042”, so they have to be removed.
df.dtypes
Car_Image object
Name_and_model object
Model_type object
In_Game_Price object
car_source object
stock_specs object
Stock_Rating float64
Drive_Type object
speed float64
handling float64
acceleration float64
launch float64
braking float64
Offroad float64
Top_Speed object
0-60_Mph object
0-100_Mph object
g-force object
car_source_1 object
car_source_2 object
Horse_Power float64
Weight_lbs float64
dtype: object
threshold = (len(df)*0.1)
missing_values = df.isna().sum()
columns_to_drop = missing_values[missing_values > threshold].index
columns_to_drop
Index(['Top_Speed', '0-60_Mph', '0-100_Mph', 'g-force', 'car_source_1',
'car_source_2'],
dtype='object')
df = df.drop(columns_to_drop, axis = 1)
The four code blocks directly above were adapted from ChatGPT to help remove columns whose values had more than 10% amount of them being missing values as they have no value in the Machine Learning part of the project. The original Data Frame has 23 columns, so I removed these columns for it to be more concise, considering some of them have over 70% missing values.
df = df.dropna(axis = 0)
df
Car_Image | Name_and_model | Model_type | In_Game_Price | car_source | stock_specs | Stock_Rating | Drive_Type | speed | handling | acceleration | launch | braking | Offroad | Horse_Power | Weight_lbs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | https://www.kudosprime.com/fh5/images/cars/sid... | 2001 Acura Integra Type R | RETRO HOT HATCH | 25,000 | Autoshow | C | 596.0 | FWD | 5.6 | 5.1 | 3.9 | 3.1 | 3.4 | 5.1 | 195.0 | 2639.0 |
1 | https://www.kudosprime.com/fh5/images/cars/sid... | 2002 Acura RSX Type S | RETRO HOT HATCH | 25,000 | Autoshow | C | 585.0 | FWD | 5.6 | 5.1 | 3.9 | 3.0 | 3.5 | 5.3 | 200.0 | 2820.0 |
2 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Acura NSX | MODERN SUPERCARS | 170,000 | Autoshow | S1 | 831.0 | AWD | 7.0 | 7.0 | 9.2 | 10.0 | 7.1 | 4.7 | 573.0 | 3803.0 |
3 | https://www.kudosprime.com/fh5/images/cars/sid... | 1973 Alpine A110 1600s | CLASSIC RALLY | 98,000 | Autoshow | C | 550.0 | RWD | 5.0 | 4.4 | 4.1 | 3.1 | 3.0 | 5.3 | 123.0 | 1576.0 |
4 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 Alpine A110 | MODERN SPORTS CARS | 67,500 | Wheelspin | B | 694.0 | RWD | 6.0 | 6.4 | 5.4 | 5.7 | 4.2 | 4.2 | 248.0 | 2432.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
534 | https://www.kudosprime.com/fh5/images/cars/sid... | 1997 Volvo 850 R | RETRO SALOONS | 25,000 | Autoshow | C | 511.0 | FWD | 5.8 | 4.3 | 3.3 | 2.6 | 2.5 | 5.0 | 240.0 | 3230.0 |
535 | https://www.kudosprime.com/fh5/images/cars/sid... | 2015 Volvo V60 Polestar | SUPER SALOONS | 62,000 | Autoshow | B | 662.0 | AWD | 6.1 | 5.9 | 5.5 | 3.6 | 3.9 | 5.1 | 346.0 | 3985.0 |
536 | https://www.kudosprime.com/fh5/images/cars/sid... | 2017 VUHL 05RR | TRACK TOYS | 100,000 | Autoshow | S1 | 886.0 | RWD | 6.8 | 8.6 | 8.5 | 9.2 | 7.8 | 3.7 | 385.0 | 1598.0 |
537 | https://www.kudosprime.com/fh5/images/cars/sid... | 1945 WILLYS MB Jeep | PICK-UP & 4X4'S | 40,000 | Autoshow | D | 198.0 | AWD | 2.6 | 3.7 | 2.0 | 3.3 | 2.1 | 7.1 | 60.0 | 2137.0 |
538 | https://www.kudosprime.com/fh5/images/cars/sid... | 2019 Zenvo TSR-S | HYPERCARS | 1,200,000 | Autoshow | S2 | 927.0 | RWD | 9.0 | 8.7 | 6.7 | 7.2 | 9.6 | 4.0 | 1177.0 | 3410.0 |
537 rows Ă— 16 columns
df.shape
(537, 16)
This effectively finishes the cleaning portion of the data that will allow the data analysis to begin
Predicting the Car’s Rating Using Various Regressions#
Comparison between Pipeline (the best degree polynomial) and Decision Tree Regressor and which one was more accurate in its prediction.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
strings_to_drop = ["Stock_Rating", "Horse_Power", "Weight_lbs"]
cols1 = [x for x in cols if x not in strings_to_drop]
cols1
['speed', 'handling', 'acceleration', 'launch', 'braking', 'Offroad']
A new list is made for ease of use when applying machine learning from our original list hence why remove some features. We dont want “Stock_Rating” as part of our features since it is our target column when doing regressions, and “Horse_Power” and “Weight_lbs” are not what I’d consider traditional car stats, hence why I removed them as well.
def make_chart(feat):
c = alt.Chart(df).mark_circle().encode(
x = feat,
y = "Stock_Rating",
color = alt.Color(feat, legend = alt.Legend(title = "Feature"))
)
return c
import random
charts = [make_chart(x) for x in cols1]
# Create an empty list to store the layered charts
layered_charts = []
def generate_random_color():
# Generate random values for red, green, and blue color components
red = random.randint(0, 255)
green = random.randint(0, 255)
blue = random.randint(0, 255)
# Construct the RGB color string
color = f'rgb({red}, {green}, {blue})'
return color
# Generate a random color
random_color = generate_random_color()
# Iterate over the charts and apply different visual properties to each layer
for i, (chart,x_label) in enumerate(zip(charts, cols1)):
# Apply different color to each layer
color = random_color if i == 0 else generate_random_color()
# Apply the visual properties to the current layer
layered_chart = chart.mark_circle().encode(
x = x_label,
y ="Stock_Rating",
color = alt.Chart(value = f"{color}"),
tooltip = [x_label],
).interactive()
layered_charts.append(layered_chart)
# Combine the layered charts using 'alt.layer'
combined_chart = alt.layer(*layered_charts)
combined_chart
This chart was made possible because of ChatGPT. Given that I have multiple input features it is not easily to visualize all the information as it would technically produce a multi-dimentional chart. As such, this is was the best alternative I could come up with. While it is hard to read, it does help visualize how the stats correspond to which what level of “Stock_Rating.”
legend_chart = alt.Chart(pd.DataFrame({'column': cols1})).mark_point().encode(
color=alt.Color("column:N", legend=alt.Legend(title='Features'))
)
legend_chart
alt.layer(*(combined_chart, legend_chart))
This is how I would have wanted my chart to look, but despite searching the internet, I could not figure out how to get the legend to reflect the actual colors that are in the chart, given that I am using a random color generator to distinguish the different features.
Linear Regression#
As a precursor, when performing a regression, not all regression types have a .score() to determine the accuracy of the model, as such its better to compare the values between a train and test set using mean_absolute_error and mean_squared_error. While the numbers dont mean much in context, we do know that if the value produce by the test set is higher then the model is a good predictor. As such, we instantiate the train_test_split below which also helps reduce the influence of overfitting in our model.
X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["Stock_Rating"], test_size=0.7)
reg = LinearRegression()
reg.fit(X_train,y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
mean_squared_error(y_train,reg.predict(X_train))
5279.724534199711
mean_squared_error(y_test,reg.predict(X_test))
4339.725113794179
mean_absolute_error(y_train, reg.predict(X_train))
53.49695209326706
mean_absolute_error(y_test, reg.predict(X_test))
48.324484758786745
From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Linear Regression is a good model for predicting the “Stock_Rating” of the cars.
Polynomial Feautes with Pipeline#
def poly_fit(d, s):
#cols2 = ["set", "degree", "mae", "mse"]
pipe = Pipeline([
("poly", PolynomialFeatures(degree=d,include_bias=False)),
('reg', LinearRegression())
])
pipe.fit(X_train, y_train)
if s == "train":
sqr_err_train = mean_squared_error(y_train, pipe.predict(X_train))
abs_err_train = mean_absolute_error(y_train, pipe.predict(X_train))
arr = [s, d, abs_err_train, sqr_err_train]
else:
sqr_err_test = mean_squared_error(y_test, pipe.predict(X_test))
abs_err_test = mean_absolute_error(y_test, pipe.predict(X_test))
arr = [s, d, abs_err_test, sqr_err_test]
return arr
cols3 = ["set", "degree", "mae", "mse"]
df_arr = pd.DataFrame(columns=cols3)
for d in range(2,9):
series = [poly_fit(d, s) for s in ["train", 'test']]
df_arr.loc[len(df_arr)] = series[0]
df_arr.loc[len(df_arr)] = series[1]
df_arr
set | degree | mae | mse | |
---|---|---|---|---|
0 | train | 2 | 2.509779e+01 | 1.101053e+03 |
1 | test | 2 | 3.190195e+01 | 1.964577e+03 |
2 | train | 3 | 1.266801e+01 | 3.025544e+02 |
3 | test | 3 | 5.312205e+01 | 1.391856e+04 |
4 | train | 4 | 1.195453e-09 | 2.743645e-18 |
5 | test | 4 | 9.918989e+02 | 1.139071e+07 |
6 | train | 5 | 1.572095e-10 | 5.547747e-20 |
7 | test | 5 | 9.684616e+02 | 1.775343e+07 |
8 | train | 6 | 2.174201e-10 | 1.022347e-19 |
9 | test | 6 | 1.372796e+03 | 3.054070e+07 |
10 | train | 7 | 1.094435e-10 | 4.879715e-20 |
11 | test | 7 | 2.205851e+03 | 6.273823e+07 |
12 | train | 8 | 1.352969e-10 | 6.088622e-20 |
13 | test | 8 | 3.773978e+03 | 1.645278e+08 |
Polynomial Features with Pipeline shows that after degree 3, the results become unreasonable, but up to degree 3, we can see that the test values remain higher than the train values, indicating that degree 2 or a degree 3 regression are useful in predicting the “Stock_Rating” for the cars.
Decision Tree Regressor#
For Decision Tree Regressor, our initial instantiation with max_leaf_nodes = 4 is a guess for determing the best results.
reg2 = DecisionTreeRegressor(max_leaf_nodes=4)
reg2.fit(X_train,y_train)
DecisionTreeRegressor(max_leaf_nodes=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeRegressor(max_leaf_nodes=4)
mean_squared_error(y_train, reg2.predict(X_train))
4795.489271173062
mean_squared_error(y_test, reg2.predict(X_test))
6615.546457009281
mean_absolute_error(y_train, reg2.predict(X_train))
55.78619206880076
mean_absolute_error(y_test, reg2.predict(X_test))
64.99354430548637
From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Decision Tree Regressor is a good model for predicting the “Stock_Rating” of the cars. Given that our initial guess of max_leaf_nodes = 4 is a good fit, its possible to improve our results such that the values of each error is reduced but the test value still remains higher using a test error curve.
df_err1 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
dtr = DecisionTreeRegressor(max_leaf_nodes=i)
dtr.fit(X_train, y_train)
d = {"leaves": i, "error": (1 - dtr.score(X_train, y_train)),"set":"train"}
d2 = {"leaves": i, "error": (1 - dtr.score(X_test, y_test)),"set":"test"}
df_err1.loc[len(df_err1)] = d
df_err1.loc[len(df_err1)] = d2
c = alt.Chart(df_err1).mark_line().encode(
x = alt.X("leaves", scale=alt.Scale(zero=False)),
y = alt.Y("error", scale=alt.Scale(zero=False)),
color = "set"
)
c
As we can see from the test error curve, max_leaf_nodes = 10 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope.
Random Forest Regressor#
rfg = RandomForestRegressor(n_estimators=200, max_leaf_nodes=13)
rfg.fit(X_train,y_train)
RandomForestRegressor(max_leaf_nodes=13, n_estimators=200)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_leaf_nodes=13, n_estimators=200)
mean_squared_error(y_train, rfg.predict(X_train))
906.1045239852818
mean_squared_error(y_test, rfg.predict(X_test))
1609.5503655502175
mean_absolute_error(y_train, rfg.predict(X_train))
23.802718069907115
mean_absolute_error(y_test, rfg.predict(X_test))
29.794632547022715
From the values produced we can see that the test values produce higher values compared to the train for both mean_squared_error and mean_absolute_value indicating that Random Forrest Regressor is a good model for predicting the “Stock_Rating” of the cars.
Predicting the Car’s Classes and Accessing their Accuracy#
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
Here we instantiate a new train_test_split because or target column of “stock_specs” is given by 5 different classes, so a new split is need for our classification models. We will be checking how accurate each classification model is using the .score() method. As with regression, we want the value produced by the test split to be greater than that of the train split.
X_train, X_test, y_train, y_test = train_test_split(df[cols1], df["stock_specs"], test_size=0.7)
Logist Regression#
clf = LogisticRegression(max_iter=100000)
clf.fit(X_train, y_train)
LogisticRegression(max_iter=100000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(max_iter=100000)
clf.score(X_train, y_train)
0.8260869565217391
clf.score(X_test, y_test)
0.723404255319149
As we can see, the score produced by test split is lower than that of the train split indicating that the Logistic Regression model is not a good fit for predicting our “stock_specs” for our cars.
Decision Tree Classifier#
For Decision Tree Classifier, our initial instantiation with max_leaf_nodes = 9 is a guess for determing the best results.
clf2 = DecisionTreeClassifier(max_leaf_nodes=9)
clf2.fit(X_train,y_train)
DecisionTreeClassifier(max_leaf_nodes=9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_leaf_nodes=9)
clf2.score(X_train, y_train)
0.8136645962732919
clf2.score(X_test, y_test)
0.6781914893617021
As we can see, the score produced by test split is lower than that of the train split indicating that the Decistion Tree Classifier model is not a good fit for predicting our “stock_specs” for our cars.
clf2.classes_
array(['A', 'B', 'C', 'D', 'S1', 'S2'], dtype=object)
df_err2 = pd.DataFrame(columns = ['leaves', 'error', 'set'])
for i in range(2,40):
dtc = DecisionTreeClassifier(max_leaf_nodes=i)
dtc.fit(X_train, y_train)
d3 = {"leaves": i, "error": (1 - dtc.score(X_train, y_train)),"set":"train"}
d4 = {"leaves": i, "error": (1 - dtc.score(X_test, y_test)),"set":"test"}
df_err2.loc[len(df_err2)] = d3
df_err2.loc[len(df_err2)] = d4
c = alt.Chart(df_err2).mark_line().encode(
x = alt.X("leaves", scale=alt.Scale(zero=False)),
y = alt.Y("error", scale=alt.Scale(zero=False)),
color = "set"
)
c
As we can see from the test error curve, max_leaf_nodes = 8 would produce slightly results given that it is the first instance of the test curve that switches from negative slope to positive slope. But, given that our original model did nor produce favorable results with max_leaf_nodes = 9, it is possible that the model still wont be good for predicting or “stock_specs” for our cars.
Random Forest Classifier#
rfc = RandomForestClassifier(n_estimators=200,max_leaf_nodes=9)
rfc.fit(X_train,y_train)
RandomForestClassifier(max_leaf_nodes=9, n_estimators=200)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_leaf_nodes=9, n_estimators=200)
rfc.score(X_train,y_train)
0.8944099378881988
rfc.score(X_test,y_test)
0.7420212765957447
As we can see, the score produced by test split is lower than that of the train split indicating that the Random Forest Classifier model is not a good fit for predicting our “stock_specs” for our cars.
Checking the Adequacy of Our Classifying Models#
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
As part of my extra portion of my project, I chose to use log_loss to determine the adequacy of our different classification models. This is determined by the lower the value, the better the model is suited for predicting our “stock_specs”. This function only works for classification models has it needs the predicted probabilities of the models to determine its value.
Logistic Regression#
log_loss(y_train, clf.predict_proba(X_train))
0.5002982111368205
log_loss(y_test, clf.predict_proba(X_test))
0.6696431553535065
As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Logistic Regression is not an adequate model for our predictions.
Decision Tree Classifier#
log_loss(y_train, clf2.predict_proba(X_train))
0.46980498391662157
log_loss(y_test, clf2.predict_proba(X_test))
2.8158974669983867
As we can see, the value produced by the test set is significantly higher than the train set indicating that our model clf2 defined as Decision Tree Classifier is not an adequate model for our predictions.
Random Forest Classifier#
log_loss(y_train, rfc.predict_proba(X_train))
0.48410537656977254
log_loss(y_test, rfc.predict_proba(X_test))
0.6558411789879937
As we can see, the value produced by the test set is higher than the train set indicating that our model clf defined as Random Forest Classifier is not an adequate model for our predictions.
Summary#
To conclude this project, we can see that all of our regression models were a good fit for predicting the “Stock_Rating” of our cars. In retrospect the data is rather linear in nature given that the higher an individual stat is directly proportional to a higher “Stock_Rating” so it was not difficult for the machine to predict the corresponding values. On the other hand, it came as a surprise that the classification models were not as successful in predicting the “stock_specs” for the cars. I believe that this is the case, from my in game knowledge, given that the 5 in game classes do not sub-divide the rating values evenly into 5 equal parts, rather, each class has a unique corresponding range for the rating values.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
This data set comes from Kaggle found here. It is dataset that contains all the possible vehicles and their respective stats that can be acquired within the game, Forza Horizon 5.
List any other references that you found helpful.
All my references are from ChatGPT which I highlighted the code blocks that made us of it.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote