Analysis of the factors that influence car’s level

Author:Dongyu Cao

Course Project, UC Irvine, Math 10, S22

Introduction

People nowadays seperate cars into different level based on laptime and so on(S is the best to A00 is the worst). And there are many factors that affect the lap time of a car on the track, and this includes different reasons such as the horsepower of the car and the weather. Here, I have selected a driver’s lap time at the Beijing Ruisi racetrack for analysis.So that we can find a model based on radom decision tree to predict a car’s level based on the information we have. And the KNN was used to create an algorithm that allowed us to predict lap times from horsepower and tail speed.

Main portion of the project

Date clearing and Feature Engineering

Check if there is any missing value that might influence the analysis

import pandas as pd
import altair as alt
df=pd.read_csv("race10.csv")
df.isna().any(axis=0)#locate where is the missing value
CAR                       False
LAPTIME                   False
Horsepower (Ps)           False
Powertrain                False
Modification Level        False
Tire                      False
Temp(℃)                   False
Tail speed                 True
0-100                      True
Level                     False
car_displacement_Liter     True
Trubo                     False
dtype: bool
df
CAR LAPTIME Horsepower (Ps) Powertrain Modification Level Tire Temp(℃) Tail speed 0-100 Level car_displacement_Liter Trubo
0 911 GT3RS(18 52.3 570 7DCT 1 Cup2R 26 164.960 3.29 S 4.0 True
1 BMW M4(17 53.5 521 7DCT 2 F200 13 163.465 NaN B 3.0 True
2 BMW M4(17 54.8 521 7DCT 2 TD 0 159.490 NaN B 3.0 True
3 911 S(20 54.8 450 8DCT 0 P0 PZ4 13 163.000 3.20 S 3.0 True
4 BMW M4(17 55.4 431 7DCT 1 Cup 2C 32 155.490 4.49 B 3.0 True
... ... ... ... ... ... ... ... ... ... ... ... ...
176 V3(15 68.8 120 5MT 0 Original manufacturer 28 NaN NaN A 1.5 False
177 Beetle(06 69.8 115 6AT 1 Original manufacturer 21 NaN NaN A 2.0 False
178 ALTO(13 71.3 71 5MT 0 Original manufacturer 30 NaN NaN A00 1.0 False
179 WulingEV(22 76.0 41 E 0 Original manufacturer 27 98.180 17.12 A00 0.0 False
180 WulingEV(20 87.5 27 E 0 Original manufacturer 10 70.000 0.00 A00 0.0 False

181 rows × 12 columns

df.corr() # find the correlationship among the dataset
LAPTIME Horsepower (Ps) Modification Level Temp(℃) Tail speed 0-100 car_displacement_Liter Trubo
LAPTIME 1.000000 -0.705073 -0.334042 0.019101 -0.922816 0.552319 -0.510130 -0.326764
Horsepower (Ps) -0.705073 1.000000 0.005825 0.038513 0.840555 -0.765410 0.635993 0.183706
Modification Level -0.334042 0.005825 1.000000 -0.018087 0.203894 -0.006618 0.049866 -0.042773
Temp(℃) 0.019101 0.038513 -0.018087 1.000000 0.043525 -0.001694 0.050947 -0.039907
Tail speed -0.922816 0.840555 0.203894 0.043525 1.000000 -0.668005 0.612480 0.218554
0-100 0.552319 -0.765410 -0.006618 -0.001694 -0.668005 1.000000 -0.414595 -0.207787
car_displacement_Liter -0.510130 0.635993 0.049866 0.050947 0.612480 -0.414595 1.000000 0.038610
Trubo -0.326764 0.183706 -0.042773 -0.039907 0.218554 -0.207787 0.038610 1.000000

Throughout the dataset, we find missing data for Tail speed and 0-100 and car_displacement_Liter. However, since directly dropping all the missing data would greatly affect the model building. So for the sake of the final model,we impute the null values of the those columns by filling in the mean value of the cars corresponding Powertrain due to the fact they have strong correlationship the faster the shifter the better acceleation the car might have .

for i in [df]:
    i["Tail_speed_null"] = i["Tail speed"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["Tail speed"]
    i["Tail speed"] = data.transform(lambda x: x.fillna(x.mean()))
for i in [df]:
    i["0-100_null"] = i["0-100"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["0-100"]
    i["0-100"] = data.transform(lambda x: x.fillna(x.mean()))
for i in [df]:
    i["car_displacement_null"] = i["car_displacement_Liter"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["car_displacement_Liter"]
    i["car_displacement_Liter"] = data.transform(lambda x: x.fillna(x.mean()))

We add null row in the back so that in the future we can know which one is na value in the begining

df["Tail_speed_null"].value_counts()
0    147
1     34
Name: Tail_speed_null, dtype: int64
df["0-100_null"].value_counts()
0    152
1     29
Name: 0-100_null, dtype: int64
#check whether there is still a missing value
df.isna().any(axis=0)
CAR                       False
LAPTIME                   False
Horsepower (Ps)           False
Powertrain                False
Modification Level        False
Tire                      False
Temp(℃)                   False
Tail speed                False
0-100                     False
Level                     False
car_displacement_Liter    False
Trubo                     False
Tail_speed_null           False
0-100_null                False
car_displacement_null     False
dtype: bool
brush = alt.selection_interval()
c1=  alt.Chart(df).mark_circle().encode(
    x=alt.X("Tail speed",scale=alt.Scale(zero=False)),
    y=alt.Y("Horsepower (Ps)",scale=alt.Scale(zero=False)),
    color=alt.Color("Level",scale=alt.Scale(scheme='rainbow')),
    tooltip=["Temp(℃)","car_displacement_Liter"]
).properties(
    title="Interactive Relationship between Horsepower and tail speed, Level "
).add_selection(brush)
c = alt.Chart(df).mark_bar().encode(
    x="Level",
    y="count()",
).transform_filter(brush)
c2 = alt.Chart(df).mark_boxplot().encode(
    x="Level",
    y="0-100",
).transform_filter(brush)
c1
c&c1&c2

Horsepower

Horsepower played a critical role in Level distinction, as the Horsepower decreased drastically for the lowest class.

Tail Speed

Tail speed on the graph also indicates that higer the tail speed, the cars are more possbile to be S.

0-100

According to the graph we can see that the median time for S level car to accelerate to 100 is less than A00. There is evidence that 0-100 influence the level of the car.

Decision Trees and Random Forests

from sklearn.model_selection import train_test_split
colmuns=["Horsepower (Ps)","0-100","Tail speed","car_displacement_Liter","Trubo","LAPTIME"]
X_train, X_test, y_train, y_test = train_test_split(df[colmuns], df["Level"], test_size=0.3, random_state=0)
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
clf.feature_names_in_
array(['Horsepower (Ps)', '0-100', 'Tail speed', 'car_displacement_Liter',
       'Trubo', 'LAPTIME'], dtype=object)
fig = plt.figure(figsize=(200,100))#visualize the tree
plot_tree(
    clf,
    feature_names=clf.feature_names_in_,
    filled=True
);
../../_images/DongyuCao_35_0.png
clf.score(X_train, y_train)
0.7619047619047619
clf.score(X_test, y_test)
0.45454545454545453

There is quite a difference between the train and test, there is possible overfitting here. So, I decide to use random forest trees to find the best one.

from sklearn.ensemble import RandomForestClassifier
rfe = RandomForestClassifier(n_estimators=10000, max_leaf_nodes=25)
def error_check(rfe):
    rfe.fit(X_train, y_train)
    df["pred"] = rfe.predict(df[colmuns])
    return (rfe.score(X_train, y_train), rfe.score(X_test, y_test))
#tuple unpacking
train,test=error_check(clf)
train
0.7619047619047619
test
0.45454545454545453
train,test=error_check(rfe)
train
0.9761904761904762
test
0.5272727272727272

The RandomForestClassifier is not good here due to the fact the train set fit too much causing overfitting

rfe.feature_importances_
array([0.2551222 , 0.17843594, 0.16709328, 0.16009833, 0.04180281,
       0.19744743])

There higer the vlaue in the rfe.feature indicates higher influnence in decision making.

K-Nearest Neighbors

First we split the data into train and test, so that we can later look at the two sets of data to see if the model has overfiting conditions.

from sklearn.model_selection import train_test_split
#Here we choose the horsepower and tail speed becasue our goal is to predict 
#lap times from horsepower and tail speed. We also choose to use 70% data for training, keep in on state
#So that when other people rerun the data, the result should be same.

X_train, X_test, y_train, y_test = train_test_split(df[["Horsepower (Ps)","Tail speed"]], df["LAPTIME"], test_size=0.3, random_state=0)
X_train.shape
(126, 2)
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error
reg = KNeighborsRegressor(n_neighbors=12)
reg.fit(X_train, y_train)
reg.predict(X_test)
array([61.34166667, 57.76666667, 61.65833333, 65.65833333, 64.00833333,
       61.79166667, 62.075     , 62.2       , 66.01666667, 64.99166667,
       56.975     , 64.4       , 64.85833333, 56.33333333, 58.79166667,
       66.01666667, 65.03333333, 61.475     , 59.475     , 61.21666667,
       60.1       , 61.75      , 62.075     , 57.40833333, 57.89166667,
       56.45833333, 63.90833333, 64.275     , 57.56666667, 65.65833333,
       62.075     , 56.76666667, 65.55      , 65.03333333, 56.76666667,
       57.64166667, 59.53333333, 64.90833333, 60.56666667, 61.675     ,
       66.40833333, 60.99166667, 59.59166667, 62.8       , 59.225     ,
       64.225     , 61.21666667, 57.64166667, 56.45833333, 59.61666667,
       61.475     , 57.40833333, 58.79166667, 59.74166667, 64.975     ])
mean_absolute_error(reg.predict(X_test), y_test)
1.938030303030304
mean_absolute_error(reg.predict(X_train), y_train)
1.9041666666666672
mean_squared_error(reg.predict(X_train), y_train)
7.99575672398589
mean_squared_error(reg.predict(X_test), y_test)
6.150948232323241

Visualization of KNN

dateset1 = df[["Horsepower (Ps)","Tail speed"]]
dateset1["Pred"]=reg.predict(dateset1)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
dateset1["REL_LAPTIME"]=df["LAPTIME"]
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
dateset1
Horsepower (Ps) Tail speed Pred REL_LAPTIME
0 570 164.960000 56.333333 52.3
1 521 163.465000 56.433333 53.5
2 521 159.490000 56.433333 54.8
3 450 163.000000 57.766667 54.8
4 431 155.490000 57.408333 55.4
... ... ... ... ...
176 120 118.857143 65.658333 68.8
177 115 130.557000 65.233333 69.8
178 71 118.857143 66.408333 71.3
179 41 98.180000 67.983333 76.0
180 27 70.000000 67.825000 87.5

181 rows × 4 columns

chart1=alt.Chart(dateset1).mark_circle(color="red").encode(
    x="Horsepower (Ps)",
    y="Pred"
)
chart2=alt.Chart(dateset1).mark_circle().encode(
    x="Horsepower (Ps)",
    y="REL_LAPTIME",
)

ch=chart1+chart2
chart3=alt.Chart(dateset1).mark_circle(color="red").encode(
    x="Tail speed",
    y="Pred"
)
chart4=alt.Chart(dateset1).mark_circle(color="black").encode(
    x="Tail speed",
    y="REL_LAPTIME"
)
ch1=chart3+chart4
(chart1+chart2)&(chart3+chart4)
brush = alt.selection(type='interval', resolve='global')
chartbase =alt.Chart(dateset1).mark_circle(color="red").encode(
    color=alt.condition(brush, 'Pred', alt.ColorValue('black'))
).add_selection(
    brush
).properties(
    width=250,
    height=250
)

chartbase.encode(x='Tail speed',y="Pred") | chartbase.encode(x='Tail speed',y="REL_LAPTIME")| chartbase.encode(x='Horsepower (Ps)',y="Pred") | chartbase.encode(x='Horsepower (Ps)',y="REL_LAPTIME")

With the calculation of mean absolute error and mean squared error, we can tell that the difference between test and train set is small. And test set indicates a better result. Under this condition, the model is not overfitting when using K=12. Besides, thourgh the altair chart we can double check with the model is a good prediction.

Summary

In the above we have used a total of two different sklearn machines and some altair charts. It is clear to see that the decision tree and random decision tree models are not effective in determining which class a vehicle belongs to based on its lap time, tail speed and zero hundred acceleration. However, the KNN model is able to determine the lap time by horsepower and tail speed. Moreover, it can predict lap times effectively and more accurately when the value of k is equal to 12. This is supported by the MSE and Altair charts. In conclusion, the higher the horsepower and tail speed, the higher the lap time will be.

References

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

Decision Trees and Random Forests from math 10 lecture Feature Engineering from Titanic https://www.kaggle.com/code/zlatankr/titanic-random-forest-82-78

  • List other references that you found helpful.

Created in deepnote.com Created in Deepnote

By Christopher Davis
© Copyright 2022.