Analysis of the factors that influence car’s level¶

Author:Dongyu Cao

Course Project, UC Irvine, Math 10, S22

Introduction¶

People nowadays seperate cars into different level based on laptime and so on(S is the best to A00 is the worst). And there are many factors that affect the lap time of a car on the track, and this includes different reasons such as the horsepower of the car and the weather. Here, I have selected a driver’s lap time at the Beijing Ruisi racetrack for analysis.So that we can find a model based on radom decision tree to predict a car’s level based on the information we have. And the KNN was used to create an algorithm that allowed us to predict lap times from horsepower and tail speed.

Main portion of the project¶

Date clearing and Feature Engineering¶

Check if there is any missing value that might influence the analysis

import pandas as pd
import altair as alt

df=pd.read_csv("race10.csv")

df.isna().any(axis=0)#locate where is the missing value

CAR                       False
LAPTIME                   False
Horsepower (Ps)           False
Powertrain                False
Modification Level        False
Tire                      False
Temp(℃)                   False
Tail speed                 True
0-100                      True
Level                     False
car_displacement_Liter     True
Trubo                     False
dtype: bool

df

	CAR	LAPTIME	Horsepower (Ps)	Powertrain	Modification Level	Tire	Temp(℃)	Tail speed	0-100	Level	car_displacement_Liter	Trubo
0	911 GT3RS（18	52.3	570	7DCT	1	Cup2R	26	164.960	3.29	S	4.0	True
1	BMW M4（17	53.5	521	7DCT	2	F200	13	163.465	NaN	B	3.0	True
2	BMW M4（17	54.8	521	7DCT	2	TD	0	159.490	NaN	B	3.0	True
3	911 S（20	54.8	450	8DCT	0	P0 PZ4	13	163.000	3.20	S	3.0	True
4	BMW M4（17	55.4	431	7DCT	1	Cup 2C	32	155.490	4.49	B	3.0	True
...	...	...	...	...	...	...	...	...	...	...	...	...
176	V3（15	68.8	120	5MT	0	Original manufacturer	28	NaN	NaN	A	1.5	False
177	Beetle（06	69.8	115	6AT	1	Original manufacturer	21	NaN	NaN	A	2.0	False
178	ALTO（13	71.3	71	5MT	0	Original manufacturer	30	NaN	NaN	A00	1.0	False
179	WulingEV（22	76.0	41	E	0	Original manufacturer	27	98.180	17.12	A00	0.0	False
180	WulingEV（20	87.5	27	E	0	Original manufacturer	10	70.000	0.00	A00	0.0	False

181 rows × 12 columns

df.corr() # find the correlationship among the dataset

	LAPTIME	Horsepower (Ps)	Modification Level	Temp(℃)	Tail speed	0-100	car_displacement_Liter	Trubo
LAPTIME	1.000000	-0.705073	-0.334042	0.019101	-0.922816	0.552319	-0.510130	-0.326764
Horsepower (Ps)	-0.705073	1.000000	0.005825	0.038513	0.840555	-0.765410	0.635993	0.183706
Modification Level	-0.334042	0.005825	1.000000	-0.018087	0.203894	-0.006618	0.049866	-0.042773
Temp(℃)	0.019101	0.038513	-0.018087	1.000000	0.043525	-0.001694	0.050947	-0.039907
Tail speed	-0.922816	0.840555	0.203894	0.043525	1.000000	-0.668005	0.612480	0.218554
0-100	0.552319	-0.765410	-0.006618	-0.001694	-0.668005	1.000000	-0.414595	-0.207787
car_displacement_Liter	-0.510130	0.635993	0.049866	0.050947	0.612480	-0.414595	1.000000	0.038610
Trubo	-0.326764	0.183706	-0.042773	-0.039907	0.218554	-0.207787	0.038610	1.000000

Throughout the dataset, we find missing data for Tail speed and 0-100 and car_displacement_Liter. However, since directly dropping all the missing data would greatly affect the model building. So for the sake of the final model，we impute the null values of the those columns by filling in the mean value of the cars corresponding Powertrain due to the fact they have strong correlationship the faster the shifter the better acceleation the car might have .

for i in [df]:
    i["Tail_speed_null"] = i["Tail speed"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["Tail speed"]
    i["Tail speed"] = data.transform(lambda x: x.fillna(x.mean()))

for i in [df]:
    i["0-100_null"] = i["0-100"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["0-100"]
    i["0-100"] = data.transform(lambda x: x.fillna(x.mean()))

for i in [df]:
    i["car_displacement_null"] = i["car_displacement_Liter"].apply(lambda x: 1 if pd.isnull(x) else 0)
    data=df.groupby(["Powertrain"])["car_displacement_Liter"]
    i["car_displacement_Liter"] = data.transform(lambda x: x.fillna(x.mean()))

We add null row in the back so that in the future we can know which one is na value in the begining

df["Tail_speed_null"].value_counts()

0    147
1     34
Name: Tail_speed_null, dtype: int64

df["0-100_null"].value_counts()

0    152
1     29
Name: 0-100_null, dtype: int64

#check whether there is still a missing value
df.isna().any(axis=0)

CAR                       False
LAPTIME                   False
Horsepower (Ps)           False
Powertrain                False
Modification Level        False
Tire                      False
Temp(℃)                   False
Tail speed                False
0-100                     False
Level                     False
car_displacement_Liter    False
Trubo                     False
Tail_speed_null           False
0-100_null                False
car_displacement_null     False
dtype: bool

brush = alt.selection_interval()
c1=  alt.Chart(df).mark_circle().encode(
    x=alt.X("Tail speed",scale=alt.Scale(zero=False)),
    y=alt.Y("Horsepower (Ps)",scale=alt.Scale(zero=False)),
    color=alt.Color("Level",scale=alt.Scale(scheme='rainbow')),
    tooltip=["Temp(℃)","car_displacement_Liter"]
).properties(
    title="Interactive Relationship between Horsepower and tail speed, Level "
).add_selection(brush)

c = alt.Chart(df).mark_bar().encode(
    x="Level",
    y="count()",
).transform_filter(brush)

c2 = alt.Chart(df).mark_boxplot().encode(
    x="Level",
    y="0-100",
).transform_filter(brush)

c1

c&c1&c2

Horsepower¶

Horsepower played a critical role in Level distinction, as the Horsepower decreased drastically for the lowest class.

Tail Speed¶

Tail speed on the graph also indicates that higer the tail speed, the cars are more possbile to be S.

0-100¶

According to the graph we can see that the median time for S level car to accelerate to 100 is less than A00. There is evidence that 0-100 influence the level of the car.

Decision Trees and Random Forests¶

from sklearn.model_selection import train_test_split

colmuns=["Horsepower (Ps)","0-100","Tail speed","car_displacement_Liter","Trubo","LAPTIME"]

X_train, X_test, y_train, y_test = train_test_split(df[colmuns], df["Level"], test_size=0.3, random_state=0)

from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)

clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=20)

import matplotlib.pyplot as plt

from sklearn.tree import plot_tree

clf.feature_names_in_

array(['Horsepower (Ps)', '0-100', 'Tail speed', 'car_displacement_Liter',
       'Trubo', 'LAPTIME'], dtype=object)

fig = plt.figure(figsize=(200,100))#visualize the tree
plot_tree(
    clf,
    feature_names=clf.feature_names_in_,
    filled=True
);

clf.score(X_train, y_train)

0.7619047619047619

clf.score(X_test, y_test)

0.45454545454545453

There is quite a difference between the train and test, there is possible overfitting here. So, I decide to use random forest trees to find the best one.

from sklearn.ensemble import RandomForestClassifier

rfe = RandomForestClassifier(n_estimators=10000, max_leaf_nodes=25)

def error_check(rfe):
    rfe.fit(X_train, y_train)
    df["pred"] = rfe.predict(df[colmuns])
    return (rfe.score(X_train, y_train), rfe.score(X_test, y_test))

#tuple unpacking
train,test=error_check(clf)

train

0.7619047619047619

test

0.45454545454545453

train,test=error_check(rfe)

train

0.9761904761904762

test

0.5272727272727272

The RandomForestClassifier is not good here due to the fact the train set fit too much causing overfitting

rfe.feature_importances_

array([0.2551222 , 0.17843594, 0.16709328, 0.16009833, 0.04180281,
       0.19744743])

There higer the vlaue in the rfe.feature indicates higher influnence in decision making.

K-Nearest Neighbors¶

First we split the data into train and test, so that we can later look at the two sets of data to see if the model has overfiting conditions.

from sklearn.model_selection import train_test_split

#Here we choose the horsepower and tail speed becasue our goal is to predict 
#lap times from horsepower and tail speed. We also choose to use 70% data for training, keep in on state
#So that when other people rerun the data, the result should be same.

X_train, X_test, y_train, y_test = train_test_split(df[["Horsepower (Ps)","Tail speed"]], df["LAPTIME"], test_size=0.3, random_state=0)

X_train.shape

(126, 2)

from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

reg = KNeighborsRegressor(n_neighbors=12)
reg.fit(X_train, y_train)
reg.predict(X_test)

array([61.34166667, 57.76666667, 61.65833333, 65.65833333, 64.00833333,
       61.79166667, 62.075     , 62.2       , 66.01666667, 64.99166667,
       56.975     , 64.4       , 64.85833333, 56.33333333, 58.79166667,
       66.01666667, 65.03333333, 61.475     , 59.475     , 61.21666667,
       60.1       , 61.75      , 62.075     , 57.40833333, 57.89166667,
       56.45833333, 63.90833333, 64.275     , 57.56666667, 65.65833333,
       62.075     , 56.76666667, 65.55      , 65.03333333, 56.76666667,
       57.64166667, 59.53333333, 64.90833333, 60.56666667, 61.675     ,
       66.40833333, 60.99166667, 59.59166667, 62.8       , 59.225     ,
       64.225     , 61.21666667, 57.64166667, 56.45833333, 59.61666667,
       61.475     , 57.40833333, 58.79166667, 59.74166667, 64.975     ])

mean_absolute_error(reg.predict(X_test), y_test)

1.938030303030304

mean_absolute_error(reg.predict(X_train), y_train)

1.9041666666666672

mean_squared_error(reg.predict(X_train), y_train)

7.99575672398589

mean_squared_error(reg.predict(X_test), y_test)

6.150948232323241

Visualization of KNN¶

dateset1 = df[["Horsepower (Ps)","Tail speed"]]

dateset1["Pred"]=reg.predict(dateset1)

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

dateset1["REL_LAPTIME"]=df["LAPTIME"]

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  

dateset1

	Horsepower (Ps)	Tail speed	Pred	REL_LAPTIME
0	570	164.960000	56.333333	52.3
1	521	163.465000	56.433333	53.5
2	521	159.490000	56.433333	54.8
3	450	163.000000	57.766667	54.8
4	431	155.490000	57.408333	55.4
...	...	...	...	...
176	120	118.857143	65.658333	68.8
177	115	130.557000	65.233333	69.8
178	71	118.857143	66.408333	71.3
179	41	98.180000	67.983333	76.0
180	27	70.000000	67.825000	87.5

181 rows × 4 columns

chart1=alt.Chart(dateset1).mark_circle(color="red").encode(
    x="Horsepower (Ps)",
    y="Pred"
)
chart2=alt.Chart(dateset1).mark_circle().encode(
    x="Horsepower (Ps)",
    y="REL_LAPTIME",
)

ch=chart1+chart2

chart3=alt.Chart(dateset1).mark_circle(color="red").encode(
    x="Tail speed",
    y="Pred"
)
chart4=alt.Chart(dateset1).mark_circle(color="black").encode(
    x="Tail speed",
    y="REL_LAPTIME"
)
ch1=chart3+chart4

(chart1+chart2)&(chart3+chart4)

brush = alt.selection(type='interval', resolve='global')
chartbase =alt.Chart(dateset1).mark_circle(color="red").encode(
    color=alt.condition(brush, 'Pred', alt.ColorValue('black'))
).add_selection(
    brush
).properties(
    width=250,
    height=250
)

chartbase.encode(x='Tail speed',y="Pred") | chartbase.encode(x='Tail speed',y="REL_LAPTIME")| chartbase.encode(x='Horsepower (Ps)',y="Pred") | chartbase.encode(x='Horsepower (Ps)',y="REL_LAPTIME")

With the calculation of mean absolute error and mean squared error, we can tell that the difference between test and train set is small. And test set indicates a better result. Under this condition, the model is not overfitting when using K=12. Besides, thourgh the altair chart we can double check with the model is a good prediction.

Summary¶

In the above we have used a total of two different sklearn machines and some altair charts. It is clear to see that the decision tree and random decision tree models are not effective in determining which class a vehicle belongs to based on its lap time, tail speed and zero hundred acceleration. However, the KNN model is able to determine the lap time by horsepower and tail speed. Moreover, it can predict lap times effectively and more accurately when the value of k is equal to 12. This is supported by the MSE and Altair charts. In conclusion, the higher the horsepower and tail speed, the higher the lap time will be.

References¶

What is the source of your dataset(s)? From the Beijing Ruisi racetrack https://kbracer.github.io/

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

Decision Trees and Random Forests from math 10 lecture Feature Engineering from Titanic https://www.kaggle.com/code/zlatankr/titanic-random-forest-82-78

List other references that you found helpful.

Created in Deepnote

UC Irvine Math 10 S22

Analysis of the factors that influence car’s level

Contents