Car Price Prediction#

Author: Shuoyu Zhu

Course Project, UC Irvine, Math 10, S23

Introduction#

The main goal of the project is to predict the price of a car based on both categorical and numerical features. This is achived through converting categorical features into 0 and 1 using OneHotEncoder. After encoding all categorical features into numerical features, we then fit the regression model to the dataset.

Imports#

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

Organizing the Dataset#

This dataset is acquired from kaggle. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

df = pd.read_csv('car_price_prediction.csv')
df.head()
ID Price Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags
0 45654403 13328 1399 LEXUS RX 450 2010 Jeep Yes Hybrid 3.5 186005 km 6.0 Automatic 4x4 04-May Left wheel Silver 12
1 44731507 16621 1018 CHEVROLET Equinox 2011 Jeep No Petrol 3 192000 km 6.0 Tiptronic 4x4 04-May Left wheel Black 8
2 45774419 8467 - HONDA FIT 2006 Hatchback No Petrol 1.3 200000 km 4.0 Variator Front 04-May Right-hand drive Black 2
3 45769185 3607 862 FORD Escape 2011 Jeep Yes Hybrid 2.5 168966 km 4.0 Automatic 4x4 04-May Left wheel White 0
4 45809263 11726 446 HONDA FIT 2014 Hatchback Yes Petrol 1.3 91901 km 4.0 Automatic Front 04-May Left wheel Silver 4

This is where I clean up the dataset that is prepared for the regression model.

df['Levy'] = df['Levy'].replace('-', 0) 
df['Levy'] = df['Levy'].astype(float)
df['total_price'] = df['Price'] + df['Levy']
df['Leather interior'] = df['Leather interior'].replace('Yes', 1)
df['Leather interior'] = df['Leather interior'].replace('No', 0)
# Here I deleted the km in Mileage column and make it a int type
# Credit to my other python course ICS31 for explaining the str.extract() function
df['Mileage'] = df['Mileage'].str.extract(r'(\d+)', expand=False)
df['Mileage'] = df['Mileage'].astype(int)

Engine volume is not numerical because some engines contains turbos while others do not

df[df['Engine volume'].str.contains('Turbo')].head(2)
ID Price Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume Mileage Cylinders Gear box type Drive wheels Doors Wheel Color Airbags total_price
23 45814106 7840 0.0 FORD Transit 2001 Microbus 0 Diesel 2.0 Turbo 230000 4.0 Manual Front 02-Mar Left wheel White 0 7840.0
25 45782859 20385 0.0 MERCEDES-BENZ E 220 2006 Sedan 1 Diesel 2.2 Turbo 210000 4.0 Tiptronic Rear 04-May Left wheel Black 8 20385.0

Since some datas have a unreseaonable total_price (Price+Levy) for cars, so I am setting a boundry from $3000 to $50000 just to get rid of the extreme values.

df = df[(df['total_price'] > 3000) & (df['total_price'] < 50000)]

Exploring the Dataset#

I am keeping the top 10 manufacturers in the dataset for forming better charts. This dataframe is also used in the prediction part.

df_pre = df[df['Manufacturer'].isin(df['Manufacturer'].value_counts()[0:10].index)]
df_visual = df_pre.sample(n=5000, random_state= 10086)

This donut chart shows us the approximate portion of each manufacturer in the dataset. What I don’t want to see is some manufacturer contributing only one or two data, which won’t have significant effect on our prediction and can complicate the process. The result of selecting the top 10 manufacturer seems decent at the moment.

alt.Chart(df_visual).mark_arc(innerRadius=50).encode(
    theta="count()",
    color="Manufacturer",
    tooltip=['Manufacturer', 'count()']
)

Source for creating the donut chart: https://altair-viz.github.io/gallery/donut_chart.html

This boxplot chart shows us the approximate price range of each brand. From here, we can see a correlation between the total_price and the brand. Brands with bigger names tend to have a higher median and third quartile of the total price.

alt.Chart(df_visual).mark_boxplot(extent='min-max', size = 30).encode(
    x='Manufacturer',
    y='total_price:Q'
).properties(width=400)

Source for creating the boxplot chart: https://altair-viz.github.io/gallery/boxplot.html

The following chart is an interesting visualization of how car color may affect its price. From the density fo the scatter plot, we can see that the obvious most favorable colors are Black, Grey, Silver, and White. We can also see a drop of median price in less favorable colors like Green, Orange, and Purple.

c1=alt.Chart(df_visual).mark_circle().encode(
    x='Color',
    y='total_price'
)
c2=alt.Chart(df_visual).mark_line(color='red').encode(
    x='Color',
    y='median(total_price)'
)
c1+c2

Machine Learning#

To include categorical features in the regression model, I am going to use the OneHotEncoder module to change these features to numeric. OneHotEncoder breaks down a column of categorical features into several column. For example, if we have a column Color with 3 possible values: Red, Green, and Yellow, OneHotEncoder will add three more columns: Color_red, Color_green, and Color_yellow. For a red car, its value for these three columns will be 1, 0, 0.

OneHotEncoder is the Extra part of the project.

encoder = OneHotEncoder()
cols = ['Manufacturer', 'Fuel type', 'Gear box type', 'Leather interior', 'Drive wheels']
encoded_cols = pd.DataFrame(
    encoder.fit_transform(df_pre[cols]).toarray(), 
    columns=encoder.get_feature_names_out(cols)
)
# By resetting the index of df_pre, I can merge the two df with their index.
df_reset_index = df_pre.reset_index(drop=True)
df_encoded = pd.merge(df_reset_index, encoded_cols, left_index=True, right_index=True)
df_encoded
ID Price Levy Manufacturer Model Prod. year Category Leather interior Fuel type Engine volume ... Fuel type_Plug-in Hybrid Gear box type_Automatic Gear box type_Manual Gear box type_Tiptronic Gear box type_Variator Leather interior_0 Leather interior_1 Drive wheels_4x4 Drive wheels_Front Drive wheels_Rear
0 45654403 13328 1399.0 LEXUS RX 450 2010 Jeep 1 Hybrid 3.5 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
1 44731507 16621 1018.0 CHEVROLET Equinox 2011 Jeep 0 Petrol 3 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
2 45774419 8467 0.0 HONDA FIT 2006 Hatchback 0 Petrol 1.3 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3 45769185 3607 862.0 FORD Escape 2011 Jeep 1 Hybrid 2.5 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
4 45809263 11726 446.0 HONDA FIT 2014 Hatchback 1 Petrol 1.3 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12257 45772306 5802 1055.0 MERCEDES-BENZ E 350 2013 Sedan 1 Diesel 3.5 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
12258 45798355 8467 0.0 MERCEDES-BENZ CLK 200 1999 Coupe 1 CNG 2.0 Turbo ... 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0
12259 45778856 15681 831.0 HYUNDAI Sonata 2011 Sedan 1 Petrol 2.4 ... 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 0.0
12260 45804997 26108 836.0 HYUNDAI Tucson 2010 Jeep 1 Diesel 2 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
12261 45793526 5331 1288.0 CHEVROLET Captiva 2007 Jeep 1 Diesel 2 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

12262 rows × 45 columns

From here on we only need the numeric columns in the dataframe.

df_reg = df_encoded.dtypes.apply(lambda x: pd.api.types.is_numeric_dtype(x))
df_reg = df_encoded.loc[:, df_reg]
df_reg = df_reg.drop(['ID', 'Price', 'Levy'], axis = 1)
df_reg.head()
Prod. year Leather interior Mileage Cylinders Airbags total_price Manufacturer_BMW Manufacturer_CHEVROLET Manufacturer_FORD Manufacturer_HONDA ... Fuel type_Plug-in Hybrid Gear box type_Automatic Gear box type_Manual Gear box type_Tiptronic Gear box type_Variator Leather interior_0 Leather interior_1 Drive wheels_4x4 Drive wheels_Front Drive wheels_Rear
0 2010 1 186005 6.0 12 14727.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
1 2011 0 192000 6.0 8 17639.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0
2 2006 0 200000 4.0 2 8467.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0
3 2011 1 168966 4.0 0 4469.0 0.0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
4 2014 1 91901 4.0 4 12172.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0

5 rows × 32 columns

features = list(df_reg.columns)
features.remove('total_price')

Splitting the data using train_test_split to find the best fitting max_leaf_nodes

X_train, X_test, y_train, y_test = train_test_split(
    df_reg[features], 
    df_reg['total_price'], 
    test_size=0.3, 
    random_state=10086
)
df_err = pd.DataFrame(columns=['leaves', 'error', 'set'])
for i in range(50, 2000, 50):
    rfr = RandomForestRegressor(max_leaf_nodes=i)
    rfr.fit(X_train, y_train)
    train_error = 1 - rfr.score(X_train, y_train)
    test_error = 1 - rfr.score(X_test, y_test)
    df_err.loc[len(df_err)] = {'leaves': i, 'error': train_error, 'set': 'train'}
    df_err.loc[len(df_err)] = {'leaves': i, 'error': test_error, 'set': 'test'}

From this graph we can see that the error curve reaches a minimum at 600 leaves, and the error value steadies around 0.3. The maximum lead nodes does not lead to overfitting in this case. I think it is because I used too many features when I fitted the model. If we want to see the test curve curves upwards and displays overfitting, we may need to add the max_leaf_nodes to 10k, which will take a huge amount of time for deepnot to process (the code section above took around 1 min to process). I thus choose 1000 leaves for the prediction below.

alt.Chart(df_err).mark_line().encode(
    x = 'leaves',
    y = 'error',
    color = 'set',
    tooltip = 'leaves'
)
rfr2 = RandomForestRegressor(max_leaf_nodes=1000)
rfr2.fit(df_reg[features], df_reg['total_price'])
RandomForestRegressor(max_leaf_nodes=1000)
df_reg['pred'] = rfr2.predict(df_reg[features])
df_reg['index'] = df_reg.index

The two graphs below give us an overview of the performance of the prediction model.

Here I only use the first 20 rows of the dataset to visualize the prediction. The blue line represents the actual total price of the car and the red line represents the predicted price. The prediction curve seems to be well fitted with the actual price curve.

c3 = alt.Chart(df_reg.head(20)).mark_line().encode(
    x = 'index',
    y = 'pred'
)
c4 = alt.Chart(df_reg.head(20)).mark_line(color = 'red').encode(
    x = 'index',
    y = 'total_price'
)
c3+c4

Here is another display of the prediction with a larger sample. We can see that the points are scattered around the line y = x, which is what I am expecting.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
    x = 'pred',
    y = 'total_price'
)

Error Analysis#

To create a good visualization of the error, I create the err_percentage column to calculate the percentage of the difference between the actual price and prediction. I rounded the percentage to 2 digits.

df_reg['err'] = df_reg['pred'] - df_reg['total_price']
df_reg['err_percentage'] = df_reg['err'] / df_reg['total_price']
df_reg['err_percentage'] = round(df_reg['err_percentage'], 2)

We can see in the following chart that the error percentages are normally distributed around 0. Most of the error percentages are between -25% to 25%. I think this is a decent result for our prediction model.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_bar().encode(
    y="count()",
    x="err_percentage"
).properties(width=500, height=300)

The following graph shows the relationship between the err_percentage and the total_price. Surprisingly, the error percentage tends to decrease as the total_price of the car increases. This means that the model has a poorer performance as the total_price decreases.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
    x="total_price",
    y="err_percentage"
).properties(width=500, height=300)

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In general, the performance of the model is much better than what I was expecting. I first use OneHotEncoder to encode all the non-numeric features to numerical forms in the dataframe. I then apply the RandomForestRegressor to make the predictions. The visualizations are limited by Altair’s maximum number of data, so I use the sample() funtion in every altair chart using the same random_state variable.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

  • List any other references that you found helpful.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html https://www.turing.com/kb/convert-categorical-data-in-pandas-and-scikit-learn.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote