Car Price Prediction#
Author: Shuoyu Zhu
Course Project, UC Irvine, Math 10, S23
Introduction#
The main goal of the project is to predict the price of a car based on both categorical and numerical features. This is achived through converting categorical features into 0 and 1 using OneHotEncoder. After encoding all categorical features into numerical features, we then fit the regression model to the dataset.
Imports#
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
Organizing the Dataset#
This dataset is acquired from kaggle. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
df = pd.read_csv('car_price_prediction.csv')
df.head()
ID | Price | Levy | Manufacturer | Model | Prod. year | Category | Leather interior | Fuel type | Engine volume | Mileage | Cylinders | Gear box type | Drive wheels | Doors | Wheel | Color | Airbags | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45654403 | 13328 | 1399 | LEXUS | RX 450 | 2010 | Jeep | Yes | Hybrid | 3.5 | 186005 km | 6.0 | Automatic | 4x4 | 04-May | Left wheel | Silver | 12 |
1 | 44731507 | 16621 | 1018 | CHEVROLET | Equinox | 2011 | Jeep | No | Petrol | 3 | 192000 km | 6.0 | Tiptronic | 4x4 | 04-May | Left wheel | Black | 8 |
2 | 45774419 | 8467 | - | HONDA | FIT | 2006 | Hatchback | No | Petrol | 1.3 | 200000 km | 4.0 | Variator | Front | 04-May | Right-hand drive | Black | 2 |
3 | 45769185 | 3607 | 862 | FORD | Escape | 2011 | Jeep | Yes | Hybrid | 2.5 | 168966 km | 4.0 | Automatic | 4x4 | 04-May | Left wheel | White | 0 |
4 | 45809263 | 11726 | 446 | HONDA | FIT | 2014 | Hatchback | Yes | Petrol | 1.3 | 91901 km | 4.0 | Automatic | Front | 04-May | Left wheel | Silver | 4 |
This is where I clean up the dataset that is prepared for the regression model.
df['Levy'] = df['Levy'].replace('-', 0)
df['Levy'] = df['Levy'].astype(float)
df['total_price'] = df['Price'] + df['Levy']
df['Leather interior'] = df['Leather interior'].replace('Yes', 1)
df['Leather interior'] = df['Leather interior'].replace('No', 0)
# Here I deleted the km in Mileage column and make it a int type
# Credit to my other python course ICS31 for explaining the str.extract() function
df['Mileage'] = df['Mileage'].str.extract(r'(\d+)', expand=False)
df['Mileage'] = df['Mileage'].astype(int)
Engine volume
is not numerical because some engines contains turbos while others do not
df[df['Engine volume'].str.contains('Turbo')].head(2)
ID | Price | Levy | Manufacturer | Model | Prod. year | Category | Leather interior | Fuel type | Engine volume | Mileage | Cylinders | Gear box type | Drive wheels | Doors | Wheel | Color | Airbags | total_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
23 | 45814106 | 7840 | 0.0 | FORD | Transit | 2001 | Microbus | 0 | Diesel | 2.0 Turbo | 230000 | 4.0 | Manual | Front | 02-Mar | Left wheel | White | 0 | 7840.0 |
25 | 45782859 | 20385 | 0.0 | MERCEDES-BENZ | E 220 | 2006 | Sedan | 1 | Diesel | 2.2 Turbo | 210000 | 4.0 | Tiptronic | Rear | 04-May | Left wheel | Black | 8 | 20385.0 |
Since some datas have a unreseaonable total_price
(Price+Levy) for cars, so I am setting a boundry from $3000 to $50000 just to get rid of the extreme values.
df = df[(df['total_price'] > 3000) & (df['total_price'] < 50000)]
Exploring the Dataset#
I am keeping the top 10 manufacturers in the dataset for forming better charts. This dataframe is also used in the prediction part.
df_pre = df[df['Manufacturer'].isin(df['Manufacturer'].value_counts()[0:10].index)]
df_visual = df_pre.sample(n=5000, random_state= 10086)
This donut chart shows us the approximate portion of each manufacturer in the dataset. What I don’t want to see is some manufacturer contributing only one or two data, which won’t have significant effect on our prediction and can complicate the process. The result of selecting the top 10 manufacturer seems decent at the moment.
alt.Chart(df_visual).mark_arc(innerRadius=50).encode(
theta="count()",
color="Manufacturer",
tooltip=['Manufacturer', 'count()']
)
Source for creating the donut chart: https://altair-viz.github.io/gallery/donut_chart.html
This boxplot chart shows us the approximate price range of each brand. From here, we can see a correlation between the total_price and the brand. Brands with bigger names tend to have a higher median and third quartile of the total price.
alt.Chart(df_visual).mark_boxplot(extent='min-max', size = 30).encode(
x='Manufacturer',
y='total_price:Q'
).properties(width=400)
Source for creating the boxplot chart: https://altair-viz.github.io/gallery/boxplot.html
The following chart is an interesting visualization of how car color may affect its price. From the density fo the scatter plot, we can see that the obvious most favorable colors are Black, Grey, Silver, and White. We can also see a drop of median price in less favorable colors like Green, Orange, and Purple.
c1=alt.Chart(df_visual).mark_circle().encode(
x='Color',
y='total_price'
)
c2=alt.Chart(df_visual).mark_line(color='red').encode(
x='Color',
y='median(total_price)'
)
c1+c2
Machine Learning#
To include categorical features in the regression model, I am going to use the OneHotEncoder module to change these features to numeric. OneHotEncoder breaks down a column of categorical features into several column. For example, if we have a column Color
with 3 possible values: Red, Green, and Yellow, OneHotEncoder will add three more columns: Color_red
, Color_green
, and Color_yellow
. For a red car, its value for these three columns will be 1, 0, 0.
OneHotEncoder is the Extra part of the project.
encoder = OneHotEncoder()
cols = ['Manufacturer', 'Fuel type', 'Gear box type', 'Leather interior', 'Drive wheels']
encoded_cols = pd.DataFrame(
encoder.fit_transform(df_pre[cols]).toarray(),
columns=encoder.get_feature_names_out(cols)
)
# By resetting the index of df_pre, I can merge the two df with their index.
df_reset_index = df_pre.reset_index(drop=True)
df_encoded = pd.merge(df_reset_index, encoded_cols, left_index=True, right_index=True)
df_encoded
ID | Price | Levy | Manufacturer | Model | Prod. year | Category | Leather interior | Fuel type | Engine volume | ... | Fuel type_Plug-in Hybrid | Gear box type_Automatic | Gear box type_Manual | Gear box type_Tiptronic | Gear box type_Variator | Leather interior_0 | Leather interior_1 | Drive wheels_4x4 | Drive wheels_Front | Drive wheels_Rear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45654403 | 13328 | 1399.0 | LEXUS | RX 450 | 2010 | Jeep | 1 | Hybrid | 3.5 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
1 | 44731507 | 16621 | 1018.0 | CHEVROLET | Equinox | 2011 | Jeep | 0 | Petrol | 3 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 45774419 | 8467 | 0.0 | HONDA | FIT | 2006 | Hatchback | 0 | Petrol | 1.3 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 45769185 | 3607 | 862.0 | FORD | Escape | 2011 | Jeep | 1 | Hybrid | 2.5 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
4 | 45809263 | 11726 | 446.0 | HONDA | FIT | 2014 | Hatchback | 1 | Petrol | 1.3 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
12257 | 45772306 | 5802 | 1055.0 | MERCEDES-BENZ | E 350 | 2013 | Sedan | 1 | Diesel | 3.5 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
12258 | 45798355 | 8467 | 0.0 | MERCEDES-BENZ | CLK 200 | 1999 | Coupe | 1 | CNG | 2.0 Turbo | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 |
12259 | 45778856 | 15681 | 831.0 | HYUNDAI | Sonata | 2011 | Sedan | 1 | Petrol | 2.4 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
12260 | 45804997 | 26108 | 836.0 | HYUNDAI | Tucson | 2010 | Jeep | 1 | Diesel | 2 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
12261 | 45793526 | 5331 | 1288.0 | CHEVROLET | Captiva | 2007 | Jeep | 1 | Diesel | 2 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
12262 rows × 45 columns
From here on we only need the numeric columns in the dataframe.
df_reg = df_encoded.dtypes.apply(lambda x: pd.api.types.is_numeric_dtype(x))
df_reg = df_encoded.loc[:, df_reg]
df_reg = df_reg.drop(['ID', 'Price', 'Levy'], axis = 1)
df_reg.head()
Prod. year | Leather interior | Mileage | Cylinders | Airbags | total_price | Manufacturer_BMW | Manufacturer_CHEVROLET | Manufacturer_FORD | Manufacturer_HONDA | ... | Fuel type_Plug-in Hybrid | Gear box type_Automatic | Gear box type_Manual | Gear box type_Tiptronic | Gear box type_Variator | Leather interior_0 | Leather interior_1 | Drive wheels_4x4 | Drive wheels_Front | Drive wheels_Rear | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | 1 | 186005 | 6.0 | 12 | 14727.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
1 | 2011 | 0 | 192000 | 6.0 | 8 | 17639.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 2006 | 0 | 200000 | 4.0 | 2 | 8467.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 2011 | 1 | 168966 | 4.0 | 0 | 4469.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 |
4 | 2014 | 1 | 91901 | 4.0 | 4 | 12172.0 | 0.0 | 0.0 | 0.0 | 1.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
5 rows × 32 columns
features = list(df_reg.columns)
features.remove('total_price')
Splitting the data using train_test_split to find the best fitting max_leaf_nodes
X_train, X_test, y_train, y_test = train_test_split(
df_reg[features],
df_reg['total_price'],
test_size=0.3,
random_state=10086
)
df_err = pd.DataFrame(columns=['leaves', 'error', 'set'])
for i in range(50, 2000, 50):
rfr = RandomForestRegressor(max_leaf_nodes=i)
rfr.fit(X_train, y_train)
train_error = 1 - rfr.score(X_train, y_train)
test_error = 1 - rfr.score(X_test, y_test)
df_err.loc[len(df_err)] = {'leaves': i, 'error': train_error, 'set': 'train'}
df_err.loc[len(df_err)] = {'leaves': i, 'error': test_error, 'set': 'test'}
From this graph we can see that the error curve reaches a minimum at 600 leaves, and the error value steadies around 0.3. The maximum lead nodes does not lead to overfitting in this case. I think it is because I used too many features when I fitted the model. If we want to see the test curve curves upwards and displays overfitting, we may need to add the max_leaf_nodes to 10k, which will take a huge amount of time for deepnot to process (the code section above took around 1 min to process). I thus choose 1000 leaves for the prediction below.
alt.Chart(df_err).mark_line().encode(
x = 'leaves',
y = 'error',
color = 'set',
tooltip = 'leaves'
)
rfr2 = RandomForestRegressor(max_leaf_nodes=1000)
rfr2.fit(df_reg[features], df_reg['total_price'])
RandomForestRegressor(max_leaf_nodes=1000)
df_reg['pred'] = rfr2.predict(df_reg[features])
df_reg['index'] = df_reg.index
The two graphs below give us an overview of the performance of the prediction model.
Here I only use the first 20 rows of the dataset to visualize the prediction. The blue line represents the actual total price of the car and the red line represents the predicted price. The prediction curve seems to be well fitted with the actual price curve.
c3 = alt.Chart(df_reg.head(20)).mark_line().encode(
x = 'index',
y = 'pred'
)
c4 = alt.Chart(df_reg.head(20)).mark_line(color = 'red').encode(
x = 'index',
y = 'total_price'
)
c3+c4
Here is another display of the prediction with a larger sample. We can see that the points are scattered around the line y = x, which is what I am expecting.
alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
x = 'pred',
y = 'total_price'
)
Error Analysis#
To create a good visualization of the error, I create the err_percentage
column to calculate the percentage of the difference between the actual price and prediction. I rounded the percentage to 2 digits.
df_reg['err'] = df_reg['pred'] - df_reg['total_price']
df_reg['err_percentage'] = df_reg['err'] / df_reg['total_price']
df_reg['err_percentage'] = round(df_reg['err_percentage'], 2)
We can see in the following chart that the error percentages are normally distributed around 0. Most of the error percentages are between -25% to 25%. I think this is a decent result for our prediction model.
alt.Chart(df_reg.sample(5000, random_state=10086)).mark_bar().encode(
y="count()",
x="err_percentage"
).properties(width=500, height=300)
The following graph shows the relationship between the err_percentage and the total_price. Surprisingly, the error percentage tends to decrease as the total_price of the car increases. This means that the model has a poorer performance as the total_price decreases.
alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
x="total_price",
y="err_percentage"
).properties(width=500, height=300)
Summary#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
In general, the performance of the model is much better than what I was expecting. I first use OneHotEncoder to encode all the non-numeric features to numerical forms in the dataframe. I then apply the RandomForestRegressor to make the predictions. The visualizations are limited by Altair’s maximum number of data, so I use the sample()
funtion in every altair chart using the same random_state
variable.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge
List any other references that you found helpful.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html https://www.turing.com/kb/convert-categorical-data-in-pandas-and-scikit-learn.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote