Car Price Prediction

Car Price Prediction#

Author: Shuoyu Zhu

Course Project, UC Irvine, Math 10, S23

Introduction#

The main goal of the project is to predict the price of a car based on both categorical and numerical features. This is achived through converting categorical features into 0 and 1 using OneHotEncoder. After encoding all categorical features into numerical features, we then fit the regression model to the dataset.

Imports#

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

Organizing the Dataset#

This dataset is acquired from kaggle. https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

df = pd.read_csv('car_price_prediction.csv')
df.head()

	ID	Price	Levy	Manufacturer	Model	Prod. year	Category	Leather interior	Fuel type	Engine volume	Mileage	Cylinders	Gear box type	Drive wheels	Doors	Wheel	Color	Airbags
0	45654403	13328	1399	LEXUS	RX 450	2010	Jeep	Yes	Hybrid	3.5	186005 km	6.0	Automatic	4x4	04-May	Left wheel	Silver	12
1	44731507	16621	1018	CHEVROLET	Equinox	2011	Jeep	No	Petrol	3	192000 km	6.0	Tiptronic	4x4	04-May	Left wheel	Black	8
2	45774419	8467	-	HONDA	FIT	2006	Hatchback	No	Petrol	1.3	200000 km	4.0	Variator	Front	04-May	Right-hand drive	Black	2
3	45769185	3607	862	FORD	Escape	2011	Jeep	Yes	Hybrid	2.5	168966 km	4.0	Automatic	4x4	04-May	Left wheel	White	0
4	45809263	11726	446	HONDA	FIT	2014	Hatchback	Yes	Petrol	1.3	91901 km	4.0	Automatic	Front	04-May	Left wheel	Silver	4

This is where I clean up the dataset that is prepared for the regression model.

df['Levy'] = df['Levy'].replace('-', 0) 
df['Levy'] = df['Levy'].astype(float)
df['total_price'] = df['Price'] + df['Levy']
df['Leather interior'] = df['Leather interior'].replace('Yes', 1)
df['Leather interior'] = df['Leather interior'].replace('No', 0)
# Here I deleted the km in Mileage column and make it a int type
# Credit to my other python course ICS31 for explaining the str.extract() function
df['Mileage'] = df['Mileage'].str.extract(r'(\d+)', expand=False)
df['Mileage'] = df['Mileage'].astype(int)

Engine volume is not numerical because some engines contains turbos while others do not

df[df['Engine volume'].str.contains('Turbo')].head(2)

	ID	Price	Levy	Manufacturer	Model	Prod. year	Category	Leather interior	Fuel type	Engine volume	Mileage	Cylinders	Gear box type	Drive wheels	Doors	Wheel	Color	Airbags	total_price
23	45814106	7840	0.0	FORD	Transit	2001	Microbus	0	Diesel	2.0 Turbo	230000	4.0	Manual	Front	02-Mar	Left wheel	White	0	7840.0
25	45782859	20385	0.0	MERCEDES-BENZ	E 220	2006	Sedan	1	Diesel	2.2 Turbo	210000	4.0	Tiptronic	Rear	04-May	Left wheel	Black	8	20385.0

Since some datas have a unreseaonable total_price (Price+Levy) for cars, so I am setting a boundry from $3000 to $50000 just to get rid of the extreme values.

df = df[(df['total_price'] > 3000) & (df['total_price'] < 50000)]

Exploring the Dataset#

I am keeping the top 10 manufacturers in the dataset for forming better charts. This dataframe is also used in the prediction part.

df_pre = df[df['Manufacturer'].isin(df['Manufacturer'].value_counts()[0:10].index)]

df_visual = df_pre.sample(n=5000, random_state= 10086)

This donut chart shows us the approximate portion of each manufacturer in the dataset. What I don’t want to see is some manufacturer contributing only one or two data, which won’t have significant effect on our prediction and can complicate the process. The result of selecting the top 10 manufacturer seems decent at the moment.

alt.Chart(df_visual).mark_arc(innerRadius=50).encode(
    theta="count()",
    color="Manufacturer",
    tooltip=['Manufacturer', 'count()']
)

Source for creating the donut chart: https://altair-viz.github.io/gallery/donut_chart.html

This boxplot chart shows us the approximate price range of each brand. From here, we can see a correlation between the total_price and the brand. Brands with bigger names tend to have a higher median and third quartile of the total price.

alt.Chart(df_visual).mark_boxplot(extent='min-max', size = 30).encode(
    x='Manufacturer',
    y='total_price:Q'
).properties(width=400)

Source for creating the boxplot chart: https://altair-viz.github.io/gallery/boxplot.html

The following chart is an interesting visualization of how car color may affect its price. From the density fo the scatter plot, we can see that the obvious most favorable colors are Black, Grey, Silver, and White. We can also see a drop of median price in less favorable colors like Green, Orange, and Purple.

c1=alt.Chart(df_visual).mark_circle().encode(
    x='Color',
    y='total_price'
)
c2=alt.Chart(df_visual).mark_line(color='red').encode(
    x='Color',
    y='median(total_price)'
)
c1+c2

Machine Learning#

To include categorical features in the regression model, I am going to use the OneHotEncoder module to change these features to numeric. OneHotEncoder breaks down a column of categorical features into several column. For example, if we have a column Color with 3 possible values: Red, Green, and Yellow, OneHotEncoder will add three more columns: Color_red, Color_green, and Color_yellow. For a red car, its value for these three columns will be 1, 0, 0.

OneHotEncoder is the Extra part of the project.

encoder = OneHotEncoder()
cols = ['Manufacturer', 'Fuel type', 'Gear box type', 'Leather interior', 'Drive wheels']
encoded_cols = pd.DataFrame(
    encoder.fit_transform(df_pre[cols]).toarray(), 
    columns=encoder.get_feature_names_out(cols)
)
# By resetting the index of df_pre, I can merge the two df with their index.
df_reset_index = df_pre.reset_index(drop=True)
df_encoded = pd.merge(df_reset_index, encoded_cols, left_index=True, right_index=True)
df_encoded

	ID	Price	Levy	Manufacturer	Model	Prod. year	Category	Leather interior	Fuel type	Engine volume	...	Fuel type_Plug-in Hybrid	Gear box type_Automatic	Gear box type_Manual	Gear box type_Tiptronic	Gear box type_Variator	Leather interior_0	Leather interior_1	Drive wheels_4x4	Drive wheels_Front	Drive wheels_Rear
0	45654403	13328	1399.0	LEXUS	RX 450	2010	Jeep	1	Hybrid	3.5	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
1	44731507	16621	1018.0	CHEVROLET	Equinox	2011	Jeep	0	Petrol	3	...	0.0	0.0	0.0	1.0	0.0	1.0	0.0	1.0	0.0	0.0
2	45774419	8467	0.0	HONDA	FIT	2006	Hatchback	0	Petrol	1.3	...	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0	1.0	0.0
3	45769185	3607	862.0	FORD	Escape	2011	Jeep	1	Hybrid	2.5	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	1.0	0.0	0.0
4	45809263	11726	446.0	HONDA	FIT	2014	Hatchback	1	Petrol	1.3	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12257	45772306	5802	1055.0	MERCEDES-BENZ	E 350	2013	Sedan	1	Diesel	3.5	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
12258	45798355	8467	0.0	MERCEDES-BENZ	CLK 200	1999	Coupe	1	CNG	2.0 Turbo	...	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	1.0
12259	45778856	15681	831.0	HYUNDAI	Sonata	2011	Sedan	1	Petrol	2.4	...	0.0	0.0	0.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0
12260	45804997	26108	836.0	HYUNDAI	Tucson	2010	Jeep	1	Diesel	2	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0
12261	45793526	5331	1288.0	CHEVROLET	Captiva	2007	Jeep	1	Diesel	2	...	0.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	1.0	0.0

12262 rows × 45 columns

From here on we only need the numeric columns in the dataframe.

df_reg = df_encoded.dtypes.apply(lambda x: pd.api.types.is_numeric_dtype(x))
df_reg = df_encoded.loc[:, df_reg]
df_reg = df_reg.drop(['ID', 'Price', 'Levy'], axis = 1)
df_reg.head()

	Prod. year	Leather interior	Mileage	Cylinders	Airbags	total_price	Manufacturer_CHEVROLET	Manufacturer_FORD	Manufacturer_HONDA	...	Gear box type_Automatic	Gear box type_Tiptronic	Gear box type_Variator	Leather interior_0	Leather interior_1	Drive wheels_4x4	Drive wheels_Front
0	2010	1	186005	6.0	12	14727.0	0.0	0.0	0.0	...	1.0	0.0	0.0	0.0	1.0	1.0	0.0
1	2011	0	192000	6.0	8	17639.0	1.0	0.0	0.0	...	0.0	1.0	0.0	1.0	0.0	1.0	0.0
2	2006	0	200000	4.0	2	8467.0	0.0	0.0	1.0	...	0.0	0.0	1.0	1.0	0.0	0.0	1.0
3	2011	1	168966	4.0	0	4469.0	0.0	1.0	0.0	...	1.0	0.0	0.0	0.0	1.0	1.0	0.0
4	2014	1	91901	4.0	4	12172.0	0.0	0.0	1.0	...	1.0	0.0	0.0	0.0	1.0	0.0	1.0

5 rows × 32 columns

features = list(df_reg.columns)
features.remove('total_price')

Splitting the data using train_test_split to find the best fitting max_leaf_nodes

X_train, X_test, y_train, y_test = train_test_split(
    df_reg[features], 
    df_reg['total_price'], 
    test_size=0.3, 
    random_state=10086
)

df_err = pd.DataFrame(columns=['leaves', 'error', 'set'])
for i in range(50, 2000, 50):
    rfr = RandomForestRegressor(max_leaf_nodes=i)
    rfr.fit(X_train, y_train)
    train_error = 1 - rfr.score(X_train, y_train)
    test_error = 1 - rfr.score(X_test, y_test)
    df_err.loc[len(df_err)] = {'leaves': i, 'error': train_error, 'set': 'train'}
    df_err.loc[len(df_err)] = {'leaves': i, 'error': test_error, 'set': 'test'}

From this graph we can see that the error curve reaches a minimum at 600 leaves, and the error value steadies around 0.3. The maximum lead nodes does not lead to overfitting in this case. I think it is because I used too many features when I fitted the model. If we want to see the test curve curves upwards and displays overfitting, we may need to add the max_leaf_nodes to 10k, which will take a huge amount of time for deepnot to process (the code section above took around 1 min to process). I thus choose 1000 leaves for the prediction below.

alt.Chart(df_err).mark_line().encode(
    x = 'leaves',
    y = 'error',
    color = 'set',
    tooltip = 'leaves'
)

rfr2 = RandomForestRegressor(max_leaf_nodes=1000)
rfr2.fit(df_reg[features], df_reg['total_price'])

RandomForestRegressor(max_leaf_nodes=1000)

df_reg['pred'] = rfr2.predict(df_reg[features])
df_reg['index'] = df_reg.index

The two graphs below give us an overview of the performance of the prediction model.

Here I only use the first 20 rows of the dataset to visualize the prediction. The blue line represents the actual total price of the car and the red line represents the predicted price. The prediction curve seems to be well fitted with the actual price curve.

c3 = alt.Chart(df_reg.head(20)).mark_line().encode(
    x = 'index',
    y = 'pred'
)
c4 = alt.Chart(df_reg.head(20)).mark_line(color = 'red').encode(
    x = 'index',
    y = 'total_price'
)
c3+c4

Here is another display of the prediction with a larger sample. We can see that the points are scattered around the line y = x, which is what I am expecting.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
    x = 'pred',
    y = 'total_price'
)

Error Analysis#

To create a good visualization of the error, I create the err_percentage column to calculate the percentage of the difference between the actual price and prediction. I rounded the percentage to 2 digits.

df_reg['err'] = df_reg['pred'] - df_reg['total_price']
df_reg['err_percentage'] = df_reg['err'] / df_reg['total_price']
df_reg['err_percentage'] = round(df_reg['err_percentage'], 2)

We can see in the following chart that the error percentages are normally distributed around 0. Most of the error percentages are between -25% to 25%. I think this is a decent result for our prediction model.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_bar().encode(
    y="count()",
    x="err_percentage"
).properties(width=500, height=300)

The following graph shows the relationship between the err_percentage and the total_price. Surprisingly, the error percentage tends to decrease as the total_price of the car increases. This means that the model has a poorer performance as the total_price decreases.

alt.Chart(df_reg.sample(5000, random_state=10086)).mark_circle().encode(
    x="total_price",
    y="err_percentage"
).properties(width=500, height=300)

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In general, the performance of the model is much better than what I was expecting. I first use OneHotEncoder to encode all the non-numeric features to numerical forms in the dataframe. I then apply the RandomForestRegressor to make the predictions. The visualizations are limited by Altair’s maximum number of data, so I use the sample() funtion in every altair chart using the same random_state variable.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/deepcontractor/car-price-prediction-challenge

List any other references that you found helpful.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html https://www.turing.com/kb/convert-categorical-data-in-pandas-and-scikit-learn.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote