Predict The Car Price¶

Author: Kehan Li

Course Project, UC Irvine, Math 10, S22

Introduction¶

My project is to build a LinearRegression and predict future vehicle prices based on historical data, which setup data is Latest_Launch, Sales_in_thousands, __year_resale_value, Passenger, Engine_size, setup, Wheelbase, Width, Length, Curb_weight, Fuel_capacity, Power_perf_factor, and found that the most correlated dataset is __year_resale_value, Engine_size, Engine_size, Curb_weight, Power_perf_factor. I tried to figure out the relationship between each variable and the final output, and chose the most significant variables to make the LinearRegression closer to the exact value. Also, I tried to build a KneighborsClassifier to determine if the car is worth buying.

Import data：¶

import pandas as pd
df = pd.read_csv("Car_sales.csv").dropna()
df

	Manufacturer	Model	Sales_in_thousands	__year_resale_value	Vehicle_type	Price_in_thousands	Engine_size	Horsepower	Wheelbase	Width	Length	Curb_weight	Fuel_capacity	Fuel_efficiency	Latest_Launch	Power_perf_factor
0	Acura	Integra	16.919	16.360	Passenger	21.50	1.8	140.0	101.2	67.3	172.4	2.639	13.2	28.0	2/2/2012	58.280150
1	Acura	TL	39.384	19.875	Passenger	28.40	3.2	225.0	108.1	70.3	192.9	3.517	17.2	25.0	6/3/2011	91.370778
3	Acura	RL	8.588	29.725	Passenger	42.00	3.5	210.0	114.6	71.4	196.6	3.850	18.0	22.0	3/10/2011	91.389779
4	Audi	A4	20.397	22.255	Passenger	23.99	1.8	150.0	102.6	68.2	178.0	2.998	16.4	27.0	10/8/2011	62.777639
5	Audi	A6	18.780	23.555	Passenger	33.95	2.8	200.0	108.7	76.1	192.0	3.561	18.5	22.0	8/9/2011	84.565105
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
145	Volkswagen	Golf	9.761	11.425	Passenger	14.90	2.0	115.0	98.9	68.3	163.3	2.767	14.5	26.0	1/24/2011	46.943877
146	Volkswagen	Jetta	83.721	13.240	Passenger	16.70	2.0	115.0	98.9	68.3	172.3	2.853	14.5	26.0	8/27/2011	47.638237
147	Volkswagen	Passat	51.102	16.725	Passenger	21.20	1.8	150.0	106.4	68.5	184.1	3.043	16.4	27.0	10/30/2012	61.701381
148	Volkswagen	Cabrio	9.569	16.575	Passenger	19.99	2.0	115.0	97.4	66.7	160.4	3.079	13.7	26.0	5/31/2011	48.907372
149	Volkswagen	GTI	5.596	13.760	Passenger	17.50	2.0	115.0	98.9	68.3	163.3	2.762	14.6	26.0	4/1/2011	47.946841

117 rows × 16 columns

Manufacturer Distribution：¶

df.Manufacturer.unique()

array(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'Dodge', 'Ford', 'Honda', 'Hyundai', 'Infiniti',
       'Jeep', 'Lexus', 'Lincoln', 'Mitsubishi', 'Mercury', 'Mercedes-B',
       'Nissan', 'Oldsmobile', 'Plymouth', 'Pontiac', 'Porsche', 'Saturn',
       'Toyota', 'Volkswagen'], dtype=object)

df.Manufacturer.value_counts()

Ford          10
Dodge          9
Toyota         8
Chevrolet      8
Mitsubishi     7
Mercury        6
Honda          5
Volkswagen     5
Nissan         5
Pontiac        5
Chrysler       5
Oldsmobile     4
Buick          4
Mercedes-B     4
Plymouth       3
Acura          3
Jeep           3
Lexus          3
Saturn         3
Audi           3
Porsche        3
Hyundai        3
Cadillac       3
BMW            2
Lincoln        2
Infiniti       1
Name: Manufacturer, dtype: int64

import altair as alt
c3 = alt.Chart(df).mark_bar().encode(
    x = "Manufacturer",
    y = "count()",
    color="Model",
    tooltip=["Model","Sales_in_thousands","Price_in_thousands"]
)
c3

Draw the Altair diagram based on the type and number of manufacturer. By looking at the chart above, we can say that manufacturers Dodge and Ford produce more cars than any other manufacturer.

import plotly.express as px
avg = pd.DataFrame(df.groupby('Manufacturer')['Sales_in_thousands'].mean())

fig = px.bar(df, x=avg.index, y=avg["Sales_in_thousands"])
fig.show()

Draw the plotly diagram based on the type of manufacturers and their sales. By looking at the chart above, We can say that the manufacturers Ford has the highest sales compared to other manufacturers.

Create the Linear Regression for all condition:¶

df.dtypes

Manufacturer            object
Model                   object
Sales_in_thousands     float64
__year_resale_value    float64
Vehicle_type            object
Price_in_thousands     float64
Engine_size            float64
Horsepower             float64
Wheelbase              float64
Width                  float64
Length                 float64
Curb_weight            float64
Fuel_capacity          float64
Fuel_efficiency        float64
Latest_Launch           object
Power_perf_factor      float64
dtype: object

df["Passenger"]=0
df["Vehicle_type"]=="Passenger"
df.loc[df["Vehicle_type"]=="Passenger","Passenger"]=1

df["Latest_Launch"]=pd.to_datetime(df["Latest_Launch"]).astype(int)#New

It Converts data type for object

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cols=['Latest_Launch','Sales_in_thousands','__year_resale_value','Passenger','Engine_size','Horsepower','Wheelbase','Width','Length','Curb_weight','Fuel_capacity','Power_perf_factor']
reg.fit(df[cols],df["Price_in_thousands"])
pd.Series(reg.coef_,index=cols)

Latest_Launch         -3.532375e-17
Sales_in_thousands    -7.167365e-03
__year_resale_value    8.218924e-01
Passenger             -5.986566e-03
Engine_size           -1.527691e-02
Horsepower            -5.138488e-02
Wheelbase              1.452939e-01
Width                 -9.911713e-03
Length                -2.618168e-02
Curb_weight            1.189987e-02
Fuel_capacity          1.283967e-01
Power_perf_factor      2.859755e-01
dtype: float64

reg.coef_

array([-3.53237502e-17, -7.16736497e-03,  8.21892425e-01, -5.98656606e-03,
       -1.52769076e-02, -5.13848849e-02,  1.45293877e-01, -9.91171336e-03,
       -2.61816793e-02,  1.18998664e-02,  1.28396731e-01,  2.85975504e-01])

print(f"The equation is: Pred Price = {cols[0]} x {reg.coef_[0]} + {cols[1]} x {reg.coef_[1]}+{cols[2]} x {reg.coef_[2]}+{cols[3]} x {reg.coef_[3]}+{cols[4]} x {reg.coef_[4]}+{cols[5]} x {reg.coef_[5]} +{cols[6]} x {reg.coef_[6]}+{cols[7]} x {reg.coef_[7]}+{cols[8]} x {reg.coef_[8]}+{cols[9]} x {reg.coef_[9]}+{cols[10]} x {reg.coef_[10]}+{cols[11]} x {reg.coef_[11]}+ {reg.intercept_}") 

The equation is: Pred Price = Latest_Launch x -3.532375019547313e-17 + Sales_in_thousands x -0.007167364973728246+__year_resale_value x 0.8218924254079063+Passenger x -0.00598656606270679+Engine_size x -0.015276907576185477+Horsepower x -0.05138488489098564 +Wheelbase x 0.14529387667785817+Width x -0.009911713362671822+Length x -0.02618167933552728+Curb_weight x 0.01189986643110451+Fuel_capacity x 0.1283967313215668+Power_perf_factor x 0.28597550410794764+ 33.94504156612675

We find the relationship between Price and other features

df["Pred"] = reg.predict(df[cols])
df

	Manufacturer	Model	Sales_in_thousands	__year_resale_value	Vehicle_type	Price_in_thousands	Engine_size	Horsepower	Wheelbase	Width	Length	Curb_weight	Fuel_capacity	Fuel_efficiency	Latest_Launch	Power_perf_factor	Passenger	Pred
0	Acura	Integra	16.919	16.360	Passenger	21.50	1.8	140.0	101.2	67.3	172.4	2.639	13.2	28.0	1328140800000000000	58.280150	1	21.043551
1	Acura	TL	39.384	19.875	Passenger	28.40	3.2	225.0	108.1	70.3	192.9	3.517	17.2	25.0	1307059200000000000	91.370778	1	30.550278
3	Acura	RL	8.588	29.725	Passenger	42.00	3.5	210.0	114.6	71.4	196.6	3.850	18.0	22.0	1299715200000000000	91.389779	1	40.841002
4	Audi	A4	20.397	22.255	Passenger	23.99	1.8	150.0	102.6	68.2	178.0	2.998	16.4	27.0	1318032000000000000	62.777639	1	27.456097
5	Audi	A6	18.780	23.555	Passenger	33.95	2.8	200.0	108.7	76.1	192.0	3.561	18.5	22.0	1312848000000000000	84.565105	1	33.083205
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
145	Volkswagen	Golf	9.761	11.425	Passenger	14.90	2.0	115.0	98.9	68.3	163.3	2.767	14.5	26.0	1295827200000000000	46.943877	1	16.282528
146	Volkswagen	Jetta	83.721	13.240	Passenger	16.70	2.0	115.0	98.9	68.3	172.3	2.853	14.5	26.0	1314403200000000000	47.638237	1	16.551949
147	Volkswagen	Passat	51.102	16.725	Passenger	21.20	1.8	150.0	106.4	68.5	184.1	3.043	16.4	27.0	1351555200000000000	61.701381	1	21.588980
148	Volkswagen	Cabrio	9.569	16.575	Passenger	19.99	2.0	115.0	97.4	66.7	160.4	3.079	13.7	26.0	1306800000000000000	48.907372	1	20.465401
149	Volkswagen	GTI	5.596	13.760	Passenger	17.50	2.0	115.0	98.9	68.3	163.3	2.762	14.6	26.0	1301616000000000000	47.946841	1	18.326620

117 rows × 18 columns

d=df["Price_in_thousands"]

Build training and test set：¶

from sklearn.model_selection import train_test_split
c_train,c_test,d_train,d_test = train_test_split(df[cols],d,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(c_train,d_train)
pred= model.predict(c_train)
model.score(c_train, d_train)

0.9553581564437191

Feature selection:¶

cols2=['Latest_Launch','Sales_in_thousands','__year_resale_value','Passenger','Engine_size','Horsepower','Wheelbase','Width','Length','Curb_weight','Fuel_capacity','Power_perf_factor','Price_in_thousands']

c = df[cols2]
c.corr()
corr = c.corr()
corr.style.background_gradient(cmap='coolwarm')

	Latest_Launch	Sales_in_thousands	__year_resale_value	Passenger	Engine_size	Horsepower	Wheelbase	Width	Length	Curb_weight	Fuel_capacity	Power_perf_factor	Price_in_thousands
Latest_Launch	1.000000	0.124997	-0.066848	0.125991	0.010749	0.024798	0.008621	0.050816	0.059959	-0.071564	0.020216	0.009201	-0.051249
Sales_in_thousands	0.124997	1.000000	-0.275426	-0.278774	0.038111	-0.152538	0.406839	0.177802	0.272336	0.067184	0.138045	-0.175562	-0.251705
__year_resale_value	-0.066848	-0.275426	1.000000	0.091679	0.527187	0.773110	-0.053685	0.178128	0.025390	0.363274	0.324796	0.829511	0.954757
Passenger	0.125991	-0.278774	0.091679	1.000000	-0.182515	0.045867	-0.385062	-0.220744	-0.109779	-0.469247	-0.586927	0.051096	0.076303
Engine_size	0.010749	0.038111	0.527187	-0.182515	1.000000	0.861618	0.410020	0.671756	0.537343	0.742831	0.616862	0.841005	0.649170
Horsepower	0.024798	-0.152538	0.773110	0.045867	0.861618	1.000000	0.225905	0.507275	0.400968	0.598603	0.479790	0.994071	0.853455
Wheelbase	0.008621	0.406839	-0.053685	-0.385062	0.410020	0.225905	1.000000	0.675559	0.853669	0.675609	0.658654	0.200228	0.067042
Width	0.050816	0.177802	0.178128	-0.220744	0.671756	0.507275	0.675559	1.000000	0.743226	0.735957	0.672191	0.478889	0.301292
Length	0.059959	0.272336	0.025390	-0.109779	0.537343	0.400968	0.853669	0.743226	1.000000	0.684305	0.562504	0.366831	0.182592
Curb_weight	-0.071564	0.067184	0.363274	-0.469247	0.742831	0.598603	0.675609	0.735957	0.684305	1.000000	0.847994	0.597586	0.511400
Fuel_capacity	0.020216	0.138045	0.324796	-0.586927	0.616862	0.479790	0.658654	0.672191	0.562504	0.847994	1.000000	0.478484	0.406496
Power_perf_factor	0.009201	-0.175562	0.829511	0.051096	0.841005	0.994071	0.200228	0.478889	0.366831	0.597586	0.478484	1.000000	0.905002
Price_in_thousands	-0.051249	-0.251705	0.954757	0.076303	0.649170	0.853455	0.067042	0.301292	0.182592	0.511400	0.406496	0.905002	1.000000

On observing the above correlation, we can say that the pair of the variables (_year_resale_value, Price_in_thousands), (horsepower, Price_in_thousands), (horsepower, engine_size), (length, wheel_base), (curb_weight, engine_size), (fuel_capacity, curb_weight), (power_perf_factor, _year_resale_value), (power_perf_factor, price_in_thousands), (power_perf_factor, engine_size), (power_perf_factor, horsepower) have a strong positive association that means if the value of one variable increases, then the value of the other variable also increases. Similarly, the pair of variables (fuel_efficiency, engine_size), (fuel_efficiency, curb_weight), (fuel_efficiency, fuel_capacity) have a strong negative association that means as the value of one variable increases the value of other variable decreases.

corr = df.corr()
corr.sort_values(["Price_in_thousands"], ascending = False, inplace = True)
print(corr.Price_in_thousands)

Price_in_thousands     1.000000
Pred                   0.976777
__year_resale_value    0.954757
Power_perf_factor      0.905002
Horsepower             0.853455
Engine_size            0.649170
Curb_weight            0.511400
Fuel_capacity          0.406496
Width                  0.301292
Length                 0.182592
Passenger              0.076303
Wheelbase              0.067042
Latest_Launch         -0.051249
Sales_in_thousands    -0.251705
Fuel_efficiency       -0.479539
Name: Price_in_thousands, dtype: float64

We find most important features relative to target Price：__year_resale_value，Power_perf_factor,Horsepower,Engine_size,Curb_weight (>0.5)

cols

['Latest_Launch',
 'Sales_in_thousands',
 '__year_resale_value',
 'Passenger',
 'Engine_size',
 'Horsepower',
 'Wheelbase',
 'Width',
 'Length',
 'Curb_weight',
 'Fuel_capacity',
 'Power_perf_factor']

import altair as alt
cols2=['__year_resale_value','Engine_size','Horsepower','Curb_weight','Power_perf_factor']
A=[]
for i in cols2:
    c=alt.Chart(df).mark_circle().encode(
        x=i,
        y="Price_in_thousands",
        color="Manufacturer"
    )
    A.append(c)
alt.vconcat(*A)

import altair as alt
c = alt.Chart(df).mark_circle().encode(
    x='__year_resale_value',
    y="Price_in_thousands",
    color="Manufacturer"
)
c1=alt.Chart(df).mark_line(color="red").encode(
    x='__year_resale_value',
    y='Pred',
)
c+c1

Create the Linear Regression for 4 most important condition:¶

x=df[['__year_resale_value','Engine_size','Curb_weight','Power_perf_factor']]
y=df['Price_in_thousands']

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x)
x2 = sc.transform(x)

from sklearn.linear_model import LinearRegression
reg2=LinearRegression()
reg2.fit(x,y)
df["Pred2"]=reg2.predict(x)

reg2.coef_

array([ 0.79057747, -1.48836066,  3.0635297 ,  0.21038971])

reg2.intercept_

-9.696273495640352

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x2,y,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)
pred= model.predict(x_train)
model.score(x_train, y_train)

0.9602604670542807

from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
linear_reg.predict(x_test)
print("Accuracy on Traing set: ",linear_reg.score(x_train,y_train))
print("Accuracy on Testing set: ",linear_reg.score(x_test,y_test))

Accuracy on Traing set:  0.9602604670542807
Accuracy on Testing set:  0.9554237019850436

Check whether overfits:¶

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(x_train,y_train)
pred= linreg.predict(x_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)

16.114304344869506

from sklearn.metrics import r2_score
r2_score(y_test,pred)

0.9314058140034983

print("Our model is explaining almost 96% of variablity of the training data")

Our model is explaining almost 96% of variablity of the training data

We assume that the model is reliable.

Create the KNeighborsClassifier to determine if the car is worth buying:¶

import numpy as np
conditions = [df['Pred2'] > df['Price_in_thousands'],df['Pred2'] < df['Price_in_thousands']]
choices = ['Buy','Not Buy']
df['Trade Decision'] = np.select(conditions, choices, default='Not Buy')
df

	Manufacturer	Model	Sales_in_thousands	__year_resale_value	Vehicle_type	Price_in_thousands	Engine_size	Horsepower	Wheelbase	Width	Length	Curb_weight	Fuel_capacity	Fuel_efficiency	Latest_Launch	Power_perf_factor	Passenger	Pred	Pred2	Trade Decision
0	Acura	Integra	16.919	16.360	Passenger	21.50	1.8	140.0	101.2	67.3	172.4	2.639	13.2	28.0	1328140800000000000	58.280150	1	21.043551	20.904723	Not Buy
1	Acura	TL	39.384	19.875	Passenger	28.40	3.2	225.0	108.1	70.3	192.9	3.517	17.2	25.0	1307059200000000000	91.370778	1	30.550278	31.251605	Buy
3	Acura	RL	8.588	29.725	Passenger	42.00	3.5	210.0	114.6	71.4	196.6	3.850	18.0	22.0	1299715200000000000	91.389779	1	40.841002	39.616438	Not Buy
4	Audi	A4	20.397	22.255	Passenger	23.99	1.8	150.0	102.6	68.2	178.0	2.998	16.4	27.0	1318032000000000000	62.777639	1	27.456097	27.611210	Buy
5	Audi	A6	18.780	23.555	Passenger	33.95	2.8	200.0	108.7	76.1	192.0	3.561	18.5	22.0	1312848000000000000	84.565105	1	33.083205	33.459226	Not Buy
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
145	Volkswagen	Golf	9.761	11.425	Passenger	14.90	2.0	115.0	98.9	68.3	163.3	2.767	14.5	26.0	1295827200000000000	46.943877	1	16.282528	14.712648	Not Buy
146	Volkswagen	Jetta	83.721	13.240	Passenger	16.70	2.0	115.0	98.9	68.3	172.3	2.853	14.5	26.0	1314403200000000000	47.638237	1	16.551949	16.557096	Not Buy
147	Volkswagen	Passat	51.102	16.725	Passenger	21.20	1.8	150.0	106.4	68.5	184.1	3.043	16.4	27.0	1351555200000000000	61.701381	1	21.588980	23.150742	Buy
148	Volkswagen	Cabrio	9.569	16.575	Passenger	19.99	2.0	115.0	97.4	66.7	160.4	3.079	13.7	26.0	1306800000000000000	48.907372	1	20.465401	20.153043	Buy
149	Volkswagen	GTI	5.596	13.760	Passenger	17.50	2.0	115.0	98.9	68.3	163.3	2.762	14.6	26.0	1301616000000000000	47.946841	1	18.326620	16.754342	Not Buy

117 rows × 20 columns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
clf = KNeighborsClassifier()
X = df[cols]
Y = df['Trade Decision']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4)
clf.fit(X_test,Y_test)
df['Pred3'] = clf.predict(X)

c1 = alt.Chart(df).mark_circle().encode(
    x = "__year_resale_value",
    y = "Price_in_thousands",
    color = 'Trade Decision'
)
c2 = alt.Chart(df).mark_circle().encode(
    x = "__year_resale_value",
    y = "Price_in_thousands",
    color = 'Pred3'
)
c1|c2

loss = log_loss(Y_test, clf.predict_proba(X_test))
loss

0.6343968063734172

(df["Trade Decision"] == df["Pred3"]).value_counts()

True     61
False    56
dtype: int64

There are too many errors indicate that the model cannot be used to determine whether to trade decision

Summary¶

I trained a Linear regression model to predict car prices. The results show that the accuracy of the model is about 95.5% if all features are included in the data set. We compared the importance of each variable to the final output and found that Compactness and Perimeter were the two most important. The accuracy of the model based on the most dominant features was about 96% on the training data set and 95.5% on the test data set. We found that the price prediction was more accurate if the model was built with the most dominant features. In addition, we also conducted predictive trading decisions and found that the data and models were not sufficient for us to get conclusions.

References¶

The reference link: https://www.kaggle.com/datasets/gagandeep16/car-sales?resource=download
pandas.DataFrame.astype: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
plotly: https://plotly.com/python/plotly-express/
numpy.select: https://numpy.org/doc/stable/reference/generated/numpy.select.html

Created in Deepnote

UC Irvine Math 10 S22

Predict The Car Price

Contents