Predict The Car Price

Author: Kehan Li

Course Project, UC Irvine, Math 10, S22


My project is to build a LinearRegression and predict future vehicle prices based on historical data, which setup data is Latest_Launch, Sales_in_thousands, __year_resale_value, Passenger, Engine_size, setup, Wheelbase, Width, Length, Curb_weight, Fuel_capacity, Power_perf_factor, and found that the most correlated dataset is __year_resale_value, Engine_size, Engine_size, Curb_weight, Power_perf_factor. I tried to figure out the relationship between each variable and the final output, and chose the most significant variables to make the LinearRegression closer to the exact value. Also, I tried to build a KneighborsClassifier to determine if the car is worth buying.

Import data:

import pandas as pd
df = pd.read_csv("Car_sales.csv").dropna()
Manufacturer Model Sales_in_thousands __year_resale_value Vehicle_type Price_in_thousands Engine_size Horsepower Wheelbase Width Length Curb_weight Fuel_capacity Fuel_efficiency Latest_Launch Power_perf_factor
0 Acura Integra 16.919 16.360 Passenger 21.50 1.8 140.0 101.2 67.3 172.4 2.639 13.2 28.0 2/2/2012 58.280150
1 Acura TL 39.384 19.875 Passenger 28.40 3.2 225.0 108.1 70.3 192.9 3.517 17.2 25.0 6/3/2011 91.370778
3 Acura RL 8.588 29.725 Passenger 42.00 3.5 210.0 114.6 71.4 196.6 3.850 18.0 22.0 3/10/2011 91.389779
4 Audi A4 20.397 22.255 Passenger 23.99 1.8 150.0 102.6 68.2 178.0 2.998 16.4 27.0 10/8/2011 62.777639
5 Audi A6 18.780 23.555 Passenger 33.95 2.8 200.0 108.7 76.1 192.0 3.561 18.5 22.0 8/9/2011 84.565105
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145 Volkswagen Golf 9.761 11.425 Passenger 14.90 2.0 115.0 98.9 68.3 163.3 2.767 14.5 26.0 1/24/2011 46.943877
146 Volkswagen Jetta 83.721 13.240 Passenger 16.70 2.0 115.0 98.9 68.3 172.3 2.853 14.5 26.0 8/27/2011 47.638237
147 Volkswagen Passat 51.102 16.725 Passenger 21.20 1.8 150.0 106.4 68.5 184.1 3.043 16.4 27.0 10/30/2012 61.701381
148 Volkswagen Cabrio 9.569 16.575 Passenger 19.99 2.0 115.0 97.4 66.7 160.4 3.079 13.7 26.0 5/31/2011 48.907372
149 Volkswagen GTI 5.596 13.760 Passenger 17.50 2.0 115.0 98.9 68.3 163.3 2.762 14.6 26.0 4/1/2011 47.946841

117 rows × 16 columns

Manufacturer Distribution:

array(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
       'Chrysler', 'Dodge', 'Ford', 'Honda', 'Hyundai', 'Infiniti',
       'Jeep', 'Lexus', 'Lincoln', 'Mitsubishi', 'Mercury', 'Mercedes-B',
       'Nissan', 'Oldsmobile', 'Plymouth', 'Pontiac', 'Porsche', 'Saturn',
       'Toyota', 'Volkswagen'], dtype=object)
Ford          10
Dodge          9
Toyota         8
Chevrolet      8
Mitsubishi     7
Mercury        6
Honda          5
Volkswagen     5
Nissan         5
Pontiac        5
Chrysler       5
Oldsmobile     4
Buick          4
Mercedes-B     4
Plymouth       3
Acura          3
Jeep           3
Lexus          3
Saturn         3
Audi           3
Porsche        3
Hyundai        3
Cadillac       3
BMW            2
Lincoln        2
Infiniti       1
Name: Manufacturer, dtype: int64
import altair as alt
c3 = alt.Chart(df).mark_bar().encode(
    x = "Manufacturer",
    y = "count()",

Draw the Altair diagram based on the type and number of manufacturer. By looking at the chart above, we can say that manufacturers Dodge and Ford produce more cars than any other manufacturer.

import as px
avg = pd.DataFrame(df.groupby('Manufacturer')['Sales_in_thousands'].mean())

fig =, x=avg.index, y=avg["Sales_in_thousands"])

Draw the plotly diagram based on the type of manufacturers and their sales. By looking at the chart above, We can say that the manufacturers Ford has the highest sales compared to other manufacturers.

Create the Linear Regression for all condition:

Manufacturer            object
Model                   object
Sales_in_thousands     float64
__year_resale_value    float64
Vehicle_type            object
Price_in_thousands     float64
Engine_size            float64
Horsepower             float64
Wheelbase              float64
Width                  float64
Length                 float64
Curb_weight            float64
Fuel_capacity          float64
Fuel_efficiency        float64
Latest_Launch           object
Power_perf_factor      float64
dtype: object

It Converts data type for object

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
Latest_Launch         -3.532375e-17
Sales_in_thousands    -7.167365e-03
__year_resale_value    8.218924e-01
Passenger             -5.986566e-03
Engine_size           -1.527691e-02
Horsepower            -5.138488e-02
Wheelbase              1.452939e-01
Width                 -9.911713e-03
Length                -2.618168e-02
Curb_weight            1.189987e-02
Fuel_capacity          1.283967e-01
Power_perf_factor      2.859755e-01
dtype: float64
array([-3.53237502e-17, -7.16736497e-03,  8.21892425e-01, -5.98656606e-03,
       -1.52769076e-02, -5.13848849e-02,  1.45293877e-01, -9.91171336e-03,
       -2.61816793e-02,  1.18998664e-02,  1.28396731e-01,  2.85975504e-01])
print(f"The equation is: Pred Price = {cols[0]} x {reg.coef_[0]} + {cols[1]} x {reg.coef_[1]}+{cols[2]} x {reg.coef_[2]}+{cols[3]} x {reg.coef_[3]}+{cols[4]} x {reg.coef_[4]}+{cols[5]} x {reg.coef_[5]} +{cols[6]} x {reg.coef_[6]}+{cols[7]} x {reg.coef_[7]}+{cols[8]} x {reg.coef_[8]}+{cols[9]} x {reg.coef_[9]}+{cols[10]} x {reg.coef_[10]}+{cols[11]} x {reg.coef_[11]}+ {reg.intercept_}") 
The equation is: Pred Price = Latest_Launch x -3.532375019547313e-17 + Sales_in_thousands x -0.007167364973728246+__year_resale_value x 0.8218924254079063+Passenger x -0.00598656606270679+Engine_size x -0.015276907576185477+Horsepower x -0.05138488489098564 +Wheelbase x 0.14529387667785817+Width x -0.009911713362671822+Length x -0.02618167933552728+Curb_weight x 0.01189986643110451+Fuel_capacity x 0.1283967313215668+Power_perf_factor x 0.28597550410794764+ 33.94504156612675

We find the relationship between Price and other features

df["Pred"] = reg.predict(df[cols])
Manufacturer Model Sales_in_thousands __year_resale_value Vehicle_type Price_in_thousands Engine_size Horsepower Wheelbase Width Length Curb_weight Fuel_capacity Fuel_efficiency Latest_Launch Power_perf_factor Passenger Pred
0 Acura Integra 16.919 16.360 Passenger 21.50 1.8 140.0 101.2 67.3 172.4 2.639 13.2 28.0 1328140800000000000 58.280150 1 21.043551
1 Acura TL 39.384 19.875 Passenger 28.40 3.2 225.0 108.1 70.3 192.9 3.517 17.2 25.0 1307059200000000000 91.370778 1 30.550278
3 Acura RL 8.588 29.725 Passenger 42.00 3.5 210.0 114.6 71.4 196.6 3.850 18.0 22.0 1299715200000000000 91.389779 1 40.841002
4 Audi A4 20.397 22.255 Passenger 23.99 1.8 150.0 102.6 68.2 178.0 2.998 16.4 27.0 1318032000000000000 62.777639 1 27.456097
5 Audi A6 18.780 23.555 Passenger 33.95 2.8 200.0 108.7 76.1 192.0 3.561 18.5 22.0 1312848000000000000 84.565105 1 33.083205
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145 Volkswagen Golf 9.761 11.425 Passenger 14.90 2.0 115.0 98.9 68.3 163.3 2.767 14.5 26.0 1295827200000000000 46.943877 1 16.282528
146 Volkswagen Jetta 83.721 13.240 Passenger 16.70 2.0 115.0 98.9 68.3 172.3 2.853 14.5 26.0 1314403200000000000 47.638237 1 16.551949
147 Volkswagen Passat 51.102 16.725 Passenger 21.20 1.8 150.0 106.4 68.5 184.1 3.043 16.4 27.0 1351555200000000000 61.701381 1 21.588980
148 Volkswagen Cabrio 9.569 16.575 Passenger 19.99 2.0 115.0 97.4 66.7 160.4 3.079 13.7 26.0 1306800000000000000 48.907372 1 20.465401
149 Volkswagen GTI 5.596 13.760 Passenger 17.50 2.0 115.0 98.9 68.3 163.3 2.762 14.6 26.0 1301616000000000000 47.946841 1 18.326620

117 rows × 18 columns


Build training and test set:

from sklearn.model_selection import train_test_split
c_train,c_test,d_train,d_test = train_test_split(df[cols],d,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression(),d_train)
pred= model.predict(c_train)
model.score(c_train, d_train)

Feature selection:

c = df[cols2]
corr = c.corr()'coolwarm')
Latest_Launch Sales_in_thousands __year_resale_value Passenger Engine_size Horsepower Wheelbase Width Length Curb_weight Fuel_capacity Power_perf_factor Price_in_thousands
Latest_Launch 1.000000 0.124997 -0.066848 0.125991 0.010749 0.024798 0.008621 0.050816 0.059959 -0.071564 0.020216 0.009201 -0.051249
Sales_in_thousands 0.124997 1.000000 -0.275426 -0.278774 0.038111 -0.152538 0.406839 0.177802 0.272336 0.067184 0.138045 -0.175562 -0.251705
__year_resale_value -0.066848 -0.275426 1.000000 0.091679 0.527187 0.773110 -0.053685 0.178128 0.025390 0.363274 0.324796 0.829511 0.954757
Passenger 0.125991 -0.278774 0.091679 1.000000 -0.182515 0.045867 -0.385062 -0.220744 -0.109779 -0.469247 -0.586927 0.051096 0.076303
Engine_size 0.010749 0.038111 0.527187 -0.182515 1.000000 0.861618 0.410020 0.671756 0.537343 0.742831 0.616862 0.841005 0.649170
Horsepower 0.024798 -0.152538 0.773110 0.045867 0.861618 1.000000 0.225905 0.507275 0.400968 0.598603 0.479790 0.994071 0.853455
Wheelbase 0.008621 0.406839 -0.053685 -0.385062 0.410020 0.225905 1.000000 0.675559 0.853669 0.675609 0.658654 0.200228 0.067042
Width 0.050816 0.177802 0.178128 -0.220744 0.671756 0.507275 0.675559 1.000000 0.743226 0.735957 0.672191 0.478889 0.301292
Length 0.059959 0.272336 0.025390 -0.109779 0.537343 0.400968 0.853669 0.743226 1.000000 0.684305 0.562504 0.366831 0.182592
Curb_weight -0.071564 0.067184 0.363274 -0.469247 0.742831 0.598603 0.675609 0.735957 0.684305 1.000000 0.847994 0.597586 0.511400
Fuel_capacity 0.020216 0.138045 0.324796 -0.586927 0.616862 0.479790 0.658654 0.672191 0.562504 0.847994 1.000000 0.478484 0.406496
Power_perf_factor 0.009201 -0.175562 0.829511 0.051096 0.841005 0.994071 0.200228 0.478889 0.366831 0.597586 0.478484 1.000000 0.905002
Price_in_thousands -0.051249 -0.251705 0.954757 0.076303 0.649170 0.853455 0.067042 0.301292 0.182592 0.511400 0.406496 0.905002 1.000000

On observing the above correlation, we can say that the pair of the variables (_year_resale_value, Price_in_thousands), (horsepower, Price_in_thousands), (horsepower, engine_size), (length, wheel_base), (curb_weight, engine_size), (fuel_capacity, curb_weight), (power_perf_factor, _year_resale_value), (power_perf_factor, price_in_thousands), (power_perf_factor, engine_size), (power_perf_factor, horsepower) have a strong positive association that means if the value of one variable increases, then the value of the other variable also increases. Similarly, the pair of variables (fuel_efficiency, engine_size), (fuel_efficiency, curb_weight), (fuel_efficiency, fuel_capacity) have a strong negative association that means as the value of one variable increases the value of other variable decreases.

corr = df.corr()
corr.sort_values(["Price_in_thousands"], ascending = False, inplace = True)
Price_in_thousands     1.000000
Pred                   0.976777
__year_resale_value    0.954757
Power_perf_factor      0.905002
Horsepower             0.853455
Engine_size            0.649170
Curb_weight            0.511400
Fuel_capacity          0.406496
Width                  0.301292
Length                 0.182592
Passenger              0.076303
Wheelbase              0.067042
Latest_Launch         -0.051249
Sales_in_thousands    -0.251705
Fuel_efficiency       -0.479539
Name: Price_in_thousands, dtype: float64

We find most important features relative to target Price:__year_resale_value,Power_perf_factor,Horsepower,Engine_size,Curb_weight (>0.5)

import altair as alt
for i in cols2:
import altair as alt
c = alt.Chart(df).mark_circle().encode(

Create the Linear Regression for 4 most important condition:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x2 = sc.transform(x)
from sklearn.linear_model import LinearRegression
array([ 0.79057747, -1.48836066,  3.0635297 ,  0.21038971])
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x2,y,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression(),y_train)
pred= model.predict(x_train)
model.score(x_train, y_train)
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression(), y_train)
print("Accuracy on Traing set: ",linear_reg.score(x_train,y_train))
print("Accuracy on Testing set: ",linear_reg.score(x_test,y_test))
Accuracy on Traing set:  0.9602604670542807
Accuracy on Testing set:  0.9554237019850436

Check whether overfits:

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression(),y_train)
pred= linreg.predict(x_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)
from sklearn.metrics import r2_score
print("Our model is explaining almost 96% of variablity of the training data")
Our model is explaining almost 96% of variablity of the training data

We assume that the model is reliable.

Create the KNeighborsClassifier to determine if the car is worth buying:

import numpy as np
conditions = [df['Pred2'] > df['Price_in_thousands'],df['Pred2'] < df['Price_in_thousands']]
choices = ['Buy','Not Buy']
df['Trade Decision'] =, choices, default='Not Buy')
Manufacturer Model Sales_in_thousands __year_resale_value Vehicle_type Price_in_thousands Engine_size Horsepower Wheelbase Width Length Curb_weight Fuel_capacity Fuel_efficiency Latest_Launch Power_perf_factor Passenger Pred Pred2 Trade Decision
0 Acura Integra 16.919 16.360 Passenger 21.50 1.8 140.0 101.2 67.3 172.4 2.639 13.2 28.0 1328140800000000000 58.280150 1 21.043551 20.904723 Not Buy
1 Acura TL 39.384 19.875 Passenger 28.40 3.2 225.0 108.1 70.3 192.9 3.517 17.2 25.0 1307059200000000000 91.370778 1 30.550278 31.251605 Buy
3 Acura RL 8.588 29.725 Passenger 42.00 3.5 210.0 114.6 71.4 196.6 3.850 18.0 22.0 1299715200000000000 91.389779 1 40.841002 39.616438 Not Buy
4 Audi A4 20.397 22.255 Passenger 23.99 1.8 150.0 102.6 68.2 178.0 2.998 16.4 27.0 1318032000000000000 62.777639 1 27.456097 27.611210 Buy
5 Audi A6 18.780 23.555 Passenger 33.95 2.8 200.0 108.7 76.1 192.0 3.561 18.5 22.0 1312848000000000000 84.565105 1 33.083205 33.459226 Not Buy
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
145 Volkswagen Golf 9.761 11.425 Passenger 14.90 2.0 115.0 98.9 68.3 163.3 2.767 14.5 26.0 1295827200000000000 46.943877 1 16.282528 14.712648 Not Buy
146 Volkswagen Jetta 83.721 13.240 Passenger 16.70 2.0 115.0 98.9 68.3 172.3 2.853 14.5 26.0 1314403200000000000 47.638237 1 16.551949 16.557096 Not Buy
147 Volkswagen Passat 51.102 16.725 Passenger 21.20 1.8 150.0 106.4 68.5 184.1 3.043 16.4 27.0 1351555200000000000 61.701381 1 21.588980 23.150742 Buy
148 Volkswagen Cabrio 9.569 16.575 Passenger 19.99 2.0 115.0 97.4 66.7 160.4 3.079 13.7 26.0 1306800000000000000 48.907372 1 20.465401 20.153043 Buy
149 Volkswagen GTI 5.596 13.760 Passenger 17.50 2.0 115.0 98.9 68.3 163.3 2.762 14.6 26.0 1301616000000000000 47.946841 1 18.326620 16.754342 Not Buy

117 rows × 20 columns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
clf = KNeighborsClassifier()
X = df[cols]
Y = df['Trade Decision']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4),Y_test)
df['Pred3'] = clf.predict(X)
c1 = alt.Chart(df).mark_circle().encode(
    x = "__year_resale_value",
    y = "Price_in_thousands",
    color = 'Trade Decision'
c2 = alt.Chart(df).mark_circle().encode(
    x = "__year_resale_value",
    y = "Price_in_thousands",
    color = 'Pred3'
loss = log_loss(Y_test, clf.predict_proba(X_test))
(df["Trade Decision"] == df["Pred3"]).value_counts()
True     61
False    56
dtype: int64

There are too many errors indicate that the model cannot be used to determine whether to trade decision


I trained a Linear regression model to predict car prices. The results show that the accuracy of the model is about 95.5% if all features are included in the data set. We compared the importance of each variable to the final output and found that Compactness and Perimeter were the two most important. The accuracy of the model based on the most dominant features was about 96% on the training data set and 95.5% on the test data set. We found that the price prediction was more accurate if the model was built with the most dominant features. In addition, we also conducted predictive trading decisions and found that the data and models were not sufficient for us to get conclusions.