Predict The Car Price
Contents
Predict The Car Price¶
Author: Kehan Li
Course Project, UC Irvine, Math 10, S22
Introduction¶
My project is to build a LinearRegression and predict future vehicle prices based on historical data, which setup data is Latest_Launch, Sales_in_thousands, __year_resale_value, Passenger, Engine_size, setup, Wheelbase, Width, Length, Curb_weight, Fuel_capacity, Power_perf_factor, and found that the most correlated dataset is __year_resale_value, Engine_size, Engine_size, Curb_weight, Power_perf_factor. I tried to figure out the relationship between each variable and the final output, and chose the most significant variables to make the LinearRegression closer to the exact value. Also, I tried to build a KneighborsClassifier to determine if the car is worth buying.
Import data:¶
import pandas as pd
df = pd.read_csv("Car_sales.csv").dropna()
df
Manufacturer | Model | Sales_in_thousands | __year_resale_value | Vehicle_type | Price_in_thousands | Engine_size | Horsepower | Wheelbase | Width | Length | Curb_weight | Fuel_capacity | Fuel_efficiency | Latest_Launch | Power_perf_factor | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | Passenger | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 2/2/2012 | 58.280150 |
1 | Acura | TL | 39.384 | 19.875 | Passenger | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 6/3/2011 | 91.370778 |
3 | Acura | RL | 8.588 | 29.725 | Passenger | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 3/10/2011 | 91.389779 |
4 | Audi | A4 | 20.397 | 22.255 | Passenger | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 10/8/2011 | 62.777639 |
5 | Audi | A6 | 18.780 | 23.555 | Passenger | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 8/9/2011 | 84.565105 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | Volkswagen | Golf | 9.761 | 11.425 | Passenger | 14.90 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.767 | 14.5 | 26.0 | 1/24/2011 | 46.943877 |
146 | Volkswagen | Jetta | 83.721 | 13.240 | Passenger | 16.70 | 2.0 | 115.0 | 98.9 | 68.3 | 172.3 | 2.853 | 14.5 | 26.0 | 8/27/2011 | 47.638237 |
147 | Volkswagen | Passat | 51.102 | 16.725 | Passenger | 21.20 | 1.8 | 150.0 | 106.4 | 68.5 | 184.1 | 3.043 | 16.4 | 27.0 | 10/30/2012 | 61.701381 |
148 | Volkswagen | Cabrio | 9.569 | 16.575 | Passenger | 19.99 | 2.0 | 115.0 | 97.4 | 66.7 | 160.4 | 3.079 | 13.7 | 26.0 | 5/31/2011 | 48.907372 |
149 | Volkswagen | GTI | 5.596 | 13.760 | Passenger | 17.50 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.762 | 14.6 | 26.0 | 4/1/2011 | 47.946841 |
117 rows × 16 columns
Manufacturer Distribution:¶
df.Manufacturer.unique()
array(['Acura', 'Audi', 'BMW', 'Buick', 'Cadillac', 'Chevrolet',
'Chrysler', 'Dodge', 'Ford', 'Honda', 'Hyundai', 'Infiniti',
'Jeep', 'Lexus', 'Lincoln', 'Mitsubishi', 'Mercury', 'Mercedes-B',
'Nissan', 'Oldsmobile', 'Plymouth', 'Pontiac', 'Porsche', 'Saturn',
'Toyota', 'Volkswagen'], dtype=object)
df.Manufacturer.value_counts()
Ford 10
Dodge 9
Toyota 8
Chevrolet 8
Mitsubishi 7
Mercury 6
Honda 5
Volkswagen 5
Nissan 5
Pontiac 5
Chrysler 5
Oldsmobile 4
Buick 4
Mercedes-B 4
Plymouth 3
Acura 3
Jeep 3
Lexus 3
Saturn 3
Audi 3
Porsche 3
Hyundai 3
Cadillac 3
BMW 2
Lincoln 2
Infiniti 1
Name: Manufacturer, dtype: int64
import altair as alt
c3 = alt.Chart(df).mark_bar().encode(
x = "Manufacturer",
y = "count()",
color="Model",
tooltip=["Model","Sales_in_thousands","Price_in_thousands"]
)
c3
Draw the Altair diagram based on the type and number of manufacturer. By looking at the chart above, we can say that manufacturers Dodge and Ford produce more cars than any other manufacturer.
import plotly.express as px
avg = pd.DataFrame(df.groupby('Manufacturer')['Sales_in_thousands'].mean())
fig = px.bar(df, x=avg.index, y=avg["Sales_in_thousands"])
fig.show()
Draw the plotly diagram based on the type of manufacturers and their sales. By looking at the chart above, We can say that the manufacturers Ford has the highest sales compared to other manufacturers.
Create the Linear Regression for all condition:¶
df.dtypes
Manufacturer object
Model object
Sales_in_thousands float64
__year_resale_value float64
Vehicle_type object
Price_in_thousands float64
Engine_size float64
Horsepower float64
Wheelbase float64
Width float64
Length float64
Curb_weight float64
Fuel_capacity float64
Fuel_efficiency float64
Latest_Launch object
Power_perf_factor float64
dtype: object
df["Passenger"]=0
df["Vehicle_type"]=="Passenger"
df.loc[df["Vehicle_type"]=="Passenger","Passenger"]=1
df["Latest_Launch"]=pd.to_datetime(df["Latest_Launch"]).astype(int)#New
It Converts data type for object
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cols=['Latest_Launch','Sales_in_thousands','__year_resale_value','Passenger','Engine_size','Horsepower','Wheelbase','Width','Length','Curb_weight','Fuel_capacity','Power_perf_factor']
reg.fit(df[cols],df["Price_in_thousands"])
pd.Series(reg.coef_,index=cols)
Latest_Launch -3.532375e-17
Sales_in_thousands -7.167365e-03
__year_resale_value 8.218924e-01
Passenger -5.986566e-03
Engine_size -1.527691e-02
Horsepower -5.138488e-02
Wheelbase 1.452939e-01
Width -9.911713e-03
Length -2.618168e-02
Curb_weight 1.189987e-02
Fuel_capacity 1.283967e-01
Power_perf_factor 2.859755e-01
dtype: float64
reg.coef_
array([-3.53237502e-17, -7.16736497e-03, 8.21892425e-01, -5.98656606e-03,
-1.52769076e-02, -5.13848849e-02, 1.45293877e-01, -9.91171336e-03,
-2.61816793e-02, 1.18998664e-02, 1.28396731e-01, 2.85975504e-01])
print(f"The equation is: Pred Price = {cols[0]} x {reg.coef_[0]} + {cols[1]} x {reg.coef_[1]}+{cols[2]} x {reg.coef_[2]}+{cols[3]} x {reg.coef_[3]}+{cols[4]} x {reg.coef_[4]}+{cols[5]} x {reg.coef_[5]} +{cols[6]} x {reg.coef_[6]}+{cols[7]} x {reg.coef_[7]}+{cols[8]} x {reg.coef_[8]}+{cols[9]} x {reg.coef_[9]}+{cols[10]} x {reg.coef_[10]}+{cols[11]} x {reg.coef_[11]}+ {reg.intercept_}")
The equation is: Pred Price = Latest_Launch x -3.532375019547313e-17 + Sales_in_thousands x -0.007167364973728246+__year_resale_value x 0.8218924254079063+Passenger x -0.00598656606270679+Engine_size x -0.015276907576185477+Horsepower x -0.05138488489098564 +Wheelbase x 0.14529387667785817+Width x -0.009911713362671822+Length x -0.02618167933552728+Curb_weight x 0.01189986643110451+Fuel_capacity x 0.1283967313215668+Power_perf_factor x 0.28597550410794764+ 33.94504156612675
We find the relationship between Price and other features
df["Pred"] = reg.predict(df[cols])
df
Manufacturer | Model | Sales_in_thousands | __year_resale_value | Vehicle_type | Price_in_thousands | Engine_size | Horsepower | Wheelbase | Width | Length | Curb_weight | Fuel_capacity | Fuel_efficiency | Latest_Launch | Power_perf_factor | Passenger | Pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | Passenger | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 1328140800000000000 | 58.280150 | 1 | 21.043551 |
1 | Acura | TL | 39.384 | 19.875 | Passenger | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 1307059200000000000 | 91.370778 | 1 | 30.550278 |
3 | Acura | RL | 8.588 | 29.725 | Passenger | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 1299715200000000000 | 91.389779 | 1 | 40.841002 |
4 | Audi | A4 | 20.397 | 22.255 | Passenger | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 1318032000000000000 | 62.777639 | 1 | 27.456097 |
5 | Audi | A6 | 18.780 | 23.555 | Passenger | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 1312848000000000000 | 84.565105 | 1 | 33.083205 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | Volkswagen | Golf | 9.761 | 11.425 | Passenger | 14.90 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.767 | 14.5 | 26.0 | 1295827200000000000 | 46.943877 | 1 | 16.282528 |
146 | Volkswagen | Jetta | 83.721 | 13.240 | Passenger | 16.70 | 2.0 | 115.0 | 98.9 | 68.3 | 172.3 | 2.853 | 14.5 | 26.0 | 1314403200000000000 | 47.638237 | 1 | 16.551949 |
147 | Volkswagen | Passat | 51.102 | 16.725 | Passenger | 21.20 | 1.8 | 150.0 | 106.4 | 68.5 | 184.1 | 3.043 | 16.4 | 27.0 | 1351555200000000000 | 61.701381 | 1 | 21.588980 |
148 | Volkswagen | Cabrio | 9.569 | 16.575 | Passenger | 19.99 | 2.0 | 115.0 | 97.4 | 66.7 | 160.4 | 3.079 | 13.7 | 26.0 | 1306800000000000000 | 48.907372 | 1 | 20.465401 |
149 | Volkswagen | GTI | 5.596 | 13.760 | Passenger | 17.50 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.762 | 14.6 | 26.0 | 1301616000000000000 | 47.946841 | 1 | 18.326620 |
117 rows × 18 columns
d=df["Price_in_thousands"]
Build training and test set:¶
from sklearn.model_selection import train_test_split
c_train,c_test,d_train,d_test = train_test_split(df[cols],d,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(c_train,d_train)
pred= model.predict(c_train)
model.score(c_train, d_train)
0.9553581564437191
Feature selection:¶
cols2=['Latest_Launch','Sales_in_thousands','__year_resale_value','Passenger','Engine_size','Horsepower','Wheelbase','Width','Length','Curb_weight','Fuel_capacity','Power_perf_factor','Price_in_thousands']
c = df[cols2]
c.corr()
corr = c.corr()
corr.style.background_gradient(cmap='coolwarm')
Latest_Launch | Sales_in_thousands | __year_resale_value | Passenger | Engine_size | Horsepower | Wheelbase | Width | Length | Curb_weight | Fuel_capacity | Power_perf_factor | Price_in_thousands | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Latest_Launch | 1.000000 | 0.124997 | -0.066848 | 0.125991 | 0.010749 | 0.024798 | 0.008621 | 0.050816 | 0.059959 | -0.071564 | 0.020216 | 0.009201 | -0.051249 |
Sales_in_thousands | 0.124997 | 1.000000 | -0.275426 | -0.278774 | 0.038111 | -0.152538 | 0.406839 | 0.177802 | 0.272336 | 0.067184 | 0.138045 | -0.175562 | -0.251705 |
__year_resale_value | -0.066848 | -0.275426 | 1.000000 | 0.091679 | 0.527187 | 0.773110 | -0.053685 | 0.178128 | 0.025390 | 0.363274 | 0.324796 | 0.829511 | 0.954757 |
Passenger | 0.125991 | -0.278774 | 0.091679 | 1.000000 | -0.182515 | 0.045867 | -0.385062 | -0.220744 | -0.109779 | -0.469247 | -0.586927 | 0.051096 | 0.076303 |
Engine_size | 0.010749 | 0.038111 | 0.527187 | -0.182515 | 1.000000 | 0.861618 | 0.410020 | 0.671756 | 0.537343 | 0.742831 | 0.616862 | 0.841005 | 0.649170 |
Horsepower | 0.024798 | -0.152538 | 0.773110 | 0.045867 | 0.861618 | 1.000000 | 0.225905 | 0.507275 | 0.400968 | 0.598603 | 0.479790 | 0.994071 | 0.853455 |
Wheelbase | 0.008621 | 0.406839 | -0.053685 | -0.385062 | 0.410020 | 0.225905 | 1.000000 | 0.675559 | 0.853669 | 0.675609 | 0.658654 | 0.200228 | 0.067042 |
Width | 0.050816 | 0.177802 | 0.178128 | -0.220744 | 0.671756 | 0.507275 | 0.675559 | 1.000000 | 0.743226 | 0.735957 | 0.672191 | 0.478889 | 0.301292 |
Length | 0.059959 | 0.272336 | 0.025390 | -0.109779 | 0.537343 | 0.400968 | 0.853669 | 0.743226 | 1.000000 | 0.684305 | 0.562504 | 0.366831 | 0.182592 |
Curb_weight | -0.071564 | 0.067184 | 0.363274 | -0.469247 | 0.742831 | 0.598603 | 0.675609 | 0.735957 | 0.684305 | 1.000000 | 0.847994 | 0.597586 | 0.511400 |
Fuel_capacity | 0.020216 | 0.138045 | 0.324796 | -0.586927 | 0.616862 | 0.479790 | 0.658654 | 0.672191 | 0.562504 | 0.847994 | 1.000000 | 0.478484 | 0.406496 |
Power_perf_factor | 0.009201 | -0.175562 | 0.829511 | 0.051096 | 0.841005 | 0.994071 | 0.200228 | 0.478889 | 0.366831 | 0.597586 | 0.478484 | 1.000000 | 0.905002 |
Price_in_thousands | -0.051249 | -0.251705 | 0.954757 | 0.076303 | 0.649170 | 0.853455 | 0.067042 | 0.301292 | 0.182592 | 0.511400 | 0.406496 | 0.905002 | 1.000000 |
On observing the above correlation, we can say that the pair of the variables (_year_resale_value, Price_in_thousands), (horsepower, Price_in_thousands), (horsepower, engine_size), (length, wheel_base), (curb_weight, engine_size), (fuel_capacity, curb_weight), (power_perf_factor, _year_resale_value), (power_perf_factor, price_in_thousands), (power_perf_factor, engine_size), (power_perf_factor, horsepower) have a strong positive association that means if the value of one variable increases, then the value of the other variable also increases. Similarly, the pair of variables (fuel_efficiency, engine_size), (fuel_efficiency, curb_weight), (fuel_efficiency, fuel_capacity) have a strong negative association that means as the value of one variable increases the value of other variable decreases.
corr = df.corr()
corr.sort_values(["Price_in_thousands"], ascending = False, inplace = True)
print(corr.Price_in_thousands)
Price_in_thousands 1.000000
Pred 0.976777
__year_resale_value 0.954757
Power_perf_factor 0.905002
Horsepower 0.853455
Engine_size 0.649170
Curb_weight 0.511400
Fuel_capacity 0.406496
Width 0.301292
Length 0.182592
Passenger 0.076303
Wheelbase 0.067042
Latest_Launch -0.051249
Sales_in_thousands -0.251705
Fuel_efficiency -0.479539
Name: Price_in_thousands, dtype: float64
We find most important features relative to target Price:__year_resale_value,Power_perf_factor,Horsepower,Engine_size,Curb_weight (>0.5)
cols
['Latest_Launch',
'Sales_in_thousands',
'__year_resale_value',
'Passenger',
'Engine_size',
'Horsepower',
'Wheelbase',
'Width',
'Length',
'Curb_weight',
'Fuel_capacity',
'Power_perf_factor']
import altair as alt
cols2=['__year_resale_value','Engine_size','Horsepower','Curb_weight','Power_perf_factor']
A=[]
for i in cols2:
c=alt.Chart(df).mark_circle().encode(
x=i,
y="Price_in_thousands",
color="Manufacturer"
)
A.append(c)
alt.vconcat(*A)
import altair as alt
c = alt.Chart(df).mark_circle().encode(
x='__year_resale_value',
y="Price_in_thousands",
color="Manufacturer"
)
c1=alt.Chart(df).mark_line(color="red").encode(
x='__year_resale_value',
y='Pred',
)
c+c1
Create the Linear Regression for 4 most important condition:¶
x=df[['__year_resale_value','Engine_size','Curb_weight','Power_perf_factor']]
y=df['Price_in_thousands']
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit(x)
x2 = sc.transform(x)
from sklearn.linear_model import LinearRegression
reg2=LinearRegression()
reg2.fit(x,y)
df["Pred2"]=reg2.predict(x)
reg2.coef_
array([ 0.79057747, -1.48836066, 3.0635297 , 0.21038971])
reg2.intercept_
-9.696273495640352
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x2,y,test_size = 0.2, random_state=0)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(x_train,y_train)
pred= model.predict(x_train)
model.score(x_train, y_train)
0.9602604670542807
from sklearn.linear_model import LinearRegression
linear_reg = LinearRegression()
linear_reg.fit(x_train, y_train)
linear_reg.predict(x_test)
print("Accuracy on Traing set: ",linear_reg.score(x_train,y_train))
print("Accuracy on Testing set: ",linear_reg.score(x_test,y_test))
Accuracy on Traing set: 0.9602604670542807
Accuracy on Testing set: 0.9554237019850436
Check whether overfits:¶
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(x_train,y_train)
pred= linreg.predict(x_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, pred)
16.114304344869506
from sklearn.metrics import r2_score
r2_score(y_test,pred)
0.9314058140034983
print("Our model is explaining almost 96% of variablity of the training data")
Our model is explaining almost 96% of variablity of the training data
We assume that the model is reliable.
Create the KNeighborsClassifier to determine if the car is worth buying:¶
import numpy as np
conditions = [df['Pred2'] > df['Price_in_thousands'],df['Pred2'] < df['Price_in_thousands']]
choices = ['Buy','Not Buy']
df['Trade Decision'] = np.select(conditions, choices, default='Not Buy')
df
Manufacturer | Model | Sales_in_thousands | __year_resale_value | Vehicle_type | Price_in_thousands | Engine_size | Horsepower | Wheelbase | Width | Length | Curb_weight | Fuel_capacity | Fuel_efficiency | Latest_Launch | Power_perf_factor | Passenger | Pred | Pred2 | Trade Decision | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Acura | Integra | 16.919 | 16.360 | Passenger | 21.50 | 1.8 | 140.0 | 101.2 | 67.3 | 172.4 | 2.639 | 13.2 | 28.0 | 1328140800000000000 | 58.280150 | 1 | 21.043551 | 20.904723 | Not Buy |
1 | Acura | TL | 39.384 | 19.875 | Passenger | 28.40 | 3.2 | 225.0 | 108.1 | 70.3 | 192.9 | 3.517 | 17.2 | 25.0 | 1307059200000000000 | 91.370778 | 1 | 30.550278 | 31.251605 | Buy |
3 | Acura | RL | 8.588 | 29.725 | Passenger | 42.00 | 3.5 | 210.0 | 114.6 | 71.4 | 196.6 | 3.850 | 18.0 | 22.0 | 1299715200000000000 | 91.389779 | 1 | 40.841002 | 39.616438 | Not Buy |
4 | Audi | A4 | 20.397 | 22.255 | Passenger | 23.99 | 1.8 | 150.0 | 102.6 | 68.2 | 178.0 | 2.998 | 16.4 | 27.0 | 1318032000000000000 | 62.777639 | 1 | 27.456097 | 27.611210 | Buy |
5 | Audi | A6 | 18.780 | 23.555 | Passenger | 33.95 | 2.8 | 200.0 | 108.7 | 76.1 | 192.0 | 3.561 | 18.5 | 22.0 | 1312848000000000000 | 84.565105 | 1 | 33.083205 | 33.459226 | Not Buy |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
145 | Volkswagen | Golf | 9.761 | 11.425 | Passenger | 14.90 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.767 | 14.5 | 26.0 | 1295827200000000000 | 46.943877 | 1 | 16.282528 | 14.712648 | Not Buy |
146 | Volkswagen | Jetta | 83.721 | 13.240 | Passenger | 16.70 | 2.0 | 115.0 | 98.9 | 68.3 | 172.3 | 2.853 | 14.5 | 26.0 | 1314403200000000000 | 47.638237 | 1 | 16.551949 | 16.557096 | Not Buy |
147 | Volkswagen | Passat | 51.102 | 16.725 | Passenger | 21.20 | 1.8 | 150.0 | 106.4 | 68.5 | 184.1 | 3.043 | 16.4 | 27.0 | 1351555200000000000 | 61.701381 | 1 | 21.588980 | 23.150742 | Buy |
148 | Volkswagen | Cabrio | 9.569 | 16.575 | Passenger | 19.99 | 2.0 | 115.0 | 97.4 | 66.7 | 160.4 | 3.079 | 13.7 | 26.0 | 1306800000000000000 | 48.907372 | 1 | 20.465401 | 20.153043 | Buy |
149 | Volkswagen | GTI | 5.596 | 13.760 | Passenger | 17.50 | 2.0 | 115.0 | 98.9 | 68.3 | 163.3 | 2.762 | 14.6 | 26.0 | 1301616000000000000 | 47.946841 | 1 | 18.326620 | 16.754342 | Not Buy |
117 rows × 20 columns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
clf = KNeighborsClassifier()
X = df[cols]
Y = df['Trade Decision']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4)
clf.fit(X_test,Y_test)
df['Pred3'] = clf.predict(X)
c1 = alt.Chart(df).mark_circle().encode(
x = "__year_resale_value",
y = "Price_in_thousands",
color = 'Trade Decision'
)
c2 = alt.Chart(df).mark_circle().encode(
x = "__year_resale_value",
y = "Price_in_thousands",
color = 'Pred3'
)
c1|c2
loss = log_loss(Y_test, clf.predict_proba(X_test))
loss
0.6343968063734172
(df["Trade Decision"] == df["Pred3"]).value_counts()
True 61
False 56
dtype: int64
There are too many errors indicate that the model cannot be used to determine whether to trade decision
Summary¶
I trained a Linear regression model to predict car prices. The results show that the accuracy of the model is about 95.5% if all features are included in the data set. We compared the importance of each variable to the final output and found that Compactness and Perimeter were the two most important. The accuracy of the model based on the most dominant features was about 96% on the training data set and 95.5% on the test data set. We found that the price prediction was more accurate if the model was built with the most dominant features. In addition, we also conducted predictive trading decisions and found that the data and models were not sufficient for us to get conclusions.
References¶
The reference link: https://www.kaggle.com/datasets/gagandeep16/car-sales?resource=download
pandas.DataFrame.astype: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
numpy.select: https://numpy.org/doc/stable/reference/generated/numpy.select.html