Diamond price
Contents
Diamond price¶
Author: Huan Tran Huu
Course Project, UC Irvine, Math 10, W22
Introduction¶
For my project, I want to use the dataset “diamonds” from seaborn to explore how can we evaluate the price of a diamond base on some of its information such as cut, depth, carat, clarity and color.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
In this step, I load the seaborn diamonond dataset, clean up the null value and delete some of the columns that I will not use for my project.
df = sns.load_dataset('diamonds')
df.dropna(inplace=True)
df = df.drop(columns=['table','x', 'y','z'])
df
carat | cut | color | clarity | depth | price | |
---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 326 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 326 |
2 | 0.23 | Good | E | VS1 | 56.9 | 327 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 334 |
4 | 0.31 | Good | J | SI2 | 63.3 | 335 |
... | ... | ... | ... | ... | ... | ... |
53935 | 0.72 | Ideal | D | SI1 | 60.8 | 2757 |
53936 | 0.72 | Good | D | SI1 | 63.1 | 2757 |
53937 | 0.70 | Very Good | D | SI1 | 62.8 | 2757 |
53938 | 0.86 | Premium | H | SI2 | 61.0 | 2757 |
53939 | 0.75 | Ideal | D | SI2 | 62.2 | 2757 |
53940 rows × 6 columns
As I’m seeing that the carat and depth column has numeric values, so I want to use Scikitlearn for linear regression to explore how the price of diamonds relate to the carat and depth.
from sklearn.linear_model import LinearRegression
df[["carat","depth","price"]] = df[["carat","depth","price"]].apply(lambda col: pd.to_numeric(col, errors="coerce"), axis=0)
# In here, I'm writing a function using f string to return an equation of linear regression
def findequation(x,y):
reg = LinearRegression()
reg.fit(df[x], df[y])
return print(f"The equation is: {y} = {x[0]} x {reg.coef_[0]} + {x[1]} x {reg.coef_[1]} + {reg.intercept_}")
findequation(["carat","depth"], "price")
The equation is: price = carat x 7765.140663767152 + depth x -102.1653221580108 + 4045.3331826016815
From the equation above, we see that carat has positive relation to price, and depth has negative relation to price. So, the higher value of carat and the lower value of depth will make the diamond’s price more expensive.
Now, I’m making a chart to see visualize how carat and depth relate to the price of diamonds.
alt.Chart(df[:5000]).mark_circle().encode(
x = alt.X("carat", scale = alt.Scale(zero=False)),
y = alt.Y("depth", scale = alt.Scale(zero=False)),
color=alt.Color("price", title="Price"),
tooltip = ["carat","depth", "price"]
).properties(
title="Diamond Chart",
width=700,
height=500,
).add_selection(alt.selection_interval())
From the chart above, it is very clear that the higher values of carat have a higher price. However, it is not clear that the lower values of depth relate to a cheaper price. The reason that I think for that is because the coefficient of carat from the equation that I get get earlier is very small. It is -102, compare to 7765 from the carat’s coefficient. So, I think the price will not fluctuate very much relate to the values of depth.
Now, let’s explore the other values of a diamond such as clarity, cut and color.
Since the value for the clarity, cut and color are not numerical values. I will use KNeighborsClassifier to explore these values.
First, I want to explore the values and categories of clarity.
df["clarity"].value_counts()
SI1 13065
VS2 12258
SI2 9194
VS1 8171
VVS2 5066
VVS1 3655
IF 1790
I1 741
Name: clarity, dtype: int64
I put an image for diamond’s clarity below for better understanding.
from IPython import display
display.Image("/work/clarity.png")
df["color"].unique()
['E', 'I', 'J', 'H', 'F', 'G', 'D']
Categories (7, object): ['D', 'E', 'F', 'G', 'H', 'I', 'J']
As the color category seems very strange to me, I did a research and came up with an image below.
from IPython import display
display.Image("/work/color.png")
For column cut, I also want to explore how many diamonds with a specific cut in the dataset. But I want to use for loop and f string.
for a,b in df.groupby("cut"):
print(f"The number of {a} cut in the data is {b.shape[0]} values")
The number of Ideal cut in the data is 21551 values
The number of Premium cut in the data is 13791 values
The number of Very Good cut in the data is 12082 values
The number of Good cut in the data is 4906 values
The number of Fair cut in the data is 1610 values
From the information above, I see that the number of Good cut is very low (1610) compare to other cuts. So, I decide to create a sub data frame contains only Ideal, Premium, Very Good and Good cut. But first, I want to highlight these 4 category in the original dataframe to see how it looks using pandas styler.
def highlight(x):
if "Ideal" in x:
return 'color:white;background-color:red'
if "Premium" in x:
return 'color:white;background-color:blue'
if "Very Good" in x:
return 'color:white;background-color:purple'
if "Good" in x:
return 'color:white;background-color:green'
else:
return None
df[:20].style.applymap(highlight,subset=["cut"])
carat | cut | color | clarity | depth | price | |
---|---|---|---|---|---|---|
0 | 0.230000 | Ideal | E | SI2 | 61.500000 | 326 |
1 | 0.210000 | Premium | E | SI1 | 59.800000 | 326 |
2 | 0.230000 | Good | E | VS1 | 56.900000 | 327 |
3 | 0.290000 | Premium | I | VS2 | 62.400000 | 334 |
4 | 0.310000 | Good | J | SI2 | 63.300000 | 335 |
5 | 0.240000 | Very Good | J | VVS2 | 62.800000 | 336 |
6 | 0.240000 | Very Good | I | VVS1 | 62.300000 | 336 |
7 | 0.260000 | Very Good | H | SI1 | 61.900000 | 337 |
8 | 0.220000 | Fair | E | VS2 | 65.100000 | 337 |
9 | 0.230000 | Very Good | H | VS1 | 59.400000 | 338 |
10 | 0.300000 | Good | J | SI1 | 64.000000 | 339 |
11 | 0.230000 | Ideal | J | VS1 | 62.800000 | 340 |
12 | 0.220000 | Premium | F | SI1 | 60.400000 | 342 |
13 | 0.310000 | Ideal | J | SI2 | 62.200000 | 344 |
14 | 0.200000 | Premium | E | SI2 | 60.200000 | 345 |
15 | 0.320000 | Premium | E | I1 | 60.900000 | 345 |
16 | 0.300000 | Ideal | I | SI2 | 62.000000 | 348 |
17 | 0.300000 | Good | J | SI1 | 63.400000 | 351 |
18 | 0.300000 | Good | J | SI1 | 63.800000 | 351 |
19 | 0.300000 | Very Good | J | SI1 | 62.700000 | 351 |
Below is a sub data that contains only top 4 cut
subdf = df[df.cut.isin(df.cut.value_counts()[:4].index)].copy()
From the information that we have, the number of Ideal cut diamonds and SI1 is the largest in this data frame. So, I think the combo Ideal and SI1 will also the largest in this data.
Now, I am writing some codes to test my guess.
# In this step, I'm using groupby to create a dictionary d.
d={}
for a,b in subdf.groupby(["cut","clarity"], sort=True):
d[(a[0],a[1])]=b.shape[0]
# Now, I'm sorting my order in ascending order. I'm using for loop with counter to print out the top 5 combo diamonds.
counter = 0
for n,m in sorted(d.items(),reverse=True, key=lambda item: item[1]):
print(f"The number of diamonds with {n[0]} cut and {n[1]} clarity is {m}")
counter += 1
if counter >= 5:
break
The number of diamonds with Ideal cut and VS2 clarity is 5071
The number of diamonds with Ideal cut and SI1 clarity is 4282
The number of diamonds with Ideal cut and VS1 clarity is 3589
The number of diamonds with Premium cut and SI1 clarity is 3575
The number of diamonds with Premium cut and VS2 clarity is 3357
As the result, my guess was wrong because Ideal and VS2 is the largest combo diamons in this dataset.
Now, I want to explore the data with top 5 diamond combos above
First, I will create a subdata with feature engineering
subdf["feature engineering"] = ((subdf["cut"] == "Ideal") & (subdf["clarity"] == "VS2")) | ((subdf["cut"] == "Ideal") & (subdf["clarity"] == "SI1")) | ((subdf["cut"] == "Ideal") & (subdf["clarity"] == "VS1")) | ((subdf["cut"] == "Premium") & (subdf["clarity"] == "SI1")) | ((subdf["cut"] == "Premium") & (subdf["clarity"] == "VS2")).copy()
I also create a function below to see if there is overfitting or underfitting base on the score value. The function will return the score values for both train set and test set. It also print out text in Blue if there is overfitting and print out Red text if there is underfitting.
The function contains the topic of KNeighborsClassifier, StandardScaler, train set and test set.
def fitting(x,y,dataframe,k): # This function will take the input of x and y columns from the dataframe that we want to use, and k is the number of n_neighbors for the classifier.
import sys
from termcolor import colored, cprint # This is a library that I use to change the text color
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # This step create a scaled X from the x input
scaler.fit(dataframe[x])
X_scaled = scaler.transform(dataframe[x])
clf = KNeighborsClassifier(n_neighbors=k)
clf.fit(X_scaled,dataframe[y])
X_train, X_test, y_train, y_test = train_test_split(X_scaled,dataframe[y],test_size=0.5)
clf.fit(X_train, y_train)
clf.predict(X_test)
# Below is a if and else condition
# It makes the function print out Blue color text if we have overfitting and otherwise it will print out Red color
# The overfitting and underfitting determine by the clf.score value. Overfittinf happens if we have greater score for the training set.
if clf.score(X_test,y_test) < clf.score(X_train,y_train):
return print(colored(f"The score for test set is {clf.score(X_test,y_test)} and the score for train set is {clf.score(X_train,y_train)} \nIt suggests overfitting as the score for train set is higher", 'blue', attrs=['reverse', 'blink']))
else:
return print(colored(f"The score for test set is {clf.score(X_test,y_test)} and the score for train set is {clf.score(X_train,y_train)} \nIt suggests underfitting as the score for train set is lower", 'red', attrs=['reverse', 'blink']))
fitting(["feature engineering", "price"],"color",subdf,4)
The score for test set is 0.30789222243454994 and the score for train set is 0.5081215364035926
It suggests overfitting as the score for train set is higher
Now, let’s explore how color relate to the price of diamond. For this dataset, I only want to explore the color G and E so I create a sub dataframe contains only G and E
subdf1 = df[(df["color"] == "G") | (df["color"] == "E")].copy()
Because G and E are not numerical value, so I convert it into numerical using labelencoder and call it newcolor.
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
subdf1["newcolor"] = labelencoder.fit_transform(subdf1['color']).copy()
subdf1["feature engineering"] = ((df["cut"] == "Ideal") & (df["clarity"] == "VS2")) | ((df["cut"] == "Ideal") & (df["clarity"] == "SI1")) | ((df["cut"] == "Ideal") & (df["clarity"] == "VS1")) | ((df["cut"] == "Premium") & (df["clarity"] == "SI1")) | ((df["cut"] == "Premium") & (df["clarity"] == "VS2")).copy()
fitting(["feature engineering", "newcolor"],"price",subdf1,4)
The score for test set is 0.0002844950213371266 and the score for train set is 0.0005690440060698028
It suggests overfitting as the score for train set is higher
Now, I want to make a graph to visualize the data
alt.data_transformers.disable_max_rows()
choice = alt.selection_multi(fields=['color'], bind='legend')
alt.Chart(df).mark_bar().encode(
x = "cut",
y = "clarity",
color = "color",
size = "price",
tooltip = ["cut","clarity", "price"],
opacity=alt.condition(choice, alt.value(1), alt.value(0.2))
).properties(
title="Diamond Chart",
width=600,
height=500,
).add_selection(choice,alt.selection_interval())
From the chart that I made above, I think it is quite unorganize and difficult to see the relation between all of the categories. So, I think it is better to break it up into graphs with less columns for better visulization.
choice = alt.selection_multi(fields=['color'], bind='legend')
c1 = alt.Chart(df).mark_bar().encode(
x = "cut",
y = "price",
color = "color",
tooltip = ["cut","color", "price"],
opacity=alt.condition(choice, alt.value(1), alt.value(0.2))
).add_selection(choice)
c1
c2 = alt.Chart(df).mark_bar().encode(
x='clarity',
y='price',
color='color',
).transform_filter(choice)
c1|c2