Predict Pumpkin Seeds
Contents
Predict Pumpkin Seeds¶
Author: Zian Dong
Course Project, UC Irvine, Math 10, W22
Student ID: 90294322
Introduction¶
Introduce your project here. About 3 sentences.
The goal of my project is to predict the category of the pumpkin seeds given by a series of input data, like its perimeter, compactness, area and so on. The dataset I use only contain two category of the pumpkin seeds, so I use the Logistic Regression as my training model. Besides, I also try to find the relationship between each variable and the final output, and choose two most significant variables to plot a relationship chart.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
pip install openpyxl
Collecting openpyxl
Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
|████████████████████████████████| 242 kB 16.7 MB/s
?25hCollecting et-xmlfile
Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9
WARNING: You are using pip version 20.1.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
import torch.nn as nn
import numpy as np
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Import data¶
df = pd.read_excel("Pumpkin_Seeds_Dataset.xlsx")
df.columns
Index(['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
'Roundness', 'Aspect_Ration', 'Compactness', 'Class'],
dtype='object')
Feature selection¶
Select important features. Eliminate highly related x variables. In such a way we can prevent the model from overfitting to some extent.
X_unsel = df[['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
'Roundness', 'Aspect_Ration', 'Compactness' ]]
corr = X_unsel.corr()
corr.style.background_gradient(cmap='coolwarm')
Area | Perimeter | Major_Axis_Length | Minor_Axis_Length | Convex_Area | Equiv_Diameter | Eccentricity | Solidity | Extent | Roundness | Aspect_Ration | Compactness | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Area | 1.000000 | 0.928548 | 0.789133 | 0.685304 | 0.999806 | 0.998464 | 0.159624 | 0.158388 | -0.014018 | -0.149378 | 0.159960 | -0.160438 |
Perimeter | 0.928548 | 1.000000 | 0.946181 | 0.392913 | 0.929971 | 0.928055 | 0.464601 | 0.065340 | -0.140600 | -0.500968 | 0.487880 | -0.484440 |
Major_Axis_Length | 0.789133 | 0.946181 | 1.000000 | 0.099376 | 0.789061 | 0.787078 | 0.704287 | 0.119291 | -0.214990 | -0.684972 | 0.729156 | -0.726958 |
Minor_Axis_Length | 0.685304 | 0.392913 | 0.099376 | 1.000000 | 0.685634 | 0.690020 | -0.590877 | 0.090915 | 0.233576 | 0.558566 | -0.598475 | 0.603441 |
Convex_Area | 0.999806 | 0.929971 | 0.789061 | 0.685634 | 1.000000 | 0.998289 | 0.159156 | 0.139178 | -0.015449 | -0.153615 | 0.159822 | -0.160432 |
Equiv_Diameter | 0.998464 | 0.928055 | 0.787078 | 0.690020 | 0.998289 | 1.000000 | 0.156246 | 0.159454 | -0.010970 | -0.145313 | 0.155762 | -0.156411 |
Eccentricity | 0.159624 | 0.464601 | 0.704287 | -0.590877 | 0.159156 | 0.156246 | 1.000000 | 0.043991 | -0.327316 | -0.890651 | 0.950225 | -0.981689 |
Solidity | 0.158388 | 0.065340 | 0.119291 | 0.090915 | 0.139178 | 0.159454 | 0.043991 | 1.000000 | 0.067537 | 0.200836 | 0.026410 | -0.019967 |
Extent | -0.014018 | -0.140600 | -0.214990 | 0.233576 | -0.015449 | -0.010970 | -0.327316 | 0.067537 | 1.000000 | 0.352338 | -0.329933 | 0.336984 |
Roundness | -0.149378 | -0.500968 | -0.684972 | 0.558566 | -0.153615 | -0.145313 | -0.890651 | 0.200836 | 0.352338 | 1.000000 | -0.935233 | 0.933308 |
Aspect_Ration | 0.159960 | 0.487880 | 0.729156 | -0.598475 | 0.159822 | 0.155762 | 0.950225 | 0.026410 | -0.329933 | -0.935233 | 1.000000 | -0.990778 |
Compactness | -0.160438 | -0.484440 | -0.726958 | 0.603441 | -0.160432 | -0.156411 | -0.981689 | -0.019967 | 0.336984 | 0.933308 | -0.990778 | 1.000000 |
We saw Area, Equv_Diameter and Convex Area, Perimeter, and Major Axis Length are closely related. Compactness, Aspect_Ration, Roundness, and Eccentrity are closely related. (abs of corr > 0.9).Thus, We can just pick Perimeter and Compactness as two representative features from these six features. All other features do not have such close relationship. We can leave them unchanged.
X = df[['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness' ]]
Y = df['Class']
Check imbalanced data¶
Next, we check whether it’s balanced data set. If it’s inbalanced, we cannot simply use the (correct_num_of_pred / total_num) to judge whether the model has a good performance.The result shows it’s approximately balanced.
Y.value_counts()
Çerçevelik 1300
Ürgüp Sivrisi 1200
Name: Class, dtype: int64
Standardize the data¶
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled
array([[-2.21575484, -0.23853605, 0.20281179, 0.85540604, 2.19727996],
[-0.56880361, 0.36208858, 0.60362561, 0.35952305, 0.84023019],
[-0.43294002, -0.63321531, -1.08551833, 0.76838021, -0.21148339],
...,
[ 0.7326892 , -0.15488711, 0.71814384, 1.1673854 , -0.83346454],
[ 0.48215494, -0.90336996, -0.14074291, 0.7256883 , -1.28581446],
[ 0.27147071, 0.37629052, 0.17418223, 0.70270035, 0.1183551 ]])
Build training and test set¶
X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2)
Train the model¶
clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
clf.predict_proba(X_train)
array([[0.82371344, 0.17628656],
[0.87303334, 0.12696666],
[0.25054071, 0.74945929],
...,
[0.84269527, 0.15730473],
[0.69738565, 0.30261435],
[0.74106931, 0.25893069]])
clf.predict(X_train)
array(['Çerçevelik', 'Çerçevelik', 'Ürgüp Sivrisi', ..., 'Çerçevelik',
'Çerçevelik', 'Çerçevelik'], dtype=object)
Check whether overfits¶
train_error = log_loss(y_train, clf.predict_proba(X_train))
test_error = log_loss(y_test, clf.predict_proba(X_test))
train_error
0.3226528735934736
test_error
0.30269970620960607
test_accuracy =np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
train_accuracy =np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
test_accuracy
0.874
train_accuracy
0.875
print(f"The log error for the training set is {train_error}, and the log error for the test set is {test_error}")
print(f"The accuracy for the training set is {train_accuracy}, and the accuracy for the test set is {test_accuracy}")
The log error for the training set is 0.3226528735934736, and the log error for the test set is 0.30269970620960607
The accuracy for the training set is 0.875, and the accuracy for the test set is 0.874
Visualize the feature importance¶
list(clf.coef_[0])
[0.7739633214464561,
-0.7281673848227552,
0.4759291759853715,
0.04615270158937232,
-2.2511074799364117]
df_coef = pd.DataFrame()
df_coef['feature'] = ['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness']
df_coef['value'] = list(abs(clf.coef_[0]))
df_coef['type'] = ['negative' if n < 0 else 'positive'for n in clf.coef_[0]]
df_coef
feature | value | type | |
---|---|---|---|
0 | Perimeter | 0.773963 | positive |
1 | Minor_Axis_Length | 0.728167 | negative |
2 | Solidity | 0.475929 | positive |
3 | Extent | 0.046153 | positive |
4 | Compactness | 2.251107 | negative |
alt.Chart(df_coef).mark_bar().encode(
x = 'feature',
y = 'value',
color = 'type',
opacity=alt.value(0.5),
).properties(
title = 'Feature Importance'
)
Finally, we can plot the relationship chart using the two most important features, which are compactness and Perimeter, and find the relationship between these two x variables and y variables.
tmp = []
for i in y_test.index:
tmp.append(df['Compactness'][i])
tmp2 = []
for i in y_test.index:
tmp2.append(df['Perimeter'][i])
len(tmp2)
500
df_pred = pd.DataFrame()
df_pred['Compactness'] = tmp
df_pred['Perimeter'] = tmp2
df_pred['type'] = clf.predict(X_test)
df_pred
Compactness | Perimeter | type | |
---|---|---|---|
0 | 0.6693 | 991.063 | Ürgüp Sivrisi |
1 | 0.6264 | 1366.056 | Ürgüp Sivrisi |
2 | 0.7348 | 1043.688 | Çerçevelik |
3 | 0.6834 | 975.628 | Ürgüp Sivrisi |
4 | 0.6916 | 1200.894 | Ürgüp Sivrisi |
... | ... | ... | ... |
495 | 0.6273 | 1263.079 | Ürgüp Sivrisi |
496 | 0.7076 | 1078.565 | Çerçevelik |
497 | 0.6919 | 1063.550 | Ürgüp Sivrisi |
498 | 0.7827 | 996.385 | Çerçevelik |
499 | 0.7116 | 1240.471 | Çerçevelik |
500 rows × 3 columns
c1 = alt.Chart(df_pred).mark_circle().encode(
x = 'Compactness',
y = 'Perimeter',
color = 'type'
).properties(
title = 'pred'
)
df_true = pd.DataFrame()
df_true['Compactness'] = tmp
df_true['Perimeter'] = tmp2
df_true['type'] = list(y_test)
c2 = alt.Chart(df_true).mark_circle().encode(
x = 'Compactness',
y = 'Perimeter',
color = 'type'
).properties(
title = 'true'
)
c1|c2
Summary¶
Either summarize what you did, or summarize the results. About 3 sentences.
I trained a Logistic Regression model to predict the category of pumpkin seeds. And the result showed the model approximately had 87.7% accuracy on the training dataset, and 88% on the test dataset. We also compared the importance of each variable on the final output, and find Compactness and Perimeter are the most two important ones. More specifically, compactness has much more impartance than the perimeter in determining the category of the pumpkin seeds. And we can verify that with the compactness-perimeter relation chart: when compactness is bigger then 0.7, the seeds are very likely to be Cercevelik, and when it’s smaller than 0.7, the seeds are likely to be Urgup Sivrisi, but perimeter doesn’t have such a clear boundary. And besides, we can also verify from the chart that compactness and perimeter are negatively correlated which is consistent with their feature importance.
References¶
Include references that you found helpful. Also say where you found the dataset you used.
https://www.kaggle.com/mkoklu42/pumpkin-seeds-dataset
Created in Deepnote