Predict Pumpkin Seeds

Author: Zian Dong

Course Project, UC Irvine, Math 10, W22

Student ID: 90294322

Introduction

Introduce your project here. About 3 sentences.

The goal of my project is to predict the category of the pumpkin seeds given by a series of input data, like its perimeter, compactness, area and so on. The dataset I use only contain two category of the pumpkin seeds, so I use the Logistic Regression as my training model. Besides, I also try to find the relationship between each variable and the final output, and choose two most significant variables to plot a relationship chart.

Main portion of the project

(You can either have all one section or divide into multiple sections)

pip install openpyxl 
Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 16.7 MB/s 
?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9
WARNING: You are using pip version 20.1.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
import torch.nn as nn
import numpy as np
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Import data

df = pd.read_excel("Pumpkin_Seeds_Dataset.xlsx")
df.columns
Index(['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
       'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
       'Roundness', 'Aspect_Ration', 'Compactness', 'Class'],
      dtype='object')

Feature selection

Select important features. Eliminate highly related x variables. In such a way we can prevent the model from overfitting to some extent.

X_unsel = df[['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
       'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
       'Roundness', 'Aspect_Ration', 'Compactness' ]]
corr = X_unsel.corr()

corr.style.background_gradient(cmap='coolwarm')
  Area Perimeter Major_Axis_Length Minor_Axis_Length Convex_Area Equiv_Diameter Eccentricity Solidity Extent Roundness Aspect_Ration Compactness
Area 1.000000 0.928548 0.789133 0.685304 0.999806 0.998464 0.159624 0.158388 -0.014018 -0.149378 0.159960 -0.160438
Perimeter 0.928548 1.000000 0.946181 0.392913 0.929971 0.928055 0.464601 0.065340 -0.140600 -0.500968 0.487880 -0.484440
Major_Axis_Length 0.789133 0.946181 1.000000 0.099376 0.789061 0.787078 0.704287 0.119291 -0.214990 -0.684972 0.729156 -0.726958
Minor_Axis_Length 0.685304 0.392913 0.099376 1.000000 0.685634 0.690020 -0.590877 0.090915 0.233576 0.558566 -0.598475 0.603441
Convex_Area 0.999806 0.929971 0.789061 0.685634 1.000000 0.998289 0.159156 0.139178 -0.015449 -0.153615 0.159822 -0.160432
Equiv_Diameter 0.998464 0.928055 0.787078 0.690020 0.998289 1.000000 0.156246 0.159454 -0.010970 -0.145313 0.155762 -0.156411
Eccentricity 0.159624 0.464601 0.704287 -0.590877 0.159156 0.156246 1.000000 0.043991 -0.327316 -0.890651 0.950225 -0.981689
Solidity 0.158388 0.065340 0.119291 0.090915 0.139178 0.159454 0.043991 1.000000 0.067537 0.200836 0.026410 -0.019967
Extent -0.014018 -0.140600 -0.214990 0.233576 -0.015449 -0.010970 -0.327316 0.067537 1.000000 0.352338 -0.329933 0.336984
Roundness -0.149378 -0.500968 -0.684972 0.558566 -0.153615 -0.145313 -0.890651 0.200836 0.352338 1.000000 -0.935233 0.933308
Aspect_Ration 0.159960 0.487880 0.729156 -0.598475 0.159822 0.155762 0.950225 0.026410 -0.329933 -0.935233 1.000000 -0.990778
Compactness -0.160438 -0.484440 -0.726958 0.603441 -0.160432 -0.156411 -0.981689 -0.019967 0.336984 0.933308 -0.990778 1.000000

We saw Area, Equv_Diameter and Convex Area, Perimeter, and Major Axis Length are closely related. Compactness, Aspect_Ration, Roundness, and Eccentrity are closely related. (abs of corr > 0.9).Thus, We can just pick Perimeter and Compactness as two representative features from these six features. All other features do not have such close relationship. We can leave them unchanged.

X = df[['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness' ]]
Y = df['Class']

Check imbalanced data

Next, we check whether it’s balanced data set. If it’s inbalanced, we cannot simply use the (correct_num_of_pred / total_num) to judge whether the model has a good performance.The result shows it’s approximately balanced.

Y.value_counts()
Çerçevelik       1300
Ürgüp Sivrisi    1200
Name: Class, dtype: int64

Standardize the data

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
X_scaled
array([[-2.21575484, -0.23853605,  0.20281179,  0.85540604,  2.19727996],
       [-0.56880361,  0.36208858,  0.60362561,  0.35952305,  0.84023019],
       [-0.43294002, -0.63321531, -1.08551833,  0.76838021, -0.21148339],
       ...,
       [ 0.7326892 , -0.15488711,  0.71814384,  1.1673854 , -0.83346454],
       [ 0.48215494, -0.90336996, -0.14074291,  0.7256883 , -1.28581446],
       [ 0.27147071,  0.37629052,  0.17418223,  0.70270035,  0.1183551 ]])

Build training and test set

X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2)

Train the model

clf = LogisticRegression()
clf.fit(X_train, y_train)
LogisticRegression()
clf.predict_proba(X_train)
array([[0.82371344, 0.17628656],
       [0.87303334, 0.12696666],
       [0.25054071, 0.74945929],
       ...,
       [0.84269527, 0.15730473],
       [0.69738565, 0.30261435],
       [0.74106931, 0.25893069]])
clf.predict(X_train)
array(['Çerçevelik', 'Çerçevelik', 'Ürgüp Sivrisi', ..., 'Çerçevelik',
       'Çerçevelik', 'Çerçevelik'], dtype=object)

Check whether overfits

train_error = log_loss(y_train, clf.predict_proba(X_train))
test_error = log_loss(y_test, clf.predict_proba(X_test))
train_error
0.3226528735934736
test_error
0.30269970620960607
test_accuracy =np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
train_accuracy =np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)
test_accuracy
0.874
train_accuracy
0.875
print(f"The log error for the training set is {train_error}, and the log error for the test set is {test_error}")
print(f"The accuracy for the training set is {train_accuracy}, and the accuracy for the test set is {test_accuracy}")
The log error for the training set is 0.3226528735934736, and the log error for the test set is 0.30269970620960607
The accuracy for the training set is 0.875, and the accuracy for the test set is 0.874

Visualize the feature importance

list(clf.coef_[0])
[0.7739633214464561,
 -0.7281673848227552,
 0.4759291759853715,
 0.04615270158937232,
 -2.2511074799364117]
df_coef = pd.DataFrame()
df_coef['feature'] = ['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness']
df_coef['value'] = list(abs(clf.coef_[0]))
df_coef['type'] = ['negative' if n < 0 else 'positive'for n in clf.coef_[0]]
df_coef
feature value type
0 Perimeter 0.773963 positive
1 Minor_Axis_Length 0.728167 negative
2 Solidity 0.475929 positive
3 Extent 0.046153 positive
4 Compactness 2.251107 negative
alt.Chart(df_coef).mark_bar().encode(
    x = 'feature',
    y = 'value',
    color = 'type',
    opacity=alt.value(0.5),
).properties(
    title = 'Feature Importance'
)

Finally, we can plot the relationship chart using the two most important features, which are compactness and Perimeter, and find the relationship between these two x variables and y variables.

tmp = []
for i in y_test.index:
    tmp.append(df['Compactness'][i])

tmp2 = []
for i in y_test.index:
    tmp2.append(df['Perimeter'][i])
len(tmp2)
500
df_pred = pd.DataFrame()
df_pred['Compactness'] = tmp
df_pred['Perimeter'] = tmp2
df_pred['type'] = clf.predict(X_test)
df_pred
Compactness Perimeter type
0 0.6693 991.063 Ürgüp Sivrisi
1 0.6264 1366.056 Ürgüp Sivrisi
2 0.7348 1043.688 Çerçevelik
3 0.6834 975.628 Ürgüp Sivrisi
4 0.6916 1200.894 Ürgüp Sivrisi
... ... ... ...
495 0.6273 1263.079 Ürgüp Sivrisi
496 0.7076 1078.565 Çerçevelik
497 0.6919 1063.550 Ürgüp Sivrisi
498 0.7827 996.385 Çerçevelik
499 0.7116 1240.471 Çerçevelik

500 rows × 3 columns

c1 = alt.Chart(df_pred).mark_circle().encode(
    x = 'Compactness',
    y = 'Perimeter',
    color = 'type'   
).properties(
    title = 'pred'
)
df_true = pd.DataFrame()
df_true['Compactness'] = tmp
df_true['Perimeter'] = tmp2
df_true['type'] = list(y_test)
c2 = alt.Chart(df_true).mark_circle().encode(
    x = 'Compactness',
    y = 'Perimeter',
    color = 'type'   
).properties(
    title = 'true'
)
c1|c2

Summary

Either summarize what you did, or summarize the results. About 3 sentences.

I trained a Logistic Regression model to predict the category of pumpkin seeds. And the result showed the model approximately had 87.7% accuracy on the training dataset, and 88% on the test dataset. We also compared the importance of each variable on the final output, and find Compactness and Perimeter are the most two important ones. More specifically, compactness has much more impartance than the perimeter in determining the category of the pumpkin seeds. And we can verify that with the compactness-perimeter relation chart: when compactness is bigger then 0.7, the seeds are very likely to be Cercevelik, and when it’s smaller than 0.7, the seeds are likely to be Urgup Sivrisi, but perimeter doesn’t have such a clear boundary. And besides, we can also verify from the chart that compactness and perimeter are negatively correlated which is consistent with their feature importance.

References

Include references that you found helpful. Also say where you found the dataset you used.

https://www.kaggle.com/mkoklu42/pumpkin-seeds-dataset

Created in deepnote.com Created in Deepnote