Predict Pumpkin Seeds¶

Author: Zian Dong

Course Project, UC Irvine, Math 10, W22

Student ID: 90294322

Introduction¶

Introduce your project here. About 3 sentences.

The goal of my project is to predict the category of the pumpkin seeds given by a series of input data, like its perimeter, compactness, area and so on. The dataset I use only contain two category of the pumpkin seeds, so I use the Logistic Regression as my training model. Besides, I also try to find the relationship between each variable and the final output, and choose two most significant variables to plot a relationship chart.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

pip install openpyxl 

Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
     |████████████████████████████████| 242 kB 16.7 MB/s 
?25hCollecting et-xmlfile
  Downloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.9
WARNING: You are using pip version 20.1.1; however, version 22.0.4 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Note: you may need to restart the kernel to use updated packages.

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
import torch.nn as nn
import numpy as np

/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Import data¶

df = pd.read_excel("Pumpkin_Seeds_Dataset.xlsx")

df.columns

Index(['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
       'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
       'Roundness', 'Aspect_Ration', 'Compactness', 'Class'],
      dtype='object')

Feature selection¶

Select important features. Eliminate highly related x variables. In such a way we can prevent the model from overfitting to some extent.

X_unsel = df[['Area', 'Perimeter', 'Major_Axis_Length', 'Minor_Axis_Length',
       'Convex_Area', 'Equiv_Diameter', 'Eccentricity', 'Solidity', 'Extent',
       'Roundness', 'Aspect_Ration', 'Compactness' ]]

corr = X_unsel.corr()

corr.style.background_gradient(cmap='coolwarm')

	Area	Perimeter	Major_Axis_Length	Minor_Axis_Length	Convex_Area	Equiv_Diameter	Eccentricity	Solidity	Extent	Roundness	Aspect_Ration	Compactness
Area	1.000000	0.928548	0.789133	0.685304	0.999806	0.998464	0.159624	0.158388	-0.014018	-0.149378	0.159960	-0.160438
Perimeter	0.928548	1.000000	0.946181	0.392913	0.929971	0.928055	0.464601	0.065340	-0.140600	-0.500968	0.487880	-0.484440
Major_Axis_Length	0.789133	0.946181	1.000000	0.099376	0.789061	0.787078	0.704287	0.119291	-0.214990	-0.684972	0.729156	-0.726958
Minor_Axis_Length	0.685304	0.392913	0.099376	1.000000	0.685634	0.690020	-0.590877	0.090915	0.233576	0.558566	-0.598475	0.603441
Convex_Area	0.999806	0.929971	0.789061	0.685634	1.000000	0.998289	0.159156	0.139178	-0.015449	-0.153615	0.159822	-0.160432
Equiv_Diameter	0.998464	0.928055	0.787078	0.690020	0.998289	1.000000	0.156246	0.159454	-0.010970	-0.145313	0.155762	-0.156411
Eccentricity	0.159624	0.464601	0.704287	-0.590877	0.159156	0.156246	1.000000	0.043991	-0.327316	-0.890651	0.950225	-0.981689
Solidity	0.158388	0.065340	0.119291	0.090915	0.139178	0.159454	0.043991	1.000000	0.067537	0.200836	0.026410	-0.019967
Extent	-0.014018	-0.140600	-0.214990	0.233576	-0.015449	-0.010970	-0.327316	0.067537	1.000000	0.352338	-0.329933	0.336984
Roundness	-0.149378	-0.500968	-0.684972	0.558566	-0.153615	-0.145313	-0.890651	0.200836	0.352338	1.000000	-0.935233	0.933308
Aspect_Ration	0.159960	0.487880	0.729156	-0.598475	0.159822	0.155762	0.950225	0.026410	-0.329933	-0.935233	1.000000	-0.990778
Compactness	-0.160438	-0.484440	-0.726958	0.603441	-0.160432	-0.156411	-0.981689	-0.019967	0.336984	0.933308	-0.990778	1.000000

We saw Area, Equv_Diameter and Convex Area, Perimeter, and Major Axis Length are closely related. Compactness, Aspect_Ration, Roundness, and Eccentrity are closely related. (abs of corr > 0.9).Thus, We can just pick Perimeter and Compactness as two representative features from these six features. All other features do not have such close relationship. We can leave them unchanged.

X = df[['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness' ]]
Y = df['Class']

Check imbalanced data¶

Next, we check whether it’s balanced data set. If it’s inbalanced, we cannot simply use the (correct_num_of_pred / total_num) to judge whether the model has a good performance.The result shows it’s approximately balanced.

Y.value_counts()

Çerçevelik       1300
Ürgüp Sivrisi    1200
Name: Class, dtype: int64

Standardize the data¶

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

X_scaled

array([[-2.21575484, -0.23853605,  0.20281179,  0.85540604,  2.19727996],
       [-0.56880361,  0.36208858,  0.60362561,  0.35952305,  0.84023019],
       [-0.43294002, -0.63321531, -1.08551833,  0.76838021, -0.21148339],
       ...,
       [ 0.7326892 , -0.15488711,  0.71814384,  1.1673854 , -0.83346454],
       [ 0.48215494, -0.90336996, -0.14074291,  0.7256883 , -1.28581446],
       [ 0.27147071,  0.37629052,  0.17418223,  0.70270035,  0.1183551 ]])

Build training and test set¶

X_train, X_test, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2)

Train the model¶

clf = LogisticRegression()

clf.fit(X_train, y_train)

LogisticRegression()

clf.predict_proba(X_train)

array([[0.82371344, 0.17628656],
       [0.87303334, 0.12696666],
       [0.25054071, 0.74945929],
       ...,
       [0.84269527, 0.15730473],
       [0.69738565, 0.30261435],
       [0.74106931, 0.25893069]])

clf.predict(X_train)

array(['Çerçevelik', 'Çerçevelik', 'Ürgüp Sivrisi', ..., 'Çerçevelik',
       'Çerçevelik', 'Çerçevelik'], dtype=object)

Check whether overfits¶

train_error = log_loss(y_train, clf.predict_proba(X_train))
test_error = log_loss(y_test, clf.predict_proba(X_test))

train_error

0.3226528735934736

test_error

0.30269970620960607

test_accuracy =np.count_nonzero(clf.predict(X_test) == y_test)/len(X_test)
train_accuracy =np.count_nonzero(clf.predict(X_train) == y_train)/len(X_train)

test_accuracy

0.874

train_accuracy

0.875

print(f"The log error for the training set is {train_error}, and the log error for the test set is {test_error}")
print(f"The accuracy for the training set is {train_accuracy}, and the accuracy for the test set is {test_accuracy}")

The log error for the training set is 0.3226528735934736, and the log error for the test set is 0.30269970620960607
The accuracy for the training set is 0.875, and the accuracy for the test set is 0.874

Visualize the feature importance¶

list(clf.coef_[0])

[0.7739633214464561,
 -0.7281673848227552,
 0.4759291759853715,
 0.04615270158937232,
 -2.2511074799364117]

df_coef = pd.DataFrame()
df_coef['feature'] = ['Perimeter', 'Minor_Axis_Length', 'Solidity', 'Extent', 'Compactness']
df_coef['value'] = list(abs(clf.coef_[0]))
df_coef['type'] = ['negative' if n < 0 else 'positive'for n in clf.coef_[0]]

df_coef

	feature	value	type
0	Perimeter	0.773963	positive
1	Minor_Axis_Length	0.728167	negative
2	Solidity	0.475929	positive
3	Extent	0.046153	positive
4	Compactness	2.251107	negative

alt.Chart(df_coef).mark_bar().encode(
    x = 'feature',
    y = 'value',
    color = 'type',
    opacity=alt.value(0.5),
).properties(
    title = 'Feature Importance'
)

Finally, we can plot the relationship chart using the two most important features, which are compactness and Perimeter, and find the relationship between these two x variables and y variables.

tmp = []
for i in y_test.index:
    tmp.append(df['Compactness'][i])

tmp2 = []
for i in y_test.index:
    tmp2.append(df['Perimeter'][i])

len(tmp2)

df_pred = pd.DataFrame()
df_pred['Compactness'] = tmp
df_pred['Perimeter'] = tmp2
df_pred['type'] = clf.predict(X_test)

df_pred

	Compactness	Perimeter	type
0	0.6693	991.063	Ürgüp Sivrisi
1	0.6264	1366.056	Ürgüp Sivrisi
2	0.7348	1043.688	Çerçevelik
3	0.6834	975.628	Ürgüp Sivrisi
4	0.6916	1200.894	Ürgüp Sivrisi
...	...	...	...
495	0.6273	1263.079	Ürgüp Sivrisi
496	0.7076	1078.565	Çerçevelik
497	0.6919	1063.550	Ürgüp Sivrisi
498	0.7827	996.385	Çerçevelik
499	0.7116	1240.471	Çerçevelik

500 rows × 3 columns

c1 = alt.Chart(df_pred).mark_circle().encode(
    x = 'Compactness',
    y = 'Perimeter',
    color = 'type'   
).properties(
    title = 'pred'
)

df_true = pd.DataFrame()
df_true['Compactness'] = tmp
df_true['Perimeter'] = tmp2
df_true['type'] = list(y_test)

c2 = alt.Chart(df_true).mark_circle().encode(
    x = 'Compactness',
    y = 'Perimeter',
    color = 'type'   
).properties(
    title = 'true'
)

c1|c2

Summary¶

Either summarize what you did, or summarize the results. About 3 sentences.

I trained a Logistic Regression model to predict the category of pumpkin seeds. And the result showed the model approximately had 87.7% accuracy on the training dataset, and 88% on the test dataset. We also compared the importance of each variable on the final output, and find Compactness and Perimeter are the most two important ones. More specifically, compactness has much more impartance than the perimeter in determining the category of the pumpkin seeds. And we can verify that with the compactness-perimeter relation chart: when compactness is bigger then 0.7, the seeds are very likely to be Cercevelik, and when it’s smaller than 0.7, the seeds are likely to be Urgup Sivrisi, but perimeter doesn’t have such a clear boundary. And besides, we can also verify from the chart that compactness and perimeter are negatively correlated which is consistent with their feature importance.

References¶

Include references that you found helpful. Also say where you found the dataset you used.

https://www.kaggle.com/mkoklu42/pumpkin-seeds-dataset

Created in Deepnote

UC Irvine Math 10 W22

Predict Pumpkin Seeds

Contents