Categorizing Beans
Contents
Categorizing Beans¶
Author: Evan McDuffie
Course Project, UC Irvine, Math 10, S22
Introduction¶
The dataset for this project included dimensions and features for ~13,000 beans of 7 species. I used logistic regression and the K-nearest neighbors algorithm to create models to predict a bean’s species. I wanted to test and compare the accuracies of these two methods.
Main portion of the project¶
Data import and cleaning¶
import pandas as pd
df = pd.read_csv("Dry_Beans_Dataset.csv")
First, I checked if any values were missing and had a look at the header of the dataset. There are 16 columns of measurements and the rightmost column contains each bean’s type. In total, there are 13611 beans of 7 types, with large differences in the counts of each type.
df.isna().any(axis = None)
False
display(df.shape)
df.head()
(13611, 17)
Area | Perimeter | MajorAxisLength | MinorAxisLength | AspectRation | Eccentricity | ConvexArea | EquivDiameter | Extent | Solidity | roundness | Compactness | ShapeFactor1 | ShapeFactor2 | ShapeFactor3 | ShapeFactor4 | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 28395 | 610.291 | 208.178117 | 173.888747 | 1.197191 | 0.549812 | 28715 | 190.141097 | 0.763923 | 0.988856 | 0.958027 | 0.913358 | 0.007332 | 0.003147 | 0.834222 | 0.998724 | SEKER |
1 | 28734 | 638.018 | 200.524796 | 182.734419 | 1.097356 | 0.411785 | 29172 | 191.272750 | 0.783968 | 0.984986 | 0.887034 | 0.953861 | 0.006979 | 0.003564 | 0.909851 | 0.998430 | SEKER |
2 | 29380 | 624.110 | 212.826130 | 175.931143 | 1.209713 | 0.562727 | 29690 | 193.410904 | 0.778113 | 0.989559 | 0.947849 | 0.908774 | 0.007244 | 0.003048 | 0.825871 | 0.999066 | SEKER |
3 | 30008 | 645.884 | 210.557999 | 182.516516 | 1.153638 | 0.498616 | 30724 | 195.467062 | 0.782681 | 0.976696 | 0.903936 | 0.928329 | 0.007017 | 0.003215 | 0.861794 | 0.994199 | SEKER |
4 | 30140 | 620.134 | 201.847882 | 190.279279 | 1.060798 | 0.333680 | 30417 | 195.896503 | 0.773098 | 0.990893 | 0.984877 | 0.970516 | 0.006697 | 0.003665 | 0.941900 | 0.999166 | SEKER |
df["Class"].value_counts()
DERMASON 3546
SIRA 2636
SEKER 2027
HOROZ 1928
CALI 1630
BARBUNYA 1322
BOMBAY 522
Name: Class, dtype: int64
Now, I looked at summary statistics for each numeric column.
from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
pd.DataFrame({"mean": df[num_cols].mean(axis = 0), "std_dev": df[num_cols].std(axis = 0)})
mean | std_dev | |
---|---|---|
Area | 53048.284549 | 29324.095717 |
Perimeter | 855.283459 | 214.289696 |
MajorAxisLength | 320.141867 | 85.694186 |
MinorAxisLength | 202.270714 | 44.970091 |
AspectRation | 1.583242 | 0.246678 |
Eccentricity | 0.750895 | 0.092002 |
ConvexArea | 53768.200206 | 29774.915817 |
EquivDiameter | 253.064220 | 59.177120 |
Extent | 0.749733 | 0.049086 |
Solidity | 0.987143 | 0.004660 |
roundness | 0.873282 | 0.059520 |
Compactness | 0.799864 | 0.061713 |
ShapeFactor1 | 0.006564 | 0.001128 |
ShapeFactor2 | 0.001716 | 0.000596 |
ShapeFactor3 | 0.643590 | 0.098996 |
ShapeFactor4 | 0.995063 | 0.004366 |
The numeric columns in the dataset have wildly different means and standard deviations, so it’ll be a good idea to normalize the data. I used the StandardScaler
from scikit-learn.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[num_cols])
df[num_cols] = scaler.transform(df[num_cols])
Logistic regression¶
Test run: binary classifier for one bean type¶
To see if this idea will work at all, I wanted to try classifying one type of bean at a time. I used Dermason as a quick test, since this is the most common in the data. I copied the dataframe and added an is_dermason
column.
df_test = df.copy()
df_test["is_dermason"] = df_test["Class"] == "DERMASON"
I generated a training and test set so I can effectively measure the model’s success. I set aside 80% of the data for training and 20% for testing.
from sklearn.model_selection import train_test_split
X_test, X_train, y_test, y_train = train_test_split(df_test[num_cols], df_test["is_dermason"], train_size=0.8, random_state = 123)
I then ran a logistic classifier. Since I’m just checking how well the model works for now, I didn’t really care what the model’s predictions are. I’ll instead just use score()
.
from sklearn.linear_model import LogisticRegression
clf_test = LogisticRegression()
clf_test.fit(X_train, y_train)
train_score = clf_test.score(X_train, y_train)
test_score = clf_test.score(X_test, y_test)
print(f"The model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")
The model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.
The model accurately predicted if the bean was of type “Dermason” for both the training and test sets, and since the accuracies were about equal for both, I don’t see reason to worry about overfitting. The model performs very well for my expectations.
Repeating the test¶
I decided to repeat this test run for each species to see if different kinds of beans were easier or harder to classify.
species = df["Class"].unique()
df_binary = df.copy()
for sp in species:
name = f"is_{sp}"
# add a boolean column for each bean species
df_binary[name] = df_binary["Class"] == sp
# create training and test sets and score the model
X_test, X_train, y_test, y_train = train_test_split(df_binary[num_cols], df_binary[name], train_size=0.8, random_state = 123)
clf_bin = LogisticRegression()
clf_bin.fit(X_train, y_train)
train_score = clf_bin.score(X_train, y_train)
test_score = clf_bin.score(X_test, y_test)
print(f"When classifying {sp} beans, the model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")
When classifying SEKER beans, the model achieved 97.9% accuracy on the training set and 98.0% accuracy on the test set.
When classifying BARBUNYA beans, the model achieved 97.9% accuracy on the training set and 98.3% accuracy on the test set.
When classifying BOMBAY beans, the model achieved 100.0% accuracy on the training set and 100.0% accuracy on the test set.
When classifying CALI beans, the model achieved 97.6% accuracy on the training set and 97.5% accuracy on the test set.
When classifying HOROZ beans, the model achieved 99.0% accuracy on the training set and 98.3% accuracy on the test set.
When classifying SIRA beans, the model achieved 93.8% accuracy on the training set and 92.9% accuracy on the test set.
When classifying DERMASON beans, the model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.
The model performed similarly well to the first test, getting about 90% for all bean types and with test set accuracy usually less than 1% lower. Interestingly, the binary model did unbelievably well for Bombay beans, achieving 100% accuracy on both the training and test sets! I hypothesized there must be some features that are heavily split between Bombay beans and other types, such as Bombay beans having a much larger area on average than all others. I created some histograms in Altair to take a look.
import altair as alt
alt.data_transformers.enable('default', max_rows=15000)
# for each numeric column, plot a histogram of that dimension with Bombay beans highlighted
charts = [alt.Chart(df_binary).mark_bar().encode(
x = alt.X(col, bin = alt.BinParams(maxbins = 16)),
y = "count()",
color = "is_BOMBAY"
).properties(
width = 120,
height = 120
) for col in num_cols]
alt.ConcatChart(concat = charts, columns = 4)
As I suspected, the Bombay beans are much larger, showing a clear divide between this species and the others in dimensions like Area and Axis Lengths. Therefore, the regression model has a clear boundary for splitting between these species.
Multinomial regression¶
Since logistic regression worked well when classifying one bean at a time, I wanted to test classifying all 7 at once. I used the SAG solver, in a similar way to the logistic regression model we used for classifying the MNIST dataset in a previous homework. I also increased the tolerance value to 0.01 because the model wasn’t converging in time.
df_multi = df.copy()
clf_multi = LogisticRegression(tol = 0.01, solver = "sag")
# split the dataset
X_train, X_test, y_train, y_test = train_test_split(df_multi[num_cols], df_multi["Class"], train_size = 0.8, random_state = 123)
# fit and score the model
clf_multi.fit(X_train, y_train)
test_score = clf_multi.score(X_test, y_test)
train_score = clf_multi.score(X_train, y_train)
print(f"The model was {round(train_score*100,1)}% accurate on training data and {round(test_score*100,1)}% accurate on test data.")
The model was 92.5% accurate on training data and 92.4% accurate on test data.
While the multinomial model was less accurate than the binary classification models above, again, the model performed very similarly on the training and test sets and did well overall.
Confusion matrix¶
I was curious to see if any bean types were mixed up more often than others, so I created a confusion matrix. I based most of the code off of snippets provided in the Homework 6 template.
test_predictions = clf_multi.predict(X_test)
confusion_df = pd.DataFrame({
"actual": y_test,
"predicted": test_predictions
})
cm_color = alt.Chart(confusion_df).mark_rect().encode(
x = "actual",
y = "predicted",
color = alt.Color("count()", scale = alt.Scale(scheme = "lightmulti"))
).properties(width = 400, height = 400)
cm_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
x = "actual",
y = "predicted",
text = "count()"
)
cm_color + cm_text
This works well enough to see exact counts, but doesn’t tell us much overall. Since the dataset is uneven with some bean types having lots more data than others, the matrix (especially the color scale) skews towards more common beans. For example, Dermason beans appear to be the most well-classified at first glance, but Bombay beans are classified correctly 100% of the time.
To remedy this, I created a confusion matrix that displays proportions instead of raw counts. Since I’ll want to reuse it later, I wrapped it in a function.
import numpy as np
def confusion_matrices(actual, predicted):
# create every possible pair of bean types to compare between
x, y = map(lambda a: a.reshape(-1), np.meshgrid(species, species))
# this dataframe will have five columns: two for the pair of species names, and three for the count and proportions
confusion_df = pd.DataFrame({
"actual": x,
"predicted": y
})
# count actual and predicted classes
actual_counts = pd.Series(actual).value_counts()
predicted_counts = pd.Series(predicted).value_counts()
# calculate the proportions: for each pair of actual class a and prediction b, find the number of matches
# proportion1 is the proportion of all actual data points with class a
# proportion2 is the proportion of all predictions with class b
# add these to the dataframe.
confusion_df["count"] = [sum((y_test == a) & (predicted == b)) for a, b in zip(x, y)]
confusion_df["proportion1"] = [confusion_df["count"][idx] / actual_counts[a] for idx, a in enumerate(x)]
confusion_df["proportion2"] = [confusion_df["count"][idx] / predicted_counts[a] for idx, a in enumerate(y)]
# plot the data with altair to display the confusion matrix
cm2a_color = alt.Chart(confusion_df).mark_rect().encode(
x = "actual:N",
y = "predicted:N",
color = alt.Color("proportion1", scale = alt.Scale(scheme = "lightmulti")),
tooltip = ["count"]
).properties(width = 350, height = 350, title = "As a proportion of actual data")
cm2a_text = alt.Chart(confusion_df).mark_text().encode(
x = "actual",
y = "predicted",
text = alt.Text("proportion1", format = ".1%"),
color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
tooltip = ["count"]
)
cm2b_color = alt.Chart(confusion_df).mark_rect().encode(
x = "predicted",
y = "actual",
color = alt.Color("proportion2", scale = alt.Scale(scheme = "lightmulti")),
tooltip = ["count"]
).properties(width = 350, height = 350, title = "As a proportion of predictions")
cm2b_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
x = "predicted",
y = "actual",
text = alt.Text("proportion2", format = ".1%"),
color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
tooltip = ["count"]
)
return (cm2a_color + cm2a_text) | (cm2b_color + cm2b_text)
# generate predictions and run the confusion matrix
confusion_matrices(predicted = clf_multi.predict(X_test), actual = y_test)
The table on the left shows for each type of bean how often they were classified as every type. The table on the right shows for each classification how often they were each actual species. For example, 7.0% of Dermason beans were classified as Sira, but 5.9% of beans classified as Dermason were actually Sira. The tooltip for each square displays the count for each pair of actual value and prediction.
In contrast with the chart above, we can quickly tell that the model works really well for most species. Again, Bombay beans are classified correctly 100% of the time with no false positives. The lowest accuracy is 87.1% for Barbunya beans (from the left chart) which were most commonly misclassified as Cali beans.
K-nearest neighbors classifier¶
Test run¶
I also wanted to see how effective a K-nearest neighbors (KNN) classifier would be at categorizing this dataset. I used scikit-learn’s KNN model.
from sklearn.neighbors import KNeighborsClassifier
# use default parameters for now
knn = KNeighborsClassifier()
# create a train-test split, just like always
df_knn = df.copy()
X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 100)
# fit and score the model
knn.fit(X_train, y_train)
train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)
print(f"The initial KNN model had an accuracy of {round(train_score*100, 1)}% on the training set and {round(test_score*100, 1)}% on the test set." )
The initial KNN model had an accuracy of 94.3% on the training set and 92.4% on the test set.
The KNN model performed similarly well to the logistic regression model above at about 92% accuracy, but with a wider gap (~2%) between the test and training accuracies. I made another set of confusion matrices to compare more specific classification accuracies.
confusion_matrices(predicted = knn.predict(X_test), actual = y_test)
These confusion matrices are very similar to those generated for the logistic regression model. Again, we see that Bombay beans were correctly classified always, and Sira and Barbunya beans were classified the least accurately overall.
Tuning the model¶
I wanted to see if changing the parameters could make the model run any better. I tested values for n_neighbors
from 3 to 30 and displayed the results in an Altair line plot.
X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 13)
scores = []
trials = range(3, 31)
for n in trials:
knn = KNeighborsClassifier(n_neighbors = n)
knn.fit(X_train, y_train)
scores.append(knn.score(X_test, y_test))
tune_df = pd.DataFrame({"n": trials, "score": scores})
alt.Chart(tune_df).mark_line().encode(
x = "n",
y = alt.Y("score", scale = alt.Scale(domain = [0.8, 1]))
)
tune_df.sort_values(by = "score", ascending = False).head()
n | score | |
---|---|---|
19 | 22 | 0.930959 |
16 | 19 | 0.930959 |
14 | 17 | 0.929857 |
12 | 15 | 0.929857 |
20 | 23 | 0.929490 |
The model’s accuracy didn’t increase much through this tuning process. In the end, the model seems to work best when n_neighbors
= 22.
X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 123)
final_knn = KNeighborsClassifier(n_neighbors = 22)
final_knn.fit(X_train, y_train)
accuracy = final_knn.score(X_test, y_test)
display(confusion_matrices(predicted = final_knn.predict(X_test), actual = y_test))
print(f"This KNN model achieved an accuracy of {round(accuracy*100, 1)}%.")
This KNN model achieved an accuracy of 91.9%.
I wasn’t able to make the model any more accurate than about 92%, so KNN worked about equally well to logistic regression.
A different approach: testing KMeans¶
Even though KMeans is an unsupervised learning model, I wanted to see if the clusters it came up with matched the original species in the dataset. I set n_clusters
to equal the number of classes because I want to see how well the model can “guess” the actual species.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 7)
df_kmeans = df.copy()
kmeans.fit(df_kmeans[num_cols])
df_kmeans["Cluster"] = kmeans.predict(df_kmeans[num_cols])
Since the clusters aren’t explicitly associated with the species classes, I can’t calculate an accuracy or create a confusion matrix. Instead, I created a scatter plots in Altair comparing the clusters to the actual species for a couple pairs of dimensions.
feature_pairs = [["Area", "AspectRation"], ["ShapeFactor1", "ShapeFactor2"]]
charts = []
for fp in feature_pairs:
class_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
x = fp[0],
y = fp[1],
color = "Class"
).properties(width = 250, height = 250, title = f"Species: {fp[0]} vs {fp[1]}")
cluster_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
x = fp[0],
y = fp[1],
color = "Cluster:N"
).properties(width = 250, height = 250, title = f"KMeans Clusters: {fp[0]} vs {fp[1]}")
charts.append(class_chart)
charts.append(cluster_chart)
alt.ConcatChart(concat = charts, columns = 2).resolve_scale(color = "independent")
Although we can’t get quantitative results for how well this model worked, these plots give us a bit of insight. It looks like the model combined Cali and Barbunya beans into one cluster. This is understandable, since the logistic regression and KNN models also often mistook these species for one another. The now-missing seventh cluster seems to have been created at the fringes of a few other species to encompass the overlap between outliers of those species.
Summary¶
Logistic regression and K-nearest neighbors performed about the same when classifying between the species of beans. Each achieved an accuracy of 92%, with some types classified more accurately in both models. An attempt at clustering using the KMeans algorithm yielded groupings that were similar to the actual species classes, but with some noticeable errors, including incorrectly combining beans from two species into one cluster.
References¶
I found this dataset on Kaggle.
The code for the first confusion matrix was adapted from one of the Math 10 homework templates. I often referred to the Altair and Scikit-Learn documentation, in particular:
altair.ConcatChart
sklearn.neighbors.KNeighborsClassifier
sklearn.linear_model.LogisticRegression
Conditional text formatting for confusion matrices: https://www.programcreek.com/python/example/117307/altair.condition
Created in Deepnote