Categorizing Beans

Author: Evan McDuffie

Course Project, UC Irvine, Math 10, S22

Introduction

The dataset for this project included dimensions and features for ~13,000 beans of 7 species. I used logistic regression and the K-nearest neighbors algorithm to create models to predict a bean’s species. I wanted to test and compare the accuracies of these two methods.

Main portion of the project

Data import and cleaning

import pandas as pd
df = pd.read_csv("Dry_Beans_Dataset.csv")

First, I checked if any values were missing and had a look at the header of the dataset. There are 16 columns of measurements and the rightmost column contains each bean’s type. In total, there are 13611 beans of 7 types, with large differences in the counts of each type.

df.isna().any(axis = None)
False
display(df.shape)
df.head()
(13611, 17)
Area Perimeter MajorAxisLength MinorAxisLength AspectRation Eccentricity ConvexArea EquivDiameter Extent Solidity roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4 Class
0 28395 610.291 208.178117 173.888747 1.197191 0.549812 28715 190.141097 0.763923 0.988856 0.958027 0.913358 0.007332 0.003147 0.834222 0.998724 SEKER
1 28734 638.018 200.524796 182.734419 1.097356 0.411785 29172 191.272750 0.783968 0.984986 0.887034 0.953861 0.006979 0.003564 0.909851 0.998430 SEKER
2 29380 624.110 212.826130 175.931143 1.209713 0.562727 29690 193.410904 0.778113 0.989559 0.947849 0.908774 0.007244 0.003048 0.825871 0.999066 SEKER
3 30008 645.884 210.557999 182.516516 1.153638 0.498616 30724 195.467062 0.782681 0.976696 0.903936 0.928329 0.007017 0.003215 0.861794 0.994199 SEKER
4 30140 620.134 201.847882 190.279279 1.060798 0.333680 30417 195.896503 0.773098 0.990893 0.984877 0.970516 0.006697 0.003665 0.941900 0.999166 SEKER
df["Class"].value_counts()
DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64

Now, I looked at summary statistics for each numeric column.

from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]

pd.DataFrame({"mean": df[num_cols].mean(axis = 0), "std_dev": df[num_cols].std(axis = 0)})
mean std_dev
Area 53048.284549 29324.095717
Perimeter 855.283459 214.289696
MajorAxisLength 320.141867 85.694186
MinorAxisLength 202.270714 44.970091
AspectRation 1.583242 0.246678
Eccentricity 0.750895 0.092002
ConvexArea 53768.200206 29774.915817
EquivDiameter 253.064220 59.177120
Extent 0.749733 0.049086
Solidity 0.987143 0.004660
roundness 0.873282 0.059520
Compactness 0.799864 0.061713
ShapeFactor1 0.006564 0.001128
ShapeFactor2 0.001716 0.000596
ShapeFactor3 0.643590 0.098996
ShapeFactor4 0.995063 0.004366

The numeric columns in the dataset have wildly different means and standard deviations, so it’ll be a good idea to normalize the data. I used the StandardScaler from scikit-learn.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[num_cols])
df[num_cols] = scaler.transform(df[num_cols])

Logistic regression

Test run: binary classifier for one bean type

To see if this idea will work at all, I wanted to try classifying one type of bean at a time. I used Dermason as a quick test, since this is the most common in the data. I copied the dataframe and added an is_dermason column.

df_test = df.copy()
df_test["is_dermason"] = df_test["Class"] == "DERMASON"

I generated a training and test set so I can effectively measure the model’s success. I set aside 80% of the data for training and 20% for testing.

from sklearn.model_selection import train_test_split
X_test, X_train, y_test, y_train = train_test_split(df_test[num_cols], df_test["is_dermason"], train_size=0.8, random_state = 123)

I then ran a logistic classifier. Since I’m just checking how well the model works for now, I didn’t really care what the model’s predictions are. I’ll instead just use score().

from sklearn.linear_model import LogisticRegression
clf_test = LogisticRegression()
clf_test.fit(X_train, y_train)
train_score = clf_test.score(X_train, y_train)
test_score = clf_test.score(X_test, y_test)
print(f"The model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")
The model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.

The model accurately predicted if the bean was of type “Dermason” for both the training and test sets, and since the accuracies were about equal for both, I don’t see reason to worry about overfitting. The model performs very well for my expectations.

Repeating the test

I decided to repeat this test run for each species to see if different kinds of beans were easier or harder to classify.

species = df["Class"].unique()
df_binary = df.copy()
for sp in species:
    name = f"is_{sp}"
    # add a boolean column for each bean species
    df_binary[name] = df_binary["Class"] == sp

    # create training and test sets and score the model
    X_test, X_train, y_test, y_train = train_test_split(df_binary[num_cols], df_binary[name], train_size=0.8, random_state = 123)
    clf_bin = LogisticRegression()
    clf_bin.fit(X_train, y_train)
    train_score = clf_bin.score(X_train, y_train)
    test_score = clf_bin.score(X_test, y_test)
    print(f"When classifying {sp} beans, the model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")
When classifying SEKER beans, the model achieved 97.9% accuracy on the training set and 98.0% accuracy on the test set.
When classifying BARBUNYA beans, the model achieved 97.9% accuracy on the training set and 98.3% accuracy on the test set.
When classifying BOMBAY beans, the model achieved 100.0% accuracy on the training set and 100.0% accuracy on the test set.
When classifying CALI beans, the model achieved 97.6% accuracy on the training set and 97.5% accuracy on the test set.
When classifying HOROZ beans, the model achieved 99.0% accuracy on the training set and 98.3% accuracy on the test set.
When classifying SIRA beans, the model achieved 93.8% accuracy on the training set and 92.9% accuracy on the test set.
When classifying DERMASON beans, the model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.

The model performed similarly well to the first test, getting about 90% for all bean types and with test set accuracy usually less than 1% lower. Interestingly, the binary model did unbelievably well for Bombay beans, achieving 100% accuracy on both the training and test sets! I hypothesized there must be some features that are heavily split between Bombay beans and other types, such as Bombay beans having a much larger area on average than all others. I created some histograms in Altair to take a look.

import altair as alt
alt.data_transformers.enable('default', max_rows=15000)

# for each numeric column, plot a histogram of that dimension with Bombay beans highlighted
charts = [alt.Chart(df_binary).mark_bar().encode(
    x = alt.X(col, bin = alt.BinParams(maxbins = 16)),
    y = "count()",
    color = "is_BOMBAY"
).properties(
    width = 120,
    height = 120
) for col in num_cols]

alt.ConcatChart(concat = charts, columns = 4)

As I suspected, the Bombay beans are much larger, showing a clear divide between this species and the others in dimensions like Area and Axis Lengths. Therefore, the regression model has a clear boundary for splitting between these species.

Multinomial regression

Since logistic regression worked well when classifying one bean at a time, I wanted to test classifying all 7 at once. I used the SAG solver, in a similar way to the logistic regression model we used for classifying the MNIST dataset in a previous homework. I also increased the tolerance value to 0.01 because the model wasn’t converging in time.

df_multi = df.copy()
clf_multi = LogisticRegression(tol = 0.01, solver = "sag")

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(df_multi[num_cols], df_multi["Class"], train_size = 0.8, random_state = 123)

# fit and score the model
clf_multi.fit(X_train, y_train)
test_score = clf_multi.score(X_test, y_test)
train_score = clf_multi.score(X_train, y_train)

print(f"The model was {round(train_score*100,1)}% accurate on training data and {round(test_score*100,1)}% accurate on test data.")
The model was 92.5% accurate on training data and 92.4% accurate on test data.

While the multinomial model was less accurate than the binary classification models above, again, the model performed very similarly on the training and test sets and did well overall.

Confusion matrix

I was curious to see if any bean types were mixed up more often than others, so I created a confusion matrix. I based most of the code off of snippets provided in the Homework 6 template.

test_predictions = clf_multi.predict(X_test)
confusion_df = pd.DataFrame({
    "actual": y_test,
    "predicted": test_predictions
})

cm_color = alt.Chart(confusion_df).mark_rect().encode(
    x = "actual",
    y = "predicted",
    color = alt.Color("count()", scale = alt.Scale(scheme = "lightmulti"))
).properties(width = 400, height = 400)

cm_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
    x = "actual",
    y = "predicted",
    text = "count()"
)

cm_color + cm_text

This works well enough to see exact counts, but doesn’t tell us much overall. Since the dataset is uneven with some bean types having lots more data than others, the matrix (especially the color scale) skews towards more common beans. For example, Dermason beans appear to be the most well-classified at first glance, but Bombay beans are classified correctly 100% of the time.

To remedy this, I created a confusion matrix that displays proportions instead of raw counts. Since I’ll want to reuse it later, I wrapped it in a function.

import numpy as np

def confusion_matrices(actual, predicted):
    # create every possible pair of bean types to compare between
    x, y = map(lambda a: a.reshape(-1), np.meshgrid(species, species))

    # this dataframe will have five columns: two for the pair of species names, and three for the count and proportions
    confusion_df = pd.DataFrame({
        "actual": x,
        "predicted": y
    })
    
    # count actual and predicted classes
    actual_counts = pd.Series(actual).value_counts()
    predicted_counts = pd.Series(predicted).value_counts()

    # calculate the proportions: for each pair of actual class a and prediction b, find the number of matches
    # proportion1 is the proportion of all actual data points with class a
    # proportion2 is the proportion of all predictions with class b
    # add these to the dataframe.
    confusion_df["count"] = [sum((y_test == a) & (predicted == b)) for a, b in zip(x, y)]
    confusion_df["proportion1"] = [confusion_df["count"][idx] / actual_counts[a] for idx, a in enumerate(x)]
    confusion_df["proportion2"] = [confusion_df["count"][idx] / predicted_counts[a] for idx, a in enumerate(y)]

    # plot the data with altair to display the confusion matrix
    cm2a_color = alt.Chart(confusion_df).mark_rect().encode(
        x = "actual:N",
        y = "predicted:N",
        color = alt.Color("proportion1", scale = alt.Scale(scheme = "lightmulti")),
        tooltip = ["count"]
    ).properties(width = 350, height = 350, title = "As a proportion of actual data")
    cm2a_text = alt.Chart(confusion_df).mark_text().encode(
        x = "actual",
        y = "predicted",
        text = alt.Text("proportion1", format = ".1%"),
        color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
        tooltip = ["count"]
    )
    cm2b_color = alt.Chart(confusion_df).mark_rect().encode(
        x = "predicted",
        y = "actual",
        color = alt.Color("proportion2", scale = alt.Scale(scheme = "lightmulti")),
        tooltip = ["count"]
    ).properties(width = 350, height = 350, title = "As a proportion of predictions")
    cm2b_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
        x = "predicted",
        y = "actual",
        text = alt.Text("proportion2", format = ".1%"),
        color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
        tooltip = ["count"]
    )

    return (cm2a_color + cm2a_text) | (cm2b_color + cm2b_text)
# generate predictions and run the confusion matrix
confusion_matrices(predicted = clf_multi.predict(X_test), actual = y_test)

The table on the left shows for each type of bean how often they were classified as every type. The table on the right shows for each classification how often they were each actual species. For example, 7.0% of Dermason beans were classified as Sira, but 5.9% of beans classified as Dermason were actually Sira. The tooltip for each square displays the count for each pair of actual value and prediction.

In contrast with the chart above, we can quickly tell that the model works really well for most species. Again, Bombay beans are classified correctly 100% of the time with no false positives. The lowest accuracy is 87.1% for Barbunya beans (from the left chart) which were most commonly misclassified as Cali beans.

K-nearest neighbors classifier

Test run

I also wanted to see how effective a K-nearest neighbors (KNN) classifier would be at categorizing this dataset. I used scikit-learn’s KNN model.

from sklearn.neighbors import KNeighborsClassifier

# use default parameters for now
knn = KNeighborsClassifier()

# create a train-test split, just like always
df_knn = df.copy()
X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 100)

# fit and score the model
knn.fit(X_train, y_train)
train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)

print(f"The initial KNN model had an accuracy of {round(train_score*100, 1)}% on the training set and {round(test_score*100, 1)}% on the test set." )
The initial KNN model had an accuracy of 94.3% on the training set and 92.4% on the test set.

The KNN model performed similarly well to the logistic regression model above at about 92% accuracy, but with a wider gap (~2%) between the test and training accuracies. I made another set of confusion matrices to compare more specific classification accuracies.

confusion_matrices(predicted = knn.predict(X_test), actual = y_test)

These confusion matrices are very similar to those generated for the logistic regression model. Again, we see that Bombay beans were correctly classified always, and Sira and Barbunya beans were classified the least accurately overall.

Tuning the model

I wanted to see if changing the parameters could make the model run any better. I tested values for n_neighbors from 3 to 30 and displayed the results in an Altair line plot.

X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 13)

scores = []
trials = range(3, 31)
for n in trials:
    knn = KNeighborsClassifier(n_neighbors = n)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))

tune_df = pd.DataFrame({"n": trials, "score": scores})
alt.Chart(tune_df).mark_line().encode(
    x = "n",
    y = alt.Y("score", scale = alt.Scale(domain = [0.8, 1]))
)
tune_df.sort_values(by = "score", ascending = False).head()
n score
19 22 0.930959
16 19 0.930959
14 17 0.929857
12 15 0.929857
20 23 0.929490

The model’s accuracy didn’t increase much through this tuning process. In the end, the model seems to work best when n_neighbors = 22.

X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 123)
final_knn = KNeighborsClassifier(n_neighbors = 22)
final_knn.fit(X_train, y_train)
accuracy = final_knn.score(X_test, y_test)
display(confusion_matrices(predicted = final_knn.predict(X_test), actual = y_test))
print(f"This KNN model achieved an accuracy of {round(accuracy*100, 1)}%.")
This KNN model achieved an accuracy of 91.9%.

I wasn’t able to make the model any more accurate than about 92%, so KNN worked about equally well to logistic regression.

A different approach: testing KMeans

Even though KMeans is an unsupervised learning model, I wanted to see if the clusters it came up with matched the original species in the dataset. I set n_clusters to equal the number of classes because I want to see how well the model can “guess” the actual species.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 7)

df_kmeans = df.copy()
kmeans.fit(df_kmeans[num_cols])
df_kmeans["Cluster"] = kmeans.predict(df_kmeans[num_cols])

Since the clusters aren’t explicitly associated with the species classes, I can’t calculate an accuracy or create a confusion matrix. Instead, I created a scatter plots in Altair comparing the clusters to the actual species for a couple pairs of dimensions.

feature_pairs = [["Area", "AspectRation"], ["ShapeFactor1", "ShapeFactor2"]]
charts = []

for fp in feature_pairs:
    class_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
        x = fp[0],
        y = fp[1],
        color = "Class"
    ).properties(width = 250, height = 250, title = f"Species: {fp[0]} vs {fp[1]}")
    cluster_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
        x = fp[0],
        y = fp[1],
        color = "Cluster:N"
    ).properties(width = 250, height = 250, title = f"KMeans Clusters: {fp[0]} vs {fp[1]}")
    charts.append(class_chart)
    charts.append(cluster_chart)

alt.ConcatChart(concat = charts, columns = 2).resolve_scale(color = "independent")

Although we can’t get quantitative results for how well this model worked, these plots give us a bit of insight. It looks like the model combined Cali and Barbunya beans into one cluster. This is understandable, since the logistic regression and KNN models also often mistook these species for one another. The now-missing seventh cluster seems to have been created at the fringes of a few other species to encompass the overlap between outliers of those species.

Summary

Logistic regression and K-nearest neighbors performed about the same when classifying between the species of beans. Each achieved an accuracy of 92%, with some types classified more accurately in both models. An attempt at clustering using the KMeans algorithm yielded groupings that were similar to the actual species classes, but with some noticeable errors, including incorrectly combining beans from two species into one cluster.

References

I found this dataset on Kaggle.

The code for the first confusion matrix was adapted from one of the Math 10 homework templates. I often referred to the Altair and Scikit-Learn documentation, in particular:

altair.ConcatChart
sklearn.neighbors.KNeighborsClassifier
sklearn.linear_model.LogisticRegression

Conditional text formatting for confusion matrices: https://www.programcreek.com/python/example/117307/altair.condition

Created in deepnote.com Created in Deepnote