Categorizing Beans¶

Author: Evan McDuffie

Course Project, UC Irvine, Math 10, S22

Introduction¶

The dataset for this project included dimensions and features for ~13,000 beans of 7 species. I used logistic regression and the K-nearest neighbors algorithm to create models to predict a bean’s species. I wanted to test and compare the accuracies of these two methods.

Main portion of the project¶

Data import and cleaning¶

import pandas as pd

df = pd.read_csv("Dry_Beans_Dataset.csv")

First, I checked if any values were missing and had a look at the header of the dataset. There are 16 columns of measurements and the rightmost column contains each bean’s type. In total, there are 13611 beans of 7 types, with large differences in the counts of each type.

df.isna().any(axis = None)

False

display(df.shape)
df.head()

(13611, 17)

	Area	Perimeter	MajorAxisLength	MinorAxisLength	AspectRation	Eccentricity	ConvexArea	EquivDiameter	Extent	Solidity	roundness	Compactness	ShapeFactor1	ShapeFactor2	ShapeFactor3	ShapeFactor4	Class
0	28395	610.291	208.178117	173.888747	1.197191	0.549812	28715	190.141097	0.763923	0.988856	0.958027	0.913358	0.007332	0.003147	0.834222	0.998724	SEKER
1	28734	638.018	200.524796	182.734419	1.097356	0.411785	29172	191.272750	0.783968	0.984986	0.887034	0.953861	0.006979	0.003564	0.909851	0.998430	SEKER
2	29380	624.110	212.826130	175.931143	1.209713	0.562727	29690	193.410904	0.778113	0.989559	0.947849	0.908774	0.007244	0.003048	0.825871	0.999066	SEKER
3	30008	645.884	210.557999	182.516516	1.153638	0.498616	30724	195.467062	0.782681	0.976696	0.903936	0.928329	0.007017	0.003215	0.861794	0.994199	SEKER
4	30140	620.134	201.847882	190.279279	1.060798	0.333680	30417	195.896503	0.773098	0.990893	0.984877	0.970516	0.006697	0.003665	0.941900	0.999166	SEKER

df["Class"].value_counts()

DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: Class, dtype: int64

Now, I looked at summary statistics for each numeric column.

from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]

pd.DataFrame({"mean": df[num_cols].mean(axis = 0), "std_dev": df[num_cols].std(axis = 0)})

	mean	std_dev
Area	53048.284549	29324.095717
Perimeter	855.283459	214.289696
MajorAxisLength	320.141867	85.694186
MinorAxisLength	202.270714	44.970091
AspectRation	1.583242	0.246678
Eccentricity	0.750895	0.092002
ConvexArea	53768.200206	29774.915817
EquivDiameter	253.064220	59.177120
Extent	0.749733	0.049086
Solidity	0.987143	0.004660
roundness	0.873282	0.059520
Compactness	0.799864	0.061713
ShapeFactor1	0.006564	0.001128
ShapeFactor2	0.001716	0.000596
ShapeFactor3	0.643590	0.098996
ShapeFactor4	0.995063	0.004366

The numeric columns in the dataset have wildly different means and standard deviations, so it’ll be a good idea to normalize the data. I used the StandardScaler from scikit-learn.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df[num_cols])
df[num_cols] = scaler.transform(df[num_cols])

Logistic regression¶

Test run: binary classifier for one bean type¶

To see if this idea will work at all, I wanted to try classifying one type of bean at a time. I used Dermason as a quick test, since this is the most common in the data. I copied the dataframe and added an is_dermason column.

df_test = df.copy()
df_test["is_dermason"] = df_test["Class"] == "DERMASON"

I generated a training and test set so I can effectively measure the model’s success. I set aside 80% of the data for training and 20% for testing.

from sklearn.model_selection import train_test_split
X_test, X_train, y_test, y_train = train_test_split(df_test[num_cols], df_test["is_dermason"], train_size=0.8, random_state = 123)

I then ran a logistic classifier. Since I’m just checking how well the model works for now, I didn’t really care what the model’s predictions are. I’ll instead just use score().

from sklearn.linear_model import LogisticRegression
clf_test = LogisticRegression()
clf_test.fit(X_train, y_train)
train_score = clf_test.score(X_train, y_train)
test_score = clf_test.score(X_test, y_test)
print(f"The model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")

The model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.

The model accurately predicted if the bean was of type “Dermason” for both the training and test sets, and since the accuracies were about equal for both, I don’t see reason to worry about overfitting. The model performs very well for my expectations.

Repeating the test¶

I decided to repeat this test run for each species to see if different kinds of beans were easier or harder to classify.

species = df["Class"].unique()
df_binary = df.copy()
for sp in species:
    name = f"is_{sp}"
    # add a boolean column for each bean species
    df_binary[name] = df_binary["Class"] == sp

    # create training and test sets and score the model
    X_test, X_train, y_test, y_train = train_test_split(df_binary[num_cols], df_binary[name], train_size=0.8, random_state = 123)
    clf_bin = LogisticRegression()
    clf_bin.fit(X_train, y_train)
    train_score = clf_bin.score(X_train, y_train)
    test_score = clf_bin.score(X_test, y_test)
    print(f"When classifying {sp} beans, the model achieved {round(train_score*100,1)}% accuracy on the training set and {round(test_score*100,1)}% accuracy on the test set.")

When classifying SEKER beans, the model achieved 97.9% accuracy on the training set and 98.0% accuracy on the test set.
When classifying BARBUNYA beans, the model achieved 97.9% accuracy on the training set and 98.3% accuracy on the test set.
When classifying BOMBAY beans, the model achieved 100.0% accuracy on the training set and 100.0% accuracy on the test set.
When classifying CALI beans, the model achieved 97.6% accuracy on the training set and 97.5% accuracy on the test set.
When classifying HOROZ beans, the model achieved 99.0% accuracy on the training set and 98.3% accuracy on the test set.
When classifying SIRA beans, the model achieved 93.8% accuracy on the training set and 92.9% accuracy on the test set.
When classifying DERMASON beans, the model achieved 95.3% accuracy on the training set and 95.3% accuracy on the test set.

The model performed similarly well to the first test, getting about 90% for all bean types and with test set accuracy usually less than 1% lower. Interestingly, the binary model did unbelievably well for Bombay beans, achieving 100% accuracy on both the training and test sets! I hypothesized there must be some features that are heavily split between Bombay beans and other types, such as Bombay beans having a much larger area on average than all others. I created some histograms in Altair to take a look.

import altair as alt
alt.data_transformers.enable('default', max_rows=15000)

# for each numeric column, plot a histogram of that dimension with Bombay beans highlighted
charts = [alt.Chart(df_binary).mark_bar().encode(
    x = alt.X(col, bin = alt.BinParams(maxbins = 16)),
    y = "count()",
    color = "is_BOMBAY"
).properties(
    width = 120,
    height = 120
) for col in num_cols]

alt.ConcatChart(concat = charts, columns = 4)

As I suspected, the Bombay beans are much larger, showing a clear divide between this species and the others in dimensions like Area and Axis Lengths. Therefore, the regression model has a clear boundary for splitting between these species.

Multinomial regression¶

Since logistic regression worked well when classifying one bean at a time, I wanted to test classifying all 7 at once. I used the SAG solver, in a similar way to the logistic regression model we used for classifying the MNIST dataset in a previous homework. I also increased the tolerance value to 0.01 because the model wasn’t converging in time.

df_multi = df.copy()
clf_multi = LogisticRegression(tol = 0.01, solver = "sag")

# split the dataset
X_train, X_test, y_train, y_test = train_test_split(df_multi[num_cols], df_multi["Class"], train_size = 0.8, random_state = 123)

# fit and score the model
clf_multi.fit(X_train, y_train)
test_score = clf_multi.score(X_test, y_test)
train_score = clf_multi.score(X_train, y_train)

print(f"The model was {round(train_score*100,1)}% accurate on training data and {round(test_score*100,1)}% accurate on test data.")

The model was 92.5% accurate on training data and 92.4% accurate on test data.

While the multinomial model was less accurate than the binary classification models above, again, the model performed very similarly on the training and test sets and did well overall.

Confusion matrix¶

I was curious to see if any bean types were mixed up more often than others, so I created a confusion matrix. I based most of the code off of snippets provided in the Homework 6 template.

test_predictions = clf_multi.predict(X_test)
confusion_df = pd.DataFrame({
    "actual": y_test,
    "predicted": test_predictions
})

cm_color = alt.Chart(confusion_df).mark_rect().encode(
    x = "actual",
    y = "predicted",
    color = alt.Color("count()", scale = alt.Scale(scheme = "lightmulti"))
).properties(width = 400, height = 400)

cm_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
    x = "actual",
    y = "predicted",
    text = "count()"
)

cm_color + cm_text

This works well enough to see exact counts, but doesn’t tell us much overall. Since the dataset is uneven with some bean types having lots more data than others, the matrix (especially the color scale) skews towards more common beans. For example, Dermason beans appear to be the most well-classified at first glance, but Bombay beans are classified correctly 100% of the time.

To remedy this, I created a confusion matrix that displays proportions instead of raw counts. Since I’ll want to reuse it later, I wrapped it in a function.

import numpy as np

def confusion_matrices(actual, predicted):
    # create every possible pair of bean types to compare between
    x, y = map(lambda a: a.reshape(-1), np.meshgrid(species, species))

    # this dataframe will have five columns: two for the pair of species names, and three for the count and proportions
    confusion_df = pd.DataFrame({
        "actual": x,
        "predicted": y
    })
    
    # count actual and predicted classes
    actual_counts = pd.Series(actual).value_counts()
    predicted_counts = pd.Series(predicted).value_counts()

    # calculate the proportions: for each pair of actual class a and prediction b, find the number of matches
    # proportion1 is the proportion of all actual data points with class a
    # proportion2 is the proportion of all predictions with class b
    # add these to the dataframe.
    confusion_df["count"] = [sum((y_test == a) & (predicted == b)) for a, b in zip(x, y)]
    confusion_df["proportion1"] = [confusion_df["count"][idx] / actual_counts[a] for idx, a in enumerate(x)]
    confusion_df["proportion2"] = [confusion_df["count"][idx] / predicted_counts[a] for idx, a in enumerate(y)]

    # plot the data with altair to display the confusion matrix
    cm2a_color = alt.Chart(confusion_df).mark_rect().encode(
        x = "actual:N",
        y = "predicted:N",
        color = alt.Color("proportion1", scale = alt.Scale(scheme = "lightmulti")),
        tooltip = ["count"]
    ).properties(width = 350, height = 350, title = "As a proportion of actual data")
    cm2a_text = alt.Chart(confusion_df).mark_text().encode(
        x = "actual",
        y = "predicted",
        text = alt.Text("proportion1", format = ".1%"),
        color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
        tooltip = ["count"]
    )
    cm2b_color = alt.Chart(confusion_df).mark_rect().encode(
        x = "predicted",
        y = "actual",
        color = alt.Color("proportion2", scale = alt.Scale(scheme = "lightmulti")),
        tooltip = ["count"]
    ).properties(width = 350, height = 350, title = "As a proportion of predictions")
    cm2b_text = alt.Chart(confusion_df).mark_text(color = "black").encode(
        x = "predicted",
        y = "actual",
        text = alt.Text("proportion2", format = ".1%"),
        color = alt.condition(alt.datum.proportion1 > 0.8, alt.value("white"), alt.value("black")),
        tooltip = ["count"]
    )

    return (cm2a_color + cm2a_text) | (cm2b_color + cm2b_text)

# generate predictions and run the confusion matrix
confusion_matrices(predicted = clf_multi.predict(X_test), actual = y_test)

The table on the left shows for each type of bean how often they were classified as every type. The table on the right shows for each classification how often they were each actual species. For example, 7.0% of Dermason beans were classified as Sira, but 5.9% of beans classified as Dermason were actually Sira. The tooltip for each square displays the count for each pair of actual value and prediction.

In contrast with the chart above, we can quickly tell that the model works really well for most species. Again, Bombay beans are classified correctly 100% of the time with no false positives. The lowest accuracy is 87.1% for Barbunya beans (from the left chart) which were most commonly misclassified as Cali beans.

K-nearest neighbors classifier¶

Test run¶

I also wanted to see how effective a K-nearest neighbors (KNN) classifier would be at categorizing this dataset. I used scikit-learn’s KNN model.

from sklearn.neighbors import KNeighborsClassifier

# use default parameters for now
knn = KNeighborsClassifier()

# create a train-test split, just like always
df_knn = df.copy()
X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 100)

# fit and score the model
knn.fit(X_train, y_train)
train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)

print(f"The initial KNN model had an accuracy of {round(train_score*100, 1)}% on the training set and {round(test_score*100, 1)}% on the test set." )

The initial KNN model had an accuracy of 94.3% on the training set and 92.4% on the test set.

The KNN model performed similarly well to the logistic regression model above at about 92% accuracy, but with a wider gap (~2%) between the test and training accuracies. I made another set of confusion matrices to compare more specific classification accuracies.

confusion_matrices(predicted = knn.predict(X_test), actual = y_test)

These confusion matrices are very similar to those generated for the logistic regression model. Again, we see that Bombay beans were correctly classified always, and Sira and Barbunya beans were classified the least accurately overall.

Tuning the model¶

I wanted to see if changing the parameters could make the model run any better. I tested values for n_neighbors from 3 to 30 and displayed the results in an Altair line plot.

X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 13)

scores = []
trials = range(3, 31)
for n in trials:
    knn = KNeighborsClassifier(n_neighbors = n)
    knn.fit(X_train, y_train)
    scores.append(knn.score(X_test, y_test))

tune_df = pd.DataFrame({"n": trials, "score": scores})
alt.Chart(tune_df).mark_line().encode(
    x = "n",
    y = alt.Y("score", scale = alt.Scale(domain = [0.8, 1]))
)

tune_df.sort_values(by = "score", ascending = False).head()

	n	score
19	22	0.930959
16	19	0.930959
14	17	0.929857
12	15	0.929857
20	23	0.929490

The model’s accuracy didn’t increase much through this tuning process. In the end, the model seems to work best when n_neighbors = 22.

X_train, X_test, y_train, y_test = train_test_split(df_knn[num_cols], df_knn["Class"], train_size = 0.8, random_state = 123)
final_knn = KNeighborsClassifier(n_neighbors = 22)
final_knn.fit(X_train, y_train)
accuracy = final_knn.score(X_test, y_test)
display(confusion_matrices(predicted = final_knn.predict(X_test), actual = y_test))
print(f"This KNN model achieved an accuracy of {round(accuracy*100, 1)}%.")

This KNN model achieved an accuracy of 91.9%.

I wasn’t able to make the model any more accurate than about 92%, so KNN worked about equally well to logistic regression.

A different approach: testing KMeans¶

Even though KMeans is an unsupervised learning model, I wanted to see if the clusters it came up with matched the original species in the dataset. I set n_clusters to equal the number of classes because I want to see how well the model can “guess” the actual species.

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 7)

df_kmeans = df.copy()
kmeans.fit(df_kmeans[num_cols])
df_kmeans["Cluster"] = kmeans.predict(df_kmeans[num_cols])

Since the clusters aren’t explicitly associated with the species classes, I can’t calculate an accuracy or create a confusion matrix. Instead, I created a scatter plots in Altair comparing the clusters to the actual species for a couple pairs of dimensions.

feature_pairs = [["Area", "AspectRation"], ["ShapeFactor1", "ShapeFactor2"]]
charts = []

for fp in feature_pairs:
    class_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
        x = fp[0],
        y = fp[1],
        color = "Class"
    ).properties(width = 250, height = 250, title = f"Species: {fp[0]} vs {fp[1]}")
    cluster_chart = alt.Chart(df_kmeans).mark_point(filled = True, size = 20, opacity = 0.2).encode(
        x = fp[0],
        y = fp[1],
        color = "Cluster:N"
    ).properties(width = 250, height = 250, title = f"KMeans Clusters: {fp[0]} vs {fp[1]}")
    charts.append(class_chart)
    charts.append(cluster_chart)

alt.ConcatChart(concat = charts, columns = 2).resolve_scale(color = "independent")

Although we can’t get quantitative results for how well this model worked, these plots give us a bit of insight. It looks like the model combined Cali and Barbunya beans into one cluster. This is understandable, since the logistic regression and KNN models also often mistook these species for one another. The now-missing seventh cluster seems to have been created at the fringes of a few other species to encompass the overlap between outliers of those species.

Summary¶

Logistic regression and K-nearest neighbors performed about the same when classifying between the species of beans. Each achieved an accuracy of 92%, with some types classified more accurately in both models. An attempt at clustering using the KMeans algorithm yielded groupings that were similar to the actual species classes, but with some noticeable errors, including incorrectly combining beans from two species into one cluster.

References¶

I found this dataset on Kaggle.

The code for the first confusion matrix was adapted from one of the Math 10 homework templates. I often referred to the Altair and Scikit-Learn documentation, in particular:

altair.ConcatChart
sklearn.neighbors.KNeighborsClassifier
sklearn.linear_model.LogisticRegression

Conditional text formatting for confusion matrices: https://www.programcreek.com/python/example/117307/altair.condition

Created in Deepnote

UC Irvine Math 10 S22

Categorizing Beans

Contents