Prediction and Analysis of Crabs’ Age and Sex Base on Physical Features#

Author: Kaijie Zhang

Course Project, UC Irvine, Math 10, S23

Introduction#

Based on daily experiences, distinguishing between male and female crabs is crucial. Gender dictates a crab’s role in the market; for instance, female crabs are often used to produce crab roe due to their higher roe content. However, identifying a crab’s sex requires experience, especially for non-industrial fishermen. It is said that seasoned fishermen can estimate a crab’s gender by evaluating its weight and size. I intend to start with the age of the crabs: using a regressor to predict the crabs’ age based on available data. Subsequently, I will attempt to develop a machine learning model that aids fishermen in accurately determining the sex of crabs and assess its efficacy.

from PIL import Image
Image.open("dataset-cover.jpeg")
../../_images/7828ab6c1643a844cb70c67b96a1dd3766453f52f22914a69141f879d1f5c25b.png

Section 1: Data Cleaning and Feature Engineering#

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import seaborn as sns
from itertools import product
from pandas.api.types import is_numeric_dtype
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.cluster import KMeans
df = pd.read_csv("Crab_Features.csv")

Have a short preview of the dataset:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 10 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   id              200000 non-null  int64  
 1   Sex             200000 non-null  object 
 2   Length          200000 non-null  float64
 3   Diameter        200000 non-null  float64
 4   Height          200000 non-null  float64
 5   Weight          200000 non-null  float64
 6   Shucked Weight  200000 non-null  float64
 7   Viscera Weight  200000 non-null  float64
 8   Shell Weight    200000 non-null  float64
 9   Age             200000 non-null  float64
dtypes: float64(8), int64(1), object(1)
memory usage: 15.3+ MB

Since the original dataset is so large that Deepnote could not process all of them. I pick the first 5000 data sample to be our investigated dataset.

df=df[:5000]
df.dropna(inplace=True)
df.shape
(5000, 10)

Original dataset gives multiple plain but key variables. We can create our own features base on them:

df['Volume'] = df['Length'] * df['Diameter'] * df['Height']
# Crab BMI
    
# Water Loss during experiment
df["water_loss"]=df["Weight"]-df["Shucked Weight"]-df['Viscera Weight']-df['Shell Weight']
df["water_loss"]=np.where( df["water_loss"]<0,
                            min(df["Shucked Weight"].min(), 
                            df["Viscera Weight"].min(), df["Shell Weight"].min()),
                            df["water_loss"])                                
# Crab density approx
df['Density'] = df['Weight']/(df['Volume'])

# Normalize weights to represent them as a ratio of the total weight
df['Shucked Weight Ratio'] = df['Shucked Weight'] / df['Weight']
df['Shell Weight Ratio'] = df['Shell Weight'] / df['Weight']
df['Viscera Weight Ratio'] = df['Viscera Weight'] / df['Weight']
df['water_loss Ratio'] = df['water_loss'] / df['Weight']

df['Sex_num'] = df['Sex'].replace({'I': 0, 'M': 1, 'F': 2})
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == "__main__":
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:16: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:17: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:19: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Volume & Density: more familiar variables to replace height, length, and diameter. It will be a little confused if we use three variables to represent the “size” of the crabs.

This is the relation between volume and weight (in the below plot by Seaborn), which is obviousily linear according to its regression line. Such combanation of various variables could help us review the whoe dataset more effectively. (We can see there is an outlier at about (29,3.45))

with sns.axes_style('darkgrid'):
    # Create jointplot without regression line
    joint = sns.jointplot(data=df, x='Weight', y='Volume', color='black')
    # Add the regression line separately
    joint.ax_joint.clear()  # Clear the original points
    sns.scatterplot(data=df, x='Weight', y='Volume', ax=joint.ax_joint, color='black')
    sns.regplot(data=df, x='Weight', y='Volume', ax=joint.ax_joint, color='red', scatter=False)
../../_images/5712359281aa210eda277bbd605831cac939a5702e90973333451a2766c7414b.png

Shucked/Viscera/Shell Weight ratio: the ratio of sections of each individual might implies more than just posting the weight of them respectively. [It will be used later to explain the water_loss]

water_loss: We observe that the addition of Shucked/Viscera/Shell Weight does not equal to the original weight of the individual. One significant reason is that the water contained in alive crabs will outflow when researchers are decomposing the crabs. The significant part of lost weight equals to the weight of water loss. (As you can see, water loss is actually visible in total weight [by ratio])

# Melt the DataFrame to long format
df_melt = df.copy().melt(id_vars='Age', value_vars=['Shucked Weight Ratio', 'Shell Weight Ratio', 'Viscera Weight Ratio', 'water_loss Ratio'], var_name='Weight Type', value_name='Weight Ratio')

# Due to the deepnote has a limitation of 5000 rows for dataframe, we'll aggregate data by taking mean of the ratios for each Age.
df_agg = df_melt.groupby(['Age', 'Weight Type']).mean().reset_index()

# Create the stacked bar chart
chart = alt.Chart(df_agg).mark_bar().encode(
    x='Age:O',
    y=alt.Y('Weight Ratio:Q', stack='normalize'),
    color='Weight Type:N',
    tooltip = ['Weight Ratio:Q']
).properties(
    title="Ratio of Crabs' Weight")

chart

id is not related to our prediction, so I drop it to avoid any mistakes.

df.drop("id", axis=1)
Sex Length Diameter Height Weight Shucked Weight Viscera Weight Shell Weight Age Shucked Weight ratio Viscera Weight ratio Shell Weight ratio Volume water_loss Density Sex_num
0 M 1.5750 1.2250 0.3750 31.226974 12.303683 6.321938 9.638830 10.0 0.394008 0.202451 0.308670 0.723516 2.962523 43.160055 1
1 I 1.2375 1.0000 0.3750 21.885814 7.654365 3.798833 7.654365 19.0 0.349741 0.173575 0.349741 0.464063 2.778251 47.161350 0
2 F 1.4500 1.1625 0.4125 28.250277 11.127179 7.016501 7.257472 11.0 0.393879 0.248369 0.256899 0.695320 2.849125 40.629155 2
3 I 1.3500 1.0250 0.3750 21.588144 9.738053 4.110678 6.378637 9.0 0.451083 0.190414 0.295469 0.518906 1.360776 41.603169 0
4 I 1.1375 0.8750 0.2875 14.968536 5.953395 2.962523 3.713785 8.0 0.397727 0.197917 0.248106 0.286152 2.338834 52.309675 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 M 1.6750 1.2750 0.5000 37.407165 14.855138 7.002326 11.906790 14.0 0.397120 0.187192 0.318302 1.067813 3.642911 35.031586 1
4996 M 1.6000 1.2000 0.4375 31.397071 14.047177 7.016501 8.930093 10.0 0.447404 0.223476 0.284424 0.840000 1.403300 37.377466 1
4997 M 1.6000 1.2750 0.4375 36.273185 15.521351 9.270286 10.404267 10.0 0.427902 0.255569 0.286831 0.892500 1.077281 40.642224 1
4998 F 1.3625 1.0375 0.3375 21.559795 9.128539 4.479221 7.087375 10.0 0.423406 0.207758 0.328731 0.477088 0.864660 45.190404 2
4999 F 1.4500 1.1625 0.3625 27.144646 10.517665 6.251065 7.512618 13.0 0.387467 0.230287 0.276762 0.611039 2.863299 44.423750 2

5000 rows × 16 columns

df.columns #Preview all current variables in the dataset:
Index(['id', 'Sex', 'Length', 'Diameter', 'Height', 'Weight', 'Shucked Weight',
       'Viscera Weight', 'Shell Weight', 'Age', 'Shucked Weight ratio',
       'Viscera Weight ratio', 'Shell Weight ratio', 'Volume', 'water_loss',
       'Density', 'Sex_num'],
      dtype='object')

Since Sex is the only String variable in the dataset, I prefer to separate that into three Boolean variables: Male, Female, and Indeterminate.

df_num = pd.get_dummies(df, columns=['Sex'])

Let see whether all columns in df_num are currently numeric:

len([c for c in df_num.columns if is_numeric_dtype(df_num[c])]) == df_num.shape[1]
True
df_num.sample(3)
id Length Diameter Height Weight Shucked Weight Viscera Weight Shell Weight Age Shucked Weight ratio Viscera Weight ratio Shell Weight ratio Volume water_loss Density Sex_num Sex_F Sex_I Sex_M
4697 4697 1.1250 0.8750 0.2750 12.998246 5.499803 2.806601 3.685435 9.0 0.423119 0.215921 0.283533 0.270703 1.006407 48.016608 0 0 1 0
4217 4217 1.4375 1.1875 0.4500 29.852024 10.347568 6.265239 11.765042 16.0 0.346629 0.209877 0.394112 0.768164 1.474174 38.861521 1 0 0 1
2978 2978 1.2875 0.9875 0.2875 16.768729 7.342521 3.614561 4.677668 10.0 0.437870 0.215554 0.278952 0.365529 1.133980 45.875199 2 1 0 0

Section 2: Regressor - Age Prediction#

A basic view of distribution of popuation in different age. We could see the majority of population locate in range from 7 yrs to 11 yrs.

alt.Chart(df).mark_bar().encode(
    x = alt.X("Age", bin=alt.Bin(maxbins=100)),
    y="count()"
).properties(
    title="Distribution of Age")

First,before predicting the sex, I want to use regressor to predict the age according to crab’s weight by connecting PolynomialFeatures and LinearRegression model by Pipeline so that we can see different regression lines with different degrees.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

The first stage uses the PolynomialFeatures class from sklearn.preprocessing to transform the original variable ‘Weight’ into polynomial features of a specified degree d. The argument include_bias=False ensures that a column of ones (the bias or intercept) is not added to the polynomial features. The second stage uses the LinearRegression class from sklearn.linear_model to perform linear regression on the transformed data.

def poly_fit(df,d,color):
    df_sub = df.copy()
    X = df_sub[["Weight"]]
    y = df_sub["Age"]
    pipe = Pipeline(
    [
        ("poly", PolynomialFeatures(degree=d, include_bias=False)), 
        ("reg", LinearRegression())
    ]
    )
    pipe.fit(X,y)
    df_sub[f"pred{d}"] = pipe.predict(df_sub[["Weight"]])
    chart = alt.Chart(df_sub).mark_line(clip=True, color=color).encode(
        x=alt.X("Weight", scale=alt.Scale(domain=(0,80))),
        y=alt.Y(f"pred{d}", scale=alt.Scale(domain=(0,28))),
    )
    return chart
chartl = alt.Chart(df).mark_circle().encode(
        x="Weight",
        y="Age"
    )

I choose three representative degrees number: 1,2,8. Overly high degrees will cause overfitting to each data point, which makes the plot looks messy.

tstate_colors = [(1, 'red'), (2, 'blue'), (8, 'green')]
chart_list = [poly_fit(df,d,color) for d, color in state_colors]

We can observe that regression line in different degrees implies how many curves it might have: a staright line for 1 degrees, 2 curves for 2 degrees, and multiple curves for 8 degrees. But all of them could explain the positive relationship between weight and age (it could be slightly negative when crabs are old).

chartl+chart_list[0]+chart_list[1]+chart_list[2]

We can observe that when the number data points decrease (and their distribution becomes more discrete), the up and down fluctuation of the regression line becomes very exaggerated. It might be because a small data points could lead to an ovefitting.

cols0 = ["Length","Diameter","Height","Weight"]
X_train0, X_test0, y_train0, y_test0 = train_test_split(df[cols0], df["Age"], test_size = 0.3)

Now, I want to use K-Nearest Neighbors. We want to check how k values impact the error of prediction of age by multiple features. The “k” in K-Nearest Neighbors Regressor represents the number of nearest neighbors to consider when making a prediction for a new instance.

reg = KNeighborsRegressor(n_neighbors=10)
reg.fit(X_train0, y_train0)
KNeighborsRegressor(n_neighbors=10)
def get_scores(k):
    reg = KNeighborsRegressor(n_neighbors=k)
    reg.fit(X_train0, y_train0)
    train_error = mean_absolute_error(reg.predict(X_train0), y_train0)
    test_error = mean_absolute_error(reg.predict(X_test0), y_test0)
    return (train_error, test_error)
df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})
df_scores.head()
k train_error test_error
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 NaN NaN
4 5 NaN NaN
for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores
k train_error test_error
0 1 0.007723 2.265510
1 2 1.104977 2.051701
2 3 1.341533 1.964643
3 4 1.457165 1.903936
4 5 1.514245 1.868045
... ... ... ...
144 145 1.763669 1.788714
145 146 1.764298 1.788868
146 147 1.764584 1.789765
147 148 1.764393 1.789594
148 149 1.763939 1.788651

149 rows × 3 columns

If “k” is too small, the model may be overly sensitive to noise in the data. If “k” is too large, then the model may include points that are too far away from the query point, which could result in predictions that are less relevant or accurate. This phenomenon is also applied for other datasets.

df_scores["k_value"] = 1/df_scores.k
ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "k_value",
    y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "k_value",
    y = "test_error"
)
alt.layer(ctrain,ctest)

Section 2.5: Classifier - Distribution of Crabs’ Sex#

Deleting outliers to make the plot make more sense.

df = df[df["Density"]<=120]

You can click the below bottom to see the distribution of each Sex respectively:

Sex_features = df["Sex"].unique()
Sex_radio = alt.binding_radio(options=Sex_features)
Sex_select = alt.selection_single(
    fields=["Sex"], bind=Sex_radio, name="Select to see the distribution for each sex:(Male,Indeterminate,Female) "
)
Sex_color_condition = alt.condition(
    Sex_select,
    alt.Color("Sex:N", legend=None),
    alt.value("lightgray"),
)

Sex_distribution = alt.Chart(df).mark_point(filled=True).encode(
    x=alt.X("Volume", scale=alt.Scale(domain=(0,1.8))),
    y=alt.Y("Density", scale=alt.Scale(domain=(20,120))),
    color="Sex:N",
    tooltip = ["Sex","Volume","Density"]
).add_selection(Sex_select).encode(color=Sex_color_condition).properties(
    title="Use the radio button to filter the scatter plot")
Sex_distribution
pic_sex = Image.open("Sex_Crab_eg.png")

According to above graph of the distribution of Density&Volume of crabs with different sex, we can tell when volume is a relatively a lower range, researchers will be more likely to indentify the crab with an indeterminated sex. Maybe it means the helpful features to tell which crabs are male or female is not obvious enough when the crabs are small (You can see in the below document, helpful features are required for classification).

pic_sex
../../_images/5fc36cabac238d1e03caa94a9703fa8ecec3c10638202bcf13da1aa30ab62488.png

Now, I want to see how K-Means Clustering works to group the sex.

colsS = ["Volume","Sex_num","Density","Sex"]
X_trainS = df[colsS][:2500]
X_testS = df[colsS][2500:]
kmeans = KMeans(n_clusters=3)
kmeans.fit(X_trainS[["Volume","Density"]])
KMeans(n_clusters=3)
kmeans.cluster_centers_
array([[ 0.49578693, 46.90816583],
       [ 0.62110157, 40.04474117],
       [ 0.22056579, 58.98263489]])

Mark the centers as Triangles. We can see it does not well applied to this data when we are looking for grouping of Sex depending on the mean of volume and density.

We apply the previous centers to new dataset. We can see obviously both of them are grouped in same way:

arr = kmeans.predict(X_trainS[["Volume","Density"]])
X_trainS["cluster"] = arr
X_trainS['cluster'] = X_trainS['cluster'].replace({1: 0, 2: 1, 0: 2})
cen0 = alt.Chart(X_trainS).mark_circle().encode(
    x=alt.X("Volume", scale=alt.Scale(zero=False)),
    y=alt.Y("Density", scale=alt.Scale(zero=False)),
    tooltip = ["Sex"],
    color="cluster:N"
)
cen2 = alt.Chart(df_center).mark_point(color = "black",shape='triangle',size=80).encode(
    x = 'ColumnX',
    y = 'ColumnY',
    tooltip = ['ColumnX','ColumnY']
)
alt.layer(cen0+cen2)

Check whether the clusters base on training dataset is entirely the same with the crab’s sex:

lst = X_trainS['cluster'] == X_trainS["Sex_num"]
# Convert the list to a pandas Series
# Check whether any value is Boolean False
contains_false = pd.Series(lst).isin([False]).any()
contains_false
True

true. It means the grouping is different from the actual one (It is easy to observe by our eyes)

kmeans.cluster_centers_
array([[ 0.49578693, 46.90816583],
       [ 0.62110157, 40.04474117],
       [ 0.22056579, 58.98263489]])
df_center = pd.DataFrame(kmeans.cluster_centers_, columns=['ColumnX', 'ColumnY'])
arr_new = kmeans.predict(X_testS[["Volume","Density"]])
X_testS["cluster_new"] = arr_new
X_testS['cluster_new'] = X_testS['cluster_new'].replace({1: 0, 2: 1, 0: 2})
cen1 = alt.Chart(X_testS).mark_circle().encode(
    x=alt.X("Volume", scale=alt.Scale(zero=False)),
    y=alt.Y("Density", scale=alt.Scale(zero=False)),
    tooltip = ["Sex"],
    color="cluster_new:N"
)
alt.layer(cen1+cen2)

Same for the new dataset:

lst_new = X_testS['cluster_new'] == X_testS["Sex_num"]
# Convert the list to a pandas Series
# Check whether any value is Boolean False
contains_false_new = pd.Series(lst_new).isin([False]).any()
contains_false_new
True

Then, use random forest classifier. But I would like to fit the model with less features first:

cols1 = ["Volume","Density"]
X_train1, X_test1, y_train1, y_test1 = train_test_split(df[cols1], df["Sex"], test_size=0.3)
clf1= DecisionTreeClassifier(max_leaf_nodes=5)
clf1.fit(X_train1, y_train1)
DecisionTreeClassifier(max_leaf_nodes=5)
clf1.classes_
array(['F', 'I', 'M'], dtype=object)
fig = plt.figure(figsize=(20,10))
_ = plot_tree(clf1, 
                feature_names=clf1.feature_names_in_,
                class_names=["Female", "Indeterminate","Male"],
                filled=True)
../../_images/b4a8f3937446cfa847d73b10ca4cb7c344f23c95840f8a113bf396d30357c055.png
xx = np.linspace(0,2,70)
yy = np.linspace(0,0.7,70)
df_art = pd.DataFrame(list(product(xx,yy)), columns=cols1)
df_temp = df_art.copy()
df_temp["pred"] = clf1.predict_proba(df_temp)[:, 1] #So we are looking for "Indeterminate"
color_spec = alt.Color("pred:Q", scale=alt.Scale(scheme="blueorange"))
alt.Chart(df_temp).mark_circle().encode(
    x="Volume",
    y="Density",
    color=color_spec,
    tooltip=["Volume", "Density", "pred"] 
).properties(
    title="True values"
).configure_axis(
    grid=False
)
clf1.score(X_test1,y_test1)
0.571714476317545

Section 3: Classifier - Crab’s Sex Prediction#

View which features might significantly affect the probability of one particular sex by heatmap.

corr = df_num.drop(columns=['id','Sex_num']).corr().round(2)
plt.figure(figsize=(20,10))
sns.heatmap(corr, vmin=-1, vmax=1, center=0, square=False, annot=True, cmap='coolwarm')
<AxesSubplot:>
../../_images/538e174030dfdabc65ef9466aa2494dc25aa44aecc3f36ca6b107a824945401c.png
cols2 = ['Length', 'Diameter', 'Height', 'Weight',
 'Shucked Weight','Viscera Weight', 'Shell Weight',"Age","xΩsd "]
X_train2, X_test2, y_train2, y_test2 = train_test_split(df[cols2], df["Sex"], test_size=0.3)
clf2 = DecisionTreeClassifier(max_leaf_nodes=18)
clf2.fit(X_train2, y_train2)
DecisionTreeClassifier(max_leaf_nodes=18)
fig = plt.figure(figsize=(20,10))
_ = plot_tree(clf2, 
                feature_names=clf2.feature_names_in_,
                class_names=["Female", "Indeterminate","Male"],
                filled=True)
../../_images/db80a05072bfcd182454f9b881746dadc7284e29f030e0a557d2e5302a77f898.png

Unfortunately, using more features could not significantly improve the accurarcy of classification. Though I have tried to prevent overfitting by control the numbers of max leaf and depth of the tree.

clf2.score(X_test2,y_test2)
0.6050700466977985

Conclusion#

In this project, we have mainly tried to use a series of characteristics of crabs to train machine learning models, aiming to predict the gender of crabs as accurately as possible. It’s regrettable that our final prediction accuracy is just over 50%, or around 10% more than half. However, this is also related to the limitations of the database. Since this database primarily provides numerical parameters, and in reality, most people rely on unique organs to distinguish the sex of crabs (hence, training an image recognition model might be more accurate), the existing numerical data alone is not comprehensive enough.

Summary#

After previewing the dataset and features engineering, I start with the prediction of the age of the crabs: using a regressor to predict the crabs’ age based on available data. After that, I develop two similar machine learning model by random forest classifier that aids fishermen in accurately determining the sex of crabs and assess its efficacy. However, the result does not support a more effecient outcome. During the investigation, I also use K-Nearest Neighbors and K-Means Clustering as some of the extension (but not limit to, e.g, seaborn).

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/sidhus/crab-age-prediction

  • List any other references that you found helpful.

https://wsg.washington.edu/crabteam-malevsfemale/

https://towardsdatascience.com/how-to-create-bindings-and-conditions-between-multiple-plots-using-altair-4e4fe907de37

https://jakevdp.github.io/PythonDataScienceHandbook/04.14-visualization-with-seaborn.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote