Zoo Animal Classification

Author: Alisa Crowe

Course Project, UC Irvine, Math 10, S22

Introduction

In this project, I will be examining two datasets that hold data about 101 different species of zoo animals and various characteristics that they hold. I will be using Sci-Kit Learn’s K-Means Clustering, Logistic Regression, Decision Tree Classifier, and K-Nearest Neighbors to attempt to classify the animals while disucssion any results and/or challenges I come across. I will then analyze each of these methods to conclude which was the best in classifying.

Main portion of the project

Data Cleaning and Merging Dataframes

import pandas as pd
df_zoo = pd.read_csv("zoo.csv")
df_zoo
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize class_type
0 aardvark 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
1 antelope 1 0 0 1 0 0 0 1 1 1 0 0 4 1 0 1 1
2 bass 0 0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 4
3 bear 1 0 0 1 0 0 1 1 1 1 0 0 4 0 0 1 1
4 boar 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 1 0 0 2 1 0 1 1
97 wasp 1 0 1 0 1 0 0 0 0 1 1 0 6 0 0 0 6
98 wolf 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1 1
99 worm 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 7
100 wren 0 1 1 0 1 0 0 0 1 1 0 0 2 1 0 0 2

101 rows × 18 columns

Here we introduce our first dataframe df_zoo. To ensure that there are no repeats of species, we can check value_counts() of the column ‘animal_name’.

df_zoo["animal_name"].value_counts()
frog        2
newt        1
aardvark    1
cavy        1
termite     1
           ..
crab        1
stingray    1
seahorse    1
bear        1
ladybird    1
Name: animal_name, Length: 100, dtype: int64

We can see that there are two rows for frogs in this dataframe, but only one of every other species.

df_zoo[df_zoo.loc[:,"animal_name"] == "frog"]
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize class_type
25 frog 0 0 1 0 0 1 1 1 1 1 0 0 4 0 0 0 4
26 frog 0 0 1 0 0 1 1 1 1 1 1 0 4 0 0 0 4

Looking at the ‘venomous’ column, it is clear that these two rows are not duplicates; we can leave both of them in.

df_class = pd.read_csv("class.csv")
df_class
Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names
0 1 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
1 2 20 Bird chicken, crow, dove, duck, flamingo, gull, haw...
2 3 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara
3 4 13 Fish bass, carp, catfish, chub, dogfish, haddock, h...
4 5 4 Amphibian frog, frog, newt, toad
5 6 8 Bug flea, gnat, honeybee, housefly, ladybird, moth...
6 7 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi...

Since I plan on merging these two dataframes together, I want to ensure that df_zoo["class_type"] matches up with the values in df_class["Number_Of_Animal_Species_In_Class"].

df_zoo["class_type"].value_counts()
0    41
1    20
3    13
6    10
5     8
2     5
4     4
Name: class_type, dtype: int64

To make things easier to read, I want to reset all the values in both df_zoo["class_type"] and df_class["Class_Number"] to be subtracted by one, so that it matches the index and starts at 0 instead of 1. I am doing this using a lambda function together with apply so that it applies to the entire column, and a lambda function is appropriate here because it is a very simple function.

df_zoo["class_type"] = df_zoo["class_type"].apply(lambda x: x-1)
df_class["Class_Number"] = df_class["Class_Number"].apply(lambda x: x-1)

Now to clean the data, I am using .isna().any().any() to see if there are any missing values in either dataframe. In this case, there are none.

# data cleaning
df_zoo.isna().any().any()
False
df_class.isna().any().any()
False

In order to make one dataframe, I am going to merge the two dataframes together with .merge. Since both dataframes have the same column with different names, they will be merged on this column. We can see that the resulting dataframe df has 101 columns; this is because I used how="left".

df = df_zoo.merge(df_class, how="left", left_on="class_type", right_on="Class_Number")
df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... fins legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names
0 aardvark 1 0 0 1 0 0 1 1 1 ... 0 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
1 antelope 1 0 0 1 0 0 0 1 1 ... 0 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
2 bass 0 0 1 0 0 1 1 1 1 ... 1 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h...
3 bear 1 0 0 1 0 0 1 1 1 ... 0 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
4 boar 1 0 0 1 0 0 1 1 1 ... 0 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 0 2 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
97 wasp 1 0 1 0 1 0 0 0 0 ... 0 6 0 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth...
98 wolf 1 0 0 1 0 0 1 1 1 ... 0 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,...
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi...
100 wren 0 1 1 0 1 0 0 0 1 ... 0 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw...

101 rows × 22 columns

Now that we are working with only one dataframe, we can check the dtypes to see which columns can be used for the various machine learning tequniques we are going to be using. Since all of the characteristic data (e.g. “hair”, “eggs”) is of type int64, these can be used.

df.dtypes
animal_name                          object
hair                                  int64
feathers                              int64
eggs                                  int64
milk                                  int64
airborne                              int64
aquatic                               int64
predator                              int64
toothed                               int64
backbone                              int64
breathes                              int64
venomous                              int64
fins                                  int64
legs                                  int64
tail                                  int64
domestic                              int64
catsize                               int64
class_type                            int64
Class_Number                          int64
Number_Of_Animal_Species_In_Class     int64
Class_Type                           object
Animal_Names                         object
dtype: object

I will start by making a list of column names that are usable for the following machine learning techniques I will be using in this project. We want to use the columns that contain the characteristics of the animal which are the 1st through the 17th items in the list df.columns. Here I am using slicing on df.columns to obtain this list, naming it numcols.

df.columns
Index(['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne',
       'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous',
       'fins', 'legs', 'tail', 'domestic', 'catsize', 'class_type',
       'Class_Number', 'Number_Of_Animal_Species_In_Class', 'Class_Type',
       'Animal_Names'],
      dtype='object')
numcols = df.columns[1:17]
numcols
Index(['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator',
       'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail',
       'domestic', 'catsize'],
      dtype='object')

K-Means Clustering

I am using the standard process of importing, instantiating, fitting, and predicting for K-Means Clustering. Here I both fit and predict on df[numcols], making a new column df["pred"] for the predicted values. My goal for this section is to have the clusters match the class number for the animals as closely as possible.

# import
from sklearn.cluster import KMeans
# instatiate
kmeans = KMeans(n_clusters=7)
# fit
kmeans.fit(df[numcols])
KMeans(n_clusters=7)
# predict
df["pred"] = kmeans.predict(df[numcols])
df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred
0 aardvark 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
1 antelope 1 0 0 1 0 0 0 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
2 bass 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
3 bear 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
4 boar 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 2 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 4
97 wasp 1 0 1 0 1 0 0 0 0 ... 6 0 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth... 3
98 wolf 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 5
100 wren 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1

101 rows × 23 columns

Now we can try to get the “pred” and “class_type” columns to match as best we can. The best way I could come up with to do this is to use value_counts() for both columns, and match them up this way.

df["pred"].value_counts()
0    31
1    20
2    19
3    12
6     8
4     7
5     4
Name: pred, dtype: int64
df["class_type"].value_counts()
0    41
1    20
3    13
6    10
5     8
2     5
4     4
Name: class_type, dtype: int64

I will be matching these values up by creating a dictionary class_dict, matching the numbers in descending order.

class_dict = {0:0, 1:1, 2:3, 3:6, 6:5, 4:2, 5:4}
class_dict
{0: 0, 1: 1, 2: 3, 3: 6, 6: 5, 4: 2, 5: 4}

Testing out some of these values, we can see that the rows corresponding to cluster 1 appear to be birds.

df.loc[(df["pred"] == 1),:]
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred
11 chicken 0 1 1 0 1 0 0 0 1 ... 2 1 1 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
16 crow 0 1 1 0 1 0 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
20 dove 0 1 1 0 1 0 0 0 1 ... 2 1 1 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
21 duck 0 1 1 0 1 1 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
23 flamingo 0 1 1 0 1 0 0 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
33 gull 0 1 1 0 1 1 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
37 hawk 0 1 1 0 1 0 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
41 kiwi 0 1 1 0 0 0 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
43 lark 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
56 ostrich 0 1 1 0 0 0 0 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
57 parakeet 0 1 1 0 1 0 0 0 1 ... 2 1 1 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
58 penguin 0 1 1 0 0 1 1 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
59 pheasant 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
71 rhea 0 1 1 0 0 0 1 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
78 skimmer 0 1 1 0 1 1 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
79 skua 0 1 1 0 1 1 1 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
83 sparrow 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
87 swan 0 1 1 0 1 1 0 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
95 vulture 0 1 1 0 1 0 1 0 1 ... 2 1 0 1 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1
100 wren 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1

20 rows × 23 columns

Similarly, the rows corresponding to cluster 2 appear to be fish.

df.loc[(df["pred"] == 2),:]
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred
2 bass 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
7 carp 0 0 1 0 0 1 0 1 1 ... 0 1 1 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
8 catfish 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
12 chub 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
18 dogfish 0 0 1 0 0 1 1 1 1 ... 0 1 0 1 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
19 dolphin 0 0 0 1 0 1 1 1 1 ... 0 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2
34 haddock 0 0 1 0 0 1 0 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
38 herring 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
60 pike 0 0 1 0 0 1 1 1 1 ... 0 1 0 1 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
61 piranha 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
62 pitviper 0 0 1 0 0 0 1 1 1 ... 0 1 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 2
66 porpoise 0 0 0 1 0 1 1 1 1 ... 0 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2
73 seahorse 0 0 1 0 0 1 0 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
74 seal 1 0 0 1 0 1 1 1 1 ... 0 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2
76 seasnake 0 0 0 0 0 1 1 1 1 ... 0 1 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 2
80 slowworm 0 0 1 0 0 0 1 1 1 ... 0 1 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 2
82 sole 0 0 1 0 0 1 0 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
86 stingray 0 0 1 0 0 1 1 1 1 ... 0 1 0 1 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2
92 tuna 0 0 1 0 0 1 1 1 1 ... 0 1 0 1 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 2

19 rows × 23 columns

Now we can use .map to apply class_dict to the entire row of df["pred"].

df["pred"] = df["pred"].map(class_dict)
df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred
0 aardvark 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
1 antelope 1 0 0 1 0 0 0 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
2 bass 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 3
3 bear 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
4 boar 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 2 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2
97 wasp 1 0 1 0 1 0 0 0 0 ... 6 0 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth... 6
98 wolf 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4
100 wren 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1

101 rows × 23 columns

Now that the cluster numbers are somewhat matched up with the class types, we can score these “predictions”. Here we can see that 68/101 clusters match with the class types, which is not a great prediction. We can make graphs using Altair anyway to visualize.

(df["pred"] == df["class_type"]).sum()
68
import altair as alt

I encountered an issue below with the graph. The class types are on the y-axis, with “legs” on the x-axis since it is the characteristic with the largest number of distinct values. The problem with this is that although the points (0,6) and (8,6) are of the same class type, (0,6) is closer in distance to (0,3) than it is to (8,6), and so it gets clustered with (0,3).

alt.Chart(df).mark_circle().encode(
    x="legs",
    y="class_type",
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)
len(numcols)
16
numcols[0]
'hair'
numcols[1]
'feathers'

This is a similar graph using “hair” and “feathers”. Here I encountered my second issue, being that there are only three distinct combinations of these two characteristics - (1,0), (0,0), and (0,1). This means that although there are 101 points, they are all stacked on top of each other at these three points.

alt.Chart(df).mark_circle().encode(
    x=numcols[0],
    y=numcols[1],
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)

Below I attempted to make a list of Altair charts, using “class_type” as the y-axis for all of them and all 16 characteristic columns of df for the x-axis. Using tooltip= allows for information about each point to be displayed when the mouse is hovered over it.

chart_list=[]
for c in range(16):
    chart=alt.Chart(df).mark_circle().encode(
    x=numcols[c],
    y="class_type",
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)
    chart_list.append(chart)
chart_list[0]
chart_list[1]

The last problem I encountered with these graphs is that when I run chart_list, the charts do not show up.

chart_list
[alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...)]

Logistic Regression

For the logistic regression section of this project, I will use numcols as the inputs and try to classify Class_Type.

df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... legs tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred
0 aardvark 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
1 antelope 1 0 0 1 0 0 0 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
2 bass 0 0 1 0 0 1 1 1 1 ... 0 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 3
3 bear 1 0 0 1 0 0 1 1 1 ... 4 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
4 boar 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 2 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2
97 wasp 1 0 1 0 1 0 0 0 0 ... 6 0 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth... 6
98 wolf 1 0 0 1 0 0 1 1 1 ... 4 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4
100 wren 0 1 1 0 1 0 0 0 1 ... 2 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1

101 rows × 23 columns

#import
from sklearn.linear_model import LogisticRegression
# instantiate
clf = LogisticRegression()
# fit
clf.fit(df[numcols], df["Class_Type"])
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()
# predict
df["class_pred"] = clf.predict(df[numcols])
df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred class_pred
0 aardvark 1 0 0 1 0 0 1 1 1 ... 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal
1 antelope 1 0 0 1 0 0 0 1 1 ... 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal
2 bass 0 0 1 0 0 1 1 1 1 ... 1 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 3 Fish
3 bear 1 0 0 1 0 0 1 1 1 ... 0 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal
4 boar 1 0 0 1 0 0 1 1 1 ... 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2 Mammal
97 wasp 1 0 1 0 1 0 0 0 0 ... 0 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth... 6 Bug
98 wolf 1 0 0 1 0 0 1 1 1 ... 1 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate
100 wren 0 1 1 0 1 0 0 0 1 ... 1 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1 Bird

101 rows × 24 columns

(df["Class_Type"] == df["class_pred"]).sum()
100

This is an example of overfitting the data - 100/101 rows in the dataframe were predicted correctly. Below I wanted to see which one row was predicted incorrectly. The only animal that was predicted incorrectly was “tortoise,” which was predicted as a bird instead of a reptile. My guess is because it lays eggs, which could have confused the computer.

df.loc[(df["Class_Type"] != df["class_pred"]),:]
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... tail domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred class_pred
90 tortoise 0 0 1 0 0 0 0 0 1 ... 1 0 1 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 5 Bird

1 rows × 24 columns

One thing we can do to combat the overfitting is to split the data into a training set and a testing set. I am using train_size=0.8.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df[numcols],
    df["Class_Type"],
    train_size=0.8
)
X_train
hair feathers eggs milk airborne aquatic predator toothed backbone breathes venomous fins legs tail domestic catsize
48 1 0 0 1 0 1 1 1 1 1 0 0 4 1 0 1
73 0 0 1 0 0 1 0 1 1 0 0 1 0 1 0 0
33 0 1 1 0 1 1 1 0 1 1 0 0 2 1 0 0
7 0 0 1 0 0 1 0 1 1 0 0 1 0 1 1 0
69 1 0 0 1 0 0 1 1 1 1 0 0 4 1 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
57 0 1 1 0 1 0 0 0 1 1 0 0 2 1 1 0
87 0 1 1 0 1 1 0 0 1 1 0 0 2 1 0 1
62 0 0 1 0 0 0 1 1 1 1 1 0 0 1 0 0
66 0 0 0 1 0 1 1 1 1 1 0 1 0 1 0 1
74 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0 1

80 rows × 16 columns

y_train
48     Mammal
73       Fish
33       Bird
7        Fish
69     Mammal
       ...   
57       Bird
87       Bird
62    Reptile
66     Mammal
74     Mammal
Name: Class_Type, Length: 80, dtype: object
# instantiate
clf2 = LogisticRegression()
# fit
clf2.fit(X_train, y_train)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()

Although it isn’t common practice, we can see that if we predict again on the training set, overfitting is evident once again.

# predict
(clf2.predict(X_train) == y_train).sum()/len(y_train)
1.0

We proceed by predicting on the unseen testing data.

clf2.predict(X_test)
array(['Mammal', 'Mammal', 'Bird', 'Bug', 'Fish', 'Mammal', 'Mammal',
       'Bug', 'Mammal', 'Mammal', 'Bird', 'Mammal', 'Mammal', 'Bug',
       'Bird', 'Invertebrate', 'Mammal', 'Fish', 'Mammal', 'Mammal',
       'Mammal'], dtype=object)

I will now calculate the score both manually and by using clf2.score. This is a pretty good score; 95.2% of the rows were predicted correctly. While there may be some overfitting still, it is much less evident fitting on the training set and predicting on the testing data.

(clf2.predict(X_test) == y_test).sum()/len(y_test)
0.9523809523809523
# same as:
clf2.score(X_test, y_test)
0.9523809523809523

Decision Tree

The last section from the course material is decision trees. My goal here is to see which characteristics are the most “important” in determining classification. We will try first without using test or train sets for comparison.

# import
from sklearn.tree import DecisionTreeClassifier
# instantiate
clf_tree = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10)
# fit
clf_tree.fit(df[numcols], df["Class_Type"])
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10)
from sklearn import tree
import matplotlib.pyplot as plt
# plot
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf_tree, 
                   feature_names=clf_tree.feature_names_in_,  
                   class_names=clf_tree.classes_,
                   filled=True)
../../_images/AlisaCrowe_93_0.png

This diagram shows us the most important characteristics for each class type. For mammals it’s milk, for birds it’s feathers, etc. It seems like there may be a bit of overfitting since “mammal”, “bird”, and “fish” all have 100% probabilities, however the rest do not.

Now we can do the same thing with X_train and y_train to compare the two.

# instantiate
clf_tree2 = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=12)
# fit
clf_tree2.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=4, max_leaf_nodes=12)
# plot
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf_tree2, 
                   feature_names=clf_tree2.feature_names_in_,  
                   class_names=clf_tree2.classes_,
                   filled=True)
../../_images/AlisaCrowe_97_0.png

The two plots are identical except for the bottom two leve;s; in the top one it is “backbone,” while in the second one it is “airborne”. Now we can use score to compare the two.

clf_tree.score(df[numcols], df["Class_Type"])
0.8811881188118812
clf_tree2.score(X_train, y_train)
0.8875

We can see that the decision tree classifier does slightly better when testing and training sets are used, but it isn’t too big of a difference. Overall, the score for the decision tree was lower than that of logistic regression.

K-Nearest Neighbors

# import
from sklearn.neighbors import KNeighborsClassifier
# instantiate
clf_nb = KNeighborsClassifier(n_neighbors=10)

Since I specified n_neighbors=10 during the instantiating step, this is going to find the 10 nearest data points for our new data point based on distance, similar to K-Means clustering. However, for K-Nearest Neighbors we will be trying to predict “Class_Type” once again.

clf_nb.fit(df[numcols], df["Class_Type"])
KNeighborsClassifier(n_neighbors=10)
df["pred_nb"] = clf_nb.predict(df[numcols])
df
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred class_pred pred_nb
0 aardvark 1 0 0 1 0 0 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal Mammal
1 antelope 1 0 0 1 0 0 0 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal Mammal
2 bass 0 0 1 0 0 1 1 1 1 ... 0 0 3 3 13 Fish bass, carp, catfish, chub, dogfish, haddock, h... 3 Fish Fish
3 bear 1 0 0 1 0 0 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal Mammal
4 boar 1 0 0 1 0 0 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal Mammal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
96 wallaby 1 0 0 1 0 0 0 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 2 Mammal Mammal
97 wasp 1 0 1 0 1 0 0 0 0 ... 0 0 5 5 8 Bug flea, gnat, honeybee, housefly, ladybird, moth... 6 Bug Bug
98 wolf 1 0 0 1 0 0 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 0 Mammal Mammal
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate Fish
100 wren 0 1 1 0 1 0 0 0 1 ... 0 0 1 1 20 Bird chicken, crow, dove, duck, flamingo, gull, haw... 1 Bird Bird

101 rows × 25 columns

After adding a new column to df with the prediction from K-Nearest Neighbors, we can use score to compare this to the other methods.

clf_nb.score(df[numcols], df["Class_Type"])
0.8316831683168316

This score is not the best, but it’s not bad either. Here I am taking a closer look at exactly which columns were difficult for the computer to predict. It seems like the invertebrates were especially challenging to predict using this method.

df.loc[(df["Class_Type"] != df["pred_nb"]), :]
animal_name hair feathers eggs milk airborne aquatic predator toothed backbone ... domestic catsize class_type Class_Number Number_Of_Animal_Species_In_Class Class_Type Animal_Names pred class_pred pred_nb
13 clam 0 0 1 0 0 0 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate Fish
14 crab 0 0 1 0 0 1 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 5 Invertebrate Amphibian
15 crayfish 0 0 1 0 0 1 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 6 Invertebrate Bug
19 dolphin 0 0 0 1 0 1 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 3 Mammal Fish
46 lobster 0 0 1 0 0 1 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 6 Invertebrate Bug
53 octopus 0 0 1 0 0 1 1 0 0 ... 0 1 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 6 Invertebrate Bug
62 pitviper 0 0 1 0 0 0 1 1 1 ... 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 3 Reptile Fish
66 porpoise 0 0 0 1 0 1 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 3 Mammal Fish
72 scorpion 0 0 0 0 0 0 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 6 Invertebrate Bug
74 seal 1 0 0 1 0 1 1 1 1 ... 0 1 0 0 41 Mammal aardvark, antelope, bear, boar, buffalo, calf,... 3 Mammal Fish
76 seasnake 0 0 0 0 0 1 1 1 1 ... 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 3 Reptile Fish
77 seawasp 0 0 1 0 0 1 1 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate Fish
80 slowworm 0 0 1 0 0 0 1 1 1 ... 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 3 Reptile Fish
81 slug 0 0 1 0 0 0 0 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate Fish
90 tortoise 0 0 1 0 0 0 0 0 1 ... 0 1 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 5 Bird Mammal
91 tuatara 0 0 1 0 0 0 1 1 1 ... 0 0 2 2 5 Reptile pitviper, seasnake, slowworm, tortoise, tuatara 5 Reptile Amphibian
99 worm 0 0 1 0 0 0 0 0 0 ... 0 0 6 6 10 Invertebrate clam, crab, crayfish, lobster, octopus, scorpi... 4 Invertebrate Fish

17 rows × 25 columns

Now we can add in our X_train, y_train to fit and X_test, y_test to score.

clf_nb2 = KNeighborsClassifier(n_neighbors=10)
clf_nb2.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)
clf_nb2.score(X_train, y_train)
0.8
clf_nb2.score(X_test, y_test)
0.9047619047619048

I find it a bit strange that the testing set hada much better score than the train set, and both scores are still lower than that of both logistic regression and the decision tree classifier.

Summary

In summary, I used K-Means clustering, logistic regression, decision tree classifiers, and K-Nearest Neighbors to attempt to classify zoo animals. The results show that logistic regression was best (with possibility of overfitting), followed by decision tree classifiers, and then K-Nearest Neighbors. K-Means clustering was not the best suited for this project, but was still a good tool to visualize.

References

  • What is the source of your dataset(s)?

kaggle.com

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

I used this article from analyticsvidhya.com to come up with the K-Nearest Neighbors portion of my project.

Created in deepnote.com Created in Deepnote