Zoo Animal Classification¶

Author: Alisa Crowe

Course Project, UC Irvine, Math 10, S22

Introduction¶

In this project, I will be examining two datasets that hold data about 101 different species of zoo animals and various characteristics that they hold. I will be using Sci-Kit Learn’s K-Means Clustering, Logistic Regression, Decision Tree Classifier, and K-Nearest Neighbors to attempt to classify the animals while disucssion any results and/or challenges I come across. I will then analyze each of these methods to conclude which was the best in classifying.

Main portion of the project¶

Data Cleaning and Merging Dataframes¶

import pandas as pd

df_zoo = pd.read_csv("zoo.csv")
df_zoo

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize	class_type
0	aardvark	1	0	0	1	0	0	1	1	1	1	0	0	4	0	0	1	1
1	antelope	1	0	0	1	0	0	0	1	1	1	0	0	4	1	0	1	1
2	bass	0	0	1	0	0	1	1	1	1	0	0	1	0	1	0	0	4
3	bear	1	0	0	1	0	0	1	1	1	1	0	0	4	0	0	1	1
4	boar	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	1	0	0	2	1	0	1	1
97	wasp	1	0	1	0	1	0	0	0	0	1	1	0	6	0	0	0	6
98	wolf	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1	1
99	worm	0	0	1	0	0	0	0	0	0	1	0	0	0	0	0	0	7
100	wren	0	1	1	0	1	0	0	0	1	1	0	0	2	1	0	0	2

101 rows × 18 columns

Here we introduce our first dataframe df_zoo. To ensure that there are no repeats of species, we can check value_counts() of the column ‘animal_name’.

df_zoo["animal_name"].value_counts()

frog        2
newt        1
aardvark    1
cavy        1
termite     1
           ..
crab        1
stingray    1
seahorse    1
bear        1
ladybird    1
Name: animal_name, Length: 100, dtype: int64

We can see that there are two rows for frogs in this dataframe, but only one of every other species.

df_zoo[df_zoo.loc[:,"animal_name"] == "frog"]

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize	class_type
25	frog	0	0	1	0	0	1	1	1	1	1	0	0	4	0	0	0	4
26	frog	0	0	1	0	0	1	1	1	1	1	1	0	4	0	0	0	4

Looking at the ‘venomous’ column, it is clear that these two rows are not duplicates; we can leave both of them in.

df_class = pd.read_csv("class.csv")
df_class

	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names
0	1	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
1	2	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...
2	3	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara
3	4	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...
4	5	4	Amphibian	frog, frog, newt, toad
5	6	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...
6	7	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...

Since I plan on merging these two dataframes together, I want to ensure that df_zoo["class_type"] matches up with the values in df_class["Number_Of_Animal_Species_In_Class"].

df_zoo["class_type"].value_counts()

  41
  20
  13
  10
   8
   5
   4
Name: class_type, dtype: int64

To make things easier to read, I want to reset all the values in both df_zoo["class_type"] and df_class["Class_Number"] to be subtracted by one, so that it matches the index and starts at 0 instead of 1. I am doing this using a lambda function together with apply so that it applies to the entire column, and a lambda function is appropriate here because it is a very simple function.

df_zoo["class_type"] = df_zoo["class_type"].apply(lambda x: x-1)

df_class["Class_Number"] = df_class["Class_Number"].apply(lambda x: x-1)

Now to clean the data, I am using .isna().any().any() to see if there are any missing values in either dataframe. In this case, there are none.

# data cleaning
df_zoo.isna().any().any()

False

df_class.isna().any().any()

False

In order to make one dataframe, I am going to merge the two dataframes together with .merge. Since both dataframes have the same column with different names, they will be merged on this column. We can see that the resulting dataframe df has 101 columns; this is because I used how="left".

df = df_zoo.merge(df_class, how="left", left_on="class_type", right_on="Class_Number")
df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	fins	legs	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names
0	aardvark	1	0	0	1	0	0	1	1	1	...	0	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
1	antelope	1	0	0	1	0	0	0	1	1	...	0	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
2	bass	0	0	1	0	0	1	1	1	1	...	1	0	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...
3	bear	1	0	0	1	0	0	1	1	1	...	0	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
4	boar	1	0	0	1	0	0	1	1	1	...	0	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	0	2	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
97	wasp	1	0	1	0	1	0	0	0	0	...	0	6	0	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...
98	wolf	1	0	0	1	0	0	1	1	1	...	0	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...
100	wren	0	1	1	0	1	0	0	0	1	...	0	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...

101 rows × 22 columns

Now that we are working with only one dataframe, we can check the dtypes to see which columns can be used for the various machine learning tequniques we are going to be using. Since all of the characteristic data (e.g. “hair”, “eggs”) is of type int64, these can be used.

df.dtypes

animal_name                          object
hair                                  int64
feathers                              int64
eggs                                  int64
milk                                  int64
airborne                              int64
aquatic                               int64
predator                              int64
toothed                               int64
backbone                              int64
breathes                              int64
venomous                              int64
fins                                  int64
legs                                  int64
tail                                  int64
domestic                              int64
catsize                               int64
class_type                            int64
Class_Number                          int64
Number_Of_Animal_Species_In_Class     int64
Class_Type                           object
Animal_Names                         object
dtype: object

I will start by making a list of column names that are usable for the following machine learning techniques I will be using in this project. We want to use the columns that contain the characteristics of the animal which are the 1st through the 17th items in the list df.columns. Here I am using slicing on df.columns to obtain this list, naming it numcols.

df.columns

Index(['animal_name', 'hair', 'feathers', 'eggs', 'milk', 'airborne',
       'aquatic', 'predator', 'toothed', 'backbone', 'breathes', 'venomous',
       'fins', 'legs', 'tail', 'domestic', 'catsize', 'class_type',
       'Class_Number', 'Number_Of_Animal_Species_In_Class', 'Class_Type',
       'Animal_Names'],
      dtype='object')

numcols = df.columns[1:17]
numcols

Index(['hair', 'feathers', 'eggs', 'milk', 'airborne', 'aquatic', 'predator',
       'toothed', 'backbone', 'breathes', 'venomous', 'fins', 'legs', 'tail',
       'domestic', 'catsize'],
      dtype='object')

K-Means Clustering¶

I am using the standard process of importing, instantiating, fitting, and predicting for K-Means Clustering. Here I both fit and predict on df[numcols], making a new column df["pred"] for the predicted values. My goal for this section is to have the clusters match the class number for the animals as closely as possible.

# import
from sklearn.cluster import KMeans

# instatiate
kmeans = KMeans(n_clusters=7)

# fit
kmeans.fit(df[numcols])

KMeans(n_clusters=7)

# predict
df["pred"] = kmeans.predict(df[numcols])

df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	legs	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred
0	aardvark	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
1	antelope	1	0	0	1	0	0	0	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
2	bass	0	0	1	0	0	1	1	1	1	...	0	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
3	bear	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
4	boar	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	2	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	4
97	wasp	1	0	1	0	1	0	0	0	0	...	6	0	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...	3
98	wolf	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	5
100	wren	0	1	1	0	1	0	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1

101 rows × 23 columns

Now we can try to get the “pred” and “class_type” columns to match as best we can. The best way I could come up with to do this is to use value_counts() for both columns, and match them up this way.

df["pred"].value_counts()

  31
  20
  19
  12
   8
   7
   4
Name: pred, dtype: int64

df["class_type"].value_counts()

  41
  20
  13
  10
   8
   5
   4
Name: class_type, dtype: int64

I will be matching these values up by creating a dictionary class_dict, matching the numbers in descending order.

class_dict = {0:0, 1:1, 2:3, 3:6, 6:5, 4:2, 5:4}
class_dict

{0: 0, 1: 1, 2: 3, 3: 6, 6: 5, 4: 2, 5: 4}

Testing out some of these values, we can see that the rows corresponding to cluster 1 appear to be birds.

df.loc[(df["pred"] == 1),:]

	animal_name	feathers	eggs	airborne	aquatic	predator	backbone	...	legs	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred
11	chicken	1	1	1	0	0	1	...	2	1	1	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
16	crow	1	1	1	0	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
20	dove	1	1	1	0	0	1	...	2	1	1	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
21	duck	1	1	1	1	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
23	flamingo	1	1	1	0	0	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
33	gull	1	1	1	1	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
37	hawk	1	1	1	0	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
41	kiwi	1	1	0	0	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
43	lark	1	1	1	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
56	ostrich	1	1	0	0	0	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
57	parakeet	1	1	1	0	0	1	...	2	1	1	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
58	penguin	1	1	0	1	1	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
59	pheasant	1	1	1	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
71	rhea	1	1	0	0	1	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
78	skimmer	1	1	1	1	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
79	skua	1	1	1	1	1	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
83	sparrow	1	1	1	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
87	swan	1	1	1	1	0	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
95	vulture	1	1	1	0	1	1	...	2	1	0	1	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1
100	wren	1	1	1	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1

20 rows × 23 columns

Similarly, the rows corresponding to cluster 2 appear to be fish.

df.loc[(df["pred"] == 2),:]

	animal_name	hair	eggs	milk	aquatic	predator	toothed	backbone	...	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred
2	bass	0	1	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
7	carp	0	1	0	1	0	1	1	...	1	1	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
8	catfish	0	1	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
12	chub	0	1	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
18	dogfish	0	1	0	1	1	1	1	...	1	0	1	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
19	dolphin	0	0	1	1	1	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2
34	haddock	0	1	0	1	0	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
38	herring	0	1	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
60	pike	0	1	0	1	1	1	1	...	1	0	1	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
61	piranha	0	1	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
62	pitviper	0	1	0	0	1	1	1	...	1	0	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	2
66	porpoise	0	0	1	1	1	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2
73	seahorse	0	1	0	1	0	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
74	seal	1	0	1	1	1	1	1	...	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2
76	seasnake	0	0	0	1	1	1	1	...	1	0	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	2
80	slowworm	0	1	0	0	1	1	1	...	1	0	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	2
82	sole	0	1	0	1	0	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
86	stingray	0	1	0	1	1	1	1	...	1	0	1	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2
92	tuna	0	1	0	1	1	1	1	...	1	0	1	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	2

19 rows × 23 columns

Now we can use .map to apply class_dict to the entire row of df["pred"].

df["pred"] = df["pred"].map(class_dict)

df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	legs	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred
0	aardvark	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
1	antelope	1	0	0	1	0	0	0	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
2	bass	0	0	1	0	0	1	1	1	1	...	0	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	3
3	bear	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
4	boar	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	2	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2
97	wasp	1	0	1	0	1	0	0	0	0	...	6	0	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...	6
98	wolf	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4
100	wren	0	1	1	0	1	0	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1

101 rows × 23 columns

Now that the cluster numbers are somewhat matched up with the class types, we can score these “predictions”. Here we can see that 68/101 clusters match with the class types, which is not a great prediction. We can make graphs using Altair anyway to visualize.

(df["pred"] == df["class_type"]).sum()

import altair as alt

I encountered an issue below with the graph. The class types are on the y-axis, with “legs” on the x-axis since it is the characteristic with the largest number of distinct values. The problem with this is that although the points (0,6) and (8,6) are of the same class type, (0,6) is closer in distance to (0,3) than it is to (8,6), and so it gets clustered with (0,3).

alt.Chart(df).mark_circle().encode(
    x="legs",
    y="class_type",
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)

len(numcols)

numcols[0]

'hair'

numcols[1]

'feathers'

This is a similar graph using “hair” and “feathers”. Here I encountered my second issue, being that there are only three distinct combinations of these two characteristics - (1,0), (0,0), and (0,1). This means that although there are 101 points, they are all stacked on top of each other at these three points.

alt.Chart(df).mark_circle().encode(
    x=numcols[0],
    y=numcols[1],
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)

Below I attempted to make a list of Altair charts, using “class_type” as the y-axis for all of them and all 16 characteristic columns of df for the x-axis. Using tooltip= allows for information about each point to be displayed when the mouse is hovered over it.

chart_list=[]
for c in range(16):
    chart=alt.Chart(df).mark_circle().encode(
    x=numcols[c],
    y="class_type",
    color="pred:N",
    tooltip=["animal_name", "class_type", "pred"]
)
    chart_list.append(chart)

chart_list[0]

chart_list[1]

The last problem I encountered with these graphs is that when I run chart_list, the charts do not show up.

chart_list

[alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...),
 alt.Chart(...)]

Logistic Regression¶

For the logistic regression section of this project, I will use numcols as the inputs and try to classify Class_Type.

df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	legs	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred
0	aardvark	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
1	antelope	1	0	0	1	0	0	0	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
2	bass	0	0	1	0	0	1	1	1	1	...	0	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	3
3	bear	1	0	0	1	0	0	1	1	1	...	4	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
4	boar	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	2	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2
97	wasp	1	0	1	0	1	0	0	0	0	...	6	0	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...	6
98	wolf	1	0	0	1	0	0	1	1	1	...	4	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4
100	wren	0	1	1	0	1	0	0	0	1	...	2	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1

101 rows × 23 columns

#import
from sklearn.linear_model import LogisticRegression

# instantiate
clf = LogisticRegression()

# fit
clf.fit(df[numcols], df["Class_Type"])

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

LogisticRegression()

# predict
df["class_pred"] = clf.predict(df[numcols])

df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred	class_pred
0	aardvark	1	0	0	1	0	0	1	1	1	...	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal
1	antelope	1	0	0	1	0	0	0	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal
2	bass	0	0	1	0	0	1	1	1	1	...	1	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	3	Fish
3	bear	1	0	0	1	0	0	1	1	1	...	0	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal
4	boar	1	0	0	1	0	0	1	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2	Mammal
97	wasp	1	0	1	0	1	0	0	0	0	...	0	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...	6	Bug
98	wolf	1	0	0	1	0	0	1	1	1	...	1	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate
100	wren	0	1	1	0	1	0	0	0	1	...	1	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1	Bird

101 rows × 24 columns

(df["Class_Type"] == df["class_pred"]).sum()

This is an example of overfitting the data - 100/101 rows in the dataframe were predicted correctly. Below I wanted to see which one row was predicted incorrectly. The only animal that was predicted incorrectly was “tortoise,” which was predicted as a bird instead of a reptile. My guess is because it lays eggs, which could have confused the computer.

df.loc[(df["Class_Type"] != df["class_pred"]),:]

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	tail	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred	class_pred
90	tortoise	0	0	1	0	0	0	0	0	1	...	1	0	1	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	5	Bird

1 rows × 24 columns

One thing we can do to combat the overfitting is to split the data into a training set and a testing set. I am using train_size=0.8.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df[numcols],
    df["Class_Type"],
    train_size=0.8
)

X_train

	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize
48	1	0	0	1	0	1	1	1	1	1	0	0	4	1	0	1
73	0	0	1	0	0	1	0	1	1	0	0	1	0	1	0	0
33	0	1	1	0	1	1	1	0	1	1	0	0	2	1	0	0
7	0	0	1	0	0	1	0	1	1	0	0	1	0	1	1	0
69	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57	0	1	1	0	1	0	0	0	1	1	0	0	2	1	1	0
87	0	1	1	0	1	1	0	0	1	1	0	0	2	1	0	1
62	0	0	1	0	0	0	1	1	1	1	1	0	0	1	0	0
66	0	0	0	1	0	1	1	1	1	1	0	1	0	1	0	1
74	1	0	0	1	0	1	1	1	1	1	0	1	0	0	0	1

80 rows × 16 columns

y_train

   Mammal
     Fish
     Bird
      Fish
   Mammal
       ...   
     Bird
     Bird
  Reptile
   Mammal
   Mammal
Name: Class_Type, Length: 80, dtype: object

# instantiate
clf2 = LogisticRegression()

# fit
clf2.fit(X_train, y_train)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

LogisticRegression()

Although it isn’t common practice, we can see that if we predict again on the training set, overfitting is evident once again.

# predict
(clf2.predict(X_train) == y_train).sum()/len(y_train)

1.0

We proceed by predicting on the unseen testing data.

clf2.predict(X_test)

array(['Mammal', 'Mammal', 'Bird', 'Bug', 'Fish', 'Mammal', 'Mammal',
       'Bug', 'Mammal', 'Mammal', 'Bird', 'Mammal', 'Mammal', 'Bug',
       'Bird', 'Invertebrate', 'Mammal', 'Fish', 'Mammal', 'Mammal',
       'Mammal'], dtype=object)

I will now calculate the score both manually and by using clf2.score. This is a pretty good score; 95.2% of the rows were predicted correctly. While there may be some overfitting still, it is much less evident fitting on the training set and predicting on the testing data.

(clf2.predict(X_test) == y_test).sum()/len(y_test)

0.9523809523809523

# same as:
clf2.score(X_test, y_test)

0.9523809523809523

Decision Tree¶

The last section from the course material is decision trees. My goal here is to see which characteristics are the most “important” in determining classification. We will try first without using test or train sets for comparison.

# import
from sklearn.tree import DecisionTreeClassifier

# instantiate
clf_tree = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10)

# fit
clf_tree.fit(df[numcols], df["Class_Type"])

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=10)

from sklearn import tree
import matplotlib.pyplot as plt

# plot
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf_tree, 
                   feature_names=clf_tree.feature_names_in_,  
                   class_names=clf_tree.classes_,
                   filled=True)

This diagram shows us the most important characteristics for each class type. For mammals it’s milk, for birds it’s feathers, etc. It seems like there may be a bit of overfitting since “mammal”, “bird”, and “fish” all have 100% probabilities, however the rest do not.

Now we can do the same thing with X_train and y_train to compare the two.

# instantiate
clf_tree2 = DecisionTreeClassifier(max_depth=4, max_leaf_nodes=12)

# fit
clf_tree2.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4, max_leaf_nodes=12)

# plot
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf_tree2, 
                   feature_names=clf_tree2.feature_names_in_,  
                   class_names=clf_tree2.classes_,
                   filled=True)

The two plots are identical except for the bottom two leve;s; in the top one it is “backbone,” while in the second one it is “airborne”. Now we can use score to compare the two.

clf_tree.score(df[numcols], df["Class_Type"])

0.8811881188118812

clf_tree2.score(X_train, y_train)

0.8875

We can see that the decision tree classifier does slightly better when testing and training sets are used, but it isn’t too big of a difference. Overall, the score for the decision tree was lower than that of logistic regression.

K-Nearest Neighbors¶

# import
from sklearn.neighbors import KNeighborsClassifier

# instantiate
clf_nb = KNeighborsClassifier(n_neighbors=10)

Since I specified n_neighbors=10 during the instantiating step, this is going to find the 10 nearest data points for our new data point based on distance, similar to K-Means clustering. However, for K-Nearest Neighbors we will be trying to predict “Class_Type” once again.

clf_nb.fit(df[numcols], df["Class_Type"])

KNeighborsClassifier(n_neighbors=10)

df["pred_nb"] = clf_nb.predict(df[numcols])

df

	animal_name	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	...	domestic	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred	class_pred	pred_nb
0	aardvark	1	0	0	1	0	0	1	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal	Mammal
1	antelope	1	0	0	1	0	0	0	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal	Mammal
2	bass	0	0	1	0	0	1	1	1	1	...	0	0	3	3	13	Fish	bass, carp, catfish, chub, dogfish, haddock, h...	3	Fish	Fish
3	bear	1	0	0	1	0	0	1	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal	Mammal
4	boar	1	0	0	1	0	0	1	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal	Mammal
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
96	wallaby	1	0	0	1	0	0	0	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	2	Mammal	Mammal
97	wasp	1	0	1	0	1	0	0	0	0	...	0	0	5	5	8	Bug	flea, gnat, honeybee, housefly, ladybird, moth...	6	Bug	Bug
98	wolf	1	0	0	1	0	0	1	1	1	...	0	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	0	Mammal	Mammal
99	worm	0	0	1	0	0	0	0	0	0	...	0	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate	Fish
100	wren	0	1	1	0	1	0	0	0	1	...	0	0	1	1	20	Bird	chicken, crow, dove, duck, flamingo, gull, haw...	1	Bird	Bird

101 rows × 25 columns

After adding a new column to df with the prediction from K-Nearest Neighbors, we can use score to compare this to the other methods.

clf_nb.score(df[numcols], df["Class_Type"])

0.8316831683168316

This score is not the best, but it’s not bad either. Here I am taking a closer look at exactly which columns were difficult for the computer to predict. It seems like the invertebrates were especially challenging to predict using this method.

df.loc[(df["Class_Type"] != df["pred_nb"]), :]

	animal_name	hair	eggs	milk	aquatic	predator	toothed	backbone	...	catsize	class_type	Class_Number	Number_Of_Animal_Species_In_Class	Class_Type	Animal_Names	pred	class_pred	pred_nb
13	clam	0	1	0	0	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate	Fish
14	crab	0	1	0	1	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	5	Invertebrate	Amphibian
15	crayfish	0	1	0	1	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	6	Invertebrate	Bug
19	dolphin	0	0	1	1	1	1	1	...	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	3	Mammal	Fish
46	lobster	0	1	0	1	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	6	Invertebrate	Bug
53	octopus	0	1	0	1	1	0	0	...	1	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	6	Invertebrate	Bug
62	pitviper	0	1	0	0	1	1	1	...	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	3	Reptile	Fish
66	porpoise	0	0	1	1	1	1	1	...	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	3	Mammal	Fish
72	scorpion	0	0	0	0	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	6	Invertebrate	Bug
74	seal	1	0	1	1	1	1	1	...	1	0	0	41	Mammal	aardvark, antelope, bear, boar, buffalo, calf,...	3	Mammal	Fish
76	seasnake	0	0	0	1	1	1	1	...	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	3	Reptile	Fish
77	seawasp	0	1	0	1	1	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate	Fish
80	slowworm	0	1	0	0	1	1	1	...	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	3	Reptile	Fish
81	slug	0	1	0	0	0	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate	Fish
90	tortoise	0	1	0	0	0	0	1	...	1	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	5	Bird	Mammal
91	tuatara	0	1	0	0	1	1	1	...	0	2	2	5	Reptile	pitviper, seasnake, slowworm, tortoise, tuatara	5	Reptile	Amphibian
99	worm	0	1	0	0	0	0	0	...	0	6	6	10	Invertebrate	clam, crab, crayfish, lobster, octopus, scorpi...	4	Invertebrate	Fish

17 rows × 25 columns

Now we can add in our X_train, y_train to fit and X_test, y_test to score.

clf_nb2 = KNeighborsClassifier(n_neighbors=10)

clf_nb2.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

clf_nb2.score(X_train, y_train)

0.8

clf_nb2.score(X_test, y_test)

0.9047619047619048

I find it a bit strange that the testing set hada much better score than the train set, and both scores are still lower than that of both logistic regression and the decision tree classifier.

Summary¶

In summary, I used K-Means clustering, logistic regression, decision tree classifiers, and K-Nearest Neighbors to attempt to classify zoo animals. The results show that logistic regression was best (with possibility of overfitting), followed by decision tree classifiers, and then K-Nearest Neighbors. K-Means clustering was not the best suited for this project, but was still a good tool to visualize.

References¶

What is the source of your dataset(s)?

kaggle.com

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

I used this article from analyticsvidhya.com to come up with the K-Nearest Neighbors portion of my project.

Created in Deepnote

UC Irvine Math 10 S22

Zoo Animal Classification

Contents

Zoo Animal Classification¶

Introduction¶

Main portion of the project¶

Data Cleaning and Merging Dataframes¶

K-Means Clustering¶

Logistic Regression¶

Decision Tree¶

K-Nearest Neighbors¶

Summary¶

References¶

	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize
48	1	0	0	1	0	1	1	1	1	1	0	0	4	1	0	1
73	0	0	1	0	0	1	0	1	1	0	0	1	0	1	0	0
33	0	1	1	0	1	1	1	0	1	1	0	0	2	1	0	0
7	0	0	1	0	0	1	0	1	1	0	0	1	0	1	1	0
69	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57	0	1	1	0	1	0	0	0	1	1	0	0	2	1	1	0
87	0	1	1	0	1	1	0	0	1	1	0	0	2	1	0	1
62	0	0	1	0	0	0	1	1	1	1	1	0	0	1	0	0
66	0	0	0	1	0	1	1	1	1	1	0	1	0	1	0	1
74	1	0	0	1	0	1	1	1	1	1	0	1	0	0	0	1

	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize
48	1	0	0	1	0	1	1	1	1	1	0	0	4	1	0	1
73	0	0	1	0	0	1	0	1	1	0	0	1	0	1	0	0
33	0	1	1	0	1	1	1	0	1	1	0	0	2	1	0	0
7	0	0	1	0	0	1	0	1	1	0	0	1	0	1	1	0
69	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57	0	1	1	0	1	0	0	0	1	1	0	0	2	1	1	0
87	0	1	1	0	1	1	0	0	1	1	0	0	2	1	0	1
62	0	0	1	0	0	0	1	1	1	1	1	0	0	1	0	0
66	0	0	0	1	0	1	1	1	1	1	0	1	0	1	0	1
74	1	0	0	1	0	1	1	1	1	1	0	1	0	0	0	1

	hair	feathers	eggs	milk	airborne	aquatic	predator	toothed	backbone	breathes	venomous	fins	legs	tail	domestic	catsize
48	1	0	0	1	0	1	1	1	1	1	0	0	4	1	0	1
73	0	0	1	0	0	1	0	1	1	0	0	1	0	1	0	0
33	0	1	1	0	1	1	1	0	1	1	0	0	2	1	0	0
7	0	0	1	0	0	1	0	1	1	0	0	1	0	1	1	0
69	1	0	0	1	0	0	1	1	1	1	0	0	4	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
57	0	1	1	0	1	0	0	0	1	1	0	0	2	1	1	0
87	0	1	1	0	1	1	0	0	1	1	0	0	2	1	0	1
62	0	0	1	0	0	0	1	1	1	1	1	0	0	1	0	0
66	0	0	0	1	0	1	1	1	1	1	0	1	0	1	0	1
74	1	0	0	1	0	1	1	1	1	1	0	1	0	0	0	1