Feature Engineering with the Titanic dataset
Contents
Feature Engineering with the Titanic dataset¶
Feature Engineering¶
Many of my ideas come from this notebook on Kaggle by ZlatanKremonic.
import pandas as pd
df = pd.read_csv("../data/titanic_train.csv")
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
Notice that the “Age”, “Cabin”, and “Embarked” columns have missing data.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
We are going to fill in the missing age values. Let’s make a new column (let’s “engineer a new feature”) to remember whether the value was missing or not. We’ll make it a numerical column (with 0
and 1
) instead of a Boolean column. (The eventual decision tree would show it as numeric anyway.)
df["AgeNull"] = df["Age"].isna().map(int)
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
We are going to fill in the missing age values with the median age value. The above linked Kaggle notebook has a more sophisticated method, which uses the median age grouped by “Pclass”.
df["Age"].median()
28.0
We use the fillna
method to say what to fill these missing values with.
df["Age"] = df["Age"].fillna(df["Age"].median())
We also could have used inplace=True
, as the documentation shows.
help(df["Age"].fillna)
Help on method fillna in module pandas.core.series:
fillna(value: 'object | ArrayLike | None' = None, method: 'FillnaOptions | None' = None, axis=None, inplace=False, limit=None, downcast=None) -> 'Series | None' method of pandas.core.series.Series instance
Fill NA/NaN values using the specified method.
Parameters
----------
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). Values not
in the dict/Series/DataFrame will not be filled. This value cannot
be a list.
method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap.
axis : {0 or 'index'}
Axis along which to fill missing values.
inplace : bool, default False
If True, fill in-place. Note: this will modify any
other views on this object (e.g., a no-copy slice for a column in a
DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is
a gap with more than this number of consecutive NaNs, it will only
be partially filled. If method is not specified, this is the
maximum number of entries along the entire axis where NaNs will be
filled. Must be greater than 0 if not None.
downcast : dict, default is None
A dict of item->dtype of what to downcast if possible,
or the string 'infer' which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible).
Returns
-------
Series or None
Object with missing values filled or None if ``inplace=True``.
See Also
--------
interpolate : Fill NaN values using interpolation.
reindex : Conform object to new index.
asfreq : Convert TimeSeries to specified frequency.
Examples
--------
>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
... [3, 4, np.nan, 1],
... [np.nan, np.nan, np.nan, 5],
... [np.nan, 3, np.nan, 4]],
... columns=list("ABCD"))
>>> df
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
3 NaN 3.0 NaN 4
Replace all NaN elements with 0s.
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
We can also propagate non-null values forward or backward.
>>> df.fillna(method="ffill")
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
2, and 3 respectively.
>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
Only replace the first NaN element.
>>> df.fillna(value=values, limit=1)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 NaN 1
2 NaN 1.0 NaN 5
3 NaN 3.0 NaN 4
When filling using a DataFrame, replacement happens along
the same column names and same indices
>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
Notice how there are no longer any missing values in the “Age” column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 891 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
12 AgeNull 891 non-null int64
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB
We have distorted the data, because now by far the most frequent age value is 28, but because we also recorded which rows used to have missing data, it does not seem like too big of a deal.
df["Age"].value_counts()
28.00 202
24.00 30
22.00 27
18.00 26
19.00 25
...
36.50 1
55.50 1
0.92 1
23.50 1
74.00 1
Name: Age, Length: 88, dtype: int64
Notice how at least the average age differs significantly by “Pclass”. For example, the average age in “Pclass” 1 is nearly 37.
df.groupby("Pclass").mean()
PassengerId | Survived | Age | SibSp | Parch | Fare | AgeNull | |
---|---|---|---|---|---|---|---|
Pclass | |||||||
1 | 461.597222 | 0.629630 | 36.812130 | 0.416667 | 0.356481 | 84.154687 | 0.138889 |
2 | 445.956522 | 0.472826 | 29.765380 | 0.402174 | 0.380435 | 20.662183 | 0.059783 |
3 | 439.154786 | 0.242363 | 25.932627 | 0.615071 | 0.393075 | 13.675550 | 0.276986 |
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
The error in the following is that we did not specify to look among the column names. By default, pandas is looking among the row names.
df.drop("PassengerId")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38048/3370910653.py in <module>
----> 1 df.drop("PassengerId")
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4911 level=level,
4912 inplace=inplace,
-> 4913 errors=errors,
4914 )
4915
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
4148 for axis, labels in axes.items():
4149 if labels is not None:
-> 4150 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
4151
4152 if inplace:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
4183 new_axis = axis.drop(labels, level=level, errors=errors)
4184 else:
-> 4185 new_axis = axis.drop(labels, errors=errors)
4186 result = self.reindex(**{axis_name: new_axis})
4187
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
6015 if mask.any():
6016 if errors != "ignore":
-> 6017 raise KeyError(f"{labels[mask]} not found in axis")
6018 indexer = indexer[~mask]
6019 return self.delete(indexer)
KeyError: "['PassengerId'] not found in axis"
df.drop("PassengerId", axis=1, inplace=True)
Notice how the “PassengerId” column has disappeared now.
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
df["Sex"].unique()
array(['male', 'female'], dtype=object)
Probably using a Boolean Series would be the most elegant way to convert the “Sex” column to a numeric column, but let’s use a dictionary just for practice.
rep_dict = {'male':0, 'female':1}
rep_dict['male']
0
I think we could actually just put rep_dict
itself into map
, but I’m not certain.
df["Sex"] = df["Sex"].map(lambda x: rep_dict[x])
Now the “Sex” column is numeric.
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 |
4 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
Let’s try to extract some information out of the “Name” column.
df["Name"]
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
886 Montvila, Rev. Juozas
887 Graham, Miss. Margaret Edith
888 Johnston, Miss. Catherine Helen "Carrie"
889 Behr, Mr. Karl Howell
890 Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object
Let’s first separate these names at the commas.
df["Name"].map(lambda s: s.split(","))
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
The following will get the title (and a space before it).
df["Title"] = df["Name"].map(lambda s: s.split(",")[1].split(".")[0])
Now we have engineered a new feature holding the title.
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | Title | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | Mr |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | Mrs |
2 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | Miss |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | Mrs |
4 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | Mr |
Let’s also save the length of the name.
df["NameLength"] = df["Name"].map(len)
Here would be another way.
df["NameLength"] = df["Name"].map(lambda s: len(s))
df.head()
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | AgeNull | Title | NameLength | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 0 | Mr | 23 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 0 | Mrs | 51 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | Miss | 22 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 0 | Mrs | 44 |
4 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | Mr | 24 |
Here are the different titles.
df.Title.value_counts()
Mr 517
Miss 182
Mrs 125
Master 40
Dr 7
Rev 6
Mlle 2
Major 2
Col 2
the Countess 1
Capt 1
Ms 1
Sir 1
Lady 1
Mme 1
Don 1
Jonkheer 1
Name: Title, dtype: int64
There are 17 unique titles. But they’re not numeric, so what good are they?
len(df.Title.unique())
17
We will use a new pandas function called get_dummies
. The following procedure is also called “one hot encoding”.
pd.get_dummies(df.Title)
Capt | Col | Don | Dr | Jonkheer | Lady | Major | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir | the Countess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
887 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
888 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
889 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
890 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
891 rows × 17 columns
Notice how the number of columns is the same as the number of unique values in the “Title” column. The above should remind you of the “Monday”, …, “Sunday” columns we made when working with the bicycle dataset. You see now that we could have saved some time by using the get_dummies
function.
pd.get_dummies(df.Title).shape
(891, 17)
Let’s put these dummy columns onto the end of our DataFrame. Because we are putting the DataFrames side-by-side (not one on over the other), we specify axis=1
.
df2 = pd.concat([df, pd.get_dummies(df.Title)], axis=1)
Notice how the name titles correspond to the 1s that occur in the right side.
df2
Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | ... | Master | Miss | Mlle | Mme | Mr | Mrs | Ms | Rev | Sir | the Countess | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | Braund, Mr. Owen Harris | 0 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 1 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
2 | 1 | 3 | Heikkinen, Miss. Laina | 1 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 1 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | 0 | 3 | Allen, Mr. William Henry | 0 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 0 | 2 | Montvila, Rev. Juozas | 0 | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
887 | 1 | 1 | Graham, Miss. Margaret Edith | 1 | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
888 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | 1 | 28.0 | 1 | 2 | W./C. 6607 | 23.4500 | NaN | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
889 | 1 | 1 | Behr, Mr. Karl Howell | 0 | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
890 | 0 | 3 | Dooley, Mr. Patrick | 0 | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
891 rows × 31 columns
We could keep going, but that’s all the feature engineering we’ll do for today. Let’s find the numeric columns. (Only numeric columns can be used in our Decision Tree.)
from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df2.columns if is_numeric_dtype(df2[c])]
cols = [c for c in num_cols if c != "Survived"]
Baseline prediction¶
What would be a good accuracy? Well, if we just predict everyone did not survive, then we would get an accuracy of 61.45 percent (for the following test data), so that is a good baseline against which to meaure our performance.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df2[cols],
df2["Survived"],
test_size=0.2,
random_state=0
)
1-y_test.mean()
0.6145251396648045
Fitting a decision tree¶
from sklearn.tree import DecisionTreeClassifier
If we specify a maximum depth of 7, then there are at most \(2^7 = 128\) leaves. We will restrict to 40 leaves. (These parameters of 7 and 40 were chosen moreorless randomly.)
clf = DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)
The difference of 89% vs 81% accuracy is high enough that there is probably overfitting to be concerned with, but it is not drastic. (At least our test accuracy is much better than the baseline from above.)
clf.score(X_train, y_train)
0.8946629213483146
clf.score(X_test, y_test)
0.8100558659217877
Let’s visualize the tree, similarly to what you did in the discussion section yesterday.
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(300,200))
plot_tree(
clf,
feature_names=clf.feature_names_in_,
filled=True
);
One of the best features of decision trees is that they are very interpretable. For example, in the following, we have a record of which features (columns) contributed the most to the decision tree. It might seem surprising that the “Sex” column was not important, but notice that the “Mr” column was very important.
pd.Series(clf.feature_importances_, index=cols)
Pclass 0.137991
Sex 0.000000
Age 0.054578
SibSp 0.013301
Parch 0.024459
Fare 0.163993
AgeNull 0.000000
NameLength 0.062319
Capt 0.008217
Col 0.000000
Don 0.000000
Dr 0.009907
Jonkheer 0.000000
Lady 0.000000
Major 0.003665
Master 0.000000
Miss 0.000000
Mlle 0.000000
Mme 0.000000
Mr 0.498170
Mrs 0.000000
Ms 0.000000
Rev 0.023400
Sir 0.000000
the Countess 0.000000
dtype: float64