Feature Engineering with the Titanic dataset¶

Feature Engineering¶

Many of my ideas come from this notebook on Kaggle by ZlatanKremonic.

import pandas as pd

df = pd.read_csv("../data/titanic_train.csv")

df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Notice that the “Age”, “Cabin”, and “Embarked” columns have missing data.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

We are going to fill in the missing age values. Let’s make a new column (let’s “engineer a new feature”) to remember whether the value was missing or not. We’ll make it a numerical column (with 0 and 1) instead of a Boolean column. (The eventual decision tree would show it as numeric anyway.)

df["AgeNull"] = df["Age"].isna().map(int)

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

We are going to fill in the missing age values with the median age value. The above linked Kaggle notebook has a more sophisticated method, which uses the median age grouped by “Pclass”.

df["Age"].median()

28.0

We use the fillna method to say what to fill these missing values with.

df["Age"] = df["Age"].fillna(df["Age"].median())

We also could have used inplace=True, as the documentation shows.

help(df["Age"].fillna)

Help on method fillna in module pandas.core.series:

fillna(value: 'object | ArrayLike | None' = None, method: 'FillnaOptions | None' = None, axis=None, inplace=False, limit=None, downcast=None) -> 'Series | None' method of pandas.core.series.Series instance
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use next valid observation to fill gap.
    axis : {0 or 'index'}
        Axis along which to fill missing values.
    inplace : bool, default False
        If True, fill in-place. Note: this will modify any
        other views on this object (e.g., a no-copy slice for a column in a
        DataFrame).
    limit : int, default None
        If method is specified, this is the maximum number of consecutive
        NaN values to forward/backward fill. In other words, if there is
        a gap with more than this number of consecutive NaNs, it will only
        be partially filled. If method is not specified, this is the
        maximum number of entries along the entire axis where NaNs will be
        filled. Must be greater than 0 if not None.
    downcast : dict, default is None
        A dict of item->dtype of what to downcast if possible,
        or the string 'infer' which will try to downcast to an appropriate
        equal type (e.g. float64 to int64 if possible).
    
    Returns
    -------
    Series or None
        Object with missing values filled or None if ``inplace=True``.
    
    See Also
    --------
    interpolate : Fill NaN values using interpolation.
    reindex : Conform object to new index.
    asfreq : Convert TimeSeries to specified frequency.
    
    Examples
    --------
    >>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
    ...                    [3, 4, np.nan, 1],
    ...                    [np.nan, np.nan, np.nan, 5],
    ...                    [np.nan, 3, np.nan, 4]],
    ...                   columns=list("ABCD"))
    >>> df
         A    B   C  D
    0  NaN  2.0 NaN  0
    1  3.0  4.0 NaN  1
    2  NaN  NaN NaN  5
    3  NaN  3.0 NaN  4
    
    Replace all NaN elements with 0s.
    
    >>> df.fillna(0)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4
    
    We can also propagate non-null values forward or backward.
    
    >>> df.fillna(method="ffill")
        A   B   C   D
    0   NaN 2.0 NaN 0
    1   3.0 4.0 NaN 1
    2   3.0 4.0 NaN 5
    3   3.0 3.0 NaN 4
    
    Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
    2, and 3 respectively.
    
    >>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
    >>> df.fillna(value=values)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 2.0 1
    2   0.0 1.0 2.0 5
    3   0.0 3.0 2.0 4
    
    Only replace the first NaN element.
    
    >>> df.fillna(value=values, limit=1)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 NaN 1
    2   NaN 1.0 NaN 5
    3   NaN 3.0 NaN 4
    
    When filling using a DataFrame, replacement happens along
    the same column names and same indices
    
    >>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
    >>> df.fillna(df2)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4

Notice how there are no longer any missing values in the “Age” column.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  AgeNull      891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB

We have distorted the data, because now by far the most frequent age value is 28, but because we also recorded which rows used to have missing data, it does not seem like too big of a deal.

df["Age"].value_counts()

00    202
00     30
00     27
00     26
00     25
        ... 
50      1
50      1
92       1
50      1
00      1
Name: Age, Length: 88, dtype: int64

Notice how at least the average age differs significantly by “Pclass”. For example, the average age in “Pclass” 1 is nearly 37.

df.groupby("Pclass").mean()

	PassengerId	Survived	Age	SibSp	Parch	Fare	AgeNull
Pclass
1	461.597222	0.629630	36.812130	0.416667	0.356481	84.154687	0.138889
2	445.956522	0.472826	29.765380	0.402174	0.380435	20.662183	0.059783
3	439.154786	0.242363	25.932627	0.615071	0.393075	13.675550	0.276986

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

The error in the following is that we did not specify to look among the column names. By default, pandas is looking among the row names.

df.drop("PassengerId")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38048/3370910653.py in <module>
----> 1 df.drop("PassengerId")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4911             level=level,
   4912             inplace=inplace,
-> 4913             errors=errors,
   4914         )
   4915 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4148         for axis, labels in axes.items():
   4149             if labels is not None:
-> 4150                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4151 
   4152         if inplace:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
   4183                 new_axis = axis.drop(labels, level=level, errors=errors)
   4184             else:
-> 4185                 new_axis = axis.drop(labels, errors=errors)
   4186             result = self.reindex(**{axis_name: new_axis})
   4187 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   6015         if mask.any():
   6016             if errors != "ignore":
-> 6017                 raise KeyError(f"{labels[mask]} not found in axis")
   6018             indexer = indexer[~mask]
   6019         return self.delete(indexer)

KeyError: "['PassengerId'] not found in axis"

df.drop("PassengerId", axis=1, inplace=True)

Notice how the “PassengerId” column has disappeared now.

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df["Sex"].unique()

array(['male', 'female'], dtype=object)

Probably using a Boolean Series would be the most elegant way to convert the “Sex” column to a numeric column, but let’s use a dictionary just for practice.

rep_dict = {'male':0, 'female':1}

rep_dict['male']

I think we could actually just put rep_dict itself into map, but I’m not certain.

df["Sex"] = df["Sex"].map(lambda x: rep_dict[x])

Now the “Sex” column is numeric.

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	1	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	113803	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	0	35.0	0	373450	8.0500	NaN	S

Let’s try to extract some information out of the “Name” column.

df["Name"]

                              Braund, Mr. Owen Harris
    Cumings, Mrs. John Bradley (Florence Briggs Th...
                               Heikkinen, Miss. Laina
         Futrelle, Mrs. Jacques Heath (Lily May Peel)
                             Allen, Mr. William Henry
                             ...                        
                              Montvila, Rev. Juozas
                       Graham, Miss. Margaret Edith
           Johnston, Miss. Catherine Helen "Carrie"
                              Behr, Mr. Karl Howell
                                Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Let’s first separate these names at the commas.

df["Name"].map(lambda s: s.split(","))

                           [Braund,  Mr. Owen Harris]
    [Cumings,  Mrs. John Bradley (Florence Briggs ...
                            [Heikkinen,  Miss. Laina]
      [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
                          [Allen,  Mr. William Henry]
                             ...                        
                           [Montvila,  Rev. Juozas]
                    [Graham,  Miss. Margaret Edith]
        [Johnston,  Miss. Catherine Helen "Carrie"]
                           [Behr,  Mr. Karl Howell]
                             [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

The following will get the title (and a space before it).

df["Title"] = df["Name"].map(lambda s: s.split(",")[1].split(".")[0])

Now we have engineered a new feature holding the title.

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	NaN	S	Mr
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	C85	C	Mrs
2	1	3	Heikkinen, Miss. Laina	1	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	113803	53.1000	C123	S	Mrs
4	0	3	Allen, Mr. William Henry	0	35.0	0	373450	8.0500	NaN	S	Mr

Let’s also save the length of the name.

df["NameLength"] = df["Name"].map(len)

Here would be another way.

df["NameLength"] = df["Name"].map(lambda s: len(s))

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	NameLength
0	0	3	Braund, Mr. Owen Harris	0	22.0	1	A/5 21171	7.2500	NaN	S	Mr	23
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	PC 17599	71.2833	C85	C	Mrs	51
2	1	3	Heikkinen, Miss. Laina	1	26.0	0	STON/O2. 3101282	7.9250	NaN	S	Miss	22
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	113803	53.1000	C123	S	Mrs	44
4	0	3	Allen, Mr. William Henry	0	35.0	0	373450	8.0500	NaN	S	Mr	24

Here are the different titles.

df.Title.value_counts()

 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Mlle              2
 Major             2
 Col               2
 the Countess      1
 Capt              1
 Ms                1
 Sir               1
 Lady              1
 Mme               1
 Don               1
 Jonkheer          1
Name: Title, dtype: int64

There are 17 unique titles. But they’re not numeric, so what good are they?

len(df.Title.unique())

We will use a new pandas function called get_dummies. The following procedure is also called “one hot encoding”.

pd.get_dummies(df.Title)

	Capt	Col	Don	Dr	Jonkheer	Lady	Major	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir	the Countess
0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
2	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
887	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
888	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
889	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
890	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0

891 rows × 17 columns

Notice how the number of columns is the same as the number of unique values in the “Title” column. The above should remind you of the “Monday”, …, “Sunday” columns we made when working with the bicycle dataset. You see now that we could have saved some time by using the get_dummies function.

pd.get_dummies(df.Title).shape

(891, 17)

Let’s put these dummy columns onto the end of our DataFrame. Because we are putting the DataFrames side-by-side (not one on over the other), we specify axis=1.

df2 = pd.concat([df, pd.get_dummies(df.Title)], axis=1)

Notice how the name titles correspond to the 1s that occur in the right side.

df2

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	...	Master	Miss	Mlle	Mme	Mr	Mrs	Ms	Rev	Sir	the Countess
0	0	3	Braund, Mr. Owen Harris	0	22.0	1	0	A/5 21171	7.2500	NaN	...	0	0	0	0	1	0	0	0	0	0
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	1	38.0	1	0	PC 17599	71.2833	C85	...	0	0	0	0	0	1	0	0	0	0
2	1	3	Heikkinen, Miss. Laina	1	26.0	0	0	STON/O2. 3101282	7.9250	NaN	...	0	1	0	0	0	0	0	0	0	0
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1	35.0	1	0	113803	53.1000	C123	...	0	0	0	0	0	1	0	0	0	0
4	0	3	Allen, Mr. William Henry	0	35.0	0	0	373450	8.0500	NaN	...	0	0	0	0	1	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
886	0	2	Montvila, Rev. Juozas	0	27.0	0	0	211536	13.0000	NaN	...	0	0	0	0	0	0	0	1	0	0
887	1	1	Graham, Miss. Margaret Edith	1	19.0	0	0	112053	30.0000	B42	...	0	1	0	0	0	0	0	0	0	0
888	0	3	Johnston, Miss. Catherine Helen "Carrie"	1	28.0	1	2	W./C. 6607	23.4500	NaN	...	0	1	0	0	0	0	0	0	0	0
889	1	1	Behr, Mr. Karl Howell	0	26.0	0	0	111369	30.0000	C148	...	0	0	0	0	1	0	0	0	0	0
890	0	3	Dooley, Mr. Patrick	0	32.0	0	0	370376	7.7500	NaN	...	0	0	0	0	1	0	0	0	0	0

891 rows × 31 columns

We could keep going, but that’s all the feature engineering we’ll do for today. Let’s find the numeric columns. (Only numeric columns can be used in our Decision Tree.)

from pandas.api.types import is_numeric_dtype

num_cols = [c for c in df2.columns if is_numeric_dtype(df2[c])]

cols = [c for c in num_cols if c != "Survived"]

Baseline prediction¶

What would be a good accuracy? Well, if we just predict everyone did not survive, then we would get an accuracy of 61.45 percent (for the following test data), so that is a good baseline against which to meaure our performance.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df2[cols],
    df2["Survived"],
    test_size=0.2,
    random_state=0
)

1-y_test.mean()

0.6145251396648045

Fitting a decision tree¶

from sklearn.tree import DecisionTreeClassifier

If we specify a maximum depth of 7, then there are at most \(2^7 = 128\) leaves. We will restrict to 40 leaves. (These parameters of 7 and 40 were chosen moreorless randomly.)

clf = DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)

clf.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)

The difference of 89% vs 81% accuracy is high enough that there is probably overfitting to be concerned with, but it is not drastic. (At least our test accuracy is much better than the baseline from above.)

clf.score(X_train, y_train)

0.8946629213483146

clf.score(X_test, y_test)

0.8100558659217877

Let’s visualize the tree, similarly to what you did in the discussion section yesterday.

from sklearn.tree import plot_tree

import matplotlib.pyplot as plt

fig = plt.figure(figsize=(300,200))
plot_tree(
    clf,
    feature_names=clf.feature_names_in_,
    filled=True
);

One of the best features of decision trees is that they are very interpretable. For example, in the following, we have a record of which features (columns) contributed the most to the decision tree. It might seem surprising that the “Sex” column was not important, but notice that the “Mr” column was very important.

pd.Series(clf.feature_importances_, index=cols)

Pclass           0.137991
Sex              0.000000
Age              0.054578
SibSp            0.013301
Parch            0.024459
Fare             0.163993
AgeNull          0.000000
NameLength       0.062319
 Capt            0.008217
 Col             0.000000
 Don             0.000000
 Dr              0.009907
 Jonkheer        0.000000
 Lady            0.000000
 Major           0.003665
 Master          0.000000
 Miss            0.000000
 Mlle            0.000000
 Mme             0.000000
 Mr              0.498170
 Mrs             0.000000
 Ms              0.000000
 Rev             0.023400
 Sir             0.000000
 the Countess    0.000000
dtype: float64

UC Irvine Math 10 S22

Feature Engineering with the Titanic dataset

Contents

Feature Engineering with the Titanic dataset¶