Feature Engineering with the Titanic dataset

Feature Engineering with the Titanic dataset

Feature Engineering

Many of my ideas come from this notebook on Kaggle by ZlatanKremonic.

import pandas as pd
df = pd.read_csv("../data/titanic_train.csv")
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Notice that the “Age”, “Cabin”, and “Embarked” columns have missing data.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

We are going to fill in the missing age values. Let’s make a new column (let’s “engineer a new feature”) to remember whether the value was missing or not. We’ll make it a numerical column (with 0 and 1) instead of a Boolean column. (The eventual decision tree would show it as numeric anyway.)

df["AgeNull"] = df["Age"].isna().map(int)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0

We are going to fill in the missing age values with the median age value. The above linked Kaggle notebook has a more sophisticated method, which uses the median age grouped by “Pclass”.

df["Age"].median()
28.0

We use the fillna method to say what to fill these missing values with.

df["Age"] = df["Age"].fillna(df["Age"].median())

We also could have used inplace=True, as the documentation shows.

help(df["Age"].fillna)
Help on method fillna in module pandas.core.series:

fillna(value: 'object | ArrayLike | None' = None, method: 'FillnaOptions | None' = None, axis=None, inplace=False, limit=None, downcast=None) -> 'Series | None' method of pandas.core.series.Series instance
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame).  Values not
        in the dict/Series/DataFrame will not be filled. This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use next valid observation to fill gap.
    axis : {0 or 'index'}
        Axis along which to fill missing values.
    inplace : bool, default False
        If True, fill in-place. Note: this will modify any
        other views on this object (e.g., a no-copy slice for a column in a
        DataFrame).
    limit : int, default None
        If method is specified, this is the maximum number of consecutive
        NaN values to forward/backward fill. In other words, if there is
        a gap with more than this number of consecutive NaNs, it will only
        be partially filled. If method is not specified, this is the
        maximum number of entries along the entire axis where NaNs will be
        filled. Must be greater than 0 if not None.
    downcast : dict, default is None
        A dict of item->dtype of what to downcast if possible,
        or the string 'infer' which will try to downcast to an appropriate
        equal type (e.g. float64 to int64 if possible).
    
    Returns
    -------
    Series or None
        Object with missing values filled or None if ``inplace=True``.
    
    See Also
    --------
    interpolate : Fill NaN values using interpolation.
    reindex : Conform object to new index.
    asfreq : Convert TimeSeries to specified frequency.
    
    Examples
    --------
    >>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
    ...                    [3, 4, np.nan, 1],
    ...                    [np.nan, np.nan, np.nan, 5],
    ...                    [np.nan, 3, np.nan, 4]],
    ...                   columns=list("ABCD"))
    >>> df
         A    B   C  D
    0  NaN  2.0 NaN  0
    1  3.0  4.0 NaN  1
    2  NaN  NaN NaN  5
    3  NaN  3.0 NaN  4
    
    Replace all NaN elements with 0s.
    
    >>> df.fillna(0)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4
    
    We can also propagate non-null values forward or backward.
    
    >>> df.fillna(method="ffill")
        A   B   C   D
    0   NaN 2.0 NaN 0
    1   3.0 4.0 NaN 1
    2   3.0 4.0 NaN 5
    3   3.0 3.0 NaN 4
    
    Replace all NaN elements in column 'A', 'B', 'C', and 'D', with 0, 1,
    2, and 3 respectively.
    
    >>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
    >>> df.fillna(value=values)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 2.0 1
    2   0.0 1.0 2.0 5
    3   0.0 3.0 2.0 4
    
    Only replace the first NaN element.
    
    >>> df.fillna(value=values, limit=1)
        A   B   C   D
    0   0.0 2.0 2.0 0
    1   3.0 4.0 NaN 1
    2   NaN 1.0 NaN 5
    3   NaN 3.0 NaN 4
    
    When filling using a DataFrame, replacement happens along
    the same column names and same indices
    
    >>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
    >>> df.fillna(df2)
        A   B   C   D
    0   0.0 2.0 0.0 0
    1   3.0 4.0 0.0 1
    2   0.0 0.0 0.0 5
    3   0.0 3.0 0.0 4

Notice how there are no longer any missing values in the “Age” column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
 12  AgeNull      891 non-null    int64  
dtypes: float64(2), int64(6), object(5)
memory usage: 90.6+ KB

We have distorted the data, because now by far the most frequent age value is 28, but because we also recorded which rows used to have missing data, it does not seem like too big of a deal.

df["Age"].value_counts()
28.00    202
24.00     30
22.00     27
18.00     26
19.00     25
        ... 
36.50      1
55.50      1
0.92       1
23.50      1
74.00      1
Name: Age, Length: 88, dtype: int64

Notice how at least the average age differs significantly by “Pclass”. For example, the average age in “Pclass” 1 is nearly 37.

df.groupby("Pclass").mean()
PassengerId Survived Age SibSp Parch Fare AgeNull
Pclass
1 461.597222 0.629630 36.812130 0.416667 0.356481 84.154687 0.138889
2 445.956522 0.472826 29.765380 0.402174 0.380435 20.662183 0.059783
3 439.154786 0.242363 25.932627 0.615071 0.393075 13.675550 0.276986
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0

The error in the following is that we did not specify to look among the column names. By default, pandas is looking among the row names.

df.drop("PassengerId")
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_38048/3370910653.py in <module>
----> 1 df.drop("PassengerId")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312 
    313         return wrapper

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4911             level=level,
   4912             inplace=inplace,
-> 4913             errors=errors,
   4914         )
   4915 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4148         for axis, labels in axes.items():
   4149             if labels is not None:
-> 4150                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   4151 
   4152         if inplace:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/generic.py in _drop_axis(self, labels, axis, level, errors)
   4183                 new_axis = axis.drop(labels, level=level, errors=errors)
   4184             else:
-> 4185                 new_axis = axis.drop(labels, errors=errors)
   4186             result = self.reindex(**{axis_name: new_axis})
   4187 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
   6015         if mask.any():
   6016             if errors != "ignore":
-> 6017                 raise KeyError(f"{labels[mask]} not found in axis")
   6018             indexer = indexer[~mask]
   6019         return self.delete(indexer)

KeyError: "['PassengerId'] not found in axis"
df.drop("PassengerId", axis=1, inplace=True)

Notice how the “PassengerId” column has disappeared now.

df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull
0 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 0
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 0
2 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 0
4 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 0
df["Sex"].unique()
array(['male', 'female'], dtype=object)

Probably using a Boolean Series would be the most elegant way to convert the “Sex” column to a numeric column, but let’s use a dictionary just for practice.

rep_dict = {'male':0, 'female':1}
rep_dict['male']
0

I think we could actually just put rep_dict itself into map, but I’m not certain.

df["Sex"] = df["Sex"].map(lambda x: rep_dict[x])

Now the “Sex” column is numeric.

df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull
0 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S 0
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C 0
2 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S 0
4 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S 0

Let’s try to extract some information out of the “Name” column.

df["Name"]
0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

Let’s first separate these names at the commas.

df["Name"].map(lambda s: s.split(","))
0                             [Braund,  Mr. Owen Harris]
1      [Cumings,  Mrs. John Bradley (Florence Briggs ...
2                              [Heikkinen,  Miss. Laina]
3        [Futrelle,  Mrs. Jacques Heath (Lily May Peel)]
4                            [Allen,  Mr. William Henry]
                             ...                        
886                             [Montvila,  Rev. Juozas]
887                      [Graham,  Miss. Margaret Edith]
888          [Johnston,  Miss. Catherine Helen "Carrie"]
889                             [Behr,  Mr. Karl Howell]
890                               [Dooley,  Mr. Patrick]
Name: Name, Length: 891, dtype: object

The following will get the title (and a space before it).

df["Title"] = df["Name"].map(lambda s: s.split(",")[1].split(".")[0])

Now we have engineered a new feature holding the title.

df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull Title
0 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S 0 Mr
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C 0 Mrs
2 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 Miss
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S 0 Mrs
4 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S 0 Mr

Let’s also save the length of the name.

df["NameLength"] = df["Name"].map(len)

Here would be another way.

df["NameLength"] = df["Name"].map(lambda s: len(s))
df.head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked AgeNull Title NameLength
0 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN S 0 Mr 23
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 C 0 Mrs 51
2 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 0 Miss 22
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 S 0 Mrs 44
4 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN S 0 Mr 24

Here are the different titles.

df.Title.value_counts()
 Mr              517
 Miss            182
 Mrs             125
 Master           40
 Dr                7
 Rev               6
 Mlle              2
 Major             2
 Col               2
 the Countess      1
 Capt              1
 Ms                1
 Sir               1
 Lady              1
 Mme               1
 Don               1
 Jonkheer          1
Name: Title, dtype: int64

There are 17 unique titles. But they’re not numeric, so what good are they?

len(df.Title.unique())
17

We will use a new pandas function called get_dummies. The following procedure is also called “one hot encoding”.

pd.get_dummies(df.Title)
Capt Col Don Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
887 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
888 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
889 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
890 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0

891 rows × 17 columns

Notice how the number of columns is the same as the number of unique values in the “Title” column. The above should remind you of the “Monday”, …, “Sunday” columns we made when working with the bicycle dataset. You see now that we could have saved some time by using the get_dummies function.

pd.get_dummies(df.Title).shape
(891, 17)

Let’s put these dummy columns onto the end of our DataFrame. Because we are putting the DataFrames side-by-side (not one on over the other), we specify axis=1.

df2 = pd.concat([df, pd.get_dummies(df.Title)], axis=1)

Notice how the name titles correspond to the 1s that occur in the right side.

df2
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin ... Master Miss Mlle Mme Mr Mrs Ms Rev Sir the Countess
0 0 3 Braund, Mr. Owen Harris 0 22.0 1 0 A/5 21171 7.2500 NaN ... 0 0 0 0 1 0 0 0 0 0
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38.0 1 0 PC 17599 71.2833 C85 ... 0 0 0 0 0 1 0 0 0 0
2 1 3 Heikkinen, Miss. Laina 1 26.0 0 0 STON/O2. 3101282 7.9250 NaN ... 0 1 0 0 0 0 0 0 0 0
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35.0 1 0 113803 53.1000 C123 ... 0 0 0 0 0 1 0 0 0 0
4 0 3 Allen, Mr. William Henry 0 35.0 0 0 373450 8.0500 NaN ... 0 0 0 0 1 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 Montvila, Rev. Juozas 0 27.0 0 0 211536 13.0000 NaN ... 0 0 0 0 0 0 0 1 0 0
887 1 1 Graham, Miss. Margaret Edith 1 19.0 0 0 112053 30.0000 B42 ... 0 1 0 0 0 0 0 0 0 0
888 0 3 Johnston, Miss. Catherine Helen "Carrie" 1 28.0 1 2 W./C. 6607 23.4500 NaN ... 0 1 0 0 0 0 0 0 0 0
889 1 1 Behr, Mr. Karl Howell 0 26.0 0 0 111369 30.0000 C148 ... 0 0 0 0 1 0 0 0 0 0
890 0 3 Dooley, Mr. Patrick 0 32.0 0 0 370376 7.7500 NaN ... 0 0 0 0 1 0 0 0 0 0

891 rows × 31 columns

We could keep going, but that’s all the feature engineering we’ll do for today. Let’s find the numeric columns. (Only numeric columns can be used in our Decision Tree.)

from pandas.api.types import is_numeric_dtype
num_cols = [c for c in df2.columns if is_numeric_dtype(df2[c])]
cols = [c for c in num_cols if c != "Survived"]

Baseline prediction

What would be a good accuracy? Well, if we just predict everyone did not survive, then we would get an accuracy of 61.45 percent (for the following test data), so that is a good baseline against which to meaure our performance.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df2[cols],
    df2["Survived"],
    test_size=0.2,
    random_state=0
)
1-y_test.mean()
0.6145251396648045

Fitting a decision tree

from sklearn.tree import DecisionTreeClassifier

If we specify a maximum depth of 7, then there are at most \(2^7 = 128\) leaves. We will restrict to 40 leaves. (These parameters of 7 and 40 were chosen moreorless randomly.)

clf = DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)
clf.fit(X_train, y_train)
DecisionTreeClassifier(max_depth=7, max_leaf_nodes=40)

The difference of 89% vs 81% accuracy is high enough that there is probably overfitting to be concerned with, but it is not drastic. (At least our test accuracy is much better than the baseline from above.)

clf.score(X_train, y_train)
0.8946629213483146
clf.score(X_test, y_test)
0.8100558659217877

Let’s visualize the tree, similarly to what you did in the discussion section yesterday.

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(300,200))
plot_tree(
    clf,
    feature_names=clf.feature_names_in_,
    filled=True
);
../_images/Week9-Wednesday_80_0.png

One of the best features of decision trees is that they are very interpretable. For example, in the following, we have a record of which features (columns) contributed the most to the decision tree. It might seem surprising that the “Sex” column was not important, but notice that the “Mr” column was very important.

pd.Series(clf.feature_importances_, index=cols)
Pclass           0.137991
Sex              0.000000
Age              0.054578
SibSp            0.013301
Parch            0.024459
Fare             0.163993
AgeNull          0.000000
NameLength       0.062319
 Capt            0.008217
 Col             0.000000
 Don             0.000000
 Dr              0.009907
 Jonkheer        0.000000
 Lady            0.000000
 Major           0.003665
 Master          0.000000
 Miss            0.000000
 Mlle            0.000000
 Mme             0.000000
 Mr              0.498170
 Mrs             0.000000
 Ms              0.000000
 Rev             0.023400
 Sir             0.000000
 the Countess    0.000000
dtype: float64