Titanic Survival

Author: Brigitte Harijanto

Email: brigith@uci.edu

Course Project, UC Irvine, Math 10, W22

Introduction

In this project, I will be using the titanic dataset, imported from kaggle as train and test, which has never been used in class before. Here, we will explore if the fare a person paid and the cabin they were on can predict their survival rate by using scikit learn.Furthermore, we will explore the reliability of several machine learning models and get the mean of them to check the machine’s confidence on this matter. Last, we will prove by graph why Linear Regression is not the way to go.

Main portion of the project

Importing Files

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss, mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LinearRegression
from torch import nn
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
training = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

training['train_test'] = 1
test['train_test'] = 0
test['Survived'] = np.NaN
df = pd.concat([training,test])
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked train_test
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 1
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 1
df.dropna(inplace=True)
df
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked train_test
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 1
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 1
6 7 0.0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 1
10 11 1.0 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S 1
11 12 1.0 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
871 872 1.0 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S 1
872 873 0.0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S 1
879 880 1.0 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C 1
887 888 1.0 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 1
889 890 1.0 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 1

183 rows × 13 columns

for k in df["Cabin"].unique():
    df[f"Cabin_{k}"] = (df["Cabin"] == k)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  

Using .unique() to get all distinct decks, then using f strings to modify decks by the respective names and assigning true or false values(numerical values) to each deck column. Afterwards, using for loop to get all the distinct decks to assign all decks to a new numerical column.

df.head() #check if each Cabin has it's own column
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Cabin_D17 Cabin_A36 Cabin_B69 Cabin_E49 Cabin_D28 Cabin_E17 Cabin_A24 Cabin_C50 Cabin_B42 Cabin_C148
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 ... False False False False False False False False False False
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 ... False False False False False False False False False False
6 7 0.0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 ... False False False False False False False False False False
10 11 1.0 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 ... False False False False False False False False False False
11 12 1.0 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 ... False False False False False False False False False False

5 rows × 146 columns

#def order(char):
    #return ord(char) - ord('A') +1
#df['deck_num'] = df['deck'].map(lambda s : order(s))
X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X = df.loc[:, X_colnames].copy()
y = df.loc[:, y_colname].copy()
[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
['Cabin_A10',
 'Cabin_A16',
 'Cabin_A20',
 'Cabin_A23',
 'Cabin_A24',
 'Cabin_A26',
 'Cabin_A31',
 'Cabin_A34',
 'Cabin_A36',
 'Cabin_A5',
 'Cabin_A6',
 'Cabin_A7',
 'Cabin_B101',
 'Cabin_B18',
 'Cabin_B19',
 'Cabin_B20',
 'Cabin_B22',
 'Cabin_B3',
 'Cabin_B30',
 'Cabin_B35',
 'Cabin_B37',
 'Cabin_B38',
 'Cabin_B39',
 'Cabin_B4',
 'Cabin_B41',
 'Cabin_B42',
 'Cabin_B49',
 'Cabin_B5',
 'Cabin_B50',
 'Cabin_B51 B53 B55',
 'Cabin_B57 B59 B63 B66',
 'Cabin_B58 B60',
 'Cabin_B69',
 'Cabin_B71',
 'Cabin_B73',
 'Cabin_B77',
 'Cabin_B79',
 'Cabin_B80',
 'Cabin_B82 B84',
 'Cabin_B86',
 'Cabin_B94',
 'Cabin_B96 B98',
 'Cabin_C101',
 'Cabin_C103',
 'Cabin_C104',
 'Cabin_C110',
 'Cabin_C111',
 'Cabin_C118',
 'Cabin_C123',
 'Cabin_C124',
 'Cabin_C125',
 'Cabin_C126',
 'Cabin_C148',
 'Cabin_C2',
 'Cabin_C22 C26',
 'Cabin_C23 C25 C27',
 'Cabin_C30',
 'Cabin_C32',
 'Cabin_C45',
 'Cabin_C46',
 'Cabin_C49',
 'Cabin_C50',
 'Cabin_C52',
 'Cabin_C54',
 'Cabin_C62 C64',
 'Cabin_C65',
 'Cabin_C68',
 'Cabin_C7',
 'Cabin_C70',
 'Cabin_C78',
 'Cabin_C82',
 'Cabin_C83',
 'Cabin_C85',
 'Cabin_C86',
 'Cabin_C87',
 'Cabin_C90',
 'Cabin_C91',
 'Cabin_C92',
 'Cabin_C93',
 'Cabin_C99',
 'Cabin_D',
 'Cabin_D10 D12',
 'Cabin_D11',
 'Cabin_D15',
 'Cabin_D17',
 'Cabin_D19',
 'Cabin_D20',
 'Cabin_D26',
 'Cabin_D28',
 'Cabin_D30',
 'Cabin_D33',
 'Cabin_D35',
 'Cabin_D36',
 'Cabin_D37',
 'Cabin_D46',
 'Cabin_D47',
 'Cabin_D48',
 'Cabin_D49',
 'Cabin_D50',
 'Cabin_D56',
 'Cabin_D6',
 'Cabin_D7',
 'Cabin_D9',
 'Cabin_E10',
 'Cabin_E101',
 'Cabin_E12',
 'Cabin_E121',
 'Cabin_E17',
 'Cabin_E24',
 'Cabin_E25',
 'Cabin_E31',
 'Cabin_E33',
 'Cabin_E34',
 'Cabin_E36',
 'Cabin_E38',
 'Cabin_E40',
 'Cabin_E44',
 'Cabin_E46',
 'Cabin_E49',
 'Cabin_E50',
 'Cabin_E58',
 'Cabin_E63',
 'Cabin_E67',
 'Cabin_E68',
 'Cabin_E77',
 'Cabin_E8',
 'Cabin_F G63',
 'Cabin_F G73',
 'Cabin_F2',
 'Cabin_F33',
 'Cabin_F4',
 'Cabin_G6',
 'Cabin_T']

Using Sci-kit learn’s StandardScaler, KNeighborsClassifier, Train_test_split

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
clf = KNeighborsClassifier(n_neighbors=10)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
len(X_train)
146
len(y_train)
146
scaler.fit(X_train)
scaler.fit(X_test)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
len(X_train_scaled)
146
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)
clf.predict_proba(X_scaled)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"
array([[0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.6, 0.4],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.6, 0.4],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3]])
clf.score(X_test, y_test, sample_weight = None)
0.7567567567567568
clf.score(X_train, y_train, sample_weight = None)
0.7123287671232876

The score on training set is better than the test set, hence overfitting. However since the difference is not extreme, the overfitting is under control.

log_loss(df['Survived'], clf.predict_proba(X_scaled), labels = clf.classes_)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"
0.9230261607842106

The loss is very close to 0(less than 1), meaning the machine can make better predictions about the survival rate.

def get_scores(k):
    reg = KNeighborsClassifier(n_neighbors=k)
    reg.fit(X_train, y_train)
    test_error = log_loss(y_test, reg.predict_proba(X_test), labels = reg.classes_)
    return (test_error)
for i in range(1,11):
    print(f"when n_nearest neighbor is {i}, the test error is {get_scores(i)}")
when n_nearest neighbor is 1, the test error is 14.00220664658541
when n_nearest neighbor is 2, the test error is 4.948407829268989
when n_nearest neighbor is 3, the test error is 2.275243075462202
when n_nearest neighbor is 4, the test error is 1.4544746859879571
when n_nearest neighbor is 5, the test error is 0.5125348430463309
when n_nearest neighbor is 6, the test error is 0.5475811883637854
when n_nearest neighbor is 7, the test error is 0.572556653112447
when n_nearest neighbor is 8, the test error is 0.5541995482782791
when n_nearest neighbor is 9, the test error is 0.547555203527817
when n_nearest neighbor is 10, the test error is 0.5781849858754858

When I ran the code, K = 10 nearest neighbors shows the best result since the error is the lowest for test set.

c1 = alt.Chart(df).mark_circle().encode(
    x = "Cabin:O",
    y = "Fare",
    color = "Survived:N",
    tooltip = ["Fare", "Cabin", "Survived"]
).properties(
    title = "Fare paid for each deck",
    height = 550,
    width = 800
)

c1

Here we can see that there is an outlier in the graph, now lets get rid of it to see if it makes the data more reliable.

df[df["Fare"] > 500] #check which rows has datas with fare > $500
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Cabin_D17 Cabin_A36 Cabin_B69 Cabin_E49 Cabin_D28 Cabin_E17 Cabin_A24 Cabin_C50 Cabin_B42 Cabin_C148
679 680 1.0 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 ... False False False False False False False False False False
737 738 1.0 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 ... False False False False False False False False False False

2 rows × 146 columns

df1 = df[~(df["Fare"] > 500)] #Removing the anomaly from the dataset and setting it to df1
c2 = alt.Chart(df1).mark_circle().encode(
    x = "Cabin:O",
    y = "Fare",
    color = "Survived:N",
    tooltip = ["Fare", "Cabin", "Survived"]
).properties(
    title = "Fare paid for each deck",
    height = 550,
    width = 800
)

c2

After removing the 2 outliers, we can see from the graph that it visually looks more evenly distributed. As there are more orange (survived) dots than blue (did not survive) dots in the upper half of the graph, we can tell that the people who paid more had a higher rate of survival. Now we can recalculate our losses, errors and scores to see if there is any changes after outlier is removed.

X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X1 = df1.loc[:, X_colnames].copy()
y1 = df1.loc[:, y_colname].copy()
#extra 
Cabin_col = [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
clf1 = KNeighborsClassifier(n_neighbors=10)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1,y1,test_size=0.2)
len(X_train1)
144
scaler = StandardScaler()
scaler.fit(X_train1)
scaler.fit(X_test1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)
scaler.fit(X1)
X_scaled1 = scaler.transform(X1)

The difference between X_scaled and X_scaled1 is the length of it. X_scaled1 has fewer rows because anomalies has been dropped.

clf1.fit(X_train1, y_train1)
KNeighborsClassifier(n_neighbors=10)
clf1.predict_proba(X_scaled1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"
array([[0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3]])
clf1.score(X_test1, y_test1, sample_weight = None)
0.6486486486486487
clf1.score(X_train1, y_train1, sample_weight = None)
0.7291666666666666
log_loss(df1['Survived'], clf1.predict_proba(X_scaled1), labels = clf.classes_)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"
0.923100585413051

When I run my notebook, the results after removing the anomaly shows higher score and lower log loss in df1. Therefore, dataset df, where we do not remove, the anomaly shows a worse result. We will proceed the following code with df1 and not df. But for both parts, scores on test and training set does not have a significant difference of a few times higher than the other. In fact, score on test set is higher than on training set, making this underfitting but underfitting is in control since there is not much difference.

Using KNeighborsRegressor

kreg = KNeighborsRegressor(n_neighbors=10)
kreg.fit(X_train1, y_train1)
KNeighborsRegressor(n_neighbors=10)
kreg.predict(X_scaled1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
  "X does not have valid feature names, but"
array([0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3])
#since KNeighborsRegressor does not have a predict_proba attribute, we use .predict
kreg.score(X_test1, y_test1, sample_weight = None)
0.0010000000000000009
kreg.score(X_train1, y_train1, sample_weight = None)
0.1317979797979797

Since the score is much better on training set than test set, this is a very likely sign of overfitting.

mean_squared_error(df1['Survived'], kreg.predict(X_scaled1))
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
  "X does not have valid feature names, but"
0.35740331491712707

Through this comparison, we can see that using KNeighborsClassifier is better than using KneighborsRegressor since the difference between test set score and training set score is much closer to each other in Classifier than Regressor. Moreover, we always want score to be higher. The higher, the better. Here we can see that the Regressor has really low scores on both sets while the Classifier has much higher scores for both sets. Hence it is more suitable to use KNeighborsClassifier.

Extra topics not learned in class

In the following blocks, I will use Linear Regression, KNearestNeighbors, and Random Forest to compute scores of how well the machine can predict the survival rate according to the fare and deck the passengers are at, by using Cross Validation.

lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.62068966 0.65517241 0.67857143]
0.6667487684729064

Here we use max_iter to set the maximum number of iterations a solver can do.

knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.72413793 0.62068966 0.72413793 0.5862069  0.53571429]
0.6381773399014778
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.62068966 0.65517241 0.71428571]
0.6738916256157635

Here we use cv = 5 which is printed in the square brackets, meaning it takes the features df and target y, splits into k-folds (which is the cv parameter), fits on the (k-1) folds and evaluates on the last fold.

Now we want to use the voting classifier. The voting classifier basically takes the reliability / score of each machine learning models and take the mean of it, to see how reliable the machine is overall. If the mean of these machines are > 50%, then the passenger is predicted to have survived, vice versa.

voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf)], voting = 'soft') 

Here we use soft voting because we want the classifier to classify data based on probability and weights instead of class labels and weights.

cv = cross_val_score(voting_clf,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.68965517 0.62068966 0.71428571]
0.6807881773399015

Here we can see that through the cross validation score, the machine is 65% confident with it’s data, as a mean from all the machine learning models teste, which are Logistic Regression, which has the highest score, followed by KNearestNeighbor and last, Random Forest Classifier. Now lets try test the data with Logistic Regression and Linear Regression.

Using Logistic Regression

clfo = LogisticRegression()
clfo.fit(X_train1, y_train1)
LogisticRegression()
clfo.predict(X_train1)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.])
np.count_nonzero(clfo.predict(X_test1) == y_test1)/len(X_test1)
0.5945945945945946
np.count_nonzero(clfo.predict(X_train1) == y_train1)/len(X_train1)
0.7222222222222222

Here we can see that the test set using Logistic Regression is 62% accurate, and training set using Logistic Regression is 74% accurate.

Using LinearRegression

reg = LinearRegression()
reg.fit(df[["Fare",]], df["Survived"])
LinearRegression()
reg.coef_
array([0.00082767])
reg.intercept_
0.6070082823070905
def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    d1 = alt.Chart(df1).mark_circle().encode(
        x = "Fare",
        y = "Survived"
    )

    xmax = 40
    df_line = pd.DataFrame({"Fare":[0,xmax],"Survived":[b,xmax*m]})
    d2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "Fare",
        y = "Survived"
    )
    return d1+d2

Although this is a Logistic Regression Problem, we will try using Linear Regression.

draw_line(0.00082767, 0.6070082823070905)

Here it is obvious why we do not use Linear Regression since Survived is more categorical, either yes or no instead of having some in between because people can’t half-survive. The graph is very hard to read but since the intercept is above 0.5, we can tell that more people survived than not when paying 0, but other than that it is very difficult to tell as the line goes down although there is no negative coefficient. However, we would think that the more we pay, the more likely we are to survive, so this just furtheer shows how linear regression does not work.

Now let’s see if we use Cabins as the x-axis it will work at all.

reg1 = LinearRegression()
reg1.fit(df[[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]], df["Survived"])
LinearRegression()
reg.coef_
array([0.00082767])
reg.intercept_
0.6070082823070905
def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    d1 = alt.Chart(df).mark_circle().encode(
        x = "Cabin:O",
        y = "Survived"
    )

    xmax = 40
    df_line = pd.DataFrame({"Cabin":[0,xmax],"Survived":[b,xmax*m]})
    d2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "Cabin",
        y = "Survived"
    )
    return d1+d2
draw_line(1.1276363e+14, -112763629749553.84)

Seeing how the graph does not ake any sense in Cabin, Linear Regression will not work in this case. Since we multiply with the x value, but this is categorical so that won’t work.

Summary

In this project, I used sci-kit learn to do the machine learning process. First, I used KNeighborClassifier and Regressor to compare with each other. Looking at the scores of training and test set, It is obvious that KneightborClassifier is the better one to use since eventhough it is underfitting, the difference is not that significant compared to the completely overfitting Regressor. Next, I used cross validation score and voting classifier to get the mean of how confident the machine is in predicting the survival of passengers using machine learning models: KNearestNeighbors, LinearRegression, Random Forest. This shows that the machine is most confident when using the Linear Regression model. The mean is about 62% when I run the code so I would say that the machine is still on the better side in predicting the survival although not exactly reliable. Following that, I used Logistic Regression and Linear Regression. For Logistic Regression, I knew it would work because it is suitable for classification problems, so I just checked the reliability of the training set in comparison to the test set. Turns out the training set shows 74% while test shows 62% accuracy making the model overfitting but the overfitting is still under control. When using Linear Regression, I had doubts whether or not it can be used at all so I graphed it to see if it makes sense but it does not. Overall, I wouldn’t recommend using the machine to predict survival rates from fare and cabin.