Titanic Survival¶

Author: Brigitte Harijanto

Course Project, UC Irvine, Math 10, W22

Introduction¶

In this project, I will be using the titanic dataset, imported from kaggle as train and test, which has never been used in class before. Here, we will explore if the fare a person paid and the cabin they were on can predict their survival rate by using scikit learn.Furthermore, we will explore the reliability of several machine learning models and get the mean of them to check the machine’s confidence on this matter. Last, we will prove by graph why Linear Regression is not the way to go.

Main portion of the project¶

Importing Files¶

import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss, mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LinearRegression
from torch import nn

/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

training = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

training['train_test'] = 1
test['train_test'] = 0
test['Survived'] = np.NaN
df = pd.concat([training,test])

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	train_test
0	1	0.0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	1
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	1
2	3	1.0	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	1
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S	1
4	5	0.0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S	1

df.dropna(inplace=True)

df

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	train_test
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C	1
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S	1
6	7	0.0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S	1
10	11	1.0	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S	1
11	12	1.0	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...
871	872	1.0	1	Beckwith, Mrs. Richard Leonard (Sallie Monypeny)	female	47.0	1	1	11751	52.5542	D35	S	1
872	873	0.0	1	Carlsson, Mr. Frans Olof	male	33.0	0	0	695	5.0000	B51 B53 B55	S	1
879	880	1.0	1	Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)	female	56.0	0	1	11767	83.1583	C50	C	1
887	888	1.0	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S	1
889	890	1.0	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C	1

183 rows × 13 columns

for k in df["Cabin"].unique():
    df[f"Cabin_{k}"] = (df["Cabin"] == k)

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead.  To get a de-fragmented frame, use `newframe = frame.copy()`
  

Using .unique() to get all distinct decks, then using f strings to modify decks by the respective names and assigning true or false values(numerical values) to each deck column. Afterwards, using for loop to get all the distinct decks to assign all decks to a new numerical column.

df.head() #check if each Cabin has it's own column

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	Cabin_D17	Cabin_A36	Cabin_B69	Cabin_E49	Cabin_D28	Cabin_E17	Cabin_A24	Cabin_C50	Cabin_B42	Cabin_C148
1	2	1.0	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	...	False	False	False	False	False	False	False	False	False	False
3	4	1.0	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	...	False	False	False	False	False	False	False	False	False	False
6	7	0.0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	...	False	False	False	False	False	False	False	False	False	False
10	11	1.0	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	...	False	False	False	False	False	False	False	False	False	False
11	12	1.0	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	...	False	False	False	False	False	False	False	False	False	False

5 rows × 146 columns

#def order(char):
    #return ord(char) - ord('A') +1

#df['deck_num'] = df['deck'].map(lambda s : order(s))

X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X = df.loc[:, X_colnames].copy()
y = df.loc[:, y_colname].copy()

[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]

['Cabin_A10',
 'Cabin_A16',
 'Cabin_A20',
 'Cabin_A23',
 'Cabin_A24',
 'Cabin_A26',
 'Cabin_A31',
 'Cabin_A34',
 'Cabin_A36',
 'Cabin_A5',
 'Cabin_A6',
 'Cabin_A7',
 'Cabin_B101',
 'Cabin_B18',
 'Cabin_B19',
 'Cabin_B20',
 'Cabin_B22',
 'Cabin_B3',
 'Cabin_B30',
 'Cabin_B35',
 'Cabin_B37',
 'Cabin_B38',
 'Cabin_B39',
 'Cabin_B4',
 'Cabin_B41',
 'Cabin_B42',
 'Cabin_B49',
 'Cabin_B5',
 'Cabin_B50',
 'Cabin_B51 B53 B55',
 'Cabin_B57 B59 B63 B66',
 'Cabin_B58 B60',
 'Cabin_B69',
 'Cabin_B71',
 'Cabin_B73',
 'Cabin_B77',
 'Cabin_B79',
 'Cabin_B80',
 'Cabin_B82 B84',
 'Cabin_B86',
 'Cabin_B94',
 'Cabin_B96 B98',
 'Cabin_C101',
 'Cabin_C103',
 'Cabin_C104',
 'Cabin_C110',
 'Cabin_C111',
 'Cabin_C118',
 'Cabin_C123',
 'Cabin_C124',
 'Cabin_C125',
 'Cabin_C126',
 'Cabin_C148',
 'Cabin_C2',
 'Cabin_C22 C26',
 'Cabin_C23 C25 C27',
 'Cabin_C30',
 'Cabin_C32',
 'Cabin_C45',
 'Cabin_C46',
 'Cabin_C49',
 'Cabin_C50',
 'Cabin_C52',
 'Cabin_C54',
 'Cabin_C62 C64',
 'Cabin_C65',
 'Cabin_C68',
 'Cabin_C7',
 'Cabin_C70',
 'Cabin_C78',
 'Cabin_C82',
 'Cabin_C83',
 'Cabin_C85',
 'Cabin_C86',
 'Cabin_C87',
 'Cabin_C90',
 'Cabin_C91',
 'Cabin_C92',
 'Cabin_C93',
 'Cabin_C99',
 'Cabin_D',
 'Cabin_D10 D12',
 'Cabin_D11',
 'Cabin_D15',
 'Cabin_D17',
 'Cabin_D19',
 'Cabin_D20',
 'Cabin_D26',
 'Cabin_D28',
 'Cabin_D30',
 'Cabin_D33',
 'Cabin_D35',
 'Cabin_D36',
 'Cabin_D37',
 'Cabin_D46',
 'Cabin_D47',
 'Cabin_D48',
 'Cabin_D49',
 'Cabin_D50',
 'Cabin_D56',
 'Cabin_D6',
 'Cabin_D7',
 'Cabin_D9',
 'Cabin_E10',
 'Cabin_E101',
 'Cabin_E12',
 'Cabin_E121',
 'Cabin_E17',
 'Cabin_E24',
 'Cabin_E25',
 'Cabin_E31',
 'Cabin_E33',
 'Cabin_E34',
 'Cabin_E36',
 'Cabin_E38',
 'Cabin_E40',
 'Cabin_E44',
 'Cabin_E46',
 'Cabin_E49',
 'Cabin_E50',
 'Cabin_E58',
 'Cabin_E63',
 'Cabin_E67',
 'Cabin_E68',
 'Cabin_E77',
 'Cabin_E8',
 'Cabin_F G63',
 'Cabin_F G73',
 'Cabin_F2',
 'Cabin_F33',
 'Cabin_F4',
 'Cabin_G6',
 'Cabin_T']

Using Sci-kit learn’s StandardScaler, KNeighborsClassifier, Train_test_split¶

scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

clf = KNeighborsClassifier(n_neighbors=10)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
len(X_train)

len(y_train)

scaler.fit(X_train)
scaler.fit(X_test)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

len(X_train_scaled)

clf.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=10)

clf.predict_proba(X_scaled)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"

array([[0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.6, 0.4],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.6, 0.4],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3]])

clf.score(X_test, y_test, sample_weight = None)

0.7567567567567568

clf.score(X_train, y_train, sample_weight = None)

0.7123287671232876

The score on training set is better than the test set, hence overfitting. However since the difference is not extreme, the overfitting is under control.

log_loss(df['Survived'], clf.predict_proba(X_scaled), labels = clf.classes_)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"

0.9230261607842106

The loss is very close to 0(less than 1), meaning the machine can make better predictions about the survival rate.

def get_scores(k):
    reg = KNeighborsClassifier(n_neighbors=k)
    reg.fit(X_train, y_train)
    test_error = log_loss(y_test, reg.predict_proba(X_test), labels = reg.classes_)
    return (test_error)

for i in range(1,11):
    print(f"when n_nearest neighbor is {i}, the test error is {get_scores(i)}")

when n_nearest neighbor is 1, the test error is 14.00220664658541
when n_nearest neighbor is 2, the test error is 4.948407829268989
when n_nearest neighbor is 3, the test error is 2.275243075462202
when n_nearest neighbor is 4, the test error is 1.4544746859879571
when n_nearest neighbor is 5, the test error is 0.5125348430463309
when n_nearest neighbor is 6, the test error is 0.5475811883637854
when n_nearest neighbor is 7, the test error is 0.572556653112447
when n_nearest neighbor is 8, the test error is 0.5541995482782791
when n_nearest neighbor is 9, the test error is 0.547555203527817
when n_nearest neighbor is 10, the test error is 0.5781849858754858

When I ran the code, K = 10 nearest neighbors shows the best result since the error is the lowest for test set.

c1 = alt.Chart(df).mark_circle().encode(
    x = "Cabin:O",
    y = "Fare",
    color = "Survived:N",
    tooltip = ["Fare", "Cabin", "Survived"]
).properties(
    title = "Fare paid for each deck",
    height = 550,
    width = 800
)

c1

Here we can see that there is an outlier in the graph, now lets get rid of it to see if it makes the data more reliable.

df[df["Fare"] > 500] #check which rows has datas with fare > $500

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	...	Cabin_D17	Cabin_A36	Cabin_B69	Cabin_E49	Cabin_D28	Cabin_E17	Cabin_A24	Cabin_C50	Cabin_B42	Cabin_C148
679	680	1.0	1	Cardeza, Mr. Thomas Drake Martinez	male	36.0	0	1	PC 17755	512.3292	...	False	False	False	False	False	False	False	False	False	False
737	738	1.0	1	Lesurer, Mr. Gustave J	male	35.0	0	0	PC 17755	512.3292	...	False	False	False	False	False	False	False	False	False	False

2 rows × 146 columns

df1 = df[~(df["Fare"] > 500)] #Removing the anomaly from the dataset and setting it to df1

c2 = alt.Chart(df1).mark_circle().encode(
    x = "Cabin:O",
    y = "Fare",
    color = "Survived:N",
    tooltip = ["Fare", "Cabin", "Survived"]
).properties(
    title = "Fare paid for each deck",
    height = 550,
    width = 800
)

c2

After removing the 2 outliers, we can see from the graph that it visually looks more evenly distributed. As there are more orange (survived) dots than blue (did not survive) dots in the upper half of the graph, we can tell that the people who paid more had a higher rate of survival. Now we can recalculate our losses, errors and scores to see if there is any changes after outlier is removed.

X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X1 = df1.loc[:, X_colnames].copy()
y1 = df1.loc[:, y_colname].copy()

#extra 
Cabin_col = [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]

clf1 = KNeighborsClassifier(n_neighbors=10)

X_train1, X_test1, y_train1, y_test1 = train_test_split(X1,y1,test_size=0.2)
len(X_train1)

scaler = StandardScaler()
scaler.fit(X_train1)
scaler.fit(X_test1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)

scaler.fit(X1)
X_scaled1 = scaler.transform(X1)

The difference between X_scaled and X_scaled1 is the length of it. X_scaled1 has fewer rows because anomalies has been dropped.

clf1.fit(X_train1, y_train1)

KNeighborsClassifier(n_neighbors=10)

clf1.predict_proba(X_scaled1)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"

array([[0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3],
       [0.7, 0.3]])

clf1.score(X_test1, y_test1, sample_weight = None)

0.6486486486486487

clf1.score(X_train1, y_train1, sample_weight = None)

0.7291666666666666

log_loss(df1['Survived'], clf1.predict_proba(X_scaled1), labels = clf.classes_)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
  "X does not have valid feature names, but"

0.923100585413051

When I run my notebook, the results after removing the anomaly shows higher score and lower log loss in df1. Therefore, dataset df, where we do not remove, the anomaly shows a worse result. We will proceed the following code with df1 and not df. But for both parts, scores on test and training set does not have a significant difference of a few times higher than the other. In fact, score on test set is higher than on training set, making this underfitting but underfitting is in control since there is not much difference.

Using KNeighborsRegressor¶

kreg = KNeighborsRegressor(n_neighbors=10)

kreg.fit(X_train1, y_train1)

KNeighborsRegressor(n_neighbors=10)

kreg.predict(X_scaled1)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
  "X does not have valid feature names, but"

array([0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
       0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3])

#since KNeighborsRegressor does not have a predict_proba attribute, we use .predict
kreg.score(X_test1, y_test1, sample_weight = None)

0.0010000000000000009

kreg.score(X_train1, y_train1, sample_weight = None)

0.1317979797979797

Since the score is much better on training set than test set, this is a very likely sign of overfitting.

mean_squared_error(df1['Survived'], kreg.predict(X_scaled1))

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
  "X does not have valid feature names, but"

0.35740331491712707

Through this comparison, we can see that using KNeighborsClassifier is better than using KneighborsRegressor since the difference between test set score and training set score is much closer to each other in Classifier than Regressor. Moreover, we always want score to be higher. The higher, the better. Here we can see that the Regressor has really low scores on both sets while the Classifier has much higher scores for both sets. Hence it is more suitable to use KNeighborsClassifier.

Extra topics not learned in class¶

In the following blocks, I will use Linear Regression, KNearestNeighbors, and Random Forest to compute scores of how well the machine can predict the survival rate according to the fare and deck the passengers are at, by using Cross Validation.

lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())

[0.68965517 0.68965517 0.62068966 0.65517241 0.67857143]
0.6667487684729064

Here we use max_iter to set the maximum number of iterations a solver can do.

knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())

[0.72413793 0.62068966 0.72413793 0.5862069  0.53571429]
0.6381773399014778

rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())

[0.68965517 0.68965517 0.62068966 0.65517241 0.71428571]
0.6738916256157635

Here we use cv = 5 which is printed in the square brackets, meaning it takes the features df and target y, splits into k-folds (which is the cv parameter), fits on the (k-1) folds and evaluates on the last fold.

Now we want to use the voting classifier. The voting classifier basically takes the reliability / score of each machine learning models and take the mean of it, to see how reliable the machine is overall. If the mean of these machines are > 50%, then the passenger is predicted to have survived, vice versa.

voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf)], voting = 'soft') 

Here we use soft voting because we want the classifier to classify data based on probability and weights instead of class labels and weights.

cv = cross_val_score(voting_clf,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())

[0.68965517 0.68965517 0.68965517 0.62068966 0.71428571]
0.6807881773399015

Here we can see that through the cross validation score, the machine is 65% confident with it’s data, as a mean from all the machine learning models teste, which are Logistic Regression, which has the highest score, followed by KNearestNeighbor and last, Random Forest Classifier. Now lets try test the data with Logistic Regression and Linear Regression.

Using Logistic Regression¶

clfo = LogisticRegression()

clfo.fit(X_train1, y_train1)

LogisticRegression()

clfo.predict(X_train1)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1.])

np.count_nonzero(clfo.predict(X_test1) == y_test1)/len(X_test1)

0.5945945945945946

np.count_nonzero(clfo.predict(X_train1) == y_train1)/len(X_train1)

0.7222222222222222

Here we can see that the test set using Logistic Regression is 62% accurate, and training set using Logistic Regression is 74% accurate.

Using LinearRegression¶

reg = LinearRegression()

reg.fit(df[["Fare",]], df["Survived"])

LinearRegression()

reg.coef_

array([0.00082767])

reg.intercept_

0.6070082823070905

def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    d1 = alt.Chart(df1).mark_circle().encode(
        x = "Fare",
        y = "Survived"
    )

    xmax = 40
    df_line = pd.DataFrame({"Fare":[0,xmax],"Survived":[b,xmax*m]})
    d2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "Fare",
        y = "Survived"
    )
    return d1+d2

Although this is a Logistic Regression Problem, we will try using Linear Regression.

draw_line(0.00082767, 0.6070082823070905)

Here it is obvious why we do not use Linear Regression since Survived is more categorical, either yes or no instead of having some in between because people can’t half-survive. The graph is very hard to read but since the intercept is above 0.5, we can tell that more people survived than not when paying 0, but other than that it is very difficult to tell as the line goes down although there is no negative coefficient. However, we would think that the more we pay, the more likely we are to survive, so this just furtheer shows how linear regression does not work.

Now let’s see if we use Cabins as the x-axis it will work at all.

reg1 = LinearRegression()

reg1.fit(df[[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]], df["Survived"])

LinearRegression()

reg.coef_

array([0.00082767])

reg.intercept_

0.6070082823070905

def draw_line(m,b):
    alt.data_transformers.disable_max_rows()

    d1 = alt.Chart(df).mark_circle().encode(
        x = "Cabin:O",
        y = "Survived"
    )

    xmax = 40
    df_line = pd.DataFrame({"Cabin":[0,xmax],"Survived":[b,xmax*m]})
    d2 = alt.Chart(df_line).mark_line(color="red").encode(
        x = "Cabin",
        y = "Survived"
    )
    return d1+d2

draw_line(1.1276363e+14, -112763629749553.84)

Seeing how the graph does not ake any sense in Cabin, Linear Regression will not work in this case. Since we multiply with the x value, but this is categorical so that won’t work.

Summary¶

In this project, I used sci-kit learn to do the machine learning process. First, I used KNeighborClassifier and Regressor to compare with each other. Looking at the scores of training and test set, It is obvious that KneightborClassifier is the better one to use since eventhough it is underfitting, the difference is not that significant compared to the completely overfitting Regressor. Next, I used cross validation score and voting classifier to get the mean of how confident the machine is in predicting the survival of passengers using machine learning models: KNearestNeighbors, LinearRegression, Random Forest. This shows that the machine is most confident when using the Linear Regression model. The mean is about 62% when I run the code so I would say that the machine is still on the better side in predicting the survival although not exactly reliable. Following that, I used Logistic Regression and Linear Regression. For Logistic Regression, I knew it would work because it is suitable for classification problems, so I just checked the reliability of the training set in comparison to the test set. Turns out the training set shows 74% while test shows 62% accuracy making the model overfitting but the overfitting is still under control. When using Linear Regression, I had doubts whether or not it can be used at all so I graphed it to see if it makes sense but it does not. Overall, I wouldn’t recommend using the machine to predict survival rates from fare and cabin.