Titanic Survival
Contents
Titanic Survival¶
Author: Brigitte Harijanto
Email: brigith@uci.edu
Course Project, UC Irvine, Math 10, W22
Introduction¶
In this project, I will be using the titanic dataset, imported from kaggle as train and test, which has never been used in class before. Here, we will explore if the fare a person paid and the cabin they were on can predict their survival rate by using scikit learn.Furthermore, we will explore the reliability of several machine learning models and get the mean of them to check the machine’s confidence on this matter. Last, we will prove by graph why Linear Regression is not the way to go.
Main portion of the project¶
Importing Files¶
import seaborn as sns
import numpy as np
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.metrics import log_loss, mean_squared_error, mean_absolute_error
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LinearRegression
from torch import nn
/shared-libs/python3.7/py/lib/python3.7/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
training = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
training['train_test'] = 1
test['train_test'] = 0
test['Survived'] = np.NaN
df = pd.concat([training,test])
df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | train_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 |
2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 1 |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 |
4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 1 |
df.dropna(inplace=True)
df
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | train_test | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 |
6 | 7 | 0.0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 1 |
10 | 11 | 1.0 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S | 1 |
11 | 12 | 1.0 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
871 | 872 | 1.0 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S | 1 |
872 | 873 | 0.0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S | 1 |
879 | 880 | 1.0 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C | 1 |
887 | 888 | 1.0 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S | 1 |
889 | 890 | 1.0 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C | 1 |
183 rows × 13 columns
for k in df["Cabin"].unique():
df[f"Cabin_{k}"] = (df["Cabin"] == k)
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:2: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
Using .unique()
to get all distinct decks, then using f strings to modify decks by the respective names and assigning true or false values(numerical values) to each deck column. Afterwards, using for loop to get all the distinct decks to assign all decks to a new numerical column.
df.head() #check if each Cabin has it's own column
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Cabin_D17 | Cabin_A36 | Cabin_B69 | Cabin_E49 | Cabin_D28 | Cabin_E17 | Cabin_A24 | Cabin_C50 | Cabin_B42 | Cabin_C148 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | False | False | False | False | False | False | False | False | False | False |
3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | False | False | False | False | False | False | False | False | False | False |
6 | 7 | 0.0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | ... | False | False | False | False | False | False | False | False | False | False |
10 | 11 | 1.0 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | ... | False | False | False | False | False | False | False | False | False | False |
11 | 12 | 1.0 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | ... | False | False | False | False | False | False | False | False | False | False |
5 rows × 146 columns
#def order(char):
#return ord(char) - ord('A') +1
#df['deck_num'] = df['deck'].map(lambda s : order(s))
X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X = df.loc[:, X_colnames].copy()
y = df.loc[:, y_colname].copy()
[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
['Cabin_A10',
'Cabin_A16',
'Cabin_A20',
'Cabin_A23',
'Cabin_A24',
'Cabin_A26',
'Cabin_A31',
'Cabin_A34',
'Cabin_A36',
'Cabin_A5',
'Cabin_A6',
'Cabin_A7',
'Cabin_B101',
'Cabin_B18',
'Cabin_B19',
'Cabin_B20',
'Cabin_B22',
'Cabin_B3',
'Cabin_B30',
'Cabin_B35',
'Cabin_B37',
'Cabin_B38',
'Cabin_B39',
'Cabin_B4',
'Cabin_B41',
'Cabin_B42',
'Cabin_B49',
'Cabin_B5',
'Cabin_B50',
'Cabin_B51 B53 B55',
'Cabin_B57 B59 B63 B66',
'Cabin_B58 B60',
'Cabin_B69',
'Cabin_B71',
'Cabin_B73',
'Cabin_B77',
'Cabin_B79',
'Cabin_B80',
'Cabin_B82 B84',
'Cabin_B86',
'Cabin_B94',
'Cabin_B96 B98',
'Cabin_C101',
'Cabin_C103',
'Cabin_C104',
'Cabin_C110',
'Cabin_C111',
'Cabin_C118',
'Cabin_C123',
'Cabin_C124',
'Cabin_C125',
'Cabin_C126',
'Cabin_C148',
'Cabin_C2',
'Cabin_C22 C26',
'Cabin_C23 C25 C27',
'Cabin_C30',
'Cabin_C32',
'Cabin_C45',
'Cabin_C46',
'Cabin_C49',
'Cabin_C50',
'Cabin_C52',
'Cabin_C54',
'Cabin_C62 C64',
'Cabin_C65',
'Cabin_C68',
'Cabin_C7',
'Cabin_C70',
'Cabin_C78',
'Cabin_C82',
'Cabin_C83',
'Cabin_C85',
'Cabin_C86',
'Cabin_C87',
'Cabin_C90',
'Cabin_C91',
'Cabin_C92',
'Cabin_C93',
'Cabin_C99',
'Cabin_D',
'Cabin_D10 D12',
'Cabin_D11',
'Cabin_D15',
'Cabin_D17',
'Cabin_D19',
'Cabin_D20',
'Cabin_D26',
'Cabin_D28',
'Cabin_D30',
'Cabin_D33',
'Cabin_D35',
'Cabin_D36',
'Cabin_D37',
'Cabin_D46',
'Cabin_D47',
'Cabin_D48',
'Cabin_D49',
'Cabin_D50',
'Cabin_D56',
'Cabin_D6',
'Cabin_D7',
'Cabin_D9',
'Cabin_E10',
'Cabin_E101',
'Cabin_E12',
'Cabin_E121',
'Cabin_E17',
'Cabin_E24',
'Cabin_E25',
'Cabin_E31',
'Cabin_E33',
'Cabin_E34',
'Cabin_E36',
'Cabin_E38',
'Cabin_E40',
'Cabin_E44',
'Cabin_E46',
'Cabin_E49',
'Cabin_E50',
'Cabin_E58',
'Cabin_E63',
'Cabin_E67',
'Cabin_E68',
'Cabin_E77',
'Cabin_E8',
'Cabin_F G63',
'Cabin_F G73',
'Cabin_F2',
'Cabin_F33',
'Cabin_F4',
'Cabin_G6',
'Cabin_T']
Using Sci-kit learn’s StandardScaler, KNeighborsClassifier, Train_test_split¶
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)
clf = KNeighborsClassifier(n_neighbors=10)
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
len(X_train)
146
len(y_train)
146
scaler.fit(X_train)
scaler.fit(X_test)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
len(X_train_scaled)
146
clf.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10)
clf.predict_proba(X_scaled)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
"X does not have valid feature names, but"
array([[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.6, 0.4],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.6, 0.4],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3]])
clf.score(X_test, y_test, sample_weight = None)
0.7567567567567568
clf.score(X_train, y_train, sample_weight = None)
0.7123287671232876
The score on training set is better than the test set, hence overfitting. However since the difference is not extreme, the overfitting is under control.
log_loss(df['Survived'], clf.predict_proba(X_scaled), labels = clf.classes_)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
"X does not have valid feature names, but"
0.9230261607842106
The loss is very close to 0(less than 1), meaning the machine can make better predictions about the survival rate.
def get_scores(k):
reg = KNeighborsClassifier(n_neighbors=k)
reg.fit(X_train, y_train)
test_error = log_loss(y_test, reg.predict_proba(X_test), labels = reg.classes_)
return (test_error)
for i in range(1,11):
print(f"when n_nearest neighbor is {i}, the test error is {get_scores(i)}")
when n_nearest neighbor is 1, the test error is 14.00220664658541
when n_nearest neighbor is 2, the test error is 4.948407829268989
when n_nearest neighbor is 3, the test error is 2.275243075462202
when n_nearest neighbor is 4, the test error is 1.4544746859879571
when n_nearest neighbor is 5, the test error is 0.5125348430463309
when n_nearest neighbor is 6, the test error is 0.5475811883637854
when n_nearest neighbor is 7, the test error is 0.572556653112447
when n_nearest neighbor is 8, the test error is 0.5541995482782791
when n_nearest neighbor is 9, the test error is 0.547555203527817
when n_nearest neighbor is 10, the test error is 0.5781849858754858
When I ran the code, K = 10 nearest neighbors shows the best result since the error is the lowest for test set.
c1 = alt.Chart(df).mark_circle().encode(
x = "Cabin:O",
y = "Fare",
color = "Survived:N",
tooltip = ["Fare", "Cabin", "Survived"]
).properties(
title = "Fare paid for each deck",
height = 550,
width = 800
)
c1
Here we can see that there is an outlier in the graph, now lets get rid of it to see if it makes the data more reliable.
df[df["Fare"] > 500] #check which rows has datas with fare > $500
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Cabin_D17 | Cabin_A36 | Cabin_B69 | Cabin_E49 | Cabin_D28 | Cabin_E17 | Cabin_A24 | Cabin_C50 | Cabin_B42 | Cabin_C148 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
679 | 680 | 1.0 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | ... | False | False | False | False | False | False | False | False | False | False |
737 | 738 | 1.0 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | ... | False | False | False | False | False | False | False | False | False | False |
2 rows × 146 columns
df1 = df[~(df["Fare"] > 500)] #Removing the anomaly from the dataset and setting it to df1
c2 = alt.Chart(df1).mark_circle().encode(
x = "Cabin:O",
y = "Fare",
color = "Survived:N",
tooltip = ["Fare", "Cabin", "Survived"]
).properties(
title = "Fare paid for each deck",
height = 550,
width = 800
)
c2
After removing the 2 outliers, we can see from the graph that it visually looks more evenly distributed. As there are more orange (survived) dots than blue (did not survive) dots in the upper half of the graph, we can tell that the people who paid more had a higher rate of survival. Now we can recalculate our losses, errors and scores to see if there is any changes after outlier is removed.
X_colnames = ["Fare"] + [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
y_colname = 'Survived'
X1 = df1.loc[:, X_colnames].copy()
y1 = df1.loc[:, y_colname].copy()
#extra
Cabin_col = [f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]
clf1 = KNeighborsClassifier(n_neighbors=10)
X_train1, X_test1, y_train1, y_test1 = train_test_split(X1,y1,test_size=0.2)
len(X_train1)
144
scaler = StandardScaler()
scaler.fit(X_train1)
scaler.fit(X_test1)
X_train_scaled1 = scaler.transform(X_train1)
X_test_scaled1 = scaler.transform(X_test1)
scaler.fit(X1)
X_scaled1 = scaler.transform(X1)
The difference between X_scaled
and X_scaled1
is the length of it. X_scaled1
has fewer rows because anomalies has been dropped.
clf1.fit(X_train1, y_train1)
KNeighborsClassifier(n_neighbors=10)
clf1.predict_proba(X_scaled1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
"X does not have valid feature names, but"
array([[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3],
[0.7, 0.3]])
clf1.score(X_test1, y_test1, sample_weight = None)
0.6486486486486487
clf1.score(X_train1, y_train1, sample_weight = None)
0.7291666666666666
log_loss(df1['Survived'], clf1.predict_proba(X_scaled1), labels = clf.classes_)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsClassifier was fitted with feature names
"X does not have valid feature names, but"
0.923100585413051
When I run my notebook, the results after removing the anomaly shows higher score and lower log loss in df1
. Therefore, dataset df
, where we do not remove, the anomaly shows a worse result. We will proceed the following code with df1
and not df
. But for both parts, scores on test and training set does not have a significant difference of a few times higher than the other. In fact, score on test set is higher than on training set, making this underfitting but underfitting is in control since there is not much difference.
Using KNeighborsRegressor¶
kreg = KNeighborsRegressor(n_neighbors=10)
kreg.fit(X_train1, y_train1)
KNeighborsRegressor(n_neighbors=10)
kreg.predict(X_scaled1)
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
"X does not have valid feature names, but"
array([0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3])
#since KNeighborsRegressor does not have a predict_proba attribute, we use .predict
kreg.score(X_test1, y_test1, sample_weight = None)
0.0010000000000000009
kreg.score(X_train1, y_train1, sample_weight = None)
0.1317979797979797
Since the score is much better on training set than test set, this is a very likely sign of overfitting.
mean_squared_error(df1['Survived'], kreg.predict(X_scaled1))
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:451: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
"X does not have valid feature names, but"
0.35740331491712707
Through this comparison, we can see that using KNeighborsClassifier
is better than using KneighborsRegressor
since the difference between test set score and training set score is much closer to each other in Classifier than Regressor. Moreover, we always want score to be higher. The higher, the better. Here we can see that the Regressor has really low scores on both sets while the Classifier has much higher scores for both sets. Hence it is more suitable to use KNeighborsClassifier
.
Extra topics not learned in class¶
In the following blocks, I will use Linear Regression
, KNearestNeighbors
, and Random Forest
to compute scores of how well the machine can predict the survival rate according to the fare and deck the passengers are at, by using Cross Validation
.
lr = LogisticRegression(max_iter = 2000)
cv = cross_val_score(lr,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.62068966 0.65517241 0.67857143]
0.6667487684729064
Here we use max_iter
to set the maximum number of iterations a solver can do.
knn = KNeighborsClassifier()
cv = cross_val_score(knn,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.72413793 0.62068966 0.72413793 0.5862069 0.53571429]
0.6381773399014778
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf,X_train1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.62068966 0.65517241 0.71428571]
0.6738916256157635
Here we use cv = 5 which is printed in the square brackets, meaning it takes the features df and target y, splits into k-folds (which is the cv parameter), fits on the (k-1) folds and evaluates on the last fold.
Now we want to use the voting classifier
. The voting classifier
basically takes the reliability / score of each machine learning models and take the mean of it, to see how reliable the machine is overall. If the mean of these machines are > 50%, then the passenger is predicted to have survived, vice versa.
voting_clf = VotingClassifier(estimators = [('lr',lr),('knn',knn),('rf',rf)], voting = 'soft')
Here we use soft voting because we want the classifier to classify data based on probability and weights instead of class labels and weights.
cv = cross_val_score(voting_clf,X_train_scaled1,y_train1,cv=5)
print(cv)
print(cv.mean())
[0.68965517 0.68965517 0.68965517 0.62068966 0.71428571]
0.6807881773399015
Here we can see that through the cross validation score, the machine is 65% confident with it’s data, as a mean from all the machine learning models teste, which are Logistic Regression
, which has the highest score, followed by KNearestNeighbor
and last, Random Forest Classifier
. Now lets try test the data with Logistic Regression
and Linear Regression
.
Using Logistic Regression¶
clfo = LogisticRegression()
clfo.fit(X_train1, y_train1)
LogisticRegression()
clfo.predict(X_train1)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1.,
1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
1., 1., 1., 1., 1., 1., 1., 1.])
np.count_nonzero(clfo.predict(X_test1) == y_test1)/len(X_test1)
0.5945945945945946
np.count_nonzero(clfo.predict(X_train1) == y_train1)/len(X_train1)
0.7222222222222222
Here we can see that the test set using Logistic Regression is 62% accurate, and training set using Logistic Regression is 74% accurate.
Using LinearRegression¶
reg = LinearRegression()
reg.fit(df[["Fare",]], df["Survived"])
LinearRegression()
reg.coef_
array([0.00082767])
reg.intercept_
0.6070082823070905
def draw_line(m,b):
alt.data_transformers.disable_max_rows()
d1 = alt.Chart(df1).mark_circle().encode(
x = "Fare",
y = "Survived"
)
xmax = 40
df_line = pd.DataFrame({"Fare":[0,xmax],"Survived":[b,xmax*m]})
d2 = alt.Chart(df_line).mark_line(color="red").encode(
x = "Fare",
y = "Survived"
)
return d1+d2
Although this is a Logistic Regression
Problem, we will try using Linear Regression
.
draw_line(0.00082767, 0.6070082823070905)
Here it is obvious why we do not use Linear Regression since Survived
is more categorical, either yes or no instead of having some in between because people can’t half-survive. The graph is very hard to read but since the intercept is above 0.5, we can tell that more people survived than not when paying 0, but other than that it is very difficult to tell as the line goes down although there is no negative coefficient. However, we would think that the more we pay, the more likely we are to survive, so this just furtheer shows how linear regression does not work.
Now let’s see if we use Cabins as the x-axis it will work at all.
reg1 = LinearRegression()
reg1.fit(df[[f"Cabin_{k}" for k in sorted(df["Cabin"].unique())]], df["Survived"])
LinearRegression()
reg.coef_
array([0.00082767])
reg.intercept_
0.6070082823070905
def draw_line(m,b):
alt.data_transformers.disable_max_rows()
d1 = alt.Chart(df).mark_circle().encode(
x = "Cabin:O",
y = "Survived"
)
xmax = 40
df_line = pd.DataFrame({"Cabin":[0,xmax],"Survived":[b,xmax*m]})
d2 = alt.Chart(df_line).mark_line(color="red").encode(
x = "Cabin",
y = "Survived"
)
return d1+d2
draw_line(1.1276363e+14, -112763629749553.84)
Seeing how the graph does not ake any sense in Cabin
, Linear Regression
will not work in this case. Since we multiply with the x value, but this is categorical so that won’t work.
Summary¶
In this project, I used sci-kit learn to do the machine learning process. First, I used KNeighborClassifier
and Regressor
to compare with each other. Looking at the scores of training and test set, It is obvious that KneightborClassifier
is the better one to use since eventhough it is underfitting, the difference is not that significant compared to the completely overfitting Regressor
. Next, I used cross validation score
and voting classifier
to get the mean of how confident the machine is in predicting the survival of passengers using machine learning models: KNearestNeighbors
, LinearRegression
, Random Forest
. This shows that the machine is most confident when using the Linear Regression
model. The mean is about 62% when I run the code so I would say that the machine is still on the better side in predicting the survival although not exactly reliable. Following that, I used Logistic Regression
and Linear Regression
. For Logistic Regression
, I knew it would work because it is suitable for classification problems, so I just checked the reliability of the training set in comparison to the test set. Turns out the training set shows 74% while test shows 62% accuracy making the model overfitting but the overfitting is still under control. When using Linear Regression, I had doubts whether or not it can be used at all so I graphed it to see if it makes sense but it does not. Overall, I wouldn’t recommend using the machine to predict survival rates from fare and cabin.
References¶
Dataset:https://www.kaggle.com/c/titanic/data
Exploration: https://www.youtube.com/watch?v=I3FBJdiExcg
Reference for new material: https://www.kaggle.com/kenjee/titanic-project-example
Understanding of cv parameter: https://stackoverflow.com/questions/52611498/need-help-understanding-cross-val-score-in-sklearn-python
KNeighborsRegressor: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor
Referenced from Homework 6, Week 6 Video Notebook, and Week9 Monday Linear Regression Worksheet from Math10.
Created in Deepnote