Predicting Heart Disease#

Author: Emily Wang

Course Project, UC Irvine, Math 10, F22

Introduction#

Cardiovascular disease is one of the leading causes of death in the world. So it is important to know what risk factors play a role in determining heart disease, which will lead to helping to find a cure or solution to prevent heart disease. In my project, I am trying to predict whether or not a person has a heart disease based on certain risk factors. In addition, I will be looking at which factors have a greater chance of developing heart disease and which ones have no correlation.

Main portion of the project#

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn import tree

Reading Data#

df = pd.read_csv("heart_data.csv")
df
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
... ... ... ... ... ... ... ... ... ... ... ... ...
913 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1
914 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1
915 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1
916 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1
917 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0

918 rows × 12 columns

df.shape
(918, 12)
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")
There are 918 rows and 12 columns in this dataset.

Cleaning Data#

# Checking to see if there are any missing values in the dataset.
df.isna().sum()
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

The dataset doesn’t have any missing data.

df.dtypes
Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object
df.columns
Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease'],
      dtype='object')
# Sorting the dataset columns based on if they are numeric types or not.
num_cols = [cols for cols in df.select_dtypes(include=np.number)]
num_cols
['Age',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak',
 'HeartDisease']
# creating a list of nonnumerical columns in the data set
categorical_cols = [cols for cols in df.select_dtypes(include="object")]
categorical_cols
['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
df["HD"] = df["HeartDisease"].map({0:"False", 1:"True"})

To use the HeartDisease column as the output value and y variable for classification in the decision tree, the datatype needs to be changed to object since, in the original data set, the value was an integer datatype.

df
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease HD
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0 False
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1 True
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0 False
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1 True
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0 False
... ... ... ... ... ... ... ... ... ... ... ... ... ...
913 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1 True
914 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1 True
915 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1 True
916 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1 True
917 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0 False

918 rows × 13 columns

Data Analysis#

df.sample(10)
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease HD
50 50 M ASY 130 233 0 Normal 121 Y 2.0 Flat 1 True
604 68 M NAP 134 254 1 Normal 151 Y 0.0 Up 0 False
27 52 M ATA 120 284 0 Normal 118 N 0.0 Up 0 False
353 58 M ASY 130 0 0 ST 100 Y 1.0 Flat 1 True
122 46 M ASY 110 240 0 ST 140 N 0.0 Up 0 False
258 51 F NAP 150 200 0 Normal 120 N 0.5 Up 0 False
333 40 M ASY 95 0 1 ST 144 N 0.0 Up 1 True
363 56 M ASY 120 0 0 ST 148 N 0.0 Flat 1 True
71 44 M ATA 130 215 0 Normal 135 N 0.0 Up 0 False
507 40 M NAP 106 240 0 Normal 80 Y 0.0 Up 0 False
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
 12  HD              918 non-null    object 
dtypes: float64(1), int64(6), object(6)
memory usage: 93.4+ KB
df["HeartDisease"].value_counts()
1    508
0    410
Name: HeartDisease, dtype: int64

In this heart failure prediction dataset, 508/918, or about 55% of the people, had a heart condition. In the dataset’s heart condition column, the value 1 means the person has a heart disease, and 0 means no heart disease was detected.

dataplot=sns.heatmap(df.corr(), annot=True)
../../_images/EmilyWang_23_0.png
  • Based on the correlation matrix, there isn’t a strong correlation between the different cardiovascular risk factors and the target values.

  • The risk factor Oldpeak (numeric value for depression measure) has the highest positive correlation with heart disease, which means that if depression measure increases, then heart disease likelihood also increases and vice versa for the opposite direction.

  • Risk factor MaxHR (Maximum Heart Rate) has the highest negative correlation with HeartDisease, which means that as one variable increases, the other variable decreases. So if MaxHR increases, the likelihood of heart disease decreases.

  • There is a surprising finding in this data: the correlation between cholesterol and heart disease is negative.

Train_Test_split#

num_cols.remove("HeartDisease")
X = df[num_cols]
y = df["HD"]
X.columns
Index(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak'], dtype='object')
scaler = StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, random_state = 1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train.shape
(734, 6)
y_train.shape
(734,)

Decision Tree Classifier#

X1 = df[num_cols]
y1 = df["HD"]
X1_train, X1_test, y1_train, y1_test = train_test_split(X1,y1, train_size = 0.8, random_state = 1)
clf = DecisionTreeClassifier(max_leaf_nodes=7, max_depth=10, random_state=1)
clf.fit(X1_train, y1_train)
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=7, random_state=1)
clf.predict(X1_test)
array(['True', 'True', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'False', 'False', 'True', 'True', 'True',
       'False', 'True', 'False', 'True', 'True', 'False', 'False', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'False',
       'False', 'False', 'False', 'False', 'True', 'True', 'True', 'True',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'False', 'True', 'True',
       'False', 'True', 'False', 'True', 'False', 'True', 'False', 'True',
       'True', 'True', 'True', 'False', 'True', 'True', 'True', 'False',
       'True', 'True', 'False', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True',
       'False', 'True', 'True', 'True', 'True', 'True', 'True', 'False',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'False', 'True', 'False', 'False', 'False',
       'False', 'False', 'False', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'False', 'True', 'False', 'True', 'True', 'False',
       'True', 'False', 'True', 'True', 'False', 'True', 'True', 'False',
       'True', 'True', 'False', 'False', 'True', 'True', 'True', 'True',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'False',
       'True', 'False', 'True', 'True', 'True', 'False', 'True', 'True',
       'True', 'True', 'False', 'False', 'True', 'True', 'True', 'False',
       'True', 'True', 'True', 'True', 'False', 'False', 'False', 'True',
       'False', 'False', 'False', 'False', 'False'], dtype=object)
clf.score(X1_train, y1_train)
0.8010899182561307
clf.score(X1_test, y1_test)
0.8152173913043478

Based on the decision tree classification finding, the training and test score were close to each other, so the model could have a high probability of having the correct prediction. Furthermore, we dodn’t have to worry about overfitting since the training score was lower than the test score.

(y1_train == clf.predict(X1_train)).value_counts()
True     588
False    146
Name: HD, dtype: int64
fig = plt.figure()
_ = plot_tree(clf,
                   feature_names = clf.feature_names_in_,
                   class_names=clf.classes_,
                   filled=True)
../../_images/EmilyWang_44_0.png
clf.feature_names_in_
array(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak'],
      dtype=object)
pd.Series(clf.feature_importances_)
0    0.046660
1    0.000000
2    0.364315
3    0.000000
4    0.156920
5    0.432105
dtype: float64
pd.Series(clf.feature_importances_, index=clf.feature_names_in_).sort_values()
RestingBP      0.000000
FastingBS      0.000000
Age            0.046660
MaxHR          0.156920
Cholesterol    0.364315
Oldpeak        0.432105
dtype: float64

The factors that will most significantly impact the decision tree in the findings of features importance are the higher value numbers. For example, the most relevant factors to predict whether a person has a heart disease or not are Oldpeak and MaxHR.

Random Forest Classifier#

X2 = df[num_cols]
y2 = df["HD"]
X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2,  train_size = 0.8, random_state = 1)
rfc = RandomForestClassifier(n_estimators = 1000, max_leaf_nodes = 10, random_state=1)
rfc.fit(X2_train, y2_train)
RandomForestClassifier(max_leaf_nodes=10, n_estimators=1000, random_state=1)
rfc.score(X2_train, y2_train)
0.8160762942779292
rfc.score(X2_test, y2_test)
0.8369565217391305

The scores for random forest are about the same as the decision tree. In the random forest classification, the test and training set scores are relatively close together, and there are no signs of overfitting. The score is also somewhat high, showing that this information can accurately predict heart disease.

K- Nearest Neighbors Classifier#

X3 = df[num_cols]
y3 = df["HD"]
X3_train, X3_test, y3_train, y3_test = train_test_split(X3,y3,  train_size = 0.8, random_state = 1)
clf2 = KNeighborsClassifier(n_neighbors=10)
clf2.fit(X3_train, y3_train)
KNeighborsClassifier(n_neighbors=10)
clf2.predict(X3_test)
array(['False', 'True', 'True', 'True', 'True', 'True', 'False', 'True',
       'False', 'False', 'False', 'False', 'False', 'False', 'True',
       'False', 'True', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'False', 'True',
       'False', 'False', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'False', 'False', 'False', 'True', 'False',
       'True', 'False', 'True', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'False',
       'False', 'False', 'True', 'False', 'False', 'False', 'False',
       'True', 'True', 'False', 'True', 'False', 'True', 'False', 'True',
       'True', 'False', 'False', 'False', 'True', 'True', 'True', 'False',
       'False', 'True', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'False', 'True', 'False', 'False', 'True', 'True', 'True',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'True', 'False', 'True', 'True', 'True', 'True', 'True',
       'False', 'False', 'True', 'False', 'False', 'True', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'True', 'True', 'False', 'True', 'True', 'False', 'False', 'False',
       'True', 'True', 'False', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'True', 'False', 'False', 'False', 'False',
       'False'], dtype=object)
clf2.score(X3_test, y3_test)
0.6847826086956522
clf2.score(X3_train, y3_train)
0.7438692098092643

The scores for this classifier have a more significant difference than the other scores for the training and test set and are about 6% apart from each other, so it could be a sign of overfitting since the training set score is higher than the test set score. But the scores are also much lower than the others, so this classifier is probably not the best one to use.

df["Prediction"] = clf2.predict(X3)
df
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease HD Prediction
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0 False False
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1 True False
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0 False True
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1 True True
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0 False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
913 45 M TA 110 264 0 Normal 132 N 1.2 Flat 1 True False
914 68 M ASY 144 193 1 Normal 141 N 3.4 Flat 1 True True
915 57 M ASY 130 131 0 Normal 115 Y 1.2 Flat 1 True True
916 57 F ATA 130 236 0 LVH 174 N 0.0 Flat 1 True False
917 38 M NAP 138 175 0 Normal 173 N 0.0 Up 0 False False

918 rows × 14 columns

c1 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="MaxHR",
    color="HD:N"
)
c2 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="MaxHR",
    color="Prediction:N"
)
c3 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="Oldpeak",
    color="HD:N"
)
c4 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="Oldpeak",
    color="Prediction:N"
)
c1 | c2
c3 | c4

By looking at the graphs, the heart disease predictor had more points for false for heart disease than the actual data did. This is the same for both sets of graphs looking at the highest factors in correlation to heart disease.

Summary#

My project aimed to predict the possibility of a person having a heart condition based on some common risk factors. I used several different classifiers to show the probability of predicting a heart disease with the given data. The classifiers with better scores for the training and test set were the decision tree and random forest classifier, with both having about the same score. Even though the models had a pretty high accuracy score of around 80%, the correlations were not high enough to determine if a person has heart disease based on the factors alone. The highest correlation between heart disease and one of the data set risk factors, Oldpeak, was about 40% which needs to be higher to determine if it correlated to someone having a heart disease. Even though the accuracy was relatively high, there were some false negatives which would have a bad outcome since people will think they don’t have heart disease when they do.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

  • List any other references that you found helpful.

https://www.geeksforgeeks.org/how-to-create-a-seaborn-correlation-heatmap-in-python/ https://www.kaggle.com/code/durgancegaur/a-guide-to-any-classification-problem https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote