Predicting Heart Disease#

Author: Emily Wang

Course Project, UC Irvine, Math 10, F22

Introduction#

Cardiovascular disease is one of the leading causes of death in the world. So it is important to know what risk factors play a role in determining heart disease, which will lead to helping to find a cure or solution to prevent heart disease. In my project, I am trying to predict whether or not a person has a heart disease based on certain risk factors. In addition, I will be looking at which factors have a greater chance of developing heart disease and which ones have no correlation.

Main portion of the project#

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import plot_tree
from sklearn import tree

Reading Data#

df = pd.read_csv("heart_data.csv")
df

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1
2	37	M	ATA	130	283	0	ST	98	N	0.0	Up	0
3	48	F	ASY	138	214	0	Normal	108	Y	1.5	Flat	1
4	54	M	NAP	150	195	0	Normal	122	N	0.0	Up	0
...	...	...	...	...	...	...	...	...	...	...	...	...
913	45	M	TA	110	264	0	Normal	132	N	1.2	Flat	1
914	68	M	ASY	144	193	1	Normal	141	N	3.4	Flat	1
915	57	M	ASY	130	131	0	Normal	115	Y	1.2	Flat	1
916	57	F	ATA	130	236	0	LVH	174	N	0.0	Flat	1
917	38	M	NAP	138	175	0	Normal	173	N	0.0	Up	0

918 rows × 12 columns

df.shape

(918, 12)

print(f"There are {df.shape[0]} rows and {df.shape[1]} columns in this dataset.")

There are 918 rows and 12 columns in this dataset.

Cleaning Data#

# Checking to see if there are any missing values in the dataset.
df.isna().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64

The dataset doesn’t have any missing data.

df.dtypes

Age                 int64
Sex                object
ChestPainType      object
RestingBP           int64
Cholesterol         int64
FastingBS           int64
RestingECG         object
MaxHR               int64
ExerciseAngina     object
Oldpeak           float64
ST_Slope           object
HeartDisease        int64
dtype: object

df.columns

Index(['Age', 'Sex', 'ChestPainType', 'RestingBP', 'Cholesterol', 'FastingBS',
       'RestingECG', 'MaxHR', 'ExerciseAngina', 'Oldpeak', 'ST_Slope',
       'HeartDisease'],
      dtype='object')

# Sorting the dataset columns based on if they are numeric types or not.
num_cols = [cols for cols in df.select_dtypes(include=np.number)]
num_cols

['Age',
 'RestingBP',
 'Cholesterol',
 'FastingBS',
 'MaxHR',
 'Oldpeak',
 'HeartDisease']

# creating a list of nonnumerical columns in the data set
categorical_cols = [cols for cols in df.select_dtypes(include="object")]
categorical_cols

['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

df["HD"] = df["HeartDisease"].map({0:"False", 1:"True"})

To use the HeartDisease column as the output value and y variable for classification in the decision tree, the datatype needs to be changed to object since, in the original data set, the value was an integer datatype.

df

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease	HD
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0	False
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1	True
2	37	M	ATA	130	283	0	ST	98	N	0.0	Up	0	False
3	48	F	ASY	138	214	0	Normal	108	Y	1.5	Flat	1	True
4	54	M	NAP	150	195	0	Normal	122	N	0.0	Up	0	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...
913	45	M	TA	110	264	0	Normal	132	N	1.2	Flat	1	True
914	68	M	ASY	144	193	1	Normal	141	N	3.4	Flat	1	True
915	57	M	ASY	130	131	0	Normal	115	Y	1.2	Flat	1	True
916	57	F	ATA	130	236	0	LVH	174	N	0.0	Flat	1	True
917	38	M	NAP	138	175	0	Normal	173	N	0.0	Up	0	False

918 rows × 13 columns

Data Analysis#

df.sample(10)

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease	HD
50	50	M	ASY	130	233	0	Normal	121	Y	2.0	Flat	1	True
604	68	M	NAP	134	254	1	Normal	151	Y	0.0	Up	0	False
27	52	M	ATA	120	284	0	Normal	118	N	0.0	Up	0	False
353	58	M	ASY	130	0	0	ST	100	Y	1.0	Flat	1	True
122	46	M	ASY	110	240	0	ST	140	N	0.0	Up	0	False
258	51	F	NAP	150	200	0	Normal	120	N	0.5	Up	0	False
333	40	M	ASY	95	0	1	ST	144	N	0.0	Up	1	True
363	56	M	ASY	120	0	0	ST	148	N	0.0	Flat	1	True
71	44	M	ATA	130	215	0	Normal	135	N	0.0	Up	0	False
507	40	M	NAP	106	240	0	Normal	80	Y	0.0	Up	0	False

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
 12  HD              918 non-null    object 
dtypes: float64(1), int64(6), object(6)
memory usage: 93.4+ KB

df["HeartDisease"].value_counts()

1    508
0    410
Name: HeartDisease, dtype: int64

In this heart failure prediction dataset, 508/918, or about 55% of the people, had a heart condition. In the dataset’s heart condition column, the value 1 means the person has a heart disease, and 0 means no heart disease was detected.

dataplot=sns.heatmap(df.corr(), annot=True)

Based on the correlation matrix, there isn’t a strong correlation between the different cardiovascular risk factors and the target values.
The risk factor Oldpeak (numeric value for depression measure) has the highest positive correlation with heart disease, which means that if depression measure increases, then heart disease likelihood also increases and vice versa for the opposite direction.
Risk factor MaxHR (Maximum Heart Rate) has the highest negative correlation with HeartDisease, which means that as one variable increases, the other variable decreases. So if MaxHR increases, the likelihood of heart disease decreases.
There is a surprising finding in this data: the correlation between cholesterol and heart disease is negative.

Train_Test_split#

num_cols.remove("HeartDisease")

X = df[num_cols]
y = df["HD"]

X.columns

Index(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak'], dtype='object')

scaler = StandardScaler()

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, random_state = 1)

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

X_train.shape

(734, 6)

y_train.shape

(734,)

Decision Tree Classifier#

X1 = df[num_cols]
y1 = df["HD"]

X1_train, X1_test, y1_train, y1_test = train_test_split(X1,y1, train_size = 0.8, random_state = 1)

clf = DecisionTreeClassifier(max_leaf_nodes=7, max_depth=10, random_state=1)

clf.fit(X1_train, y1_train)

DecisionTreeClassifier(max_depth=10, max_leaf_nodes=7, random_state=1)

clf.predict(X1_test)

array(['True', 'True', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'False', 'False', 'True', 'True', 'True',
       'False', 'True', 'False', 'True', 'True', 'False', 'False', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'False',
       'False', 'False', 'False', 'False', 'True', 'True', 'True', 'True',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'False', 'True', 'True',
       'False', 'True', 'False', 'True', 'False', 'True', 'False', 'True',
       'True', 'True', 'True', 'False', 'True', 'True', 'True', 'False',
       'True', 'True', 'False', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'True', 'True', 'False', 'False', 'False', 'True', 'True',
       'False', 'True', 'True', 'True', 'True', 'True', 'True', 'False',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'False', 'True', 'False', 'False', 'False',
       'False', 'False', 'False', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'False', 'True', 'False', 'True', 'True', 'False',
       'True', 'False', 'True', 'True', 'False', 'True', 'True', 'False',
       'True', 'True', 'False', 'False', 'True', 'True', 'True', 'True',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'False',
       'True', 'False', 'True', 'True', 'True', 'False', 'True', 'True',
       'True', 'True', 'False', 'False', 'True', 'True', 'True', 'False',
       'True', 'True', 'True', 'True', 'False', 'False', 'False', 'True',
       'False', 'False', 'False', 'False', 'False'], dtype=object)

clf.score(X1_train, y1_train)

0.8010899182561307

clf.score(X1_test, y1_test)

0.8152173913043478

Based on the decision tree classification finding, the training and test score were close to each other, so the model could have a high probability of having the correct prediction. Furthermore, we dodn’t have to worry about overfitting since the training score was lower than the test score.

(y1_train == clf.predict(X1_train)).value_counts()

True     588
False    146
Name: HD, dtype: int64

fig = plt.figure()
_ = plot_tree(clf,
                   feature_names = clf.feature_names_in_,
                   class_names=clf.classes_,
                   filled=True)

clf.feature_names_in_

array(['Age', 'RestingBP', 'Cholesterol', 'FastingBS', 'MaxHR', 'Oldpeak'],
      dtype=object)

pd.Series(clf.feature_importances_)

  0.046660
  0.000000
  0.364315
  0.000000
  0.156920
  0.432105
dtype: float64

pd.Series(clf.feature_importances_, index=clf.feature_names_in_).sort_values()

RestingBP      0.000000
FastingBS      0.000000
Age            0.046660
MaxHR          0.156920
Cholesterol    0.364315
Oldpeak        0.432105
dtype: float64

The factors that will most significantly impact the decision tree in the findings of features importance are the higher value numbers. For example, the most relevant factors to predict whether a person has a heart disease or not are Oldpeak and MaxHR.

Random Forest Classifier#

X2 = df[num_cols]
y2 = df["HD"]

X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2,  train_size = 0.8, random_state = 1)

rfc = RandomForestClassifier(n_estimators = 1000, max_leaf_nodes = 10, random_state=1)

rfc.fit(X2_train, y2_train)

RandomForestClassifier(max_leaf_nodes=10, n_estimators=1000, random_state=1)

rfc.score(X2_train, y2_train)

0.8160762942779292

rfc.score(X2_test, y2_test)

0.8369565217391305

The scores for random forest are about the same as the decision tree. In the random forest classification, the test and training set scores are relatively close together, and there are no signs of overfitting. The score is also somewhat high, showing that this information can accurately predict heart disease.

K- Nearest Neighbors Classifier#

X3 = df[num_cols]
y3 = df["HD"]

X3_train, X3_test, y3_train, y3_test = train_test_split(X3,y3,  train_size = 0.8, random_state = 1)

clf2 = KNeighborsClassifier(n_neighbors=10)

clf2.fit(X3_train, y3_train)

KNeighborsClassifier(n_neighbors=10)

clf2.predict(X3_test)

array(['False', 'True', 'True', 'True', 'True', 'True', 'False', 'True',
       'False', 'False', 'False', 'False', 'False', 'False', 'True',
       'False', 'True', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'False', 'True',
       'False', 'False', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'False', 'False', 'False', 'True', 'False',
       'True', 'False', 'True', 'True', 'True', 'True', 'False', 'False',
       'False', 'True', 'True', 'False', 'True', 'True', 'True', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'False',
       'False', 'False', 'True', 'False', 'False', 'False', 'False',
       'True', 'True', 'False', 'True', 'False', 'True', 'False', 'True',
       'True', 'False', 'False', 'False', 'True', 'True', 'True', 'False',
       'False', 'True', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'False', 'True', 'False', 'False', 'True', 'True', 'True',
       'True', 'False', 'True', 'True', 'True', 'False', 'False', 'True',
       'False', 'True', 'False', 'True', 'True', 'True', 'True', 'True',
       'False', 'False', 'True', 'False', 'False', 'True', 'True',
       'False', 'True', 'True', 'False', 'False', 'True', 'True', 'False',
       'False', 'False', 'True', 'False', 'True', 'False', 'False',
       'True', 'True', 'False', 'True', 'True', 'False', 'False', 'False',
       'True', 'True', 'False', 'True', 'True', 'False', 'False', 'True',
       'False', 'False', 'True', 'False', 'False', 'False', 'False',
       'False'], dtype=object)

clf2.score(X3_test, y3_test)

0.6847826086956522

clf2.score(X3_train, y3_train)

0.7438692098092643

The scores for this classifier have a more significant difference than the other scores for the training and test set and are about 6% apart from each other, so it could be a sign of overfitting since the training set score is higher than the test set score. But the scores are also much lower than the others, so this classifier is probably not the best one to use.

df["Prediction"] = clf2.predict(X3)
df

	Age	Sex	ChestPainType	RestingBP	Cholesterol	FastingBS	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope	HeartDisease	HD	Prediction
0	40	M	ATA	140	289	0	Normal	172	N	0.0	Up	0	False	False
1	49	F	NAP	160	180	0	Normal	156	N	1.0	Flat	1	True	False
2	37	M	ATA	130	283	0	ST	98	N	0.0	Up	0	False	True
3	48	F	ASY	138	214	0	Normal	108	Y	1.5	Flat	1	True	True
4	54	M	NAP	150	195	0	Normal	122	N	0.0	Up	0	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
913	45	M	TA	110	264	0	Normal	132	N	1.2	Flat	1	True	False
914	68	M	ASY	144	193	1	Normal	141	N	3.4	Flat	1	True	True
915	57	M	ASY	130	131	0	Normal	115	Y	1.2	Flat	1	True	True
916	57	F	ATA	130	236	0	LVH	174	N	0.0	Flat	1	True	False
917	38	M	NAP	138	175	0	Normal	173	N	0.0	Up	0	False	False

918 rows × 14 columns

c1 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="MaxHR",
    color="HD:N"
)
c2 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="MaxHR",
    color="Prediction:N"
)

c3 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="Oldpeak",
    color="HD:N"
)
c4 = alt.Chart(df).mark_circle().encode(
    x="Age",
    y="Oldpeak",
    color="Prediction:N"
)

c1 | c2

c3 | c4

By looking at the graphs, the heart disease predictor had more points for false for heart disease than the actual data did. This is the same for both sets of graphs looking at the highest factors in correlation to heart disease.

Summary#

My project aimed to predict the possibility of a person having a heart condition based on some common risk factors. I used several different classifiers to show the probability of predicting a heart disease with the given data. The classifiers with better scores for the training and test set were the decision tree and random forest classifier, with both having about the same score. Even though the models had a pretty high accuracy score of around 80%, the correlations were not high enough to determine if a person has heart disease based on the factors alone. The highest correlation between heart disease and one of the data set risk factors, Oldpeak, was about 40% which needs to be higher to determine if it correlated to someone having a heart disease. Even though the accuracy was relatively high, there were some false negatives which would have a bad outcome since people will think they don’t have heart disease when they do.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction

List any other references that you found helpful.

https://www.geeksforgeeks.org/how-to-create-a-seaborn-correlation-heatmap-in-python/ https://www.kaggle.com/code/durgancegaur/a-guide-to-any-classification-problem https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote

UC Irvine Math 10, Fall 2022

Predicting Heart Disease

Contents