Payment and treatment in U.S. Hospital

Author: Linjun Zhou

Course Project, UC Irvine, Math 10, S22

Introduction

The dataset “Payment_and_value_of_care_-_Hospital.csv” includes what Medicaid paid for three types of heart disease in each U.S. state. In this project, the payments for the three diseases are separated out to provide a more accurate estimate of the health care situation in each state. And then, The K-Nearest Neighbors Regressor is used to infer the payment association between the three diseases.

Main portion of the project

Dataset Adjustment

import pandas as pd
import altair as alt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
df = pd.read_csv("/work/Payment_and_value_of_care_-_Hospital.csv")
# drop unnecessary data
df = df[[i for i in df.columns if i in ['Hospital name', 'State', 'Payment measure ID', 'Payment']]]
df = df.dropna()
# remove characters and convert type of column
for i in ["Payment"]:
    for a in df.index:
        df.loc[a,i] = df.loc[a,i].replace("$","")
for i in ["Payment"]:
    df[i] = pd.to_numeric(df[i])
df
Hospital name State Payment measure ID Payment
0 MARSHALL MEDICAL CENTER SOUTH AL PAYM_30_AMI 23171.0
1 MARSHALL MEDICAL CENTER SOUTH AL PAYM_30_HF 16376.0
2 MARSHALL MEDICAL CENTER SOUTH AL PAYM_30_PN 14384.0
4 WEDOWEE HOSPITAL AL PAYM_30_HF 16649.0
5 WEDOWEE HOSPITAL AL PAYM_30_PN 13168.0
... ... ... ... ...
14446 SETON MEDICAL CENTER HAYS TX PAYM_30_HF 17189.0
14448 NORTH CYPRESS MEDICAL CENTER TX PAYM_30_AMI 23587.0
14451 MEMORIAL MEDICAL CENTER WI PAYM_30_PN 13813.0
14452 STAR VALLEY MEDICAL CENTER WY PAYM_30_PN 18226.0
14453 LAKEWAY REGIONAL MEDICAL CENTER, LLC TX PAYM_30_HF 17076.0

9880 rows × 4 columns

df["Payment measure ID"].unique()
array(['PAYM_30_AMI', 'PAYM_30_HF', 'PAYM_30_PN'], dtype=object)

To better distinguish different payments, I create new dataframe which divides “payments” into three payments according to the payment measure.

for a,b in df.groupby('Payment measure ID'):
    if a == 'PAYM_30_AMI':
        df1 = b[[i for i in df.columns if i != 'Payment measure ID']]
    elif a == 'PAYM_30_HF':
        df_HF = b[[i for i in df.columns if i != 'Payment measure ID']]
    else:
        df_PN = b[[i for i in df.columns if i != 'Payment measure ID']]

df1.rename(columns={'Payment': 'AMI_Payment'}, inplace=True)

df1 = df1.reset_index()
df_HF = df_HF.reset_index()
df_PN = df_PN.reset_index()

df1.drop('index', axis=1, inplace=True)
df_HF.drop('index', axis=1, inplace=True)
df_PN.drop('index', axis=1, inplace=True)

df1['HF_Payment'] = df_HF['Payment']
df1['PN_Payment'] = df_PN['Payment']
df1.head()
Hospital name State AMI_Payment HF_Payment PN_Payment
0 MARSHALL MEDICAL CENTER SOUTH AL 23171.0 16376.0 14384.0
1 CRESTWOOD MEDICAL CENTER AL 20007.0 16649.0 13168.0
2 PROVIDENCE ALASKA MEDICAL CENTER AK 24309.0 14229.0 13258.0
3 CHI-ST VINCENT INFIRMARY AR 23600.0 15339.0 12303.0
4 CHICOT MEMORIAL MEDICAL CENTER AR 23543.0 14558.0 10817.0
numcols = ['AMI_Payment', 'HF_Payment', 'PN_Payment']
df1[numcols]
AMI_Payment HF_Payment PN_Payment
0 23171.0 16376.0 14384.0
1 20007.0 16649.0 13168.0
2 24309.0 14229.0 13258.0
3 23600.0 15339.0 12303.0
4 23543.0 14558.0 10817.0
... ... ... ...
2338 20340.0 15377.0 13525.0
2339 22608.0 15453.0 15425.0
2340 23941.0 17143.0 13979.0
2341 22231.0 17935.0 12097.0
2342 23587.0 16171.0 13932.0

2343 rows × 3 columns

Some trials about clustering

The graph below is the overview of three treatments’ payments. It shows that payment for heart attack patients (with payment measure id “PAYM_30_AMI”) has the highest cost.

alt.data_transformers.enable('default', max_rows=None)
c1 = alt.Chart(df).mark_boxplot(size=50, extent=0.5).encode(
    x="Payment measure ID",
    y=alt.Y('Payment',scale=alt.Scale(zero=False))
).properties(width=300).configure_axis(
    labelFontSize=16,
    titleFontSize=16
)
c1
kmeans1 = KMeans(n_clusters=2)
kmeans1.fit(df1[numcols])
KMeans(n_clusters=2)
df1['pred'] = kmeans1.predict(df1[numcols])
df1
Hospital name State AMI_Payment HF_Payment PN_Payment pred
0 MARSHALL MEDICAL CENTER SOUTH AL 23171.0 16376.0 14384.0 0
1 CRESTWOOD MEDICAL CENTER AL 20007.0 16649.0 13168.0 0
2 PROVIDENCE ALASKA MEDICAL CENTER AK 24309.0 14229.0 13258.0 1
3 CHI-ST VINCENT INFIRMARY AR 23600.0 15339.0 12303.0 1
4 CHICOT MEMORIAL MEDICAL CENTER AR 23543.0 14558.0 10817.0 0
... ... ... ... ... ... ...
2338 MEMORIAL MEDICAL CENTER WI 20340.0 15377.0 13525.0 0
2339 BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE TX 22608.0 15453.0 15425.0 0
2340 BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY TX 23941.0 17143.0 13979.0 1
2341 SETON MEDICAL CENTER HARKER HEIGHTS TX 22231.0 17935.0 12097.0 0
2342 NORTH CYPRESS MEDICAL CENTER TX 23587.0 16171.0 13932.0 1

2343 rows × 6 columns

c1=[]
for i in numcols:
    c1.append(alt.Chart(df1).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = "pred:N"
    ))
c1[0]&c1[1]&c1[2]

I use KMeans here and want to classify the hospital into two categories, one represents high payments and other represents low payments. However, according to the graphs above, only the first figure is reasonable and valueable. And the reason might be the generally higher value of first figure, which means it will account for a larger portion of the algorithm. So next I will rescale data and try cluster again.

scaler = StandardScaler(with_mean=True, with_std=False)
scaler.fit(df1[numcols])
StandardScaler(with_std=False)
df2 = df1.copy()
df2[numcols] = scaler.transform(df1[numcols])
kmeans2 = KMeans(n_clusters=2)
kmeans2.fit(df2[numcols])
KMeans(n_clusters=2)
df2['pred'] = kmeans2.predict(df2[numcols])
alt.Chart(df2).mark_circle().encode(
    x = 'PN_Payment',
    y = 'HF_Payment',
    color = "pred:N"
)

This chart is pretty chaos, it shows that my second cluster does not work either.

c2=[]
for i in numcols:
    c2.append(alt.Chart(df2).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = "pred:N"
    ))
c2[0]&c2[1]&c2[2]

This chart also makes no sense, the cluster only work for one payment. So next I will use KMeans to predict three payments seperately.

for i in numcols:
    kmeansi = KMeans(n_clusters=2)
    kmeansi.fit(df2[[i]])
    df2[f'pred_{i}'] = kmeansi.predict(df2[[i]])
df2
Hospital name State AMI_Payment HF_Payment PN_Payment pred pred_AMI_Payment pred_HF_Payment pred_PN_Payment cluster type_state
714 BAYHEALTH - KENT GENERAL HOSPITAL DE -1201.681178 1925.911652 662.243278 1 0 1 1 3 2.166667
625 BAYHEALTH - MILFORD MEMORIAL HOSPITAL DE 986.318822 3038.911652 1329.243278 0 1 1 1 0 2.166667
623 CHRISTIANA CARE HEALTH SERVICES, INC. DE 1057.318822 906.911652 -676.756722 0 1 1 0 1 2.166667
485 NANTICOKE MEMORIAL HOSPITAL DE 1048.318822 -1004.088348 374.243278 0 1 0 1 2 2.166667
562 BEEBE MEDICAL CENTER DE 3679.318822 -1690.088348 834.243278 0 1 0 1 2 2.166667
... ... ... ... ... ... ... ... ... ... ... ...
1931 SANFORD USD MEDICAL CENTER SD -510.681178 -952.088348 -2479.756722 1 0 0 0 7 6.181818
1910 AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE... SD -935.681178 -2082.088348 -1312.756722 1 0 0 0 7 6.181818
1922 RAPID CITY REGIONAL HOSPITAL SD -109.681178 -646.088348 767.243278 1 0 0 1 6 6.181818
1984 BROOKINGS HEALTH SYSTEM SD -1803.681178 -28.088348 -110.756722 1 0 0 0 7 6.181818
150 COMMONWEALTH HEALTH CENTER MP -6907.681178 -1760.088348 -1121.756722 1 0 0 0 7 7.000000

2343 rows × 11 columns

c3=[]
for i in numcols:
    c3.append(alt.Chart(df2).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = f"pred_{i}:N"
    ))
c3[0]&c3[1]&c3[2]

Each payment is divided into two clusters, and then I will aggregate the eight combination cases of the clusters of three payments.

df2['cluster'] = 0
i = "pred_AMI_Payment"
j = "pred_HF_Payment"
k = "pred_PN_Payment"
df2.loc[(df2[i] == 0) & (df2[j] == 0) & (df2[k] == 1), 'cluster'] = 1
df2.loc[(df2[i] == 0) & (df2[j] == 1) & (df2[k] == 0), 'cluster'] = 2
df2.loc[(df2[i] == 1) & (df2[j] == 0) & (df2[k] == 0), 'cluster'] = 3
df2.loc[(df2[i] == 0) & (df2[j] == 1) & (df2[k] == 1), 'cluster'] = 4
df2.loc[(df2[i] == 1) & (df2[j] == 0) & (df2[k] == 1), 'cluster'] = 5
df2.loc[(df2[i] == 1) & (df2[j] == 1) & (df2[k] == 0), 'cluster'] = 6
df2.loc[(df2[i] == 1) & (df2[j] == 1) & (df2[k] == 1), 'cluster'] = 7

The aggregation is from 0 to 7. The default number is 0 so I did not reset it here. The larger the aggregation number, the higher the payment.

df2
Hospital name State AMI_Payment HF_Payment PN_Payment pred pred_AMI_Payment pred_HF_Payment pred_PN_Payment cluster type_state
714 BAYHEALTH - KENT GENERAL HOSPITAL DE -1201.681178 1925.911652 662.243278 1 0 1 1 4 2.166667
625 BAYHEALTH - MILFORD MEMORIAL HOSPITAL DE 986.318822 3038.911652 1329.243278 0 1 1 1 7 2.166667
623 CHRISTIANA CARE HEALTH SERVICES, INC. DE 1057.318822 906.911652 -676.756722 0 1 1 0 6 2.166667
485 NANTICOKE MEMORIAL HOSPITAL DE 1048.318822 -1004.088348 374.243278 0 1 0 1 5 2.166667
562 BEEBE MEDICAL CENTER DE 3679.318822 -1690.088348 834.243278 0 1 0 1 5 2.166667
... ... ... ... ... ... ... ... ... ... ... ...
1931 SANFORD USD MEDICAL CENTER SD -510.681178 -952.088348 -2479.756722 1 0 0 0 0 6.181818
1910 AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE... SD -935.681178 -2082.088348 -1312.756722 1 0 0 0 0 6.181818
1922 RAPID CITY REGIONAL HOSPITAL SD -109.681178 -646.088348 767.243278 1 0 0 1 1 6.181818
1984 BROOKINGS HEALTH SYSTEM SD -1803.681178 -28.088348 -110.756722 1 0 0 0 0 6.181818
150 COMMONWEALTH HEALTH CENTER MP -6907.681178 -1760.088348 -1121.756722 1 0 0 0 0 7.000000

2343 rows × 11 columns

alt.Chart(df2).mark_circle().encode(
    x = 'HF_Payment',
    y = 'AMI_Payment',
    color = "cluster:N"
)

This chart is still a bit messy, but it is much clear than charts in my previous trials. In this chart, clusters with first four lower numbers are below axis = 0, and other four clusters are above axis = 0.

c4 = alt.Chart(df2).mark_boxplot().encode(
    x = 'State',
    y = 'mean(cluster)'
)
c4
selection = alt.selection_single()
c5 = alt.Chart(df2).mark_circle().encode(
    x = 'State',
    y = 'mean(cluster)',
    tooltip = [alt.Tooltip('mean(cluster)'), alt.Tooltip('State')]
).add_selection(selection)

c5

Here I make other two charts to visualize the average payments of each state. From the circle chart we can know that states like MP and SD have the lowest values of payments, and states like DE and NJ have the highest values of payments.

Finding overall payments of each state

x = df2['State'].unique()
df2['type_state'] = 0
for a,b in df2.groupby('State'):
    for i in range(len(x)):
        if a == x[i]:
            df2.loc[b.index,'type_state'] = b['cluster'].mean()

I create a new column which represents the average of each state’s cluster.

df2['type_state'].unique()
array([4.83333333, 4.77777778, 4.73684211, 4.68421053, 4.21568627,
       4.04347826, 4.        , 3.96296296, 3.81632653, 3.88888889,
       3.61111111, 3.57142857, 3.5       , 3.44791667, 3.40291262,
       3.40462428, 3.40909091, 3.31372549, 3.25      , 3.16666667,
       3.15555556, 3.11111111, 3.10869565, 3.04      , 2.96551724,
       2.86666667, 2.8       , 2.78125   , 2.75862069, 2.72727273,
       2.70149254, 2.69014085, 2.65625   , 2.59259259, 2.58064516,
       2.57142857, 2.53333333, 2.525     , 2.5       , 2.38181818,
       2.35555556, 2.34615385, 2.33333333, 2.29166667, 2.11111111,
       2.07692308, 2.        , 1.90909091, 1.58333333, 1.54166667,
       1.        , 0.81818182, 0.        ])
df2 = df2.sort_values(['type_state'])
df2
Hospital name State AMI_Payment HF_Payment PN_Payment pred pred_AMI_Payment pred_HF_Payment pred_PN_Payment cluster type_state
150 COMMONWEALTH HEALTH CENTER MP -6907.681178 -1760.088348 -1121.756722 1 0 0 0 0 0.000000
1915 PRAIRIE LAKES HOSPITAL SD -221.681178 638.911652 -544.756722 1 0 1 0 2 0.818182
1913 AVERA ST LUKES SD -1586.681178 -737.088348 573.243278 1 0 0 1 1 0.818182
1872 AVERA SACRED HEART HOSPITAL SD -1060.681178 -1325.088348 597.243278 1 0 0 1 1 0.818182
1891 HURON REGIONAL MEDICAL CENTER SD -1659.681178 -1845.088348 -788.756722 1 0 0 0 0 0.818182
... ... ... ... ... ... ... ... ... ... ... ...
562 BEEBE MEDICAL CENTER DE 3679.318822 -1690.088348 834.243278 0 1 0 1 5 4.833333
485 NANTICOKE MEMORIAL HOSPITAL DE 1048.318822 -1004.088348 374.243278 0 1 0 1 5 4.833333
623 CHRISTIANA CARE HEALTH SERVICES, INC. DE 1057.318822 906.911652 -676.756722 0 1 1 0 6 4.833333
625 BAYHEALTH - MILFORD MEMORIAL HOSPITAL DE 986.318822 3038.911652 1329.243278 0 1 1 1 7 4.833333
714 BAYHEALTH - KENT GENERAL HOSPITAL DE -1201.681178 1925.911652 662.243278 1 0 1 1 4 4.833333

2343 rows × 11 columns

selection = alt.selection_single()
c6 = alt.Chart(df2).mark_circle().encode(
    y = 'type_state',
    x = 'type_state',
    color = 'State',
    tooltip = [alt.Tooltip('type_state'), alt.Tooltip('State')]
).add_selection(selection)
c6

Here I chart the average of each state’s cluster in order.

Assuming type_state >= 3.5 is outstanding states. So we have:

temp_lst = [df2.loc[i,'State'] for i in df2.index if df2.loc[i,'type_state'] >= 3.5]
o_state = np.array(temp_lst)
o_state = np.unique(o_state)
print(f"Outstanding states are: {o_state}")
Outstanding states are: ['CT' 'DC' 'DE' 'FL' 'IL' 'KS' 'MA' 'NE' 'NH' 'NJ' 'NV' 'RI' 'UT']

Section 2 : Predict the AMI payment by using the K-Nearest Neighbors Regressor

df3 = df1.copy().drop('pred', axis=1)
df3
Hospital name State AMI_Payment HF_Payment PN_Payment
0 MARSHALL MEDICAL CENTER SOUTH AL 23171.0 16376.0 14384.0
1 CRESTWOOD MEDICAL CENTER AL 20007.0 16649.0 13168.0
2 PROVIDENCE ALASKA MEDICAL CENTER AK 24309.0 14229.0 13258.0
3 CHI-ST VINCENT INFIRMARY AR 23600.0 15339.0 12303.0
4 CHICOT MEMORIAL MEDICAL CENTER AR 23543.0 14558.0 10817.0
... ... ... ... ... ...
2338 MEMORIAL MEDICAL CENTER WI 20340.0 15377.0 13525.0
2339 BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE TX 22608.0 15453.0 15425.0
2340 BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY TX 23941.0 17143.0 13979.0
2341 SETON MEDICAL CENTER HARKER HEIGHTS TX 22231.0 17935.0 12097.0
2342 NORTH CYPRESS MEDICAL CENTER TX 23587.0 16171.0 13932.0

2343 rows × 5 columns

X_train, X_test, y_train, y_test = train_test_split(
    df3[["HF_Payment", "PN_Payment"]], df3["AMI_Payment"], test_size = 0.4)
reg = KNeighborsRegressor(n_neighbors=2)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=2)
df3['pred'] = reg.predict(df3[["HF_Payment", "PN_Payment"]])
df3
Hospital name State AMI_Payment HF_Payment PN_Payment pred
0 MARSHALL MEDICAL CENTER SOUTH AL 23171.0 16376.0 14384.0 21394.5
1 CRESTWOOD MEDICAL CENTER AL 20007.0 16649.0 13168.0 24125.0
2 PROVIDENCE ALASKA MEDICAL CENTER AK 24309.0 14229.0 13258.0 23598.0
3 CHI-ST VINCENT INFIRMARY AR 23600.0 15339.0 12303.0 22985.5
4 CHICOT MEMORIAL MEDICAL CENTER AR 23543.0 14558.0 10817.0 23346.0
... ... ... ... ... ... ...
2338 MEMORIAL MEDICAL CENTER WI 20340.0 15377.0 13525.0 22522.5
2339 BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE TX 22608.0 15453.0 15425.0 22758.5
2340 BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY TX 23941.0 17143.0 13979.0 22909.5
2341 SETON MEDICAL CENTER HARKER HEIGHTS TX 22231.0 17935.0 12097.0 21336.5
2342 NORTH CYPRESS MEDICAL CENTER TX 23587.0 16171.0 13932.0 23185.5

2343 rows × 6 columns

c11 = alt.Chart(df3).mark_circle().encode(
    x = alt.X('HF_Payment', scale=alt.Scale(zero=False)),
    y = alt.Y('pred', scale=alt.Scale(zero=False))
)
c12 = alt.Chart(df3).mark_circle(color='purple').encode(
    x = alt.X('HF_Payment', scale=alt.Scale(zero=False)),
    y = alt.Y('AMI_Payment', scale=alt.Scale(zero=False))
)
c11+c12
reg.score(df3[["HF_Payment", "PN_Payment"]], df3[['AMI_Payment']])
0.06560938956960727

Although two charts above are pretty similar, the accuracy of prediction is significantly low. Hence, there is no relation between HF_Payment, PN_Payment and AMI_Payment.

Summary

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In this project, I used KMeans to group hospital payments by state and to visualize the overall payments by state. Basically, this result is in accordance with the real situation that the top three states, DE, NJ, and NE, with highest payments are very high ranked in terms of health care in the U.S. In addition, I used two of the payment scenarios to predict the other payment scenario, but the prediction accuracy was very low, so there was no connection between them.

References

  • What is the source of your dataset(s)?

The dataset “Payment_and_value_of_care_-_Hospital.csv” was adapted from Hospital Payment and Value of Care

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

  • List other references that you found helpful.

Created in deepnote.com Created in Deepnote