Payment and treatment in U.S. Hospital¶

Author: Linjun Zhou

Course Project, UC Irvine, Math 10, S22

Introduction¶

The dataset “Payment_and_value_of_care_-_Hospital.csv” includes what Medicaid paid for three types of heart disease in each U.S. state. In this project, the payments for the three diseases are separated out to provide a more accurate estimate of the health care situation in each state. And then, The K-Nearest Neighbors Regressor is used to infer the payment association between the three diseases.

Main portion of the project¶

Dataset Adjustment¶

import pandas as pd
import altair as alt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split

df = pd.read_csv("/work/Payment_and_value_of_care_-_Hospital.csv")

# drop unnecessary data
df = df[[i for i in df.columns if i in ['Hospital name', 'State', 'Payment measure ID', 'Payment']]]
df = df.dropna()

# remove characters and convert type of column
for i in ["Payment"]:
    for a in df.index:
        df.loc[a,i] = df.loc[a,i].replace("$","")
for i in ["Payment"]:
    df[i] = pd.to_numeric(df[i])

df

	Hospital name	State	Payment measure ID	Payment
0	MARSHALL MEDICAL CENTER SOUTH	AL	PAYM_30_AMI	23171.0
1	MARSHALL MEDICAL CENTER SOUTH	AL	PAYM_30_HF	16376.0
2	MARSHALL MEDICAL CENTER SOUTH	AL	PAYM_30_PN	14384.0
4	WEDOWEE HOSPITAL	AL	PAYM_30_HF	16649.0
5	WEDOWEE HOSPITAL	AL	PAYM_30_PN	13168.0
...	...	...	...	...
14446	SETON MEDICAL CENTER HAYS	TX	PAYM_30_HF	17189.0
14448	NORTH CYPRESS MEDICAL CENTER	TX	PAYM_30_AMI	23587.0
14451	MEMORIAL MEDICAL CENTER	WI	PAYM_30_PN	13813.0
14452	STAR VALLEY MEDICAL CENTER	WY	PAYM_30_PN	18226.0
14453	LAKEWAY REGIONAL MEDICAL CENTER, LLC	TX	PAYM_30_HF	17076.0

9880 rows × 4 columns

df["Payment measure ID"].unique()

array(['PAYM_30_AMI', 'PAYM_30_HF', 'PAYM_30_PN'], dtype=object)

To better distinguish different payments, I create new dataframe which divides “payments” into three payments according to the payment measure.

for a,b in df.groupby('Payment measure ID'):
    if a == 'PAYM_30_AMI':
        df1 = b[[i for i in df.columns if i != 'Payment measure ID']]
    elif a == 'PAYM_30_HF':
        df_HF = b[[i for i in df.columns if i != 'Payment measure ID']]
    else:
        df_PN = b[[i for i in df.columns if i != 'Payment measure ID']]

df1.rename(columns={'Payment': 'AMI_Payment'}, inplace=True)

df1 = df1.reset_index()
df_HF = df_HF.reset_index()
df_PN = df_PN.reset_index()

df1.drop('index', axis=1, inplace=True)
df_HF.drop('index', axis=1, inplace=True)
df_PN.drop('index', axis=1, inplace=True)

df1['HF_Payment'] = df_HF['Payment']
df1['PN_Payment'] = df_PN['Payment']

df1.head()

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment
0	MARSHALL MEDICAL CENTER SOUTH	AL	23171.0	16376.0	14384.0
1	CRESTWOOD MEDICAL CENTER	AL	20007.0	16649.0	13168.0
2	PROVIDENCE ALASKA MEDICAL CENTER	AK	24309.0	14229.0	13258.0
3	CHI-ST VINCENT INFIRMARY	AR	23600.0	15339.0	12303.0
4	CHICOT MEMORIAL MEDICAL CENTER	AR	23543.0	14558.0	10817.0

numcols = ['AMI_Payment', 'HF_Payment', 'PN_Payment']
df1[numcols]

	AMI_Payment	HF_Payment	PN_Payment
0	23171.0	16376.0	14384.0
1	20007.0	16649.0	13168.0
2	24309.0	14229.0	13258.0
3	23600.0	15339.0	12303.0
4	23543.0	14558.0	10817.0
...	...	...	...
2338	20340.0	15377.0	13525.0
2339	22608.0	15453.0	15425.0
2340	23941.0	17143.0	13979.0
2341	22231.0	17935.0	12097.0
2342	23587.0	16171.0	13932.0

2343 rows × 3 columns

Some trials about clustering¶

The graph below is the overview of three treatments’ payments. It shows that payment for heart attack patients (with payment measure id “PAYM_30_AMI”) has the highest cost.

alt.data_transformers.enable('default', max_rows=None)
c1 = alt.Chart(df).mark_boxplot(size=50, extent=0.5).encode(
    x="Payment measure ID",
    y=alt.Y('Payment',scale=alt.Scale(zero=False))
).properties(width=300).configure_axis(
    labelFontSize=16,
    titleFontSize=16
)
c1

kmeans1 = KMeans(n_clusters=2)
kmeans1.fit(df1[numcols])

KMeans(n_clusters=2)

df1['pred'] = kmeans1.predict(df1[numcols])
df1

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment	pred
0	MARSHALL MEDICAL CENTER SOUTH	AL	23171.0	16376.0	14384.0	0
1	CRESTWOOD MEDICAL CENTER	AL	20007.0	16649.0	13168.0	0
2	PROVIDENCE ALASKA MEDICAL CENTER	AK	24309.0	14229.0	13258.0	1
3	CHI-ST VINCENT INFIRMARY	AR	23600.0	15339.0	12303.0	1
4	CHICOT MEMORIAL MEDICAL CENTER	AR	23543.0	14558.0	10817.0	0
...	...	...	...	...	...	...
2338	MEMORIAL MEDICAL CENTER	WI	20340.0	15377.0	13525.0	0
2339	BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE	TX	22608.0	15453.0	15425.0	0
2340	BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY	TX	23941.0	17143.0	13979.0	1
2341	SETON MEDICAL CENTER HARKER HEIGHTS	TX	22231.0	17935.0	12097.0	0
2342	NORTH CYPRESS MEDICAL CENTER	TX	23587.0	16171.0	13932.0	1

2343 rows × 6 columns

c1=[]
for i in numcols:
    c1.append(alt.Chart(df1).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = "pred:N"
    ))
c1[0]&c1[1]&c1[2]

I use KMeans here and want to classify the hospital into two categories, one represents high payments and other represents low payments. However, according to the graphs above, only the first figure is reasonable and valueable. And the reason might be the generally higher value of first figure, which means it will account for a larger portion of the algorithm. So next I will rescale data and try cluster again.

scaler = StandardScaler(with_mean=True, with_std=False)
scaler.fit(df1[numcols])

StandardScaler(with_std=False)

df2 = df1.copy()
df2[numcols] = scaler.transform(df1[numcols])

kmeans2 = KMeans(n_clusters=2)
kmeans2.fit(df2[numcols])

KMeans(n_clusters=2)

df2['pred'] = kmeans2.predict(df2[numcols])

alt.Chart(df2).mark_circle().encode(
    x = 'PN_Payment',
    y = 'HF_Payment',
    color = "pred:N"
)

This chart is pretty chaos, it shows that my second cluster does not work either.

c2=[]
for i in numcols:
    c2.append(alt.Chart(df2).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = "pred:N"
    ))
c2[0]&c2[1]&c2[2]

This chart also makes no sense, the cluster only work for one payment. So next I will use KMeans to predict three payments seperately.

for i in numcols:
    kmeansi = KMeans(n_clusters=2)
    kmeansi.fit(df2[[i]])
    df2[f'pred_{i}'] = kmeansi.predict(df2[[i]])
df2

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment	pred	pred_AMI_Payment	pred_HF_Payment	pred_PN_Payment	cluster	type_state
714	BAYHEALTH - KENT GENERAL HOSPITAL	DE	-1201.681178	1925.911652	662.243278	1	0	1	1	3	2.166667
625	BAYHEALTH - MILFORD MEMORIAL HOSPITAL	DE	986.318822	3038.911652	1329.243278	0	1	1	1	0	2.166667
623	CHRISTIANA CARE HEALTH SERVICES, INC.	DE	1057.318822	906.911652	-676.756722	0	1	1	0	1	2.166667
485	NANTICOKE MEMORIAL HOSPITAL	DE	1048.318822	-1004.088348	374.243278	0	1	0	1	2	2.166667
562	BEEBE MEDICAL CENTER	DE	3679.318822	-1690.088348	834.243278	0	1	0	1	2	2.166667
...	...	...	...	...	...	...	...	...	...	...	...
1931	SANFORD USD MEDICAL CENTER	SD	-510.681178	-952.088348	-2479.756722	1	0	0	0	7	6.181818
1910	AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE...	SD	-935.681178	-2082.088348	-1312.756722	1	0	0	0	7	6.181818
1922	RAPID CITY REGIONAL HOSPITAL	SD	-109.681178	-646.088348	767.243278	1	0	0	1	6	6.181818
1984	BROOKINGS HEALTH SYSTEM	SD	-1803.681178	-28.088348	-110.756722	1	0	0	0	7	6.181818
150	COMMONWEALTH HEALTH CENTER	MP	-6907.681178	-1760.088348	-1121.756722	1	0	0	0	7	7.000000

2343 rows × 11 columns

c3=[]
for i in numcols:
    c3.append(alt.Chart(df2).mark_circle().encode(
        x = alt.X('State'),
        y = alt.Y(i),
        color = f"pred_{i}:N"
    ))
c3[0]&c3[1]&c3[2]

Each payment is divided into two clusters, and then I will aggregate the eight combination cases of the clusters of three payments.

df2['cluster'] = 0
i = "pred_AMI_Payment"
j = "pred_HF_Payment"
k = "pred_PN_Payment"
df2.loc[(df2[i] == 0) & (df2[j] == 0) & (df2[k] == 1), 'cluster'] = 1
df2.loc[(df2[i] == 0) & (df2[j] == 1) & (df2[k] == 0), 'cluster'] = 2
df2.loc[(df2[i] == 1) & (df2[j] == 0) & (df2[k] == 0), 'cluster'] = 3
df2.loc[(df2[i] == 0) & (df2[j] == 1) & (df2[k] == 1), 'cluster'] = 4
df2.loc[(df2[i] == 1) & (df2[j] == 0) & (df2[k] == 1), 'cluster'] = 5
df2.loc[(df2[i] == 1) & (df2[j] == 1) & (df2[k] == 0), 'cluster'] = 6
df2.loc[(df2[i] == 1) & (df2[j] == 1) & (df2[k] == 1), 'cluster'] = 7

The aggregation is from 0 to 7. The default number is 0 so I did not reset it here. The larger the aggregation number, the higher the payment.

df2

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment	pred	pred_AMI_Payment	pred_HF_Payment	pred_PN_Payment	cluster	type_state
714	BAYHEALTH - KENT GENERAL HOSPITAL	DE	-1201.681178	1925.911652	662.243278	1	0	1	1	4	2.166667
625	BAYHEALTH - MILFORD MEMORIAL HOSPITAL	DE	986.318822	3038.911652	1329.243278	0	1	1	1	7	2.166667
623	CHRISTIANA CARE HEALTH SERVICES, INC.	DE	1057.318822	906.911652	-676.756722	0	1	1	0	6	2.166667
485	NANTICOKE MEMORIAL HOSPITAL	DE	1048.318822	-1004.088348	374.243278	0	1	0	1	5	2.166667
562	BEEBE MEDICAL CENTER	DE	3679.318822	-1690.088348	834.243278	0	1	0	1	5	2.166667
...	...	...	...	...	...	...	...	...	...	...	...
1931	SANFORD USD MEDICAL CENTER	SD	-510.681178	-952.088348	-2479.756722	1	0	0	0	0	6.181818
1910	AVERA MCKENNAN HOSPITAL & UNIVERSITY HEALTH CE...	SD	-935.681178	-2082.088348	-1312.756722	1	0	0	0	0	6.181818
1922	RAPID CITY REGIONAL HOSPITAL	SD	-109.681178	-646.088348	767.243278	1	0	0	1	1	6.181818
1984	BROOKINGS HEALTH SYSTEM	SD	-1803.681178	-28.088348	-110.756722	1	0	0	0	0	6.181818
150	COMMONWEALTH HEALTH CENTER	MP	-6907.681178	-1760.088348	-1121.756722	1	0	0	0	0	7.000000

2343 rows × 11 columns

alt.Chart(df2).mark_circle().encode(
    x = 'HF_Payment',
    y = 'AMI_Payment',
    color = "cluster:N"
)

This chart is still a bit messy, but it is much clear than charts in my previous trials. In this chart, clusters with first four lower numbers are below axis = 0, and other four clusters are above axis = 0.

c4 = alt.Chart(df2).mark_boxplot().encode(
    x = 'State',
    y = 'mean(cluster)'
)
c4

selection = alt.selection_single()
c5 = alt.Chart(df2).mark_circle().encode(
    x = 'State',
    y = 'mean(cluster)',
    tooltip = [alt.Tooltip('mean(cluster)'), alt.Tooltip('State')]
).add_selection(selection)

c5

Here I make other two charts to visualize the average payments of each state. From the circle chart we can know that states like MP and SD have the lowest values of payments, and states like DE and NJ have the highest values of payments.

Finding overall payments of each state¶

x = df2['State'].unique()
df2['type_state'] = 0
for a,b in df2.groupby('State'):
    for i in range(len(x)):
        if a == x[i]:
            df2.loc[b.index,'type_state'] = b['cluster'].mean()

I create a new column which represents the average of each state’s cluster.

df2['type_state'].unique()

array([4.83333333, 4.77777778, 4.73684211, 4.68421053, 4.21568627,
       4.04347826, 4.        , 3.96296296, 3.81632653, 3.88888889,
       3.61111111, 3.57142857, 3.5       , 3.44791667, 3.40291262,
       3.40462428, 3.40909091, 3.31372549, 3.25      , 3.16666667,
       3.15555556, 3.11111111, 3.10869565, 3.04      , 2.96551724,
       2.86666667, 2.8       , 2.78125   , 2.75862069, 2.72727273,
       2.70149254, 2.69014085, 2.65625   , 2.59259259, 2.58064516,
       2.57142857, 2.53333333, 2.525     , 2.5       , 2.38181818,
       2.35555556, 2.34615385, 2.33333333, 2.29166667, 2.11111111,
       2.07692308, 2.        , 1.90909091, 1.58333333, 1.54166667,
       1.        , 0.81818182, 0.        ])

df2 = df2.sort_values(['type_state'])
df2

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment	pred	pred_AMI_Payment	pred_HF_Payment	pred_PN_Payment	cluster	type_state
150	COMMONWEALTH HEALTH CENTER	MP	-6907.681178	-1760.088348	-1121.756722	1	0	0	0	0	0.000000
1915	PRAIRIE LAKES HOSPITAL	SD	-221.681178	638.911652	-544.756722	1	0	1	0	2	0.818182
1913	AVERA ST LUKES	SD	-1586.681178	-737.088348	573.243278	1	0	0	1	1	0.818182
1872	AVERA SACRED HEART HOSPITAL	SD	-1060.681178	-1325.088348	597.243278	1	0	0	1	1	0.818182
1891	HURON REGIONAL MEDICAL CENTER	SD	-1659.681178	-1845.088348	-788.756722	1	0	0	0	0	0.818182
...	...	...	...	...	...	...	...	...	...	...	...
562	BEEBE MEDICAL CENTER	DE	3679.318822	-1690.088348	834.243278	0	1	0	1	5	4.833333
485	NANTICOKE MEMORIAL HOSPITAL	DE	1048.318822	-1004.088348	374.243278	0	1	0	1	5	4.833333
623	CHRISTIANA CARE HEALTH SERVICES, INC.	DE	1057.318822	906.911652	-676.756722	0	1	1	0	6	4.833333
625	BAYHEALTH - MILFORD MEMORIAL HOSPITAL	DE	986.318822	3038.911652	1329.243278	0	1	1	1	7	4.833333
714	BAYHEALTH - KENT GENERAL HOSPITAL	DE	-1201.681178	1925.911652	662.243278	1	0	1	1	4	4.833333

2343 rows × 11 columns

selection = alt.selection_single()
c6 = alt.Chart(df2).mark_circle().encode(
    y = 'type_state',
    x = 'type_state',
    color = 'State',
    tooltip = [alt.Tooltip('type_state'), alt.Tooltip('State')]
).add_selection(selection)
c6

Here I chart the average of each state’s cluster in order.

Assuming type_state >= 3.5 is outstanding states. So we have:

temp_lst = [df2.loc[i,'State'] for i in df2.index if df2.loc[i,'type_state'] >= 3.5]
o_state = np.array(temp_lst)
o_state = np.unique(o_state)
print(f"Outstanding states are: {o_state}")

Outstanding states are: ['CT' 'DC' 'DE' 'FL' 'IL' 'KS' 'MA' 'NE' 'NH' 'NJ' 'NV' 'RI' 'UT']

Section 2 : Predict the AMI payment by using the K-Nearest Neighbors Regressor¶

df3 = df1.copy().drop('pred', axis=1)
df3

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment
0	MARSHALL MEDICAL CENTER SOUTH	AL	23171.0	16376.0	14384.0
1	CRESTWOOD MEDICAL CENTER	AL	20007.0	16649.0	13168.0
2	PROVIDENCE ALASKA MEDICAL CENTER	AK	24309.0	14229.0	13258.0
3	CHI-ST VINCENT INFIRMARY	AR	23600.0	15339.0	12303.0
4	CHICOT MEMORIAL MEDICAL CENTER	AR	23543.0	14558.0	10817.0
...	...	...	...	...	...
2338	MEMORIAL MEDICAL CENTER	WI	20340.0	15377.0	13525.0
2339	BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE	TX	22608.0	15453.0	15425.0
2340	BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY	TX	23941.0	17143.0	13979.0
2341	SETON MEDICAL CENTER HARKER HEIGHTS	TX	22231.0	17935.0	12097.0
2342	NORTH CYPRESS MEDICAL CENTER	TX	23587.0	16171.0	13932.0

2343 rows × 5 columns

X_train, X_test, y_train, y_test = train_test_split(
    df3[["HF_Payment", "PN_Payment"]], df3["AMI_Payment"], test_size = 0.4)

reg = KNeighborsRegressor(n_neighbors=2)
reg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=2)

df3['pred'] = reg.predict(df3[["HF_Payment", "PN_Payment"]])
df3

	Hospital name	State	AMI_Payment	HF_Payment	PN_Payment	pred
0	MARSHALL MEDICAL CENTER SOUTH	AL	23171.0	16376.0	14384.0	21394.5
1	CRESTWOOD MEDICAL CENTER	AL	20007.0	16649.0	13168.0	24125.0
2	PROVIDENCE ALASKA MEDICAL CENTER	AK	24309.0	14229.0	13258.0	23598.0
3	CHI-ST VINCENT INFIRMARY	AR	23600.0	15339.0	12303.0	22985.5
4	CHICOT MEMORIAL MEDICAL CENTER	AR	23543.0	14558.0	10817.0	23346.0
...	...	...	...	...	...	...
2338	MEMORIAL MEDICAL CENTER	WI	20340.0	15377.0	13525.0	22522.5
2339	BAYLOR SCOTT AND WHITE MEDICAL CENTER SUNNYVALE	TX	22608.0	15453.0	15425.0	22758.5
2340	BAYLOR SCOTT AND WHITE MEDICAL CENTER MCKINNEY	TX	23941.0	17143.0	13979.0	22909.5
2341	SETON MEDICAL CENTER HARKER HEIGHTS	TX	22231.0	17935.0	12097.0	21336.5
2342	NORTH CYPRESS MEDICAL CENTER	TX	23587.0	16171.0	13932.0	23185.5

2343 rows × 6 columns

c11 = alt.Chart(df3).mark_circle().encode(
    x = alt.X('HF_Payment', scale=alt.Scale(zero=False)),
    y = alt.Y('pred', scale=alt.Scale(zero=False))
)
c12 = alt.Chart(df3).mark_circle(color='purple').encode(
    x = alt.X('HF_Payment', scale=alt.Scale(zero=False)),
    y = alt.Y('AMI_Payment', scale=alt.Scale(zero=False))
)
c11+c12

reg.score(df3[["HF_Payment", "PN_Payment"]], df3[['AMI_Payment']])

0.06560938956960727

Although two charts above are pretty similar, the accuracy of prediction is significantly low. Hence, there is no relation between HF_Payment, PN_Payment and AMI_Payment.

Summary¶

Either summarize what you did, or summarize the results. Maybe 3 sentences.

In this project, I used KMeans to group hospital payments by state and to visualize the overall payments by state. Basically, this result is in accordance with the real situation that the top three states, DE, NJ, and NE, with highest payments are very high ranked in terms of health care in the U.S. In addition, I used two of the payment scenarios to predict the other payment scenario, but the prediction accuracy was very low, so there was no connection between them.

References¶

What is the source of your dataset(s)?

The dataset “Payment_and_value_of_care_-_Hospital.csv” was adapted from Hospital Payment and Value of Care

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

List other references that you found helpful.

Created in Deepnote

UC Irvine Math 10 S22

Payment and treatment in U.S. Hospital

Contents

Payment and treatment in U.S. Hospital¶

Introduction¶

Main portion of the project¶

Dataset Adjustment¶

Some trials about clustering¶

Finding overall payments of each state¶

Section 2 : Predict the AMI payment by using the K-Nearest Neighbors Regressor¶

Summary¶

References¶