Decision Tree Regressor Analysis on Insurance Charges: The Impact of Age, BMI, and Smoking Status

Decision Tree Regressor Analysis on Insurance Charges: The Impact of Age, BMI, and Smoking Status#

Author:Xifan Jiang

Course Project, UC Irvine, Math 10, S23

Introduction#

Insurance premiums vary significantly among individuals, influenced by various factors such as age, body mass index (BMI), or smoking habits. This project employs a Decision Tree Regressor, a machine learning algorithm, to model and predict insurance charges based on those parameters and more. The objective is to explore how accurately these individual health indicators can predict insurance charges, providing valuable insights for both insurers and policyholders.

Import Data and Clean Data#

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
import math
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor, plot_tree
import seaborn as sns

df=pd.read_csv('insurance.csv')
df.head(5)

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

Check for missing value and describe the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB

df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Since there isn’t any null result, we don’t need to drop any rows or columns at all

retain and reorder the columns only seems needed

df=df[['charges','age','bmi','children','smoker','sex']]
df.head()

	charges	age	bmi	children	smoker	sex
0	16884.92400	19	27.900	0	yes	female
1	1725.55230	18	33.770	1	no	male
2	4449.46200	28	33.000	3	no	male
3	21984.47061	33	22.705	0	no	male
4	3866.85520	32	28.880	0	no	male

create dummy for sex and smokers

df = pd.get_dummies(df,drop_first=True)
df.head()

	charges	age	bmi	children	smoker_yes	sex_male
0	16884.92400	19	27.900	0	1	0
1	1725.55230	18	33.770	1	0	1
2	4449.46200	28	33.000	3	0	1
3	21984.47061	33	22.705	0	0	1
4	3866.85520	32	28.880	0	0	1

Visualize the Data

plt.figure(figsize=(8, 6))
df['smoker_yes'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Smoker vs. Non-Smoker Distribution')
plt.show()

../../_images/e2e37e809642cc74a456f0c850c0d01a6ebb7f78809794f8c4b3696181522c83.png

plt.figure(figsize=(8, 6))
sns.boxplot(data=df, y='charges')
plt.ylabel('Charges')
plt.title('Medical Charges')
plt.show()

../../_images/1e34cd91ef5f1deac104876c925be598556560d2eeb5bf1aef5098a0f43557f1.png

Using scatter plot to see which may contribute to the charges

c1=alt.Chart(df).mark_circle().encode(
    x='bmi',
    y='charges',
    color='sex_male:N',
)
c1
#This chart show that there isn't obivious relationship for sex at all

alt.Chart(df).mark_circle().encode(
    x='children:N',
    y='charges',)
#it seems that there are some relationship with children, but not a lot

alt.Chart(df).mark_circle().encode(
    x='bmi',
    y='charges',
    color='smoker_yes:N',
)
# smokers do generally have a higher charge

alt.Chart(df).mark_circle().encode(
    x=alt.X('age', scale=alt.Scale(domain=(18,64))),
    y='charges')

#higher age seems have a higher charges, especially the lowest charge for each age

alt.Chart(df).mark_circle().encode(
    x=alt.X('bmi', scale=alt.Scale(domain=(10,55))),
    y='charges')
# the result show that BMI seems not really important, a higher BMI seems only increase the celling of charge but not floow.

Generally, from those charts, we can see that all the features seems important but the sex. It seem that sex is not relate to the charges at all for any age group. And smoke contribute the most to the charges. And we will test it using linear regression and decision tree regressor

Read the Correlation map between each variables, this is also a good way to understand the importace. From this map, we know that the thing contribute the most to the charge is also smoker, and age being second.

corr_df = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr_df, annot=True, cmap="coolwarm")

<AxesSubplot:>

../../_images/535c910c7bca3c7b5358eb7cafc62e9780ce09c6dea7a232959c1f987f1fcb5d.png

Split the Data for Train and Test#

features = ['age','bmi','children','smoker_yes','sex_male']

X = df[features]
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)

Linear Regression#

reg=LinearRegression()
reg.fit(X_train, y_train)
print('The train score is ' + str(reg.score(X_train, y_train)))
print('The test score is ' + str(reg.score(X_test, y_test)))
print('The score is ' + str(reg.score(df[features], df['charges'])))

The train score is 0.7555272998081446
The test score is 0.7439322852344854
The score is 0.7464160661734538

There is no overfitting for the linear regression

reg_mse_train = mean_squared_error(y_train, reg.predict(X_train))
math.sqrt(reg_mse_train)

6187.598871935761

reg_mse_test = mean_squared_error(y_test, reg.predict(X_test))
math.sqrt(reg_mse_test)

6072.915873685427

df['test'] = df['charges'].mean()
mse_test = mean_squared_error(df['test'],df['charges'])
math.sqrt(mse_test)

12105.484975561612

for name, coef in zip(features, reg.coef_):
    print(name, "=", coef)

age = 245.955948591765
bmi = 383.9550413407004
children = 395.24145654726436
smoker_yes = 24902.795380463333
sex_male = -288.927901059906

The coefficient of regression result seems prove the hypothesis

Decision Tree Regressor and MSE#

dtr = DecisionTreeRegressor(max_depth=10, min_samples_split=6, min_samples_leaf=10, max_features='sqrt',random_state=10)
dtr.fit(X_train, y_train)
print('The train score is ' + str(dtr.score(X_train, y_train)))
print('The test score is ' + str(dtr.score(X_test, y_test)))
print('The score is ' + str(dtr.score(df[features], df['charges'])))

The train score is 0.7969213149718739
The test score is 0.7649359650091231
The score is 0.7717672349529094

There is no overfitting for the decision tree regressor.

mse_train = mean_squared_error(y_train, dtr.predict(X_train))
math.sqrt(mse_train)

5639.481005734181

mse_test = mean_squared_error(y_test, dtr.predict(X_test))
math.sqrt(mse_test)

5818.525547701853

df['test'] = df['charges'].mean()
mse_test = mean_squared_error(df['test'],df['charges'])
math.sqrt(mse_test)

12105.484975561612

the result of mse also seems good for prediction

plt.figure(figsize=(200,100))
plot_tree(dtr, feature_names=features, filled=True)
plt.show()

../../_images/0337a3b27b3c37382896fb1a472117f3b2f017779f3dba9ec491ab4414d383c0.png

See the importance for each parameter

for name, importance in zip(features, dtr.feature_importances_):
    print(name, "=", importance)

age = 0.1492232839167378
bmi = 0.08866239089208457
children = 0.009894399841378665
smoker_yes = 0.7520104207980166
sex_male = 0.00020950455178229155

The coefficient of regression result seems prove the hypothesis

Visulization the Result#

df['Pred_reg'] = reg.predict(df[features])

df['Pred_dtr']=dtr.predict(df[features])

c1=alt.Chart(df).mark_circle().encode(
    y='charges',
    x='bmi'
)

c2=alt.Chart(df).mark_circle(color='red ').encode(
    y='Pred_reg',
    x='bmi'
)

c3=alt.Chart(df).mark_circle(color='blue').encode(
    y='Pred_dtr',
    x='bmi'
)

c1+c2

c1+c3

Summary#

For this project, I have do the regression and decision tree regressor for the insurance premium on age, children, bmi, smoker and sex. Before I have do the machine learning and the regression, I predicted with chart that age, children are not that important and the smoker is a great contributer to the premium. And the both of the Data have proved my prediction. What’s more, the decision tree regressor have a higher R^2 (around 0.81) than linear regression(around 0.75).

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

This dataset is downloaded from Kaggle(https://www.kaggle.com/mirichoi0218/insurance)

List any other references that you found helpful.
chatGPT(https://chat.openai.com/) it helps me to understand the deeper on machine learning
the decision tree regressor details https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html
heatmap from sns package https://seaborn.pydata.org/generated/seaborn.heatmap.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote