Decision Tree Regressor Analysis on Insurance Charges: The Impact of Age, BMI, and Smoking Status#

Author:Xifan Jiang

Course Project, UC Irvine, Math 10, S23

Introduction#

Insurance premiums vary significantly among individuals, influenced by various factors such as age, body mass index (BMI), or smoking habits. This project employs a Decision Tree Regressor, a machine learning algorithm, to model and predict insurance charges based on those parameters and more. The objective is to explore how accurately these individual health indicators can predict insurance charges, providing valuable insights for both insurers and policyholders.

Import Data and Clean Data#

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeRegressor
import math
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor, plot_tree
import seaborn as sns
df=pd.read_csv('insurance.csv')
df.head(5)
age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

Check for missing value and describe the data

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
df.isnull().sum()
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

Since there isn’t any null result, we don’t need to drop any rows or columns at all

retain and reorder the columns only seems needed

df=df[['charges','age','bmi','children','smoker','sex']]
df.head()
charges age bmi children smoker sex
0 16884.92400 19 27.900 0 yes female
1 1725.55230 18 33.770 1 no male
2 4449.46200 28 33.000 3 no male
3 21984.47061 33 22.705 0 no male
4 3866.85520 32 28.880 0 no male

create dummy for sex and smokers

df = pd.get_dummies(df,drop_first=True)
df.head()
charges age bmi children smoker_yes sex_male
0 16884.92400 19 27.900 0 1 0
1 1725.55230 18 33.770 1 0 1
2 4449.46200 28 33.000 3 0 1
3 21984.47061 33 22.705 0 0 1
4 3866.85520 32 28.880 0 0 1

Visualize the Data

plt.figure(figsize=(8, 6))
df['smoker_yes'].value_counts().plot(kind='pie', autopct='%1.1f%%')
plt.title('Smoker vs. Non-Smoker Distribution')
plt.show()
../../_images/e2e37e809642cc74a456f0c850c0d01a6ebb7f78809794f8c4b3696181522c83.png
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, y='charges')
plt.ylabel('Charges')
plt.title('Medical Charges')
plt.show()
../../_images/1e34cd91ef5f1deac104876c925be598556560d2eeb5bf1aef5098a0f43557f1.png

Using scatter plot to see which may contribute to the charges

c1=alt.Chart(df).mark_circle().encode(
    x='bmi',
    y='charges',
    color='sex_male:N',
)
c1
#This chart show that there isn't obivious relationship for sex at all
alt.Chart(df).mark_circle().encode(
    x='children:N',
    y='charges',)
#it seems that there are some relationship with children, but not a lot
alt.Chart(df).mark_circle().encode(
    x='bmi',
    y='charges',
    color='smoker_yes:N',
)
# smokers do generally have a higher charge
alt.Chart(df).mark_circle().encode(
    x=alt.X('age', scale=alt.Scale(domain=(18,64))),
    y='charges')

#higher age seems have a higher charges, especially the lowest charge for each age
alt.Chart(df).mark_circle().encode(
    x=alt.X('bmi', scale=alt.Scale(domain=(10,55))),
    y='charges')
# the result show that BMI seems not really important, a higher BMI seems only increase the celling of charge but not floow.

Generally, from those charts, we can see that all the features seems important but the sex. It seem that sex is not relate to the charges at all for any age group. And smoke contribute the most to the charges. And we will test it using linear regression and decision tree regressor

Read the Correlation map between each variables, this is also a good way to understand the importace. From this map, we know that the thing contribute the most to the charge is also smoker, and age being second.

corr_df = df.corr()
plt.figure(figsize=(15,10))
sns.heatmap(corr_df, annot=True, cmap="coolwarm")
<AxesSubplot:>
../../_images/535c910c7bca3c7b5358eb7cafc62e9780ce09c6dea7a232959c1f987f1fcb5d.png

Split the Data for Train and Test#

features = ['age','bmi','children','smoker_yes','sex_male']
X = df[features]
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2, random_state=42)

Linear Regression#

reg=LinearRegression()
reg.fit(X_train, y_train)
print('The train score is ' + str(reg.score(X_train, y_train)))
print('The test score is ' + str(reg.score(X_test, y_test)))
print('The score is ' + str(reg.score(df[features], df['charges'])))
The train score is 0.7555272998081446
The test score is 0.7439322852344854
The score is 0.7464160661734538

There is no overfitting for the linear regression

reg_mse_train = mean_squared_error(y_train, reg.predict(X_train))
math.sqrt(reg_mse_train)
6187.598871935761
reg_mse_test = mean_squared_error(y_test, reg.predict(X_test))
math.sqrt(reg_mse_test)
6072.915873685427
df['test'] = df['charges'].mean()
mse_test = mean_squared_error(df['test'],df['charges'])
math.sqrt(mse_test)
12105.484975561612
for name, coef in zip(features, reg.coef_):
    print(name, "=", coef)
age = 245.955948591765
bmi = 383.9550413407004
children = 395.24145654726436
smoker_yes = 24902.795380463333
sex_male = -288.927901059906

The coefficient of regression result seems prove the hypothesis

Decision Tree Regressor and MSE#

dtr = DecisionTreeRegressor(max_depth=10, min_samples_split=6, min_samples_leaf=10, max_features='sqrt',random_state=10)
dtr.fit(X_train, y_train)
print('The train score is ' + str(dtr.score(X_train, y_train)))
print('The test score is ' + str(dtr.score(X_test, y_test)))
print('The score is ' + str(dtr.score(df[features], df['charges'])))
The train score is 0.7969213149718739
The test score is 0.7649359650091231
The score is 0.7717672349529094

There is no overfitting for the decision tree regressor.

mse_train = mean_squared_error(y_train, dtr.predict(X_train))
math.sqrt(mse_train)
5639.481005734181
mse_test = mean_squared_error(y_test, dtr.predict(X_test))
math.sqrt(mse_test)
5818.525547701853
df['test'] = df['charges'].mean()
mse_test = mean_squared_error(df['test'],df['charges'])
math.sqrt(mse_test)
12105.484975561612

the result of mse also seems good for prediction

plt.figure(figsize=(200,100))
plot_tree(dtr, feature_names=features, filled=True)
plt.show()
../../_images/0337a3b27b3c37382896fb1a472117f3b2f017779f3dba9ec491ab4414d383c0.png

See the importance for each parameter

for name, importance in zip(features, dtr.feature_importances_):
    print(name, "=", importance)
age = 0.1492232839167378
bmi = 0.08866239089208457
children = 0.009894399841378665
smoker_yes = 0.7520104207980166
sex_male = 0.00020950455178229155

The coefficient of regression result seems prove the hypothesis

Visulization the Result#

df['Pred_reg'] = reg.predict(df[features])
df['Pred_dtr']=dtr.predict(df[features])
c1=alt.Chart(df).mark_circle().encode(
    y='charges',
    x='bmi'
)
c2=alt.Chart(df).mark_circle(color='red ').encode(
    y='Pred_reg',
    x='bmi'
)
c3=alt.Chart(df).mark_circle(color='blue').encode(
    y='Pred_dtr',
    x='bmi'
)
c1+c2
c1+c3

Summary#

For this project, I have do the regression and decision tree regressor for the insurance premium on age, children, bmi, smoker and sex. Before I have do the machine learning and the regression, I predicted with chart that age, children are not that important and the smoker is a great contributer to the premium. And the both of the Data have proved my prediction. What’s more, the decision tree regressor have a higher R^2 (around 0.81) than linear regression(around 0.75).

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

This dataset is downloaded from Kaggle(https://www.kaggle.com/mirichoi0218/insurance)

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote