Depression Prediction

Author: Tong Su

Course Project, UC Irvine, Math 10, S22

Introduction

As a psychology major student, I would like to use the dataset of depression importing from Kaggle to explore the socioeconomic factors that can predict the rate of rural residents being diagonosed as depression by using scikit learn.

Main portion of the project

(You can either have all one section or divide into multiple sections)

import numpy as np
import pandas as pd
import seaborn as sns
import altair as alt
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, log_loss

Basic info about the dataset

df = pd.read_csv('b_depressed.csv')
df.head()
Survey_id Ville_id sex Age Married Number_children education_level total_members gained_asset durable_asset ... incoming_salary incoming_own_farm incoming_business incoming_no_business incoming_agricultural farm_expenses labor_primary lasting_investment no_lasting_investmen depressed
0 926 91 1 28 1 4 10 5 28912201 22861940 ... 0 0 0 0 30028818 31363432 0 28411718 28292707.0 0
1 747 57 1 23 1 3 8 5 28912201 22861940 ... 0 0 0 0 30028818 31363432 0 28411718 28292707.0 1
2 1190 115 1 22 1 3 9 5 28912201 22861940 ... 0 0 0 0 30028818 31363432 0 28411718 28292707.0 0
3 1065 97 1 27 1 2 10 4 52667108 19698904 ... 0 1 0 1 22288055 18751329 0 7781123 69219765.0 0
4 806 42 0 59 0 4 10 6 82606287 17352654 ... 1 0 0 0 53384566 20731006 1 20100562 43419447.0 0

5 rows × 23 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1429 entries, 0 to 1428
Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Survey_id              1429 non-null   int64  
 1   Ville_id               1429 non-null   int64  
 2   sex                    1429 non-null   int64  
 3   Age                    1429 non-null   int64  
 4   Married                1429 non-null   int64  
 5   Number_children        1429 non-null   int64  
 6   education_level        1429 non-null   int64  
 7   total_members          1429 non-null   int64  
 8   gained_asset           1429 non-null   int64  
 9   durable_asset          1429 non-null   int64  
 10  save_asset             1429 non-null   int64  
 11  living_expenses        1429 non-null   int64  
 12  other_expenses         1429 non-null   int64  
 13  incoming_salary        1429 non-null   int64  
 14  incoming_own_farm      1429 non-null   int64  
 15  incoming_business      1429 non-null   int64  
 16  incoming_no_business   1429 non-null   int64  
 17  incoming_agricultural  1429 non-null   int64  
 18  farm_expenses          1429 non-null   int64  
 19  labor_primary          1429 non-null   int64  
 20  lasting_investment     1429 non-null   int64  
 21  no_lasting_investmen   1409 non-null   float64
 22  depressed              1429 non-null   int64  
dtypes: float64(1), int64(22)
memory usage: 256.9 KB

Eliminate the null_values in “no_lasting_investmen”

df = df[~(df.no_lasting_investmen.isnull())]

Since the Survey_id and Ville_id are just the not relevant with the mental disorder so drop them.

df = df.drop(['Survey_id','Ville_id'],axis=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1409 entries, 0 to 1428
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   sex                    1409 non-null   int64  
 1   Age                    1409 non-null   int64  
 2   Married                1409 non-null   int64  
 3   Number_children        1409 non-null   int64  
 4   education_level        1409 non-null   int64  
 5   total_members          1409 non-null   int64  
 6   gained_asset           1409 non-null   int64  
 7   durable_asset          1409 non-null   int64  
 8   save_asset             1409 non-null   int64  
 9   living_expenses        1409 non-null   int64  
 10  other_expenses         1409 non-null   int64  
 11  incoming_salary        1409 non-null   int64  
 12  incoming_own_farm      1409 non-null   int64  
 13  incoming_business      1409 non-null   int64  
 14  incoming_no_business   1409 non-null   int64  
 15  incoming_agricultural  1409 non-null   int64  
 16  farm_expenses          1409 non-null   int64  
 17  labor_primary          1409 non-null   int64  
 18  lasting_investment     1409 non-null   int64  
 19  no_lasting_investmen   1409 non-null   float64
 20  depressed              1409 non-null   int64  
dtypes: float64(1), int64(20)
memory usage: 242.2 KB
df.describe()
sex Age Married Number_children education_level total_members gained_asset durable_asset save_asset living_expenses ... incoming_salary incoming_own_farm incoming_business incoming_no_business incoming_agricultural farm_expenses labor_primary lasting_investment no_lasting_investmen depressed
count 1409.000000 1409.000000 1409.000000 1409.000000 1409.000000 1409.000000 1.409000e+03 1.409000e+03 1.409000e+03 1.409000e+03 ... 1409.000000 1409.000000 1409.000000 1409.000000 1.409000e+03 1.409000e+03 1409.000000 1.409000e+03 1.409000e+03 1409.000000
mean 0.918382 34.733854 0.774308 2.904897 8.697658 4.996451 3.360588e+07 2.707096e+07 2.744453e+07 3.248661e+07 ... 0.176011 0.254081 0.109297 0.264017 3.457400e+07 3.555012e+07 0.209368 3.300612e+07 3.360385e+07 0.166785
std 0.273879 13.800712 0.418186 1.872585 2.913673 1.772778 2.007839e+07 1.804276e+07 1.778911e+07 2.101274e+07 ... 0.380965 0.435498 0.312123 0.440965 2.091860e+07 2.126744e+07 0.407002 2.114974e+07 2.160228e+07 0.372916
min 0.000000 17.000000 0.000000 0.000000 1.000000 1.000000 3.251120e+05 1.625560e+05 1.729660e+05 2.629190e+05 ... 0.000000 0.000000 0.000000 0.000000 3.251120e+05 2.715050e+05 0.000000 7.429200e+04 1.263120e+05 0.000000
25% 1.000000 25.000000 1.000000 2.000000 8.000000 4.000000 2.312976e+07 1.929852e+07 2.339998e+07 2.103352e+07 ... 0.000000 0.000000 0.000000 0.000000 2.295536e+07 2.239928e+07 0.000000 2.010056e+07 2.064203e+07 0.000000
50% 1.000000 31.000000 1.000000 3.000000 9.000000 5.000000 2.891220e+07 2.286194e+07 2.339998e+07 2.669228e+07 ... 0.000000 0.000000 0.000000 0.000000 3.002882e+07 3.136343e+07 0.000000 2.841172e+07 2.829271e+07 0.000000
75% 1.000000 42.000000 1.000000 4.000000 10.000000 6.000000 3.717283e+07 2.634528e+07 2.339998e+07 3.870381e+07 ... 0.000000 1.000000 0.000000 1.000000 4.003842e+07 4.399778e+07 0.000000 3.978445e+07 4.151762e+07 0.000000
max 1.000000 91.000000 1.000000 11.000000 19.000000 12.000000 9.912755e+07 9.961560e+07 9.992676e+07 9.929528e+07 ... 1.000000 1.000000 1.000000 1.000000 9.978910e+07 9.965119e+07 1.000000 9.944667e+07 9.965119e+07 1.000000

8 rows × 21 columns

Visualization

plt.figure(figsize=(15,5))
sns.countplot(x='sex',hue='depressed',data=df)
plt.show()
../../_images/TongSu_14_0.png
plt.figure(figsize=(15,5))
sns.barplot(x='sex',y='depressed',hue='Married',data=df)
plt.show()
../../_images/TongSu_15_0.png
sns.countplot(x='Number_children', hue='depressed', data=df)
<AxesSubplot:xlabel='Number_children', ylabel='count'>
../../_images/TongSu_16_1.png
plt.figure(figsize=(15,5))
sns.barplot(x='depressed',y='education_level',data=df)
plt.show()
../../_images/TongSu_17_0.png
sel = alt.selection_single(fields=["depressed"])
c1= alt.Chart(df).mark_circle().encode(
    x="gained_asset",
    y="living_expenses",
    color=alt.Color('depressed')
)
c2= alt.Chart(df).mark_circle().encode(
    x="incoming_agricultural",
    y="lasting_investment",
    color=alt.Color('depressed')
)
c1&c2

Based on graphs, we can basically have an intuition about the dataset. Indicated by the graph, unmarried men are more likely to be depressed than married men and married women are more likely to be depressed than unmarried women. The people not despressed distributes normarlly as the number of children they have and people do not have children are more likely to be depressed.

Correlation

plt.subplots(figsize=(15,10)) 
sns.heatmap(df.corr(), annot = True, fmt = ".2f")
plt.show()
../../_images/TongSu_21_0.png

Here is the correlation matrix for all the predictor variables and response variable. Notice that the correlation between labor_primary and incoming-salary is 0.90. This indicates that people with high incoming-salary tend to also have high labor primary.

Train_Test_Split

socioecon_state = ['gained_asset','durable_asset', 'save_asset', 'living_expenses', 'other_expenses','incoming_agricultural', 'farm_expenses', 'lasting_investment']
X = df[socioecon_state]
y = df['depressed']
scaler = StandardScaler()
scaler.fit(X)
StandardScaler()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2)
clf = KNeighborsClassifier()
clf.fit(X_train, y_train)
KNeighborsClassifier()
df['pred_depressed'] = pd.Series(clf.predict(X_test))
log_loss(df['depressed'], clf.predict_proba(X))
0.7256651995634845

The log loss is not that high hence the model is proper for predicting depression.

pred_depression_type = (df[['depressed', 'pred_depressed']]).copy()
pdt = pred_depression_type.value_counts(normalize=True)
pdt
depressed  pred_depressed
0          0.0               0.797101
1          0.0               0.184783
0          1.0               0.014493
1          1.0               0.003623
dtype: float64

Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
clf1 = DecisionTreeClassifier()
clf1.fit(df[socioecon_state],df["depressed"])
DecisionTreeClassifier()
clf1.score(df[socioecon_state],df["depressed"])
0.9460610361958836
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf1, 
                   feature_names=clf1.feature_names_in_
                   )
../../_images/TongSu_38_0.png

Restrict the maximum depth and maximum number of leaf nodes by instantiating a new classifier

clf2 = DecisionTreeClassifier(max_depth=10, max_leaf_nodes=20)
clf2.fit(df[socioecon_state],df["depressed"])
DecisionTreeClassifier(max_depth=10, max_leaf_nodes=20)
clf2.score(df[socioecon_state],df["depressed"])
0.8523775727466288
fig = plt.figure(figsize=(200,100))
_ = tree.plot_tree(clf2, 
                   feature_names=clf2.feature_names_in_
                   )
../../_images/TongSu_43_0.png

Summary

In this project, I analyzed the datasets of rural residents’ depression. First, I created graphs to describe the distribution of depression regarding different factors. Some factors such as marriage, sex influence the probability of being depressed a lot and others such as education level do not impact the probability of being depressed a lot. Then, I presented the correlation between factors. Then, I used machine learning to analyze the dataset fitting data to see whether the numerical variables can predict being depressed. The last part is to build up decision tree for predicting depression and restricting the range of tree by setting up the max leaf nodes and max depth.