Title#

Name: Yihao Zhang

ID: 67069281

Introduction#

The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. My project is to build a LinearRegression and predict Happiness score based on the variable of Economy, Familt, trust, health, and freedom,and find out which variabless might contribute the most to the model. At last I will try to use sklearn to determine whether a country’s happiness is beyond expectation.

import altair as alt
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
from pandas.api.types import is_numeric_dtype
from sklearn.linear_model import LinearRegression

Dataframe Outlook#

Import data:#

df = pd.read_csv("data happiness.csv", na_values=" ").dropna(axis=0).copy()
df=df.loc[:,["Country name","Regional indicator","Ladder score","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]].copy()
df
Country name Regional indicator Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption
0 Finland Western Europe 7.842 10.775 0.954 72.000 0.949 -0.098 0.186
1 Denmark Western Europe 7.620 10.933 0.954 72.700 0.946 0.030 0.179
2 Switzerland Western Europe 7.571 11.117 0.942 74.400 0.919 0.025 0.292
3 Iceland Western Europe 7.554 10.878 0.983 73.000 0.955 0.160 0.673
4 Netherlands Western Europe 7.464 10.932 0.942 72.400 0.913 0.175 0.338
... ... ... ... ... ... ... ... ... ...
144 Lesotho Sub-Saharan Africa 3.512 7.926 0.787 48.700 0.715 -0.131 0.915
145 Botswana Sub-Saharan Africa 3.467 9.782 0.784 59.269 0.824 -0.246 0.801
146 Rwanda Sub-Saharan Africa 3.415 7.676 0.552 61.400 0.897 0.061 0.167
147 Zimbabwe Sub-Saharan Africa 3.145 7.943 0.750 56.201 0.677 -0.047 0.821
148 Afghanistan South Asia 2.523 7.695 0.463 52.493 0.382 -0.102 0.924

149 rows × 9 columns

df.shape
(149, 9)

Rank the Countries With Most Happiness Ladder Score

df["Ladder score"].sort_values(ascending= False)
0      7.842
1      7.620
2      7.571
3      7.554
4      7.464
       ...  
144    3.512
145    3.467
146    3.415
147    3.145
148    2.523
Name: Ladder score, Length: 149, dtype: float64
df["Rank"]=pd.Series(range(1,150))

The Daraframe is already in descending sorting, therefore we can distribute ranks to the countries directly.

10 Countries With Most Happiness Ladder Score#

df[["Country name","Rank"]].head(10)
Country name Rank
0 Finland 1
1 Denmark 2
2 Switzerland 3
3 Iceland 4
4 Netherlands 5
5 Norway 6
6 Sweden 7
7 Luxembourg 8
8 New Zealand 9
9 Austria 10

10 Countries With Least Happiness Ladder Score#

df[["Country name","Rank"]].tail(10)
Country name Rank
139 Burundi 140
140 Yemen 141
141 Tanzania 142
142 Haiti 143
143 Malawi 144
144 Lesotho 145
145 Botswana 146
146 Rwanda 147
147 Zimbabwe 148
148 Afghanistan 149

Data visualization#

Distributing the Level of happiness regards to Quantiles of Ladder score#

Medianhp=df["Ladder score"].quantile(q=0.5)
firstq=df["Ladder score"].quantile(q=0.25)
thirdq=df["Ladder score"].quantile(q=0.75)
def quartile(data):
    if data >= thirdq:
        return("Very Happy")
    if data <= firstq:
        return("Not Happy")
    if Medianhp >= data >= firstq:
        return("Less Happy")
    if thirdq >= data >= Medianhp:
        return("Happy")
df["Happiness"]= df["Ladder score"].map(quartile)
df[["Country name","Regional indicator","Happiness"]]
Country name Regional indicator Happiness
0 Finland Western Europe Very Happy
1 Denmark Western Europe Very Happy
2 Switzerland Western Europe Very Happy
3 Iceland Western Europe Very Happy
4 Netherlands Western Europe Very Happy
... ... ... ...
144 Lesotho Sub-Saharan Africa Not Happy
145 Botswana Sub-Saharan Africa Not Happy
146 Rwanda Sub-Saharan Africa Not Happy
147 Zimbabwe Sub-Saharan Africa Not Happy
148 Afghanistan South Asia Not Happy

149 rows × 3 columns

Distrubute countries to 4 happiness level by using the 1st and 3 rd quantiles, and the median of the ladder score.

Graph Regards to Level of Happiness#

c1 = alt.Chart(df).mark_point().encode(
    x=alt.X("Regional indicator", scale=alt.Scale(zero=False)),
    y=alt.Y("Ladder score", scale=alt.Scale(zero=False)),
    color="Happiness",
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    height=800,
    width=300,
    title="Country Region VS Happiness Ladder Score"
)
c1

Ladder Score Global distrubution#

Use plotly to draw graph with world map to see the rough ladder score distribution related to the country region

import plotly.express as px
fig = px.choropleth(
    df,
    locations="Country name",
    color="Ladder score",
    locationmode="country names",
    color_continuous_scale=px.colors.sequential.Plasma
)
fig.update_layout(title="Ladder Score worldwide distribution")
fig.show()
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, x=df["Ladder score"], hue=df["Regional indicator"],fill=False, linewidth=2)
plt.axvline(df["Ladder score"].mean(), c="black")
plt.title("Ladder Score Distribution by Region")
plt.show()
../../_images/YihaoZhang_32_0.png

Scatterplot with Numerical variable as x axis, Ladder score as y axis, with Color Indicating its Region#

chart_list = []
for i in df.columns:
    c = alt.Chart(df).mark_circle().encode(
    x=alt.X(i, scale=alt.Scale(zero=False)),
    y = "Ladder score",
    color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    title=f"{i} VS Happiness Ladder Score"
)
    chart_list.append(c)
alt.hconcat(chart_list[3], chart_list[4], chart_list[5],chart_list[6],chart_list[7],chart_list[8])

Plotly Bar Chart with region as x axis and average ladder score as y axis

import plotly.express as px
avg = pd.DataFrame(df.groupby('Regional indicator')['Ladder score'].mean())

fig = px.bar(df, x=avg.index, y=avg["Ladder score"])
fig.show()

After drawing the diagram based on the Regional indicator, Happiness level and Ladder score. By looking at the chart above, we can say that Western Europe and Central and Eastern Europecountries tend to have happier perception of life, Sub-Saharan Africa and South Asia countries tend to have a less happier perception of current life. We can also see that economy, healthy life and social support form a clearer correlation with happiness ladder score compared to other variables. We see a clearly positive correlation between GDP and Happiness, and between social support and Happiness

Clustering the Countries use Standard Scaler and KMeans Clustering#

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("kmeans", KMeans(n_clusters=5))
    ]
)
cols0=["Ladder score","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
pipe.fit(df[cols0])
Pipeline(steps=[('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=5))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
arr=pipe.predict(df[cols0])
df["cluster"] = arr

Scatterplot with Numerical variable as x axis, Ladder score as y axis, with Color Indicating its Cluster#

chart_list0 = []
for i in df.columns:
    e = alt.Chart(df).mark_circle().encode(
    x=alt.X(i, scale=alt.Scale(zero=False)),
    y = "Ladder score",
    color=alt.Color("cluster:N", scale=alt.Scale(scheme='blues')),
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    title=f"{i} VS Happiness Ladder Score with Cluster"
)
    chart_list0.append(e)
alt.hconcat(chart_list0[3], chart_list0[4], chart_list0[5],chart_list0[6],chart_list0[7],chart_list0[8])

Correlations and Feature Selection#

cols=["Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
import matplotlib.pyplot  as plt
import seaborn as sns
plt.figure(figsize=(4,3))
sns.heatmap(df[cols].corr(),annot=True,fmt=".2f",linewidth=0.7)
<AxesSubplot: >
../../_images/YihaoZhang_51_1.png
corr = df[df.columns].corr()
corr.sort_values(["Ladder score"], ascending = False, inplace = True)
print(corr["Ladder score"])
Ladder score                    1.000000
Logged GDP per capita           0.789760
Healthy life expectancy         0.768099
Social support                  0.756888
Freedom to make life choices    0.607753
Generosity                     -0.017799
cluster                        -0.075286
Perceptions of corruption      -0.421140
Rank                           -0.984265
Name: Ladder score, dtype: float64

We see the variable with most correlation with the ladder score is GDP, and the least correlated variable is Generosity.

Linear Regression and Prediction#

Sklearn Linear Regression Based on Numerical Variables Only#

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cols=["Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
reg.fit(df[cols],df["Ladder score"])
pd.Series(reg.coef_,index=cols)
Logged GDP per capita           0.279533
Social support                  2.476206
Healthy life expectancy         0.030314
Freedom to make life choices    2.010465
Generosity                      0.364382
Perceptions of corruption      -0.605092
dtype: float64
reg.coef_
array([ 0.2795329 ,  2.47620585,  0.03031381,  2.0104647 ,  0.36438194,
       -0.60509177])
print(f"The equation is: Pred Price = {cols[0]} x {reg.coef_[0]} + {cols[1]} x {reg.coef_[1]}+{cols[2]} x {reg.coef_[2]}+{cols[3]} x {reg.coef_[3]}+{cols[4]} x {reg.coef_[4]}+{cols[5]} x {reg.coef_[5]} + {reg.intercept_}") 
The equation is: Pred Price = Logged GDP per capita x 0.2795328970903119 + Social support x 2.476205853915902+Healthy life expectancy x 0.030313812350904194+Freedom to make life choices x 2.01046470184026+Generosity x 0.3643819429244515+Perceptions of corruption x -0.6050917656434847 + -2.2372192944749907
df["Pred"] = reg.predict(df[cols])

Constructing graph to see the differences of true ladder score and predicted data.

import altair as alt
chartlist2=[]
for i in cols:
    c=alt.Chart(df).mark_circle().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Ladder score",
        color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    )
    c1=alt.Chart(df).mark_line().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Pred",
    )
    c3=c+c1
    chartlist2.append(c3)
alt.hconcat(*chartlist2)

Build training and test set:#

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df[cols],df["Ladder score"],test_size = 0.3, random_state=1)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
predtrain= model.predict(X_train)

predtest= model.predict(X_test)

print("Accuracy on Traing set: ",model.score(X_train,y_train))
print("Accuracy on Testing set: ",model.score(X_test,y_test))
Accuracy on Traing set:  0.7816778724066249
Accuracy on Testing set:  0.6683353869020149
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("Mean Squared Error: ",mean_squared_error(y_test, predtest))
print("R^2 Score: ",r2_score(y_test,predtest))
Mean Squared Error:  0.3334530610483154
R^2 Score:  0.6683353869020149

Interpreting the R^2 score, The model is explaining almost 67% of variablity of the variance fo the training data, and we get a score of 0.78 on the accuracy of Training set, a score of 0.67 on the accuracy of Testing set,

As we can see the model does not have a really good model score and accuracy, to improve our model, we put variable of region indicator into consideration.

Improvements of Prediction Model#

Create Binary Variables Column to improve Model accuracy, taking consideration of country geographic location#

df2=pd.get_dummies(df["Regional indicator"])
df3=pd.concat([df, df2], axis=1).copy()
df3
Country name Regional indicator Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Rank ... Central and Eastern Europe Commonwealth of Independent States East Asia Latin America and Caribbean Middle East and North Africa North America and ANZ South Asia Southeast Asia Sub-Saharan Africa Western Europe
0 Finland Western Europe 7.842 10.775 0.954 72.000 0.949 -0.098 0.186 1 ... 0 0 0 0 0 0 0 0 0 1
1 Denmark Western Europe 7.620 10.933 0.954 72.700 0.946 0.030 0.179 2 ... 0 0 0 0 0 0 0 0 0 1
2 Switzerland Western Europe 7.571 11.117 0.942 74.400 0.919 0.025 0.292 3 ... 0 0 0 0 0 0 0 0 0 1
3 Iceland Western Europe 7.554 10.878 0.983 73.000 0.955 0.160 0.673 4 ... 0 0 0 0 0 0 0 0 0 1
4 Netherlands Western Europe 7.464 10.932 0.942 72.400 0.913 0.175 0.338 5 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
144 Lesotho Sub-Saharan Africa 3.512 7.926 0.787 48.700 0.715 -0.131 0.915 145 ... 0 0 0 0 0 0 0 0 1 0
145 Botswana Sub-Saharan Africa 3.467 9.782 0.784 59.269 0.824 -0.246 0.801 146 ... 0 0 0 0 0 0 0 0 1 0
146 Rwanda Sub-Saharan Africa 3.415 7.676 0.552 61.400 0.897 0.061 0.167 147 ... 0 0 0 0 0 0 0 0 1 0
147 Zimbabwe Sub-Saharan Africa 3.145 7.943 0.750 56.201 0.677 -0.047 0.821 148 ... 0 0 0 0 0 0 0 0 1 0
148 Afghanistan South Asia 2.523 7.695 0.463 52.493 0.382 -0.102 0.924 149 ... 0 0 0 0 0 0 1 0 0 0

149 rows × 23 columns

cols4=[ 'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 
       'Central and Eastern Europe', 'Commonwealth of Independent States',
       'East Asia', 'Latin America and Caribbean',
       'Middle East and North Africa', 'North America and ANZ', 'South Asia',
       'Southeast Asia', 'Sub-Saharan Africa', 'Western Europe']

Build Training and Testing set for the New Model#

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df3[cols4],df3["Ladder score"],test_size = 0.3, random_state=1)
from sklearn.linear_model import LinearRegression
model2 = LinearRegression()
model2.fit(X_train,y_train)
pred2= model2.predict(X_train)
predtest2=model2.predict(X_test)
print("Accuracy on Traing set: ",model2.score(X_train,y_train))
print("Accuracy on Testing set: ",model2.score(X_test,y_test))
Accuracy on Traing set:  0.8272728525667357
Accuracy on Testing set:  0.7353645962541258
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("Mean Squared Error: ",mean_squared_error(y_test, predtest2))
print("R^2 Score: ",r2_score(y_test,predtest2))
Mean Squared Error:  0.2660624074921989
R^2 Score:  0.7353645962541258

Interpreting the R^2 score, The new model is explaining almost 74% of variablity of the variance fo the training data, and we get higher model accuracy score on both of the training set and testing set, and we also get less Mean Squared Error.

sklearn Linear Regression of the New Model#

from sklearn.linear_model import LinearRegression
reg2 = LinearRegression()
reg2.fit(df3[cols4],df3["Ladder score"])
pd.Series(reg2.coef_,index=cols4)
Logged GDP per capita                 0.267515
Social support                        1.949552
Healthy life expectancy               0.014086
Freedom to make life choices          2.266572
Generosity                            0.497026
Perceptions of corruption            -0.328476
Central and Eastern Europe            0.170953
Commonwealth of Independent States   -0.189213
East Asia                            -0.048350
Latin America and Caribbean           0.299890
Middle East and North Africa         -0.108041
North America and ANZ                 0.520121
South Asia                           -0.557216
Southeast Asia                       -0.460052
Sub-Saharan Africa                   -0.134426
Western Europe                        0.506334
dtype: float64
reg2.coef_
array([ 0.26751474,  1.94955181,  0.01408642,  2.26657235,  0.49702641,
       -0.32847612,  0.17095269, -0.18921326, -0.04834953,  0.29989009,
       -0.10804123,  0.52012137, -0.55721574, -0.46005188, -0.13442618,
        0.50633366])
print(f"The equation is: Pred Price = {cols4[0]} x {reg2.coef_[0]} + {cols4[1]} x {reg2.coef_[1]}+{cols4[2]} x {reg2.coef_[2]}+{cols4[3]} x {reg2.coef_[3]}+{cols4[4]} x {reg2.coef_[4]}+{cols4[5]} x {reg2.coef_[5]}+{cols4[6]} x {reg2.coef_[6]}+{cols4[7]} x {reg2.coef_[7]}+{cols4[8]} x {reg2.coef_[8]}+{cols4[9]} x {reg2.coef_[9]}+{cols4[10]} x {reg2.coef_[10]}+{cols4[11]} x {reg2.coef_[11]}+{cols4[12]} x {reg2.coef_[12]}+{cols4[13]} x {reg2.coef_[13]}+{cols4[14]} x {reg2.coef_[14]}+{cols4[15]} x {reg2.coef_[15]} + {reg2.intercept_}") 
The equation is: Pred Price = Logged GDP per capita x 0.2675147422939184 + Social support x 1.9495518113713843+Healthy life expectancy x 0.014086415435709418+Freedom to make life choices x 2.2665723530069286+Generosity x 0.4970264092008061+Perceptions of corruption x -0.32847612358341893+Central and Eastern Europe x 0.17095269208625866+Commonwealth of Independent States x -0.1892132556381877+East Asia x -0.048349528121480884+Latin America and Caribbean x 0.29989008516075766+Middle East and North Africa x -0.10804122503952122+North America and ANZ x 0.5201213713676418+South Asia x -0.5572157356073879+Southeast Asia x -0.46005188026096594+Sub-Saharan Africa x -0.13442618032271927+Western Europe x 0.5063336563756091 + -1.0711857633458735
df3["Pred"] = reg2.predict(df3[cols4])
df3
Country name Regional indicator Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Rank ... Central and Eastern Europe Commonwealth of Independent States East Asia Latin America and Caribbean Middle East and North Africa North America and ANZ South Asia Southeast Asia Sub-Saharan Africa Western Europe
0 Finland Western Europe 7.842 10.775 0.954 72.000 0.949 -0.098 0.186 1 ... 0 0 0 0 0 0 0 0 0 1
1 Denmark Western Europe 7.620 10.933 0.954 72.700 0.946 0.030 0.179 2 ... 0 0 0 0 0 0 0 0 0 1
2 Switzerland Western Europe 7.571 11.117 0.942 74.400 0.919 0.025 0.292 3 ... 0 0 0 0 0 0 0 0 0 1
3 Iceland Western Europe 7.554 10.878 0.983 73.000 0.955 0.160 0.673 4 ... 0 0 0 0 0 0 0 0 0 1
4 Netherlands Western Europe 7.464 10.932 0.942 72.400 0.913 0.175 0.338 5 ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
144 Lesotho Sub-Saharan Africa 3.512 7.926 0.787 48.700 0.715 -0.131 0.915 145 ... 0 0 0 0 0 0 0 0 1 0
145 Botswana Sub-Saharan Africa 3.467 9.782 0.784 59.269 0.824 -0.246 0.801 146 ... 0 0 0 0 0 0 0 0 1 0
146 Rwanda Sub-Saharan Africa 3.415 7.676 0.552 61.400 0.897 0.061 0.167 147 ... 0 0 0 0 0 0 0 0 1 0
147 Zimbabwe Sub-Saharan Africa 3.145 7.943 0.750 56.201 0.677 -0.047 0.821 148 ... 0 0 0 0 0 0 0 0 1 0
148 Afghanistan South Asia 2.523 7.695 0.463 52.493 0.382 -0.102 0.924 149 ... 0 0 0 0 0 0 1 0 0 0

149 rows × 23 columns

Ture Ladder Score VS. Predicted Ladder Score with New Model#

import altair as alt
chartlist3=[]
for i in cols:
    d=alt.Chart(df3).mark_circle().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Ladder score",
        color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    )
    d1=alt.Chart(df3).mark_line().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Pred",
        color=alt.value("#FFAA00")
    )
    d3=d+d1
    chartlist3.append(d3)
alt.hconcat(*chartlist3)

Use Sklearn Gradient Boosting Classifier to determine if a country’s people’s happiness is beyond expectation cosidering various variables.#

from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
conditions = [df3['Pred'] > df['Ladder score'],df3['Pred'] < df['Ladder score']]
choices = ['Under Expectation','Beyond Expectation']
df3['Happiness Expectation'] = np.select(conditions, choices, default='Beyond Expectation')
df3
Country name Regional indicator Ladder score Logged GDP per capita Social support Healthy life expectancy Freedom to make life choices Generosity Perceptions of corruption Rank ... Commonwealth of Independent States East Asia Latin America and Caribbean Middle East and North Africa North America and ANZ South Asia Southeast Asia Sub-Saharan Africa Western Europe Happiness Expectation
0 Finland Western Europe 7.842 10.775 0.954 72.000 0.949 -0.098 0.186 1 ... 0 0 0 0 0 0 0 0 1 Beyond Expectation
1 Denmark Western Europe 7.620 10.933 0.954 72.700 0.946 0.030 0.179 2 ... 0 0 0 0 0 0 0 0 1 Beyond Expectation
2 Switzerland Western Europe 7.571 11.117 0.942 74.400 0.919 0.025 0.292 3 ... 0 0 0 0 0 0 0 0 1 Beyond Expectation
3 Iceland Western Europe 7.554 10.878 0.983 73.000 0.955 0.160 0.673 4 ... 0 0 0 0 0 0 0 0 1 Beyond Expectation
4 Netherlands Western Europe 7.464 10.932 0.942 72.400 0.913 0.175 0.338 5 ... 0 0 0 0 0 0 0 0 1 Beyond Expectation
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
144 Lesotho Sub-Saharan Africa 3.512 7.926 0.787 48.700 0.715 -0.131 0.915 145 ... 0 0 0 0 0 0 0 1 0 Under Expectation
145 Botswana Sub-Saharan Africa 3.467 9.782 0.784 59.269 0.824 -0.246 0.801 146 ... 0 0 0 0 0 0 0 1 0 Under Expectation
146 Rwanda Sub-Saharan Africa 3.415 7.676 0.552 61.400 0.897 0.061 0.167 147 ... 0 0 0 0 0 0 0 1 0 Under Expectation
147 Zimbabwe Sub-Saharan Africa 3.145 7.943 0.750 56.201 0.677 -0.047 0.821 148 ... 0 0 0 0 0 0 0 1 0 Under Expectation
148 Afghanistan South Asia 2.523 7.695 0.463 52.493 0.382 -0.102 0.924 149 ... 0 0 0 0 0 1 0 0 0 Under Expectation

149 rows × 24 columns

X_train,X_test,y_train,y_test = train_test_split(df3[cols4],df3['Happiness Expectation'],test_size = 0.3, random_state=1)
clf = GradientBoostingClassifier(n_estimators=200, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)
clf.score(X_test, y_test)
0.4666666666666667

The model score are too low, indicate that the model cannot be used to determine whether a happiness score of one country is beyond expectation.

Summary#

I have completed a linear regression model to predict the happiness index of country residence. The results show that if all features are included in the dataset, the R squared of the model is about 73.5%. I compared the importance of each variable to the final output and found that GDP and Social support are the two most important factors. Generosity is the least important factor. The model based on the most dominant features has an accuracy of about 82.7% on the training dataset and about 73.5% on the test dataset. Building regional factors into my model improved the accuracy of my model. In addition, we performed machine learning to predict whether the happiness index will exceed expectations, but found that the data and models were insufficient for us to draw conclusions.

References#

Dataframe reference link: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021/code

seaborn.kdeplot: https://seaborn.pydata.org/generated/seaborn.kdeplot.html

plotly: https://plotly.com/python/plotly-express/

Choropleth Maps in Python: https://plotly.com/python/choropleth-maps/

Seaborn Heatmap: https://machinelearningknowledge.ai/seaborn-heatmap-using-sns-heatmap-with-examples-for-beginners/

sklearn.ensemble.GradientBoostingClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

Created in deepnote.com Created in Deepnote