Title#

Name: Yihao Zhang

ID: 67069281

Introduction#

The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. My project is to build a LinearRegression and predict Happiness score based on the variable of Economy, Familt, trust, health, and freedom,and find out which variabless might contribute the most to the model. At last I will try to use sklearn to determine whether a country’s happiness is beyond expectation.

import altair as alt
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np
from pandas.api.types import is_numeric_dtype
from sklearn.linear_model import LinearRegression

Dataframe Outlook#

Import data：#

df = pd.read_csv("data happiness.csv", na_values=" ").dropna(axis=0).copy()

df=df.loc[:,["Country name","Regional indicator","Ladder score","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]].copy()

df

	Country name	Regional indicator	Ladder score	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption
0	Finland	Western Europe	7.842	10.775	0.954	72.000	0.949	-0.098	0.186
1	Denmark	Western Europe	7.620	10.933	0.954	72.700	0.946	0.030	0.179
2	Switzerland	Western Europe	7.571	11.117	0.942	74.400	0.919	0.025	0.292
3	Iceland	Western Europe	7.554	10.878	0.983	73.000	0.955	0.160	0.673
4	Netherlands	Western Europe	7.464	10.932	0.942	72.400	0.913	0.175	0.338
...	...	...	...	...	...	...	...	...	...
144	Lesotho	Sub-Saharan Africa	3.512	7.926	0.787	48.700	0.715	-0.131	0.915
145	Botswana	Sub-Saharan Africa	3.467	9.782	0.784	59.269	0.824	-0.246	0.801
146	Rwanda	Sub-Saharan Africa	3.415	7.676	0.552	61.400	0.897	0.061	0.167
147	Zimbabwe	Sub-Saharan Africa	3.145	7.943	0.750	56.201	0.677	-0.047	0.821
148	Afghanistan	South Asia	2.523	7.695	0.463	52.493	0.382	-0.102	0.924

149 rows × 9 columns

df.shape

(149, 9)

Rank the Countries With Most Happiness Ladder Score

df["Ladder score"].sort_values(ascending= False)

    7.842
    7.620
    7.571
    7.554
    7.464
       ...  
  3.512
  3.467
  3.415
  3.145
  2.523
Name: Ladder score, Length: 149, dtype: float64

df["Rank"]=pd.Series(range(1,150))

The Daraframe is already in descending sorting, therefore we can distribute ranks to the countries directly.

10 Countries With Most Happiness Ladder Score#

df[["Country name","Rank"]].head(10)

	Country name	Rank
0	Finland	1
1	Denmark	2
2	Switzerland	3
3	Iceland	4
4	Netherlands	5
5	Norway	6
6	Sweden	7
7	Luxembourg	8
8	New Zealand	9
9	Austria	10

10 Countries With Least Happiness Ladder Score#

df[["Country name","Rank"]].tail(10)

	Country name	Rank
139	Burundi	140
140	Yemen	141
141	Tanzania	142
142	Haiti	143
143	Malawi	144
144	Lesotho	145
145	Botswana	146
146	Rwanda	147
147	Zimbabwe	148
148	Afghanistan	149

Data visualization#

Distributing the Level of happiness regards to Quantiles of Ladder score#

Medianhp=df["Ladder score"].quantile(q=0.5)
firstq=df["Ladder score"].quantile(q=0.25)
thirdq=df["Ladder score"].quantile(q=0.75)

def quartile(data):
    if data >= thirdq:
        return("Very Happy")
    if data <= firstq:
        return("Not Happy")
    if Medianhp >= data >= firstq:
        return("Less Happy")
    if thirdq >= data >= Medianhp:
        return("Happy")

df["Happiness"]= df["Ladder score"].map(quartile)

df[["Country name","Regional indicator","Happiness"]]

	Country name	Regional indicator	Happiness
0	Finland	Western Europe	Very Happy
1	Denmark	Western Europe	Very Happy
2	Switzerland	Western Europe	Very Happy
3	Iceland	Western Europe	Very Happy
4	Netherlands	Western Europe	Very Happy
...	...	...	...
144	Lesotho	Sub-Saharan Africa	Not Happy
145	Botswana	Sub-Saharan Africa	Not Happy
146	Rwanda	Sub-Saharan Africa	Not Happy
147	Zimbabwe	Sub-Saharan Africa	Not Happy
148	Afghanistan	South Asia	Not Happy

149 rows × 3 columns

Distrubute countries to 4 happiness level by using the 1st and 3 rd quantiles, and the median of the ladder score.

Graph Regards to Level of Happiness#

c1 = alt.Chart(df).mark_point().encode(
    x=alt.X("Regional indicator", scale=alt.Scale(zero=False)),
    y=alt.Y("Ladder score", scale=alt.Scale(zero=False)),
    color="Happiness",
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    height=800,
    width=300,
    title="Country Region VS Happiness Ladder Score"
)
c1

Ladder Score Global distrubution#

Use plotly to draw graph with world map to see the rough ladder score distribution related to the country region

import plotly.express as px
fig = px.choropleth(
    df,
    locations="Country name",
    color="Ladder score",
    locationmode="country names",
    color_continuous_scale=px.colors.sequential.Plasma
)
fig.update_layout(title="Ladder Score worldwide distribution")
fig.show()

import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.kdeplot(data=df, x=df["Ladder score"], hue=df["Regional indicator"],fill=False, linewidth=2)
plt.axvline(df["Ladder score"].mean(), c="black")
plt.title("Ladder Score Distribution by Region")
plt.show()

Scatterplot with Numerical variable as x axis, Ladder score as y axis, with Color Indicating its Region#

chart_list = []
for i in df.columns:
    c = alt.Chart(df).mark_circle().encode(
    x=alt.X(i, scale=alt.Scale(zero=False)),
    y = "Ladder score",
    color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    title=f"{i} VS Happiness Ladder Score"
)
    chart_list.append(c)

alt.hconcat(chart_list[3], chart_list[4], chart_list[5],chart_list[6],chart_list[7],chart_list[8])

Plotly Bar Chart with region as x axis and average ladder score as y axis

import plotly.express as px
avg = pd.DataFrame(df.groupby('Regional indicator')['Ladder score'].mean())

fig = px.bar(df, x=avg.index, y=avg["Ladder score"])
fig.show()

After drawing the diagram based on the Regional indicator, Happiness level and Ladder score. By looking at the chart above, we can say that Western Europe and Central and Eastern Europecountries tend to have happier perception of life, Sub-Saharan Africa and South Asia countries tend to have a less happier perception of current life. We can also see that economy, healthy life and social support form a clearer correlation with happiness ladder score compared to other variables. We see a clearly positive correlation between GDP and Happiness, and between social support and Happiness

Clustering the Countries use Standard Scaler and KMeans Clustering#

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("kmeans", KMeans(n_clusters=5))
    ]
)

cols0=["Ladder score","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]

pipe.fit(df[cols0])

Pipeline(steps=[('scaler', StandardScaler()), ('kmeans', KMeans(n_clusters=5))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

arr=pipe.predict(df[cols0])

df["cluster"] = arr

Scatterplot with Numerical variable as x axis, Ladder score as y axis, with Color Indicating its Cluster#

chart_list0 = []
for i in df.columns:
    e = alt.Chart(df).mark_circle().encode(
    x=alt.X(i, scale=alt.Scale(zero=False)),
    y = "Ladder score",
    color=alt.Color("cluster:N", scale=alt.Scale(scheme='blues')),
    tooltip=["Country name","Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
).properties(
    title=f"{i} VS Happiness Ladder Score with Cluster"
)
    chart_list0.append(e)

alt.hconcat(chart_list0[3], chart_list0[4], chart_list0[5],chart_list0[6],chart_list0[7],chart_list0[8])

Correlations and Feature Selection#

cols=["Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]

import matplotlib.pyplot  as plt
import seaborn as sns
plt.figure(figsize=(4,3))
sns.heatmap(df[cols].corr(),annot=True,fmt=".2f",linewidth=0.7)

<AxesSubplot: >

corr = df[df.columns].corr()
corr.sort_values(["Ladder score"], ascending = False, inplace = True)
print(corr["Ladder score"])

Ladder score                    1.000000
Logged GDP per capita           0.789760
Healthy life expectancy         0.768099
Social support                  0.756888
Freedom to make life choices    0.607753
Generosity                     -0.017799
cluster                        -0.075286
Perceptions of corruption      -0.421140
Rank                           -0.984265
Name: Ladder score, dtype: float64

We see the variable with most correlation with the ladder score is GDP, and the least correlated variable is Generosity.

Linear Regression and Prediction#

Sklearn Linear Regression Based on Numerical Variables Only#

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
cols=["Logged GDP per capita","Social support","Healthy life expectancy","Freedom to make life choices","Generosity","Perceptions of corruption"]
reg.fit(df[cols],df["Ladder score"])
pd.Series(reg.coef_,index=cols)

Logged GDP per capita           0.279533
Social support                  2.476206
Healthy life expectancy         0.030314
Freedom to make life choices    2.010465
Generosity                      0.364382
Perceptions of corruption      -0.605092
dtype: float64

reg.coef_

array([ 0.2795329 ,  2.47620585,  0.03031381,  2.0104647 ,  0.36438194,
       -0.60509177])

print(f"The equation is: Pred Price = {cols[0]} x {reg.coef_[0]} + {cols[1]} x {reg.coef_[1]}+{cols[2]} x {reg.coef_[2]}+{cols[3]} x {reg.coef_[3]}+{cols[4]} x {reg.coef_[4]}+{cols[5]} x {reg.coef_[5]} + {reg.intercept_}") 

The equation is: Pred Price = Logged GDP per capita x 0.2795328970903119 + Social support x 2.476205853915902+Healthy life expectancy x 0.030313812350904194+Freedom to make life choices x 2.01046470184026+Generosity x 0.3643819429244515+Perceptions of corruption x -0.6050917656434847 + -2.2372192944749907

df["Pred"] = reg.predict(df[cols])

Constructing graph to see the differences of true ladder score and predicted data.

import altair as alt
chartlist2=[]
for i in cols:
    c=alt.Chart(df).mark_circle().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Ladder score",
        color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    )
    c1=alt.Chart(df).mark_line().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Pred",
    )
    c3=c+c1
    chartlist2.append(c3)
alt.hconcat(*chartlist2)

Build training and test set：#

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df[cols],df["Ladder score"],test_size = 0.3, random_state=1)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
predtrain= model.predict(X_train)

predtest= model.predict(X_test)

print("Accuracy on Traing set: ",model.score(X_train,y_train))
print("Accuracy on Testing set: ",model.score(X_test,y_test))

Accuracy on Traing set:  0.7816778724066249
Accuracy on Testing set:  0.6683353869020149

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("Mean Squared Error: ",mean_squared_error(y_test, predtest))
print("R^2 Score: ",r2_score(y_test,predtest))

Mean Squared Error:  0.3334530610483154
R^2 Score:  0.6683353869020149

Interpreting the R^2 score, The model is explaining almost 67% of variablity of the variance fo the training data, and we get a score of 0.78 on the accuracy of Training set, a score of 0.67 on the accuracy of Testing set,

As we can see the model does not have a really good model score and accuracy, to improve our model, we put variable of region indicator into consideration.

Improvements of Prediction Model#

Create Binary Variables Column to improve Model accuracy, taking consideration of country geographic location#

df2=pd.get_dummies(df["Regional indicator"])

df3=pd.concat([df, df2], axis=1).copy()

df3

	Country name	Regional indicator	Ladder score	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption	Rank	...	Central and Eastern Europe	Commonwealth of Independent States	East Asia	Latin America and Caribbean	Middle East and North Africa	North America and ANZ	South Asia	Southeast Asia	Sub-Saharan Africa	Western Europe
0	Finland	Western Europe	7.842	10.775	0.954	72.000	0.949	-0.098	0.186	1	...	0	0	0	0	0	0	0	0	0	1
1	Denmark	Western Europe	7.620	10.933	0.954	72.700	0.946	0.030	0.179	2	...	0	0	0	0	0	0	0	0	0	1
2	Switzerland	Western Europe	7.571	11.117	0.942	74.400	0.919	0.025	0.292	3	...	0	0	0	0	0	0	0	0	0	1
3	Iceland	Western Europe	7.554	10.878	0.983	73.000	0.955	0.160	0.673	4	...	0	0	0	0	0	0	0	0	0	1
4	Netherlands	Western Europe	7.464	10.932	0.942	72.400	0.913	0.175	0.338	5	...	0	0	0	0	0	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
144	Lesotho	Sub-Saharan Africa	3.512	7.926	0.787	48.700	0.715	-0.131	0.915	145	...	0	0	0	0	0	0	0	0	1	0
145	Botswana	Sub-Saharan Africa	3.467	9.782	0.784	59.269	0.824	-0.246	0.801	146	...	0	0	0	0	0	0	0	0	1	0
146	Rwanda	Sub-Saharan Africa	3.415	7.676	0.552	61.400	0.897	0.061	0.167	147	...	0	0	0	0	0	0	0	0	1	0
147	Zimbabwe	Sub-Saharan Africa	3.145	7.943	0.750	56.201	0.677	-0.047	0.821	148	...	0	0	0	0	0	0	0	0	1	0
148	Afghanistan	South Asia	2.523	7.695	0.463	52.493	0.382	-0.102	0.924	149	...	0	0	0	0	0	0	1	0	0	0

149 rows × 23 columns

cols4=[ 'Logged GDP per capita', 'Social support', 'Healthy life expectancy',
       'Freedom to make life choices', 'Generosity',
       'Perceptions of corruption', 
       'Central and Eastern Europe', 'Commonwealth of Independent States',
       'East Asia', 'Latin America and Caribbean',
       'Middle East and North Africa', 'North America and ANZ', 'South Asia',
       'Southeast Asia', 'Sub-Saharan Africa', 'Western Europe']

Build Training and Testing set for the New Model#

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df3[cols4],df3["Ladder score"],test_size = 0.3, random_state=1)
from sklearn.linear_model import LinearRegression
model2 = LinearRegression()
model2.fit(X_train,y_train)
pred2= model2.predict(X_train)
predtest2=model2.predict(X_test)
print("Accuracy on Traing set: ",model2.score(X_train,y_train))
print("Accuracy on Testing set: ",model2.score(X_test,y_test))

Accuracy on Traing set:  0.8272728525667357
Accuracy on Testing set:  0.7353645962541258

from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
print("Mean Squared Error: ",mean_squared_error(y_test, predtest2))
print("R^2 Score: ",r2_score(y_test,predtest2))

Mean Squared Error:  0.2660624074921989
R^2 Score:  0.7353645962541258

Interpreting the R^2 score, The new model is explaining almost 74% of variablity of the variance fo the training data, and we get higher model accuracy score on both of the training set and testing set, and we also get less Mean Squared Error.

sklearn Linear Regression of the New Model#

from sklearn.linear_model import LinearRegression
reg2 = LinearRegression()
reg2.fit(df3[cols4],df3["Ladder score"])
pd.Series(reg2.coef_,index=cols4)

Logged GDP per capita                 0.267515
Social support                        1.949552
Healthy life expectancy               0.014086
Freedom to make life choices          2.266572
Generosity                            0.497026
Perceptions of corruption            -0.328476
Central and Eastern Europe            0.170953
Commonwealth of Independent States   -0.189213
East Asia                            -0.048350
Latin America and Caribbean           0.299890
Middle East and North Africa         -0.108041
North America and ANZ                 0.520121
South Asia                           -0.557216
Southeast Asia                       -0.460052
Sub-Saharan Africa                   -0.134426
Western Europe                        0.506334
dtype: float64

reg2.coef_

array([ 0.26751474,  1.94955181,  0.01408642,  2.26657235,  0.49702641,
       -0.32847612,  0.17095269, -0.18921326, -0.04834953,  0.29989009,
       -0.10804123,  0.52012137, -0.55721574, -0.46005188, -0.13442618,
        0.50633366])

print(f"The equation is: Pred Price = {cols4[0]} x {reg2.coef_[0]} + {cols4[1]} x {reg2.coef_[1]}+{cols4[2]} x {reg2.coef_[2]}+{cols4[3]} x {reg2.coef_[3]}+{cols4[4]} x {reg2.coef_[4]}+{cols4[5]} x {reg2.coef_[5]}+{cols4[6]} x {reg2.coef_[6]}+{cols4[7]} x {reg2.coef_[7]}+{cols4[8]} x {reg2.coef_[8]}+{cols4[9]} x {reg2.coef_[9]}+{cols4[10]} x {reg2.coef_[10]}+{cols4[11]} x {reg2.coef_[11]}+{cols4[12]} x {reg2.coef_[12]}+{cols4[13]} x {reg2.coef_[13]}+{cols4[14]} x {reg2.coef_[14]}+{cols4[15]} x {reg2.coef_[15]} + {reg2.intercept_}") 

The equation is: Pred Price = Logged GDP per capita x 0.2675147422939184 + Social support x 1.9495518113713843+Healthy life expectancy x 0.014086415435709418+Freedom to make life choices x 2.2665723530069286+Generosity x 0.4970264092008061+Perceptions of corruption x -0.32847612358341893+Central and Eastern Europe x 0.17095269208625866+Commonwealth of Independent States x -0.1892132556381877+East Asia x -0.048349528121480884+Latin America and Caribbean x 0.29989008516075766+Middle East and North Africa x -0.10804122503952122+North America and ANZ x 0.5201213713676418+South Asia x -0.5572157356073879+Southeast Asia x -0.46005188026096594+Sub-Saharan Africa x -0.13442618032271927+Western Europe x 0.5063336563756091 + -1.0711857633458735

df3["Pred"] = reg2.predict(df3[cols4])
df3

	Country name	Regional indicator	Ladder score	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption	Rank	...	Central and Eastern Europe	Commonwealth of Independent States	East Asia	Latin America and Caribbean	Middle East and North Africa	North America and ANZ	South Asia	Southeast Asia	Sub-Saharan Africa	Western Europe
0	Finland	Western Europe	7.842	10.775	0.954	72.000	0.949	-0.098	0.186	1	...	0	0	0	0	0	0	0	0	0	1
1	Denmark	Western Europe	7.620	10.933	0.954	72.700	0.946	0.030	0.179	2	...	0	0	0	0	0	0	0	0	0	1
2	Switzerland	Western Europe	7.571	11.117	0.942	74.400	0.919	0.025	0.292	3	...	0	0	0	0	0	0	0	0	0	1
3	Iceland	Western Europe	7.554	10.878	0.983	73.000	0.955	0.160	0.673	4	...	0	0	0	0	0	0	0	0	0	1
4	Netherlands	Western Europe	7.464	10.932	0.942	72.400	0.913	0.175	0.338	5	...	0	0	0	0	0	0	0	0	0	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
144	Lesotho	Sub-Saharan Africa	3.512	7.926	0.787	48.700	0.715	-0.131	0.915	145	...	0	0	0	0	0	0	0	0	1	0
145	Botswana	Sub-Saharan Africa	3.467	9.782	0.784	59.269	0.824	-0.246	0.801	146	...	0	0	0	0	0	0	0	0	1	0
146	Rwanda	Sub-Saharan Africa	3.415	7.676	0.552	61.400	0.897	0.061	0.167	147	...	0	0	0	0	0	0	0	0	1	0
147	Zimbabwe	Sub-Saharan Africa	3.145	7.943	0.750	56.201	0.677	-0.047	0.821	148	...	0	0	0	0	0	0	0	0	1	0
148	Afghanistan	South Asia	2.523	7.695	0.463	52.493	0.382	-0.102	0.924	149	...	0	0	0	0	0	0	1	0	0	0

149 rows × 23 columns

Ture Ladder Score VS. Predicted Ladder Score with New Model#

import altair as alt
chartlist3=[]
for i in cols:
    d=alt.Chart(df3).mark_circle().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Ladder score",
        color=alt.Color("Regional indicator", scale=alt.Scale(scheme='dark2')),
    )
    d1=alt.Chart(df3).mark_line().encode(
        x=alt.X(i, scale=alt.Scale(zero=False)),
        y="Pred",
        color=alt.value("#FFAA00")
    )
    d3=d+d1
    chartlist3.append(d3)
alt.hconcat(*chartlist3)

Use Sklearn Gradient Boosting Classifier to determine if a country’s people’s happiness is beyond expectation cosidering various variables.#

from sklearn.ensemble import GradientBoostingClassifier

import numpy as np
conditions = [df3['Pred'] > df['Ladder score'],df3['Pred'] < df['Ladder score']]
choices = ['Under Expectation','Beyond Expectation']
df3['Happiness Expectation'] = np.select(conditions, choices, default='Beyond Expectation')
df3

	Country name	Regional indicator	Ladder score	Logged GDP per capita	Social support	Healthy life expectancy	Freedom to make life choices	Generosity	Perceptions of corruption	Rank	...	Commonwealth of Independent States	East Asia	Latin America and Caribbean	Middle East and North Africa	North America and ANZ	South Asia	Southeast Asia	Sub-Saharan Africa	Western Europe	Happiness Expectation
0	Finland	Western Europe	7.842	10.775	0.954	72.000	0.949	-0.098	0.186	1	...	0	0	0	0	0	0	0	0	1	Beyond Expectation
1	Denmark	Western Europe	7.620	10.933	0.954	72.700	0.946	0.030	0.179	2	...	0	0	0	0	0	0	0	0	1	Beyond Expectation
2	Switzerland	Western Europe	7.571	11.117	0.942	74.400	0.919	0.025	0.292	3	...	0	0	0	0	0	0	0	0	1	Beyond Expectation
3	Iceland	Western Europe	7.554	10.878	0.983	73.000	0.955	0.160	0.673	4	...	0	0	0	0	0	0	0	0	1	Beyond Expectation
4	Netherlands	Western Europe	7.464	10.932	0.942	72.400	0.913	0.175	0.338	5	...	0	0	0	0	0	0	0	0	1	Beyond Expectation
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
144	Lesotho	Sub-Saharan Africa	3.512	7.926	0.787	48.700	0.715	-0.131	0.915	145	...	0	0	0	0	0	0	0	1	0	Under Expectation
145	Botswana	Sub-Saharan Africa	3.467	9.782	0.784	59.269	0.824	-0.246	0.801	146	...	0	0	0	0	0	0	0	1	0	Under Expectation
146	Rwanda	Sub-Saharan Africa	3.415	7.676	0.552	61.400	0.897	0.061	0.167	147	...	0	0	0	0	0	0	0	1	0	Under Expectation
147	Zimbabwe	Sub-Saharan Africa	3.145	7.943	0.750	56.201	0.677	-0.047	0.821	148	...	0	0	0	0	0	0	0	1	0	Under Expectation
148	Afghanistan	South Asia	2.523	7.695	0.463	52.493	0.382	-0.102	0.924	149	...	0	0	0	0	0	1	0	0	0	Under Expectation

149 rows × 24 columns

X_train,X_test,y_train,y_test = train_test_split(df3[cols4],df3['Happiness Expectation'],test_size = 0.3, random_state=1)

clf = GradientBoostingClassifier(n_estimators=200, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)

clf.score(X_test, y_test)

0.4666666666666667

The model score are too low, indicate that the model cannot be used to determine whether a happiness score of one country is beyond expectation.

Summary#

I have completed a linear regression model to predict the happiness index of country residence. The results show that if all features are included in the dataset, the R squared of the model is about 73.5%. I compared the importance of each variable to the final output and found that GDP and Social support are the two most important factors. Generosity is the least important factor. The model based on the most dominant features has an accuracy of about 82.7% on the training dataset and about 73.5% on the test dataset. Building regional factors into my model improved the accuracy of my model. In addition, we performed machine learning to predict whether the happiness index will exceed expectations, but found that the data and models were insufficient for us to draw conclusions.