Global Life Expectancy
Contents
Global Life Expectancy#
Author: Eden Kim
Course Project, UC Irvine, Math 10, F22
Introduction#
In this project, we will be looking at a historical dataset that contains the recorded life expectancy for years 1960 to 2020 from a number of unique coutries. Some things we want to explore is the country with the greatest life expectancy, life expectancy by region, and a global prediction line/trend.
Country with Greatest Life Expectancy#
First, we’ll try to create a bar graph to visually find the country with greatest life expectancy.
import pandas as pd
import altair as alt
df = pd.read_csv("global life expectancy.csv")
df
Country Name | Country Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 65.662 | 66.074 | 66.444 | 66.787 | 67.113 | 67.435 | 67.762 | 68.095 | ... | 75.158 | 75.299000 | 75.441000 | 75.583000 | 75.725000 | 75.868000 | 76.010000 | 76.152000 | 76.293000 | 76.434000 |
1 | Afghanistan | AFG | 32.446 | 32.962 | 33.471 | 33.971 | 34.463 | 34.948 | 35.430 | 35.914 | ... | 61.553 | 62.054000 | 62.525000 | 62.966000 | 63.377000 | 63.763000 | 64.130000 | 64.486000 | 64.833000 | 65.173000 |
2 | Angola | AGO | 37.524 | 37.811 | 38.113 | 38.430 | 38.760 | 39.102 | 39.454 | 39.813 | ... | 56.330 | 57.236000 | 58.054000 | 58.776000 | 59.398000 | 59.925000 | 60.379000 | 60.782000 | 61.147000 | 61.487000 |
3 | Albania | ALB | 62.283 | 63.301 | 64.190 | 64.914 | 65.463 | 65.850 | 66.110 | 66.304 | ... | 76.914 | 77.252000 | 77.554000 | 77.813000 | 78.025000 | 78.194000 | 78.333000 | 78.458000 | 78.573000 | 78.686000 |
4 | United Arab Emirates | ARE | 51.537 | 52.560 | 53.573 | 54.572 | 55.555 | 56.523 | 57.482 | 58.432 | ... | 76.521 | 76.711000 | 76.903000 | 77.095000 | 77.285000 | 77.470000 | 77.647000 | 77.814000 | 77.972000 | 78.120000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
200 | Kosovo | XKX | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 70.200 | 70.497561 | 70.797561 | 71.097561 | 71.346341 | 71.846341 | 72.295122 | 72.695122 | 73.092683 | 71.087805 |
201 | Yemen, Rep. | YEM | 29.919 | 30.163 | 30.500 | 30.943 | 31.501 | 32.175 | 32.960 | 33.836 | ... | 65.768 | 65.920000 | 66.016000 | 66.066000 | 66.085000 | 66.087000 | 66.086000 | 66.096000 | 66.125000 | 66.181000 |
202 | South Africa | ZAF | 48.406 | 48.777 | 49.142 | 49.509 | 49.888 | 50.284 | 50.705 | 51.148 | ... | 58.895 | 60.060000 | 61.099000 | 61.968000 | 62.649000 | 63.153000 | 63.538000 | 63.857000 | 64.131000 | 64.379000 |
203 | Zambia | ZMB | 46.687 | 47.084 | 47.446 | 47.772 | 48.068 | 48.351 | 48.643 | 48.960 | ... | 57.126 | 58.502000 | 59.746000 | 60.831000 | 61.737000 | 62.464000 | 63.043000 | 63.510000 | 63.886000 | 64.194000 |
204 | Zimbabwe | ZWE | 53.019 | 53.483 | 53.946 | 54.403 | 54.849 | 55.274 | 55.671 | 56.034 | ... | 52.896 | 55.032000 | 56.897000 | 58.410000 | 59.534000 | 60.294000 | 60.812000 | 61.195000 | 61.490000 | 61.738000 |
205 rows × 63 columns
There are countries with missing data which we want to drop from the dataset because it will result in inaccurate calculations and problems in graphing.
df1 = df.dropna(axis=0).copy()
df1
Country Name | Country Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 65.662 | 66.074 | 66.444 | 66.787 | 67.113 | 67.435 | 67.762 | 68.095 | ... | 75.158 | 75.299 | 75.441 | 75.583 | 75.725 | 75.868 | 76.010 | 76.152 | 76.293 | 76.434 |
1 | Afghanistan | AFG | 32.446 | 32.962 | 33.471 | 33.971 | 34.463 | 34.948 | 35.430 | 35.914 | ... | 61.553 | 62.054 | 62.525 | 62.966 | 63.377 | 63.763 | 64.130 | 64.486 | 64.833 | 65.173 |
2 | Angola | AGO | 37.524 | 37.811 | 38.113 | 38.430 | 38.760 | 39.102 | 39.454 | 39.813 | ... | 56.330 | 57.236 | 58.054 | 58.776 | 59.398 | 59.925 | 60.379 | 60.782 | 61.147 | 61.487 |
3 | Albania | ALB | 62.283 | 63.301 | 64.190 | 64.914 | 65.463 | 65.850 | 66.110 | 66.304 | ... | 76.914 | 77.252 | 77.554 | 77.813 | 78.025 | 78.194 | 78.333 | 78.458 | 78.573 | 78.686 |
4 | United Arab Emirates | ARE | 51.537 | 52.560 | 53.573 | 54.572 | 55.555 | 56.523 | 57.482 | 58.432 | ... | 76.521 | 76.711 | 76.903 | 77.095 | 77.285 | 77.470 | 77.647 | 77.814 | 77.972 | 78.120 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
199 | Samoa | WSM | 56.902 | 57.188 | 57.472 | 57.756 | 58.045 | 58.340 | 58.642 | 58.951 | ... | 71.906 | 72.136 | 72.351 | 72.549 | 72.730 | 72.895 | 73.046 | 73.187 | 73.321 | 73.450 |
201 | Yemen, Rep. | YEM | 29.919 | 30.163 | 30.500 | 30.943 | 31.501 | 32.175 | 32.960 | 33.836 | ... | 65.768 | 65.920 | 66.016 | 66.066 | 66.085 | 66.087 | 66.086 | 66.096 | 66.125 | 66.181 |
202 | South Africa | ZAF | 48.406 | 48.777 | 49.142 | 49.509 | 49.888 | 50.284 | 50.705 | 51.148 | ... | 58.895 | 60.060 | 61.099 | 61.968 | 62.649 | 63.153 | 63.538 | 63.857 | 64.131 | 64.379 |
203 | Zambia | ZMB | 46.687 | 47.084 | 47.446 | 47.772 | 48.068 | 48.351 | 48.643 | 48.960 | ... | 57.126 | 58.502 | 59.746 | 60.831 | 61.737 | 62.464 | 63.043 | 63.510 | 63.886 | 64.194 |
204 | Zimbabwe | ZWE | 53.019 | 53.483 | 53.946 | 54.403 | 54.849 | 55.274 | 55.671 | 56.034 | ... | 52.896 | 55.032 | 56.897 | 58.410 | 59.534 | 60.294 | 60.812 | 61.195 | 61.490 | 61.738 |
187 rows × 63 columns
Then, df.melt and groupby were used to get the average life expectancy for each country.
By using df.melt, I'm bringing all the years into one column. When I do groupby("Country").mean() on the resulting dataframe, it will take all the life expectancies corresponding to the country (for year 1960, 1961, ..., 2020) and take the mean.
df1_melt = df1.melt(
id_vars=["Country Name", "Country Code"],
var_name="Year",
value_name="Life Expectancy"
)
df1_melt
Country Name | Country Code | Year | Life Expectancy | |
---|---|---|---|---|
0 | Aruba | ABW | 1960 | 65.662 |
1 | Afghanistan | AFG | 1960 | 32.446 |
2 | Angola | AGO | 1960 | 37.524 |
3 | Albania | ALB | 1960 | 62.283 |
4 | United Arab Emirates | ARE | 1960 | 51.537 |
... | ... | ... | ... | ... |
11402 | Samoa | WSM | 2020 | 73.450 |
11403 | Yemen, Rep. | YEM | 2020 | 66.181 |
11404 | South Africa | ZAF | 2020 | 64.379 |
11405 | Zambia | ZMB | 2020 | 64.194 |
11406 | Zimbabwe | ZWE | 2020 | 61.738 |
11407 rows × 4 columns
df1_bar = df1_melt.groupby("Country Name").mean().copy()
df1_bar.reset_index(inplace=True)
df1_bar
Country Name | Life Expectancy | |
---|---|---|
0 | Afghanistan | 49.459066 |
1 | Albania | 71.795115 |
2 | Algeria | 63.784213 |
3 | Angola | 46.999508 |
4 | Antigua and Barbuda | 70.901902 |
... | ... | ... |
182 | Vietnam | 68.850000 |
183 | Virgin Islands (U.S.) | 73.813023 |
184 | Yemen, Rep. | 52.760738 |
185 | Zambia | 51.307410 |
186 | Zimbabwe | 54.651705 |
187 rows × 2 columns
Now we have the x-axis and y-axis data that we wanted.
To make the categorical bar graph, we use altair with df1_bar, “Country Name” as x-axis and “Life Expectancy” as y-axis.
I made the color in :Q so that we can distinguish which countries have higher and lower life expectancy. I also added the tooltip so that we can see the exact life expectancy value along with the country name.
alt.Chart(df1_bar).mark_bar().encode(
x='Country Name',
y='Life Expectancy',
color='Life Expectancy:Q',
tooltip=['Country Name','Life Expectancy']
)
So this bar graph does show which countries have generally higher and lower life expectancies, but it’s still hard to find which one of them has the greatest life expectancy on a glance.
We can get the country we're looking for by sorting the values of life expectancy from the dataset (df1_bar), finding the index of the max life expectancy and locating it on the dataframe.
df1_bar["Life Expectancy"].sort_values(ascending=False)
76 78.173703
84 78.010892
160 77.845010
161 77.665282
127 77.428477
...
125 45.920328
124 45.779213
153 44.564557
106 44.138295
146 41.197459
Name: Life Expectancy, Length: 187, dtype: float64
df1_bar.loc[76,"Country Name"]
'Iceland'
With this, we have found that Iceland had the overall greatest life expectancy (of the countries in the data).
Life Expectancy Over Time for Each Continent#
In this section, we want to see and compare the life expectancy trend (from 1960 to 2020) between different regions. As for how to divide the regions, I used continents.
In order to do this, I used a defined function that can return the continent given the country’s name. The function is from https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry. Some things to note:
It required installing pycountry_convert module (which was moved to files: “requirements.txt”)
There were some country names that couldn’t be processed by the function because of the format or because the function didn’t recognize the country as one. In order to address this, I used a try and except.
Try: tries computing the following code and if any error occurs in the process, it's sent to the Except: where I handled the error by naming the continent as 'Misc'
import pycountry_convert as pc
def to_continent(country_name):
country_alpha2 = pc.country_name_to_country_alpha2(country_name)
country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
return country_continent_name
continents = []
for i in range(len(df1)):
try:
continents.append(to_continent(df1.loc[i,"Country Name"]))
except:
continents.append('Misc')
continents
['North America',
'Asia',
'Africa',
'Europe',
'Asia',
'South America',
'Asia',
'North America',
'Oceania',
'Europe',
'Asia',
'Africa',
'Europe',
'Africa',
'Africa',
'Asia',
'Europe',
'Asia',
'Misc',
'Europe',
'Europe',
'North America',
'Misc',
'South America',
'South America',
'North America',
'Asia',
'Asia',
'Africa',
'North America',
'Europe',
'South America',
'Asia',
'Misc',
'Africa',
'Misc',
'Misc',
'South America',
'Africa',
'Africa',
'North America',
'North America',
'Misc',
'Misc',
'Asia',
'Europe',
'Europe',
'Africa',
'Misc',
'Europe',
'North America',
'Africa',
'South America',
'Misc',
'Africa',
'Europe',
'Europe',
'Africa',
'Europe',
'Oceania',
'Europe',
'Misc',
'Misc',
'Africa',
'Europe',
'Asia',
'Africa',
'Africa',
'Misc',
'Africa',
'Africa',
'Europe',
'North America',
'Misc',
'North America',
'Oceania',
'South America',
'Misc',
'North America',
'Europe',
'North America',
'Europe',
'Asia',
'Misc',
'Asia',
'Europe',
'Misc',
'Asia',
'Europe',
'Misc',
'Europe',
'North America',
'Asia',
'Asia',
'Asia',
'Africa',
'Asia',
'Asia',
'Oceania',
'Misc',
'Misc',
'Asia',
'Misc',
'Asia',
'Africa',
'Africa',
'North America',
'Misc',
'Asia',
'Africa',
'Europe',
'Europe',
'Europe',
'Misc',
'Misc',
'Africa',
'Europe',
'Africa',
'Asia',
'North America',
'Misc',
'Europe',
'Africa',
'Europe',
'Asia',
'Europe',
'Asia',
'Africa',
'Africa',
'Africa',
'Africa',
'Asia',
'Africa',
'Oceania',
'Africa',
'Africa',
'North America',
'Europe',
'Europe',
'Asia',
'Oceania',
'Asia',
'Asia',
'North America',
'South America',
'Asia',
'Misc',
'Oceania',
'Europe',
'North America',
'Misc',
'Europe',
'South America',
'Misc',
'Oceania',
'Asia',
'Europe',
'Europe',
'Africa',
'Asia',
'Africa',
'Africa',
'Asia',
'Oceania',
'Africa',
'North America',
'Africa',
'Misc',
'Africa',
'Africa',
'South America',
'Europe',
'Europe',
'Europe',
'Africa',
'Misc',
'Misc',
'Asia',
'Africa',
'Africa',
'Asia',
'Asia',
'Asia',
'Misc',
'Oceania',
'North America',
'Africa']
We create a new column in the dataframe that has the continents.
df2 = df1.copy()
df2["Continent"] = continents
df2
Country Name | Country Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Continent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Aruba | ABW | 65.662 | 66.074 | 66.444 | 66.787 | 67.113 | 67.435 | 67.762 | 68.095 | ... | 75.299 | 75.441 | 75.583 | 75.725 | 75.868 | 76.010 | 76.152 | 76.293 | 76.434 | North America |
1 | Afghanistan | AFG | 32.446 | 32.962 | 33.471 | 33.971 | 34.463 | 34.948 | 35.430 | 35.914 | ... | 62.054 | 62.525 | 62.966 | 63.377 | 63.763 | 64.130 | 64.486 | 64.833 | 65.173 | Asia |
2 | Angola | AGO | 37.524 | 37.811 | 38.113 | 38.430 | 38.760 | 39.102 | 39.454 | 39.813 | ... | 57.236 | 58.054 | 58.776 | 59.398 | 59.925 | 60.379 | 60.782 | 61.147 | 61.487 | Africa |
3 | Albania | ALB | 62.283 | 63.301 | 64.190 | 64.914 | 65.463 | 65.850 | 66.110 | 66.304 | ... | 77.252 | 77.554 | 77.813 | 78.025 | 78.194 | 78.333 | 78.458 | 78.573 | 78.686 | Europe |
4 | United Arab Emirates | ARE | 51.537 | 52.560 | 53.573 | 54.572 | 55.555 | 56.523 | 57.482 | 58.432 | ... | 76.711 | 76.903 | 77.095 | 77.285 | 77.470 | 77.647 | 77.814 | 77.972 | 78.120 | Asia |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
199 | Samoa | WSM | 56.902 | 57.188 | 57.472 | 57.756 | 58.045 | 58.340 | 58.642 | 58.951 | ... | 72.136 | 72.351 | 72.549 | 72.730 | 72.895 | 73.046 | 73.187 | 73.321 | 73.450 | Asia |
201 | Yemen, Rep. | YEM | 29.919 | 30.163 | 30.500 | 30.943 | 31.501 | 32.175 | 32.960 | 33.836 | ... | 65.920 | 66.016 | 66.066 | 66.085 | 66.087 | 66.086 | 66.096 | 66.125 | 66.181 | Misc |
202 | South Africa | ZAF | 48.406 | 48.777 | 49.142 | 49.509 | 49.888 | 50.284 | 50.705 | 51.148 | ... | 60.060 | 61.099 | 61.968 | 62.649 | 63.153 | 63.538 | 63.857 | 64.131 | 64.379 | Oceania |
203 | Zambia | ZMB | 46.687 | 47.084 | 47.446 | 47.772 | 48.068 | 48.351 | 48.643 | 48.960 | ... | 58.502 | 59.746 | 60.831 | 61.737 | 62.464 | 63.043 | 63.510 | 63.886 | 64.194 | North America |
204 | Zimbabwe | ZWE | 53.019 | 53.483 | 53.946 | 54.403 | 54.849 | 55.274 | 55.671 | 56.034 | ... | 55.032 | 56.897 | 58.410 | 59.534 | 60.294 | 60.812 | 61.195 | 61.490 | 61.738 | Africa |
187 rows × 64 columns
Next, group the dataframe by continent so that we have year/life expectancy data by continent, not country.
df_conti = df2.groupby("Continent").mean().copy()
df_conti.reset_index(inplace=True)
df_conti
Continent | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Africa | 51.911373 | 52.315236 | 52.684135 | 53.103648 | 53.520000 | 53.917729 | 54.322771 | 54.709387 | 55.099520 | ... | 69.607766 | 69.906567 | 70.250675 | 70.546000 | 70.781485 | 71.049631 | 71.290738 | 71.542371 | 71.767809 | 71.847491 |
1 | Asia | 52.830392 | 53.399982 | 53.920459 | 54.435588 | 54.981245 | 55.461831 | 55.948146 | 56.391173 | 56.811198 | ... | 70.412202 | 70.774841 | 71.158889 | 71.506061 | 71.796693 | 72.109442 | 72.340053 | 72.584231 | 72.810201 | 72.909318 |
2 | Europe | 57.659488 | 58.135747 | 58.456193 | 58.826982 | 59.260429 | 59.588553 | 59.953093 | 60.312871 | 60.648314 | ... | 72.373682 | 72.687835 | 73.010700 | 73.364541 | 73.529201 | 73.785163 | 74.020479 | 74.194212 | 74.441680 | 74.365482 |
3 | Misc | 51.405921 | 51.830659 | 52.134498 | 52.573478 | 53.043221 | 53.392378 | 53.776949 | 54.147137 | 54.459372 | ... | 67.674541 | 68.079644 | 68.460127 | 68.861809 | 69.124180 | 69.446056 | 69.711120 | 69.957793 | 70.220220 | 70.158805 |
4 | North America | 56.544218 | 57.091116 | 57.575617 | 58.062500 | 58.577691 | 59.041588 | 59.495272 | 59.947293 | 60.372332 | ... | 72.423866 | 72.725760 | 73.033960 | 73.337421 | 73.578526 | 73.854245 | 74.113457 | 74.303688 | 74.544428 | 74.552098 |
5 | Oceania | 57.841107 | 58.207232 | 58.364212 | 58.699668 | 59.066634 | 59.269985 | 59.577412 | 59.841573 | 60.042966 | ... | 72.541788 | 72.883963 | 73.242651 | 73.567929 | 73.753539 | 74.028029 | 74.195817 | 74.406273 | 74.589739 | 74.657851 |
6 | South America | 50.836990 | 51.294980 | 51.773002 | 52.278156 | 52.809593 | 53.359563 | 53.914420 | 54.457712 | 54.976193 | ... | 70.434427 | 70.767883 | 71.154246 | 71.429815 | 71.752224 | 72.003946 | 72.277668 | 72.505290 | 72.744534 | 72.903534 |
7 rows × 62 columns
In order to make the line graph, we will use df.melt again to get the years into their own column. This time, the melt is used so that we can have the year column be the x-axis.
df_melt = df_conti.melt(
id_vars="Continent",
var_name="Year",
value_name="Life Expectancy"
)
df_melt
Continent | Year | Life Expectancy | |
---|---|---|---|
0 | Africa | 1960 | 51.911373 |
1 | Asia | 1960 | 52.830392 |
2 | Europe | 1960 | 57.659488 |
3 | Misc | 1960 | 51.405921 |
4 | North America | 1960 | 56.544218 |
... | ... | ... | ... |
422 | Europe | 2020 | 74.365482 |
423 | Misc | 2020 | 70.158805 |
424 | North America | 2020 | 74.552098 |
425 | Oceania | 2020 | 74.657851 |
426 | South America | 2020 | 72.903534 |
427 rows × 3 columns
Finally, we’ll create a new altair chart, this time a line graph with “Year” as x-axis and “Life Expectancy” as y-axis, with 6 different lines for each continent (+ Misc). Since the lines are fairly close to each other, it’s hard to distinguish, so I scaled it accordingly and made a selection portion as well.
sel = alt.selection_single(fields=["Continent"], bind="legend")
alt.Chart(df_melt).mark_line().encode(
x="Year",
y=alt.Y("Life Expectancy", scale=alt.Scale(zero=False)),
color=alt.condition(sel, "Continent", alt.value("lightgrey")),
opacity=alt.condition(sel, alt.value(1), alt.value(0.2))
).add_selection(sel)
With this, we can make some observations:
The overall trend for all continents is positive. (There is an increase in life expectancy over the years.)
Europe, North America, and Oceania (Australia) have almost the same trend, converging more as time passes.
There is a noticeable dip in the trend of South America, starting around 1987, lowest being around 1992.
Global Trend in Life Expectancy (prediction line)#
In this section, we want to see th global trend and prediction line, using data from all of the countries.
First we graph the scatter plot of life expectancy by year (took a sample because the data was too big).
Life_Trend = alt.Chart(df1_melt.sample(500)).mark_point(size=20).encode(
x = 'Year',
y = 'Life Expectancy',
color = alt.Color('Life Expectancy'),
tooltip=['Country Name','Year','Life Expectancy']
).properties(
width=800,height=300,
title='Year x Life Expectancy'
)
Life_Trend
Next, we make the linear regression line using scikit learn.
from sklearn.linear_model import LinearRegression
df_lin = df1_melt.copy()
reg = LinearRegression()
reg.fit(df_lin[["Year"]], df_lin["Life Expectancy"])
LinearRegression()
df_lin["Pred"] = reg.predict(df_lin[["Year"]])
df_lin
Country Name | Country Code | Year | Life Expectancy | Pred | |
---|---|---|---|---|---|
0 | Aruba | ABW | 1960 | 65.662 | 55.051293 |
1 | Afghanistan | AFG | 1960 | 32.446 | 55.051293 |
2 | Angola | AGO | 1960 | 37.524 | 55.051293 |
3 | Albania | ALB | 1960 | 62.283 | 55.051293 |
4 | United Arab Emirates | ARE | 1960 | 51.537 | 55.051293 |
... | ... | ... | ... | ... | ... |
11402 | Samoa | WSM | 2020 | 73.450 | 73.224064 |
11403 | Yemen, Rep. | YEM | 2020 | 66.181 | 73.224064 |
11404 | South Africa | ZAF | 2020 | 64.379 | 73.224064 |
11405 | Zambia | ZMB | 2020 | 64.194 | 73.224064 |
11406 | Zimbabwe | ZWE | 2020 | 61.738 | 73.224064 |
11407 rows × 5 columns
We’ll be using the prediction from regression to draw the regression/prediction line.
alt.data_transformers.disable_max_rows()
c1 = alt.Chart(df_lin).mark_line().encode(
x="Year",
y="Pred"
)
Life_Trend+c1
The line seems about right, fitted to the scatter plot. In order to find the trend, we’ll want to calculate the slope of the line. We can find the slope using the x,y standard deviation, or with standardscaling.
df_lin2.std(axis=0)
Year 1.000044
Life Expectancy 11.405573
Pred 5.332978
dtype: float64
slope = 5.332978/1.000044
slope
5.332743359292191
This coefficient tells us that as time progressed, general life expectancy of people increased by approximately 5.33 years per year.
Standardized Linear Regression#
While on the topic of linear regression I also wanted to try comparing it to regression rescaled to standard scale, and show that slope can be found using standard scaling as well. First off, we scale the data using StandardScaler.
df_lin2 = df1_melt.copy()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(df_lin2[["Year"]])
StandardScaler()
df_lin2[["Year"]] = scaler.transform(df1_melt[["Year"]])
df_lin2
Country Name | Country Code | Year | Life Expectancy | |
---|---|---|---|---|
0 | Aruba | ABW | -1.703886 | 65.662 |
1 | Afghanistan | AFG | -1.703886 | 32.446 |
2 | Angola | AGO | -1.703886 | 37.524 |
3 | Albania | ALB | -1.703886 | 62.283 |
4 | United Arab Emirates | ARE | -1.703886 | 51.537 |
... | ... | ... | ... | ... |
11402 | Samoa | WSM | 1.703886 | 73.450 |
11403 | Yemen, Rep. | YEM | 1.703886 | 66.181 |
11404 | South Africa | ZAF | 1.703886 | 64.379 |
11405 | Zambia | ZMB | 1.703886 | 64.194 |
11406 | Zimbabwe | ZWE | 1.703886 | 61.738 |
11407 rows × 4 columns
We can check that the data was successsfully rescaled by seeing if the mean for the input/x-component (“Year”) is close to 0, and std dev is close to 1.
df_lin2.mean(axis=0)
Year -3.986564e-17
Life Expectancy 6.413768e+01
dtype: float64
df_lin2.std(axis=0)
Year 1.000044
Life Expectancy 11.405573
dtype: float64
Here, we’ll make the second (but this time rescaled) regression line.
reg2 = LinearRegression()
reg2.fit(df_lin2[["Year"]], df_lin2["Life Expectancy"])
LinearRegression()
df_lin2["Pred"] = reg2.predict(df_lin2[["Year"]])
df_lin2
Country Name | Country Code | Year | Life Expectancy | Pred | |
---|---|---|---|---|---|
0 | Aruba | ABW | -1.703886 | 65.662 | 55.051293 |
1 | Afghanistan | AFG | -1.703886 | 32.446 | 55.051293 |
2 | Angola | AGO | -1.703886 | 37.524 | 55.051293 |
3 | Albania | ALB | -1.703886 | 62.283 | 55.051293 |
4 | United Arab Emirates | ARE | -1.703886 | 51.537 | 55.051293 |
... | ... | ... | ... | ... | ... |
11402 | Samoa | WSM | 1.703886 | 73.450 | 73.224064 |
11403 | Yemen, Rep. | YEM | 1.703886 | 66.181 | 73.224064 |
11404 | South Africa | ZAF | 1.703886 | 64.379 | 73.224064 |
11405 | Zambia | ZMB | 1.703886 | 64.194 | 73.224064 |
11406 | Zimbabwe | ZWE | 1.703886 | 61.738 | 73.224064 |
11407 rows × 5 columns
reg2.coef_
array([5.33274403])
Here we can see that the coefficient refers to the slope and equals what we calculated before.
c2 = alt.Chart(df_lin2).mark_line().encode(
x="Year",
y=alt.Y("Pred", scale=alt.Scale(zero=False))
)
c2
Summary#
In summary, I took a dataset from Kaggle that recorded the life expectancy of people in different countries from 1960 to 2020. Using the data, I first found which country had the overall greatest life expectancy, which was found to be Iceland. Then, I graphed the life expectancy trend by regions/continent which gave several observations, one of them which was that South America was an outlier with a noticeable dip unlike other continents around 1992. Lastly, I found the global trend/rate of increase in life expectancy using linear regression, which was found to be approximately 0.3 years per year.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/hasibalmuzdadid/global-life-expectancy-historical-dataset
List any other references that you found helpful.
https://stackoverflow.com/questions/55910004/get-continent-name-from-country-using-pycountry
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote