Global Life Expectancy#

Author: Eden Kim

Course Project, UC Irvine, Math 10, F22


In this project, we will be looking at a historical dataset that contains the recorded life expectancy for years 1960 to 2020 from a number of unique coutries. Some things we want to explore is the country with the greatest life expectancy, life expectancy by region, and a global prediction line/trend.

Country with Greatest Life Expectancy#

First, we’ll try to create a bar graph to visually find the country with greatest life expectancy.

import pandas as pd
import altair as alt
df = pd.read_csv("global life expectancy.csv")
Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Aruba ABW 65.662 66.074 66.444 66.787 67.113 67.435 67.762 68.095 ... 75.158 75.299000 75.441000 75.583000 75.725000 75.868000 76.010000 76.152000 76.293000 76.434000
1 Afghanistan AFG 32.446 32.962 33.471 33.971 34.463 34.948 35.430 35.914 ... 61.553 62.054000 62.525000 62.966000 63.377000 63.763000 64.130000 64.486000 64.833000 65.173000
2 Angola AGO 37.524 37.811 38.113 38.430 38.760 39.102 39.454 39.813 ... 56.330 57.236000 58.054000 58.776000 59.398000 59.925000 60.379000 60.782000 61.147000 61.487000
3 Albania ALB 62.283 63.301 64.190 64.914 65.463 65.850 66.110 66.304 ... 76.914 77.252000 77.554000 77.813000 78.025000 78.194000 78.333000 78.458000 78.573000 78.686000
4 United Arab Emirates ARE 51.537 52.560 53.573 54.572 55.555 56.523 57.482 58.432 ... 76.521 76.711000 76.903000 77.095000 77.285000 77.470000 77.647000 77.814000 77.972000 78.120000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 Kosovo XKX NaN NaN NaN NaN NaN NaN NaN NaN ... 70.200 70.497561 70.797561 71.097561 71.346341 71.846341 72.295122 72.695122 73.092683 71.087805
201 Yemen, Rep. YEM 29.919 30.163 30.500 30.943 31.501 32.175 32.960 33.836 ... 65.768 65.920000 66.016000 66.066000 66.085000 66.087000 66.086000 66.096000 66.125000 66.181000
202 South Africa ZAF 48.406 48.777 49.142 49.509 49.888 50.284 50.705 51.148 ... 58.895 60.060000 61.099000 61.968000 62.649000 63.153000 63.538000 63.857000 64.131000 64.379000
203 Zambia ZMB 46.687 47.084 47.446 47.772 48.068 48.351 48.643 48.960 ... 57.126 58.502000 59.746000 60.831000 61.737000 62.464000 63.043000 63.510000 63.886000 64.194000
204 Zimbabwe ZWE 53.019 53.483 53.946 54.403 54.849 55.274 55.671 56.034 ... 52.896 55.032000 56.897000 58.410000 59.534000 60.294000 60.812000 61.195000 61.490000 61.738000

205 rows × 63 columns

There are countries with missing data which we want to drop from the dataset because it will result in inaccurate calculations and problems in graphing.

df1 = df.dropna(axis=0).copy()
Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Aruba ABW 65.662 66.074 66.444 66.787 67.113 67.435 67.762 68.095 ... 75.158 75.299 75.441 75.583 75.725 75.868 76.010 76.152 76.293 76.434
1 Afghanistan AFG 32.446 32.962 33.471 33.971 34.463 34.948 35.430 35.914 ... 61.553 62.054 62.525 62.966 63.377 63.763 64.130 64.486 64.833 65.173
2 Angola AGO 37.524 37.811 38.113 38.430 38.760 39.102 39.454 39.813 ... 56.330 57.236 58.054 58.776 59.398 59.925 60.379 60.782 61.147 61.487
3 Albania ALB 62.283 63.301 64.190 64.914 65.463 65.850 66.110 66.304 ... 76.914 77.252 77.554 77.813 78.025 78.194 78.333 78.458 78.573 78.686
4 United Arab Emirates ARE 51.537 52.560 53.573 54.572 55.555 56.523 57.482 58.432 ... 76.521 76.711 76.903 77.095 77.285 77.470 77.647 77.814 77.972 78.120
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
199 Samoa WSM 56.902 57.188 57.472 57.756 58.045 58.340 58.642 58.951 ... 71.906 72.136 72.351 72.549 72.730 72.895 73.046 73.187 73.321 73.450
201 Yemen, Rep. YEM 29.919 30.163 30.500 30.943 31.501 32.175 32.960 33.836 ... 65.768 65.920 66.016 66.066 66.085 66.087 66.086 66.096 66.125 66.181
202 South Africa ZAF 48.406 48.777 49.142 49.509 49.888 50.284 50.705 51.148 ... 58.895 60.060 61.099 61.968 62.649 63.153 63.538 63.857 64.131 64.379
203 Zambia ZMB 46.687 47.084 47.446 47.772 48.068 48.351 48.643 48.960 ... 57.126 58.502 59.746 60.831 61.737 62.464 63.043 63.510 63.886 64.194
204 Zimbabwe ZWE 53.019 53.483 53.946 54.403 54.849 55.274 55.671 56.034 ... 52.896 55.032 56.897 58.410 59.534 60.294 60.812 61.195 61.490 61.738

187 rows × 63 columns

Then, df.melt and groupby were used to get the average life expectancy for each country. By using df.melt, I'm bringing all the years into one column. When I do groupby("Country").mean() on the resulting dataframe, it will take all the life expectancies corresponding to the country (for year 1960, 1961, ..., 2020) and take the mean.

df1_melt = df1.melt(
    id_vars=["Country Name", "Country Code"],
    value_name="Life Expectancy"
Country Name Country Code Year Life Expectancy
0 Aruba ABW 1960 65.662
1 Afghanistan AFG 1960 32.446
2 Angola AGO 1960 37.524
3 Albania ALB 1960 62.283
4 United Arab Emirates ARE 1960 51.537
... ... ... ... ...
11402 Samoa WSM 2020 73.450
11403 Yemen, Rep. YEM 2020 66.181
11404 South Africa ZAF 2020 64.379
11405 Zambia ZMB 2020 64.194
11406 Zimbabwe ZWE 2020 61.738

11407 rows × 4 columns

df1_bar = df1_melt.groupby("Country Name").mean().copy()
Country Name Life Expectancy
0 Afghanistan 49.459066
1 Albania 71.795115
2 Algeria 63.784213
3 Angola 46.999508
4 Antigua and Barbuda 70.901902
... ... ...
182 Vietnam 68.850000
183 Virgin Islands (U.S.) 73.813023
184 Yemen, Rep. 52.760738
185 Zambia 51.307410
186 Zimbabwe 54.651705

187 rows × 2 columns

Now we have the x-axis and y-axis data that we wanted. To make the categorical bar graph, we use altair with df1_bar, “Country Name” as x-axis and “Life Expectancy” as y-axis. I made the color in :Q so that we can distinguish which countries have higher and lower life expectancy. I also added the tooltip so that we can see the exact life expectancy value along with the country name.

    x='Country Name',
    y='Life Expectancy',
    color='Life Expectancy:Q',
    tooltip=['Country Name','Life Expectancy']

So this bar graph does show which countries have generally higher and lower life expectancies, but it’s still hard to find which one of them has the greatest life expectancy on a glance. We can get the country we're looking for by sorting the values of life expectancy from the dataset (df1_bar), finding the index of the max life expectancy and locating it on the dataframe.

df1_bar["Life Expectancy"].sort_values(ascending=False)
76     78.173703
84     78.010892
160    77.845010
161    77.665282
127    77.428477
125    45.920328
124    45.779213
153    44.564557
106    44.138295
146    41.197459
Name: Life Expectancy, Length: 187, dtype: float64
df1_bar.loc[76,"Country Name"]

With this, we have found that Iceland had the overall greatest life expectancy (of the countries in the data).

Life Expectancy Over Time for Each Continent#

In this section, we want to see and compare the life expectancy trend (from 1960 to 2020) between different regions. As for how to divide the regions, I used continents.

In order to do this, I used a defined function that can return the continent given the country’s name. The function is from Some things to note:

  • It required installing pycountry_convert module (which was moved to files: “requirements.txt”)

  • There were some country names that couldn’t be processed by the function because of the format or because the function didn’t recognize the country as one. In order to address this, I used a try and except.

  • Try: tries computing the following code and if any error occurs in the process, it's sent to the Except: where I handled the error by naming the continent as 'Misc'

import pycountry_convert as pc

def to_continent(country_name):
    country_alpha2 = pc.country_name_to_country_alpha2(country_name)
    country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pc.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name
continents = []
for i in range(len(df1)):
        continents.append(to_continent(df1.loc[i,"Country Name"]))
['North America',
 'South America',
 'North America',
 'North America',
 'South America',
 'South America',
 'North America',
 'North America',
 'South America',
 'South America',
 'North America',
 'North America',
 'North America',
 'South America',
 'North America',
 'North America',
 'South America',
 'North America',
 'North America',
 'North America',
 'North America',
 'North America',
 'North America',
 'North America',
 'South America',
 'North America',
 'South America',
 'North America',
 'South America',
 'North America',

We create a new column in the dataframe that has the continents.

df2 = df1.copy()
df2["Continent"] = continents
Country Name Country Code 1960 1961 1962 1963 1964 1965 1966 1967 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 Continent
0 Aruba ABW 65.662 66.074 66.444 66.787 67.113 67.435 67.762 68.095 ... 75.299 75.441 75.583 75.725 75.868 76.010 76.152 76.293 76.434 North America
1 Afghanistan AFG 32.446 32.962 33.471 33.971 34.463 34.948 35.430 35.914 ... 62.054 62.525 62.966 63.377 63.763 64.130 64.486 64.833 65.173 Asia
2 Angola AGO 37.524 37.811 38.113 38.430 38.760 39.102 39.454 39.813 ... 57.236 58.054 58.776 59.398 59.925 60.379 60.782 61.147 61.487 Africa
3 Albania ALB 62.283 63.301 64.190 64.914 65.463 65.850 66.110 66.304 ... 77.252 77.554 77.813 78.025 78.194 78.333 78.458 78.573 78.686 Europe
4 United Arab Emirates ARE 51.537 52.560 53.573 54.572 55.555 56.523 57.482 58.432 ... 76.711 76.903 77.095 77.285 77.470 77.647 77.814 77.972 78.120 Asia
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
199 Samoa WSM 56.902 57.188 57.472 57.756 58.045 58.340 58.642 58.951 ... 72.136 72.351 72.549 72.730 72.895 73.046 73.187 73.321 73.450 Asia
201 Yemen, Rep. YEM 29.919 30.163 30.500 30.943 31.501 32.175 32.960 33.836 ... 65.920 66.016 66.066 66.085 66.087 66.086 66.096 66.125 66.181 Misc
202 South Africa ZAF 48.406 48.777 49.142 49.509 49.888 50.284 50.705 51.148 ... 60.060 61.099 61.968 62.649 63.153 63.538 63.857 64.131 64.379 Oceania
203 Zambia ZMB 46.687 47.084 47.446 47.772 48.068 48.351 48.643 48.960 ... 58.502 59.746 60.831 61.737 62.464 63.043 63.510 63.886 64.194 North America
204 Zimbabwe ZWE 53.019 53.483 53.946 54.403 54.849 55.274 55.671 56.034 ... 55.032 56.897 58.410 59.534 60.294 60.812 61.195 61.490 61.738 Africa

187 rows × 64 columns

Next, group the dataframe by continent so that we have year/life expectancy data by continent, not country.

df_conti = df2.groupby("Continent").mean().copy()
Continent 1960 1961 1962 1963 1964 1965 1966 1967 1968 ... 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Africa 51.911373 52.315236 52.684135 53.103648 53.520000 53.917729 54.322771 54.709387 55.099520 ... 69.607766 69.906567 70.250675 70.546000 70.781485 71.049631 71.290738 71.542371 71.767809 71.847491
1 Asia 52.830392 53.399982 53.920459 54.435588 54.981245 55.461831 55.948146 56.391173 56.811198 ... 70.412202 70.774841 71.158889 71.506061 71.796693 72.109442 72.340053 72.584231 72.810201 72.909318
2 Europe 57.659488 58.135747 58.456193 58.826982 59.260429 59.588553 59.953093 60.312871 60.648314 ... 72.373682 72.687835 73.010700 73.364541 73.529201 73.785163 74.020479 74.194212 74.441680 74.365482
3 Misc 51.405921 51.830659 52.134498 52.573478 53.043221 53.392378 53.776949 54.147137 54.459372 ... 67.674541 68.079644 68.460127 68.861809 69.124180 69.446056 69.711120 69.957793 70.220220 70.158805
4 North America 56.544218 57.091116 57.575617 58.062500 58.577691 59.041588 59.495272 59.947293 60.372332 ... 72.423866 72.725760 73.033960 73.337421 73.578526 73.854245 74.113457 74.303688 74.544428 74.552098
5 Oceania 57.841107 58.207232 58.364212 58.699668 59.066634 59.269985 59.577412 59.841573 60.042966 ... 72.541788 72.883963 73.242651 73.567929 73.753539 74.028029 74.195817 74.406273 74.589739 74.657851
6 South America 50.836990 51.294980 51.773002 52.278156 52.809593 53.359563 53.914420 54.457712 54.976193 ... 70.434427 70.767883 71.154246 71.429815 71.752224 72.003946 72.277668 72.505290 72.744534 72.903534

7 rows × 62 columns

In order to make the line graph, we will use df.melt again to get the years into their own column. This time, the melt is used so that we can have the year column be the x-axis.

df_melt = df_conti.melt(
    value_name="Life Expectancy"
Continent Year Life Expectancy
0 Africa 1960 51.911373
1 Asia 1960 52.830392
2 Europe 1960 57.659488
3 Misc 1960 51.405921
4 North America 1960 56.544218
... ... ... ...
422 Europe 2020 74.365482
423 Misc 2020 70.158805
424 North America 2020 74.552098
425 Oceania 2020 74.657851
426 South America 2020 72.903534

427 rows × 3 columns

Finally, we’ll create a new altair chart, this time a line graph with “Year” as x-axis and “Life Expectancy” as y-axis, with 6 different lines for each continent (+ Misc). Since the lines are fairly close to each other, it’s hard to distinguish, so I scaled it accordingly and made a selection portion as well.

sel = alt.selection_single(fields=["Continent"], bind="legend")

    y=alt.Y("Life Expectancy", scale=alt.Scale(zero=False)),
    color=alt.condition(sel, "Continent", alt.value("lightgrey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.2))

With this, we can make some observations:

  • The overall trend for all continents is positive. (There is an increase in life expectancy over the years.)

  • Europe, North America, and Oceania (Australia) have almost the same trend, converging more as time passes.

  • There is a noticeable dip in the trend of South America, starting around 1987, lowest being around 1992.

Global Trend in Life Expectancy (prediction line)#

In this section, we want to see th global trend and prediction line, using data from all of the countries.

First we graph the scatter plot of life expectancy by year (took a sample because the data was too big).

Life_Trend = alt.Chart(df1_melt.sample(500)).mark_point(size=20).encode(
    x = 'Year',
    y = 'Life Expectancy',
    color = alt.Color('Life Expectancy'),
    tooltip=['Country Name','Year','Life Expectancy']
    title='Year x Life Expectancy'

Next, we make the linear regression line using scikit learn.

from sklearn.linear_model import LinearRegression
df_lin = df1_melt.copy()
reg = LinearRegression()[["Year"]], df_lin["Life Expectancy"])
df_lin["Pred"] = reg.predict(df_lin[["Year"]])
Country Name Country Code Year Life Expectancy Pred
0 Aruba ABW 1960 65.662 55.051293
1 Afghanistan AFG 1960 32.446 55.051293
2 Angola AGO 1960 37.524 55.051293
3 Albania ALB 1960 62.283 55.051293
4 United Arab Emirates ARE 1960 51.537 55.051293
... ... ... ... ... ...
11402 Samoa WSM 2020 73.450 73.224064
11403 Yemen, Rep. YEM 2020 66.181 73.224064
11404 South Africa ZAF 2020 64.379 73.224064
11405 Zambia ZMB 2020 64.194 73.224064
11406 Zimbabwe ZWE 2020 61.738 73.224064

11407 rows × 5 columns

We’ll be using the prediction from regression to draw the regression/prediction line.

c1 = alt.Chart(df_lin).mark_line().encode(

The line seems about right, fitted to the scatter plot. In order to find the trend, we’ll want to calculate the slope of the line. We can find the slope using the x,y standard deviation, or with standardscaling.

Year                1.000044
Life Expectancy    11.405573
Pred                5.332978
dtype: float64
slope = 5.332978/1.000044

This coefficient tells us that as time progressed, general life expectancy of people increased by approximately 5.33 years per year.

Standardized Linear Regression#

While on the topic of linear regression I also wanted to try comparing it to regression rescaled to standard scale, and show that slope can be found using standard scaling as well. First off, we scale the data using StandardScaler.

df_lin2 = df1_melt.copy()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()[["Year"]])
df_lin2[["Year"]] = scaler.transform(df1_melt[["Year"]])
Country Name Country Code Year Life Expectancy
0 Aruba ABW -1.703886 65.662
1 Afghanistan AFG -1.703886 32.446
2 Angola AGO -1.703886 37.524
3 Albania ALB -1.703886 62.283
4 United Arab Emirates ARE -1.703886 51.537
... ... ... ... ...
11402 Samoa WSM 1.703886 73.450
11403 Yemen, Rep. YEM 1.703886 66.181
11404 South Africa ZAF 1.703886 64.379
11405 Zambia ZMB 1.703886 64.194
11406 Zimbabwe ZWE 1.703886 61.738

11407 rows × 4 columns

We can check that the data was successsfully rescaled by seeing if the mean for the input/x-component (“Year”) is close to 0, and std dev is close to 1.

Year              -3.986564e-17
Life Expectancy    6.413768e+01
dtype: float64
Year                1.000044
Life Expectancy    11.405573
dtype: float64

Here, we’ll make the second (but this time rescaled) regression line.

reg2 = LinearRegression()[["Year"]], df_lin2["Life Expectancy"])
df_lin2["Pred"] = reg2.predict(df_lin2[["Year"]])
Country Name Country Code Year Life Expectancy Pred
0 Aruba ABW -1.703886 65.662 55.051293
1 Afghanistan AFG -1.703886 32.446 55.051293
2 Angola AGO -1.703886 37.524 55.051293
3 Albania ALB -1.703886 62.283 55.051293
4 United Arab Emirates ARE -1.703886 51.537 55.051293
... ... ... ... ... ...
11402 Samoa WSM 1.703886 73.450 73.224064
11403 Yemen, Rep. YEM 1.703886 66.181 73.224064
11404 South Africa ZAF 1.703886 64.379 73.224064
11405 Zambia ZMB 1.703886 64.194 73.224064
11406 Zimbabwe ZWE 1.703886 61.738 73.224064

11407 rows × 5 columns


Here we can see that the coefficient refers to the slope and equals what we calculated before.

c2 = alt.Chart(df_lin2).mark_line().encode(
    y=alt.Y("Pred", scale=alt.Scale(zero=False))


In summary, I took a dataset from Kaggle that recorded the life expectancy of people in different countries from 1960 to 2020. Using the data, I first found which country had the overall greatest life expectancy, which was found to be Iceland. Then, I graphed the life expectancy trend by regions/continent which gave several observations, one of them which was that South America was an outlier with a noticeable dip unlike other continents around 1992. Lastly, I found the global trend/rate of increase in life expectancy using linear regression, which was found to be approximately 0.3 years per year.


Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)?

  • List any other references that you found helpful.


Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Created in Deepnote