Russian Federation Economic Indicators
Contents
Russian Federation Economic Indicators¶
Author: Lily McBeath
Course Project, UC Irvine, Math 10, W22
Introduction¶
I am planning to explore various economic indicators in the Russian Federation from 2006 to 2020, and to what extent they correlate with one another. I am using data on economic factors from The World Bank’s Data website, cited below. I am also looking into whether any of these economic indicators correlate with the exchange rate between the Russian Ruble and US Dollar. This data may be interesting given the recent sharp fall in the value of the Ruble, following sanctions on Russia related to the war in Ukraine.
Main portion of the project¶
import numpy as np
import pandas as pd
df = pd.read_csv("russia.csv")
df
Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | Unnamed: 65 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Russian Federation | RUS | Air transport, freight (million ton-km) | IS.AIR.GOOD.MT.K1 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.132144e+03 | 4.249269e+03 | 4.413559e+03 | 4.761047e+03 | 5.863197e+03 | 6.845230e+03 | 6.810610e+03 | 6.481000e+03 | 4.314605e+03 | NaN |
1 | Russian Federation | RUS | CPIA efficiency of revenue mobilization rating... | IQ.CPA.REVN.XQ | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | Russian Federation | RUS | CPIA business regulatory environment rating (1... | IQ.CPA.BREG.XQ | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | Russian Federation | RUS | Investment in transport with private participa... | IE.PPI.TRAN.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.983900e+09 | 3.100000e+06 | NaN | 1.822200e+09 | 7.940000e+07 | 2.037000e+09 | 1.622700e+09 | 3.357770e+09 | 9.257100e+08 | NaN |
4 | Russian Federation | RUS | Time required to start a business, male (days) | IC.REG.DURS.MA | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | 1.320000e+01 | 1.150000e+01 | 1.080000e+01 | 1.010000e+01 | 1.010000e+01 | 1.010000e+01 | 1.010000e+01 | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1438 | Russian Federation | RUS | Changes in inventories (current US$) | NE.GDI.STKB.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 6.608344e+10 | 3.052414e+10 | 2.005045e+10 | 2.095911e+10 | 1.550199e+10 | 2.557471e+10 | 2.094207e+10 | 2.742143e+10 | 3.310177e+10 | NaN |
1439 | Russian Federation | RUS | Gross national expenditure (constant LCU) | NE.DAB.TOTL.KN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 9.010340e+13 | 9.124340e+13 | 9.030140e+13 | 8.225460e+13 | 8.117190e+13 | 8.452690e+13 | 8.639170e+13 | 8.899610e+13 | 8.505000e+13 | NaN |
1440 | Russian Federation | RUS | Households and NPISHs Final consumption expend... | NE.CON.PRVT.KN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 4.815000e+13 | 5.059880e+13 | 5.164600e+13 | 4.677560e+13 | 4.558870e+13 | 4.727660e+13 | 4.927200e+13 | 5.082080e+13 | 4.648470e+13 | NaN |
1441 | Russian Federation | RUS | Military expenditure (current USD) | MS.MIL.XPND.CD | NaN | NaN | NaN | NaN | NaN | NaN | ... | 8.146940e+10 | 8.835290e+10 | 8.469650e+10 | 6.642182e+10 | 6.924529e+10 | 6.691303e+10 | 6.160920e+10 | 6.520134e+10 | 6.171254e+10 | NaN |
1442 | Russian Federation | RUS | Fixed broadband subscriptions (per 100 people) | IT.NET.BBND.P2 | NaN | NaN | NaN | NaN | NaN | NaN | ... | 1.453068e+01 | 1.645264e+01 | 1.724714e+01 | 1.854085e+01 | 1.894519e+01 | 2.137238e+01 | 2.200089e+01 | 2.252492e+01 | 2.321094e+01 | NaN |
1443 rows × 66 columns
df.isna().sum()[36:]
1992 882
1993 838
1994 739
1995 750
1996 748
1997 720
1998 706
1999 692
2000 590
2001 579
2002 512
2003 563
2004 562
2005 555
2006 564
2007 499
2008 528
2009 516
2010 495
2011 517
2012 466
2013 489
2014 414
2015 454
2016 458
2017 467
2018 513
2019 591
2020 885
Unnamed: 65 1443
dtype: int64
We can see that there is a lot of missing data here, which I will address soon.
First I want to drop the Country Name
, Country Code
, and Unnamed: 65
columns, which are the same for all factors, and the Indicator Code
, as the Indicator Name
is more descriptive.
df.drop(labels=["Country Name", "Country Code", "Unnamed: 65", "Indicator Code"], axis=1, inplace=True)
Now I want to look at the factors that have no missing data, and starting in the year 2006.
df.drop(df.iloc[:, 1:47], inplace=True, axis=1)
df.dropna(inplace=True)
df
Indicator Name | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Air transport, freight (million ton-km) | 1.926295e+03 | 1.224313e+03 | 2.399593e+03 | 2.305548e+03 | 3.531583e+03 | 3.900120e+03 | 4.132144e+03 | 4.249269e+03 | 4.413559e+03 | 4.761047e+03 | 5.863197e+03 | 6.845230e+03 | 6.810610e+03 | 6.481000e+03 | 4.314605e+03 |
12 | Broad money (% of GDP) | 3.762204e+01 | 4.297675e+01 | 3.950786e+01 | 4.929882e+01 | 5.143911e+01 | 4.738647e+01 | 4.728942e+01 | 5.120003e+01 | 5.429533e+01 | 6.182658e+01 | 5.944583e+01 | 5.952221e+01 | 5.911864e+01 | 5.907597e+01 | 7.038103e+01 |
13 | Commercial bank branches (per 100,000 adults) | 3.036000e+01 | 3.364000e+01 | 3.557000e+01 | 3.464000e+01 | 3.506000e+01 | 3.675000e+01 | 3.825000e+01 | 3.852000e+01 | 3.704000e+01 | 3.293000e+01 | 3.013000e+01 | 2.923000e+01 | 2.626000e+01 | 2.563000e+01 | 2.459000e+01 |
20 | Debt service on external debt, total (TDS, cur... | 4.789859e+10 | 3.787248e+10 | 8.105131e+10 | 9.113474e+10 | 5.571607e+10 | 4.463868e+10 | 5.006398e+10 | 5.092890e+10 | 8.983215e+10 | 1.040257e+11 | 9.344588e+10 | 8.111808e+10 | 1.099975e+11 | 9.620816e+10 | 9.761423e+10 |
22 | Commercial banks and other lending (PPG + PNG)... | 2.519503e+10 | 5.257957e+10 | 4.573345e+10 | -1.850812e+10 | 2.342559e+10 | 4.698240e+10 | 7.076591e+10 | 4.357459e+10 | -3.068587e+10 | -5.652469e+10 | 1.189512e+10 | 2.225490e+09 | -4.589384e+10 | -3.161458e+10 | -1.757524e+10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1438 | Changes in inventories (current US$) | 2.641315e+10 | 4.119105e+10 | 5.332577e+10 | -3.752001e+10 | 1.509159e+10 | 6.013144e+10 | 6.608344e+10 | 3.052414e+10 | 2.005045e+10 | 2.095911e+10 | 1.550199e+10 | 2.557471e+10 | 2.094207e+10 | 2.742143e+10 | 3.310177e+10 |
1439 | Gross national expenditure (constant LCU) | 6.764728e+13 | 7.732504e+13 | 8.442601e+13 | 7.140223e+13 | 7.777010e+13 | 8.501400e+13 | 9.010340e+13 | 9.124340e+13 | 9.030140e+13 | 8.225460e+13 | 8.117190e+13 | 8.452690e+13 | 8.639170e+13 | 8.899610e+13 | 8.505000e+13 |
1440 | Households and NPISHs Final consumption expend... | 3.327116e+13 | 3.798122e+13 | 4.195640e+13 | 3.980456e+13 | 4.197077e+13 | 4.478010e+13 | 4.815000e+13 | 5.059880e+13 | 5.164600e+13 | 4.677560e+13 | 4.558870e+13 | 4.727660e+13 | 4.927200e+13 | 5.082080e+13 | 4.648470e+13 |
1441 | Military expenditure (current USD) | 3.451778e+10 | 4.353499e+10 | 5.618379e+10 | 5.153212e+10 | 5.872023e+10 | 7.023752e+10 | 8.146940e+10 | 8.835290e+10 | 8.469650e+10 | 6.642182e+10 | 6.924529e+10 | 6.691303e+10 | 6.160920e+10 | 6.520134e+10 | 6.171254e+10 |
1442 | Fixed broadband subscriptions (per 100 people) | 2.022269e+00 | 3.420206e+00 | 6.478241e+00 | 9.000403e+00 | 1.094235e+01 | 1.227170e+01 | 1.453068e+01 | 1.645264e+01 | 1.724714e+01 | 1.854085e+01 | 1.894519e+01 | 2.137238e+01 | 2.200089e+01 | 2.252492e+01 | 2.321094e+01 |
493 rows × 16 columns
I am going to transform the structure of this DataFrame somewhat to more closely mimic the data we are used to working with in this class.
df = df.transpose().copy()
df.columns = df.iloc[0]
df = df[1:].copy()
df.index.name = 'Year'
df = df.reset_index()
df['Year'] = pd.to_datetime(df['Year']).dt.year
df.iloc[:,1:] = df.iloc[:,1:].apply(pd.to_numeric)
df.head()
Indicator Name | Year | Air transport, freight (million ton-km) | Broad money (% of GDP) | Commercial bank branches (per 100,000 adults) | Debt service on external debt, total (TDS, current US$) | Commercial banks and other lending (PPG + PNG) (NFL, current US$) | Portfolio investment, bonds (PPG + PNG) (NFL, current US$) | External debt stocks (% of GNI) | Secondary income receipts (BoP, current US$) | Foreign direct investment, net outflows (% of GDP) | ... | Taxes less subsidies on products (constant LCU) | GNI per capita, Atlas method (current US$) | GDP per capita, PPP (constant 2017 international $) | Manufacturing, value added (constant 2015 US$) | External balance on goods and services (current LCU) | Changes in inventories (current US$) | Gross national expenditure (constant LCU) | Households and NPISHs Final consumption expenditure (constant LCU) | Military expenditure (current USD) | Fixed broadband subscriptions (per 100 people) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006 | 1926.295 | 37.622039 | 30.36 | 4.789859e+10 | 2.519503e+10 | 1.221660e+10 | 32.386639 | 5.318470e+09 | 3.029824 | ... | 7.740037e+12 | 5810.0 | 21757.465166 | 1.496040e+11 | 3.425900e+12 | 2.641315e+10 | 6.764728e+13 | 3.327116e+13 | 3.451778e+10 | 2.022269 |
1 | 2007 | 1224.313 | 42.976748 | 33.64 | 3.787248e+10 | 5.257957e+10 | 3.830479e+09 | 32.759128 | 6.220470e+09 | 3.447028 | ... | 8.481176e+12 | 7560.0 | 23647.266506 | 1.608578e+11 | 2.866600e+12 | 4.119105e+10 | 7.732504e+13 | 3.798122e+13 | 4.353499e+10 | 3.420206 |
2 | 2008 | 2399.593 | 39.507861 | 35.57 | 8.105131e+10 | 4.573345e+10 | -2.304212e+10 | 25.973782 | 7.345430e+09 | 3.351460 | ... | 8.942480e+12 | 9580.0 | 24887.852720 | 1.574440e+11 | 3.812600e+12 | 5.332577e+10 | 8.442601e+13 | 4.195640e+13 | 5.618379e+10 | 6.478241 |
3 | 2009 | 2305.548 | 49.298818 | 34.64 | 9.113474e+10 | -1.850812e+10 | -9.323914e+09 | 34.361362 | 6.369080e+09 | 3.539911 | ... | 7.653594e+12 | 9230.0 | 22939.694054 | 1.344322e+11 | 2.887700e+12 | -3.752001e+10 | 7.140223e+13 | 3.980456e+13 | 5.153212e+10 | 9.000403 |
4 | 2010 | 3531.583 | 51.439107 | 35.06 | 5.571607e+10 | 2.342559e+10 | 3.674379e+09 | 28.277358 | 7.258440e+09 | 3.450434 | ... | 8.211698e+12 | 9980.0 | 23961.220293 | 1.460048e+11 | 3.739700e+12 | 1.509159e+10 | 7.777010e+13 | 4.197077e+13 | 5.872023e+10 | 10.942347 |
5 rows × 494 columns
df.columns
Index(['Year', 'Air transport, freight (million ton-km)',
'Broad money (% of GDP)',
'Commercial bank branches (per 100,000 adults)',
'Debt service on external debt, total (TDS, current US$)',
'Commercial banks and other lending (PPG + PNG) (NFL, current US$)',
'Portfolio investment, bonds (PPG + PNG) (NFL, current US$)',
'External debt stocks (% of GNI)',
'Secondary income receipts (BoP, current US$)',
'Foreign direct investment, net outflows (% of GDP)',
...
'Taxes less subsidies on products (constant LCU)',
'GNI per capita, Atlas method (current US$)',
'GDP per capita, PPP (constant 2017 international $)',
'Manufacturing, value added (constant 2015 US$)',
'External balance on goods and services (current LCU)',
'Changes in inventories (current US$)',
'Gross national expenditure (constant LCU)',
'Households and NPISHs Final consumption expenditure (constant LCU)',
'Military expenditure (current USD)',
'Fixed broadband subscriptions (per 100 people)'],
dtype='object', name='Indicator Name', length=494)
I will perform an initial analysis with the population data, looking at how the rural population is related to dependency.
df.iloc[:,df.columns.str.contains('Rural population')]
Indicator Name | Rural population growth (annual %) | Rural population (% of total population) | Rural population |
---|---|---|---|
0 | -0.497164 | 26.492 | 37896710.0 |
1 | -0.341091 | 26.447 | 37767668.0 |
2 | -0.214246 | 26.402 | 37686839.0 |
3 | -0.136685 | 26.358 | 37635362.0 |
4 | -0.125974 | 26.313 | 37587981.0 |
5 | -0.093185 | 26.268 | 37552971.0 |
6 | -0.056555 | 26.209 | 37531739.0 |
7 | -0.062143 | 26.137 | 37508423.0 |
8 | -0.115774 | 26.050 | 37465023.0 |
9 | -0.192057 | 25.950 | 37393138.0 |
10 | -0.270029 | 25.836 | 37292302.0 |
11 | -0.389793 | 25.708 | 37147222.0 |
12 | -0.563045 | 25.567 | 36938654.0 |
13 | -0.653729 | 25.413 | 36697963.0 |
14 | -0.868789 | 25.246 | 36380516.0 |
df.iloc[:,df.columns.str.contains('dependency')]
Indicator Name | Age dependency ratio, young (% of working-age population) | Age dependency ratio (% of working-age population) | Age dependency ratio, old (% of working-age population) |
---|---|---|---|
0 | 20.942940 | 40.335644 | 19.392704 |
1 | 20.632962 | 39.705743 | 19.072782 |
2 | 20.490918 | 39.157910 | 18.666992 |
3 | 20.527639 | 38.884284 | 18.356645 |
4 | 20.742029 | 38.953829 | 18.211800 |
5 | 21.144384 | 39.410853 | 18.266469 |
6 | 21.793443 | 40.217864 | 18.424421 |
7 | 22.605389 | 41.290973 | 18.685584 |
8 | 23.454453 | 42.495255 | 19.040802 |
9 | 24.259000 | 43.741551 | 19.482552 |
10 | 25.122954 | 45.295788 | 20.172834 |
11 | 25.893200 | 46.837028 | 20.943828 |
12 | 26.574334 | 48.343277 | 21.768943 |
13 | 27.198002 | 49.810602 | 22.612600 |
14 | 27.767416 | 51.220020 | 23.452604 |
df = df.rename(columns={"Rural population (% of total population)":"Rural population percent of total", "Age dependency ratio (% of working-age population)":"Age dependency ratio"})
According to the World Bank’s Glossary, the age dependency ratio is the ratio of those younger than 15 or older than 64 to the working-age population, and the rural population as a percentage of total population is found according to the data from the United Nations Population Division.
I will use the rural population as a percentage of total population as input, and the age dependency ratio as a percent of working-age population as output. Since these are both percentile values, rescaling is likely unnecessary.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(df[['Rural population percent of total']], df['Age dependency ratio'])
LinearRegression()
df['Age dependency ratio pred'] = reg.predict(df[['Rural population percent of total']])
Now we will plot the results in Altair. The method for creating repeat layered charts is taken from the Altair documentation and requires Altair version 4.2.0 to run. If the charts display an error, uncomment the !pip install altair==4.2.0
line, run the cell, comment the !pip install altair==4.2.0
line again, and run the cell once more.
import altair as alt
# !pip install altair==4.2.0 # (run this if graphs are not showing, then run graphs again)
c1 = alt.Chart(df).mark_line().encode(
x="Rural population percent of total:Q",
y=alt.Y('Age dependency ratio pred:Q',
scale=alt.Scale(zero=False)
)
).properties(
title="Predicted and Actual Age Dependency Ratio"
)
c2 = alt.Chart(df).mark_point().encode(
x="Rural population percent of total",
y=alt.Y('Age dependency ratio:Q',
scale=alt.Scale(zero=False)
)
)
c3 = alt.Chart(df).mark_line().encode(
x = 'Year:O',
y=alt.Y(alt.repeat('layer'),
type='quantitative',
title='Rural Population and Age Dependency Ratio',
scale=alt.Scale(zero=False)
),
color=alt.ColorDatum(alt.repeat('layer'))
).properties(
title="Actual Age Dependency vs. Rural Population"
).repeat(layer=["Age dependency ratio", "Rural population percent of total"])
alt.layer(c1, c2)|c3
print(f"Age dependency ratio appears negatively correlated with rural population percentage, by a factor of {round(reg.coef_[0])}.")
Age dependency ratio appears negatively correlated with rural population percentage, by a factor of -11.
from sklearn.metrics import mean_absolute_error
mean_absolute_error(df['Age dependency ratio'],df['Age dependency ratio pred'])
0.637496926380498
It appears from this preliminary work that rural population is a somewhat good indicator of the age dependency ratio, in that as the rural population as a percent of total population increases, the age dependency ratio as a percent of working population decreases.
The next step will be to consider some other non-population-related indicators and compare with the Russian Ruble/US Dollar exchange rate. The Ruble to US Dollar historical spot rates were obtained from the Bank of England’s Statistical Interactive Database. We will reverse the order of the rows to sort by ascending years, and invert the average annual USD/RUB rates given to obtain the RUB/USD rates.
rubusd = pd.read_csv("rubusd.csv")
rubusd = rubusd.reindex(index=rubusd.index[::-1])
rubusd.reset_index(inplace=True)
df["RATE"] = rubusd.iloc[:,2]
df["RATE"] = 1/df["RATE"]
Let’s examine how the indicators we looked at earlier compare to the RUB/USD rate from 2006 to 2020 using a chart, rescaling to more clearly compare the values.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = df.rename(columns={"Rural population percent of total":"Rural population percent", "Age dependency ratio":"Age dependency", "RATE":"Ruble to Dollar"})
dfscaled = df.copy()
dfscaled.iloc[:,1:] = scaler.fit_transform(df.iloc[:,1:])
c4 = alt.Chart(dfscaled).mark_line().encode(
x = 'Year:O',
y=alt.Y(alt.repeat('layer'), type='quantitative', title='Scaled Indicators and Exchange Rate'),
color=alt.ColorDatum(alt.repeat('layer'))
).properties(
title="Scaled Rural Population Percentage and Age Dependency Ratio vs. Ruble to Dollar Exchange Rates",
width=700
).repeat(
layer=["Rural population percent","Age dependency", "Ruble to Dollar"]
)
c4
It is interesting that the rural population as a percent of total population in Russia (blue) seems to decline together with the RUB/USD rate (red), while the age dependency ratio as a percent of working population in Russia (orange) has been increasing since 2009.
What do the remaining indicators look like, in comparison to the Ruble / Dollar exchange rates?
Let us see if we can use some of these indicators to predict the Ruble to Dollar exchange rates. After looking into the definitions of leading vs. lagging indicators and the top economic indicators for the U.S. economy (acknowledging that these may not be optimal for an analysis of the Russian economy, but useful nonetheless for our purposes), I will try to predict the exchange rates using the following economic indicators, which include the rural population and age dependency metrics we looked at earlier, as well as others that are relevant to a nation’s economy:
Rural population percent
Age dependency
GDP (current US$)
Population growth (annual %)
Real interest rate (%)
Inflation, consumer prices (annual %)
Unemployment, total (% of total labor force) (national estimate)
Stocks traded, total value (current US$)
Merchandise trade (% of GDP)
Air transport, passengers carried
International tourism, number of arrivals
Net primary income (Net income from abroad) (current US$)
Refugee population by country or territory of origin
Foreign direct investment, net inflows (% of GDP)
Here are the indicators plotted together.
indicators = ['Rural population percent', 'Age dependency', 'GDP (current US$)', 'Population growth (annual %)', 'Real interest rate (%)', 'Inflation, consumer prices (annual %)', 'Unemployment, total (% of total labor force) (national estimate)', 'Stocks traded, total value (current US$)', 'Merchandise trade (% of GDP)', 'Air transport, passengers carried', 'International tourism, number of arrivals', 'Net primary income (Net income from abroad) (current US$)', 'Refugee population by country or territory of origin', 'Foreign direct investment, net inflows (% of GDP)']
c5 = alt.Chart(dfscaled).mark_line().encode(
x = 'Year:O',
y = alt.Y(alt.repeat('layer'), type='quantitative', title='Scaled Indicators'),
color = alt.ColorDatum(alt.repeat('layer')), #.Color(scale=alt.Scale(scheme = 'category20c'))
strokeDash = alt.StrokeDashDatum(alt.repeat('layer')),
).properties(
title="Scaled Economic Indicators in Russia from 2006 to 2020",
width=600
).repeat(
layer=indicators
)
c5
We can see that there is a significant amount of variation in these indicators’ values over the 15 years. Also, notice that international tourism decreases sharply in 2020 (as expected).
We will use the years 2006-2016 as our training set and the years 2017-2020 as our test set, so that we can test the accuracy of our model on the data and be warned of possible overfitting. Since I want to use these specific rows for my training and test sets, I will use iloc
to define them rather than train_test_split
.
First we will attempt to use Linear Regression, in hopes that the coefficients can give us some idea of how these indicators might be related to the exchange rate.
X_train = df[indicators].iloc[:11]
X_test = df[indicators].iloc[11:]
y_train = df['Ruble to Dollar'].iloc[:11]
y_test = df['Ruble to Dollar'].iloc[11:]
df_train = df.iloc[:11].copy()
df_test = df.iloc[11:].copy()
lrg = LinearRegression()
lrg.fit(X_train, y_train)
LinearRegression()
To visually evaluate the results of Linear Regression, I will use methods adapted from the Pandas user guide.
coefficients = pd.DataFrame(indicators, columns=["Indicators"])
coefficients["Coefficients"] = lrg.coef_
coefficients.set_index("Indicators", inplace=True)
def style_positive(v, props=''):
return props if v >= 0 else None
def style_negative(v, props=''):
return props if v < 0 else None
def highlight_max(s, props=''):
return np.where(s == np.max(s.values), props, '')
def highlight_min(s, props=''):
return np.where(s == np.min(s.values), props, '')
coefficients[["Coefficients"]].style.format({"Coefficients": '{:.2E}'})\
.applymap(style_positive, props='color:green;')\
.applymap(style_negative, props='color:red;')\
.applymap(lambda v: 'opacity: 20%;' if (v < 0.000001) and (v > -0.000001) else None)\
.apply(highlight_max, props='color:white;background-color:green', axis=0)\
.apply(highlight_min, props='color:white;background-color:red', axis=0)
Coefficients | |
---|---|
Indicators | |
Rural population percent | 3.48E-06 |
Age dependency | -2.08E-04 |
GDP (current US$) | -1.06E-14 |
Population growth (annual %) | 9.57E-06 |
Real interest rate (%) | -6.69E-05 |
Inflation, consumer prices (annual %) | -2.76E-04 |
Unemployment, total (% of total labor force) (national estimate) | 1.73E-05 |
Stocks traded, total value (current US$) | 1.58E-14 |
Merchandise trade (% of GDP) | -6.10E-05 |
Air transport, passengers carried | -1.40E-10 |
International tourism, number of arrivals | 3.94E-10 |
Net primary income (Net income from abroad) (current US$) | -4.33E-13 |
Refugee population by country or territory of origin | 8.23E-08 |
Foreign direct investment, net inflows (% of GDP) | 1.53E-04 |
Above, we see indicators with relatively small coefficients in low opacity, negative coefficients (i.e. negative correlation according to the model) in red, and positive coefficients in green. We also see the highlighted maximum and minimum coefficients, representing the indicator with highest positive correlation and highest negative correlation, respectively, according to the model. It appears that net inflows of foreign direct investment as a percent of GDP may be an indicator of growth in the RUB/USD exchange rate, while inflation of consumer prices may be an indicator of decline in the RUB/USD exchange rate. This seems reasonable from an economic standpoint.
Now we will attempt to gage the accuracy of this model.
from sklearn.metrics import mean_absolute_error
train_error_lrg = mean_absolute_error(lrg.predict(X_train), y_train)
test_error_lrg = mean_absolute_error(lrg.predict(X_test), y_test)
train_error_lrg
7.273399827656654e-09
test_error_lrg
0.003865468556259171
test_error_lrg/train_error_lrg
531452.7796974622
test_error_lrg/(y_test.mean())
0.24804983762756344
Notice that the test error is larger than the training error by over 6 orders of magnitude, which would suggest overfitting.
The test error from linear regression is about 25% of the average RUB/USD value in the test set.
Now we will try a different approach: K-Nearest Neighbors Regression.
from sklearn.neighbors import KNeighborsRegressor
kng3 = KNeighborsRegressor(n_neighbors=3)
kng3.fit(X_train,y_train)
KNeighborsRegressor(n_neighbors=3)
train_error_kng3 = mean_absolute_error(kng3.predict(X_train), y_train)
test_error_kng3 = mean_absolute_error(kng3.predict(X_test), y_test)
train_error_kng3
0.0038378584123845735
test_error_kng3
0.006711039436260572
test_error_kng3/train_error_kng3
1.7486417462938157
test_error_kng3/y_test.mean()
0.4306521236037697
Notice that this time our training and test errors are closer in value, which suggests that we do not have a problem with overfitting. However, the test error as a percentage of the mean of the RUB/USD values is larger, about 43%.
Let us visually compare our results from both methods.
df_test["Linear regression prediction"] = lrg.predict(X_test)
df_test["3-Neighbors regression prediction"] = kng3.predict(X_test)
df_test["Actual Ruble to Dollar"] = y_test
c6 = alt.Chart(df_test).mark_line().encode(
x = 'Year:O',
y=alt.Y(alt.repeat('layer'),
type='quantitative',
title='Predicted and Actual RUB/USD'
#scale=alt.Scale(zero=False)
),
color=alt.ColorDatum(alt.repeat('layer'))
).properties(
title="Predicted vs. Actual RUB/USD Using Linear Regression, K-Neighbors Regression",
width=500,
height=300
).repeat(layer=["Actual Ruble to Dollar", "Linear regression prediction", "3-Neighbors regression prediction"])
c6
It is clear that the linear regression method underestimates the RUB/USD exchange rates, while the K-Neighbors prediction is an overestimate.
Would a different value of K give a better model? We investigate using the method for plotting the train and test curves from the course notes.
def get_scores(k):
reg = KNeighborsRegressor(n_neighbors=k)
reg.fit(X_train, y_train)
train_error = mean_absolute_error(reg.predict(X_train), y_train)
test_error = mean_absolute_error(reg.predict(X_test), y_test)
return (train_error, test_error)
df_scores = pd.DataFrame({"k":range(1,12),"train_error":np.nan,"test_error":np.nan})
for i in df_scores.index:
df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])
df_scores["kinv"] = 1/df_scores.k
c7 = alt.Chart(df_scores).mark_line().encode(
x = "kinv",
y=alt.Y(alt.repeat('layer'),
type='quantitative',
title='Mean Absolute Error'
#scale=alt.Scale(zero=False)
),
color=alt.ColorDatum(alt.repeat('layer'))
).properties(
title="Train Error and Test Error for Different Values of K",
width=600
).repeat(layer=["train_error", "test_error"])
c7
It appears from the above graph that 1/K = 0.25 (so K = 4) may give a slightly better prediction than the original K = 3. We will avoid K = 1, despite the seemingly accurate results above, as this will result in overfitting.
As a final step, we will observe all of these predictions together along with the RUB/USD actual rates, from 2017 to 2020:
kng4 = KNeighborsRegressor(n_neighbors=4)
kng4.fit(X_train,y_train)
df_test["4-Neighbors regression prediction"] = kng4.predict(X_test)
c8 = alt.Chart(df_test).mark_line().encode(
x = 'Year:O',
y=alt.Y(alt.repeat('layer'),
type='quantitative',
title='Predicted and Actual RUB/USD'
#scale=alt.Scale(zero=False)
),
color=alt.ColorDatum(alt.repeat('layer'))
).properties(
title="Predicted vs. Actual RUB/USD Using Linear Regression, K-Neighbors Regression",
width=500,
height=300
).repeat(layer=["Actual Ruble to Dollar", "Linear regression prediction", "3-Neighbors regression prediction", "4-Neighbors regression prediction"])
c8
Summary¶
Although we had only a few years of data to work with, it was possible to gain some insights into the relationship between key economic indicators in Russia and the Ruble to U.S. Dollar exchange rates over the past 15 years. The linear regression model pointed to inflation and foreign investment as potentially significant economic indicators relating to the value of the Ruble. Additionally, predictions using linear regression and K-nearest neighbors regression appear to have some degree of usefulness, although this is limited.
There are a few potential areas where this analysis could be improved through further work. First, if more frequent data were available (quarterly, monthly, weekly, etc.), this would likely improve the accuracy of our models. Unfortunately such data did not appear to be readily available.
Additionally, the year 2020 presents a possible outlier in our data, considering how international tourism and other economic indicators were at uncharacteristic levels during this year. That being said, the year 2020 remains in this project for two reasons: the first being that removing it would constitute a 7% reduction in the number of data points, and the second being that a good model of RUB/USD exchange rates with regard to economic factors would ideally be able to accurately model the Ruble’s value even in times of crisis and uncertainty.
References¶
Bank of England (2022) - “Interest & exchange rates data.” Published online at BankOfEngland.co.uk. Retrieved from: https://www.bankofengland.co.uk/boeapps/database/index.asp?first=yes&SectionRequired=I&HideNums=-1&ExtraInfo=true
Bruce C. Dieffenbach (2014) - “Leading Economic Indicators.” Published online at Albany.edu/~bd445/. Retrieved from: https://www.albany.edu/~bd445/Economics_301_Intermediate_Macroeconomics_Slides_Spring_2014/Leading_Economic_Indicators_(Print).pdf
The Conference Board (2012) - “Description of Components.” Published online at Conference-Board.org. Retrieved from: https://www.conference-board.org/data/bci/index.cfm?id=2160
Hannah Ritchie, Edouard Mathieu, Max Roser, Bastian Herre, Joe Hasell, Esteban Ortiz-Ospina, Bobbie Macdonald, Fiona Spooner and Pablo Rosado (2022) - “War in Ukraine.” Published online at OurWorldInData.org. Retrieved from: https://ourworldindata.org/ukraine-war
Max Roser, Bastian Herre and Joe Hasell (2013) - “Nuclear Weapons.” Published online at OurWorldInData.org. Retrieved from: https://ourworldindata.org/nuclear-weapons
World Bank (2020) - “Russian Federation Data.” Published online at Data.WorldBank.org. Retrieved from: https://data.worldbank.org/country/russian-federation
Created in Deepnote