Predicting COVID-19 Trends
Contents
Predicting COVID-19 Trends¶
Author: Huy Tran
Course Project, UC Irvine, Math 10, S22
Introduction¶
Introduce your project here. Maybe 3 sentences.
This project will use the dataset from WHO examine each physical region and determine which region is combating COVID-19 the most effectively. Furthermore, we will be clustering datapoints to the region that most likely produce such statistics. By the end of this project, after examining the different data columns, we will have found data that are best used to determine the region that have best navigated around COVID-19 thus far.
Main portion of the project¶
(You can either have all one section or divide into multiple sections)
import pandas as pd
import altair as alt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df=pd.read_csv("country_wise_latest.csv")
Clarify/Rename columns.
df=df.rename(columns={'Confirmed':'Confirmed cases'})
df.head()
Country/Region | Confirmed cases | Deaths | Recovered | Active | New cases | New deaths | New recovered | Deaths / 100 Cases | Recovered / 100 Cases | Deaths / 100 Recovered | Confirmed last week | 1 week change | 1 week % increase | WHO Region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 36263 | 1269 | 25198 | 9796 | 106 | 10 | 18 | 3.50 | 69.49 | 5.04 | 35526 | 737 | 2.07 | Eastern Mediterranean |
1 | Albania | 4880 | 144 | 2745 | 1991 | 117 | 6 | 63 | 2.95 | 56.25 | 5.25 | 4171 | 709 | 17.00 | Europe |
2 | Algeria | 27973 | 1163 | 18837 | 7973 | 616 | 8 | 749 | 4.16 | 67.34 | 6.17 | 23691 | 4282 | 18.07 | Africa |
3 | Andorra | 907 | 52 | 803 | 52 | 10 | 0 | 0 | 5.73 | 88.53 | 6.48 | 884 | 23 | 2.60 | Europe |
4 | Angola | 950 | 41 | 242 | 667 | 18 | 1 | 0 | 4.32 | 25.47 | 16.94 | 749 | 201 | 26.84 | Africa |
From the following window, we can observe the amount of countries in each region on the “WHO Region” column.
df["WHO Region"].value_counts()
Europe 56
Africa 48
Americas 35
Eastern Mediterranean 22
Western Pacific 16
South-East Asia 10
Name: WHO Region, dtype: int64
Here is a data visualization of the amount of countries per region via a bar graph.
Region_Chart = alt.Chart(df).mark_bar(size=15).encode(
x = alt.X('WHO Region', sort='y'),
y = 'count()',
color = alt.Color('WHO Region', legend=None)
).properties(
title='Number of Region'
)
Region_Chart
The chart below is a visualization of the total amount of confirmed cases and the recovered. The color of each point will indicate from which region the datapoint is from. The tooltips will also show the country that each datapoint represents. However, this is not an eye pleasing chart, since a majority of the datapoints are harboring the bottom left corner of the graph.
Confirmed_Country = alt.Chart(df).mark_point(size=100).encode(
x = 'Confirmed cases',
y = 'Recovered',
color = alt.Color('WHO Region', legend=None),
tooltip=['WHO Region','Country/Region','Confirmed cases','Recovered']
).properties(
width=800,height=300,
title='Cases per country'
)
Confirmed_Country
Calculate the average 1 week percent increase for each region.
Regions=df['WHO Region'].unique()
RegionMeans=[]
for y in Regions:
Regiondf=df[df['WHO Region']==y]
newvar=np.mean([x for x in Regiondf['1 week % increase']])
RegionMeans.append(newvar)
print("The average for the",y,"region is", newvar)
The average for the Eastern Mediterranean region is 10.48227272727273
The average for the Europe region is 7.769642857142857
The average for the Africa region is 18.086458333333333
The average for the Americas region is 16.331142857142858
The average for the Western Pacific region is 22.11125
The average for the South-East Asia region is 8.513
RegionMeans
[10.48227272727273,
7.769642857142857,
18.086458333333333,
16.331142857142858,
22.11125,
8.513]
Data visualization for the average amount of % increase for the last week per region.
# New DataFrame containing a list of regions and their respective means.
dff=pd.DataFrame()
dff['Regions']=Regions
dff['RegionMeans']=RegionMeans
WeekMeans = alt.Chart(dff).mark_bar(size=15).encode(
x = alt.X('Regions', sort='y'),
y = 'RegionMeans',
color = alt.Color('Regions', legend=None)
)
WeekMeans
#New DataFrame for Kmean testing
dfff=pd.DataFrame()
dfff['New cases']=df['New cases']
dfff['New deaths']=df['New deaths']
kmeans=KMeans(6)
kmeans.fit(dfff)
KMeans(n_clusters=6)
dfff["cluster"] = kmeans.predict(dfff)
We opted to use the logarithm of the data values for “New cases” because of some extremely large values of the column.
dfff['New cases']=np.log10(dfff['New cases']+1)
dfff
New cases | New deaths | cluster | |
---|---|---|---|
0 | 2.029384 | 10 | 0 |
1 | 2.071882 | 6 | 0 |
2 | 2.790285 | 8 | 0 |
3 | 1.041393 | 0 | 0 |
4 | 1.278754 | 1 | 0 |
... | ... | ... | ... |
182 | 2.184691 | 2 | 0 |
183 | 0.000000 | 0 | 0 |
184 | 1.041393 | 4 | 0 |
185 | 1.857332 | 1 | 0 |
186 | 2.285557 | 2 | 0 |
187 rows × 3 columns
This graph shows that the variables we picked did not yield any useful interpretation of the dataset. Some possible reasonings for this occurence can be due to the compactness of the datapoints and how there are no clear separations between each cluster.
clusters = alt.Chart(dfff).mark_point(size=10).encode(
x = 'New cases',
y = 'New deaths',
color = alt.Color('cluster', legend=None),
).interactive().properties(
width=500,height=500,
title='New deaths and cases'
)
clusters
We will now proceed with the K Nearest Neighbors test.
DropCountry = df.drop('Country/Region', 1)
We hypothesize that with knowing the “Death per 100 cases”, “Recovered per 100 cases”, and the “1 week % increase”, we can make a projection on the amount of new cases.
X_train, X_test, y_train, y_test = train_test_split(
df[["Deaths / 100 Cases","Recovered / 100 Cases","1 week % increase"]], df["New cases"], test_size = 0.3)
reg = KNeighborsRegressor(n_neighbors=6)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=6)
Based on our score, it shows that the columns we are using. “Deaths/100 Cases”, Recovered/100 Cases”, and “1 week % increase” are not good indicators of “New cases”.
reg.score(X_test,y_test)
-0.09183683492283201
clf = LogisticRegression()
cols=["Deaths / 100 Cases","Recovered / 100 Cases"]
clf.fit(df[cols],df['WHO Region'])
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()
df["pred"] = clf.predict(df[cols])
The chart below is the result of logistic regression on the two columns “Deaths / 100 Cases”,”Recovered / 100 Cases”. In contrast to the expected amount of 6 clusters, our logistic regression only yields 3 clusters; this can be due to the denseness of the datapoints and how close they are to one another.
alt.Chart(df).mark_circle().encode(
x=alt.X(cols[0], scale=alt.Scale(zero=False)),
y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
color="pred",
tooltip=['WHO Region','Country/Region']
).interactive().properties(
title="Recovered and deaths"
)
Below is an interactive visual that contains the data on the daily rate of death and rate of recovery of countries in a particular region. With this, we can clearly observe which region has above a certain threshold of “Death per 100 cases” or “Recovered per 100 cases”.
brush = alt.selection(type='interval')
points = alt.Chart(df).mark_point().encode(
x='Recovered / 100 Cases:Q',
y='Deaths / 100 Cases:Q',
color=alt.condition(brush, 'WHO Region', alt.value('lightgray'))
).add_selection(
brush
)
bars = alt.Chart(df).mark_bar().encode(
y='WHO Region:N',
color='WHO Region:N',
x='count(WHO Region):Q'
).transform_filter(
brush
)
points & bars
Now, we will examine the data from the column that can help us in predicting the region that is performing the best in term of navigating around COVID-19: “Recovered / 100 Cases” and “1 week % increase”. These columns can be effective for making prediction because they are percentages, which mean that the total population of each region can have less of an impact on it, i.e. relative.
df['Recovered / 100 Cases'].corr(df['1 week % increase'])
-0.3942542429619809
RecoveredIncreasePoints = alt.Chart(df).mark_point().encode(
x=alt.X('Recovered / 100 Cases'),
y=alt.Y('1 week % increase'),
color=alt.condition(brush, 'WHO Region', alt.value('lightgray')),
tooltip=['1 week % increase']
).add_selection(
brush
)
RecoveredIncreaseBars = alt.Chart(df).mark_bar().encode(
y='WHO Region:N',
color='WHO Region:N',
x='count(WHO Region):Q'
).transform_filter(
brush
)
RecoveredIncreasePoints & RecoveredIncreaseBars
For this data visual, the bottom right corner of the chart will provide us with the Region that is performing best in combating against COVID-19. Using the brush, we can observe staggering amounts of Europe region countries, which signifies a low percent increase per week and a high recovery rate per 100 cases.
Summary¶
Either summarize what you did, or summarize the results. Maybe 3 sentences.
For this project, we have found correlations between data columns, most of which have not been helpful, since they are raw numbers that are population dependent. After using different methods of data analysis, we ran into some trouble in separating datapoints according to their respective region, which resulted in our using of population independent data. Furthermore, the brush feature of the selection histogram allows us to pick custom thresholds of the data we wish to examine; in other words, we were able to observe the amount of countries within a region that have certain amount of “Deaths/100 cases” and “Recovered/100 cases”.
References¶
What is the source of your dataset(s)?
The source of this dataset: https://www.kaggle.com/datasets/imdevskp/corona-virus-report
Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.
KNN: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html Selection histogram: https://altair-viz.github.io/gallery/selection_histogram.html
List other references that you found helpful.