Predicting COVID-19 Trends¶

Author: Huy Tran

Course Project, UC Irvine, Math 10, S22

Introduction¶

Introduce your project here. Maybe 3 sentences.

This project will use the dataset from WHO examine each physical region and determine which region is combating COVID-19 the most effectively. Furthermore, we will be clustering datapoints to the region that most likely produce such statistics. By the end of this project, after examining the different data columns, we will have found data that are best used to determine the region that have best navigated around COVID-19 thus far.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

import pandas as pd
import altair as alt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df=pd.read_csv("country_wise_latest.csv")

Clarify/Rename columns.

df=df.rename(columns={'Confirmed':'Confirmed cases'})

df.head()

	Country/Region	Confirmed cases	Deaths	Recovered	Active	New cases	New deaths	New recovered	Deaths / 100 Cases	Recovered / 100 Cases	Deaths / 100 Recovered	Confirmed last week	1 week change	1 week % increase	WHO Region
0	Afghanistan	36263	1269	25198	9796	106	10	18	3.50	69.49	5.04	35526	737	2.07	Eastern Mediterranean
1	Albania	4880	144	2745	1991	117	6	63	2.95	56.25	5.25	4171	709	17.00	Europe
2	Algeria	27973	1163	18837	7973	616	8	749	4.16	67.34	6.17	23691	4282	18.07	Africa
3	Andorra	907	52	803	52	10	0	0	5.73	88.53	6.48	884	23	2.60	Europe
4	Angola	950	41	242	667	18	1	0	4.32	25.47	16.94	749	201	26.84	Africa

From the following window, we can observe the amount of countries in each region on the “WHO Region” column.

df["WHO Region"].value_counts()

Europe                   56
Africa                   48
Americas                 35
Eastern Mediterranean    22
Western Pacific          16
South-East Asia          10
Name: WHO Region, dtype: int64

Here is a data visualization of the amount of countries per region via a bar graph.

Region_Chart = alt.Chart(df).mark_bar(size=15).encode(
    x = alt.X('WHO Region', sort='y'),
    y = 'count()',
    color = alt.Color('WHO Region', legend=None)
).properties(
    title='Number of Region'
)
Region_Chart

The chart below is a visualization of the total amount of confirmed cases and the recovered. The color of each point will indicate from which region the datapoint is from. The tooltips will also show the country that each datapoint represents. However, this is not an eye pleasing chart, since a majority of the datapoints are harboring the bottom left corner of the graph.

Confirmed_Country = alt.Chart(df).mark_point(size=100).encode(
    x = 'Confirmed cases',
    y = 'Recovered',
    color = alt.Color('WHO Region', legend=None),
    tooltip=['WHO Region','Country/Region','Confirmed cases','Recovered']
).properties(
    width=800,height=300,
    title='Cases per country'
)
Confirmed_Country

Calculate the average 1 week percent increase for each region.

Regions=df['WHO Region'].unique()

RegionMeans=[]

for y in Regions:
    Regiondf=df[df['WHO Region']==y]
    newvar=np.mean([x for x in Regiondf['1 week % increase']])
    RegionMeans.append(newvar)
    print("The average for the",y,"region is", newvar)

The average for the Eastern Mediterranean region is 10.48227272727273
The average for the Europe region is 7.769642857142857
The average for the Africa region is 18.086458333333333
The average for the Americas region is 16.331142857142858
The average for the Western Pacific region is 22.11125
The average for the South-East Asia region is 8.513

RegionMeans

[10.48227272727273,
769642857142857,
086458333333333,
331142857142858,
11125,
513]

Data visualization for the average amount of % increase for the last week per region.

# New DataFrame containing a list of regions and their respective means.
dff=pd.DataFrame()
dff['Regions']=Regions
dff['RegionMeans']=RegionMeans

WeekMeans = alt.Chart(dff).mark_bar(size=15).encode(
    x = alt.X('Regions', sort='y'),
    y = 'RegionMeans',
    color = alt.Color('Regions', legend=None)
)
WeekMeans

#New DataFrame for Kmean testing
dfff=pd.DataFrame()
dfff['New cases']=df['New cases']
dfff['New deaths']=df['New deaths']

kmeans=KMeans(6)
kmeans.fit(dfff)

KMeans(n_clusters=6)

dfff["cluster"] = kmeans.predict(dfff)

We opted to use the logarithm of the data values for “New cases” because of some extremely large values of the column.

dfff['New cases']=np.log10(dfff['New cases']+1)

dfff

	New cases	New deaths	cluster
0	2.029384	10	0
1	2.071882	6	0
2	2.790285	8	0
3	1.041393	0	0
4	1.278754	1	0
...	...	...	...
182	2.184691	2	0
183	0.000000	0	0
184	1.041393	4	0
185	1.857332	1	0
186	2.285557	2	0

187 rows × 3 columns

This graph shows that the variables we picked did not yield any useful interpretation of the dataset. Some possible reasonings for this occurence can be due to the compactness of the datapoints and how there are no clear separations between each cluster.

clusters = alt.Chart(dfff).mark_point(size=10).encode(
    x = 'New cases',
    y = 'New deaths',
    color = alt.Color('cluster', legend=None),
).interactive().properties(
    width=500,height=500,
    title='New deaths and cases'
)
clusters

We will now proceed with the K Nearest Neighbors test.

DropCountry = df.drop('Country/Region', 1)

We hypothesize that with knowing the “Death per 100 cases”, “Recovered per 100 cases”, and the “1 week % increase”, we can make a projection on the amount of new cases.

X_train, X_test, y_train, y_test = train_test_split(
    df[["Deaths / 100 Cases","Recovered / 100 Cases","1 week % increase"]], df["New cases"], test_size = 0.3)

reg = KNeighborsRegressor(n_neighbors=6)

reg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=6)

Based on our score, it shows that the columns we are using. “Deaths/100 Cases”, Recovered/100 Cases”, and “1 week % increase” are not good indicators of “New cases”.

reg.score(X_test,y_test)

-0.09183683492283201

clf = LogisticRegression()

cols=["Deaths / 100 Cases","Recovered / 100 Cases"]
clf.fit(df[cols],df['WHO Region'])

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

LogisticRegression()

df["pred"] = clf.predict(df[cols])

The chart below is the result of logistic regression on the two columns “Deaths / 100 Cases”,”Recovered / 100 Cases”. In contrast to the expected amount of 6 clusters, our logistic regression only yields 3 clusters; this can be due to the denseness of the datapoints and how close they are to one another.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="pred",
    tooltip=['WHO Region','Country/Region']
).interactive().properties(

    title="Recovered and deaths"
)

Below is an interactive visual that contains the data on the daily rate of death and rate of recovery of countries in a particular region. With this, we can clearly observe which region has above a certain threshold of “Death per 100 cases” or “Recovered per 100 cases”.

brush = alt.selection(type='interval')

points = alt.Chart(df).mark_point().encode(
    x='Recovered / 100 Cases:Q',
    y='Deaths / 100 Cases:Q',
    color=alt.condition(brush, 'WHO Region', alt.value('lightgray'))
).add_selection(
    brush
)

bars = alt.Chart(df).mark_bar().encode(
    y='WHO Region:N',
    color='WHO Region:N',
    x='count(WHO Region):Q'
).transform_filter(
    brush
)

points & bars

Now, we will examine the data from the column that can help us in predicting the region that is performing the best in term of navigating around COVID-19: “Recovered / 100 Cases” and “1 week % increase”. These columns can be effective for making prediction because they are percentages, which mean that the total population of each region can have less of an impact on it, i.e. relative.

df['Recovered / 100 Cases'].corr(df['1 week % increase'])

-0.3942542429619809

RecoveredIncreasePoints = alt.Chart(df).mark_point().encode(
    x=alt.X('Recovered / 100 Cases'),
    y=alt.Y('1 week % increase'),
    color=alt.condition(brush, 'WHO Region', alt.value('lightgray')),
    tooltip=['1 week % increase']
).add_selection(
    brush
)

RecoveredIncreaseBars = alt.Chart(df).mark_bar().encode(
    y='WHO Region:N',
    color='WHO Region:N',
    x='count(WHO Region):Q'
).transform_filter(
    brush
)

RecoveredIncreasePoints & RecoveredIncreaseBars

For this data visual, the bottom right corner of the chart will provide us with the Region that is performing best in combating against COVID-19. Using the brush, we can observe staggering amounts of Europe region countries, which signifies a low percent increase per week and a high recovery rate per 100 cases.

Summary¶

Either summarize what you did, or summarize the results. Maybe 3 sentences.

For this project, we have found correlations between data columns, most of which have not been helpful, since they are raw numbers that are population dependent. After using different methods of data analysis, we ran into some trouble in separating datapoints according to their respective region, which resulted in our using of population independent data. Furthermore, the brush feature of the selection histogram allows us to pick custom thresholds of the data we wish to examine; in other words, we were able to observe the amount of countries within a region that have certain amount of “Deaths/100 cases” and “Recovered/100 cases”.

References¶

What is the source of your dataset(s)?

The source of this dataset: https://www.kaggle.com/datasets/imdevskp/corona-virus-report

Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

KNN: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html Selection histogram: https://altair-viz.github.io/gallery/selection_histogram.html

List other references that you found helpful.

Created in Deepnote

UC Irvine Math 10 S22

Predicting COVID-19 Trends

Contents

Predicting COVID-19 Trends¶

Introduction¶

Main portion of the project¶

Summary¶

References¶