logo

UC Irvine Math 10 S22

  • UC Irvine, Math 10, Spring 2022
  • Week 1
    • Introduction to pandas
    • Week 1 Tuesday Discussion
    • for loops and ways to avoid for loops
    • Week 1 Thursday Discussion
    • Exploratory Data Analysis
    • Week 1 Videos
    • Homework 1
  • Week 2
    • Working with missing data
    • Week 2 Tuesday Discussion
    • Visualization in Python
    • Week 2 Videos
    • Plots from the Spotify dataset
    • Homework 2
  • Week 3
    • Functions in Python
    • Week 3 Tuesday Discussion
    • Working with dates in pandas
    • Week 3 Videos
    • Homework 3
  • Week 4
    • Week 4 Monday
    • Week 4 Videos
    • Week 4, Tuesday Discussion
    • Practice with Altair
    • Rescaling data
  • Week 5
    • K-Means clustering
    • K-Means clustering 2
    • Week 5 Videos
    • Homework 4
    • Week 5, Thursday Discussion
    • K-Means clustering 3
  • Week 6
    • Polynomial Regression
    • Week 6, Tuesday Discussion
    • Polynomial Regression 2
    • Linear and Polynomial Regression with the taxis dataset
    • Homework 5
    • Week 6 Videos
  • Week 7
    • Performance measures for machine learning
    • Week 7, Tuesday Discussion
    • Overfitting
    • Week 7, Thursday Discussion
    • Logistic Regression
    • Week 7 Videos
    • Homework 6
  • Week 8
    • Decision boundary for logistic regression
    • Week 8, Tuesday Discussion
    • Midterm Review
    • Week 8 Videos
    • Homework 7
  • Week 9
    • Week 9 Monday
    • Week 9, Tuesday Discussion
    • Feature Engineering with the Titanic dataset
    • Decision Trees and Random Forests
    • Week 9 Videos
    • Homework 8
  • Week 10
    • House prices in King County Washington
    • Week 10 Videos
  • Course Project
    • Course Project
    • Possible extra topics
  • Student Projects
    • Analysis of the factors that influence car’s level
    • Zoo Animal Classification
    • League of Legends Winning Factor
    • Canadian Immigration Analysis
    • Dota2 match outcome prediction based on mid player performance
    • Exoplanet Candidate Analysis
    • Analysis of the Alcohol Drinking in UCI Student
    • Real vs fake faces
    • Analysis of GDP Related to Pandemic
    • Betting odds and outcome of UFC fights from 2010-2020
    • Predict The Car Price
    • Unemployment Rate Data Analysis
    • Project Title :Stock Price of Famous Companies
    • Predict the Month through London Weather
    • Categorizing Beans
    • An Analysis of the Financial Market: Inflation and the Efficient Market Hypothesis
    • Predicting an Activity of Mine
    • Airline Satisfaction Analysis
    • Dance of the COVID-19: The Regression Model of the Vaccine Hesitancy
    • Depression Prediction
    • Apple Stock
    • Analysis of credit card fraud
    • Chess Games Analysis
    • Predicting COVID-19 Trends
    • NBA Salaries from the 2017 Season
    • Fraudulent And Non-fraudulent Transactions in Credit Cards
    • Sorting the number of planets
    • Fashion MNIST
    • Pokemon
    • Payment and treatment in U.S. Hospital
    • Tesla Stock Price Prediction
  • Tentative schedule
  • Data Sources
Powered by Jupyter Book
Contents
  • Introduction
  • Main portion of the project
  • Summary
  • References

Predicting COVID-19 Trends

Contents

  • Introduction
  • Main portion of the project
  • Summary
  • References

Predicting COVID-19 Trends¶

Author: Huy Tran

Course Project, UC Irvine, Math 10, S22

Introduction¶

Introduce your project here. Maybe 3 sentences.

This project will use the dataset from WHO examine each physical region and determine which region is combating COVID-19 the most effectively. Furthermore, we will be clustering datapoints to the region that most likely produce such statistics. By the end of this project, after examining the different data columns, we will have found data that are best used to determine the region that have best navigated around COVID-19 thus far.

Main portion of the project¶

(You can either have all one section or divide into multiple sections)

import pandas as pd
import altair as alt
import numpy as np
from sklearn.cluster import KMeans
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df=pd.read_csv("country_wise_latest.csv")

Clarify/Rename columns.

df=df.rename(columns={'Confirmed':'Confirmed cases'})
df.head()
Country/Region Confirmed cases Deaths Recovered Active New cases New deaths New recovered Deaths / 100 Cases Recovered / 100 Cases Deaths / 100 Recovered Confirmed last week 1 week change 1 week % increase WHO Region
0 Afghanistan 36263 1269 25198 9796 106 10 18 3.50 69.49 5.04 35526 737 2.07 Eastern Mediterranean
1 Albania 4880 144 2745 1991 117 6 63 2.95 56.25 5.25 4171 709 17.00 Europe
2 Algeria 27973 1163 18837 7973 616 8 749 4.16 67.34 6.17 23691 4282 18.07 Africa
3 Andorra 907 52 803 52 10 0 0 5.73 88.53 6.48 884 23 2.60 Europe
4 Angola 950 41 242 667 18 1 0 4.32 25.47 16.94 749 201 26.84 Africa

From the following window, we can observe the amount of countries in each region on the “WHO Region” column.

df["WHO Region"].value_counts()
Europe                   56
Africa                   48
Americas                 35
Eastern Mediterranean    22
Western Pacific          16
South-East Asia          10
Name: WHO Region, dtype: int64

Here is a data visualization of the amount of countries per region via a bar graph.

Region_Chart = alt.Chart(df).mark_bar(size=15).encode(
    x = alt.X('WHO Region', sort='y'),
    y = 'count()',
    color = alt.Color('WHO Region', legend=None)
).properties(
    title='Number of Region'
)
Region_Chart

The chart below is a visualization of the total amount of confirmed cases and the recovered. The color of each point will indicate from which region the datapoint is from. The tooltips will also show the country that each datapoint represents. However, this is not an eye pleasing chart, since a majority of the datapoints are harboring the bottom left corner of the graph.

Confirmed_Country = alt.Chart(df).mark_point(size=100).encode(
    x = 'Confirmed cases',
    y = 'Recovered',
    color = alt.Color('WHO Region', legend=None),
    tooltip=['WHO Region','Country/Region','Confirmed cases','Recovered']
).properties(
    width=800,height=300,
    title='Cases per country'
)
Confirmed_Country

Calculate the average 1 week percent increase for each region.

Regions=df['WHO Region'].unique()
RegionMeans=[]

for y in Regions:
    Regiondf=df[df['WHO Region']==y]
    newvar=np.mean([x for x in Regiondf['1 week % increase']])
    RegionMeans.append(newvar)
    print("The average for the",y,"region is", newvar)
The average for the Eastern Mediterranean region is 10.48227272727273
The average for the Europe region is 7.769642857142857
The average for the Africa region is 18.086458333333333
The average for the Americas region is 16.331142857142858
The average for the Western Pacific region is 22.11125
The average for the South-East Asia region is 8.513
RegionMeans
[10.48227272727273,
 7.769642857142857,
 18.086458333333333,
 16.331142857142858,
 22.11125,
 8.513]

Data visualization for the average amount of % increase for the last week per region.

# New DataFrame containing a list of regions and their respective means.
dff=pd.DataFrame()
dff['Regions']=Regions
dff['RegionMeans']=RegionMeans
WeekMeans = alt.Chart(dff).mark_bar(size=15).encode(
    x = alt.X('Regions', sort='y'),
    y = 'RegionMeans',
    color = alt.Color('Regions', legend=None)
)
WeekMeans
#New DataFrame for Kmean testing
dfff=pd.DataFrame()
dfff['New cases']=df['New cases']
dfff['New deaths']=df['New deaths']
kmeans=KMeans(6)
kmeans.fit(dfff)
KMeans(n_clusters=6)
dfff["cluster"] = kmeans.predict(dfff)

We opted to use the logarithm of the data values for “New cases” because of some extremely large values of the column.

dfff['New cases']=np.log10(dfff['New cases']+1)
dfff
New cases New deaths cluster
0 2.029384 10 0
1 2.071882 6 0
2 2.790285 8 0
3 1.041393 0 0
4 1.278754 1 0
... ... ... ...
182 2.184691 2 0
183 0.000000 0 0
184 1.041393 4 0
185 1.857332 1 0
186 2.285557 2 0

187 rows × 3 columns

This graph shows that the variables we picked did not yield any useful interpretation of the dataset. Some possible reasonings for this occurence can be due to the compactness of the datapoints and how there are no clear separations between each cluster.

clusters = alt.Chart(dfff).mark_point(size=10).encode(
    x = 'New cases',
    y = 'New deaths',
    color = alt.Color('cluster', legend=None),
).interactive().properties(
    width=500,height=500,
    title='New deaths and cases'
)
clusters

We will now proceed with the K Nearest Neighbors test.

DropCountry = df.drop('Country/Region', 1)

We hypothesize that with knowing the “Death per 100 cases”, “Recovered per 100 cases”, and the “1 week % increase”, we can make a projection on the amount of new cases.

X_train, X_test, y_train, y_test = train_test_split(
    df[["Deaths / 100 Cases","Recovered / 100 Cases","1 week % increase"]], df["New cases"], test_size = 0.3)
reg = KNeighborsRegressor(n_neighbors=6)
reg.fit(X_train, y_train)
KNeighborsRegressor(n_neighbors=6)

Based on our score, it shows that the columns we are using. “Deaths/100 Cases”, Recovered/100 Cases”, and “1 week % increase” are not good indicators of “New cases”.

reg.score(X_test,y_test)
-0.09183683492283201
clf = LogisticRegression()
cols=["Deaths / 100 Cases","Recovered / 100 Cases"]
clf.fit(df[cols],df['WHO Region'])
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()
df["pred"] = clf.predict(df[cols])

The chart below is the result of logistic regression on the two columns “Deaths / 100 Cases”,”Recovered / 100 Cases”. In contrast to the expected amount of 6 clusters, our logistic regression only yields 3 clusters; this can be due to the denseness of the datapoints and how close they are to one another.

alt.Chart(df).mark_circle().encode(
    x=alt.X(cols[0], scale=alt.Scale(zero=False)),
    y=alt.Y(cols[1], scale=alt.Scale(zero=False)),
    color="pred",
    tooltip=['WHO Region','Country/Region']
).interactive().properties(

    title="Recovered and deaths"
)

Below is an interactive visual that contains the data on the daily rate of death and rate of recovery of countries in a particular region. With this, we can clearly observe which region has above a certain threshold of “Death per 100 cases” or “Recovered per 100 cases”.

brush = alt.selection(type='interval')

points = alt.Chart(df).mark_point().encode(
    x='Recovered / 100 Cases:Q',
    y='Deaths / 100 Cases:Q',
    color=alt.condition(brush, 'WHO Region', alt.value('lightgray'))
).add_selection(
    brush
)

bars = alt.Chart(df).mark_bar().encode(
    y='WHO Region:N',
    color='WHO Region:N',
    x='count(WHO Region):Q'
).transform_filter(
    brush
)

points & bars

Now, we will examine the data from the column that can help us in predicting the region that is performing the best in term of navigating around COVID-19: “Recovered / 100 Cases” and “1 week % increase”. These columns can be effective for making prediction because they are percentages, which mean that the total population of each region can have less of an impact on it, i.e. relative.

df['Recovered / 100 Cases'].corr(df['1 week % increase'])
-0.3942542429619809
RecoveredIncreasePoints = alt.Chart(df).mark_point().encode(
    x=alt.X('Recovered / 100 Cases'),
    y=alt.Y('1 week % increase'),
    color=alt.condition(brush, 'WHO Region', alt.value('lightgray')),
    tooltip=['1 week % increase']
).add_selection(
    brush
)

RecoveredIncreaseBars = alt.Chart(df).mark_bar().encode(
    y='WHO Region:N',
    color='WHO Region:N',
    x='count(WHO Region):Q'
).transform_filter(
    brush
)

RecoveredIncreasePoints & RecoveredIncreaseBars

For this data visual, the bottom right corner of the chart will provide us with the Region that is performing best in combating against COVID-19. Using the brush, we can observe staggering amounts of Europe region countries, which signifies a low percent increase per week and a high recovery rate per 100 cases.

Summary¶

Either summarize what you did, or summarize the results. Maybe 3 sentences.

For this project, we have found correlations between data columns, most of which have not been helpful, since they are raw numbers that are population dependent. After using different methods of data analysis, we ran into some trouble in separating datapoints according to their respective region, which resulted in our using of population independent data. Furthermore, the brush feature of the selection histogram allows us to pick custom thresholds of the data we wish to examine; in other words, we were able to observe the amount of countries within a region that have certain amount of “Deaths/100 cases” and “Recovered/100 cases”.

References¶

  • What is the source of your dataset(s)?

The source of this dataset: https://www.kaggle.com/datasets/imdevskp/corona-virus-report

  • Were any portions of the code or ideas taken from another source? List those sources here and say how they were used.

KNN: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html Selection histogram: https://altair-viz.github.io/gallery/selection_histogram.html

  • List other references that you found helpful.

Created in deepnote.com Created in Deepnote

previous

Chess Games Analysis

next

NBA Salaries from the 2017 Season

By Christopher Davis
© Copyright 2022.