The analysis of relations between the total covid-affected population and other datas#

Author: Haoyang Wang

Course Project, UC Irvine, Math 10, F22

Introduction#

During the pandemic, millions of people get affected by covid-19. This project is for finding the relationship between how many people were affected in total (Total Cases) and other datas (Population, pcr-Test etc.). The project used pandas, altair, seaborn, and machine learning tools in finding the relation.

Explore the datas by using Pandas#

import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns

First, let’s see what the dataset generally looks like.

df = pd.read_csv("Covid Live.csv")
df.head()
# Country,\nOther Total\nCases Total\nDeaths New\nDeaths Total\nRecovered Active\nCases Serious,\nCritical Tot Cases/\n1M pop Deaths/\n1M pop Total\nTests Tests/\n1M pop Population
0 1 USA 98,166,904 1,084,282 NaN 94,962,112 2,120,510 2,970 293,206 3,239 1,118,158,870 3,339,729 334,805,269
1 2 India 44,587,307 528,629 NaN 44,019,095 39,583 698 31,698 376 894,416,853 635,857 1,406,631,776
2 3 France 35,342,950 155,078 NaN 34,527,115 660,757 869 538,892 2,365 271,490,188 4,139,547 65,584,518
3 4 Brazil 34,706,757 686,027 NaN 33,838,636 182,094 8,318 161,162 3,186 63,776,166 296,146 215,353,593
4 5 Germany 33,312,373 149,948 NaN 32,315,200 847,225 1,406 397,126 1,788 122,332,384 1,458,359 83,883,596

I noticed that there are something wrong with the columns‘ names for the dataframe. Therefore, for calling the columns easily, I rename the dataframe.

# Call columns to make sure store their true name in the dictionary later
df.columns
Index(['#', 'Country,\nOther', 'Total\nCases', 'Total\nDeaths', 'New\nDeaths',
       'Total\nRecovered', 'Active\nCases', 'Serious,\nCritical',
       'Tot Cases/\n1M pop', 'Deaths/\n1M pop', 'Total\nTests',
       'Tests/\n1M pop', 'Population'],
      dtype='object')
# Use the dictionary help me with renaming
col_name = {'Country,\nOther':"Country", 'Total\nCases':"Total Cases", 'Total\nDeaths':"Total Deaths", 'New\nDeaths':"New Deaths",
       'Total\nRecovered':"Total Recovered", 'Active\nCases':"Active Cases", 'Serious,\nCritical':"Serious Criticals",
       'Tot Cases/\n1M pop':"Total Cases/m", 'Deaths/\n1M pop':"Deaths/m", 'Total\nTests':"Total Test",
       'Tests/\n1M pop':"Tests/m"}
df.rename(col_name, axis=1, inplace=True)
df.head()
# Country Total Cases Total Deaths New Deaths Total Recovered Active Cases Serious Criticals Total Cases/m Deaths/m Total Test Tests/m Population
0 1 USA 98,166,904 1,084,282 NaN 94,962,112 2,120,510 2,970 293,206 3,239 1,118,158,870 3,339,729 334,805,269
1 2 India 44,587,307 528,629 NaN 44,019,095 39,583 698 31,698 376 894,416,853 635,857 1,406,631,776
2 3 France 35,342,950 155,078 NaN 34,527,115 660,757 869 538,892 2,365 271,490,188 4,139,547 65,584,518
3 4 Brazil 34,706,757 686,027 NaN 33,838,636 182,094 8,318 161,162 3,186 63,776,166 296,146 215,353,593
4 5 Germany 33,312,373 149,948 NaN 32,315,200 847,225 1,406 397,126 1,788 122,332,384 1,458,359 83,883,596

Drop missing values and the whole “New Deaths” column because it almost have no data.

del df["New Deaths"] #"del" reference: https://www.educative.io/answers/how-to-delete-a-column-in-pandas
df = df.dropna(axis=0).copy()
df.head()
# Country Total Cases Total Deaths Total Recovered Active Cases Serious Criticals Total Cases/m Deaths/m Total Test Tests/m Population
0 1 USA 98,166,904 1,084,282 94,962,112 2,120,510 2,970 293,206 3,239 1,118,158,870 3,339,729 334,805,269
1 2 India 44,587,307 528,629 44,019,095 39,583 698 31,698 376 894,416,853 635,857 1,406,631,776
2 3 France 35,342,950 155,078 34,527,115 660,757 869 538,892 2,365 271,490,188 4,139,547 65,584,518
3 4 Brazil 34,706,757 686,027 33,838,636 182,094 8,318 161,162 3,186 63,776,166 296,146 215,353,593
4 5 Germany 33,312,373 149,948 32,315,200 847,225 1,406 397,126 1,788 122,332,384 1,458,359 83,883,596

I noticed that the data in the columns are strings with comma. I need to convert them into numeric type for manipulating easier.

# first, find the columns who are string
from pandas.api.types import is_string_dtype
str_col = [i for i in df.columns if is_string_dtype(df[i])]

# the first string column is "Country", drop it
str_col = str_col[1:]
# Delete the commas, then convert the string columns into numberic
for i in str_col:
    df[i] = df[i].str.replace(",","")
    df[i] = pd.to_numeric(df[i])

Visualization by Altair and Seaborn#

First of all, I’d like to see whether the population and the density (total cases/million people) of affected patients have a strong relation. In other words, is the population affect the possibility of being affected.

alt.Chart(df).mark_circle().encode(
    x = alt.X("Population",sort="ascending"),
    y = alt.Y("Total Cases/m",sort="ascending"),
    tooltip = ("Country","Total Cases")
).interactive()

I use selection for showing the details.

sel = alt.selection_interval()

c1 = alt.Chart(df).mark_circle().encode(
    x =alt.X("Population", scale=alt.Scale(domain=[0,400000000])),
    y ="Total Cases/m",
    tooltip = ("Country","Total Cases"),
    color = alt.condition(sel, alt.value("Black"), "Population")
).add_selection(sel)

c2 = alt.Chart(df).mark_bar().encode(
    x ="Country",
    y ="Total Cases/m", 
    color ="Country"
).transform_filter(sel)

alt.vconcat(c1,c2) # Can choose a rectangle by mouse to see the datas specifically

Without considering the outliars (China and India), we could say that it is hard to observe the strong relation between the density of affected patients and the population. Thus, I’d like check whether Test helps in preventing people from affected.

sns.scatterplot(
    data = df,
    x = "Tests/m",
    y = "Total Cases/m"
)
<AxesSubplot:xlabel='Tests/m', ylabel='Total Cases/m'>
../../_images/HaoyangWang_21_1.png

Surprisingly, a negative relation (which impies tests do prevent people from affected) is not observed here between Tests/m and Total Cases/m.

DecisionTreeRegressor#

Though the relations between other datas and “Total Tests” are hard to be observed, I’m going to use machine learning to try to predict the “Total Tests” by using other datas as input.

For avoiding repeat affection on the prediction result, I will delete some of the inputs which are repeated, for example, “Total Test”, and “Tests/m”.

cols = [i for i in str_col if i[-2:] != "/m"]
cols = cols[1:] #the first one is "Total Cases" (predict result), drop it
from sklearn.model_selection import train_test_split
X = df[cols]
y = df["Total Cases/m"]
X_train, X_test, y_train, y_test = train_test_split(X, y,train_size=.8,random_state=59547172)
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
DecisionTreeRegressor()
reg.score(X_test, y_test) #the predicted result not seems good, might be overfitting
0.33628386401402366
reg.score(X_train, y_train) # 1.0 implies there is overfitting
1.0

the feature importances seem not really make sense to me

pd.Series(reg.feature_importances_, index=reg.feature_names_in_)
Total Deaths         0.049213
Total Recovered      0.343638
Active Cases         0.035992
Serious Criticals    0.011876
Total Test           0.126348
Population           0.432932
dtype: float64

To make sure I use a proper max_leaf_nodes number, I will make a U-shape test error curve first.

from sklearn.metrics import mean_absolute_error
train_dic = {}
test_dic = {}

for i in range(2,100):
    reg = DecisionTreeRegressor(criterion = "absolute_error", 
    max_leaf_nodes = i)
    reg.fit(X_train, y_train)
    train_dic[i] = mean_absolute_error(y_train, reg.predict(X_train))
    test_dic[i] = mean_absolute_error(y_test, reg.predict(X_test))
train_loss = pd.Series(train_dic)
test_loss = pd.Series(test_dic)
train_loss.name = "train"
test_loss.name = "test"
df_loss = pd.concat((train_loss, test_loss), axis=1)
df_loss.reset_index(inplace=True)
df_loss.rename({"index": "max_leaf_nodes"}, axis=1, inplace=True)
df_melted = df_loss.melt(id_vars="max_leaf_nodes", var_name="error_type", value_name="loss")

The curve is not typically U-shaped, and the best max_leaf_nodes here I can choose where is not overfitting is 17. (Actually, I also check 37, but the score of train data is .98 versus .70 test score, which implies overfitting, so I just discard 37.)

alt.Chart(df_melted).mark_line().encode(
    x = "max_leaf_nodes",
    y = "loss",
    color = "error_type",
    tooltip = "max_leaf_nodes"
)

Do the DicisionTreeRegressor again.

reg2 = DecisionTreeRegressor(max_leaf_nodes=17)
reg2.fit(X_train, y_train)
DecisionTreeRegressor(max_leaf_nodes=17)
reg2.score(X_test, y_test) # the score still not perform well
0.6106858707481919

RandomForestRegressor#

I will use randomforest to make the predict more accurate.

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(n_estimators=100, max_leaf_nodes=17)
forest_reg.fit(X_train, y_train)
RandomForestRegressor(max_leaf_nodes=17)
forest_reg.score(X_test, y_test)
0.7140076019321402
df["pred"] = forest_reg.predict(df[cols])
c1 = alt.Chart(df).mark_circle(color="black").encode(
    x = alt.X("Population", scale=alt.Scale(domain=[0,400000000])),
    y = "Total Cases",
    tooltip = "Country"
)

c2  = alt.Chart(df).mark_circle(color="red").encode(
    x = alt.X("Population", scale=alt.Scale(domain=[0,400000000])),
    y = "pred"
)

c1+c2

The chart is not performing good, I think the outliars make a lot effects on the predicted result.

KNeighborsRegressir#

I will use KNeighborsRegressor to check the predict result again.

from sklearn.neighbors import KNeighborsRegressor
# Again, start with finding the best k
def get_scores(k):
    K_reg=KNeighborsRegressor(n_neighbors=k)
    K_reg.fit(X_train, y_train)
    train_error=mean_absolute_error(K_reg.predict(X_train), y_train)
    test_error=mean_absolute_error(K_reg.predict(X_test), y_test)
    return (train_error, test_error)
df_k = pd.DataFrame(columns=("train_error", "test_error"))
df_k["train_error"] = [get_scores(k)[0] for k in range(1,100)]
df_k["test_error"] = [get_scores(k)[1] for k in range(1,100)]
df_k["k"] = df_k.index

By the chart, we know higher the k, bigger the error. The best K I can get from it is 5

sns.lineplot(data=df_k, markers=True)
<AxesSubplot:>
../../_images/HaoyangWang_59_1.png
K_reg = KNeighborsRegressor(n_neighbors=5)
K_reg.fit(X_train, y_train)
df["K_predict"] = K_reg.predict(df[cols])

Still, by using KNeighbors, the predict results do not vary from randomforest a lot. Therefore, I conclude that there is no much relations between the Total Cases and other datas.

c3 = alt.Chart(df).mark_circle(color="black").encode(
    x = alt.X("Population", scale=alt.Scale(domain=[0,400000000])),
    y = "Total Cases",
    tooltip = "Country"
)

c4  = alt.Chart(df).mark_circle(color="red").encode(
    x = alt.X("Population", scale=alt.Scale(domain=[0,400000000])),
    y = "K_predict"
)

c3+c4

Summary#

I use pandas in cleaning and analyzing datas, altair/seaborn for visualization, and Decision Tree/Random Forest/KNeighbors in machine learning. Though I used different techniques, the predicted results do not perform well. Therefore, I would say it is hardly to oberserve a relationship between the Total Cases with other datas based on this dataframe.

References#

del: “How to delete a column in pandas” by Neko Yan, https://www.educative.io/answers/how-to-delete-a-column-in-pandas

Seaborn Visualization: Seaborn.pydata https://seaborn.pydata.org/generated/seaborn.scatterplot.html https://seaborn.pydata.org/generated/seaborn.lineplot.html

  • What is the source of your dataset(s)? Kaggle

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote