Countries and Air Quality

Countries and Air Quality#

Author:Seth Abuhamdeh email: seth.abuhamdeh@gmail.com

Course Project, UC Irvine, Math 10, S23

Introduction#

My project uses a dataset containing thousands of cities and their respective air quality indexes among other measurements relating to air pollution and air quality. In this project, I plan to use machine learning algorithms to experiment and see how well a machine can predict a country based off air quality information. I will compare some machine learning models to see how accurate their predictions are and try to visualize some of their predictions.

Main Section#

You can either have all one section or divide into multiple sections. To make new sections, use ## in a markdown cell. Double-click this cell for an example of using ##

import pandas as pd
import altair as alt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

Cleaning the Data#

df_pre = pd.read_csv("global air pollution dataset.csv")

df_pre = df_pre.dropna()

df_pre.shape

(23035, 12)

df_pre

	Country	City	AQI Value	AQI Category	CO AQI Value	CO AQI Category	Ozone AQI Value	Ozone AQI Category	NO2 AQI Value	NO2 AQI Category	PM2.5 AQI Value	PM2.5 AQI Category
0	Russian Federation	Praskoveya	51	Moderate	1	Good	36	Good	0	Good	51	Moderate
1	Brazil	Presidente Dutra	41	Good	1	Good	5	Good	1	Good	41	Good
2	Italy	Priolo Gargallo	66	Moderate	1	Good	39	Good	2	Good	66	Moderate
3	Poland	Przasnysz	34	Good	1	Good	34	Good	0	Good	20	Good
4	France	Punaauia	22	Good	0	Good	22	Good	0	Good	6	Good
...	...	...	...	...	...	...	...	...	...	...	...	...
23458	India	Gursahaiganj	184	Unhealthy	3	Good	154	Unhealthy	2	Good	184	Unhealthy
23459	France	Sceaux	50	Good	1	Good	20	Good	5	Good	50	Good
23460	India	Mormugao	50	Good	1	Good	22	Good	1	Good	50	Good
23461	United States of America	Westerville	71	Moderate	1	Good	44	Good	2	Good	71	Moderate
23462	Malaysia	Marang	70	Moderate	1	Good	38	Good	0	Good	70	Moderate

23035 rows × 12 columns

Since there are 175 countries in this data set, it would be very difficult for any machine to accurately predict a country correctly based off air quality without extreme overfitting. Therefore I will reduce the data set down to the 10 most common countries in the dataset.

top_ten = df_pre["Country"].value_counts().head(10)

top_ten = top_ten.index.values

top_ten

array(['United States of America', 'India', 'Brazil', 'Germany',
       'Russian Federation', 'Italy', 'France', 'China', 'Japan',
       'Mexico'], dtype=object)

df = df_pre[df_pre["Country"].isin(top_ten)]

df

	Country	City	AQI Value	AQI Category	CO AQI Value	CO AQI Category	Ozone AQI Value	Ozone AQI Category	NO2 AQI Value	NO2 AQI Category	PM2.5 AQI Value	PM2.5 AQI Category
0	Russian Federation	Praskoveya	51	Moderate	1	Good	36	Good	0	Good	51	Moderate
1	Brazil	Presidente Dutra	41	Good	1	Good	5	Good	1	Good	41	Good
2	Italy	Priolo Gargallo	66	Moderate	1	Good	39	Good	2	Good	66	Moderate
4	France	Punaauia	22	Good	0	Good	22	Good	0	Good	6	Good
5	United States of America	Punta Gorda	54	Moderate	1	Good	14	Good	11	Good	54	Moderate
...	...	...	...	...	...	...	...	...	...	...	...	...
23456	United States of America	Highland Springs	54	Moderate	1	Good	34	Good	5	Good	54	Moderate
23458	India	Gursahaiganj	184	Unhealthy	3	Good	154	Unhealthy	2	Good	184	Unhealthy
23459	France	Sceaux	50	Good	1	Good	20	Good	5	Good	50	Good
23460	India	Mormugao	50	Good	1	Good	22	Good	1	Good	50	Good
23461	United States of America	Westerville	71	Moderate	1	Good	44	Good	2	Good	71	Moderate

13374 rows × 12 columns

Now that we have a dataset only containing 10 of the most common countries, it should be easier for the machine to give more accurate predictions of a country based off of its air quality. Since the predictions are for string type data, we should start with a LogisticRegression as our model for our machine.

cols = ["AQI Value", "CO AQI Value", "Ozone AQI Value", "NO2 AQI Value" ]

Since AQI value takes in all values given by CO, Ozone,and NO2 AQI values it is useful to make a bar graph using the average of each country’s AQI value to see if there may be some correlation we can draw without the use of a machine. PM2.5 AQI values seem to be equal to overall AQI values so they will be left out.

alt.data_transformers.disable_max_rows()
c1 =alt.Chart(df).mark_bar().encode(
    x = "Country",
    y = "mean(AQI Value)",
    color = "Country", 
    tooltip = "mean(AQI Value)"
)

c2 = alt.Chart(df).mark_point().encode(
    x = "AQI Value",
    y = "CO AQI Value",
    color = "Country",
    tooltip = ["Country", "AQI Value", "CO AQI Value", "City"]  
)

c3 = c2.mark_point().encode(
    y = "Ozone AQI Value"  
)

c4 = c2.mark_point().encode(
    y = "NO2 AQI Value",  
)

c1|c2|c3|c4

As we can see, countries like India and China have high AQI values and countries like Russia and Germany have low AQI values. CO AQI values seem to have little effect on overall AQI values, however Ozone has a strong effect on AQI values. NO2 values seems to have a positive correlation as well for most countries except for India, this may be due to bad data or some other unknown variable. Hopefully this will be reflected with our Linear Regression Model. I took the code “c1|c2|c3|c4” from a previous project since I’m not too sure if this was used in class before. Project: https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/WenqiZhao.html

Logistic Regression#

I will start off analyzing the data with Logistic Regression since this is the first machine learning model we talked about in class relevant to my data and I beleive will be the least accurate of the models.

reg = LogisticRegression()

reg.fit(df[cols], df["Country"])

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,

LogisticRegression()

df["pred"] = reg.predict(df[cols])

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

reg.score(df[cols], df["Country"])

0.4976072977418872

reg.coef_

array([[ 6.50046494e-02,  9.45639375e-02, -1.12932429e-01,
        -5.05171815e-01],
       [-1.94443208e-02,  5.26109952e-01,  3.94297927e-02,
         1.82107958e-01],
       [ 4.50445090e-03, -7.47896702e-02, -2.52123497e-04,
         5.88505304e-02],
       [-2.30427750e-02, -2.91911178e-01,  3.93321462e-02,
         2.53735292e-01],
       [ 8.36036755e-02,  1.32801467e-01, -5.39874918e-02,
        -6.95648418e-01],
       [ 2.68064295e-03, -1.47268176e-01,  1.77612356e-02,
         7.42574780e-02],
       [-1.31608557e-01,  4.27969692e-02,  1.26071037e-01,
         7.38353800e-01],
       [ 8.32250925e-02, -1.79764854e-01, -1.42281754e-01,
        -5.51433144e-01],
       [-6.99260897e-03,  2.99336877e-01,  1.66556023e-02,
        -1.84493317e-01],
       [-5.79302497e-02, -4.01875324e-01,  7.02039852e-02,
         6.29441636e-01]])

reg.classes_

array(['Brazil', 'China', 'France', 'Germany', 'India', 'Italy', 'Japan',
       'Mexico', 'Russian Federation', 'United States of America'],
      dtype=object)

reg.intercept_

array([ 0.90527227, -0.96775771,  0.02062416,  0.27396248, -0.89017611,
       -0.11496629, -0.04354956,  0.00957241,  0.39408518,  0.41293317])

d1 = c1.mark_bar().encode(
     x = "pred",
    color = "pred", 
)
d1|c1

d2 = alt.Chart(df).mark_point().encode(
    x = "AQI Value",
    y = "CO AQI Value",
    color = "pred",
    tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred"]  
)

d3 = d2.mark_point().encode( 
    y = "Ozone AQI Value",
    tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred"]  
)

d4 = d2.mark_point().encode(
    y = "NO2 AQI Value",
    tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred"]  
)
d2|c2|d3|c3|d4|c4

Our Logistic Regression model has predicted the right country with close to 50% accuracy which is significantly better than randomly choosing a country(10%). The new graphs show how the linear model predicted the values of certain countries, as we see Mexico’s pollution is greatly exagerrated and other countries like India was underestimated.France is not even predicted with this model. We can see a little bit of how the model predicted countries with our point graphs and places where it had inaccuracies.

Decison Tree Classifier#

Next I’ll use a DecisionTreeClassifier since this machine learning model is best model to analyze this type of data that we have used in class.

clf1 = DecisionTreeClassifier(max_leaf_nodes= 15)

clf1.fit(df[cols], df["Country"])

DecisionTreeClassifier(max_leaf_nodes=15)

df["pred2"] = clf1.predict(df[cols])

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

clf1.score(df[cols], df["Country"])

0.5508449229848961

clf2 = DecisionTreeClassifier(max_leaf_nodes= 15)

Just in case, I will check for overfitting for this model.

X_train, X_test, y_train, y_test = train_test_split(df[cols], df["Country"], train_size = .6, random_state= 18)

clf2.fit(X_train, y_train)

DecisionTreeClassifier(max_leaf_nodes=15)

clf2.score(X_test, y_test)

0.5435514018691588

Since our DecisonTreeClassifier had very similar accuracy when using real data and train data it is very unlikely that there is any overfitting thus we can continue with this machine model to predict the country from its air pollution values. This model is also about 5% more accurate then the LogisticRegression model.

e1 = c1.mark_bar().encode(
     x = "pred2",
    color = "pred2", 
)
e1|d1|c1

An interesting point to note is that the DecisionTreeClassifier never predicts France and even Japan yet is still more accurate then the Logistic Regression model. Still this model exaggerates values of those of Mexico and China by a lot.

e2 = alt.Chart(df).mark_point().encode(
    x = "AQI Value",
    y = "CO AQI Value",
    color = "pred2",
    tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred2"]  
)

e3 = e2.mark_point().encode(
    y = "Ozone AQI Value",
    tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred2"]  
)

e4 = e2.mark_point().encode(
    y = "NO2 AQI Value",
    tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred2"]  
)
e2|d2|c2|e3|d3|c3|e4|d4|c4

These graphs depict how the DecisionTreeClassifier predicts countries based off each type of air pollution index. We can see a lot of differences between how clf predicts its countries and how reg predicits its countries. The Decision Tree Classifier seems to section off countries into regions on the graph and whatever point falls into that region is predicted to be that country. This is best seen in the graph for ozone AQI values.

Extra: KNeighborClassifier#

Forthe extra part of this project I am using the KNeighborClassifier as it seems to be another machine learning model that is relevant to my data.

from sklearn.neighbors import KNeighborsClassifier

reg2 = KNeighborsClassifier(n_neighbors=15)

reg2.fit(df[cols],df["Country"])

KNeighborsClassifier(n_neighbors=15)

df["pred3"] = reg2.predict(df[cols])

/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Here I use the machine learning algorithm KneighborsClassifier to analyze the data. I get this code from Winter Quarter 2022 Week 6 class notes. https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html

reg2.score(df[cols], df["Country"])

0.6704800358905338

reg3 = KNeighborsClassifier(n_neighbors= 15)

reg3.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=15)

reg3.score(X_test, y_test)

0.6272897196261682

Our classifier scores somewhat similar on test data and with real data suggesting minimal overfitting thus KNeighborClassifier gives us our most accurate model for this project with nearly 2/3 accuracy much better than random guessing and considerably better than both the DecisionTreeClassifier and LogisticRegressor.

f1 = c1.mark_bar().encode(
     x = "pred3",
    color = "pred3", 
)
f1|e1|d1|c1

f2 = alt.Chart(df).mark_point().encode(
    x = "AQI Value",
    y = "CO AQI Value",
    color = "pred3",
    tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred3"]  
)

f3 = f2.mark_point().encode(
    y = "Ozone AQI Value",
    tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred3"]  
)

f4 = f2.mark_point().encode(
    y = "NO2 AQI Value",
    tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred3"]  
)
f2|e2|d2|c2|f3|e3|d3|c3|f4|e4|d4|c4

From our graphs, we see that KNeighborClassifier doesn’t leave any of the 10 countries out of its predictions and the overall mean of the AQI values for our 3rd prediction column is most similar to those of our original data. We see in our point graphs that KNeighborClassifier has a lot more mixed colors rather than our other models clustering countries together into regions on the graph. KneighborClassifier seems to classify our data much differently than other models which allows it to be more accurate.

We see that our machine learning model seems to overpredict AQI values of countries with higher AQI values while underpredicting those with lower AQI values. This is less so with the KNeighborClassifier but still present.

Visualizing the Predictions#

I finally wanted to visualize the decision boundaries of our different machine learning models. I choose to visualize the ozone AQI values since it seemed to produce the most interesting graph of the 3 values and hopefully will give more conclusive and interesting results of these decision boundaries. All the code for setting up the df_rep dataframe is taken from week 8 monday lecture notes: https://christopherdavisuci.github.io/UCI-Math-10-S23/Week8/Week8-Monday.html

rng = np.random.default_rng()

arr = rng.random(size = (5000,4))

df_rep = pd.DataFrame(arr, columns = cols)

df_rep["Ozone AQI Value"] *= 210

df_rep["AQI Value"] *= 500

df_rep["pred"] = reg.predict(df_rep[cols])

df_rep["pred2"] = clf1.predict(df_rep[cols])

df_rep["pred3"] = reg2.predict(df_rep[cols])

g1 = alt.Chart(df_rep).mark_point().encode(
    x = "AQI Value",
    y = "Ozone AQI Value",
    color = "pred",
    tooltip = ["pred", "Ozone AQI Value", "AQI Value"]

)
g2 = g1.mark_point().encode(
    color = "pred2",
    tooltip = ["pred2", "Ozone AQI Value", "AQI Value"]

)
g3 = g1.mark_point().encode(
    color = "pred3",
    tooltip = ["pred3", "Ozone AQI Value", "AQI Value"]
)
g1|g2|g3

As we can see, the decision boundaries for the left side of each graph are very different, only Logistic Regression has clear decision boundaries while the DecisionTreeClassifier and KNeighborClassifier has very mixed boundaries. The right side of all 3 graphs pretty much all predict strictly India after around AQI value of 200. While this doesn’t completely show how each model makes predictions, it does offer some very interesting insight into how each of these models work and the differences between them.

Summary#

Either summarize what you did, or summarize the results. Maybe 3 sentences.

Over the course of this project we used 3 different machine learning models to analyze air pollution data of 10 countries then use that analysis to predict a country based off of air quality conditions. Our 3 models had surprisingly accurate results with KNeghborClassifier being the most accurate(67%) and Logistic Regression being the least accurate(50%). Each model predicted that Mexico, China, and India were the most polluted countries which is accurate to real data but with some exaggerations especially with China and Mexico.

References#

Your code above should include references. Here is some additional space for references.

What is the source of your dataset(s)?

https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset

List any other references that you found helpful.

KNeighborClassifier code: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html some altair code: https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/WenqiZhao.html Visualizing the Data code:https://christopherdavisuci.github.io/UCI-Math-10-S23/Week8/Week8-Monday.html

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in Deepnote