Countries and Air Quality#
Author:Seth Abuhamdeh email: seth.abuhamdeh@gmail.com
Course Project, UC Irvine, Math 10, S23
Introduction#
My project uses a dataset containing thousands of cities and their respective air quality indexes among other measurements relating to air pollution and air quality. In this project, I plan to use machine learning algorithms to experiment and see how well a machine can predict a country based off air quality information. I will compare some machine learning models to see how accurate their predictions are and try to visualize some of their predictions.
Main Section#
You can either have all one section or divide into multiple sections. To make new sections, use ##
in a markdown cell. Double-click this cell for an example of using ##
import pandas as pd
import altair as alt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Cleaning the Data#
df_pre = pd.read_csv("global air pollution dataset.csv")
df_pre = df_pre.dropna()
df_pre.shape
(23035, 12)
df_pre
Country | City | AQI Value | AQI Category | CO AQI Value | CO AQI Category | Ozone AQI Value | Ozone AQI Category | NO2 AQI Value | NO2 AQI Category | PM2.5 AQI Value | PM2.5 AQI Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Russian Federation | Praskoveya | 51 | Moderate | 1 | Good | 36 | Good | 0 | Good | 51 | Moderate |
1 | Brazil | Presidente Dutra | 41 | Good | 1 | Good | 5 | Good | 1 | Good | 41 | Good |
2 | Italy | Priolo Gargallo | 66 | Moderate | 1 | Good | 39 | Good | 2 | Good | 66 | Moderate |
3 | Poland | Przasnysz | 34 | Good | 1 | Good | 34 | Good | 0 | Good | 20 | Good |
4 | France | Punaauia | 22 | Good | 0 | Good | 22 | Good | 0 | Good | 6 | Good |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23458 | India | Gursahaiganj | 184 | Unhealthy | 3 | Good | 154 | Unhealthy | 2 | Good | 184 | Unhealthy |
23459 | France | Sceaux | 50 | Good | 1 | Good | 20 | Good | 5 | Good | 50 | Good |
23460 | India | Mormugao | 50 | Good | 1 | Good | 22 | Good | 1 | Good | 50 | Good |
23461 | United States of America | Westerville | 71 | Moderate | 1 | Good | 44 | Good | 2 | Good | 71 | Moderate |
23462 | Malaysia | Marang | 70 | Moderate | 1 | Good | 38 | Good | 0 | Good | 70 | Moderate |
23035 rows × 12 columns
Since there are 175 countries in this data set, it would be very difficult for any machine to accurately predict a country correctly based off air quality without extreme overfitting. Therefore I will reduce the data set down to the 10 most common countries in the dataset.
top_ten = df_pre["Country"].value_counts().head(10)
top_ten = top_ten.index.values
top_ten
array(['United States of America', 'India', 'Brazil', 'Germany',
'Russian Federation', 'Italy', 'France', 'China', 'Japan',
'Mexico'], dtype=object)
df = df_pre[df_pre["Country"].isin(top_ten)]
df
Country | City | AQI Value | AQI Category | CO AQI Value | CO AQI Category | Ozone AQI Value | Ozone AQI Category | NO2 AQI Value | NO2 AQI Category | PM2.5 AQI Value | PM2.5 AQI Category | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Russian Federation | Praskoveya | 51 | Moderate | 1 | Good | 36 | Good | 0 | Good | 51 | Moderate |
1 | Brazil | Presidente Dutra | 41 | Good | 1 | Good | 5 | Good | 1 | Good | 41 | Good |
2 | Italy | Priolo Gargallo | 66 | Moderate | 1 | Good | 39 | Good | 2 | Good | 66 | Moderate |
4 | France | Punaauia | 22 | Good | 0 | Good | 22 | Good | 0 | Good | 6 | Good |
5 | United States of America | Punta Gorda | 54 | Moderate | 1 | Good | 14 | Good | 11 | Good | 54 | Moderate |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
23456 | United States of America | Highland Springs | 54 | Moderate | 1 | Good | 34 | Good | 5 | Good | 54 | Moderate |
23458 | India | Gursahaiganj | 184 | Unhealthy | 3 | Good | 154 | Unhealthy | 2 | Good | 184 | Unhealthy |
23459 | France | Sceaux | 50 | Good | 1 | Good | 20 | Good | 5 | Good | 50 | Good |
23460 | India | Mormugao | 50 | Good | 1 | Good | 22 | Good | 1 | Good | 50 | Good |
23461 | United States of America | Westerville | 71 | Moderate | 1 | Good | 44 | Good | 2 | Good | 71 | Moderate |
13374 rows × 12 columns
Now that we have a dataset only containing 10 of the most common countries, it should be easier for the machine to give more accurate predictions of a country based off of its air quality. Since the predictions are for string type data, we should start with a LogisticRegression as our model for our machine.
cols = ["AQI Value", "CO AQI Value", "Ozone AQI Value", "NO2 AQI Value" ]
Since AQI value takes in all values given by CO, Ozone,and NO2 AQI values it is useful to make a bar graph using the average of each country’s AQI value to see if there may be some correlation we can draw without the use of a machine. PM2.5 AQI values seem to be equal to overall AQI values so they will be left out.
alt.data_transformers.disable_max_rows()
c1 =alt.Chart(df).mark_bar().encode(
x = "Country",
y = "mean(AQI Value)",
color = "Country",
tooltip = "mean(AQI Value)"
)
c2 = alt.Chart(df).mark_point().encode(
x = "AQI Value",
y = "CO AQI Value",
color = "Country",
tooltip = ["Country", "AQI Value", "CO AQI Value", "City"]
)
c3 = c2.mark_point().encode(
y = "Ozone AQI Value"
)
c4 = c2.mark_point().encode(
y = "NO2 AQI Value",
)
c1|c2|c3|c4
As we can see, countries like India and China have high AQI values and countries like Russia and Germany have low AQI values. CO AQI values seem to have little effect on overall AQI values, however Ozone has a strong effect on AQI values. NO2 values seems to have a positive correlation as well for most countries except for India, this may be due to bad data or some other unknown variable. Hopefully this will be reflected with our Linear Regression Model. I took the code “c1|c2|c3|c4” from a previous project since I’m not too sure if this was used in class before. Project: https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/WenqiZhao.html
Logistic Regression#
I will start off analyzing the data with Logistic Regression since this is the first machine learning model we talked about in class relevant to my data and I beleive will be the least accurate of the models.
reg = LogisticRegression()
reg.fit(df[cols], df["Country"])
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:818: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG,
LogisticRegression()
df["pred"] = reg.predict(df[cols])
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
reg.score(df[cols], df["Country"])
0.4976072977418872
reg.coef_
array([[ 6.50046494e-02, 9.45639375e-02, -1.12932429e-01,
-5.05171815e-01],
[-1.94443208e-02, 5.26109952e-01, 3.94297927e-02,
1.82107958e-01],
[ 4.50445090e-03, -7.47896702e-02, -2.52123497e-04,
5.88505304e-02],
[-2.30427750e-02, -2.91911178e-01, 3.93321462e-02,
2.53735292e-01],
[ 8.36036755e-02, 1.32801467e-01, -5.39874918e-02,
-6.95648418e-01],
[ 2.68064295e-03, -1.47268176e-01, 1.77612356e-02,
7.42574780e-02],
[-1.31608557e-01, 4.27969692e-02, 1.26071037e-01,
7.38353800e-01],
[ 8.32250925e-02, -1.79764854e-01, -1.42281754e-01,
-5.51433144e-01],
[-6.99260897e-03, 2.99336877e-01, 1.66556023e-02,
-1.84493317e-01],
[-5.79302497e-02, -4.01875324e-01, 7.02039852e-02,
6.29441636e-01]])
reg.classes_
array(['Brazil', 'China', 'France', 'Germany', 'India', 'Italy', 'Japan',
'Mexico', 'Russian Federation', 'United States of America'],
dtype=object)
reg.intercept_
array([ 0.90527227, -0.96775771, 0.02062416, 0.27396248, -0.89017611,
-0.11496629, -0.04354956, 0.00957241, 0.39408518, 0.41293317])
d1 = c1.mark_bar().encode(
x = "pred",
color = "pred",
)
d1|c1
d2 = alt.Chart(df).mark_point().encode(
x = "AQI Value",
y = "CO AQI Value",
color = "pred",
tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred"]
)
d3 = d2.mark_point().encode(
y = "Ozone AQI Value",
tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred"]
)
d4 = d2.mark_point().encode(
y = "NO2 AQI Value",
tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred"]
)
d2|c2|d3|c3|d4|c4
Our Logistic Regression model has predicted the right country with close to 50% accuracy which is significantly better than randomly choosing a country(10%). The new graphs show how the linear model predicted the values of certain countries, as we see Mexico’s pollution is greatly exagerrated and other countries like India was underestimated.France is not even predicted with this model. We can see a little bit of how the model predicted countries with our point graphs and places where it had inaccuracies.
Decison Tree Classifier#
Next I’ll use a DecisionTreeClassifier since this machine learning model is best model to analyze this type of data that we have used in class.
clf1 = DecisionTreeClassifier(max_leaf_nodes= 15)
clf1.fit(df[cols], df["Country"])
DecisionTreeClassifier(max_leaf_nodes=15)
df["pred2"] = clf1.predict(df[cols])
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
clf1.score(df[cols], df["Country"])
0.5508449229848961
clf2 = DecisionTreeClassifier(max_leaf_nodes= 15)
Just in case, I will check for overfitting for this model.
X_train, X_test, y_train, y_test = train_test_split(df[cols], df["Country"], train_size = .6, random_state= 18)
clf2.fit(X_train, y_train)
DecisionTreeClassifier(max_leaf_nodes=15)
clf2.score(X_test, y_test)
0.5435514018691588
Since our DecisonTreeClassifier had very similar accuracy when using real data and train data it is very unlikely that there is any overfitting thus we can continue with this machine model to predict the country from its air pollution values. This model is also about 5% more accurate then the LogisticRegression model.
e1 = c1.mark_bar().encode(
x = "pred2",
color = "pred2",
)
e1|d1|c1
An interesting point to note is that the DecisionTreeClassifier never predicts France and even Japan yet is still more accurate then the Logistic Regression model. Still this model exaggerates values of those of Mexico and China by a lot.
e2 = alt.Chart(df).mark_point().encode(
x = "AQI Value",
y = "CO AQI Value",
color = "pred2",
tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred2"]
)
e3 = e2.mark_point().encode(
y = "Ozone AQI Value",
tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred2"]
)
e4 = e2.mark_point().encode(
y = "NO2 AQI Value",
tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred2"]
)
e2|d2|c2|e3|d3|c3|e4|d4|c4
These graphs depict how the DecisionTreeClassifier predicts countries based off each type of air pollution index. We can see a lot of differences between how clf predicts its countries and how reg predicits its countries. The Decision Tree Classifier seems to section off countries into regions on the graph and whatever point falls into that region is predicted to be that country. This is best seen in the graph for ozone AQI values.
Extra: KNeighborClassifier#
Forthe extra part of this project I am using the KNeighborClassifier as it seems to be another machine learning model that is relevant to my data.
from sklearn.neighbors import KNeighborsClassifier
reg2 = KNeighborsClassifier(n_neighbors=15)
reg2.fit(df[cols],df["Country"])
KNeighborsClassifier(n_neighbors=15)
df["pred3"] = reg2.predict(df[cols])
/shared-libs/python3.7/py-core/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""Entry point for launching an IPython kernel.
Here I use the machine learning algorithm KneighborsClassifier to analyze the data. I get this code from Winter Quarter 2022 Week 6 class notes. https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html
reg2.score(df[cols], df["Country"])
0.6704800358905338
reg3 = KNeighborsClassifier(n_neighbors= 15)
reg3.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=15)
reg3.score(X_test, y_test)
0.6272897196261682
Our classifier scores somewhat similar on test data and with real data suggesting minimal overfitting thus KNeighborClassifier gives us our most accurate model for this project with nearly 2/3 accuracy much better than random guessing and considerably better than both the DecisionTreeClassifier and LogisticRegressor.
f1 = c1.mark_bar().encode(
x = "pred3",
color = "pred3",
)
f1|e1|d1|c1
f2 = alt.Chart(df).mark_point().encode(
x = "AQI Value",
y = "CO AQI Value",
color = "pred3",
tooltip = ["Country", "AQI Value", "CO AQI Value", "City","pred3"]
)
f3 = f2.mark_point().encode(
y = "Ozone AQI Value",
tooltip = ["Country", "AQI Value", "Ozone AQI Value", "City","pred3"]
)
f4 = f2.mark_point().encode(
y = "NO2 AQI Value",
tooltip = ["Country", "AQI Value", "NO2 AQI Value", "City","pred3"]
)
f2|e2|d2|c2|f3|e3|d3|c3|f4|e4|d4|c4
From our graphs, we see that KNeighborClassifier doesn’t leave any of the 10 countries out of its predictions and the overall mean of the AQI values for our 3rd prediction column is most similar to those of our original data. We see in our point graphs that KNeighborClassifier has a lot more mixed colors rather than our other models clustering countries together into regions on the graph. KneighborClassifier seems to classify our data much differently than other models which allows it to be more accurate.
We see that our machine learning model seems to overpredict AQI values of countries with higher AQI values while underpredicting those with lower AQI values. This is less so with the KNeighborClassifier but still present.
Visualizing the Predictions#
I finally wanted to visualize the decision boundaries of our different machine learning models. I choose to visualize the ozone AQI values since it seemed to produce the most interesting graph of the 3 values and hopefully will give more conclusive and interesting results of these decision boundaries. All the code for setting up the df_rep dataframe is taken from week 8 monday lecture notes: https://christopherdavisuci.github.io/UCI-Math-10-S23/Week8/Week8-Monday.html
rng = np.random.default_rng()
arr = rng.random(size = (5000,4))
df_rep = pd.DataFrame(arr, columns = cols)
df_rep["Ozone AQI Value"] *= 210
df_rep["AQI Value"] *= 500
df_rep["pred"] = reg.predict(df_rep[cols])
df_rep["pred2"] = clf1.predict(df_rep[cols])
df_rep["pred3"] = reg2.predict(df_rep[cols])
g1 = alt.Chart(df_rep).mark_point().encode(
x = "AQI Value",
y = "Ozone AQI Value",
color = "pred",
tooltip = ["pred", "Ozone AQI Value", "AQI Value"]
)
g2 = g1.mark_point().encode(
color = "pred2",
tooltip = ["pred2", "Ozone AQI Value", "AQI Value"]
)
g3 = g1.mark_point().encode(
color = "pred3",
tooltip = ["pred3", "Ozone AQI Value", "AQI Value"]
)
g1|g2|g3
As we can see, the decision boundaries for the left side of each graph are very different, only Logistic Regression has clear decision boundaries while the DecisionTreeClassifier and KNeighborClassifier has very mixed boundaries. The right side of all 3 graphs pretty much all predict strictly India after around AQI value of 200. While this doesn’t completely show how each model makes predictions, it does offer some very interesting insight into how each of these models work and the differences between them.
Summary#
Either summarize what you did, or summarize the results. Maybe 3 sentences.
Over the course of this project we used 3 different machine learning models to analyze air pollution data of 10 countries then use that analysis to predict a country based off of air quality conditions. Our 3 models had surprisingly accurate results with KNeghborClassifier being the most accurate(67%) and Logistic Regression being the least accurate(50%). Each model predicted that Mexico, China, and India were the most polluted countries which is accurate to real data but with some exaggerations especially with China and Mexico.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)?
https://www.kaggle.com/datasets/hasibalmuzdadid/global-air-pollution-dataset
List any other references that you found helpful.
KNeighborClassifier code: https://christopherdavisuci.github.io/UCI-Math-10-W22/Week6/Week6-Wednesday.html some altair code: https://christopherdavisuci.github.io/UCI-Math-10-S22/Proj/StudentProjects/WenqiZhao.html Visualizing the Data code:https://christopherdavisuci.github.io/UCI-Math-10-S23/Week8/Week8-Monday.html
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote