Brazil Forest Fires Prediction and Testing

Brazil Forest Fires Prediction and Testing#

Author: Diana Gonzalez

Course Project, UC Irvine, Math 10, S23

Introduction:#

In part one of this project, we will be exploring the question at what place in Brazil had the most fires. From that we will move on to hypothetical situations on what date was there this number of forest fires and demonstrate how deepnote uses its function to determine if that the piece of data is from a selected in the set states. The third part we will try to understand what K-nearest regressor does using the forest fires csv for Brazil.

What state has the most forest fires?#

Note: The codes provided here is from the Math 10 Sp23 notes.

import pandas as pd
import altair as alt
import numpy as np

There are no empty value in the dataset provided below, thus no further cleaning is needed. The csv file is from Kraggle.

df=pd.read_csv("amazon.csv")
df["date"]=pd.to_datetime(df["date"])
df.sample(5)

	year	state	month	number	date
5520	2000	Santa Catarina	Fevereiro	8.0	2000-01-01
1222	2004	Ceara	Fevereiro	5.0	2004-01-01
444	2002	Alagoas	Novembro	50.0	2002-01-01
5599	1999	Santa Catarina	Junho	0.0	1999-01-01
1368	2010	Ceara	Setembro	357.0	2010-01-01

As we plot this whole csv file, we first notice that we can not use all of it as it exceeds the limit altair can plot. Thus to visually see this data, the first 5000 will be plotted.

alt.Chart(df[:5000]).mark_circle().encode(
    x="date",
    y="number",
    color="state:N",
    tooltip=["number","date",]
)

df.shape

(6454, 5)

To avoid any loss of data, we will not be using the sample method to get random data points. Instead we will use the mean method to average the amount of forest fires in each state. For that, we will need a new dataset.

df_statesmean = df.groupby('state').mean()
amountinplace=pd.DataFrame({
    "states":df["state"].unique(),
    "Total fires":df_statesmean["number"]
})
amountinplace

	states	Total fires
state
Acre	Acre	77.255356
Alagoas	Alagoas	19.350000
Amapa	Amapa	91.345506
Amazonas	Amazonas	128.243218
Bahia	Bahia	187.222703
Ceara	Ceara	127.314071
Distrito Federal	Distrito Federal	14.899582
Espirito Santo	Espirito Santo	27.389121
Goias	Goias	157.721841
Maranhao	Maranhao	105.142808
Mato Grosso	Mato Grosso	201.351523
Minas Gerais	Minas Gerais	156.800243
Paraiba	Par�	109.698573
Par�	Paraiba	102.561272
Pernambuco	Pernambuco	102.502092
Piau	Piau	158.174674
Rio	Rio	62.985865
Rondonia	Rondonia	84.876272
Roraima	Roraima	102.029598
Santa Catarina	Santa Catarina	101.924067
Sao Paulo	Sao Paulo	213.896226
Sergipe	Sergipe	13.543933
Tocantins	Tocantins	141.037176

Below contains a visual representation on how many times fires were recorded in Brazil. Keep in mind that we added all the fires regardless of the date.

alt.Chart(amountinplace, title="Brazil's forest fires (year 1998-2016)").mark_bar().encode(
    x="states",
    y="Total fires",
    color=alt.Color("states", scale=alt.Scale(scheme="category20b")),
    tooltip=["Total fires"]
)

From the data above, the state that has the most forest fires, would more likely to be in Sao Paulo.

Section 2: Predicted elements for Brazilian fires#

As shown below, we are going to use the top 3 states where the fires mostly occured. We will explore this part of the csv file using Decision Tree and Logistic Regression. The target will be the state with most fires, which was Sao Paulo. We will see how much relation the number each state in df_sub is to Sao Paulo. Then we will come up with a random examples can use the prediction.

df_sub=df[(df["state"]=="Sao Paulo")|(df["state"]=="Mato Grosso")|(df["state"]=="Bahia")].copy()
df_sub["IsSao"]=df_sub["state"]=="Sao Paulo"

Notice that we can now use the full data for df_sub, thus no information will be lost.

df_sub.shape

(956, 8)

Our expectation is that the prediction for which state(Sao Paulo, Bahia,or Mato Grosso ) has the more relation with Sao Paulo will be the state Sao Paulo itself. Then by looking at the Chart above our next prediction will be Mato Grosso. The last choice will be Bahia.

from sklearn.linear_model import LinearRegression
lreg=LinearRegression()
lreg.fit(df_sub[["number"]],df_sub["IsSao"])

from sklearn.linear_model import LogisticRegression
logreg=LogisticRegression()
logreg.fit(df_sub[["number"]],df_sub["IsSao"])
df_sub["lin_pred"]=lreg.predict(df_sub[["number"]])
df_sub["log_pred"]=logreg.predict(df_sub[["number"]])
df_sub["log_pred"]=logreg.predict_proba(df_sub[["number"]])
df_sub

	year	state	month	number	date	IsSao	lin_pred	log_pred
957	1998	Bahia	Janeiro	0.0	1998-01-01	False	0.238503	0.761266
958	1999	Bahia	Janeiro	114.0	1999-01-01	False	0.245025	0.755048
959	2000	Bahia	Janeiro	31.0	2000-01-01	False	0.240276	0.759586
960	2001	Bahia	Janeiro	24.0	2001-01-01	False	0.239876	0.759966
961	2002	Bahia	Janeiro	125.0	2002-01-01	False	0.245654	0.754442
...	...	...	...	...	...	...	...	...
5971	2012	Sao Paulo	Dezembro	64.0	2012-01-01	True	0.242164	0.757788
5972	2013	Sao Paulo	Dezembro	109.0	2013-01-01	True	0.244739	0.755323
5973	2014	Sao Paulo	Dezembro	57.0	2014-01-01	True	0.241764	0.758170
5974	2015	Sao Paulo	Dezembro	45.0	2015-01-01	True	0.241077	0.758824
5975	2016	Sao Paulo	Dezembro	47.0	2016-01-01	True	0.241192	0.758716

956 rows × 8 columns

#use for the line predicted for itself
df_sub2=df_sub[df_sub["state"]=="Sao Paulo"]

linear=alt.Chart(df_sub).mark_circle().encode(
    x="date",
    y="lin_pred",
    color="state:N",
    tooltip=["number", "date"]
)
log=alt.Chart(df_sub).mark_circle().encode(
    x="date",
    y="log_pred",
    color="state:N",
    tooltip=["number", "date"]
)
#below is focuses on the state Sao Paulo
linear1=alt.Chart(df_sub2).mark_line().encode(
    x="date",
    y="lin_pred",
    color="state",
    tooltip=["number", "date"]
)
log1=alt.Chart(df_sub2).mark_line().encode(
    x="date",
    y="log_pred",
    color="state:N",
    tooltip=["number", "date"]
)


linear+log+linear1+log1

As expected we gain that the state Sao Paulo has the more acuracy with itself. After that state it is Mato Grosso, and then Bahia.

Now, we will come up with random numbers to that the classifer to predict what date is it given a random number of fires. Let us use the decision tree to predict what output would be for these random samples.This sort of general thing is useful if one wants to find to make a prediction on how to determine what date is the most likely outcome for a given number of fires in a situaton.

from sklearn.tree import DecisionTreeClassifier
dtc=DecisionTreeClassifier()
dtc.fit(df_sub[["number"]],df_sub["date"])

DecisionTreeClassifier()

The error that is occuring in the code below is probably because we used random numbers. In other words, some numbers are not the exact elments in the number columns used for the classification of the dates.

rng = np.random.default_rng()

Test_subject=pd.DataFrame({
    "sample_fires":rng.integers(0,450,size=12)
})
Test_subject["pred"]=dtc.predict(Test_subject[["sample_fires"]])
Test_subject

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py:493: FutureWarning: The feature names should match those that were passed during fit. Starting version 1.2, an error will be raised.
Feature names unseen at fit time:
- sample_fires
Feature names seen at fit time, yet now missing:
- number

  warnings.warn(message, FutureWarning)

	sample_fires	pred
0	255	2015-01-01
1	329	2017-01-01
2	183	2014-01-01
3	111	2002-01-01
4	37	1999-01-01
5	162	2006-01-01
6	398	1999-01-01
7	418	2016-01-01
8	208	2004-01-01
9	193	2012-01-01
10	100	2013-01-01
11	82	1998-01-01

alt.Chart(Test_subject).mark_circle().encode(
    x="pred",
    y="sample_fires",
    color="pred",
    tooltip=["pred","sample_fires"]
)

Given Chart above, we can visually see how Decision Tree classifies what number of fires get classified as a specific date.

Section 3: K-nearest Neighbors using the Brazil dataset#

Note: These codes in this section are from the Math 10 W22 notes.

from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

X_train, X_test, y_train, y_test = train_test_split(
    df_sub[["number"]],df_sub["year"], test_size = 0.5)

Compared this to the df_sub shape. As the Math 10 notes stated, it’s almost half the amount of rows in df_sub.

X_train.shape

(478, 1)

#k is the n_neighbors
nreg = KNeighborsRegressor(n_neighbors=3)
nreg.fit(X_train, y_train)

KNeighborsRegressor(n_neighbors=3)

The bigger the k neighbors are the higher the bias, resulting to a lower variance.(Math 10 W22)

nreg.predict(X_train)

array([2005.        , 2005.66666667, 2005.33333333, 2006.        ,
       2005.66666667, 2012.66666667, 2006.        , 2010.33333333,
       2003.66666667, 2012.33333333, 1998.        , 2002.        ,
       2006.        , 2004.        , 2012.66666667, 2007.        ,
       2004.        , 2009.33333333, 2009.66666667, 2005.33333333,
       2014.66666667, 2010.        , 2011.        , 2001.33333333,
       2009.33333333, 2010.        , 2008.        , 2003.        ,
       2012.        , 2002.66666667, 2006.33333333, 2004.66666667,
       2009.66666667, 1998.        , 2010.66666667, 2004.33333333,
       2007.66666667, 2006.        , 2004.        , 2006.66666667,
       2003.66666667, 2012.        , 2006.66666667, 2000.66666667,
       2003.        , 2007.        , 2005.66666667, 2007.66666667,
       2007.        , 2003.66666667, 2008.66666667, 2008.        ,
       2003.66666667, 2009.66666667, 2005.33333333, 2008.33333333,
       2012.        , 2006.33333333, 2008.33333333, 2005.        ,
       2002.66666667, 2002.66666667, 2007.66666667, 2006.33333333,
       2010.        , 2005.33333333, 2000.        , 2004.33333333,
       2006.33333333, 2009.33333333, 2010.        , 2014.        ,
       2011.33333333, 2005.33333333, 2002.66666667, 2010.33333333,
       2004.66666667, 2009.66666667, 2007.        , 2009.33333333,
       2013.66666667, 2007.        , 2010.66666667, 2011.        ,
       2008.33333333, 2006.        , 2007.        , 2008.33333333,
       2009.33333333, 2008.33333333, 2004.66666667, 2005.        ,
       2008.33333333, 2012.66666667, 2015.33333333, 2005.66666667,
       2008.33333333, 2009.33333333, 2008.66666667, 2003.66666667,
       2012.        , 2013.33333333, 2005.        , 2007.66666667,
       2005.33333333, 2009.66666667, 2009.33333333, 2009.66666667,
       2011.        , 2005.66666667, 2012.33333333, 2006.        ,
       2004.66666667, 2006.33333333, 1998.        , 2010.        ,
       2007.        , 2002.        , 2001.66666667, 2011.33333333,
       2002.66666667, 2008.        , 2007.33333333, 2007.66666667,
       2003.66666667, 2005.33333333, 2003.33333333, 2010.66666667,
       2005.66666667, 2008.33333333, 2008.66666667, 2008.66666667,
       2006.        , 2010.66666667, 1998.        , 2011.33333333,
       2008.        , 2008.33333333, 2008.33333333, 2011.66666667,
       2009.33333333, 2008.66666667, 2012.66666667, 2004.33333333,
       2011.66666667, 1998.        , 2010.33333333, 2011.        ,
       2003.66666667, 2011.66666667, 2009.33333333, 2005.33333333,
       2008.        , 2003.66666667, 2010.        , 2008.33333333,
       2006.        , 2007.33333333, 2012.        , 2004.66666667,
       2007.33333333, 2008.33333333, 2009.        , 2012.        ,
       2004.66666667, 2008.33333333, 2006.66666667, 2011.33333333,
       2007.66666667, 2013.66666667, 2012.66666667, 2009.        ,
       2007.        , 2006.33333333, 2009.66666667, 2005.33333333,
       2010.        , 2010.66666667, 2004.        , 2011.        ,
       2011.        , 2012.66666667, 2010.33333333, 2010.33333333,
       2011.        , 2006.33333333, 2007.        , 2006.33333333,
       2007.33333333, 1998.        , 2008.66666667, 2006.        ,
       2013.33333333, 2005.66666667, 2011.        , 2010.        ,
       2012.        , 2006.33333333, 2008.        , 2006.        ,
       2004.        , 2010.33333333, 2006.33333333, 2012.66666667,
       2008.        , 2008.        , 2005.66666667, 2010.66666667,
       2001.66666667, 1998.        , 2005.33333333, 2012.33333333,
       2005.66666667, 2011.66666667, 2010.        , 2009.33333333,
       2011.66666667, 2008.66666667, 2003.66666667, 1998.        ,
       2004.        , 2005.33333333, 2002.        , 2008.        ,
       2007.        , 2012.        , 2012.        , 2011.66666667,
       2003.        , 1998.        , 2009.        , 2009.33333333,
       2008.33333333, 2004.33333333, 2008.66666667, 2009.66666667,
       2005.66666667, 2006.66666667, 2007.66666667, 2006.66666667,
       2006.33333333, 2011.66666667, 2007.66666667, 2009.33333333,
       2008.66666667, 2006.        , 2007.        , 2012.66666667,
       2006.33333333, 2012.66666667, 2008.33333333, 2000.        ,
       2005.        , 2005.66666667, 2007.        , 2005.33333333,
       2009.        , 2004.66666667, 2011.66666667, 2006.        ,
       2008.        , 2005.33333333, 2011.        , 2013.33333333,
       2006.33333333, 2003.33333333, 2005.66666667, 2000.        ,
       2003.66666667, 2010.33333333, 2008.        , 2003.33333333,
       2010.33333333, 2009.33333333, 2004.33333333, 2002.66666667,
       2001.33333333, 2008.33333333, 2006.33333333, 2007.        ,
       2010.66666667, 2007.33333333, 2008.66666667, 2007.        ,
       2012.        , 2012.66666667, 2008.66666667, 2003.66666667,
       2007.66666667, 2007.        , 2007.66666667, 2008.        ,
       2011.        , 2006.33333333, 2010.66666667, 2006.66666667,
       2009.66666667, 2008.        , 2007.66666667, 2011.        ,
       1998.        , 2005.66666667, 2013.33333333, 2004.33333333,
       2009.        , 2001.33333333, 2003.        , 2006.        ,
       2008.66666667, 2004.        , 2000.        , 2007.        ,
       2009.        , 2004.33333333, 2014.66666667, 2008.        ,
       2012.        , 2011.        , 2004.66666667, 2009.66666667,
       2007.33333333, 2011.33333333, 2009.33333333, 2008.        ,
       2006.66666667, 2000.66666667, 2006.33333333, 2001.66666667,
       2010.66666667, 2009.33333333, 2010.        , 2000.        ,
       2006.66666667, 2003.        , 2004.66666667, 2010.        ,
       2007.66666667, 2010.        , 2005.66666667, 2012.33333333,
       2012.        , 2005.66666667, 2004.33333333, 2008.66666667,
       2008.33333333, 2005.66666667, 2011.66666667, 2006.33333333,
       2005.        , 2010.66666667, 2012.        , 2013.66666667,
       2010.33333333, 2011.66666667, 2014.33333333, 2009.33333333,
       2000.        , 2005.33333333, 2011.        , 2007.        ,
       2003.66666667, 2004.66666667, 2010.66666667, 2003.66666667,
       2007.66666667, 2004.66666667, 2010.33333333, 2004.66666667,
       2008.33333333, 2006.        , 2006.33333333, 2006.        ,
       2009.66666667, 2010.66666667, 2003.        , 2011.        ,
       2010.33333333, 2012.        , 2001.66666667, 2004.        ,
       2008.66666667, 2007.33333333, 2009.        , 2005.        ,
       2007.66666667, 2008.33333333, 2009.66666667, 2005.66666667,
       2014.66666667, 2009.33333333, 2007.33333333, 2008.        ,
       2008.        , 2006.33333333, 2001.66666667, 2008.33333333,
       2004.        , 2006.66666667, 2008.33333333, 2009.33333333,
       2007.33333333, 2009.33333333, 2011.        , 2006.33333333,
       2009.66666667, 1999.33333333, 2005.33333333, 2008.66666667,
       1998.        , 2010.33333333, 2011.        , 2007.66666667,
       2011.        , 2011.66666667, 2005.33333333, 2007.66666667,
       2004.        , 2008.        , 2001.33333333, 2001.33333333,
       2005.66666667, 2008.        , 2010.        , 2007.66666667,
       2005.66666667, 2012.33333333, 2004.66666667, 2004.66666667,
       2014.33333333, 2011.66666667, 2007.33333333, 2011.66666667,
       2009.66666667, 2008.33333333, 2007.66666667, 2011.        ,
       2008.66666667, 2011.        , 2003.66666667, 2008.        ,
       2008.33333333, 2006.33333333, 2002.66666667, 2008.        ,
       2009.33333333, 2006.33333333, 2013.66666667, 2013.66666667,
       2005.        , 2007.        , 2006.66666667, 2006.33333333,
       2015.        , 2010.        , 2004.33333333, 2008.33333333,
       2007.        , 2006.        , 2011.66666667, 2001.66666667,
       2001.33333333, 2011.        , 2005.        , 2009.33333333,
       2008.33333333, 2012.        , 1998.        , 2009.        ,
       2009.66666667, 2006.        , 2012.        , 2008.        ,
       2011.33333333, 2008.66666667, 2009.        , 2003.66666667,
       2003.66666667, 2008.33333333])

The second element after the comma is the true answer, while the one before is the predicted one.

mean_absolute_error(nreg.predict(X_test), y_test)

5.275453277545327

mean_absolute_error(nreg.predict(X_train), y_train)

3.6387726638772597

Notice above that the mean absolute error for the test is greater than the train set. Thus there is overfitting.

We combine what we did above to a function below:

def get_scores(k):
    nreg = KNeighborsRegressor(n_neighbors=k)
    nreg.fit(X_train, y_train)
    train_error = mean_absolute_error(nreg.predict(X_train), y_train)
    test_error = mean_absolute_error(nreg.predict(X_test), y_test)
    return (train_error, test_error)

df_scores = pd.DataFrame({"k":range(1,150),"train_error":np.nan,"test_error":np.nan})

The code below is an example what the defined function does. It puts the training and test error at the right place

df_scores.loc[0,["train_error","test_error"]] = get_scores(1)

df_scores.head()

	k	train_error	test_error
0	1	1.125523	6.320084
1	2	NaN	NaN
2	3	NaN	NaN
3	4	NaN	NaN
4	5	NaN	NaN

So this code is inputting the scores at the right place as it does in the above code line

for i in df_scores.index:
    df_scores.loc[i,["train_error","test_error"]] = get_scores(df_scores.loc[i,"k"])

As the math 10 notes stated, we’ve put 1/k to have flexibility increase as you go to the right.

df_scores["kinv"] = 1/df_scores.k

df_scores.head()

	k	train_error	test_error	kinv
0	1	1.125523	6.320084	1.000000
1	2	3.173640	5.609833	0.500000
2	3	3.638773	5.275453	0.333333
3	4	3.904812	5.118724	0.250000
4	5	4.046025	4.948954	0.200000

ctrain = alt.Chart(df_scores).mark_line().encode(
    x = "kinv",
    y = "train_error"
)
ctest = alt.Chart(df_scores).mark_line(color="orange").encode(
    x = "kinv",
    y = "test_error"
)
ctrain+ctest

We are looking for the U-shape in the graph to characterize where it is underfitting and overfitting. Underfitting occurs at the decreaing part of the U-shape thus the part before kinv=0.1. Overfitting is the part where the u-shape starts to increase. Overfitting is found at kinv=0.2 and beyond.