Predicting Australian Cities With Weather#

Author: Ryan Harner

email: ryanharner413@gmail.com

Course Project, UC Irvine, Math 10, F22

Introduction#

In this project, I will be looking at the data of Australian cities and their weather to attempt to predict a certain aspect of the dataset. I will be using Pipeline and StandardScaler, LinearRegression, PoissonRegressor, and Lasso to understand how “MaxTemp” is affected by the other parts of this dataset.

Main Portion of the Project#

Below I have all the libraries and modules needed for my project.

import pandas as pd
import altair as alt
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

I used the weatherAUS.csv file from Kaggle.

df = pd.read_csv("weatherAUS.csv")
df[:3]
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
1 2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2 2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No

3 rows × 23 columns

The columns are for the most part self-explainatory. The temperature columns are in Celcius, “Sunshine” is measured in hours, and other columns are measured with the metric system such as the “Evaporation” column with millimeters. Also the “Location” column is filled with Australian cities. If interested, check out Reference to see the description of the columns.

df.shape
(145460, 23)

There’s 145460 rows and 23 columns in df. To clean the data I choose to drop all the rows which had nan as an entry.

df = df.dropna(axis=0)

With is_numeric_dtype and list comprehension, I’m able to find the columns which have dtypes that are numerical. I told Xlist to keep only the first 6 using slicing.

from pandas.api.types import is_numeric_dtype
Xlist = [c for c in df.columns if is_numeric_dtype(df[c]) == True][:6]
Xlist
['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed']
Xlist.append("Location") 
Xlist.append("Date") #adds Location and Date into the list
Xlist
['MinTemp',
 'MaxTemp',
 'Rainfall',
 'Evaporation',
 'Sunshine',
 'WindGustSpeed',
 'Location',
 'Date']
df_mini = df[Xlist] #creating DataFrame with strings in Xlist as columns

Using list comprehension to see what the unique locations are. (under the stipulation that the name of the city is less than 7 letters)

listcomp = [c for c in df_mini["Location"].unique() if len(c)<7] 
listcomp
['Cobar', 'Moree', 'Sydney', 'Sale', 'Cairns', 'Perth', 'Hobart', 'Darwin']

Boolean indexing allows us to shorten df_mini to a DataFrame with the entries for the “Location” column being the same as the strings in listcomp.

df_mini = df_mini[df_mini["Location"].isin(listcomp)].copy()

Using dtypes, I can see that the “Date” column has strings as entries. In following steps, I will make a new column “Datetime” that has datetime values as entries. Also I will drop the “Date” column and make df_mini have 5000 random rows.

df_mini.dtypes
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustSpeed    float64
Location          object
Date              object
dtype: object
datetimed = pd.to_datetime(df_mini["Date"]).to_frame() #This is a dataframe.
df_mini["Datetime"] = datetimed["Date"] #method 1 to get series values into DataFrame
df_mini = df_mini.drop("Date",axis=1).sample(5000, random_state=82623597).copy()
df_mini[:3]
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime
120955 9.0 24.5 0.0 4.4 9.5 20.0 Perth 2009-05-14
33415 22.2 27.4 0.0 7.2 6.2 56.0 Sydney 2017-03-13
122006 17.7 23.8 40.8 2.2 8.7 39.0 Perth 2012-04-29
df_mini.shape
(5000, 8)

Graphs from Altair#

sel = alt.selection_single(fields=["Location"], bind="legend")
c1 = alt.Chart(df_mini).mark_circle().encode(
    x= "Datetime",
    y= "MaxTemp",
    color=alt.condition(sel,"Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
    tooltip=["Location","Datetime","MaxTemp"]
).properties(
    title='Max Temp Data'
).add_selection(sel)
c1

For the two charts below this is how to make the chart interactive: (You can scroll your mouse to zoom in and out; left click and drag to move)

sel = alt.selection_single(fields=["Location"], bind="legend")

c2 = alt.Chart(df_mini).mark_line().encode(
    x= "Datetime",
    y= "MaxTemp",
    color=alt.condition(sel,"Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(0.65), alt.value(0.1)),
    tooltip=["Location","Datetime","MaxTemp"]
).add_selection(sel).interactive()
c2
c1+c2

For these graphs I focused on how the “MaxTemp” was changing over time with respect to the “Location”. Looking at the graphs, I noticed that for some cities that there’s long horizontal lines from lack of data. I talk more about this below in the caption for another graph, but to summarize, this is a result of how I cleaned the data with dropna().

Also there is definitely a pattern in these graphs. Although there isn’t a positive or negative trend over the course of the timeframe, the points seem to make a zig-zagging pattern that seems to correspond to the month/season.

Below I make a smaller dataframe consisting of only the rows which are in the year 2014. I do this because I eventually want to see how the columns affect “MaxTemp” within one year.

df_mini = df_mini[(df_mini["Datetime"].dt.year==2014)].copy()
sel = alt.selection_single(fields=["Location"], bind="legend")
c2 = alt.Chart(df_mini).mark_circle().encode(
    x= "Datetime",
    y= "MaxTemp", 
    color=alt.condition(sel, "Location", alt.value("grey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
    tooltip=["Datetime"]
).add_selection(sel)
c2

The graph’s points are widely spread out, however they make a dipping pattern similar to a flattened out x^2 graph.

Creating new columns “Month” for df_mini which gives the month a number.

df_mini["Month"]=df_mini["Datetime"].dt.month.values.copy()
df_mini
#method 2 to get series values into DataFrame
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime Month
89199 13.3 26.2 0.0 6.0 9.9 33.0 Cairns 2014-08-20 8
141116 25.3 32.9 0.0 2.8 10.0 35.0 Darwin 2014-03-26 3
89177 18.1 26.6 0.8 1.2 10.4 50.0 Cairns 2014-07-29 7
122796 12.6 20.7 0.0 3.0 8.9 37.0 Perth 2014-08-26 8
62941 8.4 26.6 0.0 7.4 12.2 39.0 Sale 2014-01-23 1
... ... ... ... ... ... ... ... ... ...
141361 24.1 33.4 13.8 6.6 5.9 61.0 Darwin 2014-11-26 11
88987 23.2 30.9 0.0 5.4 8.9 28.0 Cairns 2014-01-20 1
131988 3.2 12.5 0.0 0.8 6.6 22.0 Hobart 2014-08-18 8
141295 23.3 32.7 0.0 9.2 11.0 35.0 Darwin 2014-09-21 9
13828 20.4 34.2 0.4 4.4 11.3 44.0 Moree 2014-01-25 1

593 rows × 9 columns

In the following steps, I use groupby to get the averages for each month for columns “MaxTemp” and “MinTemp”.

df_mon = df_mini.groupby("Month").mean()[["MaxTemp","MinTemp"]]
df_mon
MaxTemp MinTemp
Month
1 31.048077 19.700000
2 30.760000 20.035556
3 28.877778 19.051111
4 26.363462 16.650000
5 24.345946 14.932432
6 21.524490 12.718367
7 20.872340 10.774468
8 21.805882 11.105882
9 24.478182 13.316364
10 27.425424 16.874576
11 28.545652 18.180435
12 29.658182 19.827273
print(f''' 
min MaxTemp: {min(df_mon["MaxTemp"])},  month: {(df_mon["MaxTemp"]).argmin()}
min MinTemp: {min(df_mon["MinTemp"])},  month: {(df_mon["MinTemp"]).argmin()}
''')
 
min MaxTemp: 20.872340425531913,  month: 6
min MinTemp: 10.774468085106381,  month: 6

In the above steps, by using groupby on the Month column of df_mini, we can see that the temperature for MaxTemp and MinTemp are lowest at month 6 and 7, respectively, which are June and July. These are winter months for Australia. This makes sense because winter is typically the coldest season.

alt.Chart(df_mini,title="2016 Max Temperature (C) in Australian Cities").mark_rect().encode(
    x="Month:O",
    y="Location:O",
    color=alt.Color('MaxTemp:Q', scale=alt.Scale(scheme="inferno")),
    tooltip=[alt.Tooltip('MaxTemp:Q', title='Max Temp')]
).properties(width=550)

Note that the colors correspond to “Max Temp” in degrees Celsius. The darker colors indicate a cooler temperature, and the warmer, yellow colors indicate that it is hot.

This colorful graph does more than just look pretty. It not only displays the temperatures of cities for each month in 2016, but it also tells a story about the data that was used. In the graph, there’s missing data for the cities Sale and Hobart. This is from taking away the rows with nan values at the beginning of my project. From April to December, Sale has nan values in columns such as “Evaporation” making it get cut out of the df_mini dataset when I used dropna. Hobart also has similar things from January to April.

Interpreting this map, we can also see that Hobart most likely is the coldest out of the cities since it has the lowest overall max temperature (C). However, Hobart’s data will be affected since it is missing data from the summer. It is the most southern city (it is in Tasmania) so presumably the summers will be warmer than other cities and winters will be colder. Darwin would most likely be the hottest as its max temperature never drops below 27.3 degrees Celcius.

Reference: Heatmap from Altair

Standard Scaler#

I create a list using list comprehension of all the column names that have numeric dtypes. I then add “Location” and remove “MaxTemp” as columns because I want to predict what the “MaxTemp” from the other columns.

cols = [c for c in df_mini.columns if is_numeric_dtype(df_mini[c]) == True]
cols.remove("MaxTemp")
cols
['MinTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'Month']
df_mini2 = df_mini.copy()

Using StandardScaler() to fit then transform df_mini[cols] so that the mean of the columns (df_mini[cols]) is 0 and the standard deviation of the columns is 1.

scaler = StandardScaler() # mean=0 and std=1
scaler.fit(df_mini[cols])
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
df_mini2[cols] = scaler.transform(df_mini[cols])
df_mini2
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime Month
89199 -0.442614 26.2 -0.294668 -0.028919 0.490930 -0.583096 Cairns 2014-08-20 0.382545
141116 1.448493 32.9 -0.294668 -0.765512 0.519041 -0.419105 Darwin 2014-03-26 -1.049583
89177 0.313829 26.6 -0.198020 -1.133809 0.631487 0.810832 Cairns 2014-07-29 0.096119
122796 -0.552928 20.7 -0.294668 -0.719475 0.209816 -0.255113 Perth 2014-08-26 0.382545
62941 -1.214816 26.6 -0.294668 0.293341 1.137492 -0.091122 Sale 2014-01-23 -1.622434
... ... ... ... ... ... ... ... ... ...
141361 1.259383 33.4 1.372494 0.109193 -0.633525 1.712785 Darwin 2014-11-26 1.241821
88987 1.117550 30.9 -0.294668 -0.167030 0.209816 -0.993075 Cairns 2014-01-20 -1.622434
131988 -2.034295 12.5 -0.294668 -1.225883 -0.436746 -1.485050 Hobart 2014-08-18 0.382545
141295 1.133309 32.7 -0.294668 0.707675 0.800155 -0.419105 Darwin 2014-09-21 0.668970
13828 0.676291 34.2 -0.246344 -0.397216 0.884489 0.318857 Moree 2014-01-25 -1.622434

593 rows × 9 columns

As seen below, for each column the mean is near 0 and the std is near 1.

df_mini2[cols].mean()
MinTemp         -1.198217e-16
Rainfall        -2.695988e-17
Evaporation      4.193760e-17
Sunshine        -2.516256e-16
WindGustSpeed    1.977058e-16
Month           -8.387520e-17
dtype: float64
df_mini2[cols].std()
MinTemp          1.000844
Rainfall         1.000844
Evaporation      1.000844
Sunshine         1.000844
WindGustSpeed    1.000844
Month            1.000844
dtype: float64

Linear Regression#

Next we use LinearRegression(). We fit then predict.

reg = LinearRegression()
reg.fit(df_mini2[cols], df_mini[["MaxTemp"]])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Setting a column in df_mini2 called “pred” to be equal to the Linear Regression predict of df_mini2[cols].

df_mini2["pred"] = reg.predict(df_mini2[cols])
df_mini2[:3]
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime Month pred
89199 -0.442614 26.2 -0.294668 -0.028919 0.490930 -0.583096 Cairns 2014-08-20 0.382545 25.111805
141116 1.448493 32.9 -0.294668 -0.765512 0.519041 -0.419105 Darwin 2014-03-26 -1.049583 34.467149
89177 0.313829 26.6 -0.198020 -1.133809 0.631487 0.810832 Cairns 2014-07-29 0.096119 28.064050

Below, I graph both the prediction and the “MaxTemp”. It looks similar to the graph for “MaxTemp”.

c3 = alt.Chart(df_mini2).mark_line().encode(
    x= "Datetime",
    y= "pred", 
    tooltip=["Datetime"]
)
c4 = alt.Chart(df_mini2).mark_line().encode(
    x= "Datetime",
    y= "MaxTemp", 
    tooltip=["Datetime"])
c3|c4

Pipeline#

Pipeline is a way faster process of combining StandardScaler() and any type of regression. It requires a lot less code.

pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("reg", LinearRegression())
    ]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('reg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipe.predict(df_mini[cols])
array([25.11180492, 34.4671491 , 28.06404967, 23.24388546, 24.60659851,
       25.63241793, 28.47856407, 31.25363106, 27.95678221, 23.28560335,
       16.99981648, 30.66529502, 26.41665912, 21.35954964, 32.44412481,
       30.99253536, 23.88620124, 13.1172668 , 33.92676117, 34.1584419 ,
       23.15595835, 26.38314459, 19.29818682, 25.69450172, 17.55670647,
       14.71606152, 25.82657478, 23.22298781, 22.78431273, 27.17429432,
       24.60619193, 18.96278159, 28.01657322, 35.8924847 , 30.51995875,
       22.49907529, 31.51448768, 32.21935974, 32.7180799 , 25.91273469,
       26.33792628, 26.12468334, 32.36652342, 32.29533213, 28.89832845,
       30.11857077, 12.50780271, 20.07715154, 23.33436236, 17.48145606,
       27.44350297, 26.72049109, 15.93125125, 34.66040528, 20.37139474,
       14.48894165, 19.62557181, 22.0144519 , 29.55341072, 18.4840181 ,
       24.09524684, 32.49921767, 34.61680769, 27.4543947 , 24.46048158,
       32.21068784, 28.73042989, 17.69850502, 31.54618215, 24.31677757,
       24.0602949 , 26.46847828, 29.01168165, 17.2955666 , 32.43611455,
       29.6425662 , 26.31830925, 31.03116205, 35.17697833, 34.59641022,
       23.25060345, 22.1202053 , 19.01975535, 19.86471064, 30.55624   ,
       30.38505465, 33.95562126, 24.54023987, 29.86794888, 31.12618648,
       21.5462229 , 37.51495769, 13.49796123, 28.4498155 , 22.9314973 ,
       25.05120386, 28.27250936, 20.92199315, 21.31746447, 26.19817002,
       19.93333077, 29.81820146, 26.45953229, 27.00935926, 30.97614295,
       33.55390455, 32.06860846, 26.72109612, 19.38896467, 28.56322159,
       33.31658393, 24.99565337, 15.58620344, 25.75867106, 22.30727015,
       23.52294871, 28.69251324, 33.20515248, 19.87554022, 30.24991998,
       23.89921331, 35.56569054, 15.50188351, 23.29846499, 31.17681291,
       28.51408079, 25.0002229 , 28.79128326, 15.12662227, 26.17571564,
       31.23478754, 33.63521772, 21.54449372, 25.08678032, 21.36404516,
       29.381258  , 20.84051676, 35.78587307, 22.54862144, 26.72097764,
       20.05803291, 15.55375431, 31.79515784, 31.14796806, 32.51152105,
       28.5464485 , 25.72064748, 31.4667868 , 32.53790344, 27.16421269,
       15.87382383, 26.58307598, 36.24172134, 31.99889982, 31.68952746,
       19.11157066, 17.86177782, 18.56735103, 31.84883556, 19.26299396,
       27.27323416, 26.46172708, 34.16724123, 35.69404038, 12.54307335,
       28.59851643, 22.73523048, 33.25333571, 31.04451987, 33.53110161,
       31.86753572, 30.92871858, 18.04198024, 19.22020833, 27.95051493,
       19.98903282, 33.80563341, 26.68688784, 16.63493653, 22.60998233,
       18.13139726, 21.82709379, 23.25648437, 28.87922459, 23.85444992,
       14.43663339, 27.8902302 , 35.79925594, 24.08917403, 30.63461155,
       24.63797229, 30.63707287, 25.02771895, 31.15011531, 17.15354581,
       24.56661908, 32.83396215, 16.73847074, 28.17104819, 21.04243351,
       22.00303404, 25.12215568, 26.18266841, 27.75923867, 32.8143378 ,
       29.95084976, 28.95037015, 20.6086872 , 25.57805433, 23.59310116,
       28.50696607, 18.91132176, 30.07519396, 31.11156072, 15.78198069,
       24.19307921, 25.2835486 , 32.03862172, 17.56034406, 28.81633259,
       29.64033077, 24.11939815, 14.68794046, 26.42168028, 27.68616336,
       19.21290045, 21.0359949 , 29.15735257, 20.11765136, 35.45629543,
       17.90912582, 29.49145124, 28.27609045, 23.44047755, 30.10711795,
       20.67961755, 32.97360432, 22.3425101 , 31.44949447, 27.96763161,
       33.79160683, 28.66594875, 33.73075461, 30.09855848, 28.92781511,
       31.72098198, 31.26509922, 27.64192978, 21.60535659, 31.09036079,
       18.7407589 , 17.40609844, 18.75162016, 17.27303315, 31.53967789,
       32.35544762, 16.94144383, 23.94622402, 29.24633187, 15.47827692,
       29.1914118 , 32.23675043, 34.43120192, 21.1361934 , 16.35349618,
       27.50917597, 31.51929801, 30.65348665, 23.92965076, 32.5883252 ,
       32.37073637, 29.18107823, 36.75911599, 33.77944163, 28.64912505,
       28.78701546, 35.37153341, 32.86601385, 29.89941952, 16.04982427,
       33.46558432, 35.6611221 , 20.9486018 , 32.20750591, 27.36295722,
       30.0338853 , 28.69445741, 26.34002414, 34.30261777, 32.56643795,
       25.54589797, 22.08838678, 33.63062182, 20.46889068, 32.98990307,
       26.85261323, 19.42266162, 19.83479623, 30.62382459, 24.28296039,
       28.66393393, 27.21732682, 25.54598952, 24.79733111, 16.67746479,
       20.98635538, 33.33693891, 26.1038826 , 34.64901957, 29.73702348,
       19.75135259, 19.37352898, 26.43060103, 22.81805857, 25.32930527,
       17.84173545, 27.88588136, 34.39368424, 23.62355837, 29.86328685,
       29.23406453, 27.73234107, 23.90041807, 23.73555367, 33.29839666,
       34.502106  , 17.27807252, 30.62623161, 33.75045091, 17.1305301 ,
       17.27046229, 29.05325388, 41.51310144, 19.3529964 , 23.170393  ,
       15.47576803, 17.65867973, 25.53892852, 26.31041664, 32.39042914,
       27.3778165 , 28.52073882, 27.02328929, 14.11990926, 35.40827672,
       22.57497544, 33.13477644, 28.01799341, 33.75786915, 11.01114451,
       27.61872403, 20.27267667, 29.75489197, 22.36246598, 30.95941108,
       32.03788235, 32.60248591, 13.62690516, 17.91284799, 25.78553377,
       30.62950945, 25.29017444, 19.595173  , 30.54154161, 32.06380868,
       25.2827903 , 20.2743339 , 34.22763357, 29.00049798, 28.80182176,
       22.89092094, 28.52970602, 32.00447739, 34.09916128, 32.98042225,
       27.89675602, 34.1246218 , 34.2427634 , 32.26208747, 26.24184336,
       29.30563627, 31.33430142, 23.81177383, 16.19866116, 14.54920541,
       23.51981587, 29.94970781, 30.20772764, 31.92670512, 27.91355825,
       19.22723923, 22.46242461, 30.64865111, 28.58361038, 21.09873926,
       23.68287303, 31.28180122, 19.19424837, 31.28447732, 17.88164046,
       25.91068954, 31.3306721 , 24.60644268, 33.5961115 , 26.44699752,
       17.67463355, 30.29774755, 18.67709832, 33.64001112, 34.07279167,
       32.59016246, 34.20684925, 25.23713705, 25.20596459, 28.90533059,
       36.90434198, 38.31038392, 30.62272335, 23.83896509, 25.10021776,
       16.88299869, 29.07384239, 24.02271516, 27.01225084, 28.37185455,
       23.4606484 , 36.03169344, 31.09883818, 32.22033656, 19.63154693,
       30.55368845, 16.27216252, 31.61636728, 23.54319878, 31.38147131,
       19.50959045, 27.75886735, 31.46628158, 24.53258902, 35.44291392,
       36.45656129, 20.04274663, 23.59952818, 21.02642569, 26.05857586,
       19.52620328, 19.64706842, 28.31800425, 28.15303791, 21.59706892,
       20.7975672 , 25.83521   , 20.42688312, 24.33980634, 23.23241659,
       25.7139883 , 28.83663828, 29.7234653 , 18.76950522, 28.22157739,
       22.21627545, 26.89610466, 28.52914916, 24.31462299, 33.6608422 ,
       20.43756566, 30.27161846, 28.46968796, 27.49561941, 19.01033814,
       30.39597284, 22.28076335, 34.04714761, 26.57082675, 20.4205012 ,
       31.78946556, 19.37771145, 30.77548429, 33.99907836, 20.11635434,
       21.98597322, 31.87151843, 24.48336791, 18.34893942, 16.64909184,
       21.26639534, 33.71939773, 28.410873  , 31.7763605 , 33.02125944,
       31.82805817, 32.66150491, 20.71590324, 32.7432846 , 25.47300834,
       17.14866059, 20.92946607, 32.8114294 , 20.50017301, 33.00249105,
       31.74648939, 23.38702454, 27.97709893, 15.76549988, 24.20102754,
       25.6605807 , 24.36277905, 30.56094318, 28.10877757, 26.45993606,
       33.30013724, 22.06968064, 29.66689117, 19.94308618, 30.29691019,
       19.00693442, 21.49496209, 21.69152084, 22.87170952, 36.27796687,
       28.82786582, 30.05992794, 24.41326856, 20.85692279, 19.05595796,
       27.64256588, 29.48771376, 27.1505203 , 31.49242119, 35.73126212,
       29.18299342, 25.72955516, 31.19491922, 32.11609058, 34.36585424,
       24.17552131, 31.35686228, 34.71252306, 15.63893479, 28.48565782,
       31.77274912, 25.90148252, 36.87689064, 21.37330951, 14.64507899,
       17.21821545, 32.56143718, 22.66182045, 20.12673531, 28.06245424,
       34.25343073, 15.11787108, 23.18539271, 22.06750746, 27.64423009,
       30.99394714, 29.14712011, 19.2388886 , 20.7018771 , 17.68412965,
       19.72170047, 15.88490034, 25.73038243, 28.80246812, 32.45534008,
       34.17584102, 12.39585257, 30.01765158, 36.50061802, 29.87710833,
       18.79262579, 17.74490389, 32.04353705, 14.07513479, 31.76931483,
       31.0714712 , 24.22021367, 27.57280333, 29.41945333, 25.63401521,
       32.12191343, 19.06143557, 18.74588025, 16.54484715, 29.42563917,
       30.81501831, 25.65206999, 26.16566663, 29.86173851, 33.22156564,
       14.23509335, 33.87461726, 32.34948915])

3 cells of code is all that is needed to use Pipeline. Below, I inserted the predicted values into a column called “pred2” in df_mini2.

df_mini2["pred2"]=pipe.predict(df_mini[cols])

Also, in the following line we can see that the “pred” column has all the same values as the “pred2” column in df_mini2. This is proof that Pipeline did the same thing as Standard Scaler first, then LinearRegression.

(df_mini2["pred"]==df_mini2["pred2"]).all()
True

The coefficients and intercept are listed below. “MinTemp”, “Sunshine”, and “Evaporation” all have positive coefficients so I will use them to compare to cols and see how well they predict.

reg.coef_
array([[ 4.77635062, -0.1989574 ,  1.01691609,  2.24238717, -0.02565457,
        -0.70732408]])
reg.coef_.shape
(1, 6)
pd.Series(reg.coef_.reshape(-1), index=reg.feature_names_in_)
MinTemp          4.776351
Rainfall        -0.198957
Evaporation      1.016916
Sunshine         2.242387
WindGustSpeed   -0.025655
Month           -0.707324
dtype: float64
reg.intercept_
array([26.35143339])

The score tells us how well the prediction does. Closer to 1, the better.

pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7997639618096023

PoissonRegressor#

Using PoissonRegressor() to predict “MaxTemp” with the cols list then with just “MinTemp”, “Sunshine”, and “Evaporation”.

from sklearn.linear_model import PoissonRegressor
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("pois", PoissonRegressor())
    ]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
df_mini2["pred3"] = pipe.predict(df_mini[cols])
df_mini2
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime Month pred pred2 pred3
89199 -0.442614 26.2 -0.294668 -0.028919 0.490930 -0.583096 Cairns 2014-08-20 0.382545 25.111805 25.111805 24.503057
141116 1.448493 32.9 -0.294668 -0.765512 0.519041 -0.419105 Darwin 2014-03-26 -1.049583 34.467149 34.467149 35.065473
89177 0.313829 26.6 -0.198020 -1.133809 0.631487 0.810832 Cairns 2014-07-29 0.096119 28.064050 28.064050 27.499847
122796 -0.552928 20.7 -0.294668 -0.719475 0.209816 -0.255113 Perth 2014-08-26 0.382545 23.243885 23.243885 22.833752
62941 -1.214816 26.6 -0.294668 0.293341 1.137492 -0.091122 Sale 2014-01-23 -1.622434 24.606599 24.606599 23.955344
... ... ... ... ... ... ... ... ... ... ... ... ...
141361 1.259383 33.4 1.372494 0.109193 -0.633525 1.712785 Darwin 2014-11-26 1.241821 29.861739 29.861739 29.567048
88987 1.117550 30.9 -0.294668 -0.167030 0.209816 -0.993075 Cairns 2014-01-20 -1.622434 33.221566 33.221566 33.371385
131988 -2.034295 12.5 -0.294668 -1.225883 -0.436746 -1.485050 Hobart 2014-08-18 0.382545 14.235093 14.235093 16.150914
141295 1.133309 32.7 -0.294668 0.707675 0.800155 -0.419105 Darwin 2014-09-21 0.668970 33.874617 33.874617 34.298310
13828 0.676291 34.2 -0.246344 -0.397216 0.884489 0.318857 Moree 2014-01-25 -1.622434 32.349489 32.349489 32.299978

593 rows × 12 columns

pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7688083582478654

Above, you can see that the score is lower when using Poisson Regressor than when I used Linear Regression.

Below, I am trying to use less columns to see if it affects the predict and score.

pipe.fit(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pipe.predict(df_mini[["MinTemp","Evaporation","Sunshine"]])
array([24.66571687, 34.20879157, 27.61003155, 23.02854555, 22.82376875,
       25.9498588 , 27.62548724, 30.86231634, 27.49623567, 22.56935588,
       18.17665532, 29.31108894, 26.72477169, 20.2894126 , 33.94097311,
       29.19967209, 23.86267716, 15.6165626 , 33.6610098 , 36.03452942,
       22.12359397, 24.41929994, 19.47195449, 25.30524483, 17.55403011,
       16.48318184, 25.45283019, 23.2471915 , 22.27727317, 25.9436926 ,
       24.73608438, 19.68028036, 28.20024264, 35.52535358, 29.35059837,
       22.68976816, 33.52483839, 30.65022025, 33.64123955, 25.38893588,
       25.70847601, 26.03312007, 31.36098189, 31.87701812, 28.25068994,
       29.18948605, 15.14589709, 21.321386  , 23.61190657, 18.00670287,
       27.45958235, 26.76752194, 17.20734116, 34.63270217, 20.79638375,
       16.52998143, 19.3793249 , 21.91294448, 27.65877381, 19.10171201,
       23.90376461, 31.52739808, 33.80621809, 27.55904414, 24.59963112,
       33.26375341, 28.93899261, 18.60671764, 32.66481136, 23.19615522,
       23.86815819, 24.99701059, 28.5419773 , 18.14958418, 32.91163951,
       30.18313862, 25.05802607, 29.78660977, 34.8340233 , 34.53908952,
       22.43344417, 21.78262526, 19.23661579, 20.1557387 , 28.87333503,
       31.3528652 , 33.29987477, 24.54894169, 29.59222955, 31.35223771,
       22.63650649, 41.84586116, 15.57594639, 28.8155143 , 22.02861081,
       23.93691197, 28.64267765, 20.86225003, 21.24725559, 25.99375654,
       19.56831318, 29.80625218, 26.53354249, 25.42424561, 29.67150144,
       34.6976942 , 32.56110404, 25.48281891, 19.56377763, 27.98962546,
       33.29572187, 23.8695088 , 17.19298447, 25.78547327, 21.74220271,
       23.09351901, 28.76209514, 34.60618956, 20.30870629, 30.45615722,
       24.29644031, 37.42727498, 17.31126091, 23.18826776, 29.85904955,
       27.9477997 , 23.84994021, 28.19624205, 16.68202922, 25.35422785,
       31.5340987 , 32.56196118, 21.71687467, 24.74457799, 21.15680341,
       27.84852062, 20.36803896, 34.97776622, 22.47433016, 27.08099845,
       20.61050835, 17.10621524, 31.03373397, 32.94783761, 34.17012542,
       28.70801245, 25.9896913 , 32.77363491, 31.40293616, 26.48657491,
       17.30414386, 26.26332485, 35.71904945, 33.19576049, 32.5871756 ,
       19.38032684, 19.13653263, 18.9962088 , 32.70839313, 19.69550568,
       27.6563358 , 25.95952678, 35.5528287 , 34.97075457, 15.1603231 ,
       26.91914909, 22.05893712, 34.5304463 , 30.27920723, 35.21860016,
       31.89073279, 30.97952738, 18.52414555, 19.22013324, 27.73901819,
       19.93719941, 32.76807984, 24.86810989, 17.36048186, 22.64178418,
       18.66272963, 22.13958079, 23.15854604, 27.41736643, 23.70682797,
       16.26330079, 27.20684946, 37.60527842, 24.33011473, 30.08477312,
       24.04876753, 30.26090414, 23.86337358, 30.16190068, 18.25954892,
       23.73685471, 33.97120862, 18.08182842, 28.28775217, 20.51684425,
       22.05887402, 24.83101973, 26.15416055, 26.47949256, 34.0393869 ,
       29.56726858, 28.00798602, 20.58761696, 26.0808929 , 23.52371977,
       28.06556712, 19.37467848, 30.54754861, 29.86881817, 17.48230529,
       23.55426373, 25.29044546, 30.94610001, 18.40916606, 27.6418737 ,
       28.44733697, 23.39550664, 16.37858385, 26.39647639, 27.91463411,
       19.83187189, 20.62207294, 28.15466091, 21.1168125 , 35.72923332,
       18.23405064, 29.60766632, 27.91436812, 23.34681718, 30.38187087,
       21.01878678, 33.98918884, 21.80204302, 32.04942506, 27.14232165,
       33.52928353, 28.64433239, 33.62065989, 30.40348826, 29.22297906,
       30.82667991, 31.68176865, 26.67584182, 21.39634593, 30.86652597,
       19.43761406, 18.73594931, 19.47957792, 18.25047882, 30.85913643,
       33.1418753 , 18.0200174 , 23.52309472, 29.69430774, 16.92886948,
       27.28986542, 32.72817657, 33.52599656, 20.83919329, 17.47251398,
       26.16951202, 32.02473746, 29.82121445, 22.94230795, 34.10953036,
       31.05840489, 27.99695492, 36.99730496, 34.9990433 , 29.22063295,
       28.12043926, 35.64948206, 34.39617812, 28.77555635, 17.26724488,
       32.76801745, 35.16164484, 21.11598241, 32.87242673, 26.9227857 ,
       28.88266081, 28.98988767, 24.68293201, 36.40462826, 30.95804967,
       25.96827327, 21.50914971, 33.585715  , 20.49657405, 34.05681901,
       26.91797874, 19.82109755, 20.39557628, 30.32222115, 23.45300687,
       28.1031733 , 25.50084426, 24.39990273, 23.68322555, 18.04539481,
       20.90149989, 35.1414622 , 24.97538126, 33.73581031, 28.37437739,
       20.32953169, 19.90232769, 26.66895978, 21.57324094, 25.34373471,
       19.04126575, 28.38713628, 35.86309556, 22.74909709, 28.49304632,
       29.14755807, 25.98879347, 23.74333911, 23.03548689, 33.95413546,
       36.54728079, 18.37886287, 31.71694663, 33.66611458, 17.99427699,
       18.95589365, 27.42762022, 46.2529031 , 19.65438072, 23.15288319,
       17.60322001, 18.38300994, 24.1494607 , 25.87988856, 31.56878169,
       26.96447149, 28.10687268, 25.76378159, 16.00234074, 37.57580584,
       22.06142307, 34.82644306, 26.97661576, 35.39723385, 14.83927324,
       27.37416194, 20.89648101, 29.08080033, 22.79157107, 31.21655104,
       30.38192871, 32.61677631, 15.9992034 , 19.12695164, 24.64165296,
       30.73256062, 23.53897131, 20.06683985, 29.51354569, 32.96122777,
       23.61069435, 20.76458973, 35.99739078, 28.30875985, 29.29553815,
       22.52681015, 27.74102411, 30.31773885, 35.71776489, 32.47905633,
       28.30465615, 33.20804096, 36.22560105, 30.59969061, 24.87092729,
       29.33958355, 29.91051868, 22.96583396, 18.10827001, 16.47230377,
       23.47592126, 28.14541928, 30.85323333, 31.41992405, 27.18953224,
       19.59697753, 22.63828086, 29.51456528, 28.18506291, 20.4819688 ,
       23.78115867, 31.31396901, 19.93543963, 29.79901695, 18.47493724,
       25.93897694, 29.53968578, 24.70292796, 34.80870409, 24.52934787,
       18.18284242, 29.9062114 , 19.2590304 , 33.57683043, 36.00047402,
       31.6226782 , 34.33889622, 25.26068915, 23.76266178, 28.56104378,
       36.60849069, 38.88848412, 29.26342585, 24.98105538, 23.90522825,
       18.19788692, 28.70892209, 23.16376329, 26.73333663, 27.17011582,
       23.89481105, 35.34386697, 29.78603142, 30.83343721, 20.1837631 ,
       30.30475834, 17.47280743, 31.97775425, 22.54528252, 31.07339611,
       19.83090334, 26.20796464, 30.21425533, 24.28061981, 37.62651371,
       36.26889583, 20.35892311, 23.40958188, 21.73585557, 24.78393606,
       19.28696143, 19.89673582, 28.34730303, 28.59307328, 21.0528157 ,
       21.71193678, 25.28926756, 20.67249572, 23.00951024, 22.49815621,
       25.95422386, 27.34322096, 30.45794587, 18.89844347, 26.85216883,
       22.26844344, 27.13738812, 28.23590351, 24.18934768, 35.49118953,
       21.19089925, 29.04480592, 27.17509728, 25.75387669, 19.92587411,
       29.98757104, 21.94341996, 32.85521344, 25.30592608, 20.87624107,
       33.1817376 , 18.75691363, 30.22122417, 32.98622862, 20.3879951 ,
       22.24114053, 33.25345683, 24.81565524, 19.01378795, 17.62037696,
       20.82159557, 35.51665399, 26.78055042, 31.3426709 , 34.39757521,
       30.45766657, 31.78454184, 20.54885862, 31.39109164, 25.60453964,
       18.34282715, 21.44311065, 32.26204022, 20.91977566, 34.52699545,
       32.99820366, 22.0010149 , 27.16970797, 17.10609503, 23.5817093 ,
       25.11453208, 24.4037193 , 30.9172605 , 26.56029869, 26.67301377,
       31.91806976, 21.49771504, 29.66163831, 20.43954982, 29.3961833 ,
       19.23665932, 21.97767614, 22.20747638, 23.18213367, 36.01184013,
       27.35640912, 30.08166971, 23.12455651, 20.64292463, 19.24644875,
       27.48325185, 28.21765569, 26.49662586, 32.09906783, 35.22567414,
       29.36899104, 24.49653279, 30.22390998, 30.30555608, 34.02390512,
       23.67378499, 31.63591041, 34.42826276, 16.84866871, 27.68330094,
       30.30980547, 26.10450578, 36.84665382, 21.78358293, 16.4145067 ,
       17.67510193, 33.3323939 , 21.90810547, 20.25052818, 26.64868024,
       33.40360309, 17.08894011, 22.58333497, 21.25462963, 26.7603103 ,
       31.17203822, 29.5152472 , 20.12477066, 20.52412865, 18.30649348,
       20.37695742, 17.22891609, 25.97278689, 28.80012338, 31.45092958,
       35.5602282 , 15.00997788, 28.47483578, 36.46971747, 29.8509232 ,
       19.13990899, 18.98732236, 33.04912696, 15.95879719, 30.27540209,
       31.70361606, 23.27589018, 27.08331799, 29.77779376, 25.15166816,
       32.77329909, 19.68225909, 19.10971866, 18.46395212, 29.85185122,
       29.68714232, 24.49708435, 26.71432878, 30.93687454, 32.02389447,
       16.27313753, 34.81713068, 30.96763578])
pipe.score(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])
0.7573496399072459

The score after using Poisson Regression is about 0.05 less than when using Linear Regression. This means that Linear Regression is a better Regression model to use for the this data. A reasons why I assume this is the case is because in order to use Poisson Regression, it assumes that the variance is equal to the mean. We also said that the mean is zero when we did Standard Scaler.

Also when using less columns for the training data, the score goes down. This makes sense because intuitively using more data should give better results.

Reference

Lasso#

Trying Lasso to see if it works better.

from sklearn.linear_model import Lasso
pipe = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("lasso", Lasso())
    ]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('lasso', Lasso())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
df_mini2["pred4"] = pipe.predict(df_mini[cols])
df_mini2
MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustSpeed Location Datetime Month pred pred2 pred3 pred4
89199 -0.442614 26.2 -0.294668 -0.028919 0.490930 -0.583096 Cairns 2014-08-20 0.382545 25.111805 25.111805 24.503057 25.237897
141116 1.448493 32.9 -0.294668 -0.765512 0.519041 -0.419105 Darwin 2014-03-26 -1.049583 34.467149 34.467149 35.065473 32.440049
89177 0.313829 26.6 -0.198020 -1.133809 0.631487 0.810832 Cairns 2014-07-29 0.096119 28.064050 28.064050 27.499847 27.872262
122796 -0.552928 20.7 -0.294668 -0.719475 0.209816 -0.255113 Perth 2014-08-26 0.382545 23.243885 23.243885 22.833752 24.054934
62941 -1.214816 26.6 -0.294668 0.293341 1.137492 -0.091122 Sale 2014-01-23 -1.622434 24.606599 24.606599 23.955344 23.203801
... ... ... ... ... ... ... ... ... ... ... ... ... ...
141361 1.259383 33.4 1.372494 0.109193 -0.633525 1.712785 Darwin 2014-11-26 1.241821 29.861739 29.861739 29.567048 30.572726
88987 1.117550 30.9 -0.294668 -0.167030 0.209816 -0.993075 Cairns 2014-01-20 -1.622434 33.221566 33.221566 33.371385 31.010369
131988 -2.034295 12.5 -0.294668 -1.225883 -0.436746 -1.485050 Hobart 2014-08-18 0.382545 14.235093 14.235093 16.150914 16.998344
141295 1.133309 32.7 -0.294668 0.707675 0.800155 -0.419105 Darwin 2014-09-21 0.668970 33.874617 33.874617 34.298310 32.332803
13828 0.676291 34.2 -0.246344 -0.397216 0.884489 0.318857 Moree 2014-01-25 -1.622434 32.349489 32.349489 32.299978 30.047367

593 rows × 13 columns

pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7388354956518769

Out of all the Regression and linear models, Lasso worked the worst in terms of predicting “MaxTemp” using the columns from df_mini that were in the list cols.

Reference

Summary#

In the Altair section, I displayed charts of the “MaxTemp” in relation to time (“Datetime”) and the cities (“Location”). In the machine learning section, I used Standard Scaler, Pipeline, Linear Regression, Poisson Regressor, and Lasso to predict the “MaxTemp”. I showed that for my data, Linear Regression worked best and that using more columns allowed the predict and score to be better.

References#

Your code above should include references. Here is some additional space for references.

  • What is the source of your dataset(s)? Reference

  • List any other references that you found helpful.

Submission#

Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.

Created in deepnote.com Created in Deepnote