Predicting Australian Cities With Weather
Contents
Predicting Australian Cities With Weather#
Author: Ryan Harner
email: ryanharner413@gmail.com
Course Project, UC Irvine, Math 10, F22
Introduction#
In this project, I will be looking at the data of Australian cities and their weather to attempt to predict a certain aspect of the dataset. I will be using Pipeline and StandardScaler, LinearRegression, PoissonRegressor, and Lasso to understand how “MaxTemp” is affected by the other parts of this dataset.
Main Portion of the Project#
Below I have all the libraries and modules needed for my project.
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
I used the weatherAUS.csv file from Kaggle.
df = pd.read_csv("weatherAUS.csv")
df[:3]
Date | Location | MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustDir | WindGustSpeed | WindDir9am | ... | Humidity9am | Humidity3pm | Pressure9am | Pressure3pm | Cloud9am | Cloud3pm | Temp9am | Temp3pm | RainToday | RainTomorrow | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2008-12-01 | Albury | 13.4 | 22.9 | 0.6 | NaN | NaN | W | 44.0 | W | ... | 71.0 | 22.0 | 1007.7 | 1007.1 | 8.0 | NaN | 16.9 | 21.8 | No | No |
1 | 2008-12-02 | Albury | 7.4 | 25.1 | 0.0 | NaN | NaN | WNW | 44.0 | NNW | ... | 44.0 | 25.0 | 1010.6 | 1007.8 | NaN | NaN | 17.2 | 24.3 | No | No |
2 | 2008-12-03 | Albury | 12.9 | 25.7 | 0.0 | NaN | NaN | WSW | 46.0 | W | ... | 38.0 | 30.0 | 1007.6 | 1008.7 | NaN | 2.0 | 21.0 | 23.2 | No | No |
3 rows × 23 columns
The columns are for the most part self-explainatory. The temperature columns are in Celcius, “Sunshine” is measured in hours, and other columns are measured with the metric system such as the “Evaporation” column with millimeters. Also the “Location” column is filled with Australian cities. If interested, check out Reference to see the description of the columns.
df.shape
(145460, 23)
There’s 145460 rows and 23 columns in df. To clean the data I choose to drop all the rows which had nan as an entry.
df = df.dropna(axis=0)
With is_numeric_dtype and list comprehension, I’m able to find the columns which have dtypes that are numerical. I told Xlist to keep only the first 6 using slicing.
from pandas.api.types import is_numeric_dtype
Xlist = [c for c in df.columns if is_numeric_dtype(df[c]) == True][:6]
Xlist
['MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed']
Xlist.append("Location")
Xlist.append("Date") #adds Location and Date into the list
Xlist
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'Location',
'Date']
df_mini = df[Xlist] #creating DataFrame with strings in Xlist as columns
Using list comprehension to see what the unique locations are. (under the stipulation that the name of the city is less than 7 letters)
listcomp = [c for c in df_mini["Location"].unique() if len(c)<7]
listcomp
['Cobar', 'Moree', 'Sydney', 'Sale', 'Cairns', 'Perth', 'Hobart', 'Darwin']
Boolean indexing allows us to shorten df_mini to a DataFrame with the entries for the “Location” column being the same as the strings in listcomp.
df_mini = df_mini[df_mini["Location"].isin(listcomp)].copy()
Using dtypes, I can see that the “Date” column has strings as entries. In following steps, I will make a new column “Datetime” that has datetime values as entries. Also I will drop the “Date” column and make df_mini have 5000 random rows.
df_mini.dtypes
MinTemp float64
MaxTemp float64
Rainfall float64
Evaporation float64
Sunshine float64
WindGustSpeed float64
Location object
Date object
dtype: object
datetimed = pd.to_datetime(df_mini["Date"]).to_frame() #This is a dataframe.
df_mini["Datetime"] = datetimed["Date"] #method 1 to get series values into DataFrame
df_mini = df_mini.drop("Date",axis=1).sample(5000, random_state=82623597).copy()
df_mini[:3]
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | |
---|---|---|---|---|---|---|---|---|
120955 | 9.0 | 24.5 | 0.0 | 4.4 | 9.5 | 20.0 | Perth | 2009-05-14 |
33415 | 22.2 | 27.4 | 0.0 | 7.2 | 6.2 | 56.0 | Sydney | 2017-03-13 |
122006 | 17.7 | 23.8 | 40.8 | 2.2 | 8.7 | 39.0 | Perth | 2012-04-29 |
df_mini.shape
(5000, 8)
Graphs from Altair#
sel = alt.selection_single(fields=["Location"], bind="legend")
c1 = alt.Chart(df_mini).mark_circle().encode(
x= "Datetime",
y= "MaxTemp",
color=alt.condition(sel,"Location", alt.value("grey")),
opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
tooltip=["Location","Datetime","MaxTemp"]
).properties(
title='Max Temp Data'
).add_selection(sel)
c1
For the two charts below this is how to make the chart interactive: (You can scroll your mouse to zoom in and out; left click and drag to move)
sel = alt.selection_single(fields=["Location"], bind="legend")
c2 = alt.Chart(df_mini).mark_line().encode(
x= "Datetime",
y= "MaxTemp",
color=alt.condition(sel,"Location", alt.value("grey")),
opacity=alt.condition(sel, alt.value(0.65), alt.value(0.1)),
tooltip=["Location","Datetime","MaxTemp"]
).add_selection(sel).interactive()
c2
c1+c2
For these graphs I focused on how the “MaxTemp” was changing over time with respect to the “Location”. Looking at the graphs, I noticed that for some cities that there’s long horizontal lines from lack of data. I talk more about this below in the caption for another graph, but to summarize, this is a result of how I cleaned the data with dropna().
Also there is definitely a pattern in these graphs. Although there isn’t a positive or negative trend over the course of the timeframe, the points seem to make a zig-zagging pattern that seems to correspond to the month/season.
Below I make a smaller dataframe consisting of only the rows which are in the year 2014. I do this because I eventually want to see how the columns affect “MaxTemp” within one year.
df_mini = df_mini[(df_mini["Datetime"].dt.year==2014)].copy()
sel = alt.selection_single(fields=["Location"], bind="legend")
c2 = alt.Chart(df_mini).mark_circle().encode(
x= "Datetime",
y= "MaxTemp",
color=alt.condition(sel, "Location", alt.value("grey")),
opacity=alt.condition(sel, alt.value(1), alt.value(0.1)),
tooltip=["Datetime"]
).add_selection(sel)
c2
The graph’s points are widely spread out, however they make a dipping pattern similar to a flattened out x^2 graph.
Creating new columns “Month” for df_mini which gives the month a number.
df_mini["Month"]=df_mini["Datetime"].dt.month.values.copy()
df_mini
#method 2 to get series values into DataFrame
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | Month | |
---|---|---|---|---|---|---|---|---|---|
89199 | 13.3 | 26.2 | 0.0 | 6.0 | 9.9 | 33.0 | Cairns | 2014-08-20 | 8 |
141116 | 25.3 | 32.9 | 0.0 | 2.8 | 10.0 | 35.0 | Darwin | 2014-03-26 | 3 |
89177 | 18.1 | 26.6 | 0.8 | 1.2 | 10.4 | 50.0 | Cairns | 2014-07-29 | 7 |
122796 | 12.6 | 20.7 | 0.0 | 3.0 | 8.9 | 37.0 | Perth | 2014-08-26 | 8 |
62941 | 8.4 | 26.6 | 0.0 | 7.4 | 12.2 | 39.0 | Sale | 2014-01-23 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
141361 | 24.1 | 33.4 | 13.8 | 6.6 | 5.9 | 61.0 | Darwin | 2014-11-26 | 11 |
88987 | 23.2 | 30.9 | 0.0 | 5.4 | 8.9 | 28.0 | Cairns | 2014-01-20 | 1 |
131988 | 3.2 | 12.5 | 0.0 | 0.8 | 6.6 | 22.0 | Hobart | 2014-08-18 | 8 |
141295 | 23.3 | 32.7 | 0.0 | 9.2 | 11.0 | 35.0 | Darwin | 2014-09-21 | 9 |
13828 | 20.4 | 34.2 | 0.4 | 4.4 | 11.3 | 44.0 | Moree | 2014-01-25 | 1 |
593 rows × 9 columns
In the following steps, I use groupby to get the averages for each month for columns “MaxTemp” and “MinTemp”.
df_mon = df_mini.groupby("Month").mean()[["MaxTemp","MinTemp"]]
df_mon
MaxTemp | MinTemp | |
---|---|---|
Month | ||
1 | 31.048077 | 19.700000 |
2 | 30.760000 | 20.035556 |
3 | 28.877778 | 19.051111 |
4 | 26.363462 | 16.650000 |
5 | 24.345946 | 14.932432 |
6 | 21.524490 | 12.718367 |
7 | 20.872340 | 10.774468 |
8 | 21.805882 | 11.105882 |
9 | 24.478182 | 13.316364 |
10 | 27.425424 | 16.874576 |
11 | 28.545652 | 18.180435 |
12 | 29.658182 | 19.827273 |
print(f'''
min MaxTemp: {min(df_mon["MaxTemp"])}, month: {(df_mon["MaxTemp"]).argmin()}
min MinTemp: {min(df_mon["MinTemp"])}, month: {(df_mon["MinTemp"]).argmin()}
''')
min MaxTemp: 20.872340425531913, month: 6
min MinTemp: 10.774468085106381, month: 6
In the above steps, by using groupby on the Month column of df_mini, we can see that the temperature for MaxTemp and MinTemp are lowest at month 6 and 7, respectively, which are June and July. These are winter months for Australia. This makes sense because winter is typically the coldest season.
alt.Chart(df_mini,title="2016 Max Temperature (C) in Australian Cities").mark_rect().encode(
x="Month:O",
y="Location:O",
color=alt.Color('MaxTemp:Q', scale=alt.Scale(scheme="inferno")),
tooltip=[alt.Tooltip('MaxTemp:Q', title='Max Temp')]
).properties(width=550)
Note that the colors correspond to “Max Temp” in degrees Celsius. The darker colors indicate a cooler temperature, and the warmer, yellow colors indicate that it is hot.
This colorful graph does more than just look pretty. It not only displays the temperatures of cities for each month in 2016, but it also tells a story about the data that was used. In the graph, there’s missing data for the cities Sale and Hobart. This is from taking away the rows with nan values at the beginning of my project. From April to December, Sale has nan values in columns such as “Evaporation” making it get cut out of the df_mini dataset when I used dropna. Hobart also has similar things from January to April.
Interpreting this map, we can also see that Hobart most likely is the coldest out of the cities since it has the lowest overall max temperature (C). However, Hobart’s data will be affected since it is missing data from the summer. It is the most southern city (it is in Tasmania) so presumably the summers will be warmer than other cities and winters will be colder. Darwin would most likely be the hottest as its max temperature never drops below 27.3 degrees Celcius.
Reference: Heatmap from Altair
Standard Scaler#
I create a list using list comprehension of all the column names that have numeric dtypes. I then add “Location” and remove “MaxTemp” as columns because I want to predict what the “MaxTemp” from the other columns.
cols = [c for c in df_mini.columns if is_numeric_dtype(df_mini[c]) == True]
cols.remove("MaxTemp")
cols
['MinTemp', 'Rainfall', 'Evaporation', 'Sunshine', 'WindGustSpeed', 'Month']
df_mini2 = df_mini.copy()
Using StandardScaler() to fit then transform df_mini[cols] so that the mean of the columns (df_mini[cols]) is 0 and the standard deviation of the columns is 1.
scaler = StandardScaler() # mean=0 and std=1
scaler.fit(df_mini[cols])
StandardScaler()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()
df_mini2[cols] = scaler.transform(df_mini[cols])
df_mini2
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | Month | |
---|---|---|---|---|---|---|---|---|---|
89199 | -0.442614 | 26.2 | -0.294668 | -0.028919 | 0.490930 | -0.583096 | Cairns | 2014-08-20 | 0.382545 |
141116 | 1.448493 | 32.9 | -0.294668 | -0.765512 | 0.519041 | -0.419105 | Darwin | 2014-03-26 | -1.049583 |
89177 | 0.313829 | 26.6 | -0.198020 | -1.133809 | 0.631487 | 0.810832 | Cairns | 2014-07-29 | 0.096119 |
122796 | -0.552928 | 20.7 | -0.294668 | -0.719475 | 0.209816 | -0.255113 | Perth | 2014-08-26 | 0.382545 |
62941 | -1.214816 | 26.6 | -0.294668 | 0.293341 | 1.137492 | -0.091122 | Sale | 2014-01-23 | -1.622434 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
141361 | 1.259383 | 33.4 | 1.372494 | 0.109193 | -0.633525 | 1.712785 | Darwin | 2014-11-26 | 1.241821 |
88987 | 1.117550 | 30.9 | -0.294668 | -0.167030 | 0.209816 | -0.993075 | Cairns | 2014-01-20 | -1.622434 |
131988 | -2.034295 | 12.5 | -0.294668 | -1.225883 | -0.436746 | -1.485050 | Hobart | 2014-08-18 | 0.382545 |
141295 | 1.133309 | 32.7 | -0.294668 | 0.707675 | 0.800155 | -0.419105 | Darwin | 2014-09-21 | 0.668970 |
13828 | 0.676291 | 34.2 | -0.246344 | -0.397216 | 0.884489 | 0.318857 | Moree | 2014-01-25 | -1.622434 |
593 rows × 9 columns
As seen below, for each column the mean is near 0 and the std is near 1.
df_mini2[cols].mean()
MinTemp -1.198217e-16
Rainfall -2.695988e-17
Evaporation 4.193760e-17
Sunshine -2.516256e-16
WindGustSpeed 1.977058e-16
Month -8.387520e-17
dtype: float64
df_mini2[cols].std()
MinTemp 1.000844
Rainfall 1.000844
Evaporation 1.000844
Sunshine 1.000844
WindGustSpeed 1.000844
Month 1.000844
dtype: float64
Linear Regression#
Next we use LinearRegression(). We fit then predict.
reg = LinearRegression()
reg.fit(df_mini2[cols], df_mini[["MaxTemp"]])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Setting a column in df_mini2 called “pred” to be equal to the Linear Regression predict of df_mini2[cols].
df_mini2["pred"] = reg.predict(df_mini2[cols])
df_mini2[:3]
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | Month | pred | |
---|---|---|---|---|---|---|---|---|---|---|
89199 | -0.442614 | 26.2 | -0.294668 | -0.028919 | 0.490930 | -0.583096 | Cairns | 2014-08-20 | 0.382545 | 25.111805 |
141116 | 1.448493 | 32.9 | -0.294668 | -0.765512 | 0.519041 | -0.419105 | Darwin | 2014-03-26 | -1.049583 | 34.467149 |
89177 | 0.313829 | 26.6 | -0.198020 | -1.133809 | 0.631487 | 0.810832 | Cairns | 2014-07-29 | 0.096119 | 28.064050 |
Below, I graph both the prediction and the “MaxTemp”. It looks similar to the graph for “MaxTemp”.
c3 = alt.Chart(df_mini2).mark_line().encode(
x= "Datetime",
y= "pred",
tooltip=["Datetime"]
)
c4 = alt.Chart(df_mini2).mark_line().encode(
x= "Datetime",
y= "MaxTemp",
tooltip=["Datetime"])
c3|c4
Pipeline#
Pipeline is a way faster process of combining StandardScaler() and any type of regression. It requires a lot less code.
pipe = Pipeline(
[
("scaler", StandardScaler()),
("reg", LinearRegression())
]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('reg', LinearRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('reg', LinearRegression())])
StandardScaler()
LinearRegression()
pipe.predict(df_mini[cols])
array([25.11180492, 34.4671491 , 28.06404967, 23.24388546, 24.60659851,
25.63241793, 28.47856407, 31.25363106, 27.95678221, 23.28560335,
16.99981648, 30.66529502, 26.41665912, 21.35954964, 32.44412481,
30.99253536, 23.88620124, 13.1172668 , 33.92676117, 34.1584419 ,
23.15595835, 26.38314459, 19.29818682, 25.69450172, 17.55670647,
14.71606152, 25.82657478, 23.22298781, 22.78431273, 27.17429432,
24.60619193, 18.96278159, 28.01657322, 35.8924847 , 30.51995875,
22.49907529, 31.51448768, 32.21935974, 32.7180799 , 25.91273469,
26.33792628, 26.12468334, 32.36652342, 32.29533213, 28.89832845,
30.11857077, 12.50780271, 20.07715154, 23.33436236, 17.48145606,
27.44350297, 26.72049109, 15.93125125, 34.66040528, 20.37139474,
14.48894165, 19.62557181, 22.0144519 , 29.55341072, 18.4840181 ,
24.09524684, 32.49921767, 34.61680769, 27.4543947 , 24.46048158,
32.21068784, 28.73042989, 17.69850502, 31.54618215, 24.31677757,
24.0602949 , 26.46847828, 29.01168165, 17.2955666 , 32.43611455,
29.6425662 , 26.31830925, 31.03116205, 35.17697833, 34.59641022,
23.25060345, 22.1202053 , 19.01975535, 19.86471064, 30.55624 ,
30.38505465, 33.95562126, 24.54023987, 29.86794888, 31.12618648,
21.5462229 , 37.51495769, 13.49796123, 28.4498155 , 22.9314973 ,
25.05120386, 28.27250936, 20.92199315, 21.31746447, 26.19817002,
19.93333077, 29.81820146, 26.45953229, 27.00935926, 30.97614295,
33.55390455, 32.06860846, 26.72109612, 19.38896467, 28.56322159,
33.31658393, 24.99565337, 15.58620344, 25.75867106, 22.30727015,
23.52294871, 28.69251324, 33.20515248, 19.87554022, 30.24991998,
23.89921331, 35.56569054, 15.50188351, 23.29846499, 31.17681291,
28.51408079, 25.0002229 , 28.79128326, 15.12662227, 26.17571564,
31.23478754, 33.63521772, 21.54449372, 25.08678032, 21.36404516,
29.381258 , 20.84051676, 35.78587307, 22.54862144, 26.72097764,
20.05803291, 15.55375431, 31.79515784, 31.14796806, 32.51152105,
28.5464485 , 25.72064748, 31.4667868 , 32.53790344, 27.16421269,
15.87382383, 26.58307598, 36.24172134, 31.99889982, 31.68952746,
19.11157066, 17.86177782, 18.56735103, 31.84883556, 19.26299396,
27.27323416, 26.46172708, 34.16724123, 35.69404038, 12.54307335,
28.59851643, 22.73523048, 33.25333571, 31.04451987, 33.53110161,
31.86753572, 30.92871858, 18.04198024, 19.22020833, 27.95051493,
19.98903282, 33.80563341, 26.68688784, 16.63493653, 22.60998233,
18.13139726, 21.82709379, 23.25648437, 28.87922459, 23.85444992,
14.43663339, 27.8902302 , 35.79925594, 24.08917403, 30.63461155,
24.63797229, 30.63707287, 25.02771895, 31.15011531, 17.15354581,
24.56661908, 32.83396215, 16.73847074, 28.17104819, 21.04243351,
22.00303404, 25.12215568, 26.18266841, 27.75923867, 32.8143378 ,
29.95084976, 28.95037015, 20.6086872 , 25.57805433, 23.59310116,
28.50696607, 18.91132176, 30.07519396, 31.11156072, 15.78198069,
24.19307921, 25.2835486 , 32.03862172, 17.56034406, 28.81633259,
29.64033077, 24.11939815, 14.68794046, 26.42168028, 27.68616336,
19.21290045, 21.0359949 , 29.15735257, 20.11765136, 35.45629543,
17.90912582, 29.49145124, 28.27609045, 23.44047755, 30.10711795,
20.67961755, 32.97360432, 22.3425101 , 31.44949447, 27.96763161,
33.79160683, 28.66594875, 33.73075461, 30.09855848, 28.92781511,
31.72098198, 31.26509922, 27.64192978, 21.60535659, 31.09036079,
18.7407589 , 17.40609844, 18.75162016, 17.27303315, 31.53967789,
32.35544762, 16.94144383, 23.94622402, 29.24633187, 15.47827692,
29.1914118 , 32.23675043, 34.43120192, 21.1361934 , 16.35349618,
27.50917597, 31.51929801, 30.65348665, 23.92965076, 32.5883252 ,
32.37073637, 29.18107823, 36.75911599, 33.77944163, 28.64912505,
28.78701546, 35.37153341, 32.86601385, 29.89941952, 16.04982427,
33.46558432, 35.6611221 , 20.9486018 , 32.20750591, 27.36295722,
30.0338853 , 28.69445741, 26.34002414, 34.30261777, 32.56643795,
25.54589797, 22.08838678, 33.63062182, 20.46889068, 32.98990307,
26.85261323, 19.42266162, 19.83479623, 30.62382459, 24.28296039,
28.66393393, 27.21732682, 25.54598952, 24.79733111, 16.67746479,
20.98635538, 33.33693891, 26.1038826 , 34.64901957, 29.73702348,
19.75135259, 19.37352898, 26.43060103, 22.81805857, 25.32930527,
17.84173545, 27.88588136, 34.39368424, 23.62355837, 29.86328685,
29.23406453, 27.73234107, 23.90041807, 23.73555367, 33.29839666,
34.502106 , 17.27807252, 30.62623161, 33.75045091, 17.1305301 ,
17.27046229, 29.05325388, 41.51310144, 19.3529964 , 23.170393 ,
15.47576803, 17.65867973, 25.53892852, 26.31041664, 32.39042914,
27.3778165 , 28.52073882, 27.02328929, 14.11990926, 35.40827672,
22.57497544, 33.13477644, 28.01799341, 33.75786915, 11.01114451,
27.61872403, 20.27267667, 29.75489197, 22.36246598, 30.95941108,
32.03788235, 32.60248591, 13.62690516, 17.91284799, 25.78553377,
30.62950945, 25.29017444, 19.595173 , 30.54154161, 32.06380868,
25.2827903 , 20.2743339 , 34.22763357, 29.00049798, 28.80182176,
22.89092094, 28.52970602, 32.00447739, 34.09916128, 32.98042225,
27.89675602, 34.1246218 , 34.2427634 , 32.26208747, 26.24184336,
29.30563627, 31.33430142, 23.81177383, 16.19866116, 14.54920541,
23.51981587, 29.94970781, 30.20772764, 31.92670512, 27.91355825,
19.22723923, 22.46242461, 30.64865111, 28.58361038, 21.09873926,
23.68287303, 31.28180122, 19.19424837, 31.28447732, 17.88164046,
25.91068954, 31.3306721 , 24.60644268, 33.5961115 , 26.44699752,
17.67463355, 30.29774755, 18.67709832, 33.64001112, 34.07279167,
32.59016246, 34.20684925, 25.23713705, 25.20596459, 28.90533059,
36.90434198, 38.31038392, 30.62272335, 23.83896509, 25.10021776,
16.88299869, 29.07384239, 24.02271516, 27.01225084, 28.37185455,
23.4606484 , 36.03169344, 31.09883818, 32.22033656, 19.63154693,
30.55368845, 16.27216252, 31.61636728, 23.54319878, 31.38147131,
19.50959045, 27.75886735, 31.46628158, 24.53258902, 35.44291392,
36.45656129, 20.04274663, 23.59952818, 21.02642569, 26.05857586,
19.52620328, 19.64706842, 28.31800425, 28.15303791, 21.59706892,
20.7975672 , 25.83521 , 20.42688312, 24.33980634, 23.23241659,
25.7139883 , 28.83663828, 29.7234653 , 18.76950522, 28.22157739,
22.21627545, 26.89610466, 28.52914916, 24.31462299, 33.6608422 ,
20.43756566, 30.27161846, 28.46968796, 27.49561941, 19.01033814,
30.39597284, 22.28076335, 34.04714761, 26.57082675, 20.4205012 ,
31.78946556, 19.37771145, 30.77548429, 33.99907836, 20.11635434,
21.98597322, 31.87151843, 24.48336791, 18.34893942, 16.64909184,
21.26639534, 33.71939773, 28.410873 , 31.7763605 , 33.02125944,
31.82805817, 32.66150491, 20.71590324, 32.7432846 , 25.47300834,
17.14866059, 20.92946607, 32.8114294 , 20.50017301, 33.00249105,
31.74648939, 23.38702454, 27.97709893, 15.76549988, 24.20102754,
25.6605807 , 24.36277905, 30.56094318, 28.10877757, 26.45993606,
33.30013724, 22.06968064, 29.66689117, 19.94308618, 30.29691019,
19.00693442, 21.49496209, 21.69152084, 22.87170952, 36.27796687,
28.82786582, 30.05992794, 24.41326856, 20.85692279, 19.05595796,
27.64256588, 29.48771376, 27.1505203 , 31.49242119, 35.73126212,
29.18299342, 25.72955516, 31.19491922, 32.11609058, 34.36585424,
24.17552131, 31.35686228, 34.71252306, 15.63893479, 28.48565782,
31.77274912, 25.90148252, 36.87689064, 21.37330951, 14.64507899,
17.21821545, 32.56143718, 22.66182045, 20.12673531, 28.06245424,
34.25343073, 15.11787108, 23.18539271, 22.06750746, 27.64423009,
30.99394714, 29.14712011, 19.2388886 , 20.7018771 , 17.68412965,
19.72170047, 15.88490034, 25.73038243, 28.80246812, 32.45534008,
34.17584102, 12.39585257, 30.01765158, 36.50061802, 29.87710833,
18.79262579, 17.74490389, 32.04353705, 14.07513479, 31.76931483,
31.0714712 , 24.22021367, 27.57280333, 29.41945333, 25.63401521,
32.12191343, 19.06143557, 18.74588025, 16.54484715, 29.42563917,
30.81501831, 25.65206999, 26.16566663, 29.86173851, 33.22156564,
14.23509335, 33.87461726, 32.34948915])
3 cells of code is all that is needed to use Pipeline. Below, I inserted the predicted values into a column called “pred2” in df_mini2.
df_mini2["pred2"]=pipe.predict(df_mini[cols])
Also, in the following line we can see that the “pred” column has all the same values as the “pred2” column in df_mini2. This is proof that Pipeline did the same thing as Standard Scaler first, then LinearRegression.
(df_mini2["pred"]==df_mini2["pred2"]).all()
True
The coefficients and intercept are listed below. “MinTemp”, “Sunshine”, and “Evaporation” all have positive coefficients so I will use them to compare to cols and see how well they predict.
reg.coef_
array([[ 4.77635062, -0.1989574 , 1.01691609, 2.24238717, -0.02565457,
-0.70732408]])
reg.coef_.shape
(1, 6)
pd.Series(reg.coef_.reshape(-1), index=reg.feature_names_in_)
MinTemp 4.776351
Rainfall -0.198957
Evaporation 1.016916
Sunshine 2.242387
WindGustSpeed -0.025655
Month -0.707324
dtype: float64
reg.intercept_
array([26.35143339])
The score tells us how well the prediction does. Closer to 1, the better.
pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7997639618096023
PoissonRegressor#
Using PoissonRegressor() to predict “MaxTemp” with the cols list then with just “MinTemp”, “Sunshine”, and “Evaporation”.
from sklearn.linear_model import PoissonRegressor
pipe = Pipeline(
[
("scaler", StandardScaler()),
("pois", PoissonRegressor())
]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])
StandardScaler()
PoissonRegressor()
df_mini2["pred3"] = pipe.predict(df_mini[cols])
df_mini2
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | Month | pred | pred2 | pred3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
89199 | -0.442614 | 26.2 | -0.294668 | -0.028919 | 0.490930 | -0.583096 | Cairns | 2014-08-20 | 0.382545 | 25.111805 | 25.111805 | 24.503057 |
141116 | 1.448493 | 32.9 | -0.294668 | -0.765512 | 0.519041 | -0.419105 | Darwin | 2014-03-26 | -1.049583 | 34.467149 | 34.467149 | 35.065473 |
89177 | 0.313829 | 26.6 | -0.198020 | -1.133809 | 0.631487 | 0.810832 | Cairns | 2014-07-29 | 0.096119 | 28.064050 | 28.064050 | 27.499847 |
122796 | -0.552928 | 20.7 | -0.294668 | -0.719475 | 0.209816 | -0.255113 | Perth | 2014-08-26 | 0.382545 | 23.243885 | 23.243885 | 22.833752 |
62941 | -1.214816 | 26.6 | -0.294668 | 0.293341 | 1.137492 | -0.091122 | Sale | 2014-01-23 | -1.622434 | 24.606599 | 24.606599 | 23.955344 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
141361 | 1.259383 | 33.4 | 1.372494 | 0.109193 | -0.633525 | 1.712785 | Darwin | 2014-11-26 | 1.241821 | 29.861739 | 29.861739 | 29.567048 |
88987 | 1.117550 | 30.9 | -0.294668 | -0.167030 | 0.209816 | -0.993075 | Cairns | 2014-01-20 | -1.622434 | 33.221566 | 33.221566 | 33.371385 |
131988 | -2.034295 | 12.5 | -0.294668 | -1.225883 | -0.436746 | -1.485050 | Hobart | 2014-08-18 | 0.382545 | 14.235093 | 14.235093 | 16.150914 |
141295 | 1.133309 | 32.7 | -0.294668 | 0.707675 | 0.800155 | -0.419105 | Darwin | 2014-09-21 | 0.668970 | 33.874617 | 33.874617 | 34.298310 |
13828 | 0.676291 | 34.2 | -0.246344 | -0.397216 | 0.884489 | 0.318857 | Moree | 2014-01-25 | -1.622434 | 32.349489 | 32.349489 | 32.299978 |
593 rows × 12 columns
pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7688083582478654
Above, you can see that the score is lower when using Poisson Regressor than when I used Linear Regression.
Below, I am trying to use less columns to see if it affects the predict and score.
pipe.fit(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('pois', PoissonRegressor())])
StandardScaler()
PoissonRegressor()
pipe.predict(df_mini[["MinTemp","Evaporation","Sunshine"]])
array([24.66571687, 34.20879157, 27.61003155, 23.02854555, 22.82376875,
25.9498588 , 27.62548724, 30.86231634, 27.49623567, 22.56935588,
18.17665532, 29.31108894, 26.72477169, 20.2894126 , 33.94097311,
29.19967209, 23.86267716, 15.6165626 , 33.6610098 , 36.03452942,
22.12359397, 24.41929994, 19.47195449, 25.30524483, 17.55403011,
16.48318184, 25.45283019, 23.2471915 , 22.27727317, 25.9436926 ,
24.73608438, 19.68028036, 28.20024264, 35.52535358, 29.35059837,
22.68976816, 33.52483839, 30.65022025, 33.64123955, 25.38893588,
25.70847601, 26.03312007, 31.36098189, 31.87701812, 28.25068994,
29.18948605, 15.14589709, 21.321386 , 23.61190657, 18.00670287,
27.45958235, 26.76752194, 17.20734116, 34.63270217, 20.79638375,
16.52998143, 19.3793249 , 21.91294448, 27.65877381, 19.10171201,
23.90376461, 31.52739808, 33.80621809, 27.55904414, 24.59963112,
33.26375341, 28.93899261, 18.60671764, 32.66481136, 23.19615522,
23.86815819, 24.99701059, 28.5419773 , 18.14958418, 32.91163951,
30.18313862, 25.05802607, 29.78660977, 34.8340233 , 34.53908952,
22.43344417, 21.78262526, 19.23661579, 20.1557387 , 28.87333503,
31.3528652 , 33.29987477, 24.54894169, 29.59222955, 31.35223771,
22.63650649, 41.84586116, 15.57594639, 28.8155143 , 22.02861081,
23.93691197, 28.64267765, 20.86225003, 21.24725559, 25.99375654,
19.56831318, 29.80625218, 26.53354249, 25.42424561, 29.67150144,
34.6976942 , 32.56110404, 25.48281891, 19.56377763, 27.98962546,
33.29572187, 23.8695088 , 17.19298447, 25.78547327, 21.74220271,
23.09351901, 28.76209514, 34.60618956, 20.30870629, 30.45615722,
24.29644031, 37.42727498, 17.31126091, 23.18826776, 29.85904955,
27.9477997 , 23.84994021, 28.19624205, 16.68202922, 25.35422785,
31.5340987 , 32.56196118, 21.71687467, 24.74457799, 21.15680341,
27.84852062, 20.36803896, 34.97776622, 22.47433016, 27.08099845,
20.61050835, 17.10621524, 31.03373397, 32.94783761, 34.17012542,
28.70801245, 25.9896913 , 32.77363491, 31.40293616, 26.48657491,
17.30414386, 26.26332485, 35.71904945, 33.19576049, 32.5871756 ,
19.38032684, 19.13653263, 18.9962088 , 32.70839313, 19.69550568,
27.6563358 , 25.95952678, 35.5528287 , 34.97075457, 15.1603231 ,
26.91914909, 22.05893712, 34.5304463 , 30.27920723, 35.21860016,
31.89073279, 30.97952738, 18.52414555, 19.22013324, 27.73901819,
19.93719941, 32.76807984, 24.86810989, 17.36048186, 22.64178418,
18.66272963, 22.13958079, 23.15854604, 27.41736643, 23.70682797,
16.26330079, 27.20684946, 37.60527842, 24.33011473, 30.08477312,
24.04876753, 30.26090414, 23.86337358, 30.16190068, 18.25954892,
23.73685471, 33.97120862, 18.08182842, 28.28775217, 20.51684425,
22.05887402, 24.83101973, 26.15416055, 26.47949256, 34.0393869 ,
29.56726858, 28.00798602, 20.58761696, 26.0808929 , 23.52371977,
28.06556712, 19.37467848, 30.54754861, 29.86881817, 17.48230529,
23.55426373, 25.29044546, 30.94610001, 18.40916606, 27.6418737 ,
28.44733697, 23.39550664, 16.37858385, 26.39647639, 27.91463411,
19.83187189, 20.62207294, 28.15466091, 21.1168125 , 35.72923332,
18.23405064, 29.60766632, 27.91436812, 23.34681718, 30.38187087,
21.01878678, 33.98918884, 21.80204302, 32.04942506, 27.14232165,
33.52928353, 28.64433239, 33.62065989, 30.40348826, 29.22297906,
30.82667991, 31.68176865, 26.67584182, 21.39634593, 30.86652597,
19.43761406, 18.73594931, 19.47957792, 18.25047882, 30.85913643,
33.1418753 , 18.0200174 , 23.52309472, 29.69430774, 16.92886948,
27.28986542, 32.72817657, 33.52599656, 20.83919329, 17.47251398,
26.16951202, 32.02473746, 29.82121445, 22.94230795, 34.10953036,
31.05840489, 27.99695492, 36.99730496, 34.9990433 , 29.22063295,
28.12043926, 35.64948206, 34.39617812, 28.77555635, 17.26724488,
32.76801745, 35.16164484, 21.11598241, 32.87242673, 26.9227857 ,
28.88266081, 28.98988767, 24.68293201, 36.40462826, 30.95804967,
25.96827327, 21.50914971, 33.585715 , 20.49657405, 34.05681901,
26.91797874, 19.82109755, 20.39557628, 30.32222115, 23.45300687,
28.1031733 , 25.50084426, 24.39990273, 23.68322555, 18.04539481,
20.90149989, 35.1414622 , 24.97538126, 33.73581031, 28.37437739,
20.32953169, 19.90232769, 26.66895978, 21.57324094, 25.34373471,
19.04126575, 28.38713628, 35.86309556, 22.74909709, 28.49304632,
29.14755807, 25.98879347, 23.74333911, 23.03548689, 33.95413546,
36.54728079, 18.37886287, 31.71694663, 33.66611458, 17.99427699,
18.95589365, 27.42762022, 46.2529031 , 19.65438072, 23.15288319,
17.60322001, 18.38300994, 24.1494607 , 25.87988856, 31.56878169,
26.96447149, 28.10687268, 25.76378159, 16.00234074, 37.57580584,
22.06142307, 34.82644306, 26.97661576, 35.39723385, 14.83927324,
27.37416194, 20.89648101, 29.08080033, 22.79157107, 31.21655104,
30.38192871, 32.61677631, 15.9992034 , 19.12695164, 24.64165296,
30.73256062, 23.53897131, 20.06683985, 29.51354569, 32.96122777,
23.61069435, 20.76458973, 35.99739078, 28.30875985, 29.29553815,
22.52681015, 27.74102411, 30.31773885, 35.71776489, 32.47905633,
28.30465615, 33.20804096, 36.22560105, 30.59969061, 24.87092729,
29.33958355, 29.91051868, 22.96583396, 18.10827001, 16.47230377,
23.47592126, 28.14541928, 30.85323333, 31.41992405, 27.18953224,
19.59697753, 22.63828086, 29.51456528, 28.18506291, 20.4819688 ,
23.78115867, 31.31396901, 19.93543963, 29.79901695, 18.47493724,
25.93897694, 29.53968578, 24.70292796, 34.80870409, 24.52934787,
18.18284242, 29.9062114 , 19.2590304 , 33.57683043, 36.00047402,
31.6226782 , 34.33889622, 25.26068915, 23.76266178, 28.56104378,
36.60849069, 38.88848412, 29.26342585, 24.98105538, 23.90522825,
18.19788692, 28.70892209, 23.16376329, 26.73333663, 27.17011582,
23.89481105, 35.34386697, 29.78603142, 30.83343721, 20.1837631 ,
30.30475834, 17.47280743, 31.97775425, 22.54528252, 31.07339611,
19.83090334, 26.20796464, 30.21425533, 24.28061981, 37.62651371,
36.26889583, 20.35892311, 23.40958188, 21.73585557, 24.78393606,
19.28696143, 19.89673582, 28.34730303, 28.59307328, 21.0528157 ,
21.71193678, 25.28926756, 20.67249572, 23.00951024, 22.49815621,
25.95422386, 27.34322096, 30.45794587, 18.89844347, 26.85216883,
22.26844344, 27.13738812, 28.23590351, 24.18934768, 35.49118953,
21.19089925, 29.04480592, 27.17509728, 25.75387669, 19.92587411,
29.98757104, 21.94341996, 32.85521344, 25.30592608, 20.87624107,
33.1817376 , 18.75691363, 30.22122417, 32.98622862, 20.3879951 ,
22.24114053, 33.25345683, 24.81565524, 19.01378795, 17.62037696,
20.82159557, 35.51665399, 26.78055042, 31.3426709 , 34.39757521,
30.45766657, 31.78454184, 20.54885862, 31.39109164, 25.60453964,
18.34282715, 21.44311065, 32.26204022, 20.91977566, 34.52699545,
32.99820366, 22.0010149 , 27.16970797, 17.10609503, 23.5817093 ,
25.11453208, 24.4037193 , 30.9172605 , 26.56029869, 26.67301377,
31.91806976, 21.49771504, 29.66163831, 20.43954982, 29.3961833 ,
19.23665932, 21.97767614, 22.20747638, 23.18213367, 36.01184013,
27.35640912, 30.08166971, 23.12455651, 20.64292463, 19.24644875,
27.48325185, 28.21765569, 26.49662586, 32.09906783, 35.22567414,
29.36899104, 24.49653279, 30.22390998, 30.30555608, 34.02390512,
23.67378499, 31.63591041, 34.42826276, 16.84866871, 27.68330094,
30.30980547, 26.10450578, 36.84665382, 21.78358293, 16.4145067 ,
17.67510193, 33.3323939 , 21.90810547, 20.25052818, 26.64868024,
33.40360309, 17.08894011, 22.58333497, 21.25462963, 26.7603103 ,
31.17203822, 29.5152472 , 20.12477066, 20.52412865, 18.30649348,
20.37695742, 17.22891609, 25.97278689, 28.80012338, 31.45092958,
35.5602282 , 15.00997788, 28.47483578, 36.46971747, 29.8509232 ,
19.13990899, 18.98732236, 33.04912696, 15.95879719, 30.27540209,
31.70361606, 23.27589018, 27.08331799, 29.77779376, 25.15166816,
32.77329909, 19.68225909, 19.10971866, 18.46395212, 29.85185122,
29.68714232, 24.49708435, 26.71432878, 30.93687454, 32.02389447,
16.27313753, 34.81713068, 30.96763578])
pipe.score(df_mini[["MinTemp","Evaporation","Sunshine"]],df_mini["MaxTemp"])
0.7573496399072459
The score after using Poisson Regression is about 0.05 less than when using Linear Regression. This means that Linear Regression is a better Regression model to use for the this data. A reasons why I assume this is the case is because in order to use Poisson Regression, it assumes that the variance is equal to the mean. We also said that the mean is zero when we did Standard Scaler.
Also when using less columns for the training data, the score goes down. This makes sense because intuitively using more data should give better results.
Lasso#
Trying Lasso to see if it works better.
from sklearn.linear_model import Lasso
pipe = Pipeline(
[
("scaler", StandardScaler()),
("lasso", Lasso())
]
)
pipe.fit(df_mini[cols],df_mini["MaxTemp"])
Pipeline(steps=[('scaler', StandardScaler()), ('lasso', Lasso())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('scaler', StandardScaler()), ('lasso', Lasso())])
StandardScaler()
Lasso()
df_mini2["pred4"] = pipe.predict(df_mini[cols])
df_mini2
MinTemp | MaxTemp | Rainfall | Evaporation | Sunshine | WindGustSpeed | Location | Datetime | Month | pred | pred2 | pred3 | pred4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
89199 | -0.442614 | 26.2 | -0.294668 | -0.028919 | 0.490930 | -0.583096 | Cairns | 2014-08-20 | 0.382545 | 25.111805 | 25.111805 | 24.503057 | 25.237897 |
141116 | 1.448493 | 32.9 | -0.294668 | -0.765512 | 0.519041 | -0.419105 | Darwin | 2014-03-26 | -1.049583 | 34.467149 | 34.467149 | 35.065473 | 32.440049 |
89177 | 0.313829 | 26.6 | -0.198020 | -1.133809 | 0.631487 | 0.810832 | Cairns | 2014-07-29 | 0.096119 | 28.064050 | 28.064050 | 27.499847 | 27.872262 |
122796 | -0.552928 | 20.7 | -0.294668 | -0.719475 | 0.209816 | -0.255113 | Perth | 2014-08-26 | 0.382545 | 23.243885 | 23.243885 | 22.833752 | 24.054934 |
62941 | -1.214816 | 26.6 | -0.294668 | 0.293341 | 1.137492 | -0.091122 | Sale | 2014-01-23 | -1.622434 | 24.606599 | 24.606599 | 23.955344 | 23.203801 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
141361 | 1.259383 | 33.4 | 1.372494 | 0.109193 | -0.633525 | 1.712785 | Darwin | 2014-11-26 | 1.241821 | 29.861739 | 29.861739 | 29.567048 | 30.572726 |
88987 | 1.117550 | 30.9 | -0.294668 | -0.167030 | 0.209816 | -0.993075 | Cairns | 2014-01-20 | -1.622434 | 33.221566 | 33.221566 | 33.371385 | 31.010369 |
131988 | -2.034295 | 12.5 | -0.294668 | -1.225883 | -0.436746 | -1.485050 | Hobart | 2014-08-18 | 0.382545 | 14.235093 | 14.235093 | 16.150914 | 16.998344 |
141295 | 1.133309 | 32.7 | -0.294668 | 0.707675 | 0.800155 | -0.419105 | Darwin | 2014-09-21 | 0.668970 | 33.874617 | 33.874617 | 34.298310 | 32.332803 |
13828 | 0.676291 | 34.2 | -0.246344 | -0.397216 | 0.884489 | 0.318857 | Moree | 2014-01-25 | -1.622434 | 32.349489 | 32.349489 | 32.299978 | 30.047367 |
593 rows × 13 columns
pipe.score(df_mini[cols],df_mini["MaxTemp"])
0.7388354956518769
Out of all the Regression and linear models, Lasso worked the worst in terms of predicting “MaxTemp” using the columns from df_mini that were in the list cols.
Summary#
In the Altair section, I displayed charts of the “MaxTemp” in relation to time (“Datetime”) and the cities (“Location”). In the machine learning section, I used Standard Scaler, Pipeline, Linear Regression, Poisson Regressor, and Lasso to predict the “MaxTemp”. I showed that for my data, Linear Regression worked best and that using more columns allowed the predict and score to be better.
References#
Your code above should include references. Here is some additional space for references.
What is the source of your dataset(s)? Reference
List any other references that you found helpful.
Submission#
Using the Share button at the top right, enable Comment privileges for anyone with a link to the project. Then submit that link on Canvas.
Created in Deepnote