Week 10 Friday
Contents
Week 10 Friday#
I’ve attached a dataset containing hourly temperatures from Kaggle: source
The unit for these temperatures is kelvin.
The idea for using a decision tree for this kind of data set comes from Jake VanderPlas’s Python Data Science Handbook (which was also the source of the bicycle worksheet).
Announcements#
Be sure you’re writing your project using the Project Template in the Project folder. (This is different from the Worksheet 16 template.)
I have Zoom office hours Monday during the first hour of the time scheduled for our final exam: Monday, 10:30-11:30am. See our Canvas homepage for the Zoom link.
Projects due 11:59pm Monday. (An extension of 1-2 days may be possible on an individual basis; email me Sunday or Monday if you think you need one. Share what you have so far.)
If you get stuck on something or have a logistical question, please ask on Ed Discussion, even if you think it is very specific to your own project.
Please fill out a course evaluation if you haven’t already!
A decision tree to predict temperature#
We won’t divide into a training set and a test set in this case. My intuition is that because the temperature at 3pm (say) is so similar to the temperature at 4pm, randomly dividing the data won’t be appropriate. Phrased another way, our eventual goal is probably to predict the temperature at some future date, not at some random hour in the middle of a day where we already know other temperatures.
My goal here is to see what the decision tree prediction function looks like. I’m not thinking about overfitting right now.
import pandas as pd
import altair as alt
df_pre = pd.read_csv("temperature.csv")
Different cities are in different columns. If we wanted for example to plot these temperatures in different colors, we should use df_pre.melt
to get the city columns combined into a single column.
df_pre
datetime | Vancouver | Portland | San Francisco | Seattle | Los Angeles | San Diego | Las Vegas | Phoenix | Albuquerque | ... | Philadelphia | New York | Montreal | Boston | Beersheba | Tel Aviv District | Eilat | Haifa | Nahariyya | Jerusalem | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-10-01 12:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | 309.100000 | NaN | NaN | NaN |
1 | 2012-10-01 13:00:00 | 284.630000 | 282.080000 | 289.480000 | 281.800000 | 291.870000 | 291.530000 | 293.410000 | 296.600000 | 285.120000 | ... | 285.630000 | 288.220000 | 285.830000 | 287.170000 | 307.590000 | 305.470000 | 310.580000 | 304.4 | 304.4 | 303.5 |
2 | 2012-10-01 14:00:00 | 284.629041 | 282.083252 | 289.474993 | 281.797217 | 291.868186 | 291.533501 | 293.403141 | 296.608509 | 285.154558 | ... | 285.663208 | 288.247676 | 285.834650 | 287.186092 | 307.590000 | 304.310000 | 310.495769 | 304.4 | 304.4 | 303.5 |
3 | 2012-10-01 15:00:00 | 284.626998 | 282.091866 | 289.460618 | 281.789833 | 291.862844 | 291.543355 | 293.392177 | 296.631487 | 285.233952 | ... | 285.756824 | 288.326940 | 285.847790 | 287.231672 | 307.391513 | 304.281841 | 310.411538 | 304.4 | 304.4 | 303.5 |
4 | 2012-10-01 16:00:00 | 284.624955 | 282.100481 | 289.446243 | 281.782449 | 291.857503 | 291.553209 | 293.381213 | 296.654466 | 285.313345 | ... | 285.850440 | 288.406203 | 285.860929 | 287.277251 | 307.145200 | 304.238015 | 310.327308 | 304.4 | 304.4 | 303.5 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45248 | 2017-11-29 20:00:00 | NaN | 282.000000 | NaN | 280.820000 | 293.550000 | 292.150000 | 289.540000 | 294.710000 | 285.720000 | ... | 290.240000 | NaN | 275.130000 | 288.080000 | NaN | NaN | NaN | NaN | NaN | NaN |
45249 | 2017-11-29 21:00:00 | NaN | 282.890000 | NaN | 281.650000 | 295.680000 | 292.740000 | 290.610000 | 295.590000 | 286.450000 | ... | 289.240000 | NaN | 274.130000 | 286.020000 | NaN | NaN | NaN | NaN | NaN | NaN |
45250 | 2017-11-29 22:00:00 | NaN | 283.390000 | NaN | 282.750000 | 295.960000 | 292.580000 | 291.340000 | 296.250000 | 286.440000 | ... | 286.780000 | NaN | 273.480000 | 283.940000 | NaN | NaN | NaN | NaN | NaN | NaN |
45251 | 2017-11-29 23:00:00 | NaN | 283.020000 | NaN | 282.960000 | 295.650000 | 292.610000 | 292.150000 | 297.150000 | 286.140000 | ... | 284.570000 | NaN | 272.480000 | 282.170000 | NaN | NaN | NaN | NaN | NaN | NaN |
45252 | 2017-11-30 00:00:00 | NaN | 282.280000 | NaN | 283.040000 | 294.930000 | 291.400000 | 291.640000 | 297.150000 | 284.700000 | ... | 283.420000 | NaN | 271.800000 | 280.650000 | NaN | NaN | NaN | NaN | NaN | NaN |
45253 rows × 37 columns
We will just take approximately 200 of the rows and two of the columns from this dataset.
# greatly reduce the rows and columns
df = df_pre.loc[400:600, ["datetime", "Detroit"]].copy()
df
datetime | Detroit | |
---|---|---|
400 | 2012-10-18 04:00:00 | 284.52 |
401 | 2012-10-18 05:00:00 | 284.45 |
402 | 2012-10-18 06:00:00 | 285.51 |
403 | 2012-10-18 07:00:00 | 284.93 |
404 | 2012-10-18 08:00:00 | 285.34 |
... | ... | ... |
596 | 2012-10-26 08:00:00 | 288.50 |
597 | 2012-10-26 09:00:00 | 287.94 |
598 | 2012-10-26 10:00:00 | 287.49 |
599 | 2012-10-26 11:00:00 | 287.09 |
600 | 2012-10-26 12:00:00 | 287.69 |
201 rows × 2 columns
Notice how strange the x-axis labels look. That is a sign that something is wrong with the data type of the “datetime” column.
c1 = alt.Chart(df).mark_line().encode(
x="datetime",
y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
width=700,
title="Detroit"
)
c1
These values are strings, not datetime objects.
df.dtypes
datetime object
Detroit float64
dtype: object
We convert that column into the datetime data type. We could replace the old column, but here we include it as a new column. (That way, if we make a mistake in this cell, we don’t need to re-load the data.)
df["date"] = pd.to_datetime(df["datetime"])
df.dtypes
datetime object
Detroit float64
date datetime64[ns]
dtype: object
Now the image looks much more natural.
c1 = alt.Chart(df).mark_line().encode(
x="date",
y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
width=700,
title="Detroit"
)
c1
It would be incorrect to use a classifier in this context, because predicting temperature is a regression problem.
from sklearn.tree import DecisionTreeClassifier
Here is the correct import.
from sklearn.tree import DecisionTreeRegressor
We’ll specify 15 leaf nodes when we instantiate the regressor object.
reg = DecisionTreeRegressor(max_leaf_nodes=15)
Here is our usual error. Remember that the first input needs to be two-dimensional.
reg.fit(df["date"], df["Detroit"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/tmp/ipykernel_84/3032300603.py in <module>
----> 1 reg.fit(df["date"], df["Detroit"])
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
1318 sample_weight=sample_weight,
1319 check_input=check_input,
-> 1320 X_idx_sorted=X_idx_sorted,
1321 )
1322 return self
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
164 check_y_params = dict(ensure_2d=False, dtype=None)
165 X, y = self._validate_data(
--> 166 X, y, validate_separately=(check_X_params, check_y_params)
167 )
168 if issparse(X):
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
576 # :(
577 check_X_params, check_y_params = validate_separately
--> 578 X = check_array(X, **check_X_params)
579 y = check_array(y, **check_y_params)
580 else:
/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
771 "Reshape your data either using array.reshape(-1, 1) if "
772 "your data has a single feature or array.reshape(1, -1) "
--> 773 "if it contains a single sample.".format(array)
774 )
775
ValueError: Expected 2D array, got 1D array instead:
array=[1.3505327e+18 1.3505365e+18 1.3505400e+18 1.3505436e+18 1.3505472e+18
1.3505507e+18 1.3505545e+18 1.3505580e+18 1.3505616e+18 1.3505652e+18
1.3505688e+18 1.3505725e+18 1.3505760e+18 1.3505796e+18 1.3505832e+18
1.3505868e+18 1.3505905e+18 1.3505940e+18 1.3505976e+18 1.3506012e+18
1.3506048e+18 1.3506083e+18 1.3506120e+18 1.3506156e+18 1.3506192e+18
1.3506228e+18 1.3506263e+18 1.3506301e+18 1.3506336e+18 1.3506372e+18
1.3506408e+18 1.3506443e+18 1.3506481e+18 1.3506516e+18 1.3506552e+18
1.3506588e+18 1.3506623e+18 1.3506661e+18 1.3506696e+18 1.3506732e+18
1.3506768e+18 1.3506804e+18 1.3506841e+18 1.3506876e+18 1.3506912e+18
1.3506948e+18 1.3506984e+18 1.3507021e+18 1.3507056e+18 1.3507092e+18
1.3507128e+18 1.3507164e+18 1.3507199e+18 1.3507236e+18 1.3507272e+18
1.3507308e+18 1.3507344e+18 1.3507379e+18 1.3507417e+18 1.3507452e+18
1.3507488e+18 1.3507524e+18 1.3507559e+18 1.3507597e+18 1.3507632e+18
1.3507668e+18 1.3507704e+18 1.3507739e+18 1.3507777e+18 1.3507812e+18
1.3507848e+18 1.3507884e+18 1.3507920e+18 1.3507957e+18 1.3507992e+18
1.3508028e+18 1.3508064e+18 1.3508100e+18 1.3508135e+18 1.3508172e+18
1.3508208e+18 1.3508244e+18 1.3508280e+18 1.3508315e+18 1.3508352e+18
1.3508388e+18 1.3508424e+18 1.3508460e+18 1.3508495e+18 1.3508533e+18
1.3508568e+18 1.3508604e+18 1.3508640e+18 1.3508675e+18 1.3508713e+18
1.3508748e+18 1.3508784e+18 1.3508820e+18 1.3508855e+18 1.3508893e+18
1.3508928e+18 1.3508964e+18 1.3509000e+18 1.3509036e+18 1.3509073e+18
1.3509108e+18 1.3509144e+18 1.3509180e+18 1.3509216e+18 1.3509251e+18
1.3509288e+18 1.3509324e+18 1.3509360e+18 1.3509396e+18 1.3509431e+18
1.3509468e+18 1.3509504e+18 1.3509540e+18 1.3509576e+18 1.3509611e+18
1.3509649e+18 1.3509684e+18 1.3509720e+18 1.3509756e+18 1.3509791e+18
1.3509829e+18 1.3509864e+18 1.3509900e+18 1.3509936e+18 1.3509971e+18
1.3510009e+18 1.3510044e+18 1.3510080e+18 1.3510116e+18 1.3510152e+18
1.3510189e+18 1.3510224e+18 1.3510260e+18 1.3510296e+18 1.3510332e+18
1.3510367e+18 1.3510404e+18 1.3510440e+18 1.3510476e+18 1.3510512e+18
1.3510547e+18 1.3510584e+18 1.3510620e+18 1.3510656e+18 1.3510692e+18
1.3510727e+18 1.3510765e+18 1.3510800e+18 1.3510836e+18 1.3510872e+18
1.3510907e+18 1.3510945e+18 1.3510980e+18 1.3511016e+18 1.3511052e+18
1.3511088e+18 1.3511125e+18 1.3511160e+18 1.3511196e+18 1.3511232e+18
1.3511268e+18 1.3511305e+18 1.3511340e+18 1.3511376e+18 1.3511412e+18
1.3511448e+18 1.3511483e+18 1.3511520e+18 1.3511556e+18 1.3511592e+18
1.3511628e+18 1.3511663e+18 1.3511700e+18 1.3511736e+18 1.3511772e+18
1.3511808e+18 1.3511843e+18 1.3511881e+18 1.3511916e+18 1.3511952e+18
1.3511988e+18 1.3512023e+18 1.3512061e+18 1.3512096e+18 1.3512132e+18
1.3512168e+18 1.3512204e+18 1.3512241e+18 1.3512276e+18 1.3512312e+18
1.3512348e+18 1.3512384e+18 1.3512421e+18 1.3512456e+18 1.3512492e+18
1.3512528e+18].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
reg.fit(df[["date"]], df["Detroit"])
DecisionTreeRegressor(max_leaf_nodes=15)
We add the predictions as a new column. I expect this would raise a warning if we hadn’t used copy()
above.
df["pred_tree"] = reg.predict(df[["date"]])
Here are the predictions. If you count, there should be 15 horizontal line segments, corresponding to the 15 leaf nodes. (Remember that each leaf node corresponds to a region, and all inputs in a region have the same output.)
c1 = alt.Chart(df).mark_line().encode(
x="date",
y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
width=700,
title="Detroit"
)
c2 = alt.Chart(df).mark_line(color="orange").encode(
x="date",
y=alt.Y("pred_tree", scale=alt.Scale(zero=False), title="kelvin")
# Doesn't work. color="orange"
# Does work. color=alt.value("orange")
)
c1+c2
This setup is a little different from our other decision tree examples. Notice how there is only one input column. So for example, the feature importances array is not interesting in this case, because there is only one feature.
reg.feature_importances_
array([1.])
Let’s see how this compares if we use a random forest with 100 trees, each with at most 15 leaf nodes.
from sklearn.ensemble import RandomForestRegressor
rfe = RandomForestRegressor(n_estimators=100, max_leaf_nodes=15)
rfe.fit(df[["date"]], df["Detroit"])
RandomForestRegressor(max_leaf_nodes=15)
df["pred_forest"] = rfe.predict(df[["date"]])
The image looks pretty similar in this case, but the corners are rounded because of the averaging that happens. Look for example on the very right side of this picture, and compare it to the picture above. We have a diagonal curve in this random forest plot, which is very different from the straight horizontal segments of the decision tree plot.
c1 = alt.Chart(df).mark_line().encode(
x="date",
y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
width=700,
title="Detroit"
)
c2 = alt.Chart(df).mark_line(color="orange").encode(
x="date",
y=alt.Y("pred_forest", scale=alt.Scale(zero=False), title="kelvin")
# Doesn't work. color="orange"
# Does work. color=alt.value("orange")
)
c1+c2
We could definitely keep going, for example considering overfitting or finding changes to make to get the plot to more closely match the true data, but this is where we’ll finish the course!