Week 10 Friday#

I’ve attached a dataset containing hourly temperatures from Kaggle: source

The unit for these temperatures is kelvin.

The idea for using a decision tree for this kind of data set comes from Jake VanderPlas’s Python Data Science Handbook (which was also the source of the bicycle worksheet).

Announcements#

  • Be sure you’re writing your project using the Project Template in the Project folder. (This is different from the Worksheet 16 template.)

  • I have Zoom office hours Monday during the first hour of the time scheduled for our final exam: Monday, 10:30-11:30am. See our Canvas homepage for the Zoom link.

  • Projects due 11:59pm Monday. (An extension of 1-2 days may be possible on an individual basis; email me Sunday or Monday if you think you need one. Share what you have so far.)

  • If you get stuck on something or have a logistical question, please ask on Ed Discussion, even if you think it is very specific to your own project.

  • Please fill out a course evaluation if you haven’t already!

A decision tree to predict temperature#

We won’t divide into a training set and a test set in this case. My intuition is that because the temperature at 3pm (say) is so similar to the temperature at 4pm, randomly dividing the data won’t be appropriate. Phrased another way, our eventual goal is probably to predict the temperature at some future date, not at some random hour in the middle of a day where we already know other temperatures.

My goal here is to see what the decision tree prediction function looks like. I’m not thinking about overfitting right now.

import pandas as pd
import altair as alt
df_pre = pd.read_csv("temperature.csv")

Different cities are in different columns. If we wanted for example to plot these temperatures in different colors, we should use df_pre.melt to get the city columns combined into a single column.

df_pre
datetime Vancouver Portland San Francisco Seattle Los Angeles San Diego Las Vegas Phoenix Albuquerque ... Philadelphia New York Montreal Boston Beersheba Tel Aviv District Eilat Haifa Nahariyya Jerusalem
0 2012-10-01 12:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 309.100000 NaN NaN NaN
1 2012-10-01 13:00:00 284.630000 282.080000 289.480000 281.800000 291.870000 291.530000 293.410000 296.600000 285.120000 ... 285.630000 288.220000 285.830000 287.170000 307.590000 305.470000 310.580000 304.4 304.4 303.5
2 2012-10-01 14:00:00 284.629041 282.083252 289.474993 281.797217 291.868186 291.533501 293.403141 296.608509 285.154558 ... 285.663208 288.247676 285.834650 287.186092 307.590000 304.310000 310.495769 304.4 304.4 303.5
3 2012-10-01 15:00:00 284.626998 282.091866 289.460618 281.789833 291.862844 291.543355 293.392177 296.631487 285.233952 ... 285.756824 288.326940 285.847790 287.231672 307.391513 304.281841 310.411538 304.4 304.4 303.5
4 2012-10-01 16:00:00 284.624955 282.100481 289.446243 281.782449 291.857503 291.553209 293.381213 296.654466 285.313345 ... 285.850440 288.406203 285.860929 287.277251 307.145200 304.238015 310.327308 304.4 304.4 303.5
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45248 2017-11-29 20:00:00 NaN 282.000000 NaN 280.820000 293.550000 292.150000 289.540000 294.710000 285.720000 ... 290.240000 NaN 275.130000 288.080000 NaN NaN NaN NaN NaN NaN
45249 2017-11-29 21:00:00 NaN 282.890000 NaN 281.650000 295.680000 292.740000 290.610000 295.590000 286.450000 ... 289.240000 NaN 274.130000 286.020000 NaN NaN NaN NaN NaN NaN
45250 2017-11-29 22:00:00 NaN 283.390000 NaN 282.750000 295.960000 292.580000 291.340000 296.250000 286.440000 ... 286.780000 NaN 273.480000 283.940000 NaN NaN NaN NaN NaN NaN
45251 2017-11-29 23:00:00 NaN 283.020000 NaN 282.960000 295.650000 292.610000 292.150000 297.150000 286.140000 ... 284.570000 NaN 272.480000 282.170000 NaN NaN NaN NaN NaN NaN
45252 2017-11-30 00:00:00 NaN 282.280000 NaN 283.040000 294.930000 291.400000 291.640000 297.150000 284.700000 ... 283.420000 NaN 271.800000 280.650000 NaN NaN NaN NaN NaN NaN

45253 rows × 37 columns

We will just take approximately 200 of the rows and two of the columns from this dataset.

# greatly reduce the rows and columns
df = df_pre.loc[400:600, ["datetime", "Detroit"]].copy()
df
datetime Detroit
400 2012-10-18 04:00:00 284.52
401 2012-10-18 05:00:00 284.45
402 2012-10-18 06:00:00 285.51
403 2012-10-18 07:00:00 284.93
404 2012-10-18 08:00:00 285.34
... ... ...
596 2012-10-26 08:00:00 288.50
597 2012-10-26 09:00:00 287.94
598 2012-10-26 10:00:00 287.49
599 2012-10-26 11:00:00 287.09
600 2012-10-26 12:00:00 287.69

201 rows × 2 columns

Notice how strange the x-axis labels look. That is a sign that something is wrong with the data type of the “datetime” column.

c1 = alt.Chart(df).mark_line().encode(
    x="datetime",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c1

These values are strings, not datetime objects.

df.dtypes
datetime     object
Detroit     float64
dtype: object

We convert that column into the datetime data type. We could replace the old column, but here we include it as a new column. (That way, if we make a mistake in this cell, we don’t need to re-load the data.)

df["date"] = pd.to_datetime(df["datetime"])
df.dtypes
datetime            object
Detroit            float64
date        datetime64[ns]
dtype: object

Now the image looks much more natural.

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c1

It would be incorrect to use a classifier in this context, because predicting temperature is a regression problem.

from sklearn.tree import DecisionTreeClassifier

Here is the correct import.

from sklearn.tree import DecisionTreeRegressor

We’ll specify 15 leaf nodes when we instantiate the regressor object.

reg = DecisionTreeRegressor(max_leaf_nodes=15)

Here is our usual error. Remember that the first input needs to be two-dimensional.

reg.fit(df["date"], df["Detroit"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_84/3032300603.py in <module>
----> 1 reg.fit(df["date"], df["Detroit"])

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
   1318             sample_weight=sample_weight,
   1319             check_input=check_input,
-> 1320             X_idx_sorted=X_idx_sorted,
   1321         )
   1322         return self

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
    164             check_y_params = dict(ensure_2d=False, dtype=None)
    165             X, y = self._validate_data(
--> 166                 X, y, validate_separately=(check_X_params, check_y_params)
    167             )
    168             if issparse(X):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    576                 # :(
    577                 check_X_params, check_y_params = validate_separately
--> 578                 X = check_array(X, **check_X_params)
    579                 y = check_array(y, **check_y_params)
    580             else:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 

ValueError: Expected 2D array, got 1D array instead:
array=[1.3505327e+18 1.3505365e+18 1.3505400e+18 1.3505436e+18 1.3505472e+18
 1.3505507e+18 1.3505545e+18 1.3505580e+18 1.3505616e+18 1.3505652e+18
 1.3505688e+18 1.3505725e+18 1.3505760e+18 1.3505796e+18 1.3505832e+18
 1.3505868e+18 1.3505905e+18 1.3505940e+18 1.3505976e+18 1.3506012e+18
 1.3506048e+18 1.3506083e+18 1.3506120e+18 1.3506156e+18 1.3506192e+18
 1.3506228e+18 1.3506263e+18 1.3506301e+18 1.3506336e+18 1.3506372e+18
 1.3506408e+18 1.3506443e+18 1.3506481e+18 1.3506516e+18 1.3506552e+18
 1.3506588e+18 1.3506623e+18 1.3506661e+18 1.3506696e+18 1.3506732e+18
 1.3506768e+18 1.3506804e+18 1.3506841e+18 1.3506876e+18 1.3506912e+18
 1.3506948e+18 1.3506984e+18 1.3507021e+18 1.3507056e+18 1.3507092e+18
 1.3507128e+18 1.3507164e+18 1.3507199e+18 1.3507236e+18 1.3507272e+18
 1.3507308e+18 1.3507344e+18 1.3507379e+18 1.3507417e+18 1.3507452e+18
 1.3507488e+18 1.3507524e+18 1.3507559e+18 1.3507597e+18 1.3507632e+18
 1.3507668e+18 1.3507704e+18 1.3507739e+18 1.3507777e+18 1.3507812e+18
 1.3507848e+18 1.3507884e+18 1.3507920e+18 1.3507957e+18 1.3507992e+18
 1.3508028e+18 1.3508064e+18 1.3508100e+18 1.3508135e+18 1.3508172e+18
 1.3508208e+18 1.3508244e+18 1.3508280e+18 1.3508315e+18 1.3508352e+18
 1.3508388e+18 1.3508424e+18 1.3508460e+18 1.3508495e+18 1.3508533e+18
 1.3508568e+18 1.3508604e+18 1.3508640e+18 1.3508675e+18 1.3508713e+18
 1.3508748e+18 1.3508784e+18 1.3508820e+18 1.3508855e+18 1.3508893e+18
 1.3508928e+18 1.3508964e+18 1.3509000e+18 1.3509036e+18 1.3509073e+18
 1.3509108e+18 1.3509144e+18 1.3509180e+18 1.3509216e+18 1.3509251e+18
 1.3509288e+18 1.3509324e+18 1.3509360e+18 1.3509396e+18 1.3509431e+18
 1.3509468e+18 1.3509504e+18 1.3509540e+18 1.3509576e+18 1.3509611e+18
 1.3509649e+18 1.3509684e+18 1.3509720e+18 1.3509756e+18 1.3509791e+18
 1.3509829e+18 1.3509864e+18 1.3509900e+18 1.3509936e+18 1.3509971e+18
 1.3510009e+18 1.3510044e+18 1.3510080e+18 1.3510116e+18 1.3510152e+18
 1.3510189e+18 1.3510224e+18 1.3510260e+18 1.3510296e+18 1.3510332e+18
 1.3510367e+18 1.3510404e+18 1.3510440e+18 1.3510476e+18 1.3510512e+18
 1.3510547e+18 1.3510584e+18 1.3510620e+18 1.3510656e+18 1.3510692e+18
 1.3510727e+18 1.3510765e+18 1.3510800e+18 1.3510836e+18 1.3510872e+18
 1.3510907e+18 1.3510945e+18 1.3510980e+18 1.3511016e+18 1.3511052e+18
 1.3511088e+18 1.3511125e+18 1.3511160e+18 1.3511196e+18 1.3511232e+18
 1.3511268e+18 1.3511305e+18 1.3511340e+18 1.3511376e+18 1.3511412e+18
 1.3511448e+18 1.3511483e+18 1.3511520e+18 1.3511556e+18 1.3511592e+18
 1.3511628e+18 1.3511663e+18 1.3511700e+18 1.3511736e+18 1.3511772e+18
 1.3511808e+18 1.3511843e+18 1.3511881e+18 1.3511916e+18 1.3511952e+18
 1.3511988e+18 1.3512023e+18 1.3512061e+18 1.3512096e+18 1.3512132e+18
 1.3512168e+18 1.3512204e+18 1.3512241e+18 1.3512276e+18 1.3512312e+18
 1.3512348e+18 1.3512384e+18 1.3512421e+18 1.3512456e+18 1.3512492e+18
 1.3512528e+18].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
reg.fit(df[["date"]], df["Detroit"])
DecisionTreeRegressor(max_leaf_nodes=15)

We add the predictions as a new column. I expect this would raise a warning if we hadn’t used copy() above.

df["pred_tree"] = reg.predict(df[["date"]])

Here are the predictions. If you count, there should be 15 horizontal line segments, corresponding to the 15 leaf nodes. (Remember that each leaf node corresponds to a region, and all inputs in a region have the same output.)

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c2 = alt.Chart(df).mark_line(color="orange").encode(
    x="date",
    y=alt.Y("pred_tree", scale=alt.Scale(zero=False), title="kelvin")
    # Doesn't work.  color="orange"
    # Does work.  color=alt.value("orange")
)

c1+c2

This setup is a little different from our other decision tree examples. Notice how there is only one input column. So for example, the feature importances array is not interesting in this case, because there is only one feature.

reg.feature_importances_
array([1.])

Let’s see how this compares if we use a random forest with 100 trees, each with at most 15 leaf nodes.

from sklearn.ensemble import RandomForestRegressor
rfe = RandomForestRegressor(n_estimators=100, max_leaf_nodes=15)
rfe.fit(df[["date"]], df["Detroit"])
RandomForestRegressor(max_leaf_nodes=15)
df["pred_forest"] = rfe.predict(df[["date"]])

The image looks pretty similar in this case, but the corners are rounded because of the averaging that happens. Look for example on the very right side of this picture, and compare it to the picture above. We have a diagonal curve in this random forest plot, which is very different from the straight horizontal segments of the decision tree plot.

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c2 = alt.Chart(df).mark_line(color="orange").encode(
    x="date",
    y=alt.Y("pred_forest", scale=alt.Scale(zero=False), title="kelvin")
    # Doesn't work.  color="orange"
    # Does work.  color=alt.value("orange")
)

c1+c2

We could definitely keep going, for example considering overfitting or finding changes to make to get the plot to more closely match the true data, but this is where we’ll finish the course!