Week 10 Friday#

I’ve attached a dataset containing hourly temperatures from Kaggle: source

The unit for these temperatures is kelvin.

The idea for using a decision tree for this kind of data set comes from Jake VanderPlas’s Python Data Science Handbook (which was also the source of the bicycle worksheet).

Announcements#

Be sure you’re writing your project using the Project Template in the Project folder. (This is different from the Worksheet 16 template.)
I have Zoom office hours Monday during the first hour of the time scheduled for our final exam: Monday, 10:30-11:30am. See our Canvas homepage for the Zoom link.
Projects due 11:59pm Monday. (An extension of 1-2 days may be possible on an individual basis; email me Sunday or Monday if you think you need one. Share what you have so far.)
If you get stuck on something or have a logistical question, please ask on Ed Discussion, even if you think it is very specific to your own project.
Please fill out a course evaluation if you haven’t already!

A decision tree to predict temperature#

We won’t divide into a training set and a test set in this case. My intuition is that because the temperature at 3pm (say) is so similar to the temperature at 4pm, randomly dividing the data won’t be appropriate. Phrased another way, our eventual goal is probably to predict the temperature at some future date, not at some random hour in the middle of a day where we already know other temperatures.

My goal here is to see what the decision tree prediction function looks like. I’m not thinking about overfitting right now.

import pandas as pd
import altair as alt

df_pre = pd.read_csv("temperature.csv")

Different cities are in different columns. If we wanted for example to plot these temperatures in different colors, we should use df_pre.melt to get the city columns combined into a single column.

df_pre

	datetime	Vancouver	Portland	San Francisco	Seattle	Los Angeles	San Diego	Las Vegas	Phoenix	Albuquerque	...	Philadelphia	New York	Montreal	Boston	Beersheba	Tel Aviv District	Eilat	Haifa	Nahariyya	Jerusalem
0	2012-10-01 12:00:00	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	309.100000	NaN	NaN	NaN
1	2012-10-01 13:00:00	284.630000	282.080000	289.480000	281.800000	291.870000	291.530000	293.410000	296.600000	285.120000	...	285.630000	288.220000	285.830000	287.170000	307.590000	305.470000	310.580000	304.4	304.4	303.5
2	2012-10-01 14:00:00	284.629041	282.083252	289.474993	281.797217	291.868186	291.533501	293.403141	296.608509	285.154558	...	285.663208	288.247676	285.834650	287.186092	307.590000	304.310000	310.495769	304.4	304.4	303.5
3	2012-10-01 15:00:00	284.626998	282.091866	289.460618	281.789833	291.862844	291.543355	293.392177	296.631487	285.233952	...	285.756824	288.326940	285.847790	287.231672	307.391513	304.281841	310.411538	304.4	304.4	303.5
4	2012-10-01 16:00:00	284.624955	282.100481	289.446243	281.782449	291.857503	291.553209	293.381213	296.654466	285.313345	...	285.850440	288.406203	285.860929	287.277251	307.145200	304.238015	310.327308	304.4	304.4	303.5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45248	2017-11-29 20:00:00	NaN	282.000000	NaN	280.820000	293.550000	292.150000	289.540000	294.710000	285.720000	...	290.240000	NaN	275.130000	288.080000	NaN	NaN	NaN	NaN	NaN	NaN
45249	2017-11-29 21:00:00	NaN	282.890000	NaN	281.650000	295.680000	292.740000	290.610000	295.590000	286.450000	...	289.240000	NaN	274.130000	286.020000	NaN	NaN	NaN	NaN	NaN	NaN
45250	2017-11-29 22:00:00	NaN	283.390000	NaN	282.750000	295.960000	292.580000	291.340000	296.250000	286.440000	...	286.780000	NaN	273.480000	283.940000	NaN	NaN	NaN	NaN	NaN	NaN
45251	2017-11-29 23:00:00	NaN	283.020000	NaN	282.960000	295.650000	292.610000	292.150000	297.150000	286.140000	...	284.570000	NaN	272.480000	282.170000	NaN	NaN	NaN	NaN	NaN	NaN
45252	2017-11-30 00:00:00	NaN	282.280000	NaN	283.040000	294.930000	291.400000	291.640000	297.150000	284.700000	...	283.420000	NaN	271.800000	280.650000	NaN	NaN	NaN	NaN	NaN	NaN

45253 rows × 37 columns

We will just take approximately 200 of the rows and two of the columns from this dataset.

# greatly reduce the rows and columns
df = df_pre.loc[400:600, ["datetime", "Detroit"]].copy()

df

	datetime	Detroit
400	2012-10-18 04:00:00	284.52
401	2012-10-18 05:00:00	284.45
402	2012-10-18 06:00:00	285.51
403	2012-10-18 07:00:00	284.93
404	2012-10-18 08:00:00	285.34
...	...	...
596	2012-10-26 08:00:00	288.50
597	2012-10-26 09:00:00	287.94
598	2012-10-26 10:00:00	287.49
599	2012-10-26 11:00:00	287.09
600	2012-10-26 12:00:00	287.69

201 rows × 2 columns

Notice how strange the x-axis labels look. That is a sign that something is wrong with the data type of the “datetime” column.

c1 = alt.Chart(df).mark_line().encode(
    x="datetime",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c1

These values are strings, not datetime objects.

df.dtypes

datetime     object
Detroit     float64
dtype: object

We convert that column into the datetime data type. We could replace the old column, but here we include it as a new column. (That way, if we make a mistake in this cell, we don’t need to re-load the data.)

df["date"] = pd.to_datetime(df["datetime"])

df.dtypes

datetime            object
Detroit            float64
date        datetime64[ns]
dtype: object

Now the image looks much more natural.

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c1

It would be incorrect to use a classifier in this context, because predicting temperature is a regression problem.

from sklearn.tree import DecisionTreeClassifier

Here is the correct import.

from sklearn.tree import DecisionTreeRegressor

We’ll specify 15 leaf nodes when we instantiate the regressor object.

reg = DecisionTreeRegressor(max_leaf_nodes=15)

Here is our usual error. Remember that the first input needs to be two-dimensional.

reg.fit(df["date"], df["Detroit"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_84/3032300603.py in <module>
----> 1 reg.fit(df["date"], df["Detroit"])

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
           sample_weight=sample_weight,
           check_input=check_input,
-> 1320             X_idx_sorted=X_idx_sorted,
       )
       return self

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/tree/_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
           check_y_params = dict(ensure_2d=False, dtype=None)
           X, y = self._validate_data(
--> 166                 X, y, validate_separately=(check_X_params, check_y_params)
           )
           if issparse(X):

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
               # :(
               check_X_params, check_y_params = validate_separately
--> 578                 X = check_array(X, **check_X_params)
               y = check_array(y, **check_y_params)
           else:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
                   "Reshape your data either using array.reshape(-1, 1) if "
                   "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
               )


ValueError: Expected 2D array, got 1D array instead:
array=[1.3505327e+18 1.3505365e+18 1.3505400e+18 1.3505436e+18 1.3505472e+18
3505507e+18 1.3505545e+18 1.3505580e+18 1.3505616e+18 1.3505652e+18
3505688e+18 1.3505725e+18 1.3505760e+18 1.3505796e+18 1.3505832e+18
3505868e+18 1.3505905e+18 1.3505940e+18 1.3505976e+18 1.3506012e+18
3506048e+18 1.3506083e+18 1.3506120e+18 1.3506156e+18 1.3506192e+18
3506228e+18 1.3506263e+18 1.3506301e+18 1.3506336e+18 1.3506372e+18
3506408e+18 1.3506443e+18 1.3506481e+18 1.3506516e+18 1.3506552e+18
3506588e+18 1.3506623e+18 1.3506661e+18 1.3506696e+18 1.3506732e+18
3506768e+18 1.3506804e+18 1.3506841e+18 1.3506876e+18 1.3506912e+18
3506948e+18 1.3506984e+18 1.3507021e+18 1.3507056e+18 1.3507092e+18
3507128e+18 1.3507164e+18 1.3507199e+18 1.3507236e+18 1.3507272e+18
3507308e+18 1.3507344e+18 1.3507379e+18 1.3507417e+18 1.3507452e+18
3507488e+18 1.3507524e+18 1.3507559e+18 1.3507597e+18 1.3507632e+18
3507668e+18 1.3507704e+18 1.3507739e+18 1.3507777e+18 1.3507812e+18
3507848e+18 1.3507884e+18 1.3507920e+18 1.3507957e+18 1.3507992e+18
3508028e+18 1.3508064e+18 1.3508100e+18 1.3508135e+18 1.3508172e+18
3508208e+18 1.3508244e+18 1.3508280e+18 1.3508315e+18 1.3508352e+18
3508388e+18 1.3508424e+18 1.3508460e+18 1.3508495e+18 1.3508533e+18
3508568e+18 1.3508604e+18 1.3508640e+18 1.3508675e+18 1.3508713e+18
3508748e+18 1.3508784e+18 1.3508820e+18 1.3508855e+18 1.3508893e+18
3508928e+18 1.3508964e+18 1.3509000e+18 1.3509036e+18 1.3509073e+18
3509108e+18 1.3509144e+18 1.3509180e+18 1.3509216e+18 1.3509251e+18
3509288e+18 1.3509324e+18 1.3509360e+18 1.3509396e+18 1.3509431e+18
3509468e+18 1.3509504e+18 1.3509540e+18 1.3509576e+18 1.3509611e+18
3509649e+18 1.3509684e+18 1.3509720e+18 1.3509756e+18 1.3509791e+18
3509829e+18 1.3509864e+18 1.3509900e+18 1.3509936e+18 1.3509971e+18
3510009e+18 1.3510044e+18 1.3510080e+18 1.3510116e+18 1.3510152e+18
3510189e+18 1.3510224e+18 1.3510260e+18 1.3510296e+18 1.3510332e+18
3510367e+18 1.3510404e+18 1.3510440e+18 1.3510476e+18 1.3510512e+18
3510547e+18 1.3510584e+18 1.3510620e+18 1.3510656e+18 1.3510692e+18
3510727e+18 1.3510765e+18 1.3510800e+18 1.3510836e+18 1.3510872e+18
3510907e+18 1.3510945e+18 1.3510980e+18 1.3511016e+18 1.3511052e+18
3511088e+18 1.3511125e+18 1.3511160e+18 1.3511196e+18 1.3511232e+18
3511268e+18 1.3511305e+18 1.3511340e+18 1.3511376e+18 1.3511412e+18
3511448e+18 1.3511483e+18 1.3511520e+18 1.3511556e+18 1.3511592e+18
3511628e+18 1.3511663e+18 1.3511700e+18 1.3511736e+18 1.3511772e+18
3511808e+18 1.3511843e+18 1.3511881e+18 1.3511916e+18 1.3511952e+18
3511988e+18 1.3512023e+18 1.3512061e+18 1.3512096e+18 1.3512132e+18
3512168e+18 1.3512204e+18 1.3512241e+18 1.3512276e+18 1.3512312e+18
3512348e+18 1.3512384e+18 1.3512421e+18 1.3512456e+18 1.3512492e+18
3512528e+18].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

reg.fit(df[["date"]], df["Detroit"])

DecisionTreeRegressor(max_leaf_nodes=15)

We add the predictions as a new column. I expect this would raise a warning if we hadn’t used copy() above.

df["pred_tree"] = reg.predict(df[["date"]])

Here are the predictions. If you count, there should be 15 horizontal line segments, corresponding to the 15 leaf nodes. (Remember that each leaf node corresponds to a region, and all inputs in a region have the same output.)

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c2 = alt.Chart(df).mark_line(color="orange").encode(
    x="date",
    y=alt.Y("pred_tree", scale=alt.Scale(zero=False), title="kelvin")
    # Doesn't work.  color="orange"
    # Does work.  color=alt.value("orange")
)

c1+c2

This setup is a little different from our other decision tree examples. Notice how there is only one input column. So for example, the feature importances array is not interesting in this case, because there is only one feature.

reg.feature_importances_

array([1.])

Let’s see how this compares if we use a random forest with 100 trees, each with at most 15 leaf nodes.

from sklearn.ensemble import RandomForestRegressor

rfe = RandomForestRegressor(n_estimators=100, max_leaf_nodes=15)

rfe.fit(df[["date"]], df["Detroit"])

RandomForestRegressor(max_leaf_nodes=15)

df["pred_forest"] = rfe.predict(df[["date"]])

The image looks pretty similar in this case, but the corners are rounded because of the averaging that happens. Look for example on the very right side of this picture, and compare it to the picture above. We have a diagonal curve in this random forest plot, which is very different from the straight horizontal segments of the decision tree plot.

c1 = alt.Chart(df).mark_line().encode(
    x="date",
    y=alt.Y("Detroit", scale=alt.Scale(zero=False), title="kelvin")
).properties(
    width=700,
    title="Detroit"
)

c2 = alt.Chart(df).mark_line(color="orange").encode(
    x="date",
    y=alt.Y("pred_forest", scale=alt.Scale(zero=False), title="kelvin")
    # Doesn't work.  color="orange"
    # Does work.  color=alt.value("orange")
)

c1+c2

We could definitely keep going, for example considering overfitting or finding changes to make to get the plot to more closely match the true data, but this is where we’ll finish the course!

UC Irvine Math 10, Fall 2022

Week 10 Friday

Contents

Week 10 Friday#

Announcements#

A decision tree to predict temperature#