Week 10 Monday#

Announcements#

  • Midterms will be returned at the end of this week.

  • Two worksheets due tonight.

  • No more quizzes or exams. After tonight’s worksheets, your main responsibility is to complete the course project.

  • Plan for Wednesday: some time to work on project, most of the class an introduction to programming using ChatGPT (or actually Bing chat).

import pandas as pd
import numpy as np
import altair as alt
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

The numbers in the decision tree diagram#

The file cleaned_temperature.csv contains the data we finished with on Wednesday. (So we’ve already melted the data, removed rows with missing values, restricted the result to three cities, added indicator columns for the cities, and added numerical columns related to the datetime value.)

  • Load the attached csv file and keep only the first 20000 rows (to speed up the fit step below).

df = pd.read_csv("cleaned_temperature.csv")[:20000]
  • Plot the data from rows 6400 to 7000 using the following code.

alt.Chart(df[6400:7000]).mark_line().encode(
    x="datetime",
    y="temperature",
    color="city",
    tooltip=["city", "temperature", "datetime"]
).properties(
    width=600
)
  • Use a format keyword argument to an Altair Axis object (documentation) so that we see the day of the week in the chart.

Notice how bad the x-axis looks. Something has clearly “gone wrong”.

alt.Chart(df[6400:7000]).mark_line().encode(
    x="datetime",
    y="temperature",
    color="city",
    tooltip=["city", "temperature", "datetime"]
).properties(
    width=600
)

The problem is that we forgot to use the parse_dates keyword argument. (Another option would be to use pd.to_datetime… that’s what we mostly did in the first half of Math 10.)

df = pd.read_csv("cleaned_temperature.csv", parse_dates=["datetime"])[:20000]

Notice how much better the x-axis looks. We also use an Altair Axis object and the format keyword argument to specify that the day of the week name should be given along the axis. (See the format portion of the above documentation link.) We use "%A" to get the weekday name.

alt.Chart(df[6400:7000]).mark_line().encode(
    x=alt.X("datetime", axis=alt.Axis(format="%A")),
    y="temperature",
    color="city",
    tooltip=["city", "temperature", "datetime"]
).properties(
    width=600
)

As another example (again, this is based on following the above documentation link), here we use the short form of the day name (“Sat” instead of “Saturday”, because we used “%a” instead of “%A”), and also list the hour and whether it is AM or PM.

alt.Chart(df[6400:7000]).mark_line().encode(
    x=alt.X("datetime", axis=alt.Axis(format="%a %I %p")),
    y="temperature",
    color="city",
    tooltip=["city", "temperature", "datetime"]
).properties(
    width=600
)

We will fit our decision tree using the following predictor columns.

var_cols = ["year", "day", "month", "hour", "day_of_week"]
city_cols = [c for c in df.columns if c.startswith("city_")]
cols = var_cols+city_cols

Here are the eight input variables we will use.

cols
['year',
 'day',
 'month',
 'hour',
 'day_of_week',
 'city_Detroit',
 'city_San Diego',
 'city_San Francisco']
  • Instantiate a DecisionTreeRegressor object specifying 10 for the maximum number of leaf nodes.

  • Instead of using the default Mean Squared Error (MSE) criterion, specify the Mean Absolute Error (MAE) criterion.

Be sure you’re using a DecisionTreeRegressor, and not a DecisionTreeClassifier. We are trying to predict temperatures, which is a quantitative value, not a discrete value. That’s why this is a regression task.

from sklearn.tree import DecisionTreeRegressor

Just for the fun of it, or to see an alternative, we specify that we want to use a non-default metric. We are going to use Mean Absolute Error instead of the default Mean Squared Error.

tree = DecisionTreeRegressor(max_leaf_nodes=10, criterion="absolute_error")
  • Fit the decision tree. (Even with only 20,000 rows and 10 leaf nodes, this seems quite a bit slower than using MSE. It took about 30 seconds when I tried. I couldn’t get it to work at all using the full 100,000 row dataset.)

tree.fit(df[cols], df["temperature"])
DecisionTreeRegressor(criterion='absolute_error', max_leaf_nodes=10)
  • Make the corresponding decision tree diagram using the following code.

fig = plt.figure(figsize=(20,10))
_ = plot_tree(tree, 
                   feature_names=tree.feature_names_in_,
                   filled=False)
fig = plt.figure(figsize=(20,10))
_ = plot_tree(tree, 
                   feature_names=tree.feature_names_in_,
                   filled=False)
../_images/3ba7030cd6dd3c0c8c682af267701f8c667974d6a5248379725c9294c1d02821.png

We are going to focus on one particular leaf node towards the middle left: the one with absolute error 5.586 and with 1080 samples. You should locate that node on the mid-left side of the diagram, because we will refer to it numerous times below.

  • Using Boolean indexing, can you identify what the values in one of the leaf nodes represent?

Our very first branch in the decision tree is to take the left-hand branch corresponding to city_Detroit <= 0.5. This is like saying city_Detroit == 0, which is like saying city_Detroit == False, which is like saying the city is not Detroit. That is the basis for our first Boolean Series ser1 in the following definition.

ser1 = df["city"] != "Detroit"
ser1
0         True
1         True
2        False
3        False
4         True
         ...  
19995     True
19996    False
19997     True
19998     True
19999     True
Name: city, Length: 20000, dtype: bool

The next branch corresponds to month <= 3.5. The next branch is a little different. It corresponds to traveling to the right from the split hour <= 17.5. Because we are traveling to the right rather than the left, we are in the portion of the data for which hour > 17.5. That is why we set ser3 = df["hour"] > 17.5.

ser2 = df["month"] <= 3.5
ser3 = df["hour"] > 17.5

We want the portion of the data where all three conditions hold. So we combine these Boolean Series together using &. We set df_sub to be the rows from df for which these three conditions are all satisfied. Here we are using Boolean indexing.

df_sub = df[ser1 & ser2 & ser3]

Notice that the rows in df_sub are indeed not from Detroit, from a month January through March, and from an hour later than 5pm.

df_sub.head(4)
datetime temperature city_Detroit city_San Diego city_San Francisco city day month year hour day_of_week
6639 2013-01-01 18:00:00 47.3 0 0 1 San Francisco 1 1 2013 18 1
6641 2013-01-01 18:00:00 51.2 0 1 0 San Diego 1 1 2013 18 1
6642 2013-01-01 19:00:00 48.6 0 0 1 San Francisco 1 1 2013 19 1
6644 2013-01-01 19:00:00 54.8 0 1 0 San Diego 1 1 2013 19 1

Look back at the leaf node in our tree diagram. We are ready to see what one of the values there represents. The samples = 1080 corresponds to how many data points are in our region of the data, or equivalently, how many rows are in df_sub. Indeed, we check that df_sub has 1080 rows.

df_sub.shape
(1080, 11)

What about the value 59.15, which is the number our decision tree will predict for any input in our region? I would have expected it to be the average of the values for our 1080 data points, but it’s not.

df_sub["temperature"].mean()
59.63074074074074

Instead, because we set criterion="absolute_error", scikit-learn automatically uses the median instead of the mean. In general, you should think that Mean Absolute Error and median are less dependent on outliers, and that Mean Squared Error and mean are more dependent on outliers. As a simple example involving mean vs median. The mean of [70, 70, 74, 80, 10**100] will be huge, roughly \(10^{99}\), but the median of [70, 70, 74, 80, 10**100] will simply be 74; the outlier value had no influence on our median calculation.

Back to the value 59.15 shown in the leaf node, that is the indeed the median of the 1080 values in our sub-DataFrame.

df_sub["temperature"].median()
59.150000000000006

Lastly, let’s see where the absolute_error = 5.586 comes from. We could compute MAE on our own, but let’s go ahead and import the relevant function from sklearn.metrics.

from sklearn.metrics import mean_absolute_error

We aren’t allowed to compare with a float directly.

mean_absolute_error(df_sub["temperature"], 59.15)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_102/303660946.py in <module>
----> 1 mean_absolute_error(df_sub["temperature"], 59.15)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/metrics/_regression.py in mean_absolute_error(y_true, y_pred, sample_weight, multioutput)
    190     """
    191     y_type, y_true, y_pred, multioutput = _check_reg_targets(
--> 192         y_true, y_pred, multioutput
    193     )
    194     check_consistent_length(y_true, y_pred, sample_weight)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/metrics/_regression.py in _check_reg_targets(y_true, y_pred, multioutput, dtype)
     92         the dtype argument passed to check_array.
     93     """
---> 94     check_consistent_length(y_true, y_pred)
     95     y_true = check_array(y_true, ensure_2d=False, dtype=dtype)
     96     y_pred = check_array(y_pred, ensure_2d=False, dtype=dtype)

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in check_consistent_length(*arrays)
    327     """
    328 
--> 329     lengths = [_num_samples(X) for X in arrays if X is not None]
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in <listcomp>(.0)
    327     """
    328 
--> 329     lengths = [_num_samples(X) for X in arrays if X is not None]
    330     uniques = np.unique(lengths)
    331     if len(uniques) > 1:

/shared-libs/python3.7/py/lib/python3.7/site-packages/sklearn/utils/validation.py in _num_samples(x)
    263             x = np.asarray(x)
    264         else:
--> 265             raise TypeError(message)
    266 
    267     if hasattr(x, "shape") and x.shape is not None:

TypeError: Expected sequence or array-like, got <class 'float'>

Just to see a new NumPy function, here is an example of how np.full_like works. The following puts three copies of 59.15 into a NumPy array (because our first argument had three elements).

(I didn’t notice this during class, but it looks like they also get rounded to integers, presumably because our first argument has integers.)

np.full_like([1,3,5], 59.15)
array([59, 59, 59])

Here is a fancy way to get the correct number of 59.15 values in a NumPy array. This can be compared to df_sub["temperature"] using mean_absolute_error.

y_pred = np.full_like(df_sub["temperature"], 59.15)
y_pred
array([59.15, 59.15, 59.15, ..., 59.15, 59.15, 59.15])

Notice that this y_pred does indeed have the correct number of entries. We are calling it y_pred because these are the predictions that our decision tree will make for every input in df_sub. (Each leaf node has a single output. For this particular leaf node, the output is 59.15.)

len(y_pred)
1080

Now we can finally reproduce the number 5.586 from the tree diagram.

mean_absolute_error(df_sub["temperature"], y_pred)
5.586111111111111

Notice that our array y_pred does indeed match the predicted values for our sub-DataFrame.

tree.predict(df_sub[cols])
array([59.15, 59.15, 59.15, ..., 59.15, 59.15, 59.15])

We didn’t get further than this. I don’t think we will continue with this on Wednesday; instead we’ll see an example of using Bing Chat for programming.

Inadequacy of train_test_split#

I don’t think we will get this far today.

  • Load the full dataset.

  • Divide the data using train_test_split, using 80% of the rows for the training set.

  • Instantiate a new decision tree, this time using the default criterion choice (MSE) and allowing a maximum of 1000 leaf nodes.

  • Fit the data. (What set should we use?)

  • What is the Mean Squared Error for our predictions on the training set? On the test set?