Week 8 Friday

Week 8 Friday#

Announcements#

Midterm 2 is a week from today. Get a notecard if you need one.
There is an in-class quiz on Tuesday of Week 9. It’s the last in-class quiz of the quarter.
Worksheets 15 and 16 are due Tuesday night (instead of the usual Monday night) because of the Memorial Day holiday.
Worksheet 17, due Week 10, is posted in the Week 9 folder. It is meant to get you started on the course project.
Worksheet 18, also due Week 10, is posted in the Week 9 folder. I recommend finishing this worksheet before Midterm 2, as it is mostly review material. Worksheet 18 is the last worksheet for Math 10.
A sample midterm is posted on the Week 9 page on Canvas. Please let me know if you see any mistakes, and I will correct them at that page.
Today we will have lecture for about 15-20 minutes, then about 10 minutes to work on the worksheets, then about 20 minutes to discuss the Course Project.

import pandas as pd

Introduction#

I’ve attached a dataset temperature.csv containing hourly temperatures from Kaggle: source

The unit for those temperatures is Kelvin. I’ve also converted the temperatures to Fahrenheit and using the formula (found on Google) \(F = 1.8 \cdot K - 459.67\). I stored these temperatures in the csv file temperature_f.csv. We will use temperature_f.csv today.

Eventually (probably on Wednesday next week) I want to give an example of using a Decision Tree to predict probabilities. Today will be spent preparing the data.

Preparing the data#

Read in the data from temperature_f.csv. Use the parse_dates keyword argument of the read_csv function so the “datetime” column gets processed automatically.
Store the resulting DataFrame in the variable name df_pre.

Notice how there are many columns. We will eventually shrink this many-column dataset to just 3 columns, while not losing any of the data.

temp = pd.read_csv("temperature_f.csv")
temp

	datetime	Vancouver	Portland	San Francisco	Seattle	Los Angeles	San Diego	Las Vegas	Phoenix	Albuquerque	...	Philadelphia	New York	Montreal	Boston	Beersheba	Tel Aviv District	Eilat	Haifa	Nahariyya	Jerusalem
0	2012-10-01 13:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.5	74.2	53.5	...	54.5	59.1	54.8	57.2	94.0	90.2	99.4	88.2	88.2	86.6
1	2012-10-01 14:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.5	74.2	53.6	...	54.5	59.2	54.8	57.3	94.0	88.1	99.2	88.2	88.2	86.6
2	2012-10-01 15:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.4	74.3	53.8	...	54.7	59.3	54.9	57.3	93.6	88.0	99.1	88.2	88.2	86.6
3	2012-10-01 16:00:00	52.7	48.1	61.3	47.5	65.7	65.1	68.4	74.3	53.9	...	54.9	59.5	54.9	57.4	93.2	88.0	98.9	88.2	88.2	86.6
4	2012-10-01 17:00:00	52.7	48.1	61.3	47.5	65.7	65.1	68.4	74.3	54.0	...	55.0	59.6	54.9	57.5	92.7	87.9	98.8	88.2	88.2	86.6
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45247	2017-11-29 20:00:00	NaN	47.9	NaN	45.8	68.7	66.2	61.5	70.8	54.6	...	62.8	NaN	35.6	58.9	NaN	NaN	NaN	NaN	NaN	NaN
45248	2017-11-29 21:00:00	NaN	49.5	NaN	47.3	72.6	67.3	63.4	72.4	55.9	...	61.0	NaN	33.8	55.2	NaN	NaN	NaN	NaN	NaN	NaN
45249	2017-11-29 22:00:00	NaN	50.4	NaN	49.3	73.1	67.0	64.7	73.6	55.9	...	56.5	NaN	32.6	51.4	NaN	NaN	NaN	NaN	NaN	NaN
45250	2017-11-29 23:00:00	NaN	49.8	NaN	49.7	72.5	67.0	66.2	75.2	55.4	...	52.6	NaN	30.8	48.2	NaN	NaN	NaN	NaN	NaN	NaN
45251	2017-11-30 00:00:00	NaN	48.4	NaN	49.8	71.2	64.8	65.3	75.2	52.8	...	50.5	NaN	29.6	45.5	NaN	NaN	NaN	NaN	NaN	NaN

45252 rows × 37 columns

For practice reading error messages, see if you can tell what we did wrong here. (Scroll to the bottom of the error message.)

df_pre = pd.read_csv("temperature_f.csv", parse_dates="datetime")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_106/2615044441.py in <module>
----> 1 df_pre = pd.read_csv("temperature_f.csv", parse_dates="datetime")

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    608     kwds.update(kwds_defaults)
    609 
--> 610     return _read(filepath_or_buffer, kwds)
    611 
    612 

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
    460 
    461     # Create the parser.
--> 462     parser = TextFileReader(filepath_or_buffer, **kwds)
    463 
    464     if chunksize or iterator:

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
    817             self.options["has_index_names"] = kwds["has_index_names"]
    818 
--> 819         self._engine = self._make_engine(self.engine)
    820 
    821     def close(self):

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
   1048             )
   1049         # error: Too many arguments for "ParserBase"
-> 1050         return mapping[engine](self.f, **self.options)  # type: ignore[call-arg]
   1051 
   1052     def _failover_to_python(self):

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
   1855         kwds = kwds.copy()
   1856 
-> 1857         ParserBase.__init__(self, kwds)
   1858 
   1859         # #2442

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, kwds)
   1281         self.col_names = None
   1282 
-> 1283         self.parse_dates = _validate_parse_dates_arg(kwds.pop("parse_dates", False))
   1284         self.date_parser = kwds.pop("date_parser", None)
   1285         self.dayfirst = kwds.pop("dayfirst", False)

/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _validate_parse_dates_arg(parse_dates)
   1261         if is_scalar(parse_dates):
   1262             if not lib.is_bool(parse_dates):
-> 1263                 raise TypeError(msg)
   1264 
   1265         elif not isinstance(parse_dates, (list, dict)):

TypeError: Only booleans, lists, and dictionaries are accepted for the 'parse_dates' parameter

We correct the issue by replacing the string "datetime" with the length-one list ["datetime"]. (Comment: Using list("datetime") would not work, because that would produce the length-eight list ["d", "a", ...].)

Overall this parse_dates keyword argument takes the place of what we did before, which was making an assignment like df["datetime"] = pd.to_datetime(df["datetime"]), because the read_csv function does the conversion for us at the same time that we read in the csv file.

df_pre = pd.read_csv("temperature_f.csv", parse_dates=["datetime"])
df_pre

	datetime	Vancouver	Portland	San Francisco	Seattle	Los Angeles	San Diego	Las Vegas	Phoenix	Albuquerque	...	Philadelphia	New York	Montreal	Boston	Beersheba	Tel Aviv District	Eilat	Haifa	Nahariyya	Jerusalem
0	2012-10-01 13:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.5	74.2	53.5	...	54.5	59.1	54.8	57.2	94.0	90.2	99.4	88.2	88.2	86.6
1	2012-10-01 14:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.5	74.2	53.6	...	54.5	59.2	54.8	57.3	94.0	88.1	99.2	88.2	88.2	86.6
2	2012-10-01 15:00:00	52.7	48.1	61.4	47.6	65.7	65.1	68.4	74.3	53.8	...	54.7	59.3	54.9	57.3	93.6	88.0	99.1	88.2	88.2	86.6
3	2012-10-01 16:00:00	52.7	48.1	61.3	47.5	65.7	65.1	68.4	74.3	53.9	...	54.9	59.5	54.9	57.4	93.2	88.0	98.9	88.2	88.2	86.6
4	2012-10-01 17:00:00	52.7	48.1	61.3	47.5	65.7	65.1	68.4	74.3	54.0	...	55.0	59.6	54.9	57.5	92.7	87.9	98.8	88.2	88.2	86.6
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45247	2017-11-29 20:00:00	NaN	47.9	NaN	45.8	68.7	66.2	61.5	70.8	54.6	...	62.8	NaN	35.6	58.9	NaN	NaN	NaN	NaN	NaN	NaN
45248	2017-11-29 21:00:00	NaN	49.5	NaN	47.3	72.6	67.3	63.4	72.4	55.9	...	61.0	NaN	33.8	55.2	NaN	NaN	NaN	NaN	NaN	NaN
45249	2017-11-29 22:00:00	NaN	50.4	NaN	49.3	73.1	67.0	64.7	73.6	55.9	...	56.5	NaN	32.6	51.4	NaN	NaN	NaN	NaN	NaN	NaN
45250	2017-11-29 23:00:00	NaN	49.8	NaN	49.7	72.5	67.0	66.2	75.2	55.4	...	52.6	NaN	30.8	48.2	NaN	NaN	NaN	NaN	NaN	NaN
45251	2017-11-30 00:00:00	NaN	48.4	NaN	49.8	71.2	64.8	65.3	75.2	52.8	...	50.5	NaN	29.6	45.5	NaN	NaN	NaN	NaN	NaN	NaN

45252 rows × 37 columns

Plotting the data: attempt 1#

Say you want to make an Altair line chart showing the temperatures for the cities ["San Diego", "San Francisco", "Detroit"] in different colors. That would be a little annoying with our current techniques. Here is the best approach I could come up with.

The main thing I want you to get from this section is that it is quite inconvenient to try to make this basic chart from our DataFrame.

import altair as alt

We will make three charts, each starting from the following base. For later reference, notice how we are only using the first 400 rows in the DataFrame.

base = alt.Chart(df_pre[:400]).mark_line().encode(
    x="datetime"
).properties(
    width=600
)

Here is a helper function, which takes as input a city, a color, and a base chart (like the one we just defined), and as output returns the temperature chart for the given city in the given color.

def make_chart(city, color, base):
    c = base.encode(
        y=city,
        color=alt.value(color),
    )
    return c

Here is an example of the output produced by our make_chart function.

make_chart("Seattle", "yellow", base)

Here is a length-three helper list that we will use. (I reiterate that the overall point in this section is that it is not easy to make our simple chart, with the data in its current form.)

pair_list = [
    ("San Diego", "red"),
    ("San Francisco", "blue"),
    ("Detroit", "orange")
]

Here we use list comprehension to get our three desired Altair charts in a list.

chart_list = [make_chart(city, color, base) for city, color in pair_list]

This does include three charts (the same as the length of pair_list).

len(chart_list)

Here is an example of one of the charts in chart_list.

chart_list[0]

We can input these three charts to the alt.layer function, but see below for a way to do this with less typing.

alt.layer(chart_list[0], chart_list[1], chart_list[2])

Here is the “correct” way to pass the entries in chart_list as arguments to the alt.layer function.

Aside from the procedure in this section taking numerous steps, there are other issues as well. For example, how can we tell from the following chart which color corresponds to which city? Related to this: there would be no way to add a tooltip to this chart that would display the corresponding city. (It would be no problem to add a tooltip displaying the temperature or the date.)

alt.layer(*chart_list)

Convert from wide-form to long-form data#

You don’t need to know this melt method for Midterm 2, but many datasets you find (for example, for your course project) will benefit from this melting.

The basic motto for using melt is that we don’t want information stored in the column names. In this Altair chart, the city names are stored as column names.

For example, why couldn’t we include a tooltip above indicating the city name? Because there was no column containing the city name. The melt method will produce a much longer DataFrame which does have a column containing the city names.

Use the DataFrame method melt to create a much longer DataFrame that has only three columns, “datetime”, “city”, and “temperature”. Name the DataFrame df_melted.

df_pre.sample(4, random_state=0)

	datetime	Vancouver	Portland	San Francisco	Seattle	Los Angeles	San Diego	Las Vegas	Phoenix	Albuquerque	...	Philadelphia	New York	Montreal	Boston	Beersheba	Tel Aviv District	Eilat	Haifa	Nahariyya	Jerusalem
37037	2016-12-22 18:00:00	38.7	31.7	52.9	34.9	55.3	60.9	45.6	60.8	34.1	...	46.8	43.6	32.0	36.4	44.3	58.4	57.5	55.4	55.4	58.7
23197	2015-05-26 02:00:00	61.6	60.6	56.2	60.3	65.7	64.0	75.3	78.4	60.3	...	75.0	73.2	62.5	66.9	58.2	59.7	71.4	69.9	69.9	65.1
33027	2016-07-08 16:00:00	59.9	62.0	59.3	59.9	70.9	70.5	90.6	95.2	79.6	...	87.7	83.8	69.4	66.5	87.3	87.4	104.4	97.0	79.0	87.4
22542	2015-04-28 19:00:00	50.6	55.1	57.0	56.8	79.2	82.3	74.1	78.5	51.3	...	63.6	63.6	58.7	50.8	75.0	70.7	74.1	70.7	66.9	70.6

4 rows × 37 columns

I couldn’t remember what the keyword arguments were for this melt method, so I looked them up using the help function.

help(df_pre.melt)

Help on method melt in module pandas.core.frame:

melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
    Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
    
    This function is useful to massage a DataFrame into a format where one
    or more columns are identifier variables (`id_vars`), while all other
    columns, considered measured variables (`value_vars`), are "unpivoted" to
    the row axis, leaving just two non-identifier columns, 'variable' and
    'value'.
    
    Parameters
    ----------
    id_vars : tuple, list, or ndarray, optional
        Column(s) to use as identifier variables.
    value_vars : tuple, list, or ndarray, optional
        Column(s) to unpivot. If not specified, uses all columns that
        are not set as `id_vars`.
    var_name : scalar
        Name to use for the 'variable' column. If None it uses
        ``frame.columns.name`` or 'variable'.
    value_name : scalar, default 'value'
        Name to use for the 'value' column.
    col_level : int or str, optional
        If columns are a MultiIndex then use this level to melt.
    ignore_index : bool, default True
        If True, original index is ignored. If False, the original index is retained.
        Index labels will be repeated as necessary.
    
        .. versionadded:: 1.1.0
    
    Returns
    -------
    DataFrame
        Unpivoted DataFrame.
    
    See Also
    --------
    melt : Identical method.
    pivot_table : Create a spreadsheet-style pivot table as a DataFrame.
    DataFrame.pivot : Return reshaped DataFrame organized
        by given index / column values.
    DataFrame.explode : Explode a DataFrame from list-like
            columns to long format.
    
    Examples
    --------
    >>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
    ...                    'B': {0: 1, 1: 3, 2: 5},
    ...                    'C': {0: 2, 1: 4, 2: 6}})
    >>> df
       A  B  C
    0  a  1  2
    1  b  3  4
    2  c  5  6
    
    >>> df.melt(id_vars=['A'], value_vars=['B'])
       A variable  value
    0  a        B      1
    1  b        B      3
    2  c        B      5
    
    >>> df.melt(id_vars=['A'], value_vars=['B', 'C'])
       A variable  value
    0  a        B      1
    1  b        B      3
    2  c        B      5
    3  a        C      2
    4  b        C      4
    5  c        C      6
    
    The names of 'variable' and 'value' columns can be customized:
    
    >>> df.melt(id_vars=['A'], value_vars=['B'],
    ...         var_name='myVarname', value_name='myValname')
       A myVarname  myValname
    0  a         B          1
    1  b         B          3
    2  c         B          5
    
    Original index values can be kept around:
    
    >>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
       A variable  value
    0  a        B      1
    1  b        B      3
    2  c        B      5
    0  a        C      2
    1  b        C      4
    2  c        C      6
    
    If you have multi-index columns:
    
    >>> df.columns = [list('ABC'), list('DEF')]
    >>> df
       A  B  C
       D  E  F
    0  a  1  2
    1  b  3  4
    2  c  5  6
    
    >>> df.melt(col_level=0, id_vars=['A'], value_vars=['B'])
       A variable  value
    0  a        B      1
    1  b        B      3
    2  c        B      5
    
    >>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')])
      (A, D) variable_0 variable_1  value
    0      a          B          E      1
    1      b          B          E      3
    2      c          B          E      5

Here we specify:

id_vars: Which columns do we want to keep? In this case, it is only the “datetime” column that we want to keep.
var_name: All remaining column names will be placed into this column. Because the column names represent cities, an obvious choice of name for this new column is “city”.
value_name: The previous entries in the DataFrame (like 38.7) will be placed into this column. Because these values represent temperatures, we use the name “temperature”.

This melt method is a little hard to understand without seeing an example, so the DataFrame shown below should make it more clear what’s going on.

df_melted = df_pre.melt(
        id_vars="datetime",
        var_name="city", 
        value_name="temperature"
    )

Here we look at 4 random rows from our newly melted DataFrame. Notice how it is down to just three columns, corresponding to what we specified above. (It is also possible specify multiple columns for the id_vars argument.) Notice how the old column names now appear in their own “city” column, and the old DataFrame entries now appear in the “temperature” column.

Our resulting DataFrame df_melted is much longer than df_pre, because every temperature value in df_pre now lives in its own row in df_melted.

df_melted.sample(4, random_state=0)

	datetime	city	temperature
1478577	2016-03-25 22:00:00	Eilat	78.0
1299334	2016-06-07 11:00:00	Montreal	61.1
1072194	2016-05-01 19:00:00	Miami	84.5
863753	2013-03-15 18:00:00	Atlanta	48.5

Now we do some more pre-processing of this data.

Keep only the rows corresponding to the cities in the following cities list.
Drop the rows with missing values.
Sort the DataFrame by date.
Name the resulting DataFrame df.

cities = ["San Diego", "San Francisco", "Detroit"]

To keep only the rows corresponding to the three given cities, we will use Boolean indexing with the isin method. We check that those cities do seem to be the only cities appearing.

df1 = df_melted[df_melted["city"].isin(cities)]
df1.sample(4, random_state=0)

	datetime	city	temperature
267876	2017-07-01 13:00:00	San Diego	64.1
112689	2015-04-13 22:00:00	San Francisco	59.8
921169	2014-08-04 14:00:00	Detroit	76.4
104765	2014-05-18 18:00:00	San Francisco	62.8

When we drop the rows with missing values, we are (potentially) changing the row labels (because we’re removing some rows). That is why we use the axis=0 keyword argument.

df2 = df1.dropna(axis=0)

I’m not sure if we’ve used sort_values so far in Math 10, but it’s a very natural method to use. In this case we sort the DataFrame by the values in the “datetime” column. (If you want the values sorted in reverse order, you can pass the keyword argument ascending=False.)

df = df2.sort_values("datetime")

Notice how we start out at 1pm on October 1st, 2012, and then we go to 2pm, etc.

df.head(4)

	datetime	city	temperature
90504	2012-10-01 13:00:00	San Francisco	61.4
226260	2012-10-01 13:00:00	San Diego	65.1
905040	2012-10-01 13:00:00	Detroit	51.6
905041	2012-10-01 14:00:00	Detroit	51.7

Plotting the data: attempt 2#

Now that we have “melted” the DataFrame, it will be much easier to plot the data.

Plot the data by adapting the following.

alt.Chart(df[???]).mark_line().encode(
    x="datetime",
    ???
).properties(
    width=600
)

What number of rows should we use to match the plot from above?

Now that our data is in “long form” instead of “wide form”, it will be much easier to plot the data using Altair. Notice how also a legend appears automatically on the right side, indicating which color corresponds to which city.

Lastly, recall that above we were plotting only the first 400 rows of the DataFrame. Those first 400 rows contained 400 values for many different cities. Now each row corresponds to a single city. Since we have three cities, to produce the same plot, we need to use \(3 \cdot 400 = 1200\) rows.

alt.Chart(df[:3*400]).mark_line().encode(
    x="datetime",
    y="temperature",
    color="city",
    tooltip=["city", "temperature"]
).properties(
    width=600
)

Preview of Week 9 Wednesday#

On Wednesday of next week, I want to use a DecisionTreeRegressor object to try to predict the temperature. I want the city to be one of the inputs, but decision trees need numerical inputs, so we will add some Boolean “indicator” columns, indicating True or False, a row corresponds to “Detroit”, etc.

10 minutes to work on the Worksheets#

Discussion of the Course Project#

See the instructions in the course notes.