Week 8 Friday#
Announcements#
Midterm 2 is a week from today. Get a notecard if you need one.
There is an in-class quiz on Tuesday of Week 9. It’s the last in-class quiz of the quarter.
Worksheets 15 and 16 are due Tuesday night (instead of the usual Monday night) because of the Memorial Day holiday.
Worksheet 17, due Week 10, is posted in the Week 9 folder. It is meant to get you started on the course project.
Worksheet 18, also due Week 10, is posted in the Week 9 folder. I recommend finishing this worksheet before Midterm 2, as it is mostly review material. Worksheet 18 is the last worksheet for Math 10.
A sample midterm is posted on the Week 9 page on Canvas. Please let me know if you see any mistakes, and I will correct them at that page.
Today we will have lecture for about 15-20 minutes, then about 10 minutes to work on the worksheets, then about 20 minutes to discuss the Course Project.
import pandas as pd
Introduction#
I’ve attached a dataset temperature.csv
containing hourly temperatures from Kaggle: source
The unit for those temperatures is Kelvin. I’ve also converted the temperatures to Fahrenheit and using the formula (found on Google) \(F = 1.8 \cdot K - 459.67\). I stored these temperatures in the csv file temperature_f.csv
. We will use temperature_f.csv
today.
Eventually (probably on Wednesday next week) I want to give an example of using a Decision Tree to predict probabilities. Today will be spent preparing the data.
Preparing the data#
Read in the data from
temperature_f.csv
. Use theparse_dates
keyword argument of theread_csv
function so the “datetime” column gets processed automatically.Store the resulting DataFrame in the variable name
df_pre
.
Notice how there are many columns. We will eventually shrink this many-column dataset to just 3 columns, while not losing any of the data.
temp = pd.read_csv("temperature_f.csv")
temp
datetime | Vancouver | Portland | San Francisco | Seattle | Los Angeles | San Diego | Las Vegas | Phoenix | Albuquerque | ... | Philadelphia | New York | Montreal | Boston | Beersheba | Tel Aviv District | Eilat | Haifa | Nahariyya | Jerusalem | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-10-01 13:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.5 | 74.2 | 53.5 | ... | 54.5 | 59.1 | 54.8 | 57.2 | 94.0 | 90.2 | 99.4 | 88.2 | 88.2 | 86.6 |
1 | 2012-10-01 14:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.5 | 74.2 | 53.6 | ... | 54.5 | 59.2 | 54.8 | 57.3 | 94.0 | 88.1 | 99.2 | 88.2 | 88.2 | 86.6 |
2 | 2012-10-01 15:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.4 | 74.3 | 53.8 | ... | 54.7 | 59.3 | 54.9 | 57.3 | 93.6 | 88.0 | 99.1 | 88.2 | 88.2 | 86.6 |
3 | 2012-10-01 16:00:00 | 52.7 | 48.1 | 61.3 | 47.5 | 65.7 | 65.1 | 68.4 | 74.3 | 53.9 | ... | 54.9 | 59.5 | 54.9 | 57.4 | 93.2 | 88.0 | 98.9 | 88.2 | 88.2 | 86.6 |
4 | 2012-10-01 17:00:00 | 52.7 | 48.1 | 61.3 | 47.5 | 65.7 | 65.1 | 68.4 | 74.3 | 54.0 | ... | 55.0 | 59.6 | 54.9 | 57.5 | 92.7 | 87.9 | 98.8 | 88.2 | 88.2 | 86.6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45247 | 2017-11-29 20:00:00 | NaN | 47.9 | NaN | 45.8 | 68.7 | 66.2 | 61.5 | 70.8 | 54.6 | ... | 62.8 | NaN | 35.6 | 58.9 | NaN | NaN | NaN | NaN | NaN | NaN |
45248 | 2017-11-29 21:00:00 | NaN | 49.5 | NaN | 47.3 | 72.6 | 67.3 | 63.4 | 72.4 | 55.9 | ... | 61.0 | NaN | 33.8 | 55.2 | NaN | NaN | NaN | NaN | NaN | NaN |
45249 | 2017-11-29 22:00:00 | NaN | 50.4 | NaN | 49.3 | 73.1 | 67.0 | 64.7 | 73.6 | 55.9 | ... | 56.5 | NaN | 32.6 | 51.4 | NaN | NaN | NaN | NaN | NaN | NaN |
45250 | 2017-11-29 23:00:00 | NaN | 49.8 | NaN | 49.7 | 72.5 | 67.0 | 66.2 | 75.2 | 55.4 | ... | 52.6 | NaN | 30.8 | 48.2 | NaN | NaN | NaN | NaN | NaN | NaN |
45251 | 2017-11-30 00:00:00 | NaN | 48.4 | NaN | 49.8 | 71.2 | 64.8 | 65.3 | 75.2 | 52.8 | ... | 50.5 | NaN | 29.6 | 45.5 | NaN | NaN | NaN | NaN | NaN | NaN |
45252 rows Ă— 37 columns
For practice reading error messages, see if you can tell what we did wrong here. (Scroll to the bottom of the error message.)
df_pre = pd.read_csv("temperature_f.csv", parse_dates="datetime")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/tmp/ipykernel_106/2615044441.py in <module>
----> 1 df_pre = pd.read_csv("temperature_f.csv", parse_dates="datetime")
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
608 kwds.update(kwds_defaults)
609
--> 610 return _read(filepath_or_buffer, kwds)
611
612
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds)
460
461 # Create the parser.
--> 462 parser = TextFileReader(filepath_or_buffer, **kwds)
463
464 if chunksize or iterator:
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, f, engine, **kwds)
817 self.options["has_index_names"] = kwds["has_index_names"]
818
--> 819 self._engine = self._make_engine(self.engine)
820
821 def close(self):
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine)
1048 )
1049 # error: Too many arguments for "ParserBase"
-> 1050 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1051
1052 def _failover_to_python(self):
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, src, **kwds)
1855 kwds = kwds.copy()
1856
-> 1857 ParserBase.__init__(self, kwds)
1858
1859 # #2442
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in __init__(self, kwds)
1281 self.col_names = None
1282
-> 1283 self.parse_dates = _validate_parse_dates_arg(kwds.pop("parse_dates", False))
1284 self.date_parser = kwds.pop("date_parser", None)
1285 self.dayfirst = kwds.pop("dayfirst", False)
/shared-libs/python3.7/py/lib/python3.7/site-packages/pandas/io/parsers.py in _validate_parse_dates_arg(parse_dates)
1261 if is_scalar(parse_dates):
1262 if not lib.is_bool(parse_dates):
-> 1263 raise TypeError(msg)
1264
1265 elif not isinstance(parse_dates, (list, dict)):
TypeError: Only booleans, lists, and dictionaries are accepted for the 'parse_dates' parameter
We correct the issue by replacing the string "datetime"
with the length-one list ["datetime"]
. (Comment: Using list("datetime")
would not work, because that would produce the length-eight list ["d", "a", ...]
.)
Overall this parse_dates
keyword argument takes the place of what we did before, which was making an assignment like df["datetime"] = pd.to_datetime(df["datetime"])
, because the read_csv
function does the conversion for us at the same time that we read in the csv file.
df_pre = pd.read_csv("temperature_f.csv", parse_dates=["datetime"])
df_pre
datetime | Vancouver | Portland | San Francisco | Seattle | Los Angeles | San Diego | Las Vegas | Phoenix | Albuquerque | ... | Philadelphia | New York | Montreal | Boston | Beersheba | Tel Aviv District | Eilat | Haifa | Nahariyya | Jerusalem | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2012-10-01 13:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.5 | 74.2 | 53.5 | ... | 54.5 | 59.1 | 54.8 | 57.2 | 94.0 | 90.2 | 99.4 | 88.2 | 88.2 | 86.6 |
1 | 2012-10-01 14:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.5 | 74.2 | 53.6 | ... | 54.5 | 59.2 | 54.8 | 57.3 | 94.0 | 88.1 | 99.2 | 88.2 | 88.2 | 86.6 |
2 | 2012-10-01 15:00:00 | 52.7 | 48.1 | 61.4 | 47.6 | 65.7 | 65.1 | 68.4 | 74.3 | 53.8 | ... | 54.7 | 59.3 | 54.9 | 57.3 | 93.6 | 88.0 | 99.1 | 88.2 | 88.2 | 86.6 |
3 | 2012-10-01 16:00:00 | 52.7 | 48.1 | 61.3 | 47.5 | 65.7 | 65.1 | 68.4 | 74.3 | 53.9 | ... | 54.9 | 59.5 | 54.9 | 57.4 | 93.2 | 88.0 | 98.9 | 88.2 | 88.2 | 86.6 |
4 | 2012-10-01 17:00:00 | 52.7 | 48.1 | 61.3 | 47.5 | 65.7 | 65.1 | 68.4 | 74.3 | 54.0 | ... | 55.0 | 59.6 | 54.9 | 57.5 | 92.7 | 87.9 | 98.8 | 88.2 | 88.2 | 86.6 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
45247 | 2017-11-29 20:00:00 | NaN | 47.9 | NaN | 45.8 | 68.7 | 66.2 | 61.5 | 70.8 | 54.6 | ... | 62.8 | NaN | 35.6 | 58.9 | NaN | NaN | NaN | NaN | NaN | NaN |
45248 | 2017-11-29 21:00:00 | NaN | 49.5 | NaN | 47.3 | 72.6 | 67.3 | 63.4 | 72.4 | 55.9 | ... | 61.0 | NaN | 33.8 | 55.2 | NaN | NaN | NaN | NaN | NaN | NaN |
45249 | 2017-11-29 22:00:00 | NaN | 50.4 | NaN | 49.3 | 73.1 | 67.0 | 64.7 | 73.6 | 55.9 | ... | 56.5 | NaN | 32.6 | 51.4 | NaN | NaN | NaN | NaN | NaN | NaN |
45250 | 2017-11-29 23:00:00 | NaN | 49.8 | NaN | 49.7 | 72.5 | 67.0 | 66.2 | 75.2 | 55.4 | ... | 52.6 | NaN | 30.8 | 48.2 | NaN | NaN | NaN | NaN | NaN | NaN |
45251 | 2017-11-30 00:00:00 | NaN | 48.4 | NaN | 49.8 | 71.2 | 64.8 | 65.3 | 75.2 | 52.8 | ... | 50.5 | NaN | 29.6 | 45.5 | NaN | NaN | NaN | NaN | NaN | NaN |
45252 rows Ă— 37 columns
Plotting the data: attempt 1#
Say you want to make an Altair line chart showing the temperatures for the cities ["San Diego", "San Francisco", "Detroit"]
in different colors. That would be a little annoying with our current techniques. Here is the best approach I could come up with.
The main thing I want you to get from this section is that it is quite inconvenient to try to make this basic chart from our DataFrame.
import altair as alt
We will make three charts, each starting from the following base. For later reference, notice how we are only using the first 400 rows in the DataFrame.
base = alt.Chart(df_pre[:400]).mark_line().encode(
x="datetime"
).properties(
width=600
)
Here is a helper function, which takes as input a city, a color, and a base chart (like the one we just defined), and as output returns the temperature chart for the given city in the given color.
def make_chart(city, color, base):
c = base.encode(
y=city,
color=alt.value(color),
)
return c
Here is an example of the output produced by our make_chart
function.
make_chart("Seattle", "yellow", base)
Here is a length-three helper list that we will use. (I reiterate that the overall point in this section is that it is not easy to make our simple chart, with the data in its current form.)
pair_list = [
("San Diego", "red"),
("San Francisco", "blue"),
("Detroit", "orange")
]
Here we use list comprehension to get our three desired Altair charts in a list.
chart_list = [make_chart(city, color, base) for city, color in pair_list]
This does include three charts (the same as the length of pair_list
).
len(chart_list)
3
Here is an example of one of the charts in chart_list
.
chart_list[0]
We can input these three charts to the alt.layer
function, but see below for a way to do this with less typing.
alt.layer(chart_list[0], chart_list[1], chart_list[2])
Here is the “correct” way to pass the entries in chart_list
as arguments to the alt.layer
function.
Aside from the procedure in this section taking numerous steps, there are other issues as well. For example, how can we tell from the following chart which color corresponds to which city? Related to this: there would be no way to add a tooltip to this chart that would display the corresponding city. (It would be no problem to add a tooltip displaying the temperature or the date.)
alt.layer(*chart_list)
Convert from wide-form to long-form data#
You don’t need to know this melt
method for Midterm 2, but many datasets you find (for example, for your course project) will benefit from this melting.
The basic motto for using melt is that we don’t want information stored in the column names. In this Altair chart, the city names are stored as column names.
For example, why couldn’t we include a tooltip above indicating the city name? Because there was no column containing the city name. The melt
method will produce a much longer DataFrame which does have a column containing the city names.
Use the DataFrame method
melt
to create a much longer DataFrame that has only three columns, “datetime”, “city”, and “temperature”. Name the DataFramedf_melted
.
df_pre.sample(4, random_state=0)
datetime | Vancouver | Portland | San Francisco | Seattle | Los Angeles | San Diego | Las Vegas | Phoenix | Albuquerque | ... | Philadelphia | New York | Montreal | Boston | Beersheba | Tel Aviv District | Eilat | Haifa | Nahariyya | Jerusalem | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37037 | 2016-12-22 18:00:00 | 38.7 | 31.7 | 52.9 | 34.9 | 55.3 | 60.9 | 45.6 | 60.8 | 34.1 | ... | 46.8 | 43.6 | 32.0 | 36.4 | 44.3 | 58.4 | 57.5 | 55.4 | 55.4 | 58.7 |
23197 | 2015-05-26 02:00:00 | 61.6 | 60.6 | 56.2 | 60.3 | 65.7 | 64.0 | 75.3 | 78.4 | 60.3 | ... | 75.0 | 73.2 | 62.5 | 66.9 | 58.2 | 59.7 | 71.4 | 69.9 | 69.9 | 65.1 |
33027 | 2016-07-08 16:00:00 | 59.9 | 62.0 | 59.3 | 59.9 | 70.9 | 70.5 | 90.6 | 95.2 | 79.6 | ... | 87.7 | 83.8 | 69.4 | 66.5 | 87.3 | 87.4 | 104.4 | 97.0 | 79.0 | 87.4 |
22542 | 2015-04-28 19:00:00 | 50.6 | 55.1 | 57.0 | 56.8 | 79.2 | 82.3 | 74.1 | 78.5 | 51.3 | ... | 63.6 | 63.6 | 58.7 | 50.8 | 75.0 | 70.7 | 74.1 | 70.7 | 66.9 | 70.6 |
4 rows Ă— 37 columns
I couldn’t remember what the keyword arguments were for this melt
method, so I looked them up using the help
function.
help(df_pre.melt)
Help on method melt in module pandas.core.frame:
melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True) -> 'DataFrame' method of pandas.core.frame.DataFrame instance
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (`id_vars`), while all other
columns, considered measured variables (`value_vars`), are "unpivoted" to
the row axis, leaving just two non-identifier columns, 'variable' and
'value'.
Parameters
----------
id_vars : tuple, list, or ndarray, optional
Column(s) to use as identifier variables.
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that
are not set as `id_vars`.
var_name : scalar
Name to use for the 'variable' column. If None it uses
``frame.columns.name`` or 'variable'.
value_name : scalar, default 'value'
Name to use for the 'value' column.
col_level : int or str, optional
If columns are a MultiIndex then use this level to melt.
ignore_index : bool, default True
If True, original index is ignored. If False, the original index is retained.
Index labels will be repeated as necessary.
.. versionadded:: 1.1.0
Returns
-------
DataFrame
Unpivoted DataFrame.
See Also
--------
melt : Identical method.
pivot_table : Create a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivot : Return reshaped DataFrame organized
by given index / column values.
DataFrame.explode : Explode a DataFrame from list-like
columns to long format.
Examples
--------
>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
... 'B': {0: 1, 1: 3, 2: 5},
... 'C': {0: 2, 1: 4, 2: 6}})
>>> df
A B C
0 a 1 2
1 b 3 4
2 c 5 6
>>> df.melt(id_vars=['A'], value_vars=['B'])
A variable value
0 a B 1
1 b B 3
2 c B 5
>>> df.melt(id_vars=['A'], value_vars=['B', 'C'])
A variable value
0 a B 1
1 b B 3
2 c B 5
3 a C 2
4 b C 4
5 c C 6
The names of 'variable' and 'value' columns can be customized:
>>> df.melt(id_vars=['A'], value_vars=['B'],
... var_name='myVarname', value_name='myValname')
A myVarname myValname
0 a B 1
1 b B 3
2 c B 5
Original index values can be kept around:
>>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
A variable value
0 a B 1
1 b B 3
2 c B 5
0 a C 2
1 b C 4
2 c C 6
If you have multi-index columns:
>>> df.columns = [list('ABC'), list('DEF')]
>>> df
A B C
D E F
0 a 1 2
1 b 3 4
2 c 5 6
>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B'])
A variable value
0 a B 1
1 b B 3
2 c B 5
>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')])
(A, D) variable_0 variable_1 value
0 a B E 1
1 b B E 3
2 c B E 5
Here we specify:
id_vars
: Which columns do we want to keep? In this case, it is only the “datetime” column that we want to keep.var_name
: All remaining column names will be placed into this column. Because the column names represent cities, an obvious choice of name for this new column is “city”.value_name
: The previous entries in the DataFrame (like38.7
) will be placed into this column. Because these values represent temperatures, we use the name “temperature”.
This melt
method is a little hard to understand without seeing an example, so the DataFrame shown below should make it more clear what’s going on.
df_melted = df_pre.melt(
id_vars="datetime",
var_name="city",
value_name="temperature"
)
Here we look at 4 random rows from our newly melted DataFrame. Notice how it is down to just three columns, corresponding to what we specified above. (It is also possible specify multiple columns for the id_vars
argument.) Notice how the old column names now appear in their own “city” column, and the old DataFrame entries now appear in the “temperature” column.
Our resulting DataFrame df_melted
is much longer than df_pre
, because every temperature value in df_pre
now lives in its own row in df_melted
.
df_melted.sample(4, random_state=0)
datetime | city | temperature | |
---|---|---|---|
1478577 | 2016-03-25 22:00:00 | Eilat | 78.0 |
1299334 | 2016-06-07 11:00:00 | Montreal | 61.1 |
1072194 | 2016-05-01 19:00:00 | Miami | 84.5 |
863753 | 2013-03-15 18:00:00 | Atlanta | 48.5 |
Now we do some more pre-processing of this data.
Keep only the rows corresponding to the cities in the following
cities
list.Drop the rows with missing values.
Sort the DataFrame by date.
Name the resulting DataFrame
df
.
cities = ["San Diego", "San Francisco", "Detroit"]
To keep only the rows corresponding to the three given cities, we will use Boolean indexing with the isin
method. We check that those cities do seem to be the only cities appearing.
df1 = df_melted[df_melted["city"].isin(cities)]
df1.sample(4, random_state=0)
datetime | city | temperature | |
---|---|---|---|
267876 | 2017-07-01 13:00:00 | San Diego | 64.1 |
112689 | 2015-04-13 22:00:00 | San Francisco | 59.8 |
921169 | 2014-08-04 14:00:00 | Detroit | 76.4 |
104765 | 2014-05-18 18:00:00 | San Francisco | 62.8 |
When we drop the rows with missing values, we are (potentially) changing the row labels (because we’re removing some rows). That is why we use the axis=0
keyword argument.
df2 = df1.dropna(axis=0)
I’m not sure if we’ve used sort_values
so far in Math 10, but it’s a very natural method to use. In this case we sort the DataFrame by the values in the “datetime” column. (If you want the values sorted in reverse order, you can pass the keyword argument ascending=False
.)
df = df2.sort_values("datetime")
Notice how we start out at 1pm on October 1st, 2012, and then we go to 2pm, etc.
df.head(4)
datetime | city | temperature | |
---|---|---|---|
90504 | 2012-10-01 13:00:00 | San Francisco | 61.4 |
226260 | 2012-10-01 13:00:00 | San Diego | 65.1 |
905040 | 2012-10-01 13:00:00 | Detroit | 51.6 |
905041 | 2012-10-01 14:00:00 | Detroit | 51.7 |
Plotting the data: attempt 2#
Now that we have “melted” the DataFrame, it will be much easier to plot the data.
Plot the data by adapting the following.
alt.Chart(df[???]).mark_line().encode(
x="datetime",
???
).properties(
width=600
)
What number of rows should we use to match the plot from above?
Now that our data is in “long form” instead of “wide form”, it will be much easier to plot the data using Altair. Notice how also a legend appears automatically on the right side, indicating which color corresponds to which city.
Lastly, recall that above we were plotting only the first 400 rows of the DataFrame. Those first 400 rows contained 400 values for many different cities. Now each row corresponds to a single city. Since we have three cities, to produce the same plot, we need to use \(3 \cdot 400 = 1200\) rows.
alt.Chart(df[:3*400]).mark_line().encode(
x="datetime",
y="temperature",
color="city",
tooltip=["city", "temperature"]
).properties(
width=600
)
Preview of Week 9 Wednesday#
On Wednesday of next week, I want to use a DecisionTreeRegressor
object to try to predict the temperature. I want the city to be one of the inputs, but decision trees need numerical inputs, so we will add some Boolean “indicator” columns, indicating True
or False
, a row corresponds to “Detroit”, etc.
10 minutes to work on the Worksheets#
Discussion of the Course Project#
See the instructions in the course notes.