Week 4 Wednesday#

Announcements#

  • I have office hours at 1pm, downstairs in ALP 2800. Please come by with any questions!

  • Videos and video quizzes posted; due Friday before lecture. (I plan to close all the quizzes and convert them to “practice quizzes” sometime before the midterm, so you can use them to study. If you are behind on the video quizzes, this is a good time to catch up.)

  • Midterm next Wednesday, October 26th, during lecture. Similar question style to the in-class quizzes, but some questions may be longer. Sample midterm posted by the end of this week.

  • Note cards will be passed out later today during the “worksheet time”. (Remind me if I forget.) You can put hand-written notes on them (both sides, one card per student) and use them during the midterm.

A DataFrame which is difficult to use “as is” with Altair#

  • Using the following pandas DataFrame, draw a line that goes 60, 80, 65 for Irvine and another line that goes 25, 85, 50 for New York. Why is this more difficult than you would expect?

import pandas as pd
import altair as alt

Here is a very simple DataFrame, but its data is presented slightly differently from how Altair expects.

df = pd.DataFrame({
    "City": ["Irvine", "New York"],
    "Feb": [60, 25],
    "Jul": [80, 85],
    "Nov": [65, 50]
})

Because there is no single column containing the month values, we do not currently know how to plot those month values along an axis using Altair. It would be easy to plot, for example “Irvine” and “New York” along the x-axis, or to plot 25 and 60 along the x-axis, but not to plot “Feb”, “Jul”, “Nov” along the x-axis.

df
City Feb Jul Nov
0 Irvine 60 80 65
1 New York 25 85 50

The solution is to use the pandas DataFrame method melt.

df.melt(
    id_vars=["City"], # columns to keep the same
    var_name="Month", # the other column labels go here
    value_name="Temperature", # the old values go here
    )
City Month Temperature
0 Irvine Feb 60
1 New York Feb 25
2 Irvine Jul 80
3 New York Jul 85
4 Irvine Nov 65
5 New York Nov 50

The syntax takes some getting used to. It can seem like magic that the month labels and temperatures showed up in the correct spot. Here is another example, where we specify to leave both the “City” and the “Jul” columns unchanged.

df.melt(
    id_vars=["City", "Jul"], # columns to keep the same
    var_name="Month", # the other column labels go here
    value_name="Temperature", # the old values go here
    )
City Jul Month Temperature
0 Irvine 80 Feb 60
1 New York 85 Feb 25
2 Irvine 80 Nov 65
3 New York 85 Nov 50

pandas did not know that the “Feb” corresponded to a “Month”… we told it that. If we chose different names, then the newly formed columns would have different labels. Notice how var_name describes what to call the old column labels, and value_name describes what to call the old values in those columns.

df.melt(
    id_vars=["City"], # columns to keep the same
    var_name="Variable", # the other column labels go here
    value_name="Value", # the old values go here
    )
City Variable Value
0 Irvine Feb 60
1 New York Feb 25
2 Irvine Jul 80
3 New York Jul 85
4 Irvine Nov 65
5 New York Nov 50

A common source of mistakes in Python is thinking that code like the following changed df. A hint that df did not change is the fact that a new DataFrame got displayed.

# this code does not change df
df.melt(
    id_vars=["City"], # columns to keep the same
    var_name="Month", # the other column labels go here
    value_name="Temperature", # the old values go here
    )
City Month Temperature
0 Irvine Feb 60
1 New York Feb 25
2 Irvine Jul 80
3 New York Jul 85
4 Irvine Nov 65
5 New York Nov 50

Nothing we have done so far has changed df.

df
City Feb Jul Nov
0 Irvine 60 80 65
1 New York 25 85 50

Here we store the melted DataFrame in a new variable name df2. Notice how nothing gets displayed beneath this cell.

df2 = df.melt(
    id_vars=["City"], # columns to keep the same
    var_name="Month", # the other column labels go here
    value_name="Temperature", # the old values go here
    )

This is almost what we want, but we haven’t told Altair to do anything with the “City” column yet.

alt.Chart(df2).mark_line().encode(
    x="Month",
    y="Temperature",
)

This is the kind of chart we were looking for. You will need to do something similar on Worksheet 8, where we are displaying various assignment names along the x-axis, like “Quiz 1” and “Quiz 2”.

alt.Chart(df2).mark_line().encode(
    x="Month",
    y="Temperature",
    color="City"
)

Interactive chart, example 1#

  • Run alt.data_transformers.enable('default', max_rows=10000) so you can plot points from up to 10,000 rows in a DataFrame. (Warning. Don’t use numbers much higher than this. Because every data point is plotted, the file sizes can become huge.)

  • Using the normalized stock data from Worksheet 4 (attached), make a line chart which highlights a certain stock market when you click on the legend.

df = pd.read_csv("wk4.csv")
df.shape
(67194, 4)
df.columns
Index(['Abbreviation', 'Date', 'Open', 'NormOpen'], dtype='object')

Here is the error Altair will raise if you try to plot from a DataFrame with more than 5000 rows.

alt.Chart(df).mark_line().encode(
    x="Date",
    y="NormOpen",
    color="Abbreviation"
)
---------------------------------------------------------------------------
MaxRowsError                              Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
   2018     copy.data = core.InlineData(values=[{}])
   2019     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:374, in TopLevelMixin.to_dict(self, *args, **kwargs)
    372 copy = self.copy(deep=False)
    373 original_data = getattr(copy, "data", Undefined)
--> 374 copy.data = _prepare_data(original_data, context)
    376 if original_data is not Undefined:
    377     context["data"] = original_data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:89, in _prepare_data(data, context)
     87 # convert dataframes  or objects with __geo_interface__ to dict
     88 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 89     data = _pipe(data, data_transformers.get())
     91 # convert string input to a URLData
     92 if isinstance(data, str):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
     17 @curried.curry
     18 def default_data_transformer(data, max_rows=5000):
---> 19     return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
    608 """ Pipe a value through a sequence of functions
    609 
    610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
   (...)
    625     thread_last
    626 """
    627 for func in funcs:
--> 628     data = func(data)
    629 return data

File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
    302 def __call__(self, *args, **kwargs):
    303     try:
--> 304         return self._partial(*args, **kwargs)
    305     except TypeError as exc:
    306         if self._should_curry(args, kwargs, exc):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/data.py:80, in limit_rows(data, max_rows)
     78         return data
     79 if max_rows is not None and len(values) > max_rows:
---> 80     raise MaxRowsError(
     81         "The number of rows in your dataset is greater "
     82         "than the maximum allowed ({}). "
     83         "For information on how to plot larger datasets "
     84         "in Altair, see the documentation".format(max_rows)
     85     )
     86 return data

MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)

Here we specify that Altair should allow up to 10,000 rows. Be careful with this tool; I do not think you should allow more than maybe 20,000 rows. The risk is producing a huge file, and possibly crashing the machine.

alt.data_transformers.enable('default', max_rows=10000)
DataTransformerRegistry.enable('default')

Because df had over 60,000 rows, we still need to decrease the size of df somehow. Here we use sample to get only 30 rows.

# still need to shrink df
alt.Chart(df.sample(30)).mark_line().encode(
    x="Date",
    y="NormOpen",
    color="Abbreviation"
)

By default, Altair doesn’t know that the “Date” column holds values representing dates. We can tell Altair this by specifying :T as the encoding data type. (Another option would be to use pd.to_datetime on the “Date” column, and then Altair would recognize automatically that these represent datetime values.) If you try to plot 10,000 points using just string encodings, the file will be huge and it will probably not be displayed.

# still need to shrink df
alt.Chart(df.sample(10000)).mark_line().encode(
    x="Date:T",
    y="NormOpen",
    color="Abbreviation"
)

Now we finally get to interactivity.

Step 1. Create an Altair selection object. Here we specify that we want to select objects by the “Abbreviation” field.

sel = alt.selection_single(fields=["Abbreviation"], bind="legend")

alt.Chart(df.sample(10000)).mark_line().encode(
    x="Date:T",
    y="NormOpen",
    color="Abbreviation"
)

Step 2. Add the selection object to the chart using add_selection.

sel = alt.selection_single(fields=["Abbreviation"], bind="legend")

alt.Chart(df.sample(10000)).mark_line().encode(
    x="Date:T",
    y="NormOpen",
    color="Abbreviation"
).add_selection(sel)

Step 3. Tell Altair how to respond to the selection. Here we use alt.condition to say that if the point is selected, use the default coloring and an opacity of 1, and if the point is not selected, use light grey for the color and make the line 80% transparent (an opacity of 0.2).

Try clicking on one of the stock exchange abbreviations listed in the legend below. Notice how the chart responds.

sel = alt.selection_single(fields=["Abbreviation"], bind="legend")

alt.Chart(df.sample(10000)).mark_line().encode(
    x="Date:T",
    y="NormOpen",
    color=alt.condition(sel, "Abbreviation", alt.value("lightgrey")),
    opacity=alt.condition(sel, alt.value(1), alt.value(0.2))
).add_selection(sel)

You will see more examples of interactivity on Worksheet 8. A very nice aspect of this interactivity is that, once the visualization is produced, the interactivity can be presented on any website, even if Python is not available to the website.

Interactive chart, example 2#

  • Using the “mpg” dataset from Seaborn, make a scatter plot showing “horsepower” vs “mpg” together with make a bar chart that shows how many cars there are from each origin.

We didn’t get here on Wednesday.