Week 4 Wednesday
Contents
Week 4 Wednesday#
Announcements#
I have office hours at 1pm, downstairs in ALP 2800. Please come by with any questions!
Videos and video quizzes posted; due Friday before lecture. (I plan to close all the quizzes and convert them to “practice quizzes” sometime before the midterm, so you can use them to study. If you are behind on the video quizzes, this is a good time to catch up.)
Midterm next Wednesday, October 26th, during lecture. Similar question style to the in-class quizzes, but some questions may be longer. Sample midterm posted by the end of this week.
Note cards will be passed out later today during the “worksheet time”. (Remind me if I forget.) You can put hand-written notes on them (both sides, one card per student) and use them during the midterm.
A DataFrame which is difficult to use “as is” with Altair#
Using the following pandas DataFrame, draw a line that goes 60, 80, 65 for Irvine and another line that goes 25, 85, 50 for New York. Why is this more difficult than you would expect?
import pandas as pd
import altair as alt
Here is a very simple DataFrame, but its data is presented slightly differently from how Altair expects.
df = pd.DataFrame({
"City": ["Irvine", "New York"],
"Feb": [60, 25],
"Jul": [80, 85],
"Nov": [65, 50]
})
Because there is no single column containing the month values, we do not currently know how to plot those month values along an axis using Altair. It would be easy to plot, for example “Irvine” and “New York” along the x-axis, or to plot 25 and 60 along the x-axis, but not to plot “Feb”, “Jul”, “Nov” along the x-axis.
df
City | Feb | Jul | Nov | |
---|---|---|---|---|
0 | Irvine | 60 | 80 | 65 |
1 | New York | 25 | 85 | 50 |
The solution is to use the pandas DataFrame method melt
.
df.melt(
id_vars=["City"], # columns to keep the same
var_name="Month", # the other column labels go here
value_name="Temperature", # the old values go here
)
City | Month | Temperature | |
---|---|---|---|
0 | Irvine | Feb | 60 |
1 | New York | Feb | 25 |
2 | Irvine | Jul | 80 |
3 | New York | Jul | 85 |
4 | Irvine | Nov | 65 |
5 | New York | Nov | 50 |
The syntax takes some getting used to. It can seem like magic that the month labels and temperatures showed up in the correct spot. Here is another example, where we specify to leave both the “City” and the “Jul” columns unchanged.
df.melt(
id_vars=["City", "Jul"], # columns to keep the same
var_name="Month", # the other column labels go here
value_name="Temperature", # the old values go here
)
City | Jul | Month | Temperature | |
---|---|---|---|---|
0 | Irvine | 80 | Feb | 60 |
1 | New York | 85 | Feb | 25 |
2 | Irvine | 80 | Nov | 65 |
3 | New York | 85 | Nov | 50 |
pandas did not know that the “Feb” corresponded to a “Month”… we told it that. If we chose different names, then the newly formed columns would have different labels. Notice how var_name
describes what to call the old column labels, and value_name
describes what to call the old values in those columns.
df.melt(
id_vars=["City"], # columns to keep the same
var_name="Variable", # the other column labels go here
value_name="Value", # the old values go here
)
City | Variable | Value | |
---|---|---|---|
0 | Irvine | Feb | 60 |
1 | New York | Feb | 25 |
2 | Irvine | Jul | 80 |
3 | New York | Jul | 85 |
4 | Irvine | Nov | 65 |
5 | New York | Nov | 50 |
A common source of mistakes in Python is thinking that code like the following changed df
. A hint that df
did not change is the fact that a new DataFrame got displayed.
# this code does not change df
df.melt(
id_vars=["City"], # columns to keep the same
var_name="Month", # the other column labels go here
value_name="Temperature", # the old values go here
)
City | Month | Temperature | |
---|---|---|---|
0 | Irvine | Feb | 60 |
1 | New York | Feb | 25 |
2 | Irvine | Jul | 80 |
3 | New York | Jul | 85 |
4 | Irvine | Nov | 65 |
5 | New York | Nov | 50 |
Nothing we have done so far has changed df
.
df
City | Feb | Jul | Nov | |
---|---|---|---|---|
0 | Irvine | 60 | 80 | 65 |
1 | New York | 25 | 85 | 50 |
Here we store the melted DataFrame in a new variable name df2
. Notice how nothing gets displayed beneath this cell.
df2 = df.melt(
id_vars=["City"], # columns to keep the same
var_name="Month", # the other column labels go here
value_name="Temperature", # the old values go here
)
This is almost what we want, but we haven’t told Altair to do anything with the “City” column yet.
alt.Chart(df2).mark_line().encode(
x="Month",
y="Temperature",
)
This is the kind of chart we were looking for. You will need to do something similar on Worksheet 8, where we are displaying various assignment names along the x-axis, like “Quiz 1” and “Quiz 2”.
alt.Chart(df2).mark_line().encode(
x="Month",
y="Temperature",
color="City"
)
Interactive chart, example 1#
Run
alt.data_transformers.enable('default', max_rows=10000)
so you can plot points from up to 10,000 rows in a DataFrame. (Warning. Don’t use numbers much higher than this. Because every data point is plotted, the file sizes can become huge.)Using the normalized stock data from Worksheet 4 (attached), make a line chart which highlights a certain stock market when you click on the legend.
df = pd.read_csv("wk4.csv")
df.shape
(67194, 4)
df.columns
Index(['Abbreviation', 'Date', 'Open', 'NormOpen'], dtype='object')
Here is the error Altair will raise if you try to plot from a DataFrame with more than 5000 rows.
alt.Chart(df).mark_line().encode(
x="Date",
y="NormOpen",
color="Abbreviation"
)
---------------------------------------------------------------------------
MaxRowsError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:374, in TopLevelMixin.to_dict(self, *args, **kwargs)
372 copy = self.copy(deep=False)
373 original_data = getattr(copy, "data", Undefined)
--> 374 copy.data = _prepare_data(original_data, context)
376 if original_data is not Undefined:
377 context["data"] = original_data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:89, in _prepare_data(data, context)
87 # convert dataframes or objects with __geo_interface__ to dict
88 if isinstance(data, pd.DataFrame) or hasattr(data, "__geo_interface__"):
---> 89 data = _pipe(data, data_transformers.get())
91 # convert string input to a URLData
92 if isinstance(data, str):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/data.py:19, in default_data_transformer(data, max_rows)
17 @curried.curry
18 def default_data_transformer(data, max_rows=5000):
---> 19 return curried.pipe(data, limit_rows(max_rows=max_rows), to_values)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:628, in pipe(data, *funcs)
608 """ Pipe a value through a sequence of functions
609
610 I.e. ``pipe(data, f, g, h)`` is equivalent to ``h(g(f(data)))``
(...)
625 thread_last
626 """
627 for func in funcs:
--> 628 data = func(data)
629 return data
File /shared-libs/python3.9/py/lib/python3.9/site-packages/toolz/functoolz.py:304, in curry.__call__(self, *args, **kwargs)
302 def __call__(self, *args, **kwargs):
303 try:
--> 304 return self._partial(*args, **kwargs)
305 except TypeError as exc:
306 if self._should_curry(args, kwargs, exc):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/data.py:80, in limit_rows(data, max_rows)
78 return data
79 if max_rows is not None and len(values) > max_rows:
---> 80 raise MaxRowsError(
81 "The number of rows in your dataset is greater "
82 "than the maximum allowed ({}). "
83 "For information on how to plot larger datasets "
84 "in Altair, see the documentation".format(max_rows)
85 )
86 return data
MaxRowsError: The number of rows in your dataset is greater than the maximum allowed (5000). For information on how to plot larger datasets in Altair, see the documentation
alt.Chart(...)
Here we specify that Altair should allow up to 10,000 rows. Be careful with this tool; I do not think you should allow more than maybe 20,000 rows. The risk is producing a huge file, and possibly crashing the machine.
alt.data_transformers.enable('default', max_rows=10000)
DataTransformerRegistry.enable('default')
Because df
had over 60,000 rows, we still need to decrease the size of df
somehow. Here we use sample
to get only 30 rows.
# still need to shrink df
alt.Chart(df.sample(30)).mark_line().encode(
x="Date",
y="NormOpen",
color="Abbreviation"
)
By default, Altair doesn’t know that the “Date” column holds values representing dates. We can tell Altair this by specifying :T
as the encoding data type. (Another option would be to use pd.to_datetime
on the “Date” column, and then Altair would recognize automatically that these represent datetime values.) If you try to plot 10,000 points using just string encodings, the file will be huge and it will probably not be displayed.
# still need to shrink df
alt.Chart(df.sample(10000)).mark_line().encode(
x="Date:T",
y="NormOpen",
color="Abbreviation"
)
Now we finally get to interactivity.
Step 1. Create an Altair selection object. Here we specify that we want to select objects by the “Abbreviation” field.
sel = alt.selection_single(fields=["Abbreviation"], bind="legend")
alt.Chart(df.sample(10000)).mark_line().encode(
x="Date:T",
y="NormOpen",
color="Abbreviation"
)
Step 2. Add the selection object to the chart using add_selection
.
sel = alt.selection_single(fields=["Abbreviation"], bind="legend")
alt.Chart(df.sample(10000)).mark_line().encode(
x="Date:T",
y="NormOpen",
color="Abbreviation"
).add_selection(sel)
Step 3. Tell Altair how to respond to the selection. Here we use alt.condition
to say that if the point is selected, use the default coloring and an opacity of 1
, and if the point is not selected, use light grey for the color and make the line 80% transparent (an opacity of 0.2
).
Try clicking on one of the stock exchange abbreviations listed in the legend below. Notice how the chart responds.
sel = alt.selection_single(fields=["Abbreviation"], bind="legend")
alt.Chart(df.sample(10000)).mark_line().encode(
x="Date:T",
y="NormOpen",
color=alt.condition(sel, "Abbreviation", alt.value("lightgrey")),
opacity=alt.condition(sel, alt.value(1), alt.value(0.2))
).add_selection(sel)
You will see more examples of interactivity on Worksheet 8. A very nice aspect of this interactivity is that, once the visualization is produced, the interactivity can be presented on any website, even if Python is not available to the website.
Interactive chart, example 2#
Using the “mpg” dataset from Seaborn, make a scatter plot showing “horsepower” vs “mpg” together with make a bar chart that shows how many cars there are from each origin.
We didn’t get here on Wednesday.