Week 3 Monday
Contents
Week 3 Monday#
Announcements#
Worksheets 3 and 4 due before discussion on Tuesday. (I tried to make Worksheet 5 shorter.)
In-class quiz tomorrow during discussion section. Based on the material up to and including Worksheet 4. Ask William for any precise details.
I have office hours after class today at 11am, next door in ALP 3610. Please come by with any questions!
Videos and video quizzes for Friday are posted.
Plotting based on the Grammar of Graphics#
If you’ve already seen one plotting library in Python, it was probably Matplotlib. Matplotlib is the most flexible and most widely used Python plotting library. In Math 10, our main interest is in using Python for Data Science, and for that, there are some specialty plotting libraries that will get us nice results much faster than Matplotlib.
Here we will introduce the plotting library we will use most often in Math 10, Altair. In today’s Worksheet, you will see two more plotting libraries, Seaborn and Plotly.
These three libraries are very similar to each other (not so similar to Matplotlib, although Seaborn is built on top of Matplotlib), and I believe all three are based on a notion called the Grammar of Graphics. (Here is the book The Grammar of Graphics, which is freely available to download from on campus or using VPN. There is also a widely used, and I think older, plotting library for the R statistical software that uses the same conventions, ggplot2.)
Here is the basic setup for Altair, Seaborn, and Plotly:
We have a pandas DataFrame, and each row in the DataFrame corresponds to one observation (or one data point).
Each column in the DataFrame corresponds to a variable (also called a dimension, or a field).
To produce the visualizations, we encode different columns from the DataFrame into visual properties of the chart (like the x-coordinate, or the color).
Altair tries to choose default values that produce high-quality visualizations; this greatly reduces the need for fine-tuning. But there is also a huge amount of customization possible. As one example, here are the named color schemes available in Altair.
import pandas as pd
We will get a dataset from Seaborn. Other than that, we won’t use Seaborn in today’s lecture. Seaborn does show up on Worksheet 5.
# Seaborn
import seaborn as sns
Here are the datasets included with Seaborn. Most of these are small (a few hundred rows), so they should be considered as “toy” datasets for practice.
sns.get_dataset_names()
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'dowjones',
'exercise',
'flights',
'fmri',
'geyser',
'glue',
'healthexp',
'iris',
'mpg',
'penguins',
'planets',
'seaice',
'taxis',
'tips',
'titanic']
Here is the syntax for loading a dataset from Seaborn. The result is a pandas DataFrame.
df = sns.load_dataset("mpg")
df
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | usa | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | europe | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | usa | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | usa | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | usa | chevy s-10 |
398 rows × 9 columns
Here are the data types in this cars dataset.
df.dtypes
mpg float64
cylinders int64
displacement float64
horsepower float64
weight int64
acceleration float64
model_year int64
origin object
name object
dtype: object
Now we will start working with Altair.
import altair as alt
The syntax will take some getting used to. (And we’ll see below that it can get quite a bit more complicated than this, depending on how much control you want to have over the chart.)
Think of the following as proceeding in three steps.
We tell Altair what pandas DataFrame we will use.
alt.Chart(df)
We tell Altair that we want to use circles (I think of them as disks) as the mark type. In other words, this will be a scatter plot (as opposed to for example a bar chart or a line chart).
.mark_circle()
We tell Altair which columns, in this case the “weight” column and “horsepower” column we want to encode in which visual properties of the chart. Here they are encoded as the x and y positions, respectively.
.encode(
x="weight",
y="horsepower"
)
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower"
)
The strength of this style of plotting (which is shared by Seaborn and Plotly) is that you can use many more encodings than just the x-coordinate and y-coordinate. Here we encode the “cylinders” column from the DataFrame in the size of the points.
Much more important than memorizing this syntax is to understand how the data in df
is reflected in the following chart. (Pick a row in the DataFrame and try to find the corresponding point in the chart. Does its size look correct?)
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
size="cylinders"
)
The following tooltip list means that when we hover our mouse over a point, we will see the values for the properties listed in the tooltip.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color="origin",
tooltip=["name", "weight", "horsepower"]
)
A common mistake is to spell a column name incorrectly. When that happens, you will receive the following error message. (You have to read the last part to get a clue that you input an incorrect column name.)
ValueError: year encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color="year",
tooltip=["name", "weight", "horsepower"]
)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
2018 copy.data = core.InlineData(values=[{}])
2019 return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:384, in TopLevelMixin.to_dict(self, *args, **kwargs)
381 kwargs["context"] = context
383 try:
--> 384 dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
385 except jsonschema.ValidationError:
386 dct = None
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:326, in SchemaBase.to_dict(self, validate, ignore, context)
324 result = _todict(self._args[0], validate=sub_validate, context=context)
325 elif not self._args:
--> 326 result = _todict(
327 {k: v for k, v in self._kwds.items() if k not in ignore},
328 validate=sub_validate,
329 context=context,
330 )
331 else:
332 raise ValueError(
333 "{} instance has both a value and properties : "
334 "cannot serialize to dict".format(self.__class__)
335 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:60, in _todict(obj, validate, context)
58 return [_todict(v, validate, context) for v in obj]
59 elif isinstance(obj, dict):
---> 60 return {
61 k: _todict(v, validate, context)
62 for k, v in obj.items()
63 if v is not Undefined
64 }
65 elif hasattr(obj, "to_dict"):
66 return obj.to_dict()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:61, in <dictcomp>(.0)
58 return [_todict(v, validate, context) for v in obj]
59 elif isinstance(obj, dict):
60 return {
---> 61 k: _todict(v, validate, context)
62 for k, v in obj.items()
63 if v is not Undefined
64 }
65 elif hasattr(obj, "to_dict"):
66 return obj.to_dict()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:56, in _todict(obj, validate, context)
54 """Convert an object to a dict representation."""
55 if isinstance(obj, SchemaBase):
---> 56 return obj.to_dict(validate=validate, context=context)
57 elif isinstance(obj, (list, tuple, np.ndarray)):
58 return [_todict(v, validate, context) for v in obj]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:326, in SchemaBase.to_dict(self, validate, ignore, context)
324 result = _todict(self._args[0], validate=sub_validate, context=context)
325 elif not self._args:
--> 326 result = _todict(
327 {k: v for k, v in self._kwds.items() if k not in ignore},
328 validate=sub_validate,
329 context=context,
330 )
331 else:
332 raise ValueError(
333 "{} instance has both a value and properties : "
334 "cannot serialize to dict".format(self.__class__)
335 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:60, in _todict(obj, validate, context)
58 return [_todict(v, validate, context) for v in obj]
59 elif isinstance(obj, dict):
---> 60 return {
61 k: _todict(v, validate, context)
62 for k, v in obj.items()
63 if v is not Undefined
64 }
65 elif hasattr(obj, "to_dict"):
66 return obj.to_dict()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:61, in <dictcomp>(.0)
58 return [_todict(v, validate, context) for v in obj]
59 elif isinstance(obj, dict):
60 return {
---> 61 k: _todict(v, validate, context)
62 for k, v in obj.items()
63 if v is not Undefined
64 }
65 elif hasattr(obj, "to_dict"):
66 return obj.to_dict()
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:56, in _todict(obj, validate, context)
54 """Convert an object to a dict representation."""
55 if isinstance(obj, SchemaBase):
---> 56 return obj.to_dict(validate=validate, context=context)
57 elif isinstance(obj, (list, tuple, np.ndarray)):
58 return [_todict(v, validate, context) for v in obj]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/schema/channels.py:40, in FieldChannelMixin.to_dict(self, validate, ignore, context)
38 elif not (type_in_shorthand or type_defined_explicitly):
39 if isinstance(context.get('data', None), pd.DataFrame):
---> 40 raise ValueError("{} encoding field is specified without a type; "
41 "the type cannot be inferred because it does not "
42 "match any column in the data.".format(shorthand))
43 else:
44 raise ValueError("{} encoding field is specified without a type; "
45 "the type cannot be automatically inferred because "
46 "the data is not specified as a pandas.DataFrame."
47 "".format(shorthand))
ValueError: year encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
alt.Chart(...)
It should have been “model_year” instead of “year”.
df.columns
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
'acceleration', 'model_year', 'origin', 'name'],
dtype='object')
Here is the default coloring used when we encode the “model_year” column in the color channel.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color="model_year",
tooltip=["name", "weight", "horsepower"]
)
Let’s see how to choose our own color scheme. As a first step, making no change to the produced chart, we replace color="model_year"
with color=alt.Color("model_year")
. The advantage of using this longer syntax is that we can pass keyword arguments to the alt.Color
constructor, which will be used to customize the appearance.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color=alt.Color("model_year"),
tooltip=["name", "weight", "horsepower"]
)
The thing we are customizing is the color scale. I know that it is a lot of writing, but the benefit is that there is a huge amount of customization possible. See the following for the possible named color schemes: color scheme choices.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color=alt.Color("model_year", scale=alt.Scale(scheme="goldgreen")),
tooltip=["name", "weight", "horsepower"]
)
If you want the colors to progress in the opposite order, you can add another keyword argument, reverse=True
. (This is getting added to the alt.Scale
constructor.
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
x="weight",
y="horsepower",
color=alt.Color("model_year", scale=alt.Scale(scheme="goldgreen", reverse=True)),
tooltip=["name", "weight", "horsepower"]
)