Week 3 Monday#

Announcements#

  • Worksheets 3 and 4 due before discussion on Tuesday. (I tried to make Worksheet 5 shorter.)

  • In-class quiz tomorrow during discussion section. Based on the material up to and including Worksheet 4. Ask William for any precise details.

  • I have office hours after class today at 11am, next door in ALP 3610. Please come by with any questions!

  • Videos and video quizzes for Friday are posted.

Plotting based on the Grammar of Graphics#

If you’ve already seen one plotting library in Python, it was probably Matplotlib. Matplotlib is the most flexible and most widely used Python plotting library. In Math 10, our main interest is in using Python for Data Science, and for that, there are some specialty plotting libraries that will get us nice results much faster than Matplotlib.

Here we will introduce the plotting library we will use most often in Math 10, Altair. In today’s Worksheet, you will see two more plotting libraries, Seaborn and Plotly.

These three libraries are very similar to each other (not so similar to Matplotlib, although Seaborn is built on top of Matplotlib), and I believe all three are based on a notion called the Grammar of Graphics. (Here is the book The Grammar of Graphics, which is freely available to download from on campus or using VPN. There is also a widely used, and I think older, plotting library for the R statistical software that uses the same conventions, ggplot2.)

Here is the basic setup for Altair, Seaborn, and Plotly:

  • We have a pandas DataFrame, and each row in the DataFrame corresponds to one observation (or one data point).

  • Each column in the DataFrame corresponds to a variable (also called a dimension, or a field).

  • To produce the visualizations, we encode different columns from the DataFrame into visual properties of the chart (like the x-coordinate, or the color).

Altair tries to choose default values that produce high-quality visualizations; this greatly reduces the need for fine-tuning. But there is also a huge amount of customization possible. As one example, here are the named color schemes available in Altair.

import pandas as pd

We will get a dataset from Seaborn. Other than that, we won’t use Seaborn in today’s lecture. Seaborn does show up on Worksheet 5.

# Seaborn
import seaborn as sns

Here are the datasets included with Seaborn. Most of these are small (a few hundred rows), so they should be considered as “toy” datasets for practice.

sns.get_dataset_names()
['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

Here is the syntax for loading a dataset from Seaborn. The result is a pandas DataFrame.

df = sns.load_dataset("mpg")
df
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 usa ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10

398 rows × 9 columns

Here are the data types in this cars dataset.

df.dtypes
mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
name             object
dtype: object

Now we will start working with Altair.

import altair as alt

The syntax will take some getting used to. (And we’ll see below that it can get quite a bit more complicated than this, depending on how much control you want to have over the chart.)

Think of the following as proceeding in three steps.

  1. We tell Altair what pandas DataFrame we will use. alt.Chart(df)

  2. We tell Altair that we want to use circles (I think of them as disks) as the mark type. In other words, this will be a scatter plot (as opposed to for example a bar chart or a line chart). .mark_circle()

  3. We tell Altair which columns, in this case the “weight” column and “horsepower” column we want to encode in which visual properties of the chart. Here they are encoded as the x and y positions, respectively.

.encode(
    x="weight",
    y="horsepower"
)
# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower"
)

The strength of this style of plotting (which is shared by Seaborn and Plotly) is that you can use many more encodings than just the x-coordinate and y-coordinate. Here we encode the “cylinders” column from the DataFrame in the size of the points.

Much more important than memorizing this syntax is to understand how the data in df is reflected in the following chart. (Pick a row in the DataFrame and try to find the corresponding point in the chart. Does its size look correct?)

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    size="cylinders"
)

The following tooltip list means that when we hover our mouse over a point, we will see the values for the properties listed in the tooltip.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color="origin",
    tooltip=["name", "weight", "horsepower"]
)

A common mistake is to spell a column name incorrectly. When that happens, you will receive the following error message. (You have to read the last part to get a clue that you input an incorrect column name.)

ValueError: year encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color="year",
    tooltip=["name", "weight", "horsepower"]
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2020, in Chart.to_dict(self, *args, **kwargs)
   2018     copy.data = core.InlineData(values=[{}])
   2019     return super(Chart, copy).to_dict(*args, **kwargs)
-> 2020 return super().to_dict(*args, **kwargs)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:384, in TopLevelMixin.to_dict(self, *args, **kwargs)
    381 kwargs["context"] = context
    383 try:
--> 384     dct = super(TopLevelMixin, copy).to_dict(*args, **kwargs)
    385 except jsonschema.ValidationError:
    386     dct = None

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:326, in SchemaBase.to_dict(self, validate, ignore, context)
    324     result = _todict(self._args[0], validate=sub_validate, context=context)
    325 elif not self._args:
--> 326     result = _todict(
    327         {k: v for k, v in self._kwds.items() if k not in ignore},
    328         validate=sub_validate,
    329         context=context,
    330     )
    331 else:
    332     raise ValueError(
    333         "{} instance has both a value and properties : "
    334         "cannot serialize to dict".format(self.__class__)
    335     )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:60, in _todict(obj, validate, context)
     58     return [_todict(v, validate, context) for v in obj]
     59 elif isinstance(obj, dict):
---> 60     return {
     61         k: _todict(v, validate, context)
     62         for k, v in obj.items()
     63         if v is not Undefined
     64     }
     65 elif hasattr(obj, "to_dict"):
     66     return obj.to_dict()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:61, in <dictcomp>(.0)
     58     return [_todict(v, validate, context) for v in obj]
     59 elif isinstance(obj, dict):
     60     return {
---> 61         k: _todict(v, validate, context)
     62         for k, v in obj.items()
     63         if v is not Undefined
     64     }
     65 elif hasattr(obj, "to_dict"):
     66     return obj.to_dict()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:56, in _todict(obj, validate, context)
     54 """Convert an object to a dict representation."""
     55 if isinstance(obj, SchemaBase):
---> 56     return obj.to_dict(validate=validate, context=context)
     57 elif isinstance(obj, (list, tuple, np.ndarray)):
     58     return [_todict(v, validate, context) for v in obj]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:326, in SchemaBase.to_dict(self, validate, ignore, context)
    324     result = _todict(self._args[0], validate=sub_validate, context=context)
    325 elif not self._args:
--> 326     result = _todict(
    327         {k: v for k, v in self._kwds.items() if k not in ignore},
    328         validate=sub_validate,
    329         context=context,
    330     )
    331 else:
    332     raise ValueError(
    333         "{} instance has both a value and properties : "
    334         "cannot serialize to dict".format(self.__class__)
    335     )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:60, in _todict(obj, validate, context)
     58     return [_todict(v, validate, context) for v in obj]
     59 elif isinstance(obj, dict):
---> 60     return {
     61         k: _todict(v, validate, context)
     62         for k, v in obj.items()
     63         if v is not Undefined
     64     }
     65 elif hasattr(obj, "to_dict"):
     66     return obj.to_dict()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:61, in <dictcomp>(.0)
     58     return [_todict(v, validate, context) for v in obj]
     59 elif isinstance(obj, dict):
     60     return {
---> 61         k: _todict(v, validate, context)
     62         for k, v in obj.items()
     63         if v is not Undefined
     64     }
     65 elif hasattr(obj, "to_dict"):
     66     return obj.to_dict()

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/utils/schemapi.py:56, in _todict(obj, validate, context)
     54 """Convert an object to a dict representation."""
     55 if isinstance(obj, SchemaBase):
---> 56     return obj.to_dict(validate=validate, context=context)
     57 elif isinstance(obj, (list, tuple, np.ndarray)):
     58     return [_todict(v, validate, context) for v in obj]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/schema/channels.py:40, in FieldChannelMixin.to_dict(self, validate, ignore, context)
     38 elif not (type_in_shorthand or type_defined_explicitly):
     39     if isinstance(context.get('data', None), pd.DataFrame):
---> 40         raise ValueError("{} encoding field is specified without a type; "
     41                          "the type cannot be inferred because it does not "
     42                          "match any column in the data.".format(shorthand))
     43     else:
     44         raise ValueError("{} encoding field is specified without a type; "
     45                          "the type cannot be automatically inferred because "
     46                          "the data is not specified as a pandas.DataFrame."
     47                          "".format(shorthand))

ValueError: year encoding field is specified without a type; the type cannot be inferred because it does not match any column in the data.
alt.Chart(...)

It should have been “model_year” instead of “year”.

df.columns
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')

Here is the default coloring used when we encode the “model_year” column in the color channel.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color="model_year",
    tooltip=["name", "weight", "horsepower"]
)

Let’s see how to choose our own color scheme. As a first step, making no change to the produced chart, we replace color="model_year" with color=alt.Color("model_year"). The advantage of using this longer syntax is that we can pass keyword arguments to the alt.Color constructor, which will be used to customize the appearance.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color=alt.Color("model_year"),
    tooltip=["name", "weight", "horsepower"]
)

The thing we are customizing is the color scale. I know that it is a lot of writing, but the benefit is that there is a huge amount of customization possible. See the following for the possible named color schemes: color scheme choices.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color=alt.Color("model_year", scale=alt.Scale(scheme="goldgreen")),
    tooltip=["name", "weight", "horsepower"]
)

If you want the colors to progress in the opposite order, you can add another keyword argument, reverse=True. (This is getting added to the alt.Scale constructor.

# syntax will seem weird at first
alt.Chart(df).mark_circle().encode(
    x="weight",
    y="horsepower",
    color=alt.Color("model_year", scale=alt.Scale(scheme="goldgreen", reverse=True)),
    tooltip=["name", "weight", "horsepower"]
)