Week 4 Monday
Contents
Week 4 Monday#
Announcements#
Worksheets 5 and 6 due tomorrow before discussion section.
Videos and video quizzes posted; due Friday before lecture.
Midterm next Wednesday, October 26th, during lecture. Similar question style to the in-class quizzes, but some questions may be longer. Note cards available Wednesday (Oct 19th). Sample midterm posted by the end of this week.
Introduction#
Say we want to investigate how the bill length of penguins differs between male penguins and female penguins. How can we investigate this using a facet chart in Altair?
import pandas as pd
import altair as alt
import seaborn as sns
We are going to use a shortcut, dropna
, to drop rows with missing values. See this week’s videos for an explanation of why dropna(axis=0)
will drop the rows (as opposed to columns). In Math 10, we will basically always drop rows instead of dropping columns. That is because each row typically represents a data point.
If instead we were to drop columns with missing values, we would wind up with a very weird dataset. This is what it would look like if we dropped the columns with missing values from this dataset.
# Weird
# Removing columns with missing values. See this week's videos for why `axis=0` is correct
df = sns.load_dataset("penguins").dropna(axis=1)
Only two remaining columns.
df
species | island | |
---|---|---|
0 | Adelie | Torgersen |
1 | Adelie | Torgersen |
2 | Adelie | Torgersen |
3 | Adelie | Torgersen |
4 | Adelie | Torgersen |
... | ... | ... |
339 | Gentoo | Biscoe |
340 | Gentoo | Biscoe |
341 | Gentoo | Biscoe |
342 | Gentoo | Biscoe |
343 | Gentoo | Biscoe |
344 rows × 2 columns
Here is the usual usage. This looks much more typical.
# Removing rows with missing values. See this week's videos for why `axis=0` is correct
df = sns.load_dataset("penguins").dropna(axis=0)
df
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows × 7 columns
Our charts below will be variants of this one. We eventually want to see whether male penguins or female penguins have longer bill length. We cannot answer that from this image, because the sex of the penguins is not shown in this chart.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_depth_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).properties(
height=500,
width=500,
title="Penguins"
)
Facet charts#
Investigate bill length using a scatter plot together with
facet
.
Here are the columns we have to work with. We are most interested in the “bill_length_mm” column and the “sex” column.
df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'sex'],
dtype='object')
There was a question about what the height
and width
keyword arguments are doing. Here we change the height
from 200
to 20
; notice how the chart gets squashed.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_depth_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).properties(
height=20,
width=200
).facet(
column="sex"
)
Back to the original height. This facet chart already gives us a good understanding of how the bill length differs between male and female penguins across various species.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_depth_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).properties(
height=200,
width=200
).facet(
column="sex"
)
If put the male and female charts in different rows instead of different columns, it is easier to compare the bill depths but it becomes harder to compare the bill lengths. Today we are primarily interested in bill length.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_depth_mm", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).properties(
height=200,
width=200
).facet(
row="sex"
)
Investigate bill length using a bar chart together with
facet
.
Here we use the exact same code as above, but switching to a bar chart with mark_bar
instead of mark_circle
.
alt.Chart(df).mark_bar().encode(
x=alt.X("species", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).properties(
height=200,
width=200
).facet(
column="sex"
)
I don’t think our size specifications (height
and width
) are improving the appearance of these bar charts; I think the default values look better. So let’s remove that portion of the chart.
alt.Chart(df).mark_bar().encode(
x=alt.X("species", scale=alt.Scale(zero=False)),
y=alt.Y("bill_length_mm", scale=alt.Scale(zero=False)),
color="species",
).facet(
column="sex"
)
Let’s also bring the bars back to their default setting, of including zero for quantitative data types (like the y-axis in this example).
alt.Chart(df).mark_bar().encode(
x=alt.X("species"),
y=alt.Y("bill_length_mm"),
color="species",
).facet(
column="sex"
)
Notice how the above charts seem to suggest that female chinstrap penguins have a longer bill than male chinstrap penguins. That is deceptive though, because these charts have one bar for each penguin, layered (not stacked) on top of each other. So in fact, all this means is that one female chinstrap penguin has a longer bill. Go back up to the scatter plots and see if you believe that one female penguin has a longer bill.
If we instead specify that we just want to know the mean
of the bill length, then we get a more representative image. The following is the first time we can tell from the bar chart that the average bill length is higher for male penguins than for female penguins.
alt.Chart(df).mark_bar().encode(
x=alt.X("species"),
y=alt.Y("mean(bill_length_mm)"),
color="species",
).facet(
column="sex"
)
I actually think it’s more useful to use some different encodings in this example. Let’s start by making the facet charts by “species” instead of by “sex”. The following chart does not include the sex information (and overall it doesn’t look very good).
alt.Chart(df).mark_bar().encode(
x=alt.X("species"),
y=alt.Y("mean(bill_length_mm)"),
color="species",
).facet(
column="species"
)
Here we specify that the x-axis should use the “sex” column. I think this is the most readable of the charts, in terms of ease of comparing average bill length between different species of penguins.
alt.Chart(df).mark_bar().encode(
x=alt.X("sex"),
y=alt.Y("mean(bill_length_mm)"),
color="species",
).facet(
column="species"
)
Because we’ve gotten rid of the keyword arguments (like scale = alt.Scale(...)
), we can get rid of the alt.X
and alt.Y
portions (we need to keep the column names).
alt.Chart(df).mark_bar().encode(
x="sex",
y="mean(bill_length_mm)",
color="species",
).facet(
column="species"
)
Facet charts “by hand” using groupby
#
Make a similar chart to the above bar chart using
groupby
andhconcat
.
Here is a first attempt at making three bar charts (one for each species) and putting those charts into a list named chart_list
. (Python is very flexible about what can go into a list or a tuple. As far as I know, any Python object can go into a list or a tuple.)
chart_list = []
for species, df_sub in df.groupby("species"):
c = alt.Chart(df).mark_bar().encode(
x="sex",
y="mean(bill_length_mm)",
)
chart_list.append(c)
Let’s try to display some of these charts horizontally, using alt.hconcat
(the “h” stands for “horizontal”.) Because there are 3 species of penguins, we repeat the code inside the for loop 3 times, so there are only 3 charts in chart_list
, so the following code raises an error. (For testing, it is very helpful to initialize chart_list = []
in the same cell as the for loop, so that it gets reset every time we run this cell.)
alt.hconcat(chart_list[0], chart_list[1], chart_list[2], chart_list[3])
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In [20], line 1
----> 1 alt.hconcat(chart_list[0], chart_list[1], chart_list[2], chart_list[3])
IndexError: list index out of range
The following works, but the charts all look the same. That is because we were using the original DataFrame df
, so each of the charts c
was exactly the same.
alt.hconcat(chart_list[0], chart_list[1], chart_list[2])
Here we use df_sub
instead of df
, and we also add a title.
(See the notes from Friday of last week if you’re confused about what species
and df_sub
represent in the line for species, df_sub in df.groupby("species"):
. On Friday we talked about what happens when we iterate over a pandas GroupBy object. On Friday we were using the cars dataset.)
chart_list = []
for species, df_sub in df.groupby("species"):
c = alt.Chart(df_sub).mark_bar().encode(
x="sex",
y="mean(bill_length_mm)",
color="species"
).properties(
title=species
)
chart_list.append(c)
alt.hconcat(chart_list[0], chart_list[1], chart_list[2])
It is definitely cumbersome to write out chart_list[0], chart_list[1], chart_list[2]
. It would be nice if we could just pass the list as our argument, but alt.hconcat
does not accept a list as an argument; it wants charts as its arguments.
alt.hconcat(chart_list)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [37], line 1
----> 1 alt.hconcat(chart_list)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2296, in hconcat(*charts, **kwargs)
2294 def hconcat(*charts, **kwargs):
2295 """Concatenate charts horizontally"""
-> 2296 return HConcatChart(hconcat=charts, **kwargs)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2270, in HConcatChart.__init__(self, data, hconcat, **kwargs)
2267 def __init__(self, data=Undefined, hconcat=(), **kwargs):
2268 # TODO: move common data to top level?
2269 for spec in hconcat:
-> 2270 _check_if_valid_subspec(spec, "HConcatChart")
2271 super(HConcatChart, self).__init__(data=data, hconcat=list(hconcat), **kwargs)
2272 self.data, self.hconcat = _combine_subchart_data(self.data, self.hconcat)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/altair/vegalite/v4/api.py:2074, in _check_if_valid_subspec(spec, classname)
2068 err = (
2069 'Objects with "{0}" attribute cannot be used within {1}. '
2070 "Consider defining the {0} attribute in the {1} object instead."
2071 )
2073 if not isinstance(spec, (core.SchemaBase, dict)):
-> 2074 raise ValueError("Only chart objects can be used in {0}.".format(classname))
2075 for attr in TOPLEVEL_ONLY_KEYS:
2076 if isinstance(spec, core.SchemaBase):
ValueError: Only chart objects can be used in HConcatChart.
Luckily there is a Python abbreviation (not an Altair abbreviation) for “unpacking” the elements in the list and using them as inputs. All we have to do is put a *
before the name of the list.
# same as alt.hconcat(chart_list[0], chart_list[1], ...)
alt.hconcat(*chart_list)