Week 3 Wednesday#
Announcements#
I’m trying to use a “paid” machine. I’m not sure what happens when you duplicate this project. You might have to go to
Environment
on the right-hand side, click the Settings wheel, and switch to theBasic
machine.Deepnote support thinks my “hardware resetting” issue should be fixed. Let’s try and see.
Reminder: I have office hours in here Friday before class, 12-1pm.
Worksheet 6 posted today.
Maya is here to help.
Plan#
Encoding data types
Other types of charts in Altair
Multi-view plots in altair
Install the newest version of Altair by executing the following command. (Even though we are executing it in a code cell, this isn’t a Python command, it is a command for the underlying Linux operating system.)
Let me know if this process significantly slows down your workspace!
!pip install altair==5.0.0rc1
!pip install altair==5.0.0rc1
Collecting altair==5.0.0rc1
Downloading altair-5.0.0rc1-py3-none-any.whl (709 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 709.5/709.5 KB 64.4 MB/s eta 0:00:00
?25hRequirement already satisfied: numpy in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (1.23.4)
Requirement already satisfied: jinja2 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from altair==5.0.0rc1) (2.11.3)
Requirement already satisfied: jsonschema>=3.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from altair==5.0.0rc1) (3.2.0)
Requirement already satisfied: typing-extensions>=4.0.1 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (4.4.0)
Requirement already satisfied: pandas>=0.18 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (1.2.5)
Requirement already satisfied: toolz in /shared-libs/python3.9/py/lib/python3.9/site-packages (from altair==5.0.0rc1) (0.12.0)
Requirement already satisfied: pyrsistent>=0.14.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (0.18.1)
Requirement already satisfied: attrs>=17.4.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (22.1.0)
Requirement already satisfied: six>=1.11.0 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (1.16.0)
Requirement already satisfied: setuptools in /root/venv/lib/python3.9/site-packages (from jsonschema>=3.0->altair==5.0.0rc1) (58.1.0)
Requirement already satisfied: pytz>=2017.3 in /shared-libs/python3.9/py/lib/python3.9/site-packages (from pandas>=0.18->altair==5.0.0rc1) (2022.5)
Requirement already satisfied: python-dateutil>=2.7.3 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from pandas>=0.18->altair==5.0.0rc1) (2.8.2)
Requirement already satisfied: MarkupSafe>=0.23 in /shared-libs/python3.9/py-core/lib/python3.9/site-packages (from jinja2->altair==5.0.0rc1) (2.0.0)
Installing collected packages: altair
Attempting uninstall: altair
Found existing installation: altair 4.2.2
Not uninstalling altair at /shared-libs/python3.9/py/lib/python3.9/site-packages, outside environment /root/venv
Can't uninstall 'altair'. No files were found to uninstall.
Successfully installed altair-5.0.0rc1
WARNING: You are using pip version 22.0.4; however, version 23.1 is available.
You should consider upgrading via the '/root/venv/bin/python -m pip install --upgrade pip' command.
Import Altair (how is that different from what we just did?) and check that the version is indeed
5.0.0
.
import altair as alt
Encoding data types#
(This notion of quantitative data vs categorical data will also be very important when we get to the Machine Learning portion of Math 10.) Altair chooses different default values depending on the type of the data being encoded. These are the 5 types of data distinguished by Altair (Reference):
Data Type |
Shorthand Code |
Description |
---|---|---|
quantitative |
Q |
a continuous real-valued quantity |
ordinal |
O |
a discrete ordered quantity |
nominal |
N |
a discrete unordered category |
temporal |
T |
a time or date value |
geojson |
G |
a geographic shape |
A quantitative data type is just an ordinary numeric data type, like floats. Ordinal and Nominal data types are categorical data types, where the values represent discrete categories or classes. We use the Ordinal designation if the categories have a natural ordering and we use Nominal if the categories do not have a natural ordering. A Temporal data type is used for data representing datetime-like values. The last encoding data type (which I haven’t covered in Math 10 before) is for geographic values (like for maps).
Load the “mpg” dataset (
sns.load_dataset
) from Seaborn and name the DataFramedf
.
import seaborn as sns
df = sns.load_dataset("mpg")
Find the sub-DataFrame for which the name of the car contains the substring “skylark”. Name the sub-DataFrame
df_sub
. (Reminder. Usestr
andcontains
.)
Notice how the row at index 1
has the substring "skylark"
in the “name” column.
df.head(4)
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
We need to use the accessor attribute str
, because that will give us access to methods like contains
. If we try to use contains
directly on this pandas Series, then we get an error, because pandas Series do not have such a method.
df["name"].contains("skylark")
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 df["name"].contains("skylark")
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5465, in NDFrame.__getattr__(self, name)
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5464 return self[name]
-> 5465 return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'contains'
On the other hand, if we use str
, then we can use the contains
method. Notice how we get a True
in the position at index 1
, corresponding to the substring "skylark"
we saw above.
df["name"].str.contains("skylark")
0 False
1 True
2 False
3 False
4 False
...
393 False
394 False
395 False
396 False
397 False
Name: name, Length: 398, dtype: bool
Here we use Boolean indexing to get the appropriate sub-DataFrame. Notice how there are only 4 rows, and each of the names contains the substring “skylark”.
df_sub = df[df["name"].str.contains("skylark")]
df_sub
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
226 | 20.5 | 6 | 231.0 | 105.0 | 3425 | 16.9 | 77 | usa | buick skylark |
305 | 28.4 | 4 | 151.0 | 90.0 | 2670 | 16.0 | 79 | usa | buick skylark limited |
339 | 26.6 | 4 | 151.0 | 84.0 | 2635 | 16.4 | 81 | usa | buick skylark |
Make a scatter plot in Altair from this sub-DataFrame using the “model_year” for both the x-coordinate and the color, and using “mpg” for the y-coordinate. (We can increase the size of the points, and remove zero from the x-axis, to make it easier to see.)
Here is the basic chart that shows up. We will make it look better by making some changes.
alt.Chart(df_sub).mark_circle().encode(
x="model_year",
color="model_year",
y="mpg"
)
First we increase the point size.
alt.Chart(df_sub).mark_circle(size=150).encode(
x="model_year",
color="model_year",
y="mpg"
)
Next we specify that the x-axis is not required to include zero. This syntax has recently become shorter. If you skipped the !pip install ...
step above, you will need to use the old syntax:
x=alt.X("model_year", scale=alt.Scale(zero=False))
instead of the new syntax:
x=alt.X("model_year").scale(zero=False)
# without pip install: x=alt.X("model_year", scale=alt.Scale(zero=False))
alt.Chart(df_sub).mark_circle(size=150).encode(
x=alt.X("model_year").scale(zero=False),
color="model_year",
y="mpg"
)
I would say the above chart still does not look very good. Let’s see what effect changing the encoding types will have.
What changes if you specify different encoding types for “model_year”? (The difference in color between quantitative and ordinal will be more clear if you use a different color scheme: options.)
Here we switch the x-axis to the “Ordinal” encoding data type, using :O
. Notice how now the values 70
, 77
, 79
, 81
are now treated like discrete categories, and the spacing between them is ignored.
alt.Chart(df_sub).mark_circle(size=150).encode(
x=alt.X("model_year:O").scale(zero=False),
color="model_year",
y="mpg"
)
Here we change color scheme (see the above link for options). We are specifying that the color channel should use a “Quantitative” encoding, but that is the default, so you will see the same thing if you do not use that. (This color scheme syntax will not work if you are on the previous version of Altair. See above or the version 4 documentation for how to make the change.)
alt.Chart(df_sub).mark_circle(size=150).encode(
x=alt.X("model_year:O").scale(zero=False),
color=alt.Color("model_year:Q").scale(scheme="lightgreyred"),
y="mpg"
)
Here is the exact same chart, but where we switch to the “Ordinal” encoding data type. Do you see the differences? One different is that the “Quantiative” legend shows a continuous progression of numbers. A more subtle difference is that the colors for 77
, 79
, 81
are grouped closer to each other in the “Quantitative” version, whereas everything, including the 70
, is equally spaced colorwise in the “Ordinal” version.
alt.Chart(df_sub).mark_circle(size=150).encode(
x=alt.X("model_year:O").scale(zero=False),
color=alt.Color("model_year:O").scale(scheme="lightgreyred"),
y="mpg"
)
If you switch the color data type to “Nominal” (which means unordered) and use the default color scheme, you can see that there is no ordering or progression to the colors used. The colors in this case are chosen to make the values as distinct as possible.
alt.Chart(df_sub).mark_circle(size=150).encode(
x=alt.X("model_year:O").scale(zero=False),
color="model_year:N",
y="mpg"
)
Other types of charts in Altair#
Here we switch back to the full DataFrame, df
. There are many types of charts in Altair (browse the example gallery to see some of the possibilities).
Make a bar chart using “cylinders” for the x-coordinate, using the median of the mpg values for the y-coordinate.
We haven’t gotten to the “median” part yet.
alt.Chart(df).mark_bar().encode(
x="cylinders",
y="mpg"
)
We specify ordinal encoding using :O
. This looks better, but it still looks strange. Do you see all the little white horizontal lines? That is because the “mpg” bars are getting stacked on top of each other. Another thing to notice: the bars go up to about 6000
… that is more evidence that these are being stacked on top of each other.
This kind of chart makes most sense if we perform some kind of aggregation, which we will do below using median
.
alt.Chart(df).mark_bar().encode(
x="cylinders:O",
y="mpg"
)
There is no column called "median(mpg)"
in our DataFrame. Instead this syntax is telling Altair to compute the median and plot the bar heights based on the result.
alt.Chart(df).mark_bar().encode(
x="cylinders:O",
y="median(mpg)"
)
Add a tooltip so we can find the precise median values.
For example, if you put your mouse over the 4-cylinders bar, it will report a median value of 28.25
. That is telling us that the median miles-per-gallon across 4-cylinder cars in the dataset is 28.25
.
alt.Chart(df).mark_bar().encode(
x="cylinders:O",
y="median(mpg)",
tooltip=["median(mpg)", "cylinders"]
)
Can you find these same median values using
df.groupby
? Deepnote hides the warning, but use the keyword argumentnumeric_only
when computing the median to avoid a Python warning.
df.groupby("cylinders").median(numeric_only=True)
mpg | displacement | horsepower | weight | acceleration | model_year | |
---|---|---|---|---|---|---|
cylinders | ||||||
3 | 20.25 | 70.0 | 98.5 | 2375.0 | 13.5 | 75.0 |
4 | 28.25 | 105.0 | 78.0 | 2232.0 | 16.2 | 78.0 |
5 | 25.40 | 131.0 | 77.0 | 2950.0 | 19.9 | 79.0 |
6 | 19.00 | 228.0 | 100.0 | 3201.5 | 16.1 | 76.0 |
8 | 14.00 | 350.0 | 150.0 | 4140.0 | 13.0 | 73.0 |
The above result is a pandas DataFrame, so we can get a column out of that DataFrame just like always, by using square brackets and the name of the column.
Notice how that same 28.25
number is visible here.
df.groupby("cylinders").median(numeric_only=True)["mpg"]
cylinders
3 20.25
4 28.25
5 25.40
6 19.00
8 14.00
Name: mpg, dtype: float64
Make a “rectangle chart” using
mark_rect
with “model_year” along the x-axis, with “cylinders” along the y-axis, and with the rectangles colored by"count()"
.
Note that here "count()"
is something defined by Altair, not one of the columns in df
.
Reading based on the colors, it appears that there are the most cars from the year 82
and with 4
cylinders in this dataset. The tooltip lets us check that there are 28
such values in the dataset.
Here we are storing the chart in the variable c1
so we can refer to it below.
c1 = alt.Chart(df).mark_rect().encode(
x="model_year:O",
y="cylinders:O",
color="count()",
tooltip=["count()"]
)
c1
Make a “text chart” using
mark_text
with the same parameters as above, but remove thecolor
encoding, and add atext
encoding based on"count()"
.
By itself, this looks a little strange. (We are using text for the marks in this case, that is why it is called mark_text
.)
c2 = alt.Chart(df).mark_text().encode(
x="model_year:O",
y="cylinders:O",
text="count()"
)
c2
Layer these last two charts together, either using
+
or usingalt.layer
.
c1+c2
The +
notation in Altair is just shorthand for layering using alt.layer
. Here we are getting both charts displayed at once.
alt.layer(c1, c2)
We didn’t get further than this.
Multi-view plots in altair#
A facet chart breaks the dataset up into sub-datasets and makes a different chart for each sub-dataset.
Make a facet chart using “horsepower” for the x-coordinate, “mpg” for the y-coordinate, “cylinders” for the color with the Nominal data encoding type, and dividing the data according to the number of cylinders. Put each chart in its own row.
Time to work on Worksheets 5-6#
Maya and I are here to help.