Week 4 Friday
Contents
Week 4 Friday#
Announcements#
Video quizzes due. (You can still get credit on all the video quizzes up until Sunday, so if you’re behind, try to catch up this weekend. Early Monday I will convert them to practice quizzes so you can use them to practice for the midterm.)
Midterm Wednesday. Covers material up to and including anything covered Monday.
Pick up a notecard if you haven’t already. Hand-written notes, written on both sides, can be used during the midterm.
Sample midterm posted on the “Week 5” page on Canvas. I’ll ask William to go over as much of the sample midterm as possible on Tuesday.
import pandas as pd
import altair as alt
import seaborn as sns
Just for a change of pace, we are calling the DataFrame cars
instead of the usual df
today.
cars = sns.load_dataset("mpg")
Interactive bar chart#
Using the “mpg” dataset from Seaborn, display a scatterplot of horsepower vs mpg, colored by origin, alongside a bar chart, showing how many cars there are from each origin.
We need to use cars
instead of df
when we specify the pandas DataFrame for this Altair chart.
alt.Chart(df).mark_circle().encode(
x="horsepower",
y="mpg",
color="origin"
)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [3], line 1
----> 1 alt.Chart(df).mark_circle().encode(
2 x="horsepower",
3 y="mpg",
4 color="origin"
5 )
NameError: name 'df' is not defined
Here is the basic scatter plot for this data.
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color="origin"
)
c1
Here is the basic bar chart for this data. Notice how we use "count()"
to tell Altair to count how many instances there are in each category. For example, the following says that there are about 70 cars from Europe in this dataset.
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y="count()",
color="origin"
)
c2
Similar to count
, there are other functions we can use, like sum
and mean
and median
. Using these on their own does not make sense, like y="sum()"
would not make sense… instead, we need to specify what should be summed. Here we are summing the “horsepower” column. (If there were a column that contained all 1s, we could sum that column to get the same thing as count()
.)
# what about sum?
alt.Chart(cars).mark_bar().encode(
x="origin",
y="sum(horsepower)",
color="origin"
)
Sample question: Try to find the heights of these blue bars exactly using pandas and Boolean indexing.
df_europe = cars[cars["origin"] == "europe"]
Here is one way to find how many rows there are in df_europe
. Notice how this matches the height of our bar above.
# count
df_europe.shape[0]
70
Here is another approach, using the built-in Python function len
to get the length of the DataFrame.
len(df_europe)
70
Notice how the following matches our y="sum(horsepower)"
height for Europe.
df_europe["horsepower"].sum()
5478.0
I expected the following to work but to possibly be slower. Instead, it doesn’t work because of missing values (or rather, it works, but it does not give us any information about the height of the bar).
# might be slower since it's not specific to pandas
# also handles missing values differently
sum(df_europe["horsepower"])
nan
Back to our original charts.
c1
c2
Here we create a selection object (step 1) and add it to c1
(step 2). Now you can click and drag and a grey rectangle shows up, but there is no responsiveness yet. (On Wednesday, we used alt.selection_single
to choose one thing at a time; here, we are using alt.selection_interval
to choose an interval at a time.)
brush = alt.selection_interval() # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color="origin"
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y="count()",
color="origin"
)
alt.hconcat(c1,c2) # c1|c2
Using an Altair selection interval object along with
transform_filter
, change the heights of the bars to match how many cars were selected. Use thescale
keyword argument so that the y-axis of the bar chart does not change.
Here we add the responsiveness, in the transform_filter(brush)
portion. This tells Altair to only use objects which have been selected. We also specify a fixed domain
for the y-axis so that the height of the bars are easier to compare.
Try clicking and dragging on the left chart, and notice how the bar heights respond.
brush = alt.selection_interval() # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color="origin"
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Because c2
references brush
, but we never added brush
to c2
, if we try to view c2
by itself, we get an error. (We didn’t get an error above, because we had it displayed together with c1
using alt.hconcat
.)
c2
The chart c1
does work on its own, there’s just no responsiveness.
c1
I’m not sure why the following doesn’t work. I assumed that when we clicked and dragged on it, all the non-selected points would disappear.
# Not sure why this doesn't work
c1.transform_filter(brush)
As review from Wednesday, change the colors in the scatterplot so that unselected points get a constant color, using
alt.value
. (Aside about colors. You can use any valid CSS color string. There are very many options for these CSS color strings. The easiest option is to use a named color. Another common option is to specify RGB values. I haven’t learned it myself yet, but I believe HSL values in general are supposed to be easier to work with than RGB values, in terms of the colors produced. I heard this article Color for the Color-challenged recommended on an episode of the podcast syntax.fm.)
Notice how if you click and drag, the unselected points now have an “orchid” color. (Notice how we need to use alt.value("orchid")
. If we just used "orchid"
, Altair would be looking for a column in the DataFrame named “orchid”.)
brush = alt.selection_interval() # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color=alt.condition(brush, "origin", alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Here is an example of using a different color encoding. Selected points are colored according to “horsepower”; unselected points are still “orchid”.
brush = alt.selection_interval() # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color=alt.condition(brush, "horsepower", alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Here is one more example of how alt.condition
works. Selected points are colored black; unselected points are colored orchid.
brush = alt.selection_interval() # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color=alt.condition(brush, alt.value("black"), alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Specify that the selection interval should only be bound to the x-encoding, and specify an initial value with the
init
keyword argument. Theinit
keyword argument should be assigned to a dictionary, which in this case will have only one key. Can you guess what the key-value pair should look like?
Notice how now we have no control over the vertical height of the selection interval; we can only control the span of the x encoding.
brush = alt.selection_interval(encodings=["x"]) # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color=alt.condition(brush, "origin", alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Specifying something like {"x": 100}
wouldn’t really make sense, because to specify an interval, we need to specify both endpoints of the interval. So we specify init={"x": [100,140]}
. Notice how the chart starts with this interval selected.
brush = alt.selection_interval(encodings=["x"], init={"x": [100,140]}) # step 1
c1 = alt.Chart(cars).mark_circle().encode(
x="horsepower",
y="mpg",
color=alt.condition(brush, "origin", alt.value("orchid"))
).add_selection(brush) # step 2
c2 = alt.Chart(cars).mark_bar().encode(
x="origin",
y=alt.Y("count()", scale=alt.Scale(domain=[0,260])),
color="origin"
).transform_filter(brush)
alt.hconcat(c1,c2) # c1|c2
Review#
I’m not sure how much time will be left today. Be thinking about what topics you would like to review. (Please phrase them in terms of the Python topic, as opposed to for example “Worksheet 5, #4”.)
We’ll talk more about apply
on Monday. (Secretly there is also an apply
method for pandas Series, but we will never use that in Math 10.)
map: is a Series method
apply and applymap: DataFrame methods
df = pd.DataFrame({
"A": [3,1,4,1],
"B": [2,0,2,5]
})
If the following is confusing, try breaking it up into smaller pieces, like df["A"]
, then df["A"].map(lambda x: x**2)
, then df["A"].map(lambda x: x**2 > 10)
, then df["A"].map(lambda x: x**2 > 10).sum()
.
df["A"].map(lambda x: x**2 > 10).sum()
1
Here is an example of applying the function str
to every entry in df
.
# df.applymap(f) applies f to every entry in df
df2 = df.applymap(str)
The DataFrame looks the same, but the entries have been converted to strings.
df2
A | B | |
---|---|---|
0 | 3 | 2 |
1 | 1 | 0 |
2 | 4 | 2 |
3 | 1 | 5 |
Let’s check that the lower-right 5
is really a string.
type(df2.loc[3,"B"])
str