Practice with Altair
Contents
Practice with Altair¶
Midterm comments¶
Best way to study: go through the sample midterm on Canvas (solutions by Yasmeen are on Canvas), go through the course notes, including the notes from the videos.
Remember your notecards. Double-sided handwritten.
It’s very hard to learn code by just reading or watching. Try it out yourself (try making changes and see if it works the way you expect).
In most cases, syntax is not important for the midterm. If you write
color=Valence,scale=turbo
and you meantcolor=alt.Color("Valence", scale=alt.Scale(scheme="turbo"))
, that will probably still get full credit, as long as I can tell what you mean, and as long as it’s a syntax error and not a conceptual error.Do you need to know a method like
isin
? You can always ask on Ed Discussion if there is something in particular you are curious about. I could imagine asking a question that is easiest to answer usingisin
, but where there are alternative methods also. I could also imagine giving you code that involvesisin
and asking what that code does.I won’t ask anything explicitly about
groupby
on the midterm, but the other material from the videos is important to study.Overall I hope that if you are comfortable with the material from class so far, then you will do well on the midterm. The best way to get more comfortable is to practice writing the code yourself.
Goal for today’s class¶
We will use the Spotify dataset to investigate how the month of “Week of Highest Charting” is related to the “Energy” and the “Valence” of the song. For example, do songs with lower energy and lower valence do better during winter months?
Note: The goal is not to have a definitive correct answer. Similarly, for the course project at the end of the class, the goal is to explore the data, not to produce a definitive answer.
import pandas as pd
import altair as alt
Data cleaning¶
Import the Spotify dataset and drop the rows which contain missing values. Save the resulting DataFrame with the variable name
df
.
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
# There are other ways to do this
df.dropna(inplace=True)
Feature Engineering¶
Add a new column called “Month” to
df
which contains the numerical month from the column “Week of Highest Charting”.
Notice that the dtype
is listed as “object”.
df["Week of Highest Charting"]
0 2021-07-23--2021-07-30
1 2021-07-23--2021-07-30
2 2021-06-25--2021-07-02
3 2021-07-02--2021-07-09
4 2021-07-23--2021-07-30
...
1551 2019-12-27--2020-01-03
1552 2019-12-27--2020-01-03
1553 2019-12-27--2020-01-03
1554 2019-12-27--2020-01-03
1555 2019-12-27--2020-01-03
Name: Week of Highest Charting, Length: 1545, dtype: object
Here is the procedure we will follow.
x = "2021-07-23--2021-07-30"
x[5:7]
'07'
Now applying that same approach to every entry in the column.
# on Monday, we used .dt
df["Week of Highest Charting"].map(lambda x: x[5:7])
0 07
1 07
2 06
3 07
4 07
..
1551 12
1552 12
1553 12
1554 12
1555 12
Name: Week of Highest Charting, Length: 1545, dtype: object
One way to make this numeric is to use the function pd.to_numeric
.
pd.to_numeric(df["Week of Highest Charting"].map(lambda x: x[5:7]))
0 7
1 7
2 6
3 7
4 7
..
1551 12
1552 12
1553 12
1554 12
1555 12
Name: Week of Highest Charting, Length: 1545, dtype: int64
Another way to make it numeric is to use int
inside the lambda function.
df["Week of Highest Charting"].map(lambda x: int(x[5:7]))
0 7
1 7
2 6
3 7
4 7
..
1551 12
1552 12
1553 12
1554 12
1555 12
Name: Week of Highest Charting, Length: 1545, dtype: int64
Let’s save it as a new column.
df["Month"] = df["Week of Highest Charting"].map(lambda x: int(x[5:7]))
Notice how the “Month” column has “int64” as its dtype
.
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
Month int64
dtype: object
Displaying the data¶
Make a scatter plot of the data from
df
by encoding “Energy” as the x-coordinate, encoding “Valence” as the y-coordinate, and encoding “Month” as the color.
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Valence",
color="Month"
)
Make another chart, this time a bar chart, where the bars correspond to different months, and the heights of the bars correspond to the median value (reference) of “Valence”.
(We will make this look better later.)
alt.Chart(df).mark_bar().encode(
x="Month",
y="median(Valence)"
)
Reminder: There are different encoding data types in Altair, including Quantitative, Nominal, Ordinal (reference). Of these three options, which do you think makes most sense for “Month”? Change the scatter plot and the bar chart to reflect this encoding data type. Do they look better?
alt.Chart(df).mark_bar().encode(
x="Month:O",
y="median(Valence)"
)
What if you use
max
instead ofmedian
?
alt.Chart(df).mark_bar().encode(
x="Month:O",
y="max(Valence)"
)
What if you use
mark_circle
instead ofmark_bar
and remove themax
portion?
Can you see the relationship between the highest circles on the scatter plot with the bar heights on the bar chart?
alt.Chart(df).mark_circle().encode(
x="Month:O",
y="Valence"
)
Same question but for minimum. We will specify the domain (0,1)
so that the dimensions are the same in both charts.
alt.Chart(df).mark_bar().encode(
x="Month:O",
y=alt.Y("min(Valence)",scale=alt.Scale(domain=(0,1)))
)
Customizing the charts¶
Choose an appropriate color scheme for the scatter plot chart from the list of color scheme options.
The cyclical schemes make sense for months, because December and January should be close to each other.
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Valence",
color=alt.Color("Month",scale=alt.Scale(scheme="sinebow"))
)
What happens if you change the data encoding type specification for that color scheme?
alt.Chart(df).mark_circle().encode(
x="Energy",
y="Valence",
color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow"))
)
Try making the bar chart with
median
look more dramatic by specifying that zero does not need to be included in the y-axis scale (reference).
# zero=True (the default)
alt.Chart(df).mark_bar().encode(
x="Month:O",
y="median(Valence)"
)
# zero=False
alt.Chart(df).mark_bar().encode(
x="Month:O",
y=alt.Y("median(Valence)", scale=alt.Scale(zero=False))
)
Interactive chart¶
We have used alt.selection_interval
a few times. In this example, we will use alt.selection_single
. It works in the same way. The only difference is that we are only selecting a single data point at a time.
Define an Altair selection object using the following. It will raise an error. Can you figure out how to correct the error?
sel = alt.selection_single(fields="Month")
The error is telling us that a string (like “Month”) is not an appropriate value for the fields
keyword argument. We should instead use something like a list
or a tuple
.
sel = alt.selection_single(fields="Month")
---------------------------------------------------------------------------
SchemaValidationError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_43383/4284748121.py in <module>
----> 1 sel = alt.selection_single(fields="Month")
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/api.py in selection_single(**kwargs)
254 def selection_single(**kwargs):
255 """Create a selection with type='single'"""
--> 256 return selection(type="single", **kwargs)
257
258
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/api.py in selection(name, type, **kwds)
236 The selection object that can be used in chart creation.
237 """
--> 238 return Selection(name, core.SelectionDef(type=type, **kwds))
239
240
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, *args, **kwds)
13722
13723 def __init__(self, *args, **kwds):
> 13724 super(SelectionDef, self).__init__(*args, **kwds)
13725
13726
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/utils/schemapi.py in __init__(self, *args, **kwds)
175
176 if DEBUG_MODE and self._class_is_valid_at_instantiation:
--> 177 self.to_dict(validate=True)
178
179 def copy(self, deep=True, ignore=()):
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
338 self.validate(result)
339 except jsonschema.ValidationError as err:
--> 340 raise SchemaValidationError(self, err)
341 return result
342
SchemaValidationError: Invalid specification
altair.vegalite.v4.schema.core.SelectionDef->0->fields, validating 'type'
'Month' is not of type 'array'
sel = alt.selection_single(fields=["Month"])
Display two identical scatter plots from the Spotify data, one above the other, using “Energy” for the x-axis, using “Valence” for the y-axis, and using “Month” for the color.
Since this portion of the code will be used twice, we can make our code more DRY (Don’t Repeat Yourself) by assigning it the variable name c
.
c = alt.Chart(df).mark_circle().encode(
x="Energy",
y="Valence",
color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow"))
)
c&c
Using
add_selection
, add the above selection object to the top chart.
This is correct but there is no influence so far. Usually we have been using add_selection(brush)
, but since our variable name is sel
, we have to use that instead.
c.add_selection(sel)&c
Use
transform_filter
on the bottom chart so that only data matching the selected month gets kept on the bottom chart.
Try clicking on the top chart. Only the songs from the same month will be kept on the bottom chart. (The fields=["Month"]
part is what says to only pay attention to the “Month” field when deciding what to keep and what to discard.)
# used selection_single
c.add_selection(sel)&c.transform_filter(sel)
We can do the same thing with the “Artist” field. We’ll add a tooltip so we can see who the artist is.
sel2 = alt.selection_single(fields=["Artist"])
c = alt.Chart(df).mark_circle().encode(
x="Energy",
y="Valence",
color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow")),
tooltip=["Artist","Song Name"]
)
Now if you click on a point in the toop chart, only songs by the same artist will be shown in the bottom chart. (There is a bug in Altair that prevents the tooltip from being shown on the bottom chart. Ask Chris if you’re interested in a workaround for that bug.)
c.add_selection(sel2)&c.transform_filter(sel2)