Practice with Altair

Midterm comments

  • Best way to study: go through the sample midterm on Canvas (solutions by Yasmeen are on Canvas), go through the course notes, including the notes from the videos.

  • Remember your notecards. Double-sided handwritten.

  • It’s very hard to learn code by just reading or watching. Try it out yourself (try making changes and see if it works the way you expect).

  • In most cases, syntax is not important for the midterm. If you write color=Valence,scale=turbo and you meant color=alt.Color("Valence", scale=alt.Scale(scheme="turbo")), that will probably still get full credit, as long as I can tell what you mean, and as long as it’s a syntax error and not a conceptual error.

  • Do you need to know a method like isin? You can always ask on Ed Discussion if there is something in particular you are curious about. I could imagine asking a question that is easiest to answer using isin, but where there are alternative methods also. I could also imagine giving you code that involves isin and asking what that code does.

  • I won’t ask anything explicitly about groupby on the midterm, but the other material from the videos is important to study.

  • Overall I hope that if you are comfortable with the material from class so far, then you will do well on the midterm. The best way to get more comfortable is to practice writing the code yourself.

Goal for today’s class

We will use the Spotify dataset to investigate how the month of “Week of Highest Charting” is related to the “Energy” and the “Valence” of the song. For example, do songs with lower energy and lower valence do better during winter months?

Note: The goal is not to have a definitive correct answer. Similarly, for the course project at the end of the class, the goal is to explore the data, not to produce a definitive answer.

import pandas as pd
import altair as alt

Data cleaning

  • Import the Spotify dataset and drop the rows which contain missing values. Save the resulting DataFrame with the variable name df.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
# There are other ways to do this
df.dropna(inplace=True)

Feature Engineering

  • Add a new column called “Month” to df which contains the numerical month from the column “Week of Highest Charting”.

Notice that the dtype is listed as “object”.

df["Week of Highest Charting"]
0       2021-07-23--2021-07-30
1       2021-07-23--2021-07-30
2       2021-06-25--2021-07-02
3       2021-07-02--2021-07-09
4       2021-07-23--2021-07-30
                 ...          
1551    2019-12-27--2020-01-03
1552    2019-12-27--2020-01-03
1553    2019-12-27--2020-01-03
1554    2019-12-27--2020-01-03
1555    2019-12-27--2020-01-03
Name: Week of Highest Charting, Length: 1545, dtype: object

Here is the procedure we will follow.

x = "2021-07-23--2021-07-30"
x[5:7]
'07'

Now applying that same approach to every entry in the column.

# on Monday, we used .dt 
df["Week of Highest Charting"].map(lambda x: x[5:7])
0       07
1       07
2       06
3       07
4       07
        ..
1551    12
1552    12
1553    12
1554    12
1555    12
Name: Week of Highest Charting, Length: 1545, dtype: object

One way to make this numeric is to use the function pd.to_numeric.

pd.to_numeric(df["Week of Highest Charting"].map(lambda x: x[5:7]))
0        7
1        7
2        6
3        7
4        7
        ..
1551    12
1552    12
1553    12
1554    12
1555    12
Name: Week of Highest Charting, Length: 1545, dtype: int64

Another way to make it numeric is to use int inside the lambda function.

df["Week of Highest Charting"].map(lambda x: int(x[5:7]))
0        7
1        7
2        6
3        7
4        7
        ..
1551    12
1552    12
1553    12
1554    12
1555    12
Name: Week of Highest Charting, Length: 1545, dtype: int64

Let’s save it as a new column.

df["Month"] = df["Week of Highest Charting"].map(lambda x: int(x[5:7]))

Notice how the “Month” column has “int64” as its dtype.

df.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
Month                          int64
dtype: object

Displaying the data

  • Make a scatter plot of the data from df by encoding “Energy” as the x-coordinate, encoding “Valence” as the y-coordinate, and encoding “Month” as the color.

alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Valence",
    color="Month"
)
  • Make another chart, this time a bar chart, where the bars correspond to different months, and the heights of the bars correspond to the median value (reference) of “Valence”.

(We will make this look better later.)

alt.Chart(df).mark_bar().encode(
    x="Month",
    y="median(Valence)"
)
  • Reminder: There are different encoding data types in Altair, including Quantitative, Nominal, Ordinal (reference). Of these three options, which do you think makes most sense for “Month”? Change the scatter plot and the bar chart to reflect this encoding data type. Do they look better?

alt.Chart(df).mark_bar().encode(
    x="Month:O",
    y="median(Valence)"
)
  • What if you use max instead of median?

alt.Chart(df).mark_bar().encode(
    x="Month:O",
    y="max(Valence)"
)
  • What if you use mark_circle instead of mark_bar and remove the max portion?

Can you see the relationship between the highest circles on the scatter plot with the bar heights on the bar chart?

alt.Chart(df).mark_circle().encode(
    x="Month:O",
    y="Valence"
)

Same question but for minimum. We will specify the domain (0,1) so that the dimensions are the same in both charts.

alt.Chart(df).mark_bar().encode(
    x="Month:O",
    y=alt.Y("min(Valence)",scale=alt.Scale(domain=(0,1)))
)

Customizing the charts

  • Choose an appropriate color scheme for the scatter plot chart from the list of color scheme options.

The cyclical schemes make sense for months, because December and January should be close to each other.

alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Valence",
    color=alt.Color("Month",scale=alt.Scale(scheme="sinebow"))
)
  • What happens if you change the data encoding type specification for that color scheme?

alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Valence",
    color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow"))
)
  • Try making the bar chart with median look more dramatic by specifying that zero does not need to be included in the y-axis scale (reference).

# zero=True (the default)
alt.Chart(df).mark_bar().encode(
    x="Month:O",
    y="median(Valence)"
)
# zero=False
alt.Chart(df).mark_bar().encode(
    x="Month:O",
    y=alt.Y("median(Valence)", scale=alt.Scale(zero=False))
)

Interactive chart

We have used alt.selection_interval a few times. In this example, we will use alt.selection_single. It works in the same way. The only difference is that we are only selecting a single data point at a time.

  • Define an Altair selection object using the following. It will raise an error. Can you figure out how to correct the error?

sel = alt.selection_single(fields="Month")

The error is telling us that a string (like “Month”) is not an appropriate value for the fields keyword argument. We should instead use something like a list or a tuple.

sel = alt.selection_single(fields="Month")
---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_43383/4284748121.py in <module>
----> 1 sel = alt.selection_single(fields="Month")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/api.py in selection_single(**kwargs)
    254 def selection_single(**kwargs):
    255     """Create a selection with type='single'"""
--> 256     return selection(type="single", **kwargs)
    257 
    258 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/api.py in selection(name, type, **kwds)
    236         The selection object that can be used in chart creation.
    237     """
--> 238     return Selection(name, core.SelectionDef(type=type, **kwds))
    239 
    240 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/vegalite/v4/schema/core.py in __init__(self, *args, **kwds)
  13722 
  13723     def __init__(self, *args, **kwds):
> 13724         super(SelectionDef, self).__init__(*args, **kwds)
  13725 
  13726 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/utils/schemapi.py in __init__(self, *args, **kwds)
    175 
    176         if DEBUG_MODE and self._class_is_valid_at_instantiation:
--> 177             self.to_dict(validate=True)
    178 
    179     def copy(self, deep=True, ignore=()):

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/altair/utils/schemapi.py in to_dict(self, validate, ignore, context)
    338                 self.validate(result)
    339             except jsonschema.ValidationError as err:
--> 340                 raise SchemaValidationError(self, err)
    341         return result
    342 

SchemaValidationError: Invalid specification

        altair.vegalite.v4.schema.core.SelectionDef->0->fields, validating 'type'

        'Month' is not of type 'array'
        
sel = alt.selection_single(fields=["Month"])
  • Display two identical scatter plots from the Spotify data, one above the other, using “Energy” for the x-axis, using “Valence” for the y-axis, and using “Month” for the color.

Since this portion of the code will be used twice, we can make our code more DRY (Don’t Repeat Yourself) by assigning it the variable name c.

c = alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Valence",
    color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow"))
)
c&c
  • Using add_selection, add the above selection object to the top chart.

This is correct but there is no influence so far. Usually we have been using add_selection(brush), but since our variable name is sel, we have to use that instead.

c.add_selection(sel)&c
  • Use transform_filter on the bottom chart so that only data matching the selected month gets kept on the bottom chart.

Try clicking on the top chart. Only the songs from the same month will be kept on the bottom chart. (The fields=["Month"] part is what says to only pay attention to the “Month” field when deciding what to keep and what to discard.)

# used selection_single
c.add_selection(sel)&c.transform_filter(sel)

We can do the same thing with the “Artist” field. We’ll add a tooltip so we can see who the artist is.

sel2 = alt.selection_single(fields=["Artist"])
c = alt.Chart(df).mark_circle().encode(
    x="Energy",
    y="Valence",
    color=alt.Color("Month:O",scale=alt.Scale(scheme="sinebow")),
    tooltip=["Artist","Song Name"]
)

Now if you click on a point in the toop chart, only songs by the same artist will be shown in the bottom chart. (There is a bug in Altair that prevents the tooltip from being shown on the bottom chart. Ask Chris if you’re interested in a workaround for that bug.)

c.add_selection(sel2)&c.transform_filter(sel2)