First two examples of Altair Charts

We give two examples, one produced with random data from NumPy, and one using a Kaggle dataset about top 2020-2021 Spotify songs.

import altair as alt
import pandas as pd
import numpy as np
rng = np.random.default_rng()

Basic example with random data

We first make a \(20 \times 4\) NumPy array of random integers in NumPy, and then convert it to a pandas DataFrame.

A = rng.integers(0,100,size = (20,4))
rand_df = pd.DataFrame(A, columns = ["a","b","c","d"])
rand_df
a b c d
0 72 41 17 31
1 28 73 48 16
2 2 72 45 26
3 20 54 97 58
4 23 85 23 41
5 33 64 86 37
6 3 30 38 83
7 45 37 56 31
8 59 66 53 74
9 79 48 51 69
10 3 52 35 95
11 35 40 47 21
12 49 59 60 86
13 7 75 51 90
14 31 12 38 60
15 35 23 60 5
16 29 99 52 37
17 33 29 73 44
18 19 99 40 13
19 18 43 28 20

Each row in the DataFrame will correspond to a point in our chart. The values in the \(a\) column and \(b\) column correspond to the \(x\)-coordinate and the \(y\)-coordinate, respectively.

Here we use mark_line to connect the datapoints with lines.

The syntax for making an Altair chart can be intimidating. The faster you can get comfortable with it, the better.

alt.Chart(rand_df).mark_line().encode(
    x = "a",
    y = "b"
)

The same data, but with disks drawn instead of lines: we changed from mark_line to mark_circle.

alt.Chart(rand_df).mark_circle().encode(
    x = "a",
    y = "b"
)

By looking at the chart, how many points have an a value less than 30? Let’s verify that in the dataset.

(rand_df["a"] < 30).sum()
10

Here are those points explicitly.

rand_df[rand_df["a"] < 30]
a b c d
1 28 73 48 16
2 2 72 45 26
3 20 54 97 58
4 23 85 23 41
6 3 30 38 83
10 3 52 35 95
13 7 75 51 90
16 29 99 52 37
18 19 99 40 13
19 18 43 28 20

Here is another chart, where we color the points using column c, and we change the size of the points using column d. We add a tooltip showing all the values, so put your mouse over a point to see its values of a,b,c,d.

alt.Chart(rand_df).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    tooltip = ["a","b","c","d"],
)

Exercise

Put your mouse over one of the points in the above chart.

  • How is the underlying data for that point reflected in its location, size, and color?

  • Can you find the row in rand_df corresponding to this point?

  • Choose another row in the DataFrame. What point does it correspond to in the chart?

For your convenience, the original pandas DataFrame is shown below.

rand_df
a b c d
0 72 41 17 31
1 28 73 48 16
2 2 72 45 26
3 20 54 97 58
4 23 85 23 41
5 33 64 86 37
6 3 30 38 83
7 45 37 56 31
8 59 66 53 74
9 79 48 51 69
10 3 52 35 95
11 35 40 47 21
12 49 59 60 86
13 7 75 51 90
14 31 12 38 60
15 35 23 60 5
16 29 99 52 37
17 33 29 73 44
18 19 99 40 13
19 18 43 28 20

Another example with more points, and where we add an opacity channel.

A = rng.integers(0,100,size = (1000,5))
rand_df2 = pd.DataFrame(A, columns = ["a","b","c","d","e"])
alt.Chart(rand_df2).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    opacity = "e",
    tooltip = ["a","b","c","d","e"],
)

Exercise

I don’t like the d and e parts of the legend. Can you figure out how to remove them, by mimicking an example from the documentation?

It’s hard to recognize the opacity in the above example. Let’s change the DataFrame so that all the points to the left of \(a = 40\) are only 15% transparent, and the rest of the points are 100% opaque. We add a scale=None to the opacity channel so we have complete control over the opacity.

rand_df2["e"] = 1
rand_df2.loc[(rand_df2["a"] < 40),"e"] = 0.15
alt.Chart(rand_df2).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    opacity = alt.Opacity("e",scale=None),
    tooltip = ["a","b","c","d","e"],
)

Example with data from Spotify

Here we use the spotify_dataset.csv file from Canvas. The dataset originally came from Kaggle here. The Kaggle page includes a description of the columns.

We perform some “cleaning” of the dataset. By the end of Math 10, all of the following cell should be understandable, but for now, you shouldn’t worry about the details of this “cleaning”.

Important: You may need to change the path from data/spotify_dataset.csv, depending on where you have this csv file stored.

df = pd.read_csv("../data/spotify_dataset.csv") # change path if necessary
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()
df.head()
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48633449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.714 0.800 -4.808 0.0504 0.1270 0.3590 134.002 211560.0 0.589 B
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47248719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.591 0.764 -5.484 0.0483 0.0383 0.1030 169.928 141806.0 0.478 C#/Db
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40162559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.563 0.664 -5.044 0.1540 0.3350 0.0849 166.928 178147.0 0.688 A
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37799456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.808 0.897 -3.712 0.0348 0.0469 0.3640 126.026 231041.0 0.591 B
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33948454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.736 0.704 -7.409 0.0615 0.0203 0.0501 149.995 212000.0 0.894 D#/Eb

5 rows × 23 columns

If there are more than 5000 rows, then we need to do some data preprocessing before giving the DataFrame to Altair. But in this case, there are only 1556 rows.

df.shape
(1556, 23)

If a column has type object, that often means it is a string, even if the values look numerical. If you’re having a hard time plotting data, make sure the values are numbers and not strings. The main point of the data-cleaning we did above was to make more of the columns numerical. Of course, a column like Song Name is never going to be numerical.

df.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                        int64
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object

Scatter plot

The following Altair chart is just like what we made above with our random DataFrame. We again use the column names to specify which parts of the data to use. Before we used column names like “a” and “b”. Here the column names are more descriptive, like “Energy” and “Loudness”.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = 'Acousticness',
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

One of my favorite customizations in Altair is to use a more interesting color scheme. Here is an example using the color scheme “goldred”. You can find more color options in the Vega documentation.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred")),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Sometimes the colors look more natural if they are reversed. We do that by adding reverse=True in the alt.Scale component.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred",reverse=True)),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Spotify chart with tooltip

In the following chart we use a different color scheme, we specify the dimensions of the chart to make it a little bigger, and we give the chart a title.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness', scale=alt.Scale(scheme='turbo',reverse=True)),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
).properties(
    width = 720,
    height = 450,
    title="Spotify dataset from Kaggle"
)

Caution

The rest of this notebook can be skipped on a first reading. We give some more advanced examples.

Histogram

Here is an example of how to make a histogram using Altair. The heights of the bars indicate how many total entries there are in that category. The count() entry is not the name of a column. Instead it is a special Altair function to count how often that entry occurs.

alt.Chart(df).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

There are so many artists, this chart is pretty difficult to interpret. Let’s restrict ourselves to the top artists.

Here are the top 19 artists. (Why 19 rather than 20? No great reason, but this particular chart looks better with 19.)

top_artists = df.Artist.value_counts()[:19]
top_artists
Taylor Swift          52
Lil Uzi Vert          32
Justin Bieber         32
Juice WRLD            30
Pop Smoke             29
BTS                   29
Bad Bunny             28
Eminem                22
The Weeknd            21
Ariana Grande         20
Drake                 19
Billie Eilish         18
Selena Gomez          17
J. Cole               16
Doja Cat              16
Dua Lipa              15
Lady Gaga             14
Tyler, The Creator    14
DaBaby                14
Name: Artist, dtype: int64

Let’s make our Altair chart using the sub-DataFrame with just these 19 top artists. We make this using a new pandas method, isin.

df_top = df[df.Artist.isin(top_artists.index)]
df_top.head()
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
6 7 3 16 2021-05-14--2021-05-21 Kiss Me More (feat. SZA) 29356736 Doja Cat 8640063.0 748mdHapucXQri7IAO8yFK ['dance pop', 'pop'] ... 0.762 0.701 -3.541 0.0286 0.23500 0.1230 110.968 208867.0 0.742 G#/Ab
8 9 3 8 2021-06-18--2021-06-25 Yonaguni 25030128 Bad Bunny 36142273.0 2JPLbjOn0wPCngEot2STUS ['latin', 'reggaeton', 'trap latino'] ... 0.644 0.648 -4.601 0.1180 0.27600 0.1350 179.951 206710.0 0.440 C#/Db
10 11 4 43 2021-05-07--2021-05-14 Levitating (feat. DaBaby) 23518010 Dua Lipa 27142474.0 463CkQjx2Zk1yXoBuierM9 ['dance pop', 'pop', 'uk pop'] ... 0.702 0.825 -3.787 0.0601 0.00883 0.0674 102.977 203064.0 0.915 F#/Gb
12 13 5 3 2021-07-09--2021-07-16 Permission to Dance 22062812 BTS 37106176.0 0LThjFY2iTtNdd4wviwVV2 ['k-pop', 'k-pop boy group'] ... 0.702 0.741 -5.330 0.0427 0.00544 0.3370 124.925 187585.0 0.646 A
13 14 1 19 2021-04-02--2021-04-09 Peaches (feat. Daniel Caesar & Giveon) 20294457 Justin Bieber 48504126.0 4iJyoBOLtHqaGxP12qzhQI ['canadian pop', 'pop', 'post-teen pop'] ... 0.677 0.696 -6.181 0.1190 0.32100 0.4200 90.030 198082.0 0.464 C

5 rows × 23 columns

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

Let’s add color to the chart, using the average number of Streams for each artist. In this example, mean is a special function in Altair, just like count.

Spotify bar chart

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()",
    color = "mean(Streams)"
)

Exercise

Copy the above histogram code, and replace mean with sum. Suddenly the colors are less interesting. Why do you think that is?

Interactive example

We end with an example just for inspiration. One of the distinguishing features of Altair is its support for interactivity. If you click and drag on the below chart, the points in the region you select will gain color.

brush = alt.selection_interval(empty='none')

chart = alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.condition(brush,
                          alt.Color('Acousticness:Q', scale=alt.Scale(scheme='turbo',reverse=True)),
                          alt.value("lightgrey")),
).add_selection(
    brush,
).properties(
    width = 720,
    height = 450,
    title="Spotify dataset from Kaggle"
)

chart