First two examples of Altair Charts¶
We give two examples, one produced with random data from NumPy, and one using a Kaggle dataset about top 2020-2021 Spotify songs.
import altair as alt
import pandas as pd
import numpy as np
rng = np.random.default_rng()
Basic example with random data¶
We first make a \(20 \times 4\) NumPy array of random integers in NumPy, and then convert it to a pandas DataFrame.
A = rng.integers(0,100,size = (20,4))
rand_df = pd.DataFrame(A, columns = ["a","b","c","d"])
rand_df
a | b | c | d | |
---|---|---|---|---|
0 | 72 | 41 | 17 | 31 |
1 | 28 | 73 | 48 | 16 |
2 | 2 | 72 | 45 | 26 |
3 | 20 | 54 | 97 | 58 |
4 | 23 | 85 | 23 | 41 |
5 | 33 | 64 | 86 | 37 |
6 | 3 | 30 | 38 | 83 |
7 | 45 | 37 | 56 | 31 |
8 | 59 | 66 | 53 | 74 |
9 | 79 | 48 | 51 | 69 |
10 | 3 | 52 | 35 | 95 |
11 | 35 | 40 | 47 | 21 |
12 | 49 | 59 | 60 | 86 |
13 | 7 | 75 | 51 | 90 |
14 | 31 | 12 | 38 | 60 |
15 | 35 | 23 | 60 | 5 |
16 | 29 | 99 | 52 | 37 |
17 | 33 | 29 | 73 | 44 |
18 | 19 | 99 | 40 | 13 |
19 | 18 | 43 | 28 | 20 |
Each row in the DataFrame will correspond to a point in our chart. The values in the \(a\) column and \(b\) column correspond to the \(x\)-coordinate and the \(y\)-coordinate, respectively.
Here we use mark_line
to connect the datapoints with lines.
The syntax for making an Altair chart can be intimidating. The faster you can get comfortable with it, the better.
alt.Chart(rand_df).mark_line().encode(
x = "a",
y = "b"
)
The same data, but with disks drawn instead of lines: we changed from mark_line
to mark_circle
.
alt.Chart(rand_df).mark_circle().encode(
x = "a",
y = "b"
)
By looking at the chart, how many points have an a
value less than 30? Let’s verify that in the dataset.
(rand_df["a"] < 30).sum()
10
Here are those points explicitly.
rand_df[rand_df["a"] < 30]
a | b | c | d | |
---|---|---|---|---|
1 | 28 | 73 | 48 | 16 |
2 | 2 | 72 | 45 | 26 |
3 | 20 | 54 | 97 | 58 |
4 | 23 | 85 | 23 | 41 |
6 | 3 | 30 | 38 | 83 |
10 | 3 | 52 | 35 | 95 |
13 | 7 | 75 | 51 | 90 |
16 | 29 | 99 | 52 | 37 |
18 | 19 | 99 | 40 | 13 |
19 | 18 | 43 | 28 | 20 |
Here is another chart, where we color the points using column c
, and we change the size of the points using column d
. We add a tooltip showing all the values, so put your mouse over a point to see its values of a,b,c,d.
alt.Chart(rand_df).mark_circle().encode(
x = "a",
y = "b",
color = "c",
size = "d",
tooltip = ["a","b","c","d"],
)
Exercise
Put your mouse over one of the points in the above chart.
How is the underlying data for that point reflected in its location, size, and color?
Can you find the row in
rand_df
corresponding to this point?Choose another row in the DataFrame. What point does it correspond to in the chart?
For your convenience, the original pandas DataFrame is shown below.
rand_df
a | b | c | d | |
---|---|---|---|---|
0 | 72 | 41 | 17 | 31 |
1 | 28 | 73 | 48 | 16 |
2 | 2 | 72 | 45 | 26 |
3 | 20 | 54 | 97 | 58 |
4 | 23 | 85 | 23 | 41 |
5 | 33 | 64 | 86 | 37 |
6 | 3 | 30 | 38 | 83 |
7 | 45 | 37 | 56 | 31 |
8 | 59 | 66 | 53 | 74 |
9 | 79 | 48 | 51 | 69 |
10 | 3 | 52 | 35 | 95 |
11 | 35 | 40 | 47 | 21 |
12 | 49 | 59 | 60 | 86 |
13 | 7 | 75 | 51 | 90 |
14 | 31 | 12 | 38 | 60 |
15 | 35 | 23 | 60 | 5 |
16 | 29 | 99 | 52 | 37 |
17 | 33 | 29 | 73 | 44 |
18 | 19 | 99 | 40 | 13 |
19 | 18 | 43 | 28 | 20 |
Another example with more points, and where we add an opacity channel.
A = rng.integers(0,100,size = (1000,5))
rand_df2 = pd.DataFrame(A, columns = ["a","b","c","d","e"])
alt.Chart(rand_df2).mark_circle().encode(
x = "a",
y = "b",
color = "c",
size = "d",
opacity = "e",
tooltip = ["a","b","c","d","e"],
)
Exercise
I don’t like the d
and e
parts of the legend. Can you figure out how to remove them, by mimicking an example from the documentation?
It’s hard to recognize the opacity in the above example. Let’s change the DataFrame so that all the points to the left of \(a = 40\) are only 15% transparent, and the rest of the points are 100% opaque. We add a scale=None
to the opacity channel so we have complete control over the opacity.
rand_df2["e"] = 1
rand_df2.loc[(rand_df2["a"] < 40),"e"] = 0.15
alt.Chart(rand_df2).mark_circle().encode(
x = "a",
y = "b",
color = "c",
size = "d",
opacity = alt.Opacity("e",scale=None),
tooltip = ["a","b","c","d","e"],
)
Example with data from Spotify¶
Here we use the spotify_dataset.csv
file from Canvas. The dataset originally came from Kaggle here. The Kaggle page includes a description of the columns.
We perform some “cleaning” of the dataset. By the end of Math 10, all of the following cell should be understandable, but for now, you shouldn’t worry about the details of this “cleaning”.
Important: You may need to change the path from data/spotify_dataset.csv
, depending on where you have this csv file stored.
df = pd.read_csv("../data/spotify_dataset.csv") # change path if necessary
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()
df.head()
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48633449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.714 | 0.800 | -4.808 | 0.0504 | 0.1270 | 0.3590 | 134.002 | 211560.0 | 0.589 | B |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47248719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.591 | 0.764 | -5.484 | 0.0483 | 0.0383 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40162559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.563 | 0.664 | -5.044 | 0.1540 | 0.3350 | 0.0849 | 166.928 | 178147.0 | 0.688 | A |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37799456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.808 | 0.897 | -3.712 | 0.0348 | 0.0469 | 0.3640 | 126.026 | 231041.0 | 0.591 | B |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33948454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.736 | 0.704 | -7.409 | 0.0615 | 0.0203 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb |
5 rows × 23 columns
If there are more than 5000 rows, then we need to do some data preprocessing before giving the DataFrame to Altair. But in this case, there are only 1556 rows.
df.shape
(1556, 23)
If a column has type object
, that often means it is a string, even if the values look numerical. If you’re having a hard time plotting data, make sure the values are numbers and not strings. The main point of the data-cleaning we did above was to make more of the columns numerical. Of course, a column like Song Name is never going to be numerical.
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams int64
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
Scatter plot¶
The following Altair chart is just like what we made above with our random DataFrame. We again use the column names to specify which parts of the data to use. Before we used column names like “a” and “b”. Here the column names are more descriptive, like “Energy” and “Loudness”.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = 'Acousticness',
tooltip = ["Artist","Song Name","Release Date","Chord"]
)
One of my favorite customizations in Altair is to use a more interesting color scheme. Here is an example using the color scheme “goldred”. You can find more color options in the Vega documentation.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred")),
tooltip = ["Artist","Song Name","Release Date","Chord"]
)
Sometimes the colors look more natural if they are reversed. We do that by adding reverse=True
in the alt.Scale
component.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred",reverse=True)),
tooltip = ["Artist","Song Name","Release Date","Chord"]
)
Spotify chart with tooltip¶
In the following chart we use a different color scheme, we specify the dimensions of the chart to make it a little bigger, and we give the chart a title.
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Acousticness', scale=alt.Scale(scheme='turbo',reverse=True)),
tooltip = ["Artist","Song Name","Release Date","Chord"]
).properties(
width = 720,
height = 450,
title="Spotify dataset from Kaggle"
)
Caution
The rest of this notebook can be skipped on a first reading. We give some more advanced examples.
Histogram¶
Here is an example of how to make a histogram using Altair. The heights of the bars indicate how many total entries there are in that category. The count()
entry is not the name of a column. Instead it is a special Altair function to count how often that entry occurs.
alt.Chart(df).mark_bar().encode(
x = "Artist",
y = "count()"
)
There are so many artists, this chart is pretty difficult to interpret. Let’s restrict ourselves to the top artists.
Here are the top 19 artists. (Why 19 rather than 20? No great reason, but this particular chart looks better with 19.)
top_artists = df.Artist.value_counts()[:19]
top_artists
Taylor Swift 52
Lil Uzi Vert 32
Justin Bieber 32
Juice WRLD 30
Pop Smoke 29
BTS 29
Bad Bunny 28
Eminem 22
The Weeknd 21
Ariana Grande 20
Drake 19
Billie Eilish 18
Selena Gomez 17
J. Cole 16
Doja Cat 16
Dua Lipa 15
Lady Gaga 14
Tyler, The Creator 14
DaBaby 14
Name: Artist, dtype: int64
Let’s make our Altair chart using the sub-DataFrame with just these 19 top artists. We make this using a new pandas method, isin
.
df_top = df[df.Artist.isin(top_artists.index)]
df_top.head()
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 7 | 3 | 16 | 2021-05-14--2021-05-21 | Kiss Me More (feat. SZA) | 29356736 | Doja Cat | 8640063.0 | 748mdHapucXQri7IAO8yFK | ['dance pop', 'pop'] | ... | 0.762 | 0.701 | -3.541 | 0.0286 | 0.23500 | 0.1230 | 110.968 | 208867.0 | 0.742 | G#/Ab |
8 | 9 | 3 | 8 | 2021-06-18--2021-06-25 | Yonaguni | 25030128 | Bad Bunny | 36142273.0 | 2JPLbjOn0wPCngEot2STUS | ['latin', 'reggaeton', 'trap latino'] | ... | 0.644 | 0.648 | -4.601 | 0.1180 | 0.27600 | 0.1350 | 179.951 | 206710.0 | 0.440 | C#/Db |
10 | 11 | 4 | 43 | 2021-05-07--2021-05-14 | Levitating (feat. DaBaby) | 23518010 | Dua Lipa | 27142474.0 | 463CkQjx2Zk1yXoBuierM9 | ['dance pop', 'pop', 'uk pop'] | ... | 0.702 | 0.825 | -3.787 | 0.0601 | 0.00883 | 0.0674 | 102.977 | 203064.0 | 0.915 | F#/Gb |
12 | 13 | 5 | 3 | 2021-07-09--2021-07-16 | Permission to Dance | 22062812 | BTS | 37106176.0 | 0LThjFY2iTtNdd4wviwVV2 | ['k-pop', 'k-pop boy group'] | ... | 0.702 | 0.741 | -5.330 | 0.0427 | 0.00544 | 0.3370 | 124.925 | 187585.0 | 0.646 | A |
13 | 14 | 1 | 19 | 2021-04-02--2021-04-09 | Peaches (feat. Daniel Caesar & Giveon) | 20294457 | Justin Bieber | 48504126.0 | 4iJyoBOLtHqaGxP12qzhQI | ['canadian pop', 'pop', 'post-teen pop'] | ... | 0.677 | 0.696 | -6.181 | 0.1190 | 0.32100 | 0.4200 | 90.030 | 198082.0 | 0.464 | C |
5 rows × 23 columns
alt.Chart(df_top).mark_bar().encode(
x = "Artist",
y = "count()"
)
Let’s add color to the chart, using the average number of Streams for each artist. In this example, mean
is a special function in Altair, just like count
.
Spotify bar chart¶
alt.Chart(df_top).mark_bar().encode(
x = "Artist",
y = "count()",
color = "mean(Streams)"
)
Exercise
Copy the above histogram code, and replace mean
with sum
. Suddenly the colors are less interesting. Why do you think that is?
Interactive example¶
We end with an example just for inspiration. One of the distinguishing features of Altair is its support for interactivity. If you click and drag on the below chart, the points in the region you select will gain color.
brush = alt.selection_interval(empty='none')
chart = alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.condition(brush,
alt.Color('Acousticness:Q', scale=alt.Scale(scheme='turbo',reverse=True)),
alt.value("lightgrey")),
).add_selection(
brush,
).properties(
width = 720,
height = 450,
title="Spotify dataset from Kaggle"
)
chart