First two examples of Altair Charts¶

We give two examples, one produced with random data from NumPy, and one using a Kaggle dataset about top 2020-2021 Spotify songs.

import altair as alt
import pandas as pd
import numpy as np
rng = np.random.default_rng()

Basic example with random data¶

We first make a \(20 \times 4\) NumPy array of random integers in NumPy, and then convert it to a pandas DataFrame.

A = rng.integers(0,100,size = (20,4))
rand_df = pd.DataFrame(A, columns = ["a","b","c","d"])

rand_df

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

Each row in the DataFrame will correspond to a point in our chart. The values in the \(a\) column and \(b\) column correspond to the \(x\)-coordinate and the \(y\)-coordinate, respectively.

Here we use mark_line to connect the datapoints with lines.

The syntax for making an Altair chart can be intimidating. The faster you can get comfortable with it, the better.

alt.Chart(rand_df).mark_line().encode(
    x = "a",
    y = "b"
)

The same data, but with disks drawn instead of lines: we changed from mark_line to mark_circle.

alt.Chart(rand_df).mark_circle().encode(
    x = "a",
    y = "b"
)

By looking at the chart, how many points have an a value less than 30? Let’s verify that in the dataset.

(rand_df["a"] < 30).sum()

Here are those points explicitly.

rand_df[rand_df["a"] < 30]

	a	b	c	d
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
6	3	30	38	83
10	3	52	35	95
13	7	75	51	90
16	29	99	52	37
18	19	99	40	13
19	18	43	28	20

Here is another chart, where we color the points using column c, and we change the size of the points using column d. We add a tooltip showing all the values, so put your mouse over a point to see its values of a,b,c,d.

alt.Chart(rand_df).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    tooltip = ["a","b","c","d"],
)

Exercise

Put your mouse over one of the points in the above chart.

How is the underlying data for that point reflected in its location, size, and color?
Can you find the row in rand_df corresponding to this point?
Choose another row in the DataFrame. What point does it correspond to in the chart?

For your convenience, the original pandas DataFrame is shown below.

rand_df

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

Another example with more points, and where we add an opacity channel.

A = rng.integers(0,100,size = (1000,5))
rand_df2 = pd.DataFrame(A, columns = ["a","b","c","d","e"])

alt.Chart(rand_df2).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    opacity = "e",
    tooltip = ["a","b","c","d","e"],
)

Exercise

I don’t like the d and e parts of the legend. Can you figure out how to remove them, by mimicking an example from the documentation?

It’s hard to recognize the opacity in the above example. Let’s change the DataFrame so that all the points to the left of \(a = 40\) are only 15% transparent, and the rest of the points are 100% opaque. We add a scale=None to the opacity channel so we have complete control over the opacity.

rand_df2["e"] = 1
rand_df2.loc[(rand_df2["a"] < 40),"e"] = 0.15

alt.Chart(rand_df2).mark_circle().encode(
    x = "a",
    y = "b",
    color = "c",
    size = "d",
    opacity = alt.Opacity("e",scale=None),
    tooltip = ["a","b","c","d","e"],
)

Example with data from Spotify¶

Here we use the spotify_dataset.csv file from Canvas. The dataset originally came from Kaggle here. The Kaggle page includes a description of the columns.

We perform some “cleaning” of the dataset. By the end of Math 10, all of the following cell should be understandable, but for now, you shouldn’t worry about the details of this “cleaning”.

Important: You may need to change the path from data/spotify_dataset.csv, depending on where you have this csv file stored.

df = pd.read_csv("../data/spotify_dataset.csv") # change path if necessary
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()

df.head()

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
0	1	1	8	2021-07-23--2021-07-30	Beggin'	48633449	Måneskin	3377762.0	3Wrjm47oTz2sjIgck11l5e	['indie rock italiano', 'italian pop']	...	0.714	0.800	-4.808	0.0504	0.1270	0.3590	134.002	211560.0	0.589	B
1	2	2	3	2021-07-23--2021-07-30	STAY (with Justin Bieber)	47248719	The Kid LAROI	2230022.0	5HCyWlXZPP0y6Gqq8TgA20	['australian hip hop']	...	0.591	0.764	-5.484	0.0483	0.0383	0.1030	169.928	141806.0	0.478	C#/Db
2	3	1	11	2021-06-25--2021-07-02	good 4 u	40162559	Olivia Rodrigo	6266514.0	4ZtFanR9U6ndgddUvNcjcG	['pop']	...	0.563	0.664	-5.044	0.1540	0.3350	0.0849	166.928	178147.0	0.688	A
3	4	3	5	2021-07-02--2021-07-09	Bad Habits	37799456	Ed Sheeran	83293380.0	6PQ88X9TkUIAUIZJHW2upE	['pop', 'uk pop']	...	0.808	0.897	-3.712	0.0348	0.0469	0.3640	126.026	231041.0	0.591	B
4	5	5	1	2021-07-23--2021-07-30	INDUSTRY BABY (feat. Jack Harlow)	33948454	Lil Nas X	5473565.0	27NovPIUIRrOZoCHxABJwK	['lgbtq+ hip hop', 'pop rap']	...	0.736	0.704	-7.409	0.0615	0.0203	0.0501	149.995	212000.0	0.894	D#/Eb

5 rows × 23 columns

If there are more than 5000 rows, then we need to do some data preprocessing before giving the DataFrame to Altair. But in this case, there are only 1556 rows.

df.shape

(1556, 23)

If a column has type object, that often means it is a string, even if the values look numerical. If you’re having a hard time plotting data, make sure the values are numbers and not strings. The main point of the data-cleaning we did above was to make more of the columns numerical. Of course, a column like Song Name is never going to be numerical.

df.dtypes

Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                        int64
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object

Scatter plot¶

The following Altair chart is just like what we made above with our random DataFrame. We again use the column names to specify which parts of the data to use. Before we used column names like “a” and “b”. Here the column names are more descriptive, like “Energy” and “Loudness”.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = 'Acousticness',
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

One of my favorite customizations in Altair is to use a more interesting color scheme. Here is an example using the color scheme “goldred”. You can find more color options in the Vega documentation.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred")),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Sometimes the colors look more natural if they are reversed. We do that by adding reverse=True in the alt.Scale component.

alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Acousticness',scale=alt.Scale(scheme="goldred",reverse=True)),
    tooltip = ["Artist","Song Name","Release Date","Chord"]
)

Histogram¶

Here is an example of how to make a histogram using Altair. The heights of the bars indicate how many total entries there are in that category. The count() entry is not the name of a column. Instead it is a special Altair function to count how often that entry occurs.

alt.Chart(df).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

There are so many artists, this chart is pretty difficult to interpret. Let’s restrict ourselves to the top artists.

Here are the top 19 artists. (Why 19 rather than 20? No great reason, but this particular chart looks better with 19.)

top_artists = df.Artist.value_counts()[:19]
top_artists

Taylor Swift          52
Lil Uzi Vert          32
Justin Bieber         32
Juice WRLD            30
Pop Smoke             29
BTS                   29
Bad Bunny             28
Eminem                22
The Weeknd            21
Ariana Grande         20
Drake                 19
Billie Eilish         18
Selena Gomez          17
J. Cole               16
Doja Cat              16
Dua Lipa              15
Lady Gaga             14
Tyler, The Creator    14
DaBaby                14
Name: Artist, dtype: int64

Let’s make our Altair chart using the sub-DataFrame with just these 19 top artists. We make this using a new pandas method, isin.

df_top = df[df.Artist.isin(top_artists.index)]
df_top.head()

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
6	7	3	16	2021-05-14--2021-05-21	Kiss Me More (feat. SZA)	29356736	Doja Cat	8640063.0	748mdHapucXQri7IAO8yFK	['dance pop', 'pop']	...	0.762	0.701	-3.541	0.0286	0.23500	0.1230	110.968	208867.0	0.742	G#/Ab
8	9	3	8	2021-06-18--2021-06-25	Yonaguni	25030128	Bad Bunny	36142273.0	2JPLbjOn0wPCngEot2STUS	['latin', 'reggaeton', 'trap latino']	...	0.644	0.648	-4.601	0.1180	0.27600	0.1350	179.951	206710.0	0.440	C#/Db
10	11	4	43	2021-05-07--2021-05-14	Levitating (feat. DaBaby)	23518010	Dua Lipa	27142474.0	463CkQjx2Zk1yXoBuierM9	['dance pop', 'pop', 'uk pop']	...	0.702	0.825	-3.787	0.0601	0.00883	0.0674	102.977	203064.0	0.915	F#/Gb
12	13	5	3	2021-07-09--2021-07-16	Permission to Dance	22062812	BTS	37106176.0	0LThjFY2iTtNdd4wviwVV2	['k-pop', 'k-pop boy group']	...	0.702	0.741	-5.330	0.0427	0.00544	0.3370	124.925	187585.0	0.646	A
13	14	1	19	2021-04-02--2021-04-09	Peaches (feat. Daniel Caesar & Giveon)	20294457	Justin Bieber	48504126.0	4iJyoBOLtHqaGxP12qzhQI	['canadian pop', 'pop', 'post-teen pop']	...	0.677	0.696	-6.181	0.1190	0.32100	0.4200	90.030	198082.0	0.464	C

5 rows × 23 columns

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()"
)

Let’s add color to the chart, using the average number of Streams for each artist. In this example, mean is a special function in Altair, just like count.

Spotify bar chart¶

alt.Chart(df_top).mark_bar().encode(
    x = "Artist",
    y = "count()",
    color = "mean(Streams)"
)

Exercise

Copy the above histogram code, and replace mean with sum. Suddenly the colors are less interesting. Why do you think that is?

Interactive example¶

We end with an example just for inspiration. One of the distinguishing features of Altair is its support for interactivity. If you click and drag on the below chart, the points in the region you select will gain color.

brush = alt.selection_interval(empty='none')

chart = alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.condition(brush,
                          alt.Color('Acousticness:Q', scale=alt.Scale(scheme='turbo',reverse=True)),
                          alt.value("lightgrey")),
).add_selection(
    brush,
).properties(
    width = 720,
    height = 450,
    title="Spotify dataset from Kaggle"
)

chart

UC Irvine Math 10

First two examples of Altair Charts¶

Basic example with random data¶

Example with data from Spotify¶

Scatter plot¶

Spotify chart with tooltip¶

Histogram¶

Spotify bar chart¶

Interactive example¶

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
6	3	30	38	83
10	3	52	35	95
13	7	75	51	90
16	29	99	52	37
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
6	3	30	38	83
10	3	52	35	95
13	7	75	51	90
16	29	99	52	37
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
6	3	30	38	83
10	3	52	35	95
13	7	75	51	90
16	29	99	52	37
18	19	99	40	13
19	18	43	28	20

	a	b	c	d
0	72	41	17	31
1	28	73	48	16
2	2	72	45	26
3	20	54	97	58
4	23	85	23	41
5	33	64	86	37
6	3	30	38	83
7	45	37	56	31
8	59	66	53	74
9	79	48	51	69
10	3	52	35	95
11	35	40	47	21
12	49	59	60	86
13	7	75	51	90
14	31	12	38	60
15	35	23	60	5
16	29	99	52	37
17	33	29	73	44
18	19	99	40	13
19	18	43	28	20