Introduction to Altair

The first time you use Altair, you will need to install it. From a lab computer:

  • Open Anaconda Navigator

  • Click on Environments (left side)

  • Change the dropdown menu from “Installed” to “All” (it might already be on “All”)

  • Search Packages for “Altair”

  • Click the checkbox next to Altair (if it already shows a green checkbox, then it’s already installed)

  • Click “Apply” at the bottom

  • Wait

  • Switch back from “Environments” to “Home”

  • Open Jupyter Notebook

Altair is just one of many plotting libraries. It is my favorite, but it is definitely not the most famous. It might not even be the fifth most famous. In Math 10, will also use Matplotlib and Seaborn. Some others, that we probably will not use, are Plotly and Holoviews (actually Holoviews probably isn’t technically a plotting library, but let’s ignore that).

# If you see ModuleNotFoundError: No module named 'altair'
# then you have not yet installed Altair.
# Follow the steps above, or an alternative like pip install altair

import altair as alt
import pandas as pd
import numpy as np
rng = np.random.default_rng()

Basic example with random data

The syntax for Altair can be intimidating the first time you see it (and the fifth time you see it). Here is some information. Skip this information for now, and come back to it when you need it.

  • Options for what to draw: Marks

  • Different “channels”, like color, opacity, and size. You can see a list of channels here: Documentation.

  • Here is a list of color schemes: Vega color schemes and an example of how to use a color scheme in Altair.

  • Sometimes it helps to explicitly tell Altair what type of data it is: encoding types

  • If you find browsing examples easier than reading documentation, check out the Altair example gallery

# Making some data
A = rng.integers(0,100,size = (6000,4))
rand_df = pd.DataFrame(A, columns = ["a","b","c","d"])
rand_df
a b c d
0 96 58 10 40
1 99 75 2 37
2 13 54 16 3
3 79 32 6 11
4 71 77 50 53
... ... ... ... ...
5995 30 97 58 55
5996 25 55 27 97
5997 18 29 17 59
5998 85 43 70 23
5999 94 92 35 90

6000 rows × 4 columns

alt.Chart(rand_df.iloc[:5000]).mark_circle().encode(
    x = "a",
    y = "b"
)

Example with data from Spotify

Download the spotify_dataset.csv file from Canvas, and put it somewhere you can access from Jupyter (maybe in the same folder as this notebook).

The dataset originally came from Kaggle here. You can go there to see a description of the columns.

# Change the path if necessary
# Doing some small "data cleaning": converting some columns from strings to numbers.
df = pd.read_csv("../data/spotify_dataset.csv")
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()
df.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                        int64
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object
alt.Chart(df).mark_circle().encode(
    x = "Energy",
    y = "Loudness",
    color = alt.Color('Energy', scale=alt.Scale(scheme='turbo')),
    tooltip = ["Song Name", "Artist"]
).properties(
    width = 400
)
df[df["Loudness"] < -20]
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
670 671 127 1 2020-12-18--2020-12-25 Carol of the Bells 7993078 Mykola Dmytrovych Leontovych, John Williams 1024868.0 4tHqQMWSqmL6YjXwsqthDI ['soundtrack'] ... 0.418 0.106 -22.507 0.0448 0.994 0.179 46.718 85267.0 0.8000 G#/Ab
1499 1500 100 1 2020-01-17--2020-01-24 Alfred - Interlude 8030151 Eminem 46814751.0 4EmunTy7kNBYQivOa8F6b8 ['detroit hip hop', 'hip hop', 'rap'] ... 0.429 0.231 -20.430 0.4020 0.878 0.279 74.545 30133.0 0.9140 F
1546 1547 143 1 2019-12-27--2020-01-03 JACKBOYS 5363493 JACKBOYS 437907.0 62zKJrpbLxz6InR3tGyr7o ['rap', 'trap'] ... 0.413 0.130 -25.166 0.0336 0.900 0.111 123.342 46837.0 0.0676 C

3 rows × 23 columns

df.sort_values("Loudness", ascending=False)
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
1096 1097 129 4 2020-06-05--2020-06-12 Na Raba Toma Tapão 4396629 Niack 352402.0 0AGS6ZRgzobrazmCi6pYMe ['funk carioca'] ... 0.962 0.787 1.509 0.0554 0.666 0.176 130.003 165231.0 0.968 D#/Eb
107 108 59 11 2021-06-04--2021-06-11 Sal y Perrea 6635076 Sech 8758283.0 5u7twkeask1VIyDeNTElSU ['latin', 'panamanian pop', 'reggaeton', 'trap... ... 0.786 0.899 -0.515 0.1270 0.192 0.299 90.025 216510.0 0.813 A#/Bb
651 652 166 11 2020-10-23--2020-10-30 Investe Em Mim 5191001 Jonas Esticado 1293055.0 15k1TDabqSEmyXOwMq9RM7 ['forro', 'sertanejo', 'sertanejo pop', 'serta... ... 0.632 0.953 -1.283 0.0325 0.317 0.125 160.061 186533.0 0.798 A
1159 1160 62 4 2020-05-08--2020-05-15 eight(Prod.&Feat. SUGA of BTS) 4764954 IU 4939593.0 0pYacDCZuRhcrwGUA5nTBe ['k-pop'] ... 0.676 0.869 -1.573 0.0423 0.115 0.132 120.029 167573.0 0.594 C#/Db
1249 1250 192 3 2020-03-27--2020-04-03 Cradles 4218201 Sub Urban 908351.0 1y4jsQt7MjnZhiD1L6qFBC ['modern indie pop'] ... 0.534 0.589 -1.865 0.3250 0.256 0.176 78.616 209829.0 0.632 C#/Db
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
750 751 19 20 2020-07-31--2020-08-07 Agua (with J Balvin) - Music From "Sponge On T... 5358940 Tainy NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
784 785 76 14 2020-09-04--2020-09-11 Lean (feat. Towy, Osquel, Beltito & Sammy & Fa... 4739241 Super Yei, Jone Quest NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
876 877 164 4 2020-09-18--2020-09-25 +Linda 4964708 Dalex NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1140 1141 131 1 2020-05-29--2020-06-05 In meinem Benz 5494500 AK AUSSERKONTROLLE, Bonez MC NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1538 1539 176 1 2020-01-03--2020-01-10 fuck, i'm lonely (with Anne-Marie) - from “13 ... 4856458 Lauv NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

1556 rows × 23 columns