Introduction to Altair¶
The first time you use Altair, you will need to install it. From a lab computer:
Open Anaconda Navigator
Click on Environments (left side)
Change the dropdown menu from “Installed” to “All” (it might already be on “All”)
Search Packages for “Altair”
Click the checkbox next to Altair (if it already shows a green checkbox, then it’s already installed)
Click “Apply” at the bottom
Wait
Switch back from “Environments” to “Home”
Open Jupyter Notebook
Altair is just one of many plotting libraries. It is my favorite, but it is definitely not the most famous. It might not even be the fifth most famous. In Math 10, will also use Matplotlib and Seaborn. Some others, that we probably will not use, are Plotly and Holoviews (actually Holoviews probably isn’t technically a plotting library, but let’s ignore that).
# If you see ModuleNotFoundError: No module named 'altair'
# then you have not yet installed Altair.
# Follow the steps above, or an alternative like pip install altair
import altair as alt
import pandas as pd
import numpy as np
rng = np.random.default_rng()
Basic example with random data¶
The syntax for Altair can be intimidating the first time you see it (and the fifth time you see it). Here is some information. Skip this information for now, and come back to it when you need it.
Options for what to draw: Marks
Different “channels”, like
color
,opacity
, andsize
. You can see a list of channels here: Documentation.Here is a list of color schemes: Vega color schemes and an example of how to use a color scheme in Altair.
Sometimes it helps to explicitly tell Altair what type of data it is: encoding types
If you find browsing examples easier than reading documentation, check out the Altair example gallery
# Making some data
A = rng.integers(0,100,size = (6000,4))
rand_df = pd.DataFrame(A, columns = ["a","b","c","d"])
rand_df
a | b | c | d | |
---|---|---|---|---|
0 | 96 | 58 | 10 | 40 |
1 | 99 | 75 | 2 | 37 |
2 | 13 | 54 | 16 | 3 |
3 | 79 | 32 | 6 | 11 |
4 | 71 | 77 | 50 | 53 |
... | ... | ... | ... | ... |
5995 | 30 | 97 | 58 | 55 |
5996 | 25 | 55 | 27 | 97 |
5997 | 18 | 29 | 17 | 59 |
5998 | 85 | 43 | 70 | 23 |
5999 | 94 | 92 | 35 | 90 |
6000 rows × 4 columns
alt.Chart(rand_df.iloc[:5000]).mark_circle().encode(
x = "a",
y = "b"
)
Example with data from Spotify¶
Download the spotify_dataset.csv
file from Canvas, and put it somewhere you can access from Jupyter (maybe in the same folder as this notebook).
The dataset originally came from Kaggle here. You can go there to see a description of the columns.
# Change the path if necessary
# Doing some small "data cleaning": converting some columns from strings to numbers.
df = pd.read_csv("../data/spotify_dataset.csv")
df = df.replace(" ",np.nan)
df["Streams"] = df["Streams"].str.replace(",","")
df.iloc[:,[5,7]] = df.iloc[:,[5,7]].apply(pd.to_numeric,axis=0).copy()
df.iloc[:,12:22] = df.iloc[:,12:22].apply(pd.to_numeric,axis=0).copy()
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams int64
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
alt.Chart(df).mark_circle().encode(
x = "Energy",
y = "Loudness",
color = alt.Color('Energy', scale=alt.Scale(scheme='turbo')),
tooltip = ["Song Name", "Artist"]
).properties(
width = 400
)
df[df["Loudness"] < -20]
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
670 | 671 | 127 | 1 | 2020-12-18--2020-12-25 | Carol of the Bells | 7993078 | Mykola Dmytrovych Leontovych, John Williams | 1024868.0 | 4tHqQMWSqmL6YjXwsqthDI | ['soundtrack'] | ... | 0.418 | 0.106 | -22.507 | 0.0448 | 0.994 | 0.179 | 46.718 | 85267.0 | 0.8000 | G#/Ab |
1499 | 1500 | 100 | 1 | 2020-01-17--2020-01-24 | Alfred - Interlude | 8030151 | Eminem | 46814751.0 | 4EmunTy7kNBYQivOa8F6b8 | ['detroit hip hop', 'hip hop', 'rap'] | ... | 0.429 | 0.231 | -20.430 | 0.4020 | 0.878 | 0.279 | 74.545 | 30133.0 | 0.9140 | F |
1546 | 1547 | 143 | 1 | 2019-12-27--2020-01-03 | JACKBOYS | 5363493 | JACKBOYS | 437907.0 | 62zKJrpbLxz6InR3tGyr7o | ['rap', 'trap'] | ... | 0.413 | 0.130 | -25.166 | 0.0336 | 0.900 | 0.111 | 123.342 | 46837.0 | 0.0676 | C |
3 rows × 23 columns
df.sort_values("Loudness", ascending=False)
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1096 | 1097 | 129 | 4 | 2020-06-05--2020-06-12 | Na Raba Toma Tapão | 4396629 | Niack | 352402.0 | 0AGS6ZRgzobrazmCi6pYMe | ['funk carioca'] | ... | 0.962 | 0.787 | 1.509 | 0.0554 | 0.666 | 0.176 | 130.003 | 165231.0 | 0.968 | D#/Eb |
107 | 108 | 59 | 11 | 2021-06-04--2021-06-11 | Sal y Perrea | 6635076 | Sech | 8758283.0 | 5u7twkeask1VIyDeNTElSU | ['latin', 'panamanian pop', 'reggaeton', 'trap... | ... | 0.786 | 0.899 | -0.515 | 0.1270 | 0.192 | 0.299 | 90.025 | 216510.0 | 0.813 | A#/Bb |
651 | 652 | 166 | 11 | 2020-10-23--2020-10-30 | Investe Em Mim | 5191001 | Jonas Esticado | 1293055.0 | 15k1TDabqSEmyXOwMq9RM7 | ['forro', 'sertanejo', 'sertanejo pop', 'serta... | ... | 0.632 | 0.953 | -1.283 | 0.0325 | 0.317 | 0.125 | 160.061 | 186533.0 | 0.798 | A |
1159 | 1160 | 62 | 4 | 2020-05-08--2020-05-15 | eight(Prod.&Feat. SUGA of BTS) | 4764954 | IU | 4939593.0 | 0pYacDCZuRhcrwGUA5nTBe | ['k-pop'] | ... | 0.676 | 0.869 | -1.573 | 0.0423 | 0.115 | 0.132 | 120.029 | 167573.0 | 0.594 | C#/Db |
1249 | 1250 | 192 | 3 | 2020-03-27--2020-04-03 | Cradles | 4218201 | Sub Urban | 908351.0 | 1y4jsQt7MjnZhiD1L6qFBC | ['modern indie pop'] | ... | 0.534 | 0.589 | -1.865 | 0.3250 | 0.256 | 0.176 | 78.616 | 209829.0 | 0.632 | C#/Db |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
750 | 751 | 19 | 20 | 2020-07-31--2020-08-07 | Agua (with J Balvin) - Music From "Sponge On T... | 5358940 | Tainy | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
784 | 785 | 76 | 14 | 2020-09-04--2020-09-11 | Lean (feat. Towy, Osquel, Beltito & Sammy & Fa... | 4739241 | Super Yei, Jone Quest | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
876 | 877 | 164 | 4 | 2020-09-18--2020-09-25 | +Linda | 4964708 | Dalex | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1140 | 1141 | 131 | 1 | 2020-05-29--2020-06-05 | In meinem Benz | 5494500 | AK AUSSERKONTROLLE, Bonez MC | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1538 | 1539 | 176 | 1 | 2020-01-03--2020-01-10 | fuck, i'm lonely (with Anne-Marie) - from “13 ... | 4856458 | Lauv | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1556 rows × 23 columns