Visualization in Python

Visualization in Python

Recording of lecture from 1/19/2022

The most important visualization library in Math 10 is Altair. Today I want to introduce Altair and two other similar libraries, Seaborn and Plotly Express. All three of these are based on a concept called the Grammar of Graphics, which I believe was invented in this book, The Grammar of Graphics, which is free to download from on campus or using VPN.

The most famous visualization library in Python is Matplotlib. We won’t talk about Matplotlib today. It is quite different from the libraries we will discuss today (Seaborn is built on top of Matplotlib).

import numpy as np
import pandas as pd
import altair as alt
import seaborn as sns
import plotly.express as px
np.arange(0,1.1,0.25)
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

Here is the “by hand” way to make a pandas DataFrame.

Using np.arange is a little difficult in this context because we need to make sure its length is the same as the length of the other columns.

df = pd.DataFrame({"a":[3,1,4,2],"b":[10,5,6,8],"c":["first","second","third","fourth"],
    "d":np.arange(0.2,1.1,0.25)})
df
a b c d
0 3 10 first 0.20
1 1 5 second 0.45
2 4 6 third 0.70
3 2 8 fourth 0.95

Put your mouse over one of the points to see the effect of the tooltip.

alt.Chart(df).mark_circle().encode(
    x = "a",
    y = "b",
    color = "d",
    size = "d",
    tooltip = ["a","c"]
)
alt.Chart(df).mark_bar().encode(
    x = "a",
    y = "b"
)
alt.Chart(df).mark_bar(width=30).encode(
    x = "a",
    y = "b"
)

To make a scatter plot in Altair, you use mark_circle. In Seaborn, you use scatterplot. The syntax is very similar.

sns.scatterplot(
    data = df,
    x = "a",
    y = "b",
    hue = "d",
    size = "d",
)
<AxesSubplot:xlabel='a', ylabel='b'>
../_images/visualization_10_1.png

The same thing for Plotly Express.

px.scatter(
    data_frame=df,
    x = "a",
    y = "b",
    color = "d",
    size = "d",
)

Penguins dataset from Seaborn

df = sns.load_dataset("penguins")
df.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
df.shape
(344, 7)
df.dtypes
species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object
alt.Chart(df).mark_circle().encode(
    x = "bill_length_mm",
    y = "bill_depth_mm",
    color = "species"
)
px.scatter(
    data_frame=df,
    x = "bill_length_mm",
    y = "bill_depth_mm",
    color = "species"
)
sns.scatterplot(
    data = df,
    x = "bill_length_mm",
    y = "bill_depth_mm",
    hue = "species"
)
<AxesSubplot:xlabel='bill_length_mm', ylabel='bill_depth_mm'>
../_images/visualization_20_1.png

By default, the Altair axes will include 0. If you want to remove them, the code gets a little longer.

alt.Chart(df).mark_circle().encode(
    x = alt.X("bill_length_mm",scale = alt.Scale(zero=False)),
    y = alt.Y("bill_depth_mm", scale = alt.Scale(zero=False)),
    color = "species"
)

Adding a tooltip that includes all the data.

alt.Chart(df).mark_circle().encode(
    x = alt.X("bill_length_mm",scale = alt.Scale(zero=False)),
    y = alt.Y("bill_depth_mm", scale = alt.Scale(zero=False)),
    color = "species",
    size = "body_mass_g",
    opacity = "body_mass_g",
    tooltip = list(df.columns)
)

Plotting just the data from rows 200 to 300.

alt.Chart(df[200:300]).mark_circle().encode(
    x = alt.X("bill_length_mm",scale = alt.Scale(zero=False)),
    y = alt.Y("bill_depth_mm", scale = alt.Scale(zero=False)),
    color = "species",
    size = "body_mass_g",
    opacity = "body_mass_g",
    tooltip = list(df.columns)
)

df[200:300] and df.iloc[200:300] mean the same thing; the one is just an abbreviation for the other.

Can you find the point on the above plot that corresponds to row 200?

df.iloc[200:300]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
200 Chinstrap Dream 51.5 18.7 187.0 3250.0 Male
201 Chinstrap Dream 49.8 17.3 198.0 3675.0 Female
202 Chinstrap Dream 48.1 16.4 199.0 3325.0 Female
203 Chinstrap Dream 51.4 19.0 201.0 3950.0 Male
204 Chinstrap Dream 45.7 17.3 193.0 3600.0 Female
... ... ... ... ... ... ... ...
295 Gentoo Biscoe 48.6 16.0 230.0 5800.0 Male
296 Gentoo Biscoe 47.5 14.2 209.0 4600.0 Female
297 Gentoo Biscoe 51.1 16.3 220.0 6000.0 Male
298 Gentoo Biscoe 45.2 13.8 215.0 4750.0 Female
299 Gentoo Biscoe 45.2 16.4 223.0 5950.0 Male

100 rows × 7 columns

df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')
type(df.columns)
pandas.core.indexes.base.Index
list(df.columns)
['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex']

The best way to convert df.columns from a pandas Index into a list is to use list(df.columns). Just for practice, we also convert it into a list using list comprehension.

[c for c in df.columns]
['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex']

Instead of the penguins dataset, there are others we could have imported also.

sns.get_dataset_names()
['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'taxis',
 'tips',
 'titanic']
df_tips = sns.load_dataset("tips")
df_tips
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

We can save this using the to_csv method. If you don’t want the row names included (the “index”), then set index = false.

df_tips.to_csv("tips.csv", index=False)

In Deepnote, if you click on the corresponding csv file in the files section, it will automatically sort the rows. Here is how you do that same thing using pandas.

df_tips.sort_values("total_bill",ascending=False)
total_bill tip sex smoker day time size
170 50.81 10.00 Male Yes Sat Dinner 3
212 48.33 9.00 Male No Sat Dinner 4
59 48.27 6.73 Male No Sat Dinner 4
156 48.17 5.00 Male No Sun Dinner 6
182 45.35 3.50 Male Yes Sun Dinner 3
... ... ... ... ... ... ... ...
149 7.51 2.00 Male No Thur Lunch 2
111 7.25 1.00 Female No Sat Dinner 1
172 7.25 5.15 Male Yes Sun Dinner 2
92 5.75 1.00 Female Yes Fri Dinner 2
67 3.07 1.00 Female Yes Sat Dinner 1

244 rows × 7 columns