Week 3 Monday

Week 3 Monday#

Worksheet 5 is posted.
Like last week, I am going to be working in a “Private” Deepnote notebook and will upload it after class. The Deepnote support group is investigating why the hardware keeps resetting on me and that will hopefully be resolved this week.
Unlike NumPy and pandas, the data visualization library we use (Altair) would need to be installed on the lab computers. (That’s not difficult, but it would need to be done on each machine.) So we will benefit from using Deepnote for this portion.
Hanson is here to help.

Plotting based on the Grammar of Graphics#

If you’ve already seen one plotting library in Python, it was probably Matplotlib. Matplotlib is the most flexible and most widely used Python plotting library. In Math 10, our main interest is in using Python for Data Science, and for that, there are some specialty plotting libraries that will get us nice results much faster than Matplotlib.

Here we will introduce the plotting library we will use most often in Math 10, Altair, along with two more plotting libraries, Seaborn and Plotly. (Of these, Seaborn is probably the most famous.)

These three libraries are very similar to each other (not so similar to Matplotlib, although Seaborn is built on top of Matplotlib), and I believe all three are based on a notion called the Grammar of Graphics. (Here is the book The Grammar of Graphics, which is freely available to download from on campus or using VPN. There is also a widely used, and I think older, plotting library for the R statistical software that uses the same conventions, ggplot2.)

Here is the basic setup for Altair, Seaborn, and Plotly:

We have a pandas DataFrame, and each row in the DataFrame corresponds to one observation (i.e., to one instance, to one data point).
Each column in the DataFrame corresponds to a variable (also called a dimension, or a field).
To produce the visualizations, we encode different columns from the DataFrame into visual properties of the chart (like the x-coordinate, or the color).

Altair tries to choose default values that produce high-quality visualizations; this greatly reduces the need for fine-tuning. But there is also a huge amount of customization possible. As one example, here are the named color schemes available in Altair.

Warm-up: first look at the `mpg` dataset#

Load the mpg dataset from the Seaborn library.

import seaborn as sns

Here is a list of all the datasets included with Seaborn.

sns.get_dataset_names()

['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'dowjones',
 'exercise',
 'flights',
 'fmri',
 'geyser',
 'glue',
 'healthexp',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'seaice',
 'taxis',
 'tips',
 'titanic']

df = sns.load_dataset("mpg")

df.shape

(398, 9)

df.sample(3)

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
190	14.5	8	351.0	152.0	4215	12.8	76	usa	ford gran torino
179	22.0	4	121.0	98.0	2945	14.5	75	europe	volvo 244dl
1	15.0	8	350.0	165.0	3693	11.5	70	usa	buick skylark 320

How many “origin” values are there in this dataset?

df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')

len(df["origin"])

The unique method is defined for pandas Series but not for the whole DataFrame.

df.unique()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[8], line 1
----> 1 df.unique()

File ~/mambaforge/envs/math10s23/lib/python3.9/site-packages/pandas/core/generic.py:5902, in NDFrame.__getattr__(self, name)
   5895 if (
   5896     name not in self._internal_names_set
   5897     and name not in self._metadata
   5898     and name not in self._accessors
   5899     and self._info_axis._can_hold_identifiers_and_holds_name(name)
   5900 ):
   5901     return self[name]
-> 5902 return object.__getattribute__(self, name)

AttributeError: 'DataFrame' object has no attribute 'unique'

df["origin"].unique()

array(['usa', 'japan', 'europe'], dtype=object)

len(df["origin"].unique())

df["origin"].value_counts()

usa       249
japan      79
europe     70
Name: origin, dtype: int64

len(df["origin"].value_counts())

How does the average weight of a car differ across these origins? Use the DataFrame method groupby (which we have not seen yet).

Here is an example of the “object-oriented programming” approach of having special-purpose objects. Here we have a DataFrameGroupBy object that is probably not used anywhere else.

df.groupby("origin")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x13e2df940>

This special object has a mean method, which will report the average values for the various columns when split by their “origin” value. Here we have a whole DataFrame.

df.groupby("origin").mean()

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_15816/2762219267.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby("origin").mean()

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year
origin
europe	27.891429	4.157143	109.142857	80.558824	2423.300000	16.787143	75.814286
japan	30.450633	4.101266	102.708861	79.835443	2221.227848	16.172152	77.443038
usa	20.083534	6.248996	245.901606	119.048980	3361.931727	15.033735	75.610442

Here we get a single column from that DataFrame.

df.groupby("origin").mean()["weight"]

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_15816/3362074246.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.mean is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
  df.groupby("origin").mean()["weight"]

origin
europe    2423.300000
japan     2221.227848
usa       3361.931727
Name: weight, dtype: float64

Can you calculate that same average weight for “europe” using Boolean indexing?

Here we get the sub-DataFrame containing only cars with origin equal to “europe”.

df_sub = df[df["origin"] == "europe"]
df_sub

	mpg	cylinders	displacement	horsepower	weight	acceleration	model_year	origin	name
19	26.0	4	97.0	46.0	1835	20.5	70	europe	volkswagen 1131 deluxe sedan
20	25.0	4	110.0	87.0	2672	17.5	70	europe	peugeot 504
21	24.0	4	107.0	90.0	2430	14.5	70	europe	audi 100 ls
22	25.0	4	104.0	95.0	2375	17.5	70	europe	saab 99e
23	26.0	4	121.0	113.0	2234	12.5	70	europe	bmw 2002
...	...	...	...	...	...	...	...	...	...
354	34.5	4	100.0	NaN	2320	15.8	81	europe	renault 18i
359	28.1	4	141.0	80.0	3230	20.4	81	europe	peugeot 505s turbo diesel
360	30.7	6	145.0	76.0	3160	19.6	81	europe	volvo diesel
375	36.0	4	105.0	74.0	1980	15.3	82	europe	volkswagen rabbit l
394	44.0	4	97.0	52.0	2130	24.6	82	europe	vw pickup

70 rows × 9 columns

Now we get the “weight” column from that sub-DataFrame.

df_sub["weight"]

   1835
   2672
   2430
   2375
   2234
       ... 
  2320
  3230
  3160
  1980
  2130
Name: weight, Length: 70, dtype: int64

Now we call the mean method of this pandas Series.

df_sub["weight"].mean()

2423.3

Something to think about: how does the miles-per-gallon change based on the weight? I don’t think we know yet how to answer that in a concise way using pandas.

Visualizing the data using Altair#

To make visualizations in all of these libraries, we encode columns in the dataset to various visual channels in the chart.

import altair as alt

Plot this data using a scatter plot (denoted by mark_circle() in Altair). Encode the “weight” column in the x-coordinate, the “mpg” column in the y-coordinate.

Here we get just a single point. We haven’t told Altair how to relate the data to the visualization. All Altair knows at this point is that we are using the DataFrame df and that we are drawing marks with circles (filled-in circles).

alt.Chart(df).mark_circle()

Here is a reminder of what columns we have to work with. If the capitalization or spelling is wrong, the plotting will not work.

df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model_year', 'origin', 'name'],
      dtype='object')

Here we allow the points to have different x-coordinates, corresponding to the weight.

alt.Chart(df).mark_circle().encode(
    x = "weight"
)

Now we allow also y-coordinates. From this vantage point, it’s clear (and it’s also intuitively clear) that as the weight increases, the miles-per-gallon decreases.

alt.Chart(df).mark_circle().encode(
    x = "weight",
    y = "mpg"
)

Add a color channel to the chart, encoding the “origin” value.

alt.Chart(df).mark_circle().encode(
    x = "weight",
    y = "mpg",
    color = "origin"
)

Add a tooltip to the chart, including the weight, mpg, origin, model year, and the name of the car.

Notice how if you move your mouse over a point in the chart, you will see all the requested information. Each drawn point should be thought of as corresponding to one row in the original DataFrame.

alt.Chart(df).mark_circle().encode(
    x = "weight",
    y = "mpg",
    color = "origin",
    tooltip = ["weight", "mpg", "origin", "model_year", "name"]
)

Visualizing the data using Seaborn#

We won’t use Seaborn or Plotly Express much if at all in Math 10 after Worksheet 5, but I want you to see how similar they are to Altair. If you like Seaborn or Plotly Express, think about using it extensively as an “extra” component in the course project at the end of the class.

Make a similar chart (using the xy-axes and color but not the tooltip) using Seaborn.

import seaborn as sns

The syntax is a little different (for example, here we use the keyword argument hue rather than color), but the approach is exactly the same: we specify how to encode columns in our DataFrame as visual channels in the plot.

sns.scatterplot(
    data=df,
    x="weight",
    y="mpg",
    hue="origin"
)

<Axes: xlabel='weight', ylabel='mpg'>

../_images/2745ceeb82e8f38d546fcb20f2c8cd6cff74553f2b0a857920a69a75851958a3.png

Visualizing the data using Plotly Express#

Make a similar chart using Plotly Express.

import plotly.express as px

Again, the syntax is a little different, but the idea is exactly the same.

px.scatter(
    data_frame=df,
    x="weight",
    y="mpg",
    color="origin"
)

Time to work on worksheets#

Worksheet 5 is posted. Time to work on that or Worksheets 3-4 from last week.
Hanson and I are here to help.