Week 3 Friday#

Announcements#

  • There was a typo in the original version of Worksheet 6: the Boolean indexing question should refer to “Midterm 2”, not to “Midterm 1”.

  • Our midterm 1 is Monday of Week 5. I’ll post a sample midterm during Week 4. Come to class next week to get the “note card” you can fill out with handwritten notes for the midterm.

  • I’m trying on a shared notebook again (Deepnote thinks they fixed the issue).

  • I had some trouble installing the newest version of Altair this morning, so we’re going to use the “old” syntax and the version of Altair that’s already installed automatically in Deepnote. (In other words, we won’t use !pip install today.)

  • I’ve tried to make this lecture a little shorter so that there is time for you to practice with the material at the end.

  • Yufei is here to help with the worksheets and course material.

import seaborn as sns
df = sns.load_dataset("penguins")

The axis keyword argument#

This is the only one description of the axis keyword argument that I have found works in every example:

The axis keyword argument indicates which axis labels are (potentially) changing. If the row labels could change, then we use axis=0. If the column labels could change, then we use axis=1.

Here is a way to remember which is 0 and which is 1. When we check the shape of a DataFrame, we get a tuple. The 0 entry in the tuple is the number of rows (changing row labels corresponds to axis=0) and the 1 entry in the tuple corresponds to columns (changing column labels corresponds to axis=1). For example, the following is telling us that df has 344 rows and 7 columns.

df.shape
(344, 7)
  • In the penguins dataset, change the column named “island” so it is named “location” and change the column named “sex” so it is named “gender”. Use the pandas DataFrame method rename, and input a Python dictionary.

df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

The following has no effect. The object inside the parentheses is an example of a Python dictionary. This is a very important data type that is built into Python.

df.rename({"island": "location", "sex":"gender"})
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

Above was the same to the following (meaning axis=0 is the default). What the following code is doing is, it is looking through all of the row labels (0, 1, through 343) and if it finds a row named “island” or “sex” it will change that name.

df.rename({"island": "location", "sex":"gender"}, axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

We are trying to change the column names, so we should instead use the argument axis=1. Notice how the “island” and “sex” columns have been renamed.

df.rename({"island": "location", "sex":"gender"}, axis=1)
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

Here is an example of renaming one of the row labels. Notice how we switch to axis=0.

df.rename({3: "Chris Davis"}, axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
Chris Davis Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

It’s important to point out that we haven’t changed df itself. A hint that we haven’t changed df is that DataFrames were displayed as the result of our code. The code was creating new DataFrames, not changing the original DataFrame.

# so far, we haven't changed df itself
df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows Ă— 7 columns

If we want to change df itself, we should assign the result back to df. (Warning: this can be dangerous; it would be safer to call this DataFrame something different, like df2. If you make a mistake at this stage, it’s very possible to destroy df, in which case you will need to restart the notebook.)

# another: df.rename({"island": "location", "sex":"gender"}, axis=1, inplace=True) (no df =)
df = df.rename({"island": "location", "sex":"gender"}, axis=1)

Another option to accomplish the same thing would be to use the inplace keyword argument:

df.rename({"island": "location", "sex":"gender"}, axis=1, inplace=True)

Notice how we do not put df= at the beginning when using the inplace keyword argument. How would you know that option existed? You could check the documentation for the rename method, as below.

help(df.rename)
Help on method rename in module pandas.core.frame:

rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') method of pandas.core.frame.DataFrame instance
    Alter axes labels.
    
    Function / dict values must be unique (1-to-1). Labels not contained in
    a dict / Series will be left as-is. Extra labels listed don't throw an
    error.
    
    See the :ref:`user guide <basics.rename>` for more.
    
    Parameters
    ----------
    mapper : dict-like or function
        Dict-like or function transformations to apply to
        that axis' values. Use either ``mapper`` and ``axis`` to
        specify the axis to target with ``mapper``, or ``index`` and
        ``columns``.
    index : dict-like or function
        Alternative to specifying axis (``mapper, axis=0``
        is equivalent to ``index=mapper``).
    columns : dict-like or function
        Alternative to specifying axis (``mapper, axis=1``
        is equivalent to ``columns=mapper``).
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Axis to target with ``mapper``. Can be either the axis name
        ('index', 'columns') or number (0, 1). The default is 'index'.
    copy : bool, default True
        Also copy underlying data.
    inplace : bool, default False
        Whether to return a new DataFrame. If True then value of copy is
        ignored.
    level : int or level name, default None
        In case of a MultiIndex, only rename labels in the specified
        level.
    errors : {'ignore', 'raise'}, default 'ignore'
        If 'raise', raise a `KeyError` when a dict-like `mapper`, `index`,
        or `columns` contains labels that are not present in the Index
        being transformed.
        If 'ignore', existing keys will be renamed and extra keys will be
        ignored.
    
    Returns
    -------
    DataFrame or None
        DataFrame with the renamed axis labels or None if ``inplace=True``.
    
    Raises
    ------
    KeyError
        If any of the labels is not found in the selected axis and
        "errors='raise'".
    
    See Also
    --------
    DataFrame.rename_axis : Set the name of the axis.
    
    Examples
    --------
    ``DataFrame.rename`` supports two calling conventions
    
    * ``(index=index_mapper, columns=columns_mapper, ...)``
    * ``(mapper, axis={'index', 'columns'}, ...)``
    
    We *highly* recommend using keyword arguments to clarify your
    intent.
    
    Rename columns using a mapping:
    
    >>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
    >>> df.rename(columns={"A": "a", "B": "c"})
       a  c
    0  1  4
    1  2  5
    2  3  6
    
    Rename index using a mapping:
    
    >>> df.rename(index={0: "x", 1: "y", 2: "z"})
       A  B
    x  1  4
    y  2  5
    z  3  6
    
    Cast index labels to a different type:
    
    >>> df.index
    RangeIndex(start=0, stop=3, step=1)
    >>> df.rename(index=str).index
    Index(['0', '1', '2'], dtype='object')
    
    >>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
    Traceback (most recent call last):
    KeyError: ['C'] not found in axis
    
    Using axis-style parameters:
    
    >>> df.rename(str.lower, axis='columns')
       a  b
    0  1  4
    1  2  5
    2  3  6
    
    >>> df.rename({1: 2, 2: 4}, axis='index')
       A  B
    0  1  4
    2  2  5
    4  3  6

Delete all the rows which contain missing values. (There is a shorter way to do this, but the following is a nice example of some useful pandas tools.)

  • First make a Boolean DataFrame indicating whether the data value is missing, using the isna method.

Notice how there are missing values. Here I am “slicing” from row 1 (inclusive, that’s not the top row) to row 5 (exclusive) just for a different example. I would usually use something like df.head(5) instead. Notice that this slicing refers to rows, not to columns.

df[1:5]
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

Here is the row we can see that has missing values.

# row labeled 3
df.loc[3]
species                 Adelie
location             Torgersen
bill_length_mm             NaN
bill_depth_mm              NaN
flipper_length_mm          NaN
body_mass_g                NaN
gender                     NaN
Name: 3, dtype: object

This is just a reminder of how df.loc works. We need to use the column label.

df.loc[3, "location"]
'Torgersen'

Notice the True values in the row labeled 3 in the following. Those True values correspond to the missing values we just saw.

df.isna()
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False True True True True True
4 False False False False False False False
... ... ... ... ... ... ... ...
339 False False True True True True True
340 False False False False False False False
341 False False False False False False False
342 False False False False False False False
343 False False False False False False False

344 rows Ă— 7 columns

  • Then apply any with a suitable axis keyword argument to determine which rows have any missing data.

In this case, we want to keep the row labels the same, but we are getting rid of the column labels completely. That’s why we use axis=1 in this case. Also notice the True in row 3.

# notice: row names didn't change, column names disappeared
df.isna().any(axis=1)
0      False
1      False
2      False
3       True
4      False
       ...  
339     True
340    False
341    False
342    False
343    False
Length: 344, dtype: bool

If we use axis=0 instead, then we are keeping the column labels the same, and finding out which columns have any missing values.

df.isna().any(axis=0)
species              False
location             False
bill_length_mm        True
bill_depth_mm         True
flipper_length_mm     True
body_mass_g           True
gender                True
dtype: bool

There is also an all (in contrast to the any we are using). Notice how the row 3 is now False, because it is not the case that all of the values are missing in this row.

df.isna().all(axis=1)
0      False
1      False
2      False
3      False
4      False
       ...  
339    False
340    False
341    False
342    False
343    False
Length: 344, dtype: bool
  • Now use Boolean indexing like usual. You might need to take a negation, using tilde ~.

If we plug in exactly what we have above, we will be doing the exact opposite of what we want. This is keeping the rows that have any missing values.

# opposite of what we want
df[df.isna().any(axis=1)]
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
3 Adelie Torgersen NaN NaN NaN NaN NaN
8 Adelie Torgersen 34.1 18.1 193.0 3475.0 NaN
9 Adelie Torgersen 42.0 20.2 190.0 4250.0 NaN
10 Adelie Torgersen 37.8 17.1 186.0 3300.0 NaN
11 Adelie Torgersen 37.8 17.3 180.0 3700.0 NaN
47 Adelie Dream 37.5 18.9 179.0 2975.0 NaN
246 Gentoo Biscoe 44.5 14.3 216.0 4100.0 NaN
286 Gentoo Biscoe 46.2 14.4 214.0 4650.0 NaN
324 Gentoo Biscoe 47.3 13.8 216.0 4725.0 NaN
336 Gentoo Biscoe 44.5 15.7 217.0 4875.0 NaN
339 Gentoo Biscoe NaN NaN NaN NaN NaN

So we should take the negation, using tilde ~. Here we are keeping the rows which do not have any missing values.

# get rid of rows with missing data
df = df[~df.isna().any(axis=1)]
df
species location bill_length_mm bill_depth_mm flipper_length_mm body_mass_g gender
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
... ... ... ... ... ... ... ...
338 Gentoo Biscoe 47.2 13.7 214.0 4925.0 Female
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

333 rows Ă— 7 columns

  • Be sure to save the resulting DataFrame with the same name df. It should now have 333 rows.

df.shape
(333, 7)

Facet charts#

  • Display an Altair scatter chart showing bill length for the x-axis, flipper length for the y-axis, and color using species.

df.columns
Index(['species', 'location', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'gender'],
      dtype='object')
import altair as alt
alt.Chart(df).mark_circle().encode(
    x="bill_length_mm",
    y="flipper_length_mm",
    color="species"
)
  • Use the domain 30-to-60 for the x-axis and 170-to-240 for the y-axis. (I had some trouble this morning installing the new version of Altair, so let’s use the old syntax.)

Here we are using a Python tuple data type to specify the domains.

# old syntax, didn't pip install today
alt.Chart(df).mark_circle().encode(
    x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
    y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
    color="species"
)
  • What data encoding type makes the most sense for “species”, Quantitative, Ordinal, or Nominal? Does adding that abbreviation it change the appearance of the chart?

Changing from "species" to "species:N" does not have any effect, because when there are strings in the column, Altair automatically defaults to a Nominal data type.

alt.Chart(df).mark_circle().encode(
    x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
    y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
    color="species:N"
)
  • What happens if you try to use the “Ordinal” encoding type for the x-axis? (Get rid of the scale part for this.)

Notice how different this looks. Also notice how the gap between 34 and 34.4 is the same as the gap between 36.6 and 36.7. By using the Ordinal data type, ":O", we are telling Altair to treat these as distinct categories, and that the numerical difference between the values is not important.

This chart definitely looks worse than with the default Quantitative encoding.

alt.Chart(df).mark_circle().encode(
    x=alt.X("bill_length_mm:O"),
    y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
    color="species"
)
  • Make a facet chart where the penguins are divided according to gender. (Go back to “Quantitative” encoding for the x channel.)

Here the data is divided by “gender”, and the different genders are put into different rows. That is what the row="gender" part means.

alt.Chart(df).mark_circle().encode(
    x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
    y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
    color="species:N",
    row="gender"
)

Here is the same thing, but putting different genders into different columns. This would be a good choice if you wanted to compare the flipper lengths between genders. If instead you wanted to compare the bill lengths between genders, then I think it would make more sense to use the above vertical facet chart.

alt.Chart(df).mark_circle().encode(
    x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
    y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
    color="species:N",
    column="gender"
)
  • Assume you want to compare flipper length. Would it make more sense to have the sub-charts appear in the same row or the same column?

Time to work on Worksheets 5-6#

  • Yufei is here to help.

If you’re already finished with the worksheets, try to redo our dropping missing values rows example using the following. (You’ll need to re-import the data, so the missing values are back.)

  • Using the pandas DataFrame method dropna and a suitable axis argument. (This is the best approach.)

  • Using the pandas DataFrame method apply, a suitable axis argument, and a function which takes as input a row and as output returns True if the row has any missing values. (This isn’t the “right” approach for dropping missing values, but apply can be used in a wide variety of contexts. We will study apply maybe as soon as Week 4.)