Week 3 Friday#
Announcements#
There was a typo in the original version of Worksheet 6: the Boolean indexing question should refer to “Midterm 2”, not to “Midterm 1”.
Our midterm 1 is Monday of Week 5. I’ll post a sample midterm during Week 4. Come to class next week to get the “note card” you can fill out with handwritten notes for the midterm.
I’m trying on a shared notebook again (Deepnote thinks they fixed the issue).
I had some trouble installing the newest version of Altair this morning, so we’re going to use the “old” syntax and the version of Altair that’s already installed automatically in Deepnote. (In other words, we won’t use
!pip install
today.)I’ve tried to make this lecture a little shorter so that there is time for you to practice with the material at the end.
Yufei is here to help with the worksheets and course material.
import seaborn as sns
df = sns.load_dataset("penguins")
The axis
keyword argument#
This is the only one description of the axis keyword argument that I have found works in every example:
The
axis
keyword argument indicates which axis labels are (potentially) changing. If the row labels could change, then we useaxis=0
. If the column labels could change, then we useaxis=1
.
Here is a way to remember which is 0
and which is 1
. When we check the shape
of a DataFrame, we get a tuple. The 0
entry in the tuple is the number of rows (changing row labels corresponds to axis=0
) and the 1
entry in the tuple corresponds to columns (changing column labels corresponds to axis=1
). For example, the following is telling us that df
has 344 rows and 7 columns.
df.shape
(344, 7)
In the penguins dataset, change the column named “island” so it is named “location” and change the column named “sex” so it is named “gender”. Use the pandas DataFrame method
rename
, and input a Python dictionary.
df.columns
Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'sex'],
dtype='object')
The following has no effect. The object inside the parentheses is an example of a Python dictionary. This is a very important data type that is built into Python.
df.rename({"island": "location", "sex":"gender"})
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
Above was the same to the following (meaning axis=0
is the default). What the following code is doing is, it is looking through all of the row labels (0
, 1
, through 343
) and if it finds a row named “island” or “sex” it will change that name.
df.rename({"island": "location", "sex":"gender"}, axis=0)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
We are trying to change the column names, so we should instead use the argument axis=1
. Notice how the “island” and “sex” columns have been renamed.
df.rename({"island": "location", "sex":"gender"}, axis=1)
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
Here is an example of renaming one of the row labels. Notice how we switch to axis=0
.
df.rename({3: "Chris Davis"}, axis=0)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
Chris Davis | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
It’s important to point out that we haven’t changed df
itself. A hint that we haven’t changed df
is that DataFrames were displayed as the result of our code. The code was creating new DataFrames, not changing the original DataFrame.
# so far, we haven't changed df itself
df
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows Ă— 7 columns
If we want to change df
itself, we should assign the result back to df
. (Warning: this can be dangerous; it would be safer to call this DataFrame something different, like df2
. If you make a mistake at this stage, it’s very possible to destroy df
, in which case you will need to restart the notebook.)
# another: df.rename({"island": "location", "sex":"gender"}, axis=1, inplace=True) (no df =)
df = df.rename({"island": "location", "sex":"gender"}, axis=1)
Another option to accomplish the same thing would be to use the inplace
keyword argument:
df.rename({"island": "location", "sex":"gender"}, axis=1, inplace=True)
Notice how we do not put df=
at the beginning when using the inplace
keyword argument. How would you know that option existed? You could check the documentation for the rename
method, as below.
help(df.rename)
Help on method rename in module pandas.core.frame:
rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore') method of pandas.core.frame.DataFrame instance
Alter axes labels.
Function / dict values must be unique (1-to-1). Labels not contained in
a dict / Series will be left as-is. Extra labels listed don't throw an
error.
See the :ref:`user guide <basics.rename>` for more.
Parameters
----------
mapper : dict-like or function
Dict-like or function transformations to apply to
that axis' values. Use either ``mapper`` and ``axis`` to
specify the axis to target with ``mapper``, or ``index`` and
``columns``.
index : dict-like or function
Alternative to specifying axis (``mapper, axis=0``
is equivalent to ``index=mapper``).
columns : dict-like or function
Alternative to specifying axis (``mapper, axis=1``
is equivalent to ``columns=mapper``).
axis : {0 or 'index', 1 or 'columns'}, default 0
Axis to target with ``mapper``. Can be either the axis name
('index', 'columns') or number (0, 1). The default is 'index'.
copy : bool, default True
Also copy underlying data.
inplace : bool, default False
Whether to return a new DataFrame. If True then value of copy is
ignored.
level : int or level name, default None
In case of a MultiIndex, only rename labels in the specified
level.
errors : {'ignore', 'raise'}, default 'ignore'
If 'raise', raise a `KeyError` when a dict-like `mapper`, `index`,
or `columns` contains labels that are not present in the Index
being transformed.
If 'ignore', existing keys will be renamed and extra keys will be
ignored.
Returns
-------
DataFrame or None
DataFrame with the renamed axis labels or None if ``inplace=True``.
Raises
------
KeyError
If any of the labels is not found in the selected axis and
"errors='raise'".
See Also
--------
DataFrame.rename_axis : Set the name of the axis.
Examples
--------
``DataFrame.rename`` supports two calling conventions
* ``(index=index_mapper, columns=columns_mapper, ...)``
* ``(mapper, axis={'index', 'columns'}, ...)``
We *highly* recommend using keyword arguments to clarify your
intent.
Rename columns using a mapping:
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
a c
0 1 4
1 2 5
2 3 6
Rename index using a mapping:
>>> df.rename(index={0: "x", 1: "y", 2: "z"})
A B
x 1 4
y 2 5
z 3 6
Cast index labels to a different type:
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
Traceback (most recent call last):
KeyError: ['C'] not found in axis
Using axis-style parameters:
>>> df.rename(str.lower, axis='columns')
a b
0 1 4
1 2 5
2 3 6
>>> df.rename({1: 2, 2: 4}, axis='index')
A B
0 1 4
2 2 5
4 3 6
Delete all the rows which contain missing values. (There is a shorter way to do this, but the following is a nice example of some useful pandas tools.)
First make a Boolean DataFrame indicating whether the data value is missing, using the
isna
method.
Notice how there are missing values. Here I am “slicing” from row 1 (inclusive, that’s not the top row) to row 5 (exclusive) just for a different example. I would usually use something like df.head(5)
instead. Notice that this slicing refers to rows, not to columns.
df[1:5]
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
Here is the row we can see that has missing values.
# row labeled 3
df.loc[3]
species Adelie
location Torgersen
bill_length_mm NaN
bill_depth_mm NaN
flipper_length_mm NaN
body_mass_g NaN
gender NaN
Name: 3, dtype: object
This is just a reminder of how df.loc
works. We need to use the column label.
df.loc[3, "location"]
'Torgersen'
Notice the True
values in the row labeled 3
in the following. Those True
values correspond to the missing values we just saw.
df.isna()
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | True | True | True | True | True |
4 | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... |
339 | False | False | True | True | True | True | True |
340 | False | False | False | False | False | False | False |
341 | False | False | False | False | False | False | False |
342 | False | False | False | False | False | False | False |
343 | False | False | False | False | False | False | False |
344 rows Ă— 7 columns
Then apply
any
with a suitableaxis
keyword argument to determine which rows have any missing data.
In this case, we want to keep the row labels the same, but we are getting rid of the column labels completely. That’s why we use axis=1
in this case. Also notice the True
in row 3
.
# notice: row names didn't change, column names disappeared
df.isna().any(axis=1)
0 False
1 False
2 False
3 True
4 False
...
339 True
340 False
341 False
342 False
343 False
Length: 344, dtype: bool
If we use axis=0
instead, then we are keeping the column labels the same, and finding out which columns have any missing values.
df.isna().any(axis=0)
species False
location False
bill_length_mm True
bill_depth_mm True
flipper_length_mm True
body_mass_g True
gender True
dtype: bool
There is also an all
(in contrast to the any
we are using). Notice how the row 3
is now False
, because it is not the case that all of the values are missing in this row.
df.isna().all(axis=1)
0 False
1 False
2 False
3 False
4 False
...
339 False
340 False
341 False
342 False
343 False
Length: 344, dtype: bool
Now use Boolean indexing like usual. You might need to take a negation, using tilde
~
.
If we plug in exactly what we have above, we will be doing the exact opposite of what we want. This is keeping the rows that have any missing values.
# opposite of what we want
df[df.isna().any(axis=1)]
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
8 | Adelie | Torgersen | 34.1 | 18.1 | 193.0 | 3475.0 | NaN |
9 | Adelie | Torgersen | 42.0 | 20.2 | 190.0 | 4250.0 | NaN |
10 | Adelie | Torgersen | 37.8 | 17.1 | 186.0 | 3300.0 | NaN |
11 | Adelie | Torgersen | 37.8 | 17.3 | 180.0 | 3700.0 | NaN |
47 | Adelie | Dream | 37.5 | 18.9 | 179.0 | 2975.0 | NaN |
246 | Gentoo | Biscoe | 44.5 | 14.3 | 216.0 | 4100.0 | NaN |
286 | Gentoo | Biscoe | 46.2 | 14.4 | 214.0 | 4650.0 | NaN |
324 | Gentoo | Biscoe | 47.3 | 13.8 | 216.0 | 4725.0 | NaN |
336 | Gentoo | Biscoe | 44.5 | 15.7 | 217.0 | 4875.0 | NaN |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
So we should take the negation, using tilde ~
. Here we are keeping the rows which do not have any missing values.
# get rid of rows with missing data
df = df[~df.isna().any(axis=1)]
df
species | location | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | gender | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
5 | Adelie | Torgersen | 39.3 | 20.6 | 190.0 | 3650.0 | Male |
... | ... | ... | ... | ... | ... | ... | ... |
338 | Gentoo | Biscoe | 47.2 | 13.7 | 214.0 | 4925.0 | Female |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
333 rows Ă— 7 columns
Be sure to save the resulting DataFrame with the same name
df
. It should now have 333 rows.
df.shape
(333, 7)
Facet charts#
Display an Altair scatter chart showing bill length for the x-axis, flipper length for the y-axis, and color using species.
df.columns
Index(['species', 'location', 'bill_length_mm', 'bill_depth_mm',
'flipper_length_mm', 'body_mass_g', 'gender'],
dtype='object')
import altair as alt
alt.Chart(df).mark_circle().encode(
x="bill_length_mm",
y="flipper_length_mm",
color="species"
)
Use the domain 30-to-60 for the x-axis and 170-to-240 for the y-axis. (I had some trouble this morning installing the new version of Altair, so let’s use the old syntax.)
Here we are using a Python tuple
data type to specify the domains.
# old syntax, didn't pip install today
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
color="species"
)
What data encoding type makes the most sense for “species”, Quantitative, Ordinal, or Nominal? Does adding that abbreviation it change the appearance of the chart?
Changing from "species"
to "species:N"
does not have any effect, because when there are strings in the column, Altair automatically defaults to a Nominal data type.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
color="species:N"
)
What happens if you try to use the “Ordinal” encoding type for the x-axis? (Get rid of the
scale
part for this.)
Notice how different this looks. Also notice how the gap between 34 and 34.4 is the same as the gap between 36.6 and 36.7. By using the Ordinal data type, ":O"
, we are telling Altair to treat these as distinct categories, and that the numerical difference between the values is not important.
This chart definitely looks worse than with the default Quantitative encoding.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_length_mm:O"),
y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
color="species"
)
Make a facet chart where the penguins are divided according to gender. (Go back to “Quantitative” encoding for the x channel.)
Here the data is divided by “gender”, and the different genders are put into different rows. That is what the row="gender"
part means.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
color="species:N",
row="gender"
)
Here is the same thing, but putting different genders into different columns. This would be a good choice if you wanted to compare the flipper lengths between genders. If instead you wanted to compare the bill lengths between genders, then I think it would make more sense to use the above vertical facet chart.
alt.Chart(df).mark_circle().encode(
x=alt.X("bill_length_mm", scale=alt.Scale(domain=(30,60))),
y=alt.Y("flipper_length_mm", scale=alt.Scale(domain=(170,240))),
color="species:N",
column="gender"
)
Assume you want to compare flipper length. Would it make more sense to have the sub-charts appear in the same row or the same column?
Time to work on Worksheets 5-6#
Yufei is here to help.
If you’re already finished with the worksheets, try to redo our dropping missing values rows example using the following. (You’ll need to re-import the data, so the missing values are back.)
Using the pandas DataFrame method
dropna
and a suitableaxis
argument. (This is the best approach.)Using the pandas DataFrame method
apply
, a suitableaxis
argument, and a function which takes as input a row and as output returnsTrue
if the row has any missing values. (This isn’t the “right” approach for dropping missing values, butapply
can be used in a wide variety of contexts. We will studyapply
maybe as soon as Week 4.)