Week 5 Monday
Contents
Week 5 Monday#
Announcements#
Video quizzes have been turned into “Practice Quizzes” (same links as before on Canvas). Scores on these practice quizzes won’t affect your grade, and you can take them as many times as you want. (About half the quizzes have multiple versions.)
No videos or video quizzes this Friday. After the midterm, we’ll start the Machine Learning portion of Math 10.
Midterm Wednesday. Covers material up to and including Week 4 Friday’s lecture and videos.
Pick up a notecard if you haven’t already. Hand-written notes, written on both sides, can be used during the midterm.
Sample midterm posted on the “Week 5” page on Canvas. I’ll ask William to go over as much of the sample midterm as possible on Tuesday. Solutions posted Tuesday afternoon at the latest. (If you want a hint on a specific question before then, please ask on Ed Discussion or come to my office hours.)
I have office hours after class, 11am, next door in ALP 3610.
map
vs applymap
vs apply
#
Here are some important facts about these methods.
map
is a Series method.applymap
andapply
are DataFrame methods. (Secretlyapply
can also be used on a Series, but we haven’t covered that.)All three take a function as an input.
map
andapplymap
apply that function on every entry.apply
applies the function on an entire row (axis=1
) or on an entire column (axis=0
).I think everyone would agree
apply
is the most difficult to understand of the three.
import pandas as pd
import altair as alt
import seaborn as sns
df = sns.load_dataset("penguins")
The following wants to multiply every element in df
by 2
. It doesn’t work because 2*x
is not a function in Python.
df.applymap(2*x)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [5], line 1
----> 1 df.applymap(2*x)
NameError: name 'x' is not defined
Here we use a lambda function to define the function \(x \leadsto 2x\). Notice how this works even for string values.
df.applymap(lambda x: 2*x)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | AdelieAdelie | TorgersenTorgersen | 78.2 | 37.4 | 362.0 | 7500.0 | MaleMale |
1 | AdelieAdelie | TorgersenTorgersen | 79.0 | 34.8 | 372.0 | 7600.0 | FemaleFemale |
2 | AdelieAdelie | TorgersenTorgersen | 80.6 | 36.0 | 390.0 | 6500.0 | FemaleFemale |
3 | AdelieAdelie | TorgersenTorgersen | NaN | NaN | NaN | NaN | NaN |
4 | AdelieAdelie | TorgersenTorgersen | 73.4 | 38.6 | 386.0 | 6900.0 | FemaleFemale |
... | ... | ... | ... | ... | ... | ... | ... |
339 | GentooGentoo | BiscoeBiscoe | NaN | NaN | NaN | NaN | NaN |
340 | GentooGentoo | BiscoeBiscoe | 93.6 | 28.6 | 430.0 | 9700.0 | FemaleFemale |
341 | GentooGentoo | BiscoeBiscoe | 100.8 | 31.4 | 444.0 | 11500.0 | MaleMale |
342 | GentooGentoo | BiscoeBiscoe | 90.4 | 29.6 | 424.0 | 10400.0 | FemaleFemale |
343 | GentooGentoo | BiscoeBiscoe | 99.8 | 32.2 | 426.0 | 10800.0 | MaleMale |
344 rows × 7 columns
Here is an example of getting a sub-DataFrame using iloc
. Notice how the above values are not reflected. That is because we did not save the values. (A hint is that something got displayed to the screen.)
df.iloc[:2, 1:4]
island | bill_length_mm | bill_depth_mm | |
---|---|---|---|
0 | Torgersen | 39.1 | 18.7 |
1 | Torgersen | 39.5 | 17.4 |
Sometimes the inplace
keyword argument can be used, but not for applymap
.
df.applymap(lambda x: 2*x, inplace=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [8], line 1
----> 1 df.applymap(lambda x: 2*x, inplace=True)
TypeError: applymap() got an unexpected keyword argument 'inplace'
Here is a reminder of the rename
method. Notice how we are also specifying axis=0
, which tells pandas that it is the row labels which should be changed.
df.rename({2:"New row name"}, axis=0)
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
New row name | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
... | ... | ... | ... | ... | ... | ... | ... |
339 | Gentoo | Biscoe | NaN | NaN | NaN | NaN | NaN |
340 | Gentoo | Biscoe | 46.8 | 14.3 | 215.0 | 4850.0 | Female |
341 | Gentoo | Biscoe | 50.4 | 15.7 | 222.0 | 5750.0 | Male |
342 | Gentoo | Biscoe | 45.2 | 14.8 | 212.0 | 5200.0 | Female |
343 | Gentoo | Biscoe | 49.9 | 16.1 | 213.0 | 5400.0 | Male |
344 rows × 7 columns
Again, there was no change to df
itself. (The row 2
is still called 2
.)
df[:4]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
Here we use the inplace
keyword argument. Notice how nothing gets displayed.
# nothing got displayed; df changed
df.rename({2:"New row name"}, axis=0, inplace=True)
Now the row name did change.
df[:4]
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
New row name | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
Now we switch to talking about apply
. The use of the axis
keyword argument is definitely confusing. See the Week 4 videos for the most consistent description I know for when to use axis=0
and when to use axis=1
.
df.shape
(344, 7)
Here we are keeping the column names the same. The number 344
represents the length of each column.
df.apply(len, axis=0)
species 344
island 344
bill_length_mm 344
bill_depth_mm 344
flipper_length_mm 344
body_mass_g 344
sex 344
dtype: int64
The number 7
here represents the length of each row.
df.apply(len, axis=1)
0 7
1 7
New row name 7
3 7
4 7
..
339 7
340 7
341 7
342 7
343 7
Length: 344, dtype: int64
Here I’ll try to convince you that the row names did not change.
df.apply(len, axis=1).index
Index([ 0, 1, 'New row name', 3,
4, 5, 6, 7,
8, 9,
...
334, 335, 336, 337,
338, 339, 340, 341,
342, 343],
dtype='object', length=344)
df.index
Index([ 0, 1, 'New row name', 3,
4, 5, 6, 7,
8, 9,
...
334, 335, 336, 337,
338, 339, 340, 341,
342, 343],
dtype='object', length=344)
Example of apply
#
Using
apply
, subtract a suitable constant from all the numeric columns in the Penguins DataFrame so each numeric column has mean 0. (Use a different constant for each column.) Usepandas.api.types.is_numeric_dtype
to determine if a column is numeric or not.
# step 1: find the numeric columns
pd.api.types.is_numeric_dtype(df["species"])
False
pd.api.types.is_numeric_dtype(df["bill_length_mm"])
True
It is cumbersome to have to write pd.api.types.is_numeric_dtype
each time, so we can import it separately.
from pandas.api.types import is_numeric_dtype
is_numeric_dtype(df["bill_length_mm"])
True
is_numeric_dtype(float)
True
Here is a first step of list comprehension, where we just get the list version of df.columns
(we are not removing anything).
# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns]
['species',
'island',
'bill_length_mm',
'bill_depth_mm',
'flipper_length_mm',
'body_mass_g',
'sex']
Now we add an if
condition to just keep the numeric columns. Notice how above we typed is_numeric_dtype(df["bill_length_mm"])
, and c
is some value like "bill_length_mm"
, so we should use df[c]
.
# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns if is_numeric_dtype(df[c]) == True]
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
The == True
is not doing anything here, so we can remove it.
num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
There was a question about whether c
should be in quotation marks. It is an important distinction. If we put c
in quotation marks, "c"
, then we just get the letter “c”, and the value of the variable c
is irrelevant.
for c in df.columns:
print("c")
c
c
c
c
c
c
c
Here is the same code without quotation marks around c
.
for c in df.columns:
print(c)
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex
Here we get the sub-DataFrame consisting of the numeric columns, using df.loc
.
df.loc[:, num_cols]
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
New row name | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | NaN | NaN | NaN | NaN |
4 | 36.7 | 19.3 | 193.0 | 3450.0 |
... | ... | ... | ... | ... |
339 | NaN | NaN | NaN | NaN |
340 | 46.8 | 14.3 | 215.0 | 4850.0 |
341 | 50.4 | 15.7 | 222.0 | 5750.0 |
342 | 45.2 | 14.8 | 212.0 | 5200.0 |
343 | 49.9 | 16.1 | 213.0 | 5400.0 |
344 rows × 4 columns
The following is an abbreviation for the same thing.
df[num_cols]
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
New row name | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | NaN | NaN | NaN | NaN |
4 | 36.7 | 19.3 | 193.0 | 3450.0 |
... | ... | ... | ... | ... |
339 | NaN | NaN | NaN | NaN |
340 | 46.8 | 14.3 | 215.0 | 4850.0 |
341 | 50.4 | 15.7 | 222.0 | 5750.0 |
342 | 45.2 | 14.8 | 212.0 | 5200.0 |
343 | 49.9 | 16.1 | 213.0 | 5400.0 |
344 rows × 4 columns
Let’s save this sub-DataFrame.
df_sub = df[num_cols]
df_sub
bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | |
---|---|---|---|---|
0 | 39.1 | 18.7 | 181.0 | 3750.0 |
1 | 39.5 | 17.4 | 186.0 | 3800.0 |
New row name | 40.3 | 18.0 | 195.0 | 3250.0 |
3 | NaN | NaN | NaN | NaN |
4 | 36.7 | 19.3 | 193.0 | 3450.0 |
... | ... | ... | ... | ... |
339 | NaN | NaN | NaN | NaN |
340 | 46.8 | 14.3 | 215.0 | 4850.0 |
341 | 50.4 | 15.7 | 222.0 | 5750.0 |
342 | 45.2 | 14.8 | 212.0 | 5200.0 |
343 | 49.9 | 16.1 | 213.0 | 5400.0 |
344 rows × 4 columns
We want to use some code like the following.
# Subtract a constant from each column so the mean is 0
df_sub.apply(???, axis=???)
Here is the correct formula.
# We're plugging in a column at a time, so use axis=0
df2 = df_sub.apply(lambda x: x-x.mean(), axis=0)
The code x-x.mean()
can be confusing, because x
represents a pandas Series and x.mean()
represents a number. Here is a specific example of computing pandas Series minus a number in Python.
df["bill_length_mm"] - 30
0 9.1
1 9.5
New row name 10.3
3 NaN
4 6.7
...
339 NaN
340 16.8
341 20.4
342 15.2
343 19.9
Name: bill_length_mm, Length: 344, dtype: float64
We want the columns to have mean 0. The following says that (up to numerical precision issues), these means are 0 (or at least very close to 0).
df2.mean(axis=0)
bill_length_mm 6.648353e-16
bill_depth_mm -2.825550e-15
flipper_length_mm -1.130220e-14
body_mass_g 8.509892e-14
dtype: float64
More examples#
This doesn’t always work (for example, some websites block it), but sometimes pandas can read tables directly from websites. The pd.read_html
function returns a list of html tables (as DataFrames) that pandas can find on the website.
Here are some websites for which pd.read_html
does work.
https://www.usclimatedata.com/climate/irvine/california/united-states/usca2494
https://www.usclimatedata.com/climate/new-york/new-york/united-states/usny0996
link = "https://www.usclimatedata.com/climate/irvine/california/united-states/usca2494"
The function pd.read_html
returns a list of DataFrames.
df_list = pd.read_html(link)
df_list[0]
Unnamed: 0 | JanJa | FebFe | MarMa | AprAp | MayMa | JunJu | |
---|---|---|---|---|---|---|---|
0 | Average high in ºF Av. high Hi | 65.00 | 65.00 | 67.00 | 71.00 | 73.00 | 76.00 |
1 | Average low in ºF Av. low Lo | 47.00 | 48.00 | 49.00 | 51.00 | 56.00 | 59.00 |
2 | Av. precipitation in inch Av. precip. Pre. | 2.73 | 3.63 | 2.44 | 0.86 | 0.23 | 0.11 |
df_list[1]
Unnamed: 0 | JulJu | AugAu | SepSe | OctOc | NovNo | DecDe | |
---|---|---|---|---|---|---|---|
0 | Average high in ºF Av. high Hi | 82.00 | 83.00 | 81.00 | 76.00 | 69.00 | 64.00 |
1 | Average low in ºF Av. low Lo | 62.00 | 63.00 | 61.00 | 57.00 | 51.00 | 47.00 |
2 | Av. precipitation in inch Av. precip. Pre. | 0.03 | 0.05 | 0.25 | 0.65 | 1.13 | 2.27 |
For the above two DataFrames, would it make more sense to stack them side-by-side or on top of each other? Definitely side-by-side, because we want the months to go from January to December. So we are changing the columns axis (the number of columns is going from 7 to 14), so we specify axis=1
.
pd.concat((df_list[0], df_list[1]), axis=1)
Unnamed: 0 | JanJa | FebFe | MarMa | AprAp | MayMa | JunJu | Unnamed: 0 | JulJu | AugAu | SepSe | OctOc | NovNo | DecDe | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Average high in ºF Av. high Hi | 65.00 | 65.00 | 67.00 | 71.00 | 73.00 | 76.00 | Average high in ºF Av. high Hi | 82.00 | 83.00 | 81.00 | 76.00 | 69.00 | 64.00 |
1 | Average low in ºF Av. low Lo | 47.00 | 48.00 | 49.00 | 51.00 | 56.00 | 59.00 | Average low in ºF Av. low Lo | 62.00 | 63.00 | 61.00 | 57.00 | 51.00 | 47.00 |
2 | Av. precipitation in inch Av. precip. Pre. | 2.73 | 3.63 | 2.44 | 0.86 | 0.23 | 0.11 | Av. precipitation in inch Av. precip. Pre. | 0.03 | 0.05 | 0.25 | 0.65 | 1.13 | 2.27 |