Week 5 Monday#

Announcements#

  • Video quizzes have been turned into “Practice Quizzes” (same links as before on Canvas). Scores on these practice quizzes won’t affect your grade, and you can take them as many times as you want. (About half the quizzes have multiple versions.)

  • No videos or video quizzes this Friday. After the midterm, we’ll start the Machine Learning portion of Math 10.

  • Midterm Wednesday. Covers material up to and including Week 4 Friday’s lecture and videos.

  • Pick up a notecard if you haven’t already. Hand-written notes, written on both sides, can be used during the midterm.

  • Sample midterm posted on the “Week 5” page on Canvas. I’ll ask William to go over as much of the sample midterm as possible on Tuesday. Solutions posted Tuesday afternoon at the latest. (If you want a hint on a specific question before then, please ask on Ed Discussion or come to my office hours.)

  • I have office hours after class, 11am, next door in ALP 3610.

map vs applymap vs apply#

Here are some important facts about these methods.

  • map is a Series method. applymap and apply are DataFrame methods. (Secretly apply can also be used on a Series, but we haven’t covered that.)

  • All three take a function as an input.

  • map and applymap apply that function on every entry. apply applies the function on an entire row (axis=1) or on an entire column (axis=0).

  • I think everyone would agree apply is the most difficult to understand of the three.

import pandas as pd
import altair as alt
import seaborn as sns
df = sns.load_dataset("penguins")

The following wants to multiply every element in df by 2. It doesn’t work because 2*x is not a function in Python.

df.applymap(2*x)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 df.applymap(2*x)

NameError: name 'x' is not defined

Here we use a lambda function to define the function \(x \leadsto 2x\). Notice how this works even for string values.

df.applymap(lambda x: 2*x)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 AdelieAdelie TorgersenTorgersen 78.2 37.4 362.0 7500.0 MaleMale
1 AdelieAdelie TorgersenTorgersen 79.0 34.8 372.0 7600.0 FemaleFemale
2 AdelieAdelie TorgersenTorgersen 80.6 36.0 390.0 6500.0 FemaleFemale
3 AdelieAdelie TorgersenTorgersen NaN NaN NaN NaN NaN
4 AdelieAdelie TorgersenTorgersen 73.4 38.6 386.0 6900.0 FemaleFemale
... ... ... ... ... ... ... ...
339 GentooGentoo BiscoeBiscoe NaN NaN NaN NaN NaN
340 GentooGentoo BiscoeBiscoe 93.6 28.6 430.0 9700.0 FemaleFemale
341 GentooGentoo BiscoeBiscoe 100.8 31.4 444.0 11500.0 MaleMale
342 GentooGentoo BiscoeBiscoe 90.4 29.6 424.0 10400.0 FemaleFemale
343 GentooGentoo BiscoeBiscoe 99.8 32.2 426.0 10800.0 MaleMale

344 rows × 7 columns

Here is an example of getting a sub-DataFrame using iloc. Notice how the above values are not reflected. That is because we did not save the values. (A hint is that something got displayed to the screen.)

df.iloc[:2, 1:4]
island bill_length_mm bill_depth_mm
0 Torgersen 39.1 18.7
1 Torgersen 39.5 17.4

Sometimes the inplace keyword argument can be used, but not for applymap.

df.applymap(lambda x: 2*x, inplace=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 df.applymap(lambda x: 2*x, inplace=True)

TypeError: applymap() got an unexpected keyword argument 'inplace'

Here is a reminder of the rename method. Notice how we are also specifying axis=0, which tells pandas that it is the row labels which should be changed.

df.rename({2:"New row name"}, axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
New row name Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

Again, there was no change to df itself. (The row 2 is still called 2.)

df[:4]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN

Here we use the inplace keyword argument. Notice how nothing gets displayed.

# nothing got displayed; df changed
df.rename({2:"New row name"}, axis=0, inplace=True)

Now the row name did change.

df[:4]
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
New row name Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN

Now we switch to talking about apply. The use of the axis keyword argument is definitely confusing. See the Week 4 videos for the most consistent description I know for when to use axis=0 and when to use axis=1.

df.shape
(344, 7)

Here we are keeping the column names the same. The number 344 represents the length of each column.

df.apply(len, axis=0)
species              344
island               344
bill_length_mm       344
bill_depth_mm        344
flipper_length_mm    344
body_mass_g          344
sex                  344
dtype: int64

The number 7 here represents the length of each row.

df.apply(len, axis=1)
0               7
1               7
New row name    7
3               7
4               7
               ..
339             7
340             7
341             7
342             7
343             7
Length: 344, dtype: int64

Here I’ll try to convince you that the row names did not change.

df.apply(len, axis=1).index
Index([             0,              1, 'New row name',              3,
                    4,              5,              6,              7,
                    8,              9,
       ...
                  334,            335,            336,            337,
                  338,            339,            340,            341,
                  342,            343],
      dtype='object', length=344)
df.index
Index([             0,              1, 'New row name',              3,
                    4,              5,              6,              7,
                    8,              9,
       ...
                  334,            335,            336,            337,
                  338,            339,            340,            341,
                  342,            343],
      dtype='object', length=344)

Example of apply#

  • Using apply, subtract a suitable constant from all the numeric columns in the Penguins DataFrame so each numeric column has mean 0. (Use a different constant for each column.) Use pandas.api.types.is_numeric_dtype to determine if a column is numeric or not.

# step 1: find the numeric columns
pd.api.types.is_numeric_dtype(df["species"])
False
pd.api.types.is_numeric_dtype(df["bill_length_mm"])
True

It is cumbersome to have to write pd.api.types.is_numeric_dtype each time, so we can import it separately.

from pandas.api.types import is_numeric_dtype
is_numeric_dtype(df["bill_length_mm"])
True
is_numeric_dtype(float)
True

Here is a first step of list comprehension, where we just get the list version of df.columns (we are not removing anything).

# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns]
['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex']

Now we add an if condition to just keep the numeric columns. Notice how above we typed is_numeric_dtype(df["bill_length_mm"]), and c is some value like "bill_length_mm", so we should use df[c].

# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns if is_numeric_dtype(df[c]) == True]
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

The == True is not doing anything here, so we can remove it.

num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

There was a question about whether c should be in quotation marks. It is an important distinction. If we put c in quotation marks, "c", then we just get the letter “c”, and the value of the variable c is irrelevant.

for c in df.columns:
    print("c")
c
c
c
c
c
c
c

Here is the same code without quotation marks around c.

for c in df.columns:
    print(c)
species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex

Here we get the sub-DataFrame consisting of the numeric columns, using df.loc.

df.loc[:, num_cols]
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
New row name 40.3 18.0 195.0 3250.0
3 NaN NaN NaN NaN
4 36.7 19.3 193.0 3450.0
... ... ... ... ...
339 NaN NaN NaN NaN
340 46.8 14.3 215.0 4850.0
341 50.4 15.7 222.0 5750.0
342 45.2 14.8 212.0 5200.0
343 49.9 16.1 213.0 5400.0

344 rows × 4 columns

The following is an abbreviation for the same thing.

df[num_cols]
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
New row name 40.3 18.0 195.0 3250.0
3 NaN NaN NaN NaN
4 36.7 19.3 193.0 3450.0
... ... ... ... ...
339 NaN NaN NaN NaN
340 46.8 14.3 215.0 4850.0
341 50.4 15.7 222.0 5750.0
342 45.2 14.8 212.0 5200.0
343 49.9 16.1 213.0 5400.0

344 rows × 4 columns

Let’s save this sub-DataFrame.

df_sub = df[num_cols]
df_sub
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
0 39.1 18.7 181.0 3750.0
1 39.5 17.4 186.0 3800.0
New row name 40.3 18.0 195.0 3250.0
3 NaN NaN NaN NaN
4 36.7 19.3 193.0 3450.0
... ... ... ... ...
339 NaN NaN NaN NaN
340 46.8 14.3 215.0 4850.0
341 50.4 15.7 222.0 5750.0
342 45.2 14.8 212.0 5200.0
343 49.9 16.1 213.0 5400.0

344 rows × 4 columns

We want to use some code like the following.

# Subtract a constant from each column so the mean is 0
df_sub.apply(???, axis=???)

Here is the correct formula.

# We're plugging in a column at a time, so use axis=0
df2 = df_sub.apply(lambda x: x-x.mean(), axis=0)

The code x-x.mean() can be confusing, because x represents a pandas Series and x.mean() represents a number. Here is a specific example of computing pandas Series minus a number in Python.

df["bill_length_mm"] - 30
0                9.1
1                9.5
New row name    10.3
3                NaN
4                6.7
                ... 
339              NaN
340             16.8
341             20.4
342             15.2
343             19.9
Name: bill_length_mm, Length: 344, dtype: float64

We want the columns to have mean 0. The following says that (up to numerical precision issues), these means are 0 (or at least very close to 0).

df2.mean(axis=0)
bill_length_mm       6.648353e-16
bill_depth_mm       -2.825550e-15
flipper_length_mm   -1.130220e-14
body_mass_g          8.509892e-14
dtype: float64

More examples#

This doesn’t always work (for example, some websites block it), but sometimes pandas can read tables directly from websites. The pd.read_html function returns a list of html tables (as DataFrames) that pandas can find on the website.

Here are some websites for which pd.read_html does work.

link = "https://www.usclimatedata.com/climate/irvine/california/united-states/usca2494"

The function pd.read_html returns a list of DataFrames.

df_list = pd.read_html(link)
df_list[0]
Unnamed: 0 JanJa FebFe MarMa AprAp MayMa JunJu
0 Average high in ºF Av. high Hi 65.00 65.00 67.00 71.00 73.00 76.00
1 Average low in ºF Av. low Lo 47.00 48.00 49.00 51.00 56.00 59.00
2 Av. precipitation in inch Av. precip. Pre. 2.73 3.63 2.44 0.86 0.23 0.11
df_list[1]
Unnamed: 0 JulJu AugAu SepSe OctOc NovNo DecDe
0 Average high in ºF Av. high Hi 82.00 83.00 81.00 76.00 69.00 64.00
1 Average low in ºF Av. low Lo 62.00 63.00 61.00 57.00 51.00 47.00
2 Av. precipitation in inch Av. precip. Pre. 0.03 0.05 0.25 0.65 1.13 2.27

For the above two DataFrames, would it make more sense to stack them side-by-side or on top of each other? Definitely side-by-side, because we want the months to go from January to December. So we are changing the columns axis (the number of columns is going from 7 to 14), so we specify axis=1.

pd.concat((df_list[0], df_list[1]), axis=1)
Unnamed: 0 JanJa FebFe MarMa AprAp MayMa JunJu Unnamed: 0 JulJu AugAu SepSe OctOc NovNo DecDe
0 Average high in ºF Av. high Hi 65.00 65.00 67.00 71.00 73.00 76.00 Average high in ºF Av. high Hi 82.00 83.00 81.00 76.00 69.00 64.00
1 Average low in ºF Av. low Lo 47.00 48.00 49.00 51.00 56.00 59.00 Average low in ºF Av. low Lo 62.00 63.00 61.00 57.00 51.00 47.00
2 Av. precipitation in inch Av. precip. Pre. 2.73 3.63 2.44 0.86 0.23 0.11 Av. precipitation in inch Av. precip. Pre. 0.03 0.05 0.25 0.65 1.13 2.27