Week 5 Monday#

Announcements#

Video quizzes have been turned into “Practice Quizzes” (same links as before on Canvas). Scores on these practice quizzes won’t affect your grade, and you can take them as many times as you want. (About half the quizzes have multiple versions.)
No videos or video quizzes this Friday. After the midterm, we’ll start the Machine Learning portion of Math 10.
Midterm Wednesday. Covers material up to and including Week 4 Friday’s lecture and videos.
Pick up a notecard if you haven’t already. Hand-written notes, written on both sides, can be used during the midterm.
Sample midterm posted on the “Week 5” page on Canvas. I’ll ask William to go over as much of the sample midterm as possible on Tuesday. Solutions posted Tuesday afternoon at the latest. (If you want a hint on a specific question before then, please ask on Ed Discussion or come to my office hours.)
I have office hours after class, 11am, next door in ALP 3610.

`map` vs `applymap` vs `apply`#

Here are some important facts about these methods.

map is a Series method. applymap and apply are DataFrame methods. (Secretly apply can also be used on a Series, but we haven’t covered that.)
All three take a function as an input.
map and applymap apply that function on every entry. apply applies the function on an entire row (axis=1) or on an entire column (axis=0).
I think everyone would agree apply is the most difficult to understand of the three.

import pandas as pd
import altair as alt
import seaborn as sns

df = sns.load_dataset("penguins")

The following wants to multiply every element in df by 2. It doesn’t work because 2*x is not a function in Python.

df.applymap(2*x)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [5], line 1
----> 1 df.applymap(2*x)

NameError: name 'x' is not defined

Here we use a lambda function to define the function \(x \leadsto 2x\). Notice how this works even for string values.

df.applymap(lambda x: 2*x)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	AdelieAdelie	TorgersenTorgersen	78.2	37.4	362.0	7500.0	MaleMale
1	AdelieAdelie	TorgersenTorgersen	79.0	34.8	372.0	7600.0	FemaleFemale
2	AdelieAdelie	TorgersenTorgersen	80.6	36.0	390.0	6500.0	FemaleFemale
3	AdelieAdelie	TorgersenTorgersen	NaN	NaN	NaN	NaN	NaN
4	AdelieAdelie	TorgersenTorgersen	73.4	38.6	386.0	6900.0	FemaleFemale
...	...	...	...	...	...	...	...
339	GentooGentoo	BiscoeBiscoe	NaN	NaN	NaN	NaN	NaN
340	GentooGentoo	BiscoeBiscoe	93.6	28.6	430.0	9700.0	FemaleFemale
341	GentooGentoo	BiscoeBiscoe	100.8	31.4	444.0	11500.0	MaleMale
342	GentooGentoo	BiscoeBiscoe	90.4	29.6	424.0	10400.0	FemaleFemale
343	GentooGentoo	BiscoeBiscoe	99.8	32.2	426.0	10800.0	MaleMale

344 rows × 7 columns

Here is an example of getting a sub-DataFrame using iloc. Notice how the above values are not reflected. That is because we did not save the values. (A hint is that something got displayed to the screen.)

df.iloc[:2, 1:4]

	island	bill_length_mm	bill_depth_mm
0	Torgersen	39.1	18.7
1	Torgersen	39.5	17.4

Sometimes the inplace keyword argument can be used, but not for applymap.

df.applymap(lambda x: 2*x, inplace=True)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [8], line 1
----> 1 df.applymap(lambda x: 2*x, inplace=True)

TypeError: applymap() got an unexpected keyword argument 'inplace'

Here is a reminder of the rename method. Notice how we are also specifying axis=0, which tells pandas that it is the row labels which should be changed.

df.rename({2:"New row name"}, axis=0)

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
New row name	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
...	...	...	...	...	...	...	...
339	Gentoo	Biscoe	NaN	NaN	NaN	NaN	NaN
340	Gentoo	Biscoe	46.8	14.3	215.0	4850.0	Female
341	Gentoo	Biscoe	50.4	15.7	222.0	5750.0	Male
342	Gentoo	Biscoe	45.2	14.8	212.0	5200.0	Female
343	Gentoo	Biscoe	49.9	16.1	213.0	5400.0	Male

344 rows × 7 columns

Again, there was no change to df itself. (The row 2 is still called 2.)

df[:4]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN

Here we use the inplace keyword argument. Notice how nothing gets displayed.

# nothing got displayed; df changed
df.rename({2:"New row name"}, axis=0, inplace=True)

Now the row name did change.

df[:4]

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
New row name	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN

Now we switch to talking about apply. The use of the axis keyword argument is definitely confusing. See the Week 4 videos for the most consistent description I know for when to use axis=0 and when to use axis=1.

df.shape

(344, 7)

Here we are keeping the column names the same. The number 344 represents the length of each column.

df.apply(len, axis=0)

species              344
island               344
bill_length_mm       344
bill_depth_mm        344
flipper_length_mm    344
body_mass_g          344
sex                  344
dtype: int64

The number 7 here represents the length of each row.

df.apply(len, axis=1)

             7
             7
New row name    7
             7
             7
               ..
           7
           7
           7
           7
           7
Length: 344, dtype: int64

Here I’ll try to convince you that the row names did not change.

df.apply(len, axis=1).index

Index([             0,              1, 'New row name',              3,
                    4,              5,              6,              7,
                    8,              9,
       ...
                  334,            335,            336,            337,
                  338,            339,            340,            341,
                  342,            343],
      dtype='object', length=344)

df.index

Index([             0,              1, 'New row name',              3,
                    4,              5,              6,              7,
                    8,              9,
       ...
                  334,            335,            336,            337,
                  338,            339,            340,            341,
                  342,            343],
      dtype='object', length=344)

Example of `apply`#

Using apply, subtract a suitable constant from all the numeric columns in the Penguins DataFrame so each numeric column has mean 0. (Use a different constant for each column.) Use pandas.api.types.is_numeric_dtype to determine if a column is numeric or not.

# step 1: find the numeric columns
pd.api.types.is_numeric_dtype(df["species"])

False

pd.api.types.is_numeric_dtype(df["bill_length_mm"])

True

It is cumbersome to have to write pd.api.types.is_numeric_dtype each time, so we can import it separately.

from pandas.api.types import is_numeric_dtype

is_numeric_dtype(df["bill_length_mm"])

True

is_numeric_dtype(float)

True

Here is a first step of list comprehension, where we just get the list version of df.columns (we are not removing anything).

# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns]

['species',
 'island',
 'bill_length_mm',
 'bill_depth_mm',
 'flipper_length_mm',
 'body_mass_g',
 'sex']

Now we add an if condition to just keep the numeric columns. Notice how above we typed is_numeric_dtype(df["bill_length_mm"]), and c is some value like "bill_length_mm", so we should use df[c].

# Using list comprehension, get a list of all the numeric columns in df
[c for c in df.columns if is_numeric_dtype(df[c]) == True]

['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

The == True is not doing anything here, so we can remove it.

num_cols = [c for c in df.columns if is_numeric_dtype(df[c])]
num_cols

['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']

There was a question about whether c should be in quotation marks. It is an important distinction. If we put c in quotation marks, "c", then we just get the letter “c”, and the value of the variable c is irrelevant.

for c in df.columns:
    print("c")

c
c
c
c
c
c
c

Here is the same code without quotation marks around c.

for c in df.columns:
    print(c)

species
island
bill_length_mm
bill_depth_mm
flipper_length_mm
body_mass_g
sex

Here we get the sub-DataFrame consisting of the numeric columns, using df.loc.

df.loc[:, num_cols]

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
New row name	40.3	18.0	195.0	3250.0
3	NaN	NaN	NaN	NaN
4	36.7	19.3	193.0	3450.0
...	...	...	...	...
339	NaN	NaN	NaN	NaN
340	46.8	14.3	215.0	4850.0
341	50.4	15.7	222.0	5750.0
342	45.2	14.8	212.0	5200.0
343	49.9	16.1	213.0	5400.0

344 rows × 4 columns

The following is an abbreviation for the same thing.

df[num_cols]

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
New row name	40.3	18.0	195.0	3250.0
3	NaN	NaN	NaN	NaN
4	36.7	19.3	193.0	3450.0
...	...	...	...	...
339	NaN	NaN	NaN	NaN
340	46.8	14.3	215.0	4850.0
341	50.4	15.7	222.0	5750.0
342	45.2	14.8	212.0	5200.0
343	49.9	16.1	213.0	5400.0

344 rows × 4 columns

Let’s save this sub-DataFrame.

df_sub = df[num_cols]

df_sub

	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
New row name	40.3	18.0	195.0	3250.0
3	NaN	NaN	NaN	NaN
4	36.7	19.3	193.0	3450.0
...	...	...	...	...
339	NaN	NaN	NaN	NaN
340	46.8	14.3	215.0	4850.0
341	50.4	15.7	222.0	5750.0
342	45.2	14.8	212.0	5200.0
343	49.9	16.1	213.0	5400.0

344 rows × 4 columns

We want to use some code like the following.

# Subtract a constant from each column so the mean is 0
df_sub.apply(???, axis=???)

Here is the correct formula.

# We're plugging in a column at a time, so use axis=0
df2 = df_sub.apply(lambda x: x-x.mean(), axis=0)

The code x-x.mean() can be confusing, because x represents a pandas Series and x.mean() represents a number. Here is a specific example of computing pandas Series minus a number in Python.

df["bill_length_mm"] - 30

              9.1
              9.5
New row name    10.3
              NaN
              6.7
                ... 
            NaN
           16.8
           20.4
           15.2
           19.9
Name: bill_length_mm, Length: 344, dtype: float64

We want the columns to have mean 0. The following says that (up to numerical precision issues), these means are 0 (or at least very close to 0).

df2.mean(axis=0)

bill_length_mm       6.648353e-16
bill_depth_mm       -2.825550e-15
flipper_length_mm   -1.130220e-14
body_mass_g          8.509892e-14
dtype: float64

More examples#

This doesn’t always work (for example, some websites block it), but sometimes pandas can read tables directly from websites. The pd.read_html function returns a list of html tables (as DataFrames) that pandas can find on the website.

Here are some websites for which pd.read_html does work.

link = "https://www.usclimatedata.com/climate/irvine/california/united-states/usca2494"

The function pd.read_html returns a list of DataFrames.

df_list = pd.read_html(link)

df_list[0]

	Unnamed: 0	JanJa	FebFe	MarMa	AprAp	MayMa	JunJu
0	Average high in ºF Av. high Hi	65.00	65.00	67.00	71.00	73.00	76.00
1	Average low in ºF Av. low Lo	47.00	48.00	49.00	51.00	56.00	59.00
2	Av. precipitation in inch Av. precip. Pre.	2.73	3.63	2.44	0.86	0.23	0.11

df_list[1]

	Unnamed: 0	JulJu	AugAu	SepSe	OctOc	NovNo	DecDe
0	Average high in ºF Av. high Hi	82.00	83.00	81.00	76.00	69.00	64.00
1	Average low in ºF Av. low Lo	62.00	63.00	61.00	57.00	51.00	47.00
2	Av. precipitation in inch Av. precip. Pre.	0.03	0.05	0.25	0.65	1.13	2.27

For the above two DataFrames, would it make more sense to stack them side-by-side or on top of each other? Definitely side-by-side, because we want the months to go from January to December. So we are changing the columns axis (the number of columns is going from 7 to 14), so we specify axis=1.

pd.concat((df_list[0], df_list[1]), axis=1)

	Unnamed: 0	JanJa	FebFe	MarMa	AprAp	MayMa	JunJu	Unnamed: 0	JulJu	AugAu	SepSe	OctOc	NovNo	DecDe
0	Average high in ºF Av. high Hi	65.00	65.00	67.00	71.00	73.00	76.00	Average high in ºF Av. high Hi	82.00	83.00	81.00	76.00	69.00	64.00
1	Average low in ºF Av. low Lo	47.00	48.00	49.00	51.00	56.00	59.00	Average low in ºF Av. low Lo	62.00	63.00	61.00	57.00	51.00	47.00
2	Av. precipitation in inch Av. precip. Pre.	2.73	3.63	2.44	0.86	0.23	0.11	Av. precipitation in inch Av. precip. Pre.	0.03	0.05	0.25	0.65	1.13	2.27

UC Irvine Math 10, Fall 2022

Week 5 Monday

Contents

Week 5 Monday#

Announcements#

`map` vs `applymap` vs `apply`#

Example of `apply`#

More examples#

UC Irvine Math 10, Fall 2022

Week 5 Monday

Contents

Week 5 Monday#

Announcements#

map vs applymap vs apply#

Example of apply#

More examples#

`map` vs `applymap` vs `apply`#

Example of `apply`#