Week 2 Wednesday#

Announcements#

Change to Wednesday Office Hours Location: Let’s try ALP 2800 Wednesday (today) 1pm.
Worksheet 4 uses some material from Worksheet 3, so definitely start with Worksheet 3.
Reminder: Week 2 videos and video quizzes are due before lecture on Friday.
On Friday I hope to introduce plotting in Python, hopefully briefly introducing each of Matplotlib (similar to plotting in Matlab), Altair, Seaborn, and Plotly (Altair, Seaborn, and Plotly are similar to each other).

import pandas as pd

df = pd.read_csv("indexData.csv")

Review with Boolean indexing and `map`#

Define df_sub to be the sub-DataFrame of df which corresponds to the Hong Kong stock exchange (abbrev: “HSI”) in 2008 (last two characters in the “Date” column “08”).

Comment. It would be more robust to convert the Date column to datetime type, like you need to do in Worksheet 3 and in Worksheet 4, but what we’re doing here is easier. Do not use pd.to_datetime(df["Date"]) in this particular example, because all of the dates from before 2000 will be incorrect.

Let’s remind ourselves which columns are in this dataset.

df.columns

Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

Don’t try to solve a question like this one all at once. Instead, break it into parts. Here we make a first Boolean Series.

# Boolean Series for Hong Kong stock exchange
ser1 = (df["Name"] == "HSI")
ser1

0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool

Here we get ready to make the second Boolean Series.

df["Date"].map(lambda s: s[-2:])

0         65
1         66
2         66
3         66
4         66
          ..
112452    21
112453    21
112454    21
112455    21
112456    21
Name: Date, Length: 112457, dtype: object

Why do we get False at the top, since four of these rows are equal to 66? The problem is that the Series contains strings, not integers. (Notice how the dtype is reported as object, which in this case is a hint that the elements are strings.)

df["Date"].map(lambda s: s[-2:]) == 66

0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Date, Length: 112457, dtype: bool

Here we are careful to use a string instead of an integer when we check for equality. We also switch to "08", since that is the year we were interested in.

ser2 = (df["Date"].map(lambda s: s[-2:]) == "08")
ser2

0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Date, Length: 112457, dtype: bool

Now we are finally ready to use Boolean indexing. Remember that you can combine two Boolean Series elementwise using &.

# Boolean indexing
df_sub = df[ser1 & ser2]
df_sub

	Name	Date	Open	High	Low	Close	Adj Close	Volume
32076	HSI	1/2/08	27632.19922	27853.59961	27299.44922	27560.51953	27560.51953	1.232143e+09
32077	HSI	1/3/08	27050.02930	27223.71094	26864.13086	26887.27930	26887.27930	2.442744e+09
32078	HSI	1/4/08	27004.33984	27596.85938	26994.84961	27519.68945	27519.68945	2.375522e+09
32079	HSI	1/7/08	26962.53906	27186.07031	26698.53906	27179.49023	27179.49023	2.452932e+09
32080	HSI	1/8/08	27466.96094	27637.59961	27088.69922	27112.90039	27112.90039	2.492361e+09
...	...	...	...	...	...	...	...	...
32318	HSI	12/23/08	14472.03027	14491.75000	14084.86035	14220.79004	14220.79004	1.494155e+09
32319	HSI	12/24/08	13855.88965	14300.70020	13855.88965	14184.13965	14184.13965	8.654987e+08
32320	HSI	12/29/08	14080.86035	14332.01953	13924.32031	14328.48047	14328.48047	1.046622e+09
32321	HSI	12/30/08	14476.74023	14513.48047	14189.08008	14235.50000	14235.50000	1.441716e+09
32322	HSI	12/31/08	14377.50977	14527.09961	14302.15039	14387.48047	14387.48047	1.612196e+09

247 rows × 8 columns

Using `copy` to prevent a specific warning#

Create a new column in df_sub, named “Month”, corresponding to the month, as a string like “July”. What warning shows up? How can we avoid this warning in the future?

df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()

/tmp/ipykernel_88/2393649687.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()

Briefly, this warning is showing up because df_sub might not be its own DataFrame, but instead might be a “view” of a portion of the original df DataFrame.

Don’t worry about that. Instead know that we can avoid this warning by returning to the line where we created df_sub and using the copy method.

df_sub = df[ser1 & ser2].copy()

Now the exact same call as above works without raising a warning.

df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()

Let’s check that we really do have a new “Month” column.

df_sub.head(3)

	Name	Date	Open	High	Low	Close	Adj Close	Volume	Month
32076	HSI	1/2/08	27632.19922	27853.59961	27299.44922	27560.51953	27560.51953	1.232143e+09	January
32077	HSI	1/3/08	27050.02930	27223.71094	26864.13086	26887.27930	26887.27930	2.442744e+09	January
32078	HSI	1/4/08	27004.33984	27596.85938	26994.84961	27519.68945	27519.68945	2.375522e+09	January

Finding missing values with `isna`#

Is there any missing data in this sub-DataFrame? In how many rows? In what rows?

Missing values in pandas and in NumPy are denoted by some variant of np.nan, which stands for “not a number”. We can detect them by using the isna method.

df_sub.isna()

	Name	Date	Open	High	Low	Close	Adj Close	Volume	Month
32076	False	False	False	False	False	False	False	False	False
32077	False	False	False	False	False	False	False	False	False
32078	False	False	False	False	False	False	False	False	False
32079	False	False	False	False	False	False	False	False	False
32080	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...
32318	False	False	False	False	False	False	False	False	False
32319	False	False	False	False	False	False	False	False	False
32320	False	False	False	False	False	False	False	False	False
32321	False	False	False	False	False	False	False	False	False
32322	False	False	False	False	False	False	False	False	False

247 rows × 9 columns

If all we care about is, are there any missing values in a row, we can use the axis=1 keyword argument to the any method. We may talk about axis=1 more later, but for now, just know that it means we are plugging in one row at a time (as opposed to one column at a time).

# Which rows have missing data?
df_sub.isna().any(axis=1)

  False
  False
  False
  False
  False
         ...  
  False
  False
  False
  False
  False
Length: 247, dtype: bool

Here we count how many rows in df_sub have missing data.

# How many rows have missing data?
df_sub.isna().any(axis=1).sum()

Here we use Boolean indexing to keep only those rows with missing data.

# Use Boolean indexing to keep the rows with missing data
df_sub[df_sub.isna().any(axis=1)]

	Name	Date	Open	High	Low	Close	Adj Close	Volume	Month
32222	HSI	8/6/08	NaN	NaN	NaN	NaN	NaN	NaN	August
32234	HSI	8/22/08	NaN	NaN	NaN	NaN	NaN	NaN	August

Back to the axis keyword argument. Here is what happens if we use axis=0, so we are plugging in one column at a time.

# axis=0 Which columns have missing data?
df_sub.isna().any(axis=0)

Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
Month        False
dtype: bool

Aside: How can you take the element-wise negation of a Boolean Series? For example, which columns do not have missing values?

To take an elementwise negation in NumPy or pandas, use the tilde symbol, ~.

~df_sub.isna().any(axis=0)

Name          True
Date          True
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
Month         True
dtype: bool

The DataFrame method `groupby`#

When I first learned the groupby method, I found it very confusing. I wanted to know, is df_sub.groupby("Month") a DataFrame? A list of DataFrames? But this isn’t really the correct object-oriented perspective. In fact, df_sub.groupby("Month") is just its own special type of object, a pandas GroupBy object.

There are two things I want you to know about GroupBy objects.

If you apply a method like mean, what is the result?
If you iterate over a GroupBy object, what values do you get?

Call df_sub.groupby("Month").mean(). What information is this conveying?

For example, in the following, the 24608 at the top left means that for the month of April, the average value in the “Open” column was approximately 24608.

df_sub.groupby("Month").mean()

	Open	High	Low	Close	Adj Close	Volume
Month
April	24608.502325	24777.395183	24403.666760	24596.926526	24596.926526	2.360510e+09
August	21486.547287	21691.237459	21236.333779	21434.854236	21434.854236	1.836207e+09
December	14588.742839	14802.577615	14370.430571	14592.815198	14592.815198	1.962145e+09
February	23876.495271	24103.789063	23586.759458	23847.546773	23847.546773	2.225153e+09
January	25576.140359	25891.002664	25035.877931	25401.379350	25401.379350	3.070506e+09
July	22088.403942	22258.997692	21906.613015	22078.589755	22078.589755	2.053009e+09
June	23206.869825	23401.861035	23038.578906	23183.498829	23183.498829	1.868362e+09
March	22594.925268	22825.850535	22261.735815	22560.548829	22560.548829	2.601644e+09
May	25292.243262	25431.831249	25042.192090	25236.711426	25236.711426	1.968866e+09
November	13544.692481	13854.303028	13281.539013	13572.811964	13572.811964	2.527160e+09
October	14952.218191	15347.115188	14501.354307	14924.740467	14924.740467	3.728888e+09
September	19454.927733	19645.481491	19079.342216	19369.610027	19369.610027	2.950194e+09

What if you don’t want the months sorted alphabetically?

It’s often helpful that the values get sorted, but in this case, we don’t want them sorted alphabetically. By using the sort=False keyword argument, we tell pandas to keep whatever the original DataFrame order was.

df_sub.groupby("Month", sort=False).mean()

	Open	High	Low	Close	Adj Close	Volume
Month
January	25576.140359	25891.002664	25035.877931	25401.379350	25401.379350	3.070506e+09
February	23876.495271	24103.789063	23586.759458	23847.546773	23847.546773	2.225153e+09
March	22594.925268	22825.850535	22261.735815	22560.548829	22560.548829	2.601644e+09
April	24608.502325	24777.395183	24403.666760	24596.926526	24596.926526	2.360510e+09
May	25292.243262	25431.831249	25042.192090	25236.711426	25236.711426	1.968866e+09
June	23206.869825	23401.861035	23038.578906	23183.498829	23183.498829	1.868362e+09
July	22088.403942	22258.997692	21906.613015	22078.589755	22078.589755	2.053009e+09
August	21486.547287	21691.237459	21236.333779	21434.854236	21434.854236	1.836207e+09
September	19454.927733	19645.481491	19079.342216	19369.610027	19369.610027	2.950194e+09
October	14952.218191	15347.115188	14501.354307	14924.740467	14924.740467	3.728888e+09
November	13544.692481	13854.303028	13281.539013	13572.811964	13572.811964	2.527160e+09
December	14588.742839	14802.577615	14370.430571	14592.815198	14592.815198	1.962145e+09

Call the following code, replacing ??? with appropriate values. What information is this conveying?

for gp, df_mini in df_sub.groupby("Month"):
    print(???"The month is {gp} and the number of rows is ???.")

We just raced through this example, but I hope it gives a hint of what happens when you iterate over a pandas GroupBy object. The value I call df_mini is the sub-DataFrame corresponding to a particular value (a particular value of “Month” in this case). For example, because February is the shortest month, there are the fewest rows for the February mini DataFrame.

for gp, df_mini in df_sub.groupby("Month"):
    print(f"The month is {gp} and the number of rows is {df_mini.shape[0]}.")

The month is April and the number of rows is 21.
The month is August and the number of rows is 21.
The month is December and the number of rows is 21.
The month is February and the number of rows is 19.
The month is January and the number of rows is 22.
The month is July and the number of rows is 22.
The month is June and the number of rows is 20.
The month is March and the number of rows is 19.
The month is May and the number of rows is 20.
The month is November and the number of rows is 20.
The month is October and the number of rows is 21.
The month is September and the number of rows is 21.

UC Irvine Math 10, Fall 2022

Week 2 Wednesday

Contents

Week 2 Wednesday#

Announcements#

Review with Boolean indexing and `map`#

Using `copy` to prevent a specific warning#

Finding missing values with `isna`#

The DataFrame method `groupby`#

UC Irvine Math 10, Fall 2022

Week 2 Wednesday

Contents

Week 2 Wednesday#

Announcements#

Review with Boolean indexing and map#

Using copy to prevent a specific warning#

Finding missing values with isna#

The DataFrame method groupby#

Review with Boolean indexing and `map`#

Using `copy` to prevent a specific warning#

Finding missing values with `isna`#

The DataFrame method `groupby`#