Week 2 Wednesday#

Announcements#

  • Change to Wednesday Office Hours Location: Let’s try ALP 2800 Wednesday (today) 1pm.

  • Worksheet 4 uses some material from Worksheet 3, so definitely start with Worksheet 3.

  • Reminder: Week 2 videos and video quizzes are due before lecture on Friday.

  • On Friday I hope to introduce plotting in Python, hopefully briefly introducing each of Matplotlib (similar to plotting in Matlab), Altair, Seaborn, and Plotly (Altair, Seaborn, and Plotly are similar to each other).

import pandas as pd
df = pd.read_csv("indexData.csv")

Review with Boolean indexing and map#

  • Define df_sub to be the sub-DataFrame of df which corresponds to the Hong Kong stock exchange (abbrev: “HSI”) in 2008 (last two characters in the “Date” column “08”).

Comment. It would be more robust to convert the Date column to datetime type, like you need to do in Worksheet 3 and in Worksheet 4, but what we’re doing here is easier. Do not use pd.to_datetime(df["Date"]) in this particular example, because all of the dates from before 2000 will be incorrect.

Let’s remind ourselves which columns are in this dataset.

df.columns
Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

Don’t try to solve a question like this one all at once. Instead, break it into parts. Here we make a first Boolean Series.

# Boolean Series for Hong Kong stock exchange
ser1 = (df["Name"] == "HSI")
ser1
0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool

Here we get ready to make the second Boolean Series.

df["Date"].map(lambda s: s[-2:])
0         65
1         66
2         66
3         66
4         66
          ..
112452    21
112453    21
112454    21
112455    21
112456    21
Name: Date, Length: 112457, dtype: object

Why do we get False at the top, since four of these rows are equal to 66? The problem is that the Series contains strings, not integers. (Notice how the dtype is reported as object, which in this case is a hint that the elements are strings.)

df["Date"].map(lambda s: s[-2:]) == 66
0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Date, Length: 112457, dtype: bool

Here we are careful to use a string instead of an integer when we check for equality. We also switch to "08", since that is the year we were interested in.

ser2 = (df["Date"].map(lambda s: s[-2:]) == "08")
ser2
0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Date, Length: 112457, dtype: bool

Now we are finally ready to use Boolean indexing. Remember that you can combine two Boolean Series elementwise using &.

# Boolean indexing
df_sub = df[ser1 & ser2]
df_sub
Name Date Open High Low Close Adj Close Volume
32076 HSI 1/2/08 27632.19922 27853.59961 27299.44922 27560.51953 27560.51953 1.232143e+09
32077 HSI 1/3/08 27050.02930 27223.71094 26864.13086 26887.27930 26887.27930 2.442744e+09
32078 HSI 1/4/08 27004.33984 27596.85938 26994.84961 27519.68945 27519.68945 2.375522e+09
32079 HSI 1/7/08 26962.53906 27186.07031 26698.53906 27179.49023 27179.49023 2.452932e+09
32080 HSI 1/8/08 27466.96094 27637.59961 27088.69922 27112.90039 27112.90039 2.492361e+09
... ... ... ... ... ... ... ... ...
32318 HSI 12/23/08 14472.03027 14491.75000 14084.86035 14220.79004 14220.79004 1.494155e+09
32319 HSI 12/24/08 13855.88965 14300.70020 13855.88965 14184.13965 14184.13965 8.654987e+08
32320 HSI 12/29/08 14080.86035 14332.01953 13924.32031 14328.48047 14328.48047 1.046622e+09
32321 HSI 12/30/08 14476.74023 14513.48047 14189.08008 14235.50000 14235.50000 1.441716e+09
32322 HSI 12/31/08 14377.50977 14527.09961 14302.15039 14387.48047 14387.48047 1.612196e+09

247 rows × 8 columns

Using copy to prevent a specific warning#

  • Create a new column in df_sub, named “Month”, corresponding to the month, as a string like “July”. What warning shows up? How can we avoid this warning in the future?

df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()
/tmp/ipykernel_88/2393649687.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()

Briefly, this warning is showing up because df_sub might not be its own DataFrame, but instead might be a “view” of a portion of the original df DataFrame.

Don’t worry about that. Instead know that we can avoid this warning by returning to the line where we created df_sub and using the copy method.

df_sub = df[ser1 & ser2].copy()

Now the exact same call as above works without raising a warning.

df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()

Let’s check that we really do have a new “Month” column.

df_sub.head(3)
Name Date Open High Low Close Adj Close Volume Month
32076 HSI 1/2/08 27632.19922 27853.59961 27299.44922 27560.51953 27560.51953 1.232143e+09 January
32077 HSI 1/3/08 27050.02930 27223.71094 26864.13086 26887.27930 26887.27930 2.442744e+09 January
32078 HSI 1/4/08 27004.33984 27596.85938 26994.84961 27519.68945 27519.68945 2.375522e+09 January

Finding missing values with isna#

  • Is there any missing data in this sub-DataFrame? In how many rows? In what rows?

Missing values in pandas and in NumPy are denoted by some variant of np.nan, which stands for “not a number”. We can detect them by using the isna method.

df_sub.isna()
Name Date Open High Low Close Adj Close Volume Month
32076 False False False False False False False False False
32077 False False False False False False False False False
32078 False False False False False False False False False
32079 False False False False False False False False False
32080 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
32318 False False False False False False False False False
32319 False False False False False False False False False
32320 False False False False False False False False False
32321 False False False False False False False False False
32322 False False False False False False False False False

247 rows × 9 columns

If all we care about is, are there any missing values in a row, we can use the axis=1 keyword argument to the any method. We may talk about axis=1 more later, but for now, just know that it means we are plugging in one row at a time (as opposed to one column at a time).

# Which rows have missing data?
df_sub.isna().any(axis=1)
32076    False
32077    False
32078    False
32079    False
32080    False
         ...  
32318    False
32319    False
32320    False
32321    False
32322    False
Length: 247, dtype: bool

Here we count how many rows in df_sub have missing data.

# How many rows have missing data?
df_sub.isna().any(axis=1).sum()
2

Here we use Boolean indexing to keep only those rows with missing data.

# Use Boolean indexing to keep the rows with missing data
df_sub[df_sub.isna().any(axis=1)]
Name Date Open High Low Close Adj Close Volume Month
32222 HSI 8/6/08 NaN NaN NaN NaN NaN NaN August
32234 HSI 8/22/08 NaN NaN NaN NaN NaN NaN August

Back to the axis keyword argument. Here is what happens if we use axis=0, so we are plugging in one column at a time.

# axis=0 Which columns have missing data?
df_sub.isna().any(axis=0)
Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
Month        False
dtype: bool
  • Aside: How can you take the element-wise negation of a Boolean Series? For example, which columns do not have missing values?

To take an elementwise negation in NumPy or pandas, use the tilde symbol, ~.

~df_sub.isna().any(axis=0)
Name          True
Date          True
Open         False
High         False
Low          False
Close        False
Adj Close    False
Volume       False
Month         True
dtype: bool

The DataFrame method groupby#

When I first learned the groupby method, I found it very confusing. I wanted to know, is df_sub.groupby("Month") a DataFrame? A list of DataFrames? But this isn’t really the correct object-oriented perspective. In fact, df_sub.groupby("Month") is just its own special type of object, a pandas GroupBy object.

There are two things I want you to know about GroupBy objects.

  1. If you apply a method like mean, what is the result?

  2. If you iterate over a GroupBy object, what values do you get?

  • Call df_sub.groupby("Month").mean(). What information is this conveying?

For example, in the following, the 24608 at the top left means that for the month of April, the average value in the “Open” column was approximately 24608.

df_sub.groupby("Month").mean()
Open High Low Close Adj Close Volume
Month
April 24608.502325 24777.395183 24403.666760 24596.926526 24596.926526 2.360510e+09
August 21486.547287 21691.237459 21236.333779 21434.854236 21434.854236 1.836207e+09
December 14588.742839 14802.577615 14370.430571 14592.815198 14592.815198 1.962145e+09
February 23876.495271 24103.789063 23586.759458 23847.546773 23847.546773 2.225153e+09
January 25576.140359 25891.002664 25035.877931 25401.379350 25401.379350 3.070506e+09
July 22088.403942 22258.997692 21906.613015 22078.589755 22078.589755 2.053009e+09
June 23206.869825 23401.861035 23038.578906 23183.498829 23183.498829 1.868362e+09
March 22594.925268 22825.850535 22261.735815 22560.548829 22560.548829 2.601644e+09
May 25292.243262 25431.831249 25042.192090 25236.711426 25236.711426 1.968866e+09
November 13544.692481 13854.303028 13281.539013 13572.811964 13572.811964 2.527160e+09
October 14952.218191 15347.115188 14501.354307 14924.740467 14924.740467 3.728888e+09
September 19454.927733 19645.481491 19079.342216 19369.610027 19369.610027 2.950194e+09
  • What if you don’t want the months sorted alphabetically?

It’s often helpful that the values get sorted, but in this case, we don’t want them sorted alphabetically. By using the sort=False keyword argument, we tell pandas to keep whatever the original DataFrame order was.

df_sub.groupby("Month", sort=False).mean()
Open High Low Close Adj Close Volume
Month
January 25576.140359 25891.002664 25035.877931 25401.379350 25401.379350 3.070506e+09
February 23876.495271 24103.789063 23586.759458 23847.546773 23847.546773 2.225153e+09
March 22594.925268 22825.850535 22261.735815 22560.548829 22560.548829 2.601644e+09
April 24608.502325 24777.395183 24403.666760 24596.926526 24596.926526 2.360510e+09
May 25292.243262 25431.831249 25042.192090 25236.711426 25236.711426 1.968866e+09
June 23206.869825 23401.861035 23038.578906 23183.498829 23183.498829 1.868362e+09
July 22088.403942 22258.997692 21906.613015 22078.589755 22078.589755 2.053009e+09
August 21486.547287 21691.237459 21236.333779 21434.854236 21434.854236 1.836207e+09
September 19454.927733 19645.481491 19079.342216 19369.610027 19369.610027 2.950194e+09
October 14952.218191 15347.115188 14501.354307 14924.740467 14924.740467 3.728888e+09
November 13544.692481 13854.303028 13281.539013 13572.811964 13572.811964 2.527160e+09
December 14588.742839 14802.577615 14370.430571 14592.815198 14592.815198 1.962145e+09
  • Call the following code, replacing ??? with appropriate values. What information is this conveying?

for gp, df_mini in df_sub.groupby("Month"):
    print(???"The month is {gp} and the number of rows is ???.")

We just raced through this example, but I hope it gives a hint of what happens when you iterate over a pandas GroupBy object. The value I call df_mini is the sub-DataFrame corresponding to a particular value (a particular value of “Month” in this case). For example, because February is the shortest month, there are the fewest rows for the February mini DataFrame.

for gp, df_mini in df_sub.groupby("Month"):
    print(f"The month is {gp} and the number of rows is {df_mini.shape[0]}.")
The month is April and the number of rows is 21.
The month is August and the number of rows is 21.
The month is December and the number of rows is 21.
The month is February and the number of rows is 19.
The month is January and the number of rows is 22.
The month is July and the number of rows is 22.
The month is June and the number of rows is 20.
The month is March and the number of rows is 19.
The month is May and the number of rows is 20.
The month is November and the number of rows is 20.
The month is October and the number of rows is 21.
The month is September and the number of rows is 21.