Week 2 Wednesday
Contents
Week 2 Wednesday#
Announcements#
Change to Wednesday Office Hours Location: Let’s try ALP 2800 Wednesday (today) 1pm.
Worksheet 4 uses some material from Worksheet 3, so definitely start with Worksheet 3.
Reminder: Week 2 videos and video quizzes are due before lecture on Friday.
On Friday I hope to introduce plotting in Python, hopefully briefly introducing each of Matplotlib (similar to plotting in Matlab), Altair, Seaborn, and Plotly (Altair, Seaborn, and Plotly are similar to each other).
import pandas as pd
df = pd.read_csv("indexData.csv")
Review with Boolean indexing and map
#
Define
df_sub
to be the sub-DataFrame ofdf
which corresponds to the Hong Kong stock exchange (abbrev: “HSI”) in 2008 (last two characters in the “Date” column “08”).
Comment. It would be more robust to convert the Date column to datetime type, like you need to do in Worksheet 3 and in Worksheet 4, but what we’re doing here is easier. Do not use pd.to_datetime(df["Date"])
in this particular example, because all of the dates from before 2000 will be incorrect.
Let’s remind ourselves which columns are in this dataset.
df.columns
Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
Don’t try to solve a question like this one all at once. Instead, break it into parts. Here we make a first Boolean Series.
# Boolean Series for Hong Kong stock exchange
ser1 = (df["Name"] == "HSI")
ser1
0 False
1 False
2 False
3 False
4 False
...
112452 False
112453 False
112454 False
112455 False
112456 False
Name: Name, Length: 112457, dtype: bool
Here we get ready to make the second Boolean Series.
df["Date"].map(lambda s: s[-2:])
0 65
1 66
2 66
3 66
4 66
..
112452 21
112453 21
112454 21
112455 21
112456 21
Name: Date, Length: 112457, dtype: object
Why do we get False
at the top, since four of these rows are equal to 66? The problem is that the Series contains strings, not integers. (Notice how the dtype
is reported as object
, which in this case is a hint that the elements are strings.)
df["Date"].map(lambda s: s[-2:]) == 66
0 False
1 False
2 False
3 False
4 False
...
112452 False
112453 False
112454 False
112455 False
112456 False
Name: Date, Length: 112457, dtype: bool
Here we are careful to use a string instead of an integer when we check for equality. We also switch to "08"
, since that is the year we were interested in.
ser2 = (df["Date"].map(lambda s: s[-2:]) == "08")
ser2
0 False
1 False
2 False
3 False
4 False
...
112452 False
112453 False
112454 False
112455 False
112456 False
Name: Date, Length: 112457, dtype: bool
Now we are finally ready to use Boolean indexing. Remember that you can combine two Boolean Series elementwise using &
.
# Boolean indexing
df_sub = df[ser1 & ser2]
df_sub
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
32076 | HSI | 1/2/08 | 27632.19922 | 27853.59961 | 27299.44922 | 27560.51953 | 27560.51953 | 1.232143e+09 |
32077 | HSI | 1/3/08 | 27050.02930 | 27223.71094 | 26864.13086 | 26887.27930 | 26887.27930 | 2.442744e+09 |
32078 | HSI | 1/4/08 | 27004.33984 | 27596.85938 | 26994.84961 | 27519.68945 | 27519.68945 | 2.375522e+09 |
32079 | HSI | 1/7/08 | 26962.53906 | 27186.07031 | 26698.53906 | 27179.49023 | 27179.49023 | 2.452932e+09 |
32080 | HSI | 1/8/08 | 27466.96094 | 27637.59961 | 27088.69922 | 27112.90039 | 27112.90039 | 2.492361e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
32318 | HSI | 12/23/08 | 14472.03027 | 14491.75000 | 14084.86035 | 14220.79004 | 14220.79004 | 1.494155e+09 |
32319 | HSI | 12/24/08 | 13855.88965 | 14300.70020 | 13855.88965 | 14184.13965 | 14184.13965 | 8.654987e+08 |
32320 | HSI | 12/29/08 | 14080.86035 | 14332.01953 | 13924.32031 | 14328.48047 | 14328.48047 | 1.046622e+09 |
32321 | HSI | 12/30/08 | 14476.74023 | 14513.48047 | 14189.08008 | 14235.50000 | 14235.50000 | 1.441716e+09 |
32322 | HSI | 12/31/08 | 14377.50977 | 14527.09961 | 14302.15039 | 14387.48047 | 14387.48047 | 1.612196e+09 |
247 rows × 8 columns
Using copy
to prevent a specific warning#
Create a new column in
df_sub
, named “Month”, corresponding to the month, as a string like “July”. What warning shows up? How can we avoid this warning in the future?
df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()
/tmp/ipykernel_88/2393649687.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()
Briefly, this warning is showing up because df_sub
might not be its own DataFrame, but instead might be a “view” of a portion of the original df
DataFrame.
Don’t worry about that. Instead know that we can avoid this warning by returning to the line where we created df_sub
and using the copy
method.
df_sub = df[ser1 & ser2].copy()
Now the exact same call as above works without raising a warning.
df_sub["Month"] = pd.to_datetime(df_sub["Date"]).dt.month_name()
Let’s check that we really do have a new “Month” column.
df_sub.head(3)
Name | Date | Open | High | Low | Close | Adj Close | Volume | Month | |
---|---|---|---|---|---|---|---|---|---|
32076 | HSI | 1/2/08 | 27632.19922 | 27853.59961 | 27299.44922 | 27560.51953 | 27560.51953 | 1.232143e+09 | January |
32077 | HSI | 1/3/08 | 27050.02930 | 27223.71094 | 26864.13086 | 26887.27930 | 26887.27930 | 2.442744e+09 | January |
32078 | HSI | 1/4/08 | 27004.33984 | 27596.85938 | 26994.84961 | 27519.68945 | 27519.68945 | 2.375522e+09 | January |
Finding missing values with isna
#
Is there any missing data in this sub-DataFrame? In how many rows? In what rows?
Missing values in pandas and in NumPy are denoted by some variant of np.nan
, which stands for “not a number”. We can detect them by using the isna
method.
df_sub.isna()
Name | Date | Open | High | Low | Close | Adj Close | Volume | Month | |
---|---|---|---|---|---|---|---|---|---|
32076 | False | False | False | False | False | False | False | False | False |
32077 | False | False | False | False | False | False | False | False | False |
32078 | False | False | False | False | False | False | False | False | False |
32079 | False | False | False | False | False | False | False | False | False |
32080 | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32318 | False | False | False | False | False | False | False | False | False |
32319 | False | False | False | False | False | False | False | False | False |
32320 | False | False | False | False | False | False | False | False | False |
32321 | False | False | False | False | False | False | False | False | False |
32322 | False | False | False | False | False | False | False | False | False |
247 rows × 9 columns
If all we care about is, are there any missing values in a row, we can use the axis=1
keyword argument to the any
method. We may talk about axis=1
more later, but for now, just know that it means we are plugging in one row at a time (as opposed to one column at a time).
# Which rows have missing data?
df_sub.isna().any(axis=1)
32076 False
32077 False
32078 False
32079 False
32080 False
...
32318 False
32319 False
32320 False
32321 False
32322 False
Length: 247, dtype: bool
Here we count how many rows in df_sub
have missing data.
# How many rows have missing data?
df_sub.isna().any(axis=1).sum()
2
Here we use Boolean indexing to keep only those rows with missing data.
# Use Boolean indexing to keep the rows with missing data
df_sub[df_sub.isna().any(axis=1)]
Name | Date | Open | High | Low | Close | Adj Close | Volume | Month | |
---|---|---|---|---|---|---|---|---|---|
32222 | HSI | 8/6/08 | NaN | NaN | NaN | NaN | NaN | NaN | August |
32234 | HSI | 8/22/08 | NaN | NaN | NaN | NaN | NaN | NaN | August |
Back to the axis
keyword argument. Here is what happens if we use axis=0
, so we are plugging in one column at a time.
# axis=0 Which columns have missing data?
df_sub.isna().any(axis=0)
Name False
Date False
Open True
High True
Low True
Close True
Adj Close True
Volume True
Month False
dtype: bool
Aside: How can you take the element-wise negation of a Boolean Series? For example, which columns do not have missing values?
To take an elementwise negation in NumPy or pandas, use the tilde symbol, ~
.
~df_sub.isna().any(axis=0)
Name True
Date True
Open False
High False
Low False
Close False
Adj Close False
Volume False
Month True
dtype: bool
The DataFrame method groupby
#
When I first learned the groupby
method, I found it very confusing. I wanted to know, is df_sub.groupby("Month")
a DataFrame? A list of DataFrames? But this isn’t really the correct object-oriented perspective. In fact, df_sub.groupby("Month")
is just its own special type of object, a pandas GroupBy object.
There are two things I want you to know about GroupBy objects.
If you apply a method like
mean
, what is the result?If you iterate over a GroupBy object, what values do you get?
Call
df_sub.groupby("Month").mean()
. What information is this conveying?
For example, in the following, the 24608
at the top left means that for the month of April, the average value in the “Open” column was approximately 24608.
df_sub.groupby("Month").mean()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
Month | ||||||
April | 24608.502325 | 24777.395183 | 24403.666760 | 24596.926526 | 24596.926526 | 2.360510e+09 |
August | 21486.547287 | 21691.237459 | 21236.333779 | 21434.854236 | 21434.854236 | 1.836207e+09 |
December | 14588.742839 | 14802.577615 | 14370.430571 | 14592.815198 | 14592.815198 | 1.962145e+09 |
February | 23876.495271 | 24103.789063 | 23586.759458 | 23847.546773 | 23847.546773 | 2.225153e+09 |
January | 25576.140359 | 25891.002664 | 25035.877931 | 25401.379350 | 25401.379350 | 3.070506e+09 |
July | 22088.403942 | 22258.997692 | 21906.613015 | 22078.589755 | 22078.589755 | 2.053009e+09 |
June | 23206.869825 | 23401.861035 | 23038.578906 | 23183.498829 | 23183.498829 | 1.868362e+09 |
March | 22594.925268 | 22825.850535 | 22261.735815 | 22560.548829 | 22560.548829 | 2.601644e+09 |
May | 25292.243262 | 25431.831249 | 25042.192090 | 25236.711426 | 25236.711426 | 1.968866e+09 |
November | 13544.692481 | 13854.303028 | 13281.539013 | 13572.811964 | 13572.811964 | 2.527160e+09 |
October | 14952.218191 | 15347.115188 | 14501.354307 | 14924.740467 | 14924.740467 | 3.728888e+09 |
September | 19454.927733 | 19645.481491 | 19079.342216 | 19369.610027 | 19369.610027 | 2.950194e+09 |
What if you don’t want the months sorted alphabetically?
It’s often helpful that the values get sorted, but in this case, we don’t want them sorted alphabetically. By using the sort=False
keyword argument, we tell pandas to keep whatever the original DataFrame order was.
df_sub.groupby("Month", sort=False).mean()
Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|
Month | ||||||
January | 25576.140359 | 25891.002664 | 25035.877931 | 25401.379350 | 25401.379350 | 3.070506e+09 |
February | 23876.495271 | 24103.789063 | 23586.759458 | 23847.546773 | 23847.546773 | 2.225153e+09 |
March | 22594.925268 | 22825.850535 | 22261.735815 | 22560.548829 | 22560.548829 | 2.601644e+09 |
April | 24608.502325 | 24777.395183 | 24403.666760 | 24596.926526 | 24596.926526 | 2.360510e+09 |
May | 25292.243262 | 25431.831249 | 25042.192090 | 25236.711426 | 25236.711426 | 1.968866e+09 |
June | 23206.869825 | 23401.861035 | 23038.578906 | 23183.498829 | 23183.498829 | 1.868362e+09 |
July | 22088.403942 | 22258.997692 | 21906.613015 | 22078.589755 | 22078.589755 | 2.053009e+09 |
August | 21486.547287 | 21691.237459 | 21236.333779 | 21434.854236 | 21434.854236 | 1.836207e+09 |
September | 19454.927733 | 19645.481491 | 19079.342216 | 19369.610027 | 19369.610027 | 2.950194e+09 |
October | 14952.218191 | 15347.115188 | 14501.354307 | 14924.740467 | 14924.740467 | 3.728888e+09 |
November | 13544.692481 | 13854.303028 | 13281.539013 | 13572.811964 | 13572.811964 | 2.527160e+09 |
December | 14588.742839 | 14802.577615 | 14370.430571 | 14592.815198 | 14592.815198 | 1.962145e+09 |
Call the following code, replacing
???
with appropriate values. What information is this conveying?
for gp, df_mini in df_sub.groupby("Month"):
print(???"The month is {gp} and the number of rows is ???.")
We just raced through this example, but I hope it gives a hint of what happens when you iterate over a pandas GroupBy object. The value I call df_mini
is the sub-DataFrame corresponding to a particular value (a particular value of “Month” in this case). For example, because February is the shortest month, there are the fewest rows for the February mini DataFrame.
for gp, df_mini in df_sub.groupby("Month"):
print(f"The month is {gp} and the number of rows is {df_mini.shape[0]}.")
The month is April and the number of rows is 21.
The month is August and the number of rows is 21.
The month is December and the number of rows is 21.
The month is February and the number of rows is 19.
The month is January and the number of rows is 22.
The month is July and the number of rows is 22.
The month is June and the number of rows is 20.
The month is March and the number of rows is 19.
The month is May and the number of rows is 20.
The month is November and the number of rows is 20.
The month is October and the number of rows is 21.
The month is September and the number of rows is 21.