Week 2 Friday#

Plan for today.

  • 20-30 minutes: Lecture

  • Remaining time: work on Worksheets 3-4. Due Tuesday before Discussion Section. Chupeng and I are here to help.

  • Reminder: Videos and video quizzes due. Try to finish them after class if you haven’t already.

import pandas as pd
df = pd.read_csv("indexData.csv")
df_info = pd.read_csv("indexInfo.csv")

Practice with isna, any, and list comprehension#

  • Make a Boolean Series indicating which columns in the indexData.csv file contain missing data. Use isna and any with an appropriate axis keyword argument, Name that Boolean Series bool_ser.

The best way to answer these sorts of questions is to build up the solution one step at a time. Here is the original DataFrame.

df
Name Date Open High Low Close Adj Close Volume
0 NYA 12/31/65 528.690002 528.690002 528.690002 528.690002 528.690002 0.0
1 NYA 1/3/66 527.210022 527.210022 527.210022 527.210022 527.210022 0.0
2 NYA 1/4/66 527.840027 527.840027 527.840027 527.840027 527.840027 0.0
3 NYA 1/5/66 531.119995 531.119995 531.119995 531.119995 531.119995 0.0
4 NYA 1/6/66 532.070007 532.070007 532.070007 532.070007 532.070007 0.0
... ... ... ... ... ... ... ... ...
112452 N100 5/27/21 1241.119995 1251.910034 1241.119995 1247.069946 1247.069946 379696400.0
112453 N100 5/28/21 1249.469971 1259.209961 1249.030029 1256.599976 1256.599976 160773400.0
112454 N100 5/31/21 1256.079956 1258.880005 1248.140015 1248.930054 1248.930054 91173700.0
112455 N100 6/1/21 1254.609985 1265.660034 1254.609985 1258.579956 1258.579956 155179900.0
112456 N100 6/2/21 1258.489990 1263.709961 1258.239990 1263.619995 1263.619995 148465000.0

112457 rows × 8 columns

Here is the result of calling the isna method. We see False because there weren’t any missing values in the portion of the DataFrame we could see.

df.isna()
Name Date Open High Low Close Adj Close Volume
0 False False False False False False False False
1 False False False False False False False False
2 False False False False False False False False
3 False False False False False False False False
4 False False False False False False False False
... ... ... ... ... ... ... ... ...
112452 False False False False False False False False
112453 False False False False False False False False
112454 False False False False False False False False
112455 False False False False False False False False
112456 False False False False False False False False

112457 rows × 8 columns

We now go from the pandas DataFrame we see above to a pandas Series. The axis=0 says that we should plug in one full column at a time. (We may talk later about a more general way to understand this axis keyword argument. For now, just memorize that axis=0 means to plug in a full column at a time.)

df.isna().any(axis=0)
Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

If this any method is difficult to understand, maybe it will be more clear on a smaller DataFrame.

df_small = pd.DataFrame([[False, True], [False, False]], columns=["A","B"])
df_small
A B
0 False True
1 False False

There was a question about data types of columns. That could mean a few different things. The entries in the columns are of type bool.

df_small.dtypes
A    bool
B    bool
dtype: object
df_small.columns
Index(['A', 'B'], dtype='object')

The names of the columns are of type str, for string.

[type(col) for col in df_small.columns]
[str, str]

The full column is a pandas Series.

type(df_small["A"])
pandas.core.series.Series

Back to the original topic, which was about how this any method works.

df_small
A B
0 False True
1 False False

Here we go through each column, one at a time, and check if it contains any True values. Notice how we don’t use isna in this case. The isna was to get Boolean values, but here we already have Boolean values.

df_small.any(axis=0)
A    False
B     True
dtype: bool

Equally useful to the any method is the all method. In this case, we get False for the values, because no column is entirely True.

df_small.all(axis=0)
A    False
B    False
dtype: bool

If we use axis=1, then we are plugging in a full row at a time. (Notice that the rows are labeled as 0 and 1, whereas we specified when we created df_small that the columns should labeled as "A" and "B".

df_small.any(axis=1)
0     True
1    False
dtype: bool
df_small
A B
0 False True
1 False False
  • Using list comprehension and bool_ser.index, make a list of all the column names which contain missing data.

Here is a construction of bool_ser. This is a pandas Series whose index contains the names of the columns and whose values are True or False, depending on whether that column had missing data or not.

bool_ser = df.isna().any(axis=0)
bool_ser
Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

Here is an example of list comprehension. This just creates a list with every column name.

[c for c in bool_ser.index]
['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

We only want the names of the columns which have missing data, so we add an if condition. The following is correct but it is not good style, because boolser[c] == True is the exact same as boolser[c] itself.

[c for c in bool_ser.index if bool_ser[c] == True]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Here is the more elegant version.

# Better style
[c for c in bool_ser.index if bool_ser[c]]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
  • Find the same columns using Boolean indexing. (If you really want it to be a list, wrap the answer in the list constructor.)

Here is a reminder of what bool_ser looked like. This is a good candidate for Boolean indexing, because it is a Boolean Series.

bool_ser
Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

Here we use Boolean indexing to keep only the key/value pairs for which the value is True.

# Boolean indexing, both the keys and the values
bool_ser[bool_ser]
Open         True
High         True
Low          True
Close        True
Adj Close    True
Volume       True
dtype: bool

If we only care about the left-hand terms (the index, which I also call the keys), we can use the index attribute.

# Boolean indexing then index (only keep the keys, only keep the index)
bool_ser[bool_ser].index
Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

If that’s confusing, maybe this example will be more clear, because the values look more interesting.

test_series = pd.Series({"a": 3, "c": 10, "g": -5})
test_series
a     3
c    10
g    -5
dtype: int64

Notice how the values (3, 10, -5) disappear when we ask for the index.

test_series.index
Index(['a', 'c', 'g'], dtype='object')

I wasn’t totally sure this would work because I don’t need to remove the keys very often, but we can get the values by accessing the values attribute. (I also thought it might need parentheses like a method, but it doesn’t.)

test_series.values
array([ 3, 10, -5])

Notice how bool_ser[bool_ser].index was not a list. If you really want it to be a list (there isn’t much advantage to making it a list), then you can wrap it in the list constructor function.

# If you insist on it being a list
list(bool_ser[bool_ser].index)
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
  • Make the same list using a list comprehension of the form

[c for c in df.columns if ???]

Here we don’t need to put an axis term in the any method, because df[c].isna() is a pandas Series, so it only has one dimension, so we don’t need to specify a dimension when we call any. (There are definitely other ways to do this. Probably df.isna().any(axis=0)[c] would also work.)

[c for c in df.columns if df[c].isna().any()]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Boolean indexing vs groupby#

Here is the other DataFrame we loaded.

df_info
Region Exchange Index Currency
0 United States New York Stock Exchange NYA USD
1 United States NASDAQ IXIC USD
2 Hong Kong Hong Kong Stock Exchange HSI HKD
3 China Shanghai Stock Exchange 000001.SS CNY
4 Japan Tokyo Stock Exchange N225 JPY
5 Europe Euronext N100 EUR
6 China Shenzhen Stock Exchange 399001.SZ CNY
7 Canada Toronto Stock Exchange GSPTSE CAD
8 India National Stock Exchange of India NSEI INR
9 Germany Frankfurt Stock Exchange GDAXI EUR
10 Korea Korea Exchange KS11 KRW
11 Switzerland SIX Swiss Exchange SSMI CHF
12 Taiwan Taiwan Stock Exchange TWII TWD
13 South Africa Johannesburg Stock Exchange J203.JO ZAR
  • Using a for loop, Boolean indexing, and an f-string (but no groupby), for each currency in df_info, print how many rows in the dataset use that currency.

Notice how there are some repetitions.

for cur in df_info["Currency"]:
    print(cur)
USD
USD
HKD
CNY
JPY
EUR
CNY
CAD
INR
EUR
KRW
CHF
TWD
ZAR

We can use df_info["Currency"].unique() if we don’t want repeated currencies.

for cur in df_info["Currency"].unique():
    print(cur)
USD
HKD
CNY
JPY
EUR
CAD
INR
KRW
CHF
TWD
ZAR

The following is similar to what you are asked to do in Worksheet 3. (In Worksheet 3, you are asked to provide more information than just the number of rows.)

# Hint for the end of Worksheet 3
for cur in df_info["Currency"].unique():
    df_sub = df_info[df_info["Currency"] == cur]
    print(f"The currency is {cur} and the number of rows is {len(df_sub)}")
The currency is USD and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is JPY and the number of rows is 1
The currency is EUR and the number of rows is 2
The currency is CAD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is ZAR and the number of rows is 1
  • Print the same information, again using a for loop, this time using a pandas GroupBy object and its count method.

The object df_info.groupby("Currency") by itself is not very useful.

df_info.groupby("Currency")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f13b0a05310>

We need to do something with this GroupBy object, like iterate over it using a for loop, or like in the following example, where we call the count method. This tells us how many rows there are. (At least in this case. Maybe it subtracts missing values…)

df_info.groupby("Currency").count()
Region Exchange Index
Currency
CAD 1 1 1
CHF 1 1 1
CNY 2 2 2
EUR 2 2 2
HKD 1 1 1
INR 1 1 1
JPY 1 1 1
KRW 1 1 1
TWD 1 1 1
USD 2 2 2
ZAR 1 1 1

Notice how df_info.groupby("Currency").count() is a pandas DataFrame. So we can get the “Region” column from that DataFrame just like usual, by using square brackets.

gb_ser = df_info.groupby("Currency").count()["Region"]
gb_ser
Currency
CAD    1
CHF    1
CNY    2
EUR    2
HKD    1
INR    1
JPY    1
KRW    1
TWD    1
USD    2
ZAR    1
Name: Region, dtype: int64

This gb_ser pandas Series has all the information we want.

for cur in gb_ser.index:
    print(f"The currency is {cur} and the number of rows is {gb_ser[cur]}")
The currency is CAD and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is EUR and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is JPY and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is USD and the number of rows is 2
The currency is ZAR and the number of rows is 1
  • Print the same information, this time iterating over the GroupBy object as follows.

for cur, df_sub in ???:
    print(???)

We didn’t get here!