Week 2 Friday
Contents
Week 2 Friday#
Plan for today.
20-30 minutes: Lecture
Remaining time: work on Worksheets 3-4. Due Tuesday before Discussion Section. Chupeng and I are here to help.
Reminder: Videos and video quizzes due. Try to finish them after class if you haven’t already.
import pandas as pd
df = pd.read_csv("indexData.csv")
df_info = pd.read_csv("indexInfo.csv")
Practice with isna
, any
, and list comprehension#
Make a Boolean Series indicating which columns in the indexData.csv file contain missing data. Use
isna
andany
with an appropriateaxis
keyword argument, Name that Boolean Seriesbool_ser
.
The best way to answer these sorts of questions is to build up the solution one step at a time. Here is the original DataFrame.
df
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | NYA | 12/31/65 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 0.0 |
1 | NYA | 1/3/66 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 0.0 |
2 | NYA | 1/4/66 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 0.0 |
3 | NYA | 1/5/66 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 0.0 |
4 | NYA | 1/6/66 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
112452 | N100 | 5/27/21 | 1241.119995 | 1251.910034 | 1241.119995 | 1247.069946 | 1247.069946 | 379696400.0 |
112453 | N100 | 5/28/21 | 1249.469971 | 1259.209961 | 1249.030029 | 1256.599976 | 1256.599976 | 160773400.0 |
112454 | N100 | 5/31/21 | 1256.079956 | 1258.880005 | 1248.140015 | 1248.930054 | 1248.930054 | 91173700.0 |
112455 | N100 | 6/1/21 | 1254.609985 | 1265.660034 | 1254.609985 | 1258.579956 | 1258.579956 | 155179900.0 |
112456 | N100 | 6/2/21 | 1258.489990 | 1263.709961 | 1258.239990 | 1263.619995 | 1263.619995 | 148465000.0 |
112457 rows × 8 columns
Here is the result of calling the isna
method. We see False
because there weren’t any missing values in the portion of the DataFrame we could see.
df.isna()
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... |
112452 | False | False | False | False | False | False | False | False |
112453 | False | False | False | False | False | False | False | False |
112454 | False | False | False | False | False | False | False | False |
112455 | False | False | False | False | False | False | False | False |
112456 | False | False | False | False | False | False | False | False |
112457 rows × 8 columns
We now go from the pandas DataFrame we see above to a pandas Series. The axis=0
says that we should plug in one full column at a time. (We may talk later about a more general way to understand this axis
keyword argument. For now, just memorize that axis=0
means to plug in a full column at a time.)
df.isna().any(axis=0)
Name False
Date False
Open True
High True
Low True
Close True
Adj Close True
Volume True
dtype: bool
If this any
method is difficult to understand, maybe it will be more clear on a smaller DataFrame.
df_small = pd.DataFrame([[False, True], [False, False]], columns=["A","B"])
df_small
A | B | |
---|---|---|
0 | False | True |
1 | False | False |
There was a question about data types of columns. That could mean a few different things. The entries in the columns are of type bool
.
df_small.dtypes
A bool
B bool
dtype: object
df_small.columns
Index(['A', 'B'], dtype='object')
The names of the columns are of type str
, for string.
[type(col) for col in df_small.columns]
[str, str]
The full column is a pandas Series.
type(df_small["A"])
pandas.core.series.Series
Back to the original topic, which was about how this any
method works.
df_small
A | B | |
---|---|---|
0 | False | True |
1 | False | False |
Here we go through each column, one at a time, and check if it contains any True
values. Notice how we don’t use isna
in this case. The isna
was to get Boolean values, but here we already have Boolean values.
df_small.any(axis=0)
A False
B True
dtype: bool
Equally useful to the any
method is the all
method. In this case, we get False
for the values, because no column is entirely True
.
df_small.all(axis=0)
A False
B False
dtype: bool
If we use axis=1
, then we are plugging in a full row at a time. (Notice that the rows are labeled as 0
and 1
, whereas we specified when we created df_small
that the columns should labeled as "A"
and "B"
.
df_small.any(axis=1)
0 True
1 False
dtype: bool
df_small
A | B | |
---|---|---|
0 | False | True |
1 | False | False |
Using list comprehension and
bool_ser.index
, make a list of all the column names which contain missing data.
Here is a construction of bool_ser
. This is a pandas Series whose index contains the names of the columns and whose values are True
or False
, depending on whether that column had missing data or not.
bool_ser = df.isna().any(axis=0)
bool_ser
Name False
Date False
Open True
High True
Low True
Close True
Adj Close True
Volume True
dtype: bool
Here is an example of list comprehension. This just creates a list with every column name.
[c for c in bool_ser.index]
['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
We only want the names of the columns which have missing data, so we add an if
condition. The following is correct but it is not good style, because boolser[c] == True
is the exact same as boolser[c]
itself.
[c for c in bool_ser.index if bool_ser[c] == True]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Here is the more elegant version.
# Better style
[c for c in bool_ser.index if bool_ser[c]]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Find the same columns using Boolean indexing. (If you really want it to be a list, wrap the answer in the
list
constructor.)
Here is a reminder of what bool_ser
looked like. This is a good candidate for Boolean indexing, because it is a Boolean Series.
bool_ser
Name False
Date False
Open True
High True
Low True
Close True
Adj Close True
Volume True
dtype: bool
Here we use Boolean indexing to keep only the key/value pairs for which the value is True
.
# Boolean indexing, both the keys and the values
bool_ser[bool_ser]
Open True
High True
Low True
Close True
Adj Close True
Volume True
dtype: bool
If we only care about the left-hand terms (the index, which I also call the keys), we can use the index
attribute.
# Boolean indexing then index (only keep the keys, only keep the index)
bool_ser[bool_ser].index
Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
If that’s confusing, maybe this example will be more clear, because the values look more interesting.
test_series = pd.Series({"a": 3, "c": 10, "g": -5})
test_series
a 3
c 10
g -5
dtype: int64
Notice how the values (3, 10, -5) disappear when we ask for the index.
test_series.index
Index(['a', 'c', 'g'], dtype='object')
I wasn’t totally sure this would work because I don’t need to remove the keys very often, but we can get the values by accessing the values
attribute. (I also thought it might need parentheses like a method, but it doesn’t.)
test_series.values
array([ 3, 10, -5])
Notice how bool_ser[bool_ser].index
was not a list. If you really want it to be a list (there isn’t much advantage to making it a list), then you can wrap it in the list
constructor function.
# If you insist on it being a list
list(bool_ser[bool_ser].index)
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Make the same list using a list comprehension of the form
[c for c in df.columns if ???]
Here we don’t need to put an axis
term in the any
method, because df[c].isna()
is a pandas Series, so it only has one dimension, so we don’t need to specify a dimension when we call any
. (There are definitely other ways to do this. Probably df.isna().any(axis=0)[c]
would also work.)
[c for c in df.columns if df[c].isna().any()]
['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']
Boolean indexing vs groupby
#
Here is the other DataFrame we loaded.
df_info
Region | Exchange | Index | Currency | |
---|---|---|---|---|
0 | United States | New York Stock Exchange | NYA | USD |
1 | United States | NASDAQ | IXIC | USD |
2 | Hong Kong | Hong Kong Stock Exchange | HSI | HKD |
3 | China | Shanghai Stock Exchange | 000001.SS | CNY |
4 | Japan | Tokyo Stock Exchange | N225 | JPY |
5 | Europe | Euronext | N100 | EUR |
6 | China | Shenzhen Stock Exchange | 399001.SZ | CNY |
7 | Canada | Toronto Stock Exchange | GSPTSE | CAD |
8 | India | National Stock Exchange of India | NSEI | INR |
9 | Germany | Frankfurt Stock Exchange | GDAXI | EUR |
10 | Korea | Korea Exchange | KS11 | KRW |
11 | Switzerland | SIX Swiss Exchange | SSMI | CHF |
12 | Taiwan | Taiwan Stock Exchange | TWII | TWD |
13 | South Africa | Johannesburg Stock Exchange | J203.JO | ZAR |
Using a for loop, Boolean indexing, and an f-string (but no
groupby
), for each currency indf_info
, print how many rows in the dataset use that currency.
Notice how there are some repetitions.
for cur in df_info["Currency"]:
print(cur)
USD
USD
HKD
CNY
JPY
EUR
CNY
CAD
INR
EUR
KRW
CHF
TWD
ZAR
We can use df_info["Currency"].unique()
if we don’t want repeated currencies.
for cur in df_info["Currency"].unique():
print(cur)
USD
HKD
CNY
JPY
EUR
CAD
INR
KRW
CHF
TWD
ZAR
The following is similar to what you are asked to do in Worksheet 3. (In Worksheet 3, you are asked to provide more information than just the number of rows.)
# Hint for the end of Worksheet 3
for cur in df_info["Currency"].unique():
df_sub = df_info[df_info["Currency"] == cur]
print(f"The currency is {cur} and the number of rows is {len(df_sub)}")
The currency is USD and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is JPY and the number of rows is 1
The currency is EUR and the number of rows is 2
The currency is CAD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is ZAR and the number of rows is 1
Print the same information, again using a for loop, this time using a pandas GroupBy object and its
count
method.
The object df_info.groupby("Currency")
by itself is not very useful.
df_info.groupby("Currency")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f13b0a05310>
We need to do something with this GroupBy object, like iterate over it using a for loop, or like in the following example, where we call the count
method. This tells us how many rows there are. (At least in this case. Maybe it subtracts missing values…)
df_info.groupby("Currency").count()
Region | Exchange | Index | |
---|---|---|---|
Currency | |||
CAD | 1 | 1 | 1 |
CHF | 1 | 1 | 1 |
CNY | 2 | 2 | 2 |
EUR | 2 | 2 | 2 |
HKD | 1 | 1 | 1 |
INR | 1 | 1 | 1 |
JPY | 1 | 1 | 1 |
KRW | 1 | 1 | 1 |
TWD | 1 | 1 | 1 |
USD | 2 | 2 | 2 |
ZAR | 1 | 1 | 1 |
Notice how df_info.groupby("Currency").count()
is a pandas DataFrame. So we can get the “Region” column from that DataFrame just like usual, by using square brackets.
gb_ser = df_info.groupby("Currency").count()["Region"]
gb_ser
Currency
CAD 1
CHF 1
CNY 2
EUR 2
HKD 1
INR 1
JPY 1
KRW 1
TWD 1
USD 2
ZAR 1
Name: Region, dtype: int64
This gb_ser
pandas Series has all the information we want.
for cur in gb_ser.index:
print(f"The currency is {cur} and the number of rows is {gb_ser[cur]}")
The currency is CAD and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is EUR and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is JPY and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is USD and the number of rows is 2
The currency is ZAR and the number of rows is 1
Print the same information, this time iterating over the GroupBy object as follows.
for cur, df_sub in ???:
print(???)
We didn’t get here!