Week 2 Friday#

Plan for today.

20-30 minutes: Lecture
Remaining time: work on Worksheets 3-4. Due Tuesday before Discussion Section. Chupeng and I are here to help.
Reminder: Videos and video quizzes due. Try to finish them after class if you haven’t already.

import pandas as pd

df = pd.read_csv("indexData.csv")
df_info = pd.read_csv("indexInfo.csv")

Practice with `isna`, `any`, and list comprehension#

Make a Boolean Series indicating which columns in the indexData.csv file contain missing data. Use isna and any with an appropriate axis keyword argument, Name that Boolean Series bool_ser.

The best way to answer these sorts of questions is to build up the solution one step at a time. Here is the original DataFrame.

df

	Name	Date	Open	High	Low	Close	Adj Close	Volume
0	NYA	12/31/65	528.690002	528.690002	528.690002	528.690002	528.690002	0.0
1	NYA	1/3/66	527.210022	527.210022	527.210022	527.210022	527.210022	0.0
2	NYA	1/4/66	527.840027	527.840027	527.840027	527.840027	527.840027	0.0
3	NYA	1/5/66	531.119995	531.119995	531.119995	531.119995	531.119995	0.0
4	NYA	1/6/66	532.070007	532.070007	532.070007	532.070007	532.070007	0.0
...	...	...	...	...	...	...	...	...
112452	N100	5/27/21	1241.119995	1251.910034	1241.119995	1247.069946	1247.069946	379696400.0
112453	N100	5/28/21	1249.469971	1259.209961	1249.030029	1256.599976	1256.599976	160773400.0
112454	N100	5/31/21	1256.079956	1258.880005	1248.140015	1248.930054	1248.930054	91173700.0
112455	N100	6/1/21	1254.609985	1265.660034	1254.609985	1258.579956	1258.579956	155179900.0
112456	N100	6/2/21	1258.489990	1263.709961	1258.239990	1263.619995	1263.619995	148465000.0

112457 rows × 8 columns

Here is the result of calling the isna method. We see False because there weren’t any missing values in the portion of the DataFrame we could see.

df.isna()

	Name	Date	Open	High	Low	Close	Adj Close	Volume
0	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...
112452	False	False	False	False	False	False	False	False
112453	False	False	False	False	False	False	False	False
112454	False	False	False	False	False	False	False	False
112455	False	False	False	False	False	False	False	False
112456	False	False	False	False	False	False	False	False

112457 rows × 8 columns

We now go from the pandas DataFrame we see above to a pandas Series. The axis=0 says that we should plug in one full column at a time. (We may talk later about a more general way to understand this axis keyword argument. For now, just memorize that axis=0 means to plug in a full column at a time.)

df.isna().any(axis=0)

Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

If this any method is difficult to understand, maybe it will be more clear on a smaller DataFrame.

df_small = pd.DataFrame([[False, True], [False, False]], columns=["A","B"])
df_small

	A	B
0	False	True
1	False	False

There was a question about data types of columns. That could mean a few different things. The entries in the columns are of type bool.

df_small.dtypes

A    bool
B    bool
dtype: object

df_small.columns

Index(['A', 'B'], dtype='object')

The names of the columns are of type str, for string.

[type(col) for col in df_small.columns]

[str, str]

The full column is a pandas Series.

type(df_small["A"])

pandas.core.series.Series

Back to the original topic, which was about how this any method works.

df_small

	A	B
0	False	True
1	False	False

Here we go through each column, one at a time, and check if it contains any True values. Notice how we don’t use isna in this case. The isna was to get Boolean values, but here we already have Boolean values.

df_small.any(axis=0)

A    False
B     True
dtype: bool

Equally useful to the any method is the all method. In this case, we get False for the values, because no column is entirely True.

df_small.all(axis=0)

A    False
B    False
dtype: bool

If we use axis=1, then we are plugging in a full row at a time. (Notice that the rows are labeled as 0 and 1, whereas we specified when we created df_small that the columns should labeled as "A" and "B".

df_small.any(axis=1)

0     True
1    False
dtype: bool

df_small

	A	B
0	False	True
1	False	False

Using list comprehension and bool_ser.index, make a list of all the column names which contain missing data.

Here is a construction of bool_ser. This is a pandas Series whose index contains the names of the columns and whose values are True or False, depending on whether that column had missing data or not.

bool_ser = df.isna().any(axis=0)
bool_ser

Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

Here is an example of list comprehension. This just creates a list with every column name.

[c for c in bool_ser.index]

['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

We only want the names of the columns which have missing data, so we add an if condition. The following is correct but it is not good style, because boolser[c] == True is the exact same as boolser[c] itself.

[c for c in bool_ser.index if bool_ser[c] == True]

['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Here is the more elegant version.

# Better style
[c for c in bool_ser.index if bool_ser[c]]

['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Find the same columns using Boolean indexing. (If you really want it to be a list, wrap the answer in the list constructor.)

Here is a reminder of what bool_ser looked like. This is a good candidate for Boolean indexing, because it is a Boolean Series.

bool_ser

Name         False
Date         False
Open          True
High          True
Low           True
Close         True
Adj Close     True
Volume        True
dtype: bool

Here we use Boolean indexing to keep only the key/value pairs for which the value is True.

# Boolean indexing, both the keys and the values
bool_ser[bool_ser]

Open         True
High         True
Low          True
Close        True
Adj Close    True
Volume       True
dtype: bool

If we only care about the left-hand terms (the index, which I also call the keys), we can use the index attribute.

# Boolean indexing then index (only keep the keys, only keep the index)
bool_ser[bool_ser].index

Index(['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

If that’s confusing, maybe this example will be more clear, because the values look more interesting.

test_series = pd.Series({"a": 3, "c": 10, "g": -5})
test_series

a     3
c    10
g    -5
dtype: int64

Notice how the values (3, 10, -5) disappear when we ask for the index.

test_series.index

Index(['a', 'c', 'g'], dtype='object')

I wasn’t totally sure this would work because I don’t need to remove the keys very often, but we can get the values by accessing the values attribute. (I also thought it might need parentheses like a method, but it doesn’t.)

test_series.values

array([ 3, 10, -5])

Notice how bool_ser[bool_ser].index was not a list. If you really want it to be a list (there isn’t much advantage to making it a list), then you can wrap it in the list constructor function.

# If you insist on it being a list
list(bool_ser[bool_ser].index)

['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Make the same list using a list comprehension of the form

[c for c in df.columns if ???]

Here we don’t need to put an axis term in the any method, because df[c].isna() is a pandas Series, so it only has one dimension, so we don’t need to specify a dimension when we call any. (There are definitely other ways to do this. Probably df.isna().any(axis=0)[c] would also work.)

[c for c in df.columns if df[c].isna().any()]

['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']

Boolean indexing vs `groupby`#

Here is the other DataFrame we loaded.

df_info

	Region	Exchange	Index	Currency
0	United States	New York Stock Exchange	NYA	USD
1	United States	NASDAQ	IXIC	USD
2	Hong Kong	Hong Kong Stock Exchange	HSI	HKD
3	China	Shanghai Stock Exchange	000001.SS	CNY
4	Japan	Tokyo Stock Exchange	N225	JPY
5	Europe	Euronext	N100	EUR
6	China	Shenzhen Stock Exchange	399001.SZ	CNY
7	Canada	Toronto Stock Exchange	GSPTSE	CAD
8	India	National Stock Exchange of India	NSEI	INR
9	Germany	Frankfurt Stock Exchange	GDAXI	EUR
10	Korea	Korea Exchange	KS11	KRW
11	Switzerland	SIX Swiss Exchange	SSMI	CHF
12	Taiwan	Taiwan Stock Exchange	TWII	TWD
13	South Africa	Johannesburg Stock Exchange	J203.JO	ZAR

Using a for loop, Boolean indexing, and an f-string (but no groupby), for each currency in df_info, print how many rows in the dataset use that currency.

Notice how there are some repetitions.

for cur in df_info["Currency"]:
    print(cur)

USD
USD
HKD
CNY
JPY
EUR
CNY
CAD
INR
EUR
KRW
CHF
TWD
ZAR

We can use df_info["Currency"].unique() if we don’t want repeated currencies.

for cur in df_info["Currency"].unique():
    print(cur)

USD
HKD
CNY
JPY
EUR
CAD
INR
KRW
CHF
TWD
ZAR

The following is similar to what you are asked to do in Worksheet 3. (In Worksheet 3, you are asked to provide more information than just the number of rows.)

# Hint for the end of Worksheet 3
for cur in df_info["Currency"].unique():
    df_sub = df_info[df_info["Currency"] == cur]
    print(f"The currency is {cur} and the number of rows is {len(df_sub)}")

The currency is USD and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is JPY and the number of rows is 1
The currency is EUR and the number of rows is 2
The currency is CAD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is ZAR and the number of rows is 1

Print the same information, again using a for loop, this time using a pandas GroupBy object and its count method.

The object df_info.groupby("Currency") by itself is not very useful.

df_info.groupby("Currency")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f13b0a05310>

We need to do something with this GroupBy object, like iterate over it using a for loop, or like in the following example, where we call the count method. This tells us how many rows there are. (At least in this case. Maybe it subtracts missing values…)

df_info.groupby("Currency").count()

	Region	Exchange	Index
Currency
CAD	1	1	1
CHF	1	1	1
CNY	2	2	2
EUR	2	2	2
HKD	1	1	1
INR	1	1	1
JPY	1	1	1
KRW	1	1	1
TWD	1	1	1
USD	2	2	2
ZAR	1	1	1

Notice how df_info.groupby("Currency").count() is a pandas DataFrame. So we can get the “Region” column from that DataFrame just like usual, by using square brackets.

gb_ser = df_info.groupby("Currency").count()["Region"]
gb_ser

Currency
CAD    1
CHF    1
CNY    2
EUR    2
HKD    1
INR    1
JPY    1
KRW    1
TWD    1
USD    2
ZAR    1
Name: Region, dtype: int64

This gb_ser pandas Series has all the information we want.

for cur in gb_ser.index:
    print(f"The currency is {cur} and the number of rows is {gb_ser[cur]}")

The currency is CAD and the number of rows is 1
The currency is CHF and the number of rows is 1
The currency is CNY and the number of rows is 2
The currency is EUR and the number of rows is 2
The currency is HKD and the number of rows is 1
The currency is INR and the number of rows is 1
The currency is JPY and the number of rows is 1
The currency is KRW and the number of rows is 1
The currency is TWD and the number of rows is 1
The currency is USD and the number of rows is 2
The currency is ZAR and the number of rows is 1

Print the same information, this time iterating over the GroupBy object as follows.

for cur, df_sub in ???:
    print(???)

We didn’t get here!

UC Irvine Math 10, Fall 2022

Week 2 Friday

Contents

Week 2 Friday#

Practice with `isna`, `any`, and list comprehension#

Boolean indexing vs `groupby`#

UC Irvine Math 10, Fall 2022

Week 2 Friday

Contents

Week 2 Friday#

Practice with isna, any, and list comprehension#

Boolean indexing vs groupby#

Practice with `isna`, `any`, and list comprehension#

Boolean indexing vs `groupby`#