Week 1 Monday#

Announcements#

  • My office hours this quarter will be Monday 11am ALP 3610 (next door), Wednesday 1pm ALP 3600, and by appointment. I can’t reserve these computer labs for office hours; I’ll send an email to announce any last-minute changes to the location.

  • Worksheet 0 is due Tuesday before discussion section. (Need groupmates? Today’s class is a good time to find them!)

  • Videos and video quizzes for this week have been posted. They are due before lecture on Friday. The video quizzes are open book, open computer, and can be taken two times. It’s fine to discuss them with classmates.

  • Today will be about half lecture and half time to work on Worksheet 1. One of our two Learning Assistants, Alicia, is here to help during the work time.

  • Usually Tuesday discussion will have an in-class quiz, but not this week. Some new material will be introduced in Tuesday discussion this week, so be sure to attend!

import pandas as pd
df = pd.read_csv("vend.csv")

Warm-up: Two useful pandas Series methods#

Reminder: If we want to get a column named "Product" out of a pandas DataFrame, we can use the following syntax. The result is a pandas Series.

df["Product"]
0       Red Bull - Energy Drink - Sugar Free
1       Red Bull - Energy Drink - Sugar Free
2           Takis - Hot Chilli Pepper & Lime
3           Takis - Hot Chilli Pepper & Lime
4       Red Bull - Energy Drink - Sugar Free
                        ...                 
6440            Lindens - Chocolate Chippers
6441          Wonderful Pistachios - Variety
6442          Hungry Buddha - Chocolate Chip
6443                     Snapple Tea - Lemon
6444                Goldfish Baked - Cheddar
Name: Product, Length: 6445, dtype: object

Here is the first of the two pandas Series methods we want to introduce. (Think of a method like a function, but it’s a function that is attached to a specific object. In this case, the value_counts method is attached to our pandas Series.) This value_counts method tells us how often each value occurs.

df["Product"].value_counts()
Coca Cola - Zero Sugar                           517
Monster Energy Original                          385
Poland Springs Water                             300
KitKat - Crisp Wafers                            260
Sunkist Soda - Orange                            217
                                                ... 
Nature's Garden Trail Mix - Heart Healthy Mix      1
Nature's Garden Trail Mix - Omega 3                1
Larabar - Peanut Butter chocolate Chip             1
Nature's Valley Crunchy Granola Bar - Oats &       1
Nature's Valley Chewy Granola Bar - Fruit & N      1
Name: Product, Length: 170, dtype: int64

We can also tell how many items appear in the “Product” column, 170, by looking at the length of the above Series. If all you care about is what items occur, and you don’t care about how often they occur, then it makes sense to use a different pandas Series method, unique.

df["Product"].unique()
array(['Red Bull - Energy Drink - Sugar Free',
       'Takis - Hot Chilli Pepper & Lime',
       'Bai Antioxidant - Brasilia BB',
       "Miss Vickie's Potato Chip - Sea Salt & Vinega",
       "Miss Vickie's Potato Chip - Lime & Cracked Pe",
       'Monster Energy Original',
       'Seapoint Farms Dry Roasted Edamame - Wasabi',
       'Snapple Diet Tea - Lemon', 'Skinny Pop Popcorn',
       'Stretch Island Fruit Leathers Snacks - Variet',
       'KitKat - Crisp Wafers', 'Wonderful Pistachios - Variety',
       'Spindrift - Sparkling Water  Lime',
       "Miss Vickie's Potato Chip - Sea Salt Original",
       'Jacks Links Bar Beef Strip',
       'Vitamin Water - XXX Acai BB Pomegranate',
       'Cheetos - Fleming Hot Crunchy',
       'Pure Organic Fruit Bar - Strawberry Banana', 'Oreo Mini',
       'Nutter Butter Bites', 'Jack Links Beef Steak Original',
       "Robert Irvine's - Fit Crunch -  Chocolate Pea",
       'Good Health Veggie Stix - Zesty Ranch',
       'Cheetos Baked - Flaming Hot',
       'S. Pellegrino Sparkling Mineral Water', 'Mini Chips Ahoy',
       'Newtons Fig Bar', 'SunChips Multigrain - Salsa',
       'Bubly - Grapefruit', 'Poland Springs Water',
       'Chum Fruit Bites - Mango/Strawberry',
       'Snapple Diet Tea - Raspberry',
       'Zevia Zero Calorie Energy - Mango Ginger',
       'V8 Vegetable Juice - Original', 'BodyArmor LYTE  - Peach Mango',
       'SunChips Multigrain - Harvest Cheddar',
       'Good Health Veggie Stix - Sea Salt',
       'Snapple Diet Tea - Peach Tea', 'Keto Krisp - Almond/Chocolate',
       'Larabar - Peanut Butter chocolate Chip',
       'Bai Antioxidant - Molokai Coconut', 'Monster Energy  Zero Ultra',
       'Chesters Fries Flaming hot',
       "Miss Vickie's Potato Chip - Jalapeno", 'Coca Cola - Zero Sugar',
       'Red Bull - Original', 'Ruffles Baked  - Cheddar & Sour Cream',
       'Ritz Bits Cheese', 'CheezIt - Original', 'Takis FUGEO mini',
       'BodyArmor LYTE  - Berry Punch',
       "Miss Vickie's Potato Chip - Smokehouse BBQ",
       'Vitamin Water - Ice Cool Blue Lavender',
       'CheezIt - White Cheddar',
       "Nature's Valley Crunchy Granola Bar - Oats &",
       'Lays Baked - Original', 'Bai Antioxidant - Zambia Bingcherry',
       'Bubly - Cherry', 'Belvita Breakfast Cookies - Blueberry',
       'Spindrift - Sparkling Water Raspberry Lime',
       'SunChips Multigrain - Original', 'Popchips Potato - Barbeque',
       "Snyder's of Hanover Pretzel - Cheddar Cheese",
       'Pure Organic Pineapple Passionfruit',
       'Keto Krisp - Almond Butter',
       "Nature's Valley Chewy Granola Bar - Fruit & N",
       'Vitamin Water - Squeezed Lemonade', 'Popchips Potato - Sea Salt',
       'CheezIt - Double Cheese', 'Lays Baked - Barbecue',
       'Sunkist Soda - Orange', 'Funyuns - Flaming Hot',
       'Munchies Snack mix Flaming Hot',
       'NuGo Dark Chocolate - Choc Chip', 'Oreo Single Server 6 ct',
       'Cheetos Flaming Hot Limon Crunchy',
       'Oberto Specialty Meats - Original', 'Doritos Famin Hot Nacho',
       'Kinder - Bueno - Crispy Creamy Chocolate',
       'Belvita Snack Packs - Blueberry',
       'RXBAR - Protein Bar - Peanut Butter', 'Snapple Tea - Raspberry',
       'KIND Bar - Dark Choc Almond SeaSalt', 'Fritos - Original',
       'NuGo Dark Chocolate - Pretzel Sea Salt',
       'Cheetos Baked - Crunchy Cheese',
       'CheezIt - Cheddar Sour Cream and Onion',
       'RXBAR - Protein Bar - Chocolate Sea Salt',
       'Starbucks Doubleshot Energy - Mocha', 'Bubly - Blackberry',
       'Doritos Nacho Cheese', 'Cheetos - Crunchy',
       'Spindrift - Sparkling Water - Pineapple', 'Iberia Coconut Water',
       'Doritos Cool Ranch', 'Lays - Classic',
       'Genius Keto Bar - Chocolate Dream', nan,
       'Lenny & Larrys - SnickerDoodle',
       'Autumns Granola Bar - Cinnamon Almond',
       'SunChips Multigrain - French Onion',
       'Bai Antioxidant - Kula Watermelon', 'Honest Tea - Berry Hibiscus',
       'Hungry Buddha - Chocolate Chip', 'Remedy Kombucha - Ginger Lemon',
       'Doritos Dinamita Chile Lemon',
       'Starbucks Refresher - Real Coconut Water',
       'Hungry Buddha - Coconut Cocoa',
       'Made in Nature - Mangoes Sweet & Tangy',
       'Belvita Snack Packs - Cinnamon',
       'Keto Bar - Creamy Peanut Butter Chocolate',
       'Skippy Peanut Butter - Go Paks',
       'Belvita Snack Bites - Cinnamon BS Go Paks',
       'Pop Corners - Kettle Corn',
       'Bai Antioxidant - Kupang strawberry kiwi',
       'Quaker Caramel Rice Crisps', 'Belvita Snack Packs - Chocolate',
       'Pop Corners - Sea Salt', 'Pop Corners - Spicy Queso',
       'Quaker Apple Cinnamon Rice Crisps',
       'Bai Antioxidant - Tanzania Lemon Super Tea',
       'Quaker ButterMilk Ranch Rice Crisps',
       'Bai Lemonade - Burundi Blueberry', 'Pop Corners - White Cheddar',
       'Bubly - Mango', 'Nutri Grain Apple Cinnamon',
       'Quaker Cheddar Rice Crisps', 'Cheetos Puffs',
       'Cheetos Cheddar Jalapeno', 'Bubly - Lime',
       "Nature's Garden Trail Mix - Cranberry Healthy",
       'BodyArmor LYTE - Strawberry Lemonade',
       'Keep Healthy - Chocolate Fudge', 'CheezIt - Go Paks',
       'Lenny & Larrys - Chocolate Chip',
       "Snyder's of Hanover Pretzel Pieces - Honey Mu",
       'Belvita Breakfast Biscuit Bites - Blueberry',
       "Snyder's of Hanover Pretzel Pieces - Hot Buff",
       'Ice Breakers Ice Cubes - PepperMint', 'Goldfish Baked - Cheddar',
       'Doritos Spicy Nacho', 'Pop Corners - Sweet & Salty Kettle',
       'Mini Chips Ahoy - Go Paks', 'Oreo Mini - Go Paks',
       'Teddy Grahams - Go Paks',
       'BodyArmor  LYTE - Blueberry Pomegranate',
       'Starbucks Doubleshot Energy - Coffee',
       "Snyder's of Hanover Pretzel Pieces - Cheddar",
       "Nature's Bakery Fig Bar - Blueberry",
       "Nature's Bakery Fig Bar - Raspberry",
       'Canada Dry - Ginger Ale & Lemonde', 'Canada Dry - Ginger Ale',
       'Snapple Tea - Lemon', 'Popchips Potato - Sour Cream & Onion',
       'Pringles - Cheddar Cheese', 'Pringles - Sour Cream & Onion',
       'Coca Cola - Regular', 'Snapple Tea - Peach',
       'Zone Perfect - Chocolate Chip Cookie Dough',
       "Nature's Garden Trail Mix - Omega 3",
       'S. Pellegrino Essenza Mineral Water - Blood O',
       'Snapple Lemonade - black berry', 'Snapple Lemonade - watermelon',
       'Snapple Lemonade - pineapple', 'Zevia Organic Tea - Peach',
       "Nature's Garden Trail Mix - Heart Healthy Mix",
       'Keto Bar - Chocolate Dream', 'Lindens - Chocolate Chippers',
       'Pure Organic Fruit Bar - Raspberry Lemonade',
       'TruBar - Cookie Dough', 'TruBar - Daydreaming about Donuts'],
      dtype=object)

Here we check how many items occur, by calling the len function on df["Product"].unique(). Notice how len goes at the front of this expression; that’s why we’re calling len a function rather than a method.

len(df["Product"].unique())
171

Notice how this answer, 171, is one more than our length of 170 for the value_counts output. This is presumably because the unique method includes the possibility of a missing value, and the value_counts method does not. (A student later pointed out to me that the unique output does include nan, which stands for “not a number”, and represents missing values.)

Indexing for pandas Series#

Indexing for pandas Series and especially for pandas DataFrames takes some getting used to. What various types of indexing represents is largely something that needs to be memorized.

Here is a reminder of what the top three rows in our DataFrame look like.

df.head(3)
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022

Here is the code we used last week to get the zeroth element (remember that counting in Python starts with zero) out of the “RPrice” column.

# Get the zeroth element out of the RPrice column
df["RPrice"].iloc[0]
3.5

It’s very reasonable to ask why we used this strange .iloc[0] syntax, when some similar versions also work in this case. For example, .loc[0] also works in this case.

df["RPrice"].loc[0]
3.5

In fact, even [0] also works in this case.

df["RPrice"][0]
3.5

The .iloc[0] version is really the correct version, and the other two only work by coincidence. Let’s look more closely at this pandas Series (this column from the DataFrame) df["RPrice"]. Notice there are seemingly two columns of numbers. The left-hand column shows what’s called the index of the pandas Series, and the right-hand column (which should be thought of as the more important column) contains the values of the pandas Series. The elements in the index should be thought of as labels. For example, 0 is the label for the float 3.5 at the top.

df["RPrice"]
0       3.5
1       3.5
2       1.5
3       1.5
4       3.5
       ... 
6440    2.0
6441    2.0
6442    2.0
6443    2.5
6444    1.5
Name: RPrice, Length: 6445, dtype: float64

Let’s try sorting these prices, from biggest to smallest. We use the sort_values method, which by default sorts the values from smallest to biggest, so we include what’s called a keyword argument ascending=False to tell pandas to instead sort them in decreasing order.

s = df["RPrice"].sort_values(ascending=False)

Notice how the index for the new Series s seems to be in a scrambled order now. That is because the index has not been sorted, only the prices.

s
5704    5.0
4680    5.0
3275    5.0
4134    5.0
2889    5.0
       ... 
1491    1.0
1837    1.0
1054    1.0
340     1.0
2088    1.0
Name: RPrice, Length: 6445, dtype: float64

The 3.5 we saw earlier at the top still has the same label 0, and if we call s[0], we do not get the top element in s, instead we get the element (or theoretically elements) with label 0.

s[0]
3.5

The same goes for loc. Think of loc as indexing by label and think of iloc as indexing by integer position.

s.loc[0]
3.5

For this Series s, to get the top value, the indexing we need to use is iloc.

s.iloc[0]
5.0

As an aside, if you need to access the labels directly, it is stored in the index attribute of the pandas Series.

s.index
Int64Index([5704, 4680, 3275, 4134, 2889, 2635, 4152, 4868, 5795, 3191,
            ...
            1829, 1058, 1492, 1831, 1832, 1491, 1837, 1054,  340, 2088],
           dtype='int64', length=6445)

We’ll see the exact same loc vs iloc distinction in the context of pandas DataFrames. There also are some additional ways to index on DataFrames, related to the fact that DataFrames are two-dimensional objects, as opposed to Series which are one-dimensional objects.

Indexing for pandas DataFrames#

The loc and iloc indexing works basically the same for pandas DataFrames. Look down a few cells at our df DataFrame. If we go to row 2 (counting from zero) and column 4 (counting from zero), do you agree that we reach the value 'Takis - Hot Chilli Pepper & Lime'? (Do not include the column names when you start counting, that should be considered as a header, and do not include the index (0, 1, 2, …) at the left-most side when counting.)

df.iloc[2,4]
'Takis - Hot Chilli Pepper & Lime'

If we instead want to index using the labels instead of the integer positions, then we use loc instead of iloc. The row value stays the same since it is both the label and the integer position (that was what caused the confusion above with the df["RPrice"] Series), but now the column value changes from the integer 4 to the string "Product".

df.loc[2,"Product"]
'Takis - Hot Chilli Pepper & Lime'
df
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6440 Processed VJ300320692 EB Public Library EB Public Library x1380 Lindens - Chocolate Chippers Food 15603201222 Wednesday, August 31, 2022 Credit 122 2.0 1 122 2.0 1 2.0 6.0 8/31/2022
6441 Processed VJ300320692 EB Public Library EB Public Library x1380 Wonderful Pistachios - Variety Food 15603201222 Wednesday, August 31, 2022 Credit 131 2.0 1 131 2.0 1 2.0 6.0 8/31/2022
6442 Processed VJ300320692 EB Public Library EB Public Library x1380 Hungry Buddha - Chocolate Chip Food 15603201222 Wednesday, August 31, 2022 Credit 137 2.0 1 137 2.0 1 2.0 6.0 8/31/2022
6443 Processed VJ300320609 GuttenPlans GuttenPlans x1367 Snapple Tea - Lemon Non Carbonated 15603853105 Wednesday, August 31, 2022 Credit 145 2.5 1 145 2.5 1 2.5 2.5 8/31/2022
6444 Processed VJ300320692 EB Public Library EB Public Library x1380 Goldfish Baked - Cheddar Food 15603921383 Wednesday, August 31, 2022 Cash 125 1.5 1 125 1.5 1 1.5 1.5 8/31/2022

6445 rows × 18 columns

At the very top of the DataFrame, we see the column names displayed. It’s often useful to have access to these column names, and they are stored in the DataFrame’s columns attribute. (Aside. Notice that this df.columns is not a list nor a pandas Series nor any other data type we have met before. Python is filled with many special-purpose types of objects. This df.columns is a pandas Index object, but I don’t think we will use any special features of this type of object in Math 10, so we will mostly ignore it.)

df.columns
Index(['Status', 'Device ID', 'Location', 'Machine', 'Product', 'Category',
       'Transaction', 'TransDate', 'Type', 'RCoil', 'RPrice', 'RQty', 'MCoil',
       'MPrice', 'MQty', 'LineTotal', 'TransTotal', 'Prcd Date'],
      dtype='object')

At the very left-hand side of the DataFrame, the row labels are displayed. These labels are stored in the index attribute. (Aside. I expected when I evaluated this to see something like Index([0, 1, 2, 3, ..., 6443, 6445]), but instead we see RangeIndex(start=0, stop=6445, step=1). The distinction is not important for us; presumably the one used by pandas in this case is more memory efficient, since it does not need to store all the explicit values.)

These row labels are important, but we won’t use them quite as often as we will use the column labels.

df.index
RangeIndex(start=0, stop=6445, step=1)

Let’s see some more types of indexing for pandas DataFrames. There is no way to know how this works in advance. Some of these conventions seem a little contradictory to me (the fact that some types of indexing access columns and some types of indexing access rows), and it’s just something you need to memorize.

If you want to access a sub-DataFrame containing only certain columns, you can pass a list of those column names. Repetitions are fine. Notice how we have two pairs of square brackets. The outer pair starts the indexing, and the inner pair creates a Python list.

df[["RPrice", "RPrice", "Location"]]
RPrice RPrice Location
0 3.5 3.5 Brunswick Sq Mall
1 3.5 3.5 Brunswick Sq Mall
2 1.5 1.5 Brunswick Sq Mall
3 1.5 1.5 Brunswick Sq Mall
4 3.5 3.5 Brunswick Sq Mall
... ... ... ...
6440 2.0 2.0 EB Public Library
6441 2.0 2.0 EB Public Library
6442 2.0 2.0 EB Public Library
6443 2.5 2.5 GuttenPlans
6444 1.5 1.5 EB Public Library

6445 rows × 3 columns

Here is an example of what is called slicing. Notice how this accesses rows, not columns. The following is short-hand for “get the first 7 rows”.

df[:7]
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
5 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Bai Antioxidant - Brasilia BB Non Carbonated 14517568743 Sunday, January 2, 2022 Cash 146 2.5 1 146 2.5 1 2.5 2.5 1/2/2022
6 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Miss Vickie's Potato Chip - Sea Salt & Vinega Food 14518731524 Monday, January 3, 2022 Cash 114 1.5 1 114 1.5 1 1.5 1.5 1/2/2022

Above when we wrote df[:7], that was an abbreviation for df[0:7], so the following returns the exact same sub-DataFrame. Related to the fact that numbering in Python starts at zero, when we specify endpoints in Python, usually the left endpoint is included but the right endpoint is not included. In this case, the zeroth row is included but the seventh row (i.e., the row at integer position 7, it would also be reasonable to call this the eighth row) is not.

df[0:7]
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
5 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Bai Antioxidant - Brasilia BB Non Carbonated 14517568743 Sunday, January 2, 2022 Cash 146 2.5 1 146 2.5 1 2.5 2.5 1/2/2022
6 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Miss Vickie's Potato Chip - Sea Salt & Vinega Food 14518731524 Monday, January 3, 2022 Cash 114 1.5 1 114 1.5 1 1.5 1.5 1/2/2022

Let’s confirm that we really are getting seven rows. If you call the function len on a pandas DataFrame, you get the number of rows. (That’s different from the Matlab convention. In Matlab, calling length returns the length of the longest dimension.)

len(df[:7])
7

Another way to find the number of rows is to first get the tuple representing the shape of the DataFrame.

df[:7].shape
(7, 18)

And then get the zeroth element out of this tuple, which represents the number of rows. (If instead we used .shape[1], that would get the number of columns.)

df[:7].shape[0]
7

Boolean indexing#

The most powerful type of indexing for us will be what is called Boolean indexing. Let’s see an example on a smaller DataFrame. Here we make df2 by hand, instead of by reading in the contents of a csv file. This isn’t the most important part of this section, but I’ll just point out that we are passing a dictionary to the DataFrame constructor function, pd.DataFrame. The keys in this dictionary will be the column names, and the values in this dictionary will be the values in the columns. There are many other things we could pass to pd.DataFrame instead of a dictionary.

df2 = pd.DataFrame({
    "A": [5,1,6,1,2,3,8],
    "B": [2,4,2,8,3,-1,0],
    "C": [5, 1.1, 3.4, 5.5, 2.1, 4.2, -3.5]
})

Here is the resulting DataFrame, df2.

df2
A B C
0 5 2 5.0
1 1 4 1.1
2 6 2 3.4
3 1 8 5.5
4 2 3 2.1
5 3 -1 4.2
6 8 0 -3.5

The point of this section is not constructing DataFrames but instead the point is how to use Boolean indexing. Before we get to Boolean indexing, here is an example of a Boolean Series. It represents, in which positions is the value in column “A” strictly greater than 4?

df2["A"] > 4
0     True
1    False
2     True
3    False
4    False
5    False
6     True
Name: A, dtype: bool

Here is an example of Boolean indexing. Inside of the square brackets is the exact Boolean Series we saw above. We are forming a sub-DataFrame consisting of all the rows in which the Boolean Series is True. In other words, we are forming the sub-DataFrame of df2 consisting of all rows where the value in the “A” column is strictly greater than 4.

It is worth looking at this and the next example slowly, until you understand how they work.

df2[df2["A"] > 4]
A B C
0 5 2 5.0
2 6 2 3.4
6 8 0 -3.5

We can use the same type of Boolean indexing with a Series instead of a DataFrame on the outside. Here we are getting all the values in the “B” column corresponding to rows where the value in the “A” column is strictly greater than 4.

df2["B"][df2["A"] > 4]
0    2
2    2
6    0
Name: B, dtype: int64

Here is a reminder of how the original DataFrame looked. Try to convince yourself that the pandas Series output by the above cell is exactly the sub-Series of column “B” corresponding to rows where the “A” column value is strictly greater than 4.

df2
A B C
0 5 2 5.0
1 1 4 1.1
2 6 2 3.4
3 1 8 5.5
4 2 3 2.1
5 3 -1 4.2
6 8 0 -3.5

Let’s see a more advanced example, still using Boolean indexing. Here is the goal. (This will help with a problem on the corresponding worksheet.)

  • Find the sub-DataFrame where the value in column “C” is strictly greater than 2 and where the value in column “A” column is not equal to 1.

Our strategy will be to make one Boolean Series for the column “C” restriction, then make another Boolean Series for the column “A” restriction, and then to combine them.

df2["C"] > 2
0     True
1    False
2     True
3     True
4     True
5     True
6    False
Name: C, dtype: bool

To check “not equal” in Python, we can use !=.

df2["A"] != 1
0     True
1    False
2     True
3    False
4     True
5     True
6     True
Name: A, dtype: bool

We now create the Boolean Series where the previous two Boolean Series are both True. If you’ve done some Python programming before, you might expect to use the word and, and that is correct in base Python, but in pandas (and in NumPy) usually we want instead to use the ampersand symbol, &.

# Warning: in pandas, use &, in base Python, use and
(df2["C"] > 2) & (df2["A"] != 1)
0     True
1    False
2     True
3    False
4     True
5     True
6    False
dtype: bool

We now use Boolean indexing with the Boolean Series we just created. The following looks complicated, but the part inside the brackets is exactly the above code.

df2[(df2["C"] > 2) & (df2["A"] != 1)]
A B C
0 5 2 5.0
2 6 2 3.4
4 2 3 2.1
5 3 -1 4.2

The above output represents the sub-DataFrame of df2 consisting of all rows for which the “C” value is strictly greater than 2 and for which the “A” value is not equal to 1.