Week 1 Monday
Contents
Week 1 Monday#
Announcements#
My office hours this quarter will be Monday 11am ALP 3610 (next door), Wednesday 1pm ALP 3600, and by appointment. I can’t reserve these computer labs for office hours; I’ll send an email to announce any last-minute changes to the location.
Worksheet 0 is due Tuesday before discussion section. (Need groupmates? Today’s class is a good time to find them!)
Videos and video quizzes for this week have been posted. They are due before lecture on Friday. The video quizzes are open book, open computer, and can be taken two times. It’s fine to discuss them with classmates.
Today will be about half lecture and half time to work on Worksheet 1. One of our two Learning Assistants, Alicia, is here to help during the work time.
Usually Tuesday discussion will have an in-class quiz, but not this week. Some new material will be introduced in Tuesday discussion this week, so be sure to attend!
import pandas as pd
df = pd.read_csv("vend.csv")
Warm-up: Two useful pandas Series methods#
Reminder: If we want to get a column named "Product"
out of a pandas DataFrame, we can use the following syntax. The result is a pandas Series.
df["Product"]
0 Red Bull - Energy Drink - Sugar Free
1 Red Bull - Energy Drink - Sugar Free
2 Takis - Hot Chilli Pepper & Lime
3 Takis - Hot Chilli Pepper & Lime
4 Red Bull - Energy Drink - Sugar Free
...
6440 Lindens - Chocolate Chippers
6441 Wonderful Pistachios - Variety
6442 Hungry Buddha - Chocolate Chip
6443 Snapple Tea - Lemon
6444 Goldfish Baked - Cheddar
Name: Product, Length: 6445, dtype: object
Here is the first of the two pandas Series methods we want to introduce. (Think of a method like a function, but it’s a function that is attached to a specific object. In this case, the value_counts
method is attached to our pandas Series.) This value_counts
method tells us how often each value occurs.
df["Product"].value_counts()
Coca Cola - Zero Sugar 517
Monster Energy Original 385
Poland Springs Water 300
KitKat - Crisp Wafers 260
Sunkist Soda - Orange 217
...
Nature's Garden Trail Mix - Heart Healthy Mix 1
Nature's Garden Trail Mix - Omega 3 1
Larabar - Peanut Butter chocolate Chip 1
Nature's Valley Crunchy Granola Bar - Oats & 1
Nature's Valley Chewy Granola Bar - Fruit & N 1
Name: Product, Length: 170, dtype: int64
We can also tell how many items appear in the “Product” column, 170, by looking at the length of the above Series. If all you care about is what items occur, and you don’t care about how often they occur, then it makes sense to use a different pandas Series method, unique
.
df["Product"].unique()
array(['Red Bull - Energy Drink - Sugar Free',
'Takis - Hot Chilli Pepper & Lime',
'Bai Antioxidant - Brasilia BB',
"Miss Vickie's Potato Chip - Sea Salt & Vinega",
"Miss Vickie's Potato Chip - Lime & Cracked Pe",
'Monster Energy Original',
'Seapoint Farms Dry Roasted Edamame - Wasabi',
'Snapple Diet Tea - Lemon', 'Skinny Pop Popcorn',
'Stretch Island Fruit Leathers Snacks - Variet',
'KitKat - Crisp Wafers', 'Wonderful Pistachios - Variety',
'Spindrift - Sparkling Water Lime',
"Miss Vickie's Potato Chip - Sea Salt Original",
'Jacks Links Bar Beef Strip',
'Vitamin Water - XXX Acai BB Pomegranate',
'Cheetos - Fleming Hot Crunchy',
'Pure Organic Fruit Bar - Strawberry Banana', 'Oreo Mini',
'Nutter Butter Bites', 'Jack Links Beef Steak Original',
"Robert Irvine's - Fit Crunch - Chocolate Pea",
'Good Health Veggie Stix - Zesty Ranch',
'Cheetos Baked - Flaming Hot',
'S. Pellegrino Sparkling Mineral Water', 'Mini Chips Ahoy',
'Newtons Fig Bar', 'SunChips Multigrain - Salsa',
'Bubly - Grapefruit', 'Poland Springs Water',
'Chum Fruit Bites - Mango/Strawberry',
'Snapple Diet Tea - Raspberry',
'Zevia Zero Calorie Energy - Mango Ginger',
'V8 Vegetable Juice - Original', 'BodyArmor LYTE - Peach Mango',
'SunChips Multigrain - Harvest Cheddar',
'Good Health Veggie Stix - Sea Salt',
'Snapple Diet Tea - Peach Tea', 'Keto Krisp - Almond/Chocolate',
'Larabar - Peanut Butter chocolate Chip',
'Bai Antioxidant - Molokai Coconut', 'Monster Energy Zero Ultra',
'Chesters Fries Flaming hot',
"Miss Vickie's Potato Chip - Jalapeno", 'Coca Cola - Zero Sugar',
'Red Bull - Original', 'Ruffles Baked - Cheddar & Sour Cream',
'Ritz Bits Cheese', 'CheezIt - Original', 'Takis FUGEO mini',
'BodyArmor LYTE - Berry Punch',
"Miss Vickie's Potato Chip - Smokehouse BBQ",
'Vitamin Water - Ice Cool Blue Lavender',
'CheezIt - White Cheddar',
"Nature's Valley Crunchy Granola Bar - Oats &",
'Lays Baked - Original', 'Bai Antioxidant - Zambia Bingcherry',
'Bubly - Cherry', 'Belvita Breakfast Cookies - Blueberry',
'Spindrift - Sparkling Water Raspberry Lime',
'SunChips Multigrain - Original', 'Popchips Potato - Barbeque',
"Snyder's of Hanover Pretzel - Cheddar Cheese",
'Pure Organic Pineapple Passionfruit',
'Keto Krisp - Almond Butter',
"Nature's Valley Chewy Granola Bar - Fruit & N",
'Vitamin Water - Squeezed Lemonade', 'Popchips Potato - Sea Salt',
'CheezIt - Double Cheese', 'Lays Baked - Barbecue',
'Sunkist Soda - Orange', 'Funyuns - Flaming Hot',
'Munchies Snack mix Flaming Hot',
'NuGo Dark Chocolate - Choc Chip', 'Oreo Single Server 6 ct',
'Cheetos Flaming Hot Limon Crunchy',
'Oberto Specialty Meats - Original', 'Doritos Famin Hot Nacho',
'Kinder - Bueno - Crispy Creamy Chocolate',
'Belvita Snack Packs - Blueberry',
'RXBAR - Protein Bar - Peanut Butter', 'Snapple Tea - Raspberry',
'KIND Bar - Dark Choc Almond SeaSalt', 'Fritos - Original',
'NuGo Dark Chocolate - Pretzel Sea Salt',
'Cheetos Baked - Crunchy Cheese',
'CheezIt - Cheddar Sour Cream and Onion',
'RXBAR - Protein Bar - Chocolate Sea Salt',
'Starbucks Doubleshot Energy - Mocha', 'Bubly - Blackberry',
'Doritos Nacho Cheese', 'Cheetos - Crunchy',
'Spindrift - Sparkling Water - Pineapple', 'Iberia Coconut Water',
'Doritos Cool Ranch', 'Lays - Classic',
'Genius Keto Bar - Chocolate Dream', nan,
'Lenny & Larrys - SnickerDoodle',
'Autumns Granola Bar - Cinnamon Almond',
'SunChips Multigrain - French Onion',
'Bai Antioxidant - Kula Watermelon', 'Honest Tea - Berry Hibiscus',
'Hungry Buddha - Chocolate Chip', 'Remedy Kombucha - Ginger Lemon',
'Doritos Dinamita Chile Lemon',
'Starbucks Refresher - Real Coconut Water',
'Hungry Buddha - Coconut Cocoa',
'Made in Nature - Mangoes Sweet & Tangy',
'Belvita Snack Packs - Cinnamon',
'Keto Bar - Creamy Peanut Butter Chocolate',
'Skippy Peanut Butter - Go Paks',
'Belvita Snack Bites - Cinnamon BS Go Paks',
'Pop Corners - Kettle Corn',
'Bai Antioxidant - Kupang strawberry kiwi',
'Quaker Caramel Rice Crisps', 'Belvita Snack Packs - Chocolate',
'Pop Corners - Sea Salt', 'Pop Corners - Spicy Queso',
'Quaker Apple Cinnamon Rice Crisps',
'Bai Antioxidant - Tanzania Lemon Super Tea',
'Quaker ButterMilk Ranch Rice Crisps',
'Bai Lemonade - Burundi Blueberry', 'Pop Corners - White Cheddar',
'Bubly - Mango', 'Nutri Grain Apple Cinnamon',
'Quaker Cheddar Rice Crisps', 'Cheetos Puffs',
'Cheetos Cheddar Jalapeno', 'Bubly - Lime',
"Nature's Garden Trail Mix - Cranberry Healthy",
'BodyArmor LYTE - Strawberry Lemonade',
'Keep Healthy - Chocolate Fudge', 'CheezIt - Go Paks',
'Lenny & Larrys - Chocolate Chip',
"Snyder's of Hanover Pretzel Pieces - Honey Mu",
'Belvita Breakfast Biscuit Bites - Blueberry',
"Snyder's of Hanover Pretzel Pieces - Hot Buff",
'Ice Breakers Ice Cubes - PepperMint', 'Goldfish Baked - Cheddar',
'Doritos Spicy Nacho', 'Pop Corners - Sweet & Salty Kettle',
'Mini Chips Ahoy - Go Paks', 'Oreo Mini - Go Paks',
'Teddy Grahams - Go Paks',
'BodyArmor LYTE - Blueberry Pomegranate',
'Starbucks Doubleshot Energy - Coffee',
"Snyder's of Hanover Pretzel Pieces - Cheddar",
"Nature's Bakery Fig Bar - Blueberry",
"Nature's Bakery Fig Bar - Raspberry",
'Canada Dry - Ginger Ale & Lemonde', 'Canada Dry - Ginger Ale',
'Snapple Tea - Lemon', 'Popchips Potato - Sour Cream & Onion',
'Pringles - Cheddar Cheese', 'Pringles - Sour Cream & Onion',
'Coca Cola - Regular', 'Snapple Tea - Peach',
'Zone Perfect - Chocolate Chip Cookie Dough',
"Nature's Garden Trail Mix - Omega 3",
'S. Pellegrino Essenza Mineral Water - Blood O',
'Snapple Lemonade - black berry', 'Snapple Lemonade - watermelon',
'Snapple Lemonade - pineapple', 'Zevia Organic Tea - Peach',
"Nature's Garden Trail Mix - Heart Healthy Mix",
'Keto Bar - Chocolate Dream', 'Lindens - Chocolate Chippers',
'Pure Organic Fruit Bar - Raspberry Lemonade',
'TruBar - Cookie Dough', 'TruBar - Daydreaming about Donuts'],
dtype=object)
Here we check how many items occur, by calling the len
function on df["Product"].unique()
. Notice how len
goes at the front of this expression; that’s why we’re calling len
a function rather than a method.
len(df["Product"].unique())
171
Notice how this answer, 171, is one more than our length of 170 for the value_counts
output. This is presumably because the unique
method includes the possibility of a missing value, and the value_counts
method does not. (A student later pointed out to me that the unique
output does include nan
, which stands for “not a number”, and represents missing values.)
Indexing for pandas Series#
Indexing for pandas Series and especially for pandas DataFrames takes some getting used to. What various types of indexing represents is largely something that needs to be memorized.
Here is a reminder of what the top three rows in our DataFrame look like.
df.head(3)
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
Here is the code we used last week to get the zeroth element (remember that counting in Python starts with zero) out of the “RPrice” column.
# Get the zeroth element out of the RPrice column
df["RPrice"].iloc[0]
3.5
It’s very reasonable to ask why we used this strange .iloc[0]
syntax, when some similar versions also work in this case. For example, .loc[0]
also works in this case.
df["RPrice"].loc[0]
3.5
In fact, even [0]
also works in this case.
df["RPrice"][0]
3.5
The .iloc[0]
version is really the correct version, and the other two only work by coincidence. Let’s look more closely at this pandas Series (this column from the DataFrame) df["RPrice"]
. Notice there are seemingly two columns of numbers. The left-hand column shows what’s called the index of the pandas Series, and the right-hand column (which should be thought of as the more important column) contains the values of the pandas Series. The elements in the index should be thought of as labels. For example, 0
is the label for the float 3.5
at the top.
df["RPrice"]
0 3.5
1 3.5
2 1.5
3 1.5
4 3.5
...
6440 2.0
6441 2.0
6442 2.0
6443 2.5
6444 1.5
Name: RPrice, Length: 6445, dtype: float64
Let’s try sorting these prices, from biggest to smallest. We use the sort_values
method, which by default sorts the values from smallest to biggest, so we include what’s called a keyword argument ascending=False
to tell pandas to instead sort them in decreasing order.
s = df["RPrice"].sort_values(ascending=False)
Notice how the index for the new Series s
seems to be in a scrambled order now. That is because the index has not been sorted, only the prices.
s
5704 5.0
4680 5.0
3275 5.0
4134 5.0
2889 5.0
...
1491 1.0
1837 1.0
1054 1.0
340 1.0
2088 1.0
Name: RPrice, Length: 6445, dtype: float64
The 3.5
we saw earlier at the top still has the same label 0
, and if we call s[0]
, we do not get the top element in s
, instead we get the element (or theoretically elements) with label 0
.
s[0]
3.5
The same goes for loc
. Think of loc
as indexing by label and think of iloc
as indexing by integer position.
s.loc[0]
3.5
For this Series s
, to get the top value, the indexing we need to use is iloc
.
s.iloc[0]
5.0
As an aside, if you need to access the labels directly, it is stored in the index
attribute of the pandas Series.
s.index
Int64Index([5704, 4680, 3275, 4134, 2889, 2635, 4152, 4868, 5795, 3191,
...
1829, 1058, 1492, 1831, 1832, 1491, 1837, 1054, 340, 2088],
dtype='int64', length=6445)
We’ll see the exact same loc
vs iloc
distinction in the context of pandas DataFrames. There also are some additional ways to index on DataFrames, related to the fact that DataFrames are two-dimensional objects, as opposed to Series which are one-dimensional objects.
Indexing for pandas DataFrames#
The loc
and iloc
indexing works basically the same for pandas DataFrames. Look down a few cells at our df
DataFrame. If we go to row 2
(counting from zero) and column 4
(counting from zero), do you agree that we reach the value 'Takis - Hot Chilli Pepper & Lime'
? (Do not include the column names when you start counting, that should be considered as a header, and do not include the index (0, 1, 2, …) at the left-most side when counting.)
df.iloc[2,4]
'Takis - Hot Chilli Pepper & Lime'
If we instead want to index using the labels instead of the integer positions, then we use loc
instead of iloc
. The row value stays the same since it is both the label and the integer position (that was what caused the confusion above with the df["RPrice"]
Series), but now the column value changes from the integer 4
to the string "Product"
.
df.loc[2,"Product"]
'Takis - Hot Chilli Pepper & Lime'
df
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6440 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Lindens - Chocolate Chippers | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 122 | 2.0 | 1 | 122 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6441 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Wonderful Pistachios - Variety | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 131 | 2.0 | 1 | 131 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6442 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Hungry Buddha - Chocolate Chip | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 137 | 2.0 | 1 | 137 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6443 | Processed | VJ300320609 | GuttenPlans | GuttenPlans x1367 | Snapple Tea - Lemon | Non Carbonated | 15603853105 | Wednesday, August 31, 2022 | Credit | 145 | 2.5 | 1 | 145 | 2.5 | 1 | 2.5 | 2.5 | 8/31/2022 |
6444 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Goldfish Baked - Cheddar | Food | 15603921383 | Wednesday, August 31, 2022 | Cash | 125 | 1.5 | 1 | 125 | 1.5 | 1 | 1.5 | 1.5 | 8/31/2022 |
6445 rows × 18 columns
At the very top of the DataFrame, we see the column names displayed. It’s often useful to have access to these column names, and they are stored in the DataFrame’s columns
attribute. (Aside. Notice that this df.columns
is not a list nor a pandas Series nor any other data type we have met before. Python is filled with many special-purpose types of objects. This df.columns
is a pandas Index object, but I don’t think we will use any special features of this type of object in Math 10, so we will mostly ignore it.)
df.columns
Index(['Status', 'Device ID', 'Location', 'Machine', 'Product', 'Category',
'Transaction', 'TransDate', 'Type', 'RCoil', 'RPrice', 'RQty', 'MCoil',
'MPrice', 'MQty', 'LineTotal', 'TransTotal', 'Prcd Date'],
dtype='object')
At the very left-hand side of the DataFrame, the row labels are displayed. These labels are stored in the index
attribute. (Aside. I expected when I evaluated this to see something like Index([0, 1, 2, 3, ..., 6443, 6445])
, but instead we see RangeIndex(start=0, stop=6445, step=1)
. The distinction is not important for us; presumably the one used by pandas in this case is more memory efficient, since it does not need to store all the explicit values.)
These row labels are important, but we won’t use them quite as often as we will use the column labels.
df.index
RangeIndex(start=0, stop=6445, step=1)
Let’s see some more types of indexing for pandas DataFrames. There is no way to know how this works in advance. Some of these conventions seem a little contradictory to me (the fact that some types of indexing access columns and some types of indexing access rows), and it’s just something you need to memorize.
If you want to access a sub-DataFrame containing only certain columns, you can pass a list of those column names. Repetitions are fine. Notice how we have two pairs of square brackets. The outer pair starts the indexing, and the inner pair creates a Python list.
df[["RPrice", "RPrice", "Location"]]
RPrice | RPrice | Location | |
---|---|---|---|
0 | 3.5 | 3.5 | Brunswick Sq Mall |
1 | 3.5 | 3.5 | Brunswick Sq Mall |
2 | 1.5 | 1.5 | Brunswick Sq Mall |
3 | 1.5 | 1.5 | Brunswick Sq Mall |
4 | 3.5 | 3.5 | Brunswick Sq Mall |
... | ... | ... | ... |
6440 | 2.0 | 2.0 | EB Public Library |
6441 | 2.0 | 2.0 | EB Public Library |
6442 | 2.0 | 2.0 | EB Public Library |
6443 | 2.5 | 2.5 | GuttenPlans |
6444 | 1.5 | 1.5 | EB Public Library |
6445 rows × 3 columns
Here is an example of what is called slicing. Notice how this accesses rows, not columns. The following is short-hand for “get the first 7 rows”.
df[:7]
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
5 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Bai Antioxidant - Brasilia BB | Non Carbonated | 14517568743 | Sunday, January 2, 2022 | Cash | 146 | 2.5 | 1 | 146 | 2.5 | 1 | 2.5 | 2.5 | 1/2/2022 |
6 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Miss Vickie's Potato Chip - Sea Salt & Vinega | Food | 14518731524 | Monday, January 3, 2022 | Cash | 114 | 1.5 | 1 | 114 | 1.5 | 1 | 1.5 | 1.5 | 1/2/2022 |
Above when we wrote df[:7]
, that was an abbreviation for df[0:7]
, so the following returns the exact same sub-DataFrame. Related to the fact that numbering in Python starts at zero, when we specify endpoints in Python, usually the left endpoint is included but the right endpoint is not included. In this case, the zeroth row is included but the seventh row (i.e., the row at integer position 7
, it would also be reasonable to call this the eighth row) is not.
df[0:7]
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
5 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Bai Antioxidant - Brasilia BB | Non Carbonated | 14517568743 | Sunday, January 2, 2022 | Cash | 146 | 2.5 | 1 | 146 | 2.5 | 1 | 2.5 | 2.5 | 1/2/2022 |
6 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Miss Vickie's Potato Chip - Sea Salt & Vinega | Food | 14518731524 | Monday, January 3, 2022 | Cash | 114 | 1.5 | 1 | 114 | 1.5 | 1 | 1.5 | 1.5 | 1/2/2022 |
Let’s confirm that we really are getting seven rows. If you call the function len
on a pandas DataFrame, you get the number of rows. (That’s different from the Matlab convention. In Matlab, calling length
returns the length of the longest dimension.)
len(df[:7])
7
Another way to find the number of rows is to first get the tuple representing the shape of the DataFrame.
df[:7].shape
(7, 18)
And then get the zeroth element out of this tuple, which represents the number of rows. (If instead we used .shape[1]
, that would get the number of columns.)
df[:7].shape[0]
7
Boolean indexing#
The most powerful type of indexing for us will be what is called Boolean indexing. Let’s see an example on a smaller DataFrame. Here we make df2
by hand, instead of by reading in the contents of a csv file. This isn’t the most important part of this section, but I’ll just point out that we are passing a dictionary to the DataFrame constructor function, pd.DataFrame
. The keys in this dictionary will be the column names, and the values in this dictionary will be the values in the columns. There are many other things we could pass to pd.DataFrame
instead of a dictionary.
df2 = pd.DataFrame({
"A": [5,1,6,1,2,3,8],
"B": [2,4,2,8,3,-1,0],
"C": [5, 1.1, 3.4, 5.5, 2.1, 4.2, -3.5]
})
Here is the resulting DataFrame, df2
.
df2
A | B | C | |
---|---|---|---|
0 | 5 | 2 | 5.0 |
1 | 1 | 4 | 1.1 |
2 | 6 | 2 | 3.4 |
3 | 1 | 8 | 5.5 |
4 | 2 | 3 | 2.1 |
5 | 3 | -1 | 4.2 |
6 | 8 | 0 | -3.5 |
The point of this section is not constructing DataFrames but instead the point is how to use Boolean indexing. Before we get to Boolean indexing, here is an example of a Boolean Series. It represents, in which positions is the value in column “A” strictly greater than 4?
df2["A"] > 4
0 True
1 False
2 True
3 False
4 False
5 False
6 True
Name: A, dtype: bool
Here is an example of Boolean indexing. Inside of the square brackets is the exact Boolean Series we saw above. We are forming a sub-DataFrame consisting of all the rows in which the Boolean Series is True
. In other words, we are forming the sub-DataFrame of df2
consisting of all rows where the value in the “A” column is strictly greater than 4.
It is worth looking at this and the next example slowly, until you understand how they work.
df2[df2["A"] > 4]
A | B | C | |
---|---|---|---|
0 | 5 | 2 | 5.0 |
2 | 6 | 2 | 3.4 |
6 | 8 | 0 | -3.5 |
We can use the same type of Boolean indexing with a Series instead of a DataFrame on the outside. Here we are getting all the values in the “B” column corresponding to rows where the value in the “A” column is strictly greater than 4.
df2["B"][df2["A"] > 4]
0 2
2 2
6 0
Name: B, dtype: int64
Here is a reminder of how the original DataFrame looked. Try to convince yourself that the pandas Series output by the above cell is exactly the sub-Series of column “B” corresponding to rows where the “A” column value is strictly greater than 4.
df2
A | B | C | |
---|---|---|---|
0 | 5 | 2 | 5.0 |
1 | 1 | 4 | 1.1 |
2 | 6 | 2 | 3.4 |
3 | 1 | 8 | 5.5 |
4 | 2 | 3 | 2.1 |
5 | 3 | -1 | 4.2 |
6 | 8 | 0 | -3.5 |
Let’s see a more advanced example, still using Boolean indexing. Here is the goal. (This will help with a problem on the corresponding worksheet.)
Find the sub-DataFrame where the value in column “C” is strictly greater than 2 and where the value in column “A” column is not equal to 1.
Our strategy will be to make one Boolean Series for the column “C” restriction, then make another Boolean Series for the column “A” restriction, and then to combine them.
df2["C"] > 2
0 True
1 False
2 True
3 True
4 True
5 True
6 False
Name: C, dtype: bool
To check “not equal” in Python, we can use !=
.
df2["A"] != 1
0 True
1 False
2 True
3 False
4 True
5 True
6 True
Name: A, dtype: bool
We now create the Boolean Series where the previous two Boolean Series are both True
. If you’ve done some Python programming before, you might expect to use the word and
, and that is correct in base Python, but in pandas (and in NumPy) usually we want instead to use the ampersand symbol, &
.
# Warning: in pandas, use &, in base Python, use and
(df2["C"] > 2) & (df2["A"] != 1)
0 True
1 False
2 True
3 False
4 True
5 True
6 False
dtype: bool
We now use Boolean indexing with the Boolean Series we just created. The following looks complicated, but the part inside the brackets is exactly the above code.
df2[(df2["C"] > 2) & (df2["A"] != 1)]
A | B | C | |
---|---|---|---|
0 | 5 | 2 | 5.0 |
2 | 6 | 2 | 3.4 |
4 | 2 | 3 | 2.1 |
5 | 3 | -1 | 4.2 |
The above output represents the sub-DataFrame of df2
consisting of all rows for which the “C” value is strictly greater than 2 and for which the “A” value is not equal to 1.