Week 1 Tuesday
Contents
Week 1 Tuesday#
Overall question for today (we will need some new techniques to answer it).
In the vending machines dataset, how many transactions occurred on Saturday?
We’ll our solution into two parts.
How to determine if a string corresponds to Saturday.
How to determine how many elements in a pandas Series correspond to Saturday.
import pandas as pd
df = pd.read_csv("vend.csv")
Let’s remind our selves what this DataFrame looks like. In the following, we’ve made a common mistake, and you’ll be well-served to be able to recognize it.
df.head
<bound method NDFrame.head of Status Device ID Location Machine \
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT
... ... ... ... ...
6440 Processed VJ300320692 EB Public Library EB Public Library x1380
6441 Processed VJ300320692 EB Public Library EB Public Library x1380
6442 Processed VJ300320692 EB Public Library EB Public Library x1380
6443 Processed VJ300320609 GuttenPlans GuttenPlans x1367
6444 Processed VJ300320692 EB Public Library EB Public Library x1380
Product Category Transaction \
0 Red Bull - Energy Drink - Sugar Free Carbonated 14515778905
1 Red Bull - Energy Drink - Sugar Free Carbonated 14516018629
2 Takis - Hot Chilli Pepper & Lime Food 14516018629
3 Takis - Hot Chilli Pepper & Lime Food 14516020373
4 Red Bull - Energy Drink - Sugar Free Carbonated 14516021756
... ... ... ...
6440 Lindens - Chocolate Chippers Food 15603201222
6441 Wonderful Pistachios - Variety Food 15603201222
6442 Hungry Buddha - Chocolate Chip Food 15603201222
6443 Snapple Tea - Lemon Non Carbonated 15603853105
6444 Goldfish Baked - Cheddar Food 15603921383
TransDate Type RCoil RPrice RQty MCoil MPrice \
0 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5
1 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5
2 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5
3 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5
4 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5
... ... ... ... ... ... ... ...
6440 Wednesday, August 31, 2022 Credit 122 2.0 1 122 2.0
6441 Wednesday, August 31, 2022 Credit 131 2.0 1 131 2.0
6442 Wednesday, August 31, 2022 Credit 137 2.0 1 137 2.0
6443 Wednesday, August 31, 2022 Credit 145 2.5 1 145 2.5
6444 Wednesday, August 31, 2022 Cash 125 1.5 1 125 1.5
MQty LineTotal TransTotal Prcd Date
0 1 3.5 3.5 1/1/2022
1 1 3.5 5.0 1/1/2022
2 1 1.5 5.0 1/1/2022
3 1 1.5 1.5 1/1/2022
4 1 3.5 3.5 1/1/2022
... ... ... ... ...
6440 1 2.0 6.0 8/31/2022
6441 1 2.0 6.0 8/31/2022
6442 1 2.0 6.0 8/31/2022
6443 1 2.5 2.5 8/31/2022
6444 1 1.5 1.5 8/31/2022
[6445 rows x 18 columns]>
If you look at the very first few words displayed in the output above, <bound method ...
, that is a hint that we have forgotten to evaluate the function by using parentheses. We need to use df.head()
rather than df.head
. (The former, df.head()
, is treating head
as a method, which is correct. The latter, df.head
, is treating head
as an attribute.)
In a Math 10 quiz or exam, you’re not expected to have memorized when parentheses are necessary, but if you see an incorrect output like the above, you might be expected to recognize what has gone wrong. And when completing the worksheets, it is very important to be able to correct these sorts of errors.
Here is a correct use of df.head()
.
df.head()
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
Here is the full DataFrame.
df
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6440 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Lindens - Chocolate Chippers | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 122 | 2.0 | 1 | 122 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6441 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Wonderful Pistachios - Variety | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 131 | 2.0 | 1 | 131 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6442 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Hungry Buddha - Chocolate Chip | Food | 15603201222 | Wednesday, August 31, 2022 | Credit | 137 | 2.0 | 1 | 137 | 2.0 | 1 | 2.0 | 6.0 | 8/31/2022 |
6443 | Processed | VJ300320609 | GuttenPlans | GuttenPlans x1367 | Snapple Tea - Lemon | Non Carbonated | 15603853105 | Wednesday, August 31, 2022 | Credit | 145 | 2.5 | 1 | 145 | 2.5 | 1 | 2.5 | 2.5 | 8/31/2022 |
6444 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Goldfish Baked - Cheddar | Food | 15603921383 | Wednesday, August 31, 2022 | Cash | 125 | 1.5 | 1 | 125 | 1.5 | 1 | 1.5 | 1.5 | 8/31/2022 |
6445 rows × 18 columns
Using string methods#
Our overall goal is to count how many transactions occurred on Saturday. We’ll first see how to determine if a specific string represents Saturday, and then we’ll see how to use the same approach on an entire pandas Series.
As humans, it’s obvious to us that the following string represents a Saturday, but how can we programmatically get Python to recognize that?
s = "Saturday, January 1, 2022"
Scan through these string methods and attributes, skipping the “dunder methods” that begin and end with double underscores. (In fact, you should just skip everything that starts with even one underscore.) Do you see a method or attribute that could be used to determine if s
contains the word “Saturday”?
dir(s)
['__add__',
'__class__',
'__contains__',
'__delattr__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__getitem__',
'__getnewargs__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__len__',
'__lt__',
'__mod__',
'__mul__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__rmod__',
'__rmul__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'capitalize',
'casefold',
'center',
'count',
'encode',
'endswith',
'expandtabs',
'find',
'format',
'format_map',
'index',
'isalnum',
'isalpha',
'isascii',
'isdecimal',
'isdigit',
'isidentifier',
'islower',
'isnumeric',
'isprintable',
'isspace',
'istitle',
'isupper',
'join',
'ljust',
'lower',
'lstrip',
'maketrans',
'partition',
'removeprefix',
'removesuffix',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'split',
'splitlines',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'zfill']
A student suggested we try the find
method, which is a great idea. I don’t know that method well, so let’s check its help documentation. The following doesn’t work.
help(find)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [9], line 1
----> 1 help(find)
NameError: name 'find' is not defined
The reason help(find)
raised an error is because find
is not defined independtly in Python, instead it is bound to this string object s
. So we need to instead call help(s.find)
. Notice how we do not use help(s.find())
, because that would attempt to call find
before we ever get to the help
portion.
help(s.find)
Help on built-in function find:
find(...) method of builtins.str instance
S.find(sub[, start[, end]]) -> int
Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end]. Optional
arguments start and end are interpreted as in slice notation.
Return -1 on failure.
We want to input the substring (denoted sub
in the help documentation) that we want to find. Here is a reminder of the value of the string s
.
s
'Saturday, January 1, 2022'
Now we call the find
method of s
. We don’t need to use the optional arguments denoted by start
and end
in the documentation. The following call outputs 0
, which is telling us that the substring was found at the very beginning (integer position 0) of s
.
s.find("Saturday")
0
These string searches are case sensitive, so "Day"
does not occur in s
.
s.find("Day")
-1
But "day"
does occur, and it starts at integer position 5.
s.find("day")
5
If we’re confident that "Saturday"
will always occur at the beginning of the string, then another option is to use the startswith
method. Instead of returning a location, it returns a Boolean value of True
or False
.
s.startswith("Saturday")
True
It’s important to remember that True
is capitalized and that it does not use quotation marks (because it is not a string).
True
True
Let’s see an example of calling this same method on a value taken directly from the DataFrame. The following is a string, just like our s
variable.
df.loc[20, "TransDate"]
'Monday, January 3, 2022'
This string does not start with "Saturday"
, so the following outputs False
.
df.loc[20, "TransDate"].startswith("Saturday")
False
Using string methods with a Series#
We now know how to determine if an individual string contains "Saturday"
or starts with "Saturday"
. How can we use that same approach on an entire pandas Series of strings?
myseries = df["TransDate"]
Here is the “TransDate” column.
myseries
0 Saturday, January 1, 2022
1 Saturday, January 1, 2022
2 Saturday, January 1, 2022
3 Saturday, January 1, 2022
4 Saturday, January 1, 2022
...
6440 Wednesday, August 31, 2022
6441 Wednesday, August 31, 2022
6442 Wednesday, August 31, 2022
6443 Wednesday, August 31, 2022
6444 Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object
When you ask Python, what is the type of a string, it outputs str
.
type("Saturday")
str
Given a pandas Series containing strings, there is what is called an “accessor attribute” str
which gives us access to Python string methods. The type of myseries.str
is not so important; I think I evaluated the following just to show that it is some special type of object, it’s not a string and not a pandas Series. (Much of Python is based on “Object Oriented Programming”, and an effect of this is that Python libraries like pandas are filled with special objects that serve very specific purposes.)
type(myseries.str)
pandas.core.strings.accessor.StringMethods
Notice how similar the attributes and methods listed here are to the methods and attributes defined on strings. They are not exactly the same, but they’re very similar. (If you tried calling dir
on a pandas Series, you would see almost no overlap between its methods and string methods, especially if you don’t count the underscore methods.)
dir(myseries.str)
['__annotations__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__frozen',
'__ge__',
'__getattribute__',
'__getitem__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__iter__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_data',
'_doc_args',
'_freeze',
'_get_series_list',
'_index',
'_inferred_dtype',
'_is_categorical',
'_is_string',
'_name',
'_orig',
'_parent',
'_validate',
'_wrap_result',
'capitalize',
'casefold',
'cat',
'center',
'contains',
'count',
'decode',
'encode',
'endswith',
'extract',
'extractall',
'find',
'findall',
'fullmatch',
'get',
'get_dummies',
'index',
'isalnum',
'isalpha',
'isdecimal',
'isdigit',
'islower',
'isnumeric',
'isspace',
'istitle',
'isupper',
'join',
'len',
'ljust',
'lower',
'lstrip',
'match',
'normalize',
'pad',
'partition',
'repeat',
'replace',
'rfind',
'rindex',
'rjust',
'rpartition',
'rsplit',
'rstrip',
'slice',
'slice_replace',
'split',
'startswith',
'strip',
'swapcase',
'title',
'translate',
'upper',
'wrap',
'zfill']
A quick reminder on slicing syntax. We’ve only seen this for a DataFrame, but it works the same way for a Series. Here we are getting the first four values of the pandas Series myseries
.
myseries[:4]
0 Saturday, January 1, 2022
1 Saturday, January 1, 2022
2 Saturday, January 1, 2022
3 Saturday, January 1, 2022
Name: TransDate, dtype: object
This is as good of a time as any to mention the negative indexing notation. Here is a way to get the last 4 values of something (whether a pandas Series as in this case, or a pandas DataFrame, or a Python list, etc).
myseries[-4:]
6441 Wednesday, August 31, 2022
6442 Wednesday, August 31, 2022
6443 Wednesday, August 31, 2022
6444 Wednesday, August 31, 2022
Name: TransDate, dtype: object
Using this string accessor attribute, str
, we can use the find
method, just like we used it above. We see 0
at the top, because "Saturday"
occurs starting at position 0
in the top few values, and we see -1
at the bottom, because "Saturday"
does not occur in the last few values, so the find
call fails in that case.
myseries.str.find("Saturday")
0 0
1 0
2 0
3 0
4 0
..
6440 -1
6441 -1
6442 -1
6443 -1
6444 -1
Name: TransDate, Length: 6445, dtype: int64
Or we can use the startswith
method. Instead of returning a Series of integer values, this returns a series of Boolean values.
myseries.str.startswith("Saturday")
0 True
1 True
2 True
3 True
4 True
...
6440 False
6441 False
6442 False
6443 False
6444 False
Name: TransDate, Length: 6445, dtype: bool
Our overall goal is to count, and that counting is somewhat more natural using the Boolean values. The reason is because True
behaves like 1
in Python (or in Matlab) and False
behaves like 0
, so if we add up Boolean values, the result is the exact same as how many times True
occurred.
True+True+False
2
Even though True
and 1
are not literally the same object in Python, True == 1
evaluates to True
.
True == 1
True
A pandas Series, like myseries.str.startswith("Saturday")
, has a sum
method, which adds up all the values in the Series. So the following is the first answer to our overall question, how many transaction dates correspond to Saturday?
myseries.str.startswith("Saturday").sum()
684
Here is another example solution, very similar, using Boolean indexing. We first get the sub-DataFrame of df
containing only those rows which correspond to a transaction that occurred on Saturday. Notice how what goes inside the square brackets is our Boolean Series from above.
# Boolean indexing
df[myseries.str.startswith("Saturday")]
Status | Device ID | Location | Machine | Product | Category | Transaction | TransDate | Type | RCoil | RPrice | RQty | MCoil | MPrice | MQty | LineTotal | TransTotal | Prcd Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14515778905 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
1 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516018629 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 5.0 | 1/1/2022 |
2 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516018629 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 5.0 | 1/1/2022 |
3 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Takis - Hot Chilli Pepper & Lime | Food | 14516020373 | Saturday, January 1, 2022 | Credit | 123 | 1.5 | 1 | 123 | 1.5 | 1 | 1.5 | 1.5 | 1/1/2022 |
4 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Red Bull - Energy Drink - Sugar Free | Carbonated | 14516021756 | Saturday, January 1, 2022 | Credit | 148 | 3.5 | 1 | 148 | 3.5 | 1 | 3.5 | 3.5 | 1/1/2022 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
6322 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Goldfish Baked - Cheddar | Food | 15586314095 | Saturday, August 27, 2022 | Credit | 125 | 1.5 | 1 | 125 | 1.5 | 1 | 1.5 | 1.5 | 8/27/2022 |
6323 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | Lenny & Larrys - SnickerDoodle | Food | 15586432746 | Saturday, August 27, 2022 | Cash | 157 | 3.0 | 1 | 157 | 3.0 | 1 | 3.0 | 3.0 | 8/27/2022 |
6324 | Processed | VJ300320692 | EB Public Library | EB Public Library x1380 | SunChips Multigrain - Harvest Cheddar | Food | 15586827743 | Saturday, August 27, 2022 | Cash | 110 | 1.5 | 1 | 110 | 1.5 | 1 | 1.5 | 1.5 | 8/27/2022 |
6325 | Processed | VJ300320611 | Brunswick Sq Mall | BSQ Mall x1366 - ATT | Poland Springs Water | Water | 15586849358 | Saturday, August 27, 2022 | Credit | 143 | 2.0 | 1 | 143 | 2.0 | 1 | 2.0 | 2.0 | 8/27/2022 |
6326 | Processed | VJ300205292 | Brunswick Sq Mall | BSQ Mall x1364 - Zales | Pop Corners - White Cheddar | Food | 15586848996 | Saturday, August 27, 2022 | Cash | 114 | 1.5 | 1 | 114 | 1.5 | 1 | 1.5 | 1.5 | 8/27/2022 |
684 rows × 18 columns
The answer to our question (which we already found above using the sum
method) can now be found by determining the number of rows in the sub-DataFrame. The sub-DataFrame has 684 rows and 18 columns.
df[myseries.str.startswith("Saturday")].shape
(684, 18)
If we just want the number of rows, we can extract the number at position 0
from the shape
tuple.
df[myseries.str.startswith("Saturday")].shape[0]
684
Using timestamp methods#
It was easy to tell (at least as a human, if not as a computer) that our string s
above corresponded to Saturday, because the word “Saturday” was in the string s
.
s
'Saturday, January 1, 2022'
A lot of times, dates will be provided to us in a more compact format. For example, even in this same dataset, the dates in the “Prcd Date” column are written differently. Here we access the date in the row labeled 20
and the column labeled "Prcd Date"
using .loc
.
t = df.loc[20, "Prcd Date"]
It’s much more difficult to tell what day of the week this date corresponds to.
t
'1/3/2022'
As a first step, we will use the to_datetime
function in pandas to convert t
into a Timestamp. The following is not the correct syntax, because we have not told Python where to find the to_datetime
function.
# use the to_datetime function in pandas
to_datetime(t)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [38], line 2
1 # use the to_datetime function in pandas
----> 2 to_datetime(t)
NameError: name 'to_datetime' is not defined
The correct syntax to use is pd.to_datetime
, which tells Python to get the to_datetime
function from the pandas library.
ts = pd.to_datetime(t)
ts
Timestamp('2022-01-03 00:00:00')
The variable t
is still a string.
type(t)
str
The variable ts
that we just defined is a pandas Timestamp. Because it is a new type of object that is specifically related to a time, it will have lots of new functionality.
type(ts)
pandas._libs.tslibs.timestamps.Timestamp
Like usual, we can see all of the methods and attributes defined on ts
by using Python’s dir
function. Scroll through these and try to decide which ones will tell us that the date represents a Monday.
dir(ts)
['__add__',
'__array_priority__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pyx_vtable__',
'__radd__',
'__reduce__',
'__reduce_cython__',
'__reduce_ex__',
'__repr__',
'__rsub__',
'__setattr__',
'__setstate__',
'__setstate_cython__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__weakref__',
'_date_repr',
'_repr_base',
'_round',
'_short_repr',
'_time_repr',
'asm8',
'astimezone',
'ceil',
'combine',
'ctime',
'date',
'day',
'day_name',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',
'daysinmonth',
'dst',
'floor',
'fold',
'freq',
'freqstr',
'fromisocalendar',
'fromisoformat',
'fromordinal',
'fromtimestamp',
'hour',
'is_leap_year',
'is_month_end',
'is_month_start',
'is_quarter_end',
'is_quarter_start',
'is_year_end',
'is_year_start',
'isocalendar',
'isoformat',
'isoweekday',
'max',
'microsecond',
'min',
'minute',
'month',
'month_name',
'nanosecond',
'normalize',
'now',
'quarter',
'replace',
'resolution',
'round',
'second',
'strftime',
'strptime',
'time',
'timestamp',
'timetuple',
'timetz',
'to_datetime64',
'to_julian_date',
'to_numpy',
'to_period',
'to_pydatetime',
'today',
'toordinal',
'tz',
'tz_convert',
'tz_localize',
'tzinfo',
'tzname',
'utcfromtimestamp',
'utcnow',
'utcoffset',
'utctimetuple',
'value',
'week',
'weekday',
'weekofyear',
'year']
A good guess is to use day
, but that tells us the day of the month. Because ts
represents January 3rd 2022, the day
attribute returns the integer 3
.
ts.day
3
Warning. As far as I know, there is no good way to know in advance which of these are methods and which are attributes. (And you certainly won’t be tested on that.) The important thing is to recognize when you have called one incorrectly. Here we try calling day
as a method. The error says that “‘int’ object is not callable”. That is because ts.day
is getting converted into 3
, so calling ts.day()
is like calling 3()
, and that is the explanation for this error message.
ts.day()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [46], line 1
----> 1 ts.day()
TypeError: 'int' object is not callable
Another try is day_of_week
, and that is correct, but it’s difficult to know in advance which day of the week corresponds to 0
. Let’s keep looking.
ts.day_of_week
0
The dayofweek
attribute (note that there are no underscores) returns the same integer 0
. We are looking for something that returns the string "Monday"
.
ts.dayofweek
0
How about day_name
?
ts.day_name
<function Timestamp.day_name>
It looks like that didn’t work. This weird output is very similar (but shorter) to the df.head
output we saw up above.
This output is a sign that we should be calling day_name
as a method, not as an attribute. The following works.
ts.day_name()
'Monday'
Aside. You might wonder why I call day_name()
a method instead of a function. In general, if a function is bound to a Python object, like how this day_name
is bound to ts
, then it is usually called a method. As small evidence of that, notice in the documentation how both the word “function” and the word “method” are used.
help(ts.day_name)
Help on built-in function day_name:
day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
Return the day name of the Timestamp with specified locale.
Parameters
----------
locale : str, default None (English locale)
Locale determining the language in which to return the day name.
Returns
-------
str
It’s nice that we can get day of the week from a date. If you had to implement that kind of functionality on its own, it would be very complicated.
Let’s try it again. Why doesn’t the following work?
"9/27/2022".day_name()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [52], line 1
----> 1 "9/27/2022".day_name()
AttributeError: 'str' object has no attribute 'day_name'
The reason the above call to day_name()
did not work, is because strings do not have a day_name
method. (Why would they? What should "Christopher".day_name()
return?)
To use this day_name
method, we have to first convert the string into a Timestamp.
pd.to_datetime("9/27/2022").day_name()
'Tuesday'
Using timestamp methods with a Series#
Let’s return to our original question (which we already answered above using string methods) of how many transactions occurred on a Saturday. Remember that the relevant column is the “TransDate” column.
df["TransDate"]
0 Saturday, January 1, 2022
1 Saturday, January 1, 2022
2 Saturday, January 1, 2022
3 Saturday, January 1, 2022
4 Saturday, January 1, 2022
...
6440 Wednesday, August 31, 2022
6441 Wednesday, August 31, 2022
6442 Wednesday, August 31, 2022
6443 Wednesday, August 31, 2022
6444 Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object
Just like we used str
to access string methods on a pandas Series, here we use dt
(which presumably stands for datetime) to access Timestamp methods on a pandas Series. But we can’t use df["TransDate"].dt
directly. The following mistake is basically the same as the "9/27/2022".day_name()
mistake above.
dir(df["TransDate"].dt)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [55], line 1
----> 1 dir(df["TransDate"].dt)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5461, in NDFrame.__getattr__(self, name)
5454 # Note: obj.x will always call obj.__getattribute__('x') prior to
5455 # calling obj.__getattr__('x').
5456 if (
5457 name in self._internal_names_set
5458 or name in self._metadata
5459 or name in self._accessors
5460 ):
-> 5461 return object.__getattribute__(self, name)
5462 else:
5463 if self._info_axis._can_hold_identifiers_and_holds_name(name):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/accessor.py:180, in CachedAccessor.__get__(self, obj, cls)
177 if obj is None:
178 # we're accessing the attribute of the class, i.e., Dataset.geo
179 return self._accessor
--> 180 accessor_obj = self._accessor(obj)
181 # Replace the property with the accessor object. Inspired by:
182 # https://www.pydanny.com/cached-property.html
183 # We need to use object.__setattr__ because we overwrite __setattr__ on
184 # NDFrame
185 object.__setattr__(obj, self._name, accessor_obj)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/accessors.py:494, in CombinedDatetimelikeProperties.__new__(cls, data)
491 elif is_period_dtype(data.dtype):
492 return PeriodProperties(data, orig)
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")
AttributeError: Can only use .dt accessor with datetimelike values
We first need to convert from strings to Timestamps using pd.to_datetime
. Notice how the dtype
changes. (I’m not sure if there is a difference between the Timestamp objects we saw above and the type referred to below as datetime64[ns]
. I think it is safe to think of them as the same.)
pd.to_datetime(df["TransDate"])
0 2022-01-01
1 2022-01-01
2 2022-01-01
3 2022-01-01
4 2022-01-01
...
6440 2022-08-31
6441 2022-08-31
6442 2022-08-31
6443 2022-08-31
6444 2022-08-31
Name: TransDate, Length: 6445, dtype: datetime64[ns]
Now we can call the Python dir
function, like I tried to do above. Notice how similar these methods are to the Timestamp attributes and methods we saw in the previous section.
dir(pd.to_datetime(df["TransDate"]).dt)
['__annotations__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__frozen',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__reduce__',
'__reduce_ex__',
'__repr__',
'__setattr__',
'__sizeof__',
'__str__',
'__subclasshook__',
'__weakref__',
'_accessors',
'_add_delegate_accessors',
'_constructor',
'_delegate_method',
'_delegate_property_get',
'_delegate_property_set',
'_dir_additions',
'_dir_deletions',
'_freeze',
'_get_values',
'_hidden_attrs',
'_parent',
'_reset_cache',
'ceil',
'date',
'day',
'day_name',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',
'daysinmonth',
'floor',
'freq',
'hour',
'is_leap_year',
'is_month_end',
'is_month_start',
'is_quarter_end',
'is_quarter_start',
'is_year_end',
'is_year_start',
'isocalendar',
'microsecond',
'minute',
'month',
'month_name',
'nanosecond',
'normalize',
'quarter',
'round',
'second',
'strftime',
'time',
'timetz',
'to_period',
'to_pydatetime',
'tz',
'tz_convert',
'tz_localize',
'week',
'weekday',
'weekofyear',
'year']
In particular, using this dt
accessor attribute, we can call the day_name
method.
pd.to_datetime(df["TransDate"]).dt.day_name()
0 Saturday
1 Saturday
2 Saturday
3 Saturday
4 Saturday
...
6440 Wednesday
6441 Wednesday
6442 Wednesday
6443 Wednesday
6444 Wednesday
Name: TransDate, Length: 6445, dtype: object
Let’s now see how often "Saturday"
occurs. Here is the same mistake we have made several times above. We are missing the parentheses from value_counts
.
pd.to_datetime(df["TransDate"]).dt.day_name().value_counts
<bound method IndexOpsMixin.value_counts of 0 Saturday
1 Saturday
2 Saturday
3 Saturday
4 Saturday
...
6440 Wednesday
6441 Wednesday
6442 Wednesday
6443 Wednesday
6444 Wednesday
Name: TransDate, Length: 6445, dtype: object>
Here is a correct call of value_counts
. This returns a Series. The index of this Series contains the days of the week, and the values of this Series indicate how often they occur.
pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()
Wednesday 1120
Tuesday 1113
Thursday 1106
Monday 1023
Friday 1016
Saturday 684
Sunday 383
Name: TransDate, dtype: int64
To access the value corresponding to the key "Saturday"
, we can use indexing.
pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()["Saturday"]
684
We could also use the .loc
indexing.
pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().loc["Saturday"]
684
If we’re willing to count to see what place "Saturday"
occurs, we could also use the iloc
indexing. This iloc
approach is definitely the worst approach in this case, but it is useful to know if you want to get, for example, the top value.
pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().iloc[5]
684