Week 1 Tuesday#

Overall question for today (we will need some new techniques to answer it).

  • In the vending machines dataset, how many transactions occurred on Saturday?

We’ll our solution into two parts.

  1. How to determine if a string corresponds to Saturday.

  2. How to determine how many elements in a pandas Series correspond to Saturday.

import pandas as pd
df = pd.read_csv("vend.csv")

Let’s remind our selves what this DataFrame looks like. In the following, we’ve made a common mistake, and you’ll be well-served to be able to recognize it.

df.head
<bound method NDFrame.head of          Status    Device ID           Location                  Machine  \
0     Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
1     Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
2     Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
3     Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
4     Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
...         ...          ...                ...                      ...   
6440  Processed  VJ300320692  EB Public Library  EB Public Library x1380   
6441  Processed  VJ300320692  EB Public Library  EB Public Library x1380   
6442  Processed  VJ300320692  EB Public Library  EB Public Library x1380   
6443  Processed  VJ300320609        GuttenPlans        GuttenPlans x1367   
6444  Processed  VJ300320692  EB Public Library  EB Public Library x1380   

                                   Product        Category  Transaction  \
0     Red Bull - Energy Drink - Sugar Free      Carbonated  14515778905   
1     Red Bull - Energy Drink - Sugar Free      Carbonated  14516018629   
2         Takis - Hot Chilli Pepper & Lime            Food  14516018629   
3         Takis - Hot Chilli Pepper & Lime            Food  14516020373   
4     Red Bull - Energy Drink - Sugar Free      Carbonated  14516021756   
...                                    ...             ...          ...   
6440          Lindens - Chocolate Chippers            Food  15603201222   
6441        Wonderful Pistachios - Variety            Food  15603201222   
6442        Hungry Buddha - Chocolate Chip            Food  15603201222   
6443                   Snapple Tea - Lemon  Non Carbonated  15603853105   
6444              Goldfish Baked - Cheddar            Food  15603921383   

                       TransDate    Type  RCoil  RPrice  RQty  MCoil  MPrice  \
0      Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
1      Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
2      Saturday, January 1, 2022  Credit    123     1.5     1    123     1.5   
3      Saturday, January 1, 2022  Credit    123     1.5     1    123     1.5   
4      Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
...                          ...     ...    ...     ...   ...    ...     ...   
6440  Wednesday, August 31, 2022  Credit    122     2.0     1    122     2.0   
6441  Wednesday, August 31, 2022  Credit    131     2.0     1    131     2.0   
6442  Wednesday, August 31, 2022  Credit    137     2.0     1    137     2.0   
6443  Wednesday, August 31, 2022  Credit    145     2.5     1    145     2.5   
6444  Wednesday, August 31, 2022    Cash    125     1.5     1    125     1.5   

      MQty  LineTotal  TransTotal  Prcd Date  
0        1        3.5         3.5   1/1/2022  
1        1        3.5         5.0   1/1/2022  
2        1        1.5         5.0   1/1/2022  
3        1        1.5         1.5   1/1/2022  
4        1        3.5         3.5   1/1/2022  
...    ...        ...         ...        ...  
6440     1        2.0         6.0  8/31/2022  
6441     1        2.0         6.0  8/31/2022  
6442     1        2.0         6.0  8/31/2022  
6443     1        2.5         2.5  8/31/2022  
6444     1        1.5         1.5  8/31/2022  

[6445 rows x 18 columns]>

If you look at the very first few words displayed in the output above, <bound method ..., that is a hint that we have forgotten to evaluate the function by using parentheses. We need to use df.head() rather than df.head. (The former, df.head(), is treating head as a method, which is correct. The latter, df.head, is treating head as an attribute.)

In a Math 10 quiz or exam, you’re not expected to have memorized when parentheses are necessary, but if you see an incorrect output like the above, you might be expected to recognize what has gone wrong. And when completing the worksheets, it is very important to be able to correct these sorts of errors.

Here is a correct use of df.head().

df.head()
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022

Here is the full DataFrame.

df
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6440 Processed VJ300320692 EB Public Library EB Public Library x1380 Lindens - Chocolate Chippers Food 15603201222 Wednesday, August 31, 2022 Credit 122 2.0 1 122 2.0 1 2.0 6.0 8/31/2022
6441 Processed VJ300320692 EB Public Library EB Public Library x1380 Wonderful Pistachios - Variety Food 15603201222 Wednesday, August 31, 2022 Credit 131 2.0 1 131 2.0 1 2.0 6.0 8/31/2022
6442 Processed VJ300320692 EB Public Library EB Public Library x1380 Hungry Buddha - Chocolate Chip Food 15603201222 Wednesday, August 31, 2022 Credit 137 2.0 1 137 2.0 1 2.0 6.0 8/31/2022
6443 Processed VJ300320609 GuttenPlans GuttenPlans x1367 Snapple Tea - Lemon Non Carbonated 15603853105 Wednesday, August 31, 2022 Credit 145 2.5 1 145 2.5 1 2.5 2.5 8/31/2022
6444 Processed VJ300320692 EB Public Library EB Public Library x1380 Goldfish Baked - Cheddar Food 15603921383 Wednesday, August 31, 2022 Cash 125 1.5 1 125 1.5 1 1.5 1.5 8/31/2022

6445 rows × 18 columns

Using string methods#

Our overall goal is to count how many transactions occurred on Saturday. We’ll first see how to determine if a specific string represents Saturday, and then we’ll see how to use the same approach on an entire pandas Series.

As humans, it’s obvious to us that the following string represents a Saturday, but how can we programmatically get Python to recognize that?

s = "Saturday, January 1, 2022"

Scan through these string methods and attributes, skipping the “dunder methods” that begin and end with double underscores. (In fact, you should just skip everything that starts with even one underscore.) Do you see a method or attribute that could be used to determine if s contains the word “Saturday”?

dir(s)
['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

A student suggested we try the find method, which is a great idea. I don’t know that method well, so let’s check its help documentation. The following doesn’t work.

help(find)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [9], line 1
----> 1 help(find)

NameError: name 'find' is not defined

The reason help(find) raised an error is because find is not defined independtly in Python, instead it is bound to this string object s. So we need to instead call help(s.find). Notice how we do not use help(s.find()), because that would attempt to call find before we ever get to the help portion.

help(s.find)
Help on built-in function find:

find(...) method of builtins.str instance
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.

We want to input the substring (denoted sub in the help documentation) that we want to find. Here is a reminder of the value of the string s.

s
'Saturday, January 1, 2022'

Now we call the find method of s. We don’t need to use the optional arguments denoted by start and end in the documentation. The following call outputs 0, which is telling us that the substring was found at the very beginning (integer position 0) of s.

s.find("Saturday")
0

These string searches are case sensitive, so "Day" does not occur in s.

s.find("Day")
-1

But "day" does occur, and it starts at integer position 5.

s.find("day")
5

If we’re confident that "Saturday" will always occur at the beginning of the string, then another option is to use the startswith method. Instead of returning a location, it returns a Boolean value of True or False.

s.startswith("Saturday")
True

It’s important to remember that True is capitalized and that it does not use quotation marks (because it is not a string).

True
True

Let’s see an example of calling this same method on a value taken directly from the DataFrame. The following is a string, just like our s variable.

df.loc[20, "TransDate"]
'Monday, January 3, 2022'

This string does not start with "Saturday", so the following outputs False.

df.loc[20, "TransDate"].startswith("Saturday")
False

Using string methods with a Series#

We now know how to determine if an individual string contains "Saturday" or starts with "Saturday". How can we use that same approach on an entire pandas Series of strings?

myseries = df["TransDate"]

Here is the “TransDate” column.

myseries
0        Saturday, January 1, 2022
1        Saturday, January 1, 2022
2        Saturday, January 1, 2022
3        Saturday, January 1, 2022
4        Saturday, January 1, 2022
                   ...            
6440    Wednesday, August 31, 2022
6441    Wednesday, August 31, 2022
6442    Wednesday, August 31, 2022
6443    Wednesday, August 31, 2022
6444    Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object

When you ask Python, what is the type of a string, it outputs str.

type("Saturday")
str

Given a pandas Series containing strings, there is what is called an “accessor attribute” str which gives us access to Python string methods. The type of myseries.str is not so important; I think I evaluated the following just to show that it is some special type of object, it’s not a string and not a pandas Series. (Much of Python is based on “Object Oriented Programming”, and an effect of this is that Python libraries like pandas are filled with special objects that serve very specific purposes.)

type(myseries.str)
pandas.core.strings.accessor.StringMethods

Notice how similar the attributes and methods listed here are to the methods and attributes defined on strings. They are not exactly the same, but they’re very similar. (If you tried calling dir on a pandas Series, you would see almost no overlap between its methods and string methods, especially if you don’t count the underscore methods.)

dir(myseries.str)
['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_doc_args',
 '_freeze',
 '_get_series_list',
 '_index',
 '_inferred_dtype',
 '_is_categorical',
 '_is_string',
 '_name',
 '_orig',
 '_parent',
 '_validate',
 '_wrap_result',
 'capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'fullmatch',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

A quick reminder on slicing syntax. We’ve only seen this for a DataFrame, but it works the same way for a Series. Here we are getting the first four values of the pandas Series myseries.

myseries[:4]
0    Saturday, January 1, 2022
1    Saturday, January 1, 2022
2    Saturday, January 1, 2022
3    Saturday, January 1, 2022
Name: TransDate, dtype: object

This is as good of a time as any to mention the negative indexing notation. Here is a way to get the last 4 values of something (whether a pandas Series as in this case, or a pandas DataFrame, or a Python list, etc).

myseries[-4:]
6441    Wednesday, August 31, 2022
6442    Wednesday, August 31, 2022
6443    Wednesday, August 31, 2022
6444    Wednesday, August 31, 2022
Name: TransDate, dtype: object

Using this string accessor attribute, str, we can use the find method, just like we used it above. We see 0 at the top, because "Saturday" occurs starting at position 0 in the top few values, and we see -1 at the bottom, because "Saturday" does not occur in the last few values, so the find call fails in that case.

myseries.str.find("Saturday")
0       0
1       0
2       0
3       0
4       0
       ..
6440   -1
6441   -1
6442   -1
6443   -1
6444   -1
Name: TransDate, Length: 6445, dtype: int64

Or we can use the startswith method. Instead of returning a Series of integer values, this returns a series of Boolean values.

myseries.str.startswith("Saturday")
0        True
1        True
2        True
3        True
4        True
        ...  
6440    False
6441    False
6442    False
6443    False
6444    False
Name: TransDate, Length: 6445, dtype: bool

Our overall goal is to count, and that counting is somewhat more natural using the Boolean values. The reason is because True behaves like 1 in Python (or in Matlab) and False behaves like 0, so if we add up Boolean values, the result is the exact same as how many times True occurred.

True+True+False
2

Even though True and 1 are not literally the same object in Python, True == 1 evaluates to True.

True == 1
True

A pandas Series, like myseries.str.startswith("Saturday"), has a sum method, which adds up all the values in the Series. So the following is the first answer to our overall question, how many transaction dates correspond to Saturday?

myseries.str.startswith("Saturday").sum()
684

Here is another example solution, very similar, using Boolean indexing. We first get the sub-DataFrame of df containing only those rows which correspond to a transaction that occurred on Saturday. Notice how what goes inside the square brackets is our Boolean Series from above.

# Boolean indexing
df[myseries.str.startswith("Saturday")]
Status Device ID Location Machine Product Category Transaction TransDate Type RCoil RPrice RQty MCoil MPrice MQty LineTotal TransTotal Prcd Date
0 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14515778905 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
1 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516018629 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 5.0 1/1/2022
2 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516018629 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 5.0 1/1/2022
3 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Takis - Hot Chilli Pepper & Lime Food 14516020373 Saturday, January 1, 2022 Credit 123 1.5 1 123 1.5 1 1.5 1.5 1/1/2022
4 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Red Bull - Energy Drink - Sugar Free Carbonated 14516021756 Saturday, January 1, 2022 Credit 148 3.5 1 148 3.5 1 3.5 3.5 1/1/2022
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6322 Processed VJ300320692 EB Public Library EB Public Library x1380 Goldfish Baked - Cheddar Food 15586314095 Saturday, August 27, 2022 Credit 125 1.5 1 125 1.5 1 1.5 1.5 8/27/2022
6323 Processed VJ300320692 EB Public Library EB Public Library x1380 Lenny & Larrys - SnickerDoodle Food 15586432746 Saturday, August 27, 2022 Cash 157 3.0 1 157 3.0 1 3.0 3.0 8/27/2022
6324 Processed VJ300320692 EB Public Library EB Public Library x1380 SunChips Multigrain - Harvest Cheddar Food 15586827743 Saturday, August 27, 2022 Cash 110 1.5 1 110 1.5 1 1.5 1.5 8/27/2022
6325 Processed VJ300320611 Brunswick Sq Mall BSQ Mall x1366 - ATT Poland Springs Water Water 15586849358 Saturday, August 27, 2022 Credit 143 2.0 1 143 2.0 1 2.0 2.0 8/27/2022
6326 Processed VJ300205292 Brunswick Sq Mall BSQ Mall x1364 - Zales Pop Corners - White Cheddar Food 15586848996 Saturday, August 27, 2022 Cash 114 1.5 1 114 1.5 1 1.5 1.5 8/27/2022

684 rows × 18 columns

The answer to our question (which we already found above using the sum method) can now be found by determining the number of rows in the sub-DataFrame. The sub-DataFrame has 684 rows and 18 columns.

df[myseries.str.startswith("Saturday")].shape
(684, 18)

If we just want the number of rows, we can extract the number at position 0 from the shape tuple.

df[myseries.str.startswith("Saturday")].shape[0]
684

Using timestamp methods#

It was easy to tell (at least as a human, if not as a computer) that our string s above corresponded to Saturday, because the word “Saturday” was in the string s.

s
'Saturday, January 1, 2022'

A lot of times, dates will be provided to us in a more compact format. For example, even in this same dataset, the dates in the “Prcd Date” column are written differently. Here we access the date in the row labeled 20 and the column labeled "Prcd Date" using .loc.

t = df.loc[20, "Prcd Date"]

It’s much more difficult to tell what day of the week this date corresponds to.

t
'1/3/2022'

As a first step, we will use the to_datetime function in pandas to convert t into a Timestamp. The following is not the correct syntax, because we have not told Python where to find the to_datetime function.

# use the to_datetime function in pandas
to_datetime(t)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [38], line 2
      1 # use the to_datetime function in pandas
----> 2 to_datetime(t)

NameError: name 'to_datetime' is not defined

The correct syntax to use is pd.to_datetime, which tells Python to get the to_datetime function from the pandas library.

ts = pd.to_datetime(t)
ts
Timestamp('2022-01-03 00:00:00')

The variable t is still a string.

type(t)
str

The variable ts that we just defined is a pandas Timestamp. Because it is a new type of object that is specifically related to a time, it will have lots of new functionality.

type(ts)
pandas._libs.tslibs.timestamps.Timestamp

Like usual, we can see all of the methods and attributes defined on ts by using Python’s dir function. Scroll through these and try to decide which ones will tell us that the date represents a Monday.

dir(ts)
['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_date_repr',
 '_repr_base',
 '_round',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'now',
 'quarter',
 'replace',
 'resolution',
 'round',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'to_datetime64',
 'to_julian_date',
 'to_numpy',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']

A good guess is to use day, but that tells us the day of the month. Because ts represents January 3rd 2022, the day attribute returns the integer 3.

ts.day
3

Warning. As far as I know, there is no good way to know in advance which of these are methods and which are attributes. (And you certainly won’t be tested on that.) The important thing is to recognize when you have called one incorrectly. Here we try calling day as a method. The error says that “‘int’ object is not callable”. That is because ts.day is getting converted into 3, so calling ts.day() is like calling 3(), and that is the explanation for this error message.

ts.day()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [46], line 1
----> 1 ts.day()

TypeError: 'int' object is not callable

Another try is day_of_week, and that is correct, but it’s difficult to know in advance which day of the week corresponds to 0. Let’s keep looking.

ts.day_of_week
0

The dayofweek attribute (note that there are no underscores) returns the same integer 0. We are looking for something that returns the string "Monday".

ts.dayofweek
0

How about day_name?

ts.day_name
<function Timestamp.day_name>

It looks like that didn’t work. This weird output is very similar (but shorter) to the df.head output we saw up above.

This output is a sign that we should be calling day_name as a method, not as an attribute. The following works.

ts.day_name()
'Monday'

Aside. You might wonder why I call day_name() a method instead of a function. In general, if a function is bound to a Python object, like how this day_name is bound to ts, then it is usually called a method. As small evidence of that, notice in the documentation how both the word “function” and the word “method” are used.

help(ts.day_name)
Help on built-in function day_name:

day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
    Return the day name of the Timestamp with specified locale.
    
    Parameters
    ----------
    locale : str, default None (English locale)
        Locale determining the language in which to return the day name.
    
    Returns
    -------
    str

It’s nice that we can get day of the week from a date. If you had to implement that kind of functionality on its own, it would be very complicated.

Let’s try it again. Why doesn’t the following work?

"9/27/2022".day_name()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [52], line 1
----> 1 "9/27/2022".day_name()

AttributeError: 'str' object has no attribute 'day_name'

The reason the above call to day_name() did not work, is because strings do not have a day_name method. (Why would they? What should "Christopher".day_name() return?)

To use this day_name method, we have to first convert the string into a Timestamp.

pd.to_datetime("9/27/2022").day_name()
'Tuesday'

Using timestamp methods with a Series#

Let’s return to our original question (which we already answered above using string methods) of how many transactions occurred on a Saturday. Remember that the relevant column is the “TransDate” column.

df["TransDate"]
0        Saturday, January 1, 2022
1        Saturday, January 1, 2022
2        Saturday, January 1, 2022
3        Saturday, January 1, 2022
4        Saturday, January 1, 2022
                   ...            
6440    Wednesday, August 31, 2022
6441    Wednesday, August 31, 2022
6442    Wednesday, August 31, 2022
6443    Wednesday, August 31, 2022
6444    Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object

Just like we used str to access string methods on a pandas Series, here we use dt (which presumably stands for datetime) to access Timestamp methods on a pandas Series. But we can’t use df["TransDate"].dt directly. The following mistake is basically the same as the "9/27/2022".day_name() mistake above.

dir(df["TransDate"].dt)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [55], line 1
----> 1 dir(df["TransDate"].dt)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5461, in NDFrame.__getattr__(self, name)
   5454 # Note: obj.x will always call obj.__getattribute__('x') prior to
   5455 # calling obj.__getattr__('x').
   5456 if (
   5457     name in self._internal_names_set
   5458     or name in self._metadata
   5459     or name in self._accessors
   5460 ):
-> 5461     return object.__getattribute__(self, name)
   5462 else:
   5463     if self._info_axis._can_hold_identifiers_and_holds_name(name):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/accessor.py:180, in CachedAccessor.__get__(self, obj, cls)
    177 if obj is None:
    178     # we're accessing the attribute of the class, i.e., Dataset.geo
    179     return self._accessor
--> 180 accessor_obj = self._accessor(obj)
    181 # Replace the property with the accessor object. Inspired by:
    182 # https://www.pydanny.com/cached-property.html
    183 # We need to use object.__setattr__ because we overwrite __setattr__ on
    184 # NDFrame
    185 object.__setattr__(obj, self._name, accessor_obj)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/accessors.py:494, in CombinedDatetimelikeProperties.__new__(cls, data)
    491 elif is_period_dtype(data.dtype):
    492     return PeriodProperties(data, orig)
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

We first need to convert from strings to Timestamps using pd.to_datetime. Notice how the dtype changes. (I’m not sure if there is a difference between the Timestamp objects we saw above and the type referred to below as datetime64[ns]. I think it is safe to think of them as the same.)

pd.to_datetime(df["TransDate"])
0      2022-01-01
1      2022-01-01
2      2022-01-01
3      2022-01-01
4      2022-01-01
          ...    
6440   2022-08-31
6441   2022-08-31
6442   2022-08-31
6443   2022-08-31
6444   2022-08-31
Name: TransDate, Length: 6445, dtype: datetime64[ns]

Now we can call the Python dir function, like I tried to do above. Notice how similar these methods are to the Timestamp attributes and methods we saw in the previous section.

dir(pd.to_datetime(df["TransDate"]).dt)
['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_add_delegate_accessors',
 '_constructor',
 '_delegate_method',
 '_delegate_property_get',
 '_delegate_property_set',
 '_dir_additions',
 '_dir_deletions',
 '_freeze',
 '_get_values',
 '_hidden_attrs',
 '_parent',
 '_reset_cache',
 'ceil',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'floor',
 'freq',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'microsecond',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'quarter',
 'round',
 'second',
 'strftime',
 'time',
 'timetz',
 'to_period',
 'to_pydatetime',
 'tz',
 'tz_convert',
 'tz_localize',
 'week',
 'weekday',
 'weekofyear',
 'year']

In particular, using this dt accessor attribute, we can call the day_name method.

pd.to_datetime(df["TransDate"]).dt.day_name()
0        Saturday
1        Saturday
2        Saturday
3        Saturday
4        Saturday
          ...    
6440    Wednesday
6441    Wednesday
6442    Wednesday
6443    Wednesday
6444    Wednesday
Name: TransDate, Length: 6445, dtype: object

Let’s now see how often "Saturday" occurs. Here is the same mistake we have made several times above. We are missing the parentheses from value_counts.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts
<bound method IndexOpsMixin.value_counts of 0        Saturday
1        Saturday
2        Saturday
3        Saturday
4        Saturday
          ...    
6440    Wednesday
6441    Wednesday
6442    Wednesday
6443    Wednesday
6444    Wednesday
Name: TransDate, Length: 6445, dtype: object>

Here is a correct call of value_counts. This returns a Series. The index of this Series contains the days of the week, and the values of this Series indicate how often they occur.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()
Wednesday    1120
Tuesday      1113
Thursday     1106
Monday       1023
Friday       1016
Saturday      684
Sunday        383
Name: TransDate, dtype: int64

To access the value corresponding to the key "Saturday", we can use indexing.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()["Saturday"]
684

We could also use the .loc indexing.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().loc["Saturday"]
684

If we’re willing to count to see what place "Saturday" occurs, we could also use the iloc indexing. This iloc approach is definitely the worst approach in this case, but it is useful to know if you want to get, for example, the top value.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().iloc[5]
684