Week 1 Tuesday#

Overall question for today (we will need some new techniques to answer it).

In the vending machines dataset, how many transactions occurred on Saturday?

We’ll our solution into two parts.

How to determine if a string corresponds to Saturday.
How to determine how many elements in a pandas Series correspond to Saturday.

import pandas as pd

df = pd.read_csv("vend.csv")

Let’s remind our selves what this DataFrame looks like. In the following, we’ve made a common mistake, and you’ll be well-served to be able to recognize it.

df.head

<bound method NDFrame.head of          Status    Device ID           Location                  Machine  \
   Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
   Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
   Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
   Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
   Processed  VJ300320611  Brunswick Sq Mall     BSQ Mall x1366 - ATT   
...         ...          ...                ...                      ...   
Processed  VJ300320692  EB Public Library  EB Public Library x1380   
Processed  VJ300320692  EB Public Library  EB Public Library x1380   
Processed  VJ300320692  EB Public Library  EB Public Library x1380   
Processed  VJ300320609        GuttenPlans        GuttenPlans x1367   
Processed  VJ300320692  EB Public Library  EB Public Library x1380   

                                   Product        Category  Transaction  \
   Red Bull - Energy Drink - Sugar Free      Carbonated  14515778905   
   Red Bull - Energy Drink - Sugar Free      Carbonated  14516018629   
       Takis - Hot Chilli Pepper & Lime            Food  14516018629   
       Takis - Hot Chilli Pepper & Lime            Food  14516020373   
   Red Bull - Energy Drink - Sugar Free      Carbonated  14516021756   
...                                    ...             ...          ...   
        Lindens - Chocolate Chippers            Food  15603201222   
      Wonderful Pistachios - Variety            Food  15603201222   
      Hungry Buddha - Chocolate Chip            Food  15603201222   
                 Snapple Tea - Lemon  Non Carbonated  15603853105   
            Goldfish Baked - Cheddar            Food  15603921383   

                       TransDate    Type  RCoil  RPrice  RQty  MCoil  MPrice  \
    Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
    Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
    Saturday, January 1, 2022  Credit    123     1.5     1    123     1.5   
    Saturday, January 1, 2022  Credit    123     1.5     1    123     1.5   
    Saturday, January 1, 2022  Credit    148     3.5     1    148     3.5   
...                          ...     ...    ...     ...   ...    ...     ...   
Wednesday, August 31, 2022  Credit    122     2.0     1    122     2.0   
Wednesday, August 31, 2022  Credit    131     2.0     1    131     2.0   
Wednesday, August 31, 2022  Credit    137     2.0     1    137     2.0   
Wednesday, August 31, 2022  Credit    145     2.5     1    145     2.5   
Wednesday, August 31, 2022    Cash    125     1.5     1    125     1.5   

      MQty  LineTotal  TransTotal  Prcd Date  
      1        3.5         3.5   1/1/2022  
      1        3.5         5.0   1/1/2022  
      1        1.5         5.0   1/1/2022  
      1        1.5         1.5   1/1/2022  
      1        3.5         3.5   1/1/2022  
...    ...        ...         ...        ...  
   1        2.0         6.0  8/31/2022  
   1        2.0         6.0  8/31/2022  
   1        2.0         6.0  8/31/2022  
   1        2.5         2.5  8/31/2022  
   1        1.5         1.5  8/31/2022  

[6445 rows x 18 columns]>

If you look at the very first few words displayed in the output above, <bound method ..., that is a hint that we have forgotten to evaluate the function by using parentheses. We need to use df.head() rather than df.head. (The former, df.head(), is treating head as a method, which is correct. The latter, df.head, is treating head as an attribute.)

In a Math 10 quiz or exam, you’re not expected to have memorized when parentheses are necessary, but if you see an incorrect output like the above, you might be expected to recognize what has gone wrong. And when completing the worksheets, it is very important to be able to correct these sorts of errors.

Here is a correct use of df.head().

df.head()

	Status	Device ID	Location	Machine	Product	Category	Transaction	TransDate	Type	RCoil	RPrice	RQty	MCoil	MPrice	MQty	LineTotal	TransTotal	Prcd Date
0	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14515778905	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
1	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516018629	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	5.0	1/1/2022
2	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516018629	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	5.0	1/1/2022
3	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516020373	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	1.5	1/1/2022
4	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516021756	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022

Here is the full DataFrame.

df

	Status	Device ID	Location	Machine	Product	Category	Transaction	TransDate	Type	RCoil	RPrice	RQty	MCoil	MPrice	MQty	LineTotal	TransTotal	Prcd Date
0	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14515778905	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
1	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516018629	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	5.0	1/1/2022
2	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516018629	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	5.0	1/1/2022
3	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516020373	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	1.5	1/1/2022
4	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516021756	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6440	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Lindens - Chocolate Chippers	Food	15603201222	Wednesday, August 31, 2022	Credit	122	2.0	1	122	2.0	1	2.0	6.0	8/31/2022
6441	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Wonderful Pistachios - Variety	Food	15603201222	Wednesday, August 31, 2022	Credit	131	2.0	1	131	2.0	1	2.0	6.0	8/31/2022
6442	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Hungry Buddha - Chocolate Chip	Food	15603201222	Wednesday, August 31, 2022	Credit	137	2.0	1	137	2.0	1	2.0	6.0	8/31/2022
6443	Processed	VJ300320609	GuttenPlans	GuttenPlans x1367	Snapple Tea - Lemon	Non Carbonated	15603853105	Wednesday, August 31, 2022	Credit	145	2.5	1	145	2.5	1	2.5	2.5	8/31/2022
6444	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Goldfish Baked - Cheddar	Food	15603921383	Wednesday, August 31, 2022	Cash	125	1.5	1	125	1.5	1	1.5	1.5	8/31/2022

6445 rows × 18 columns

Using string methods#

Our overall goal is to count how many transactions occurred on Saturday. We’ll first see how to determine if a specific string represents Saturday, and then we’ll see how to use the same approach on an entire pandas Series.

As humans, it’s obvious to us that the following string represents a Saturday, but how can we programmatically get Python to recognize that?

s = "Saturday, January 1, 2022"

Scan through these string methods and attributes, skipping the “dunder methods” that begin and end with double underscores. (In fact, you should just skip everything that starts with even one underscore.) Do you see a method or attribute that could be used to determine if s contains the word “Saturday”?

dir(s)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

A student suggested we try the find method, which is a great idea. I don’t know that method well, so let’s check its help documentation. The following doesn’t work.

help(find)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [9], line 1
----> 1 help(find)

NameError: name 'find' is not defined

The reason help(find) raised an error is because find is not defined independtly in Python, instead it is bound to this string object s. So we need to instead call help(s.find). Notice how we do not use help(s.find()), because that would attempt to call find before we ever get to the help portion.

help(s.find)

Help on built-in function find:

find(...) method of builtins.str instance
    S.find(sub[, start[, end]]) -> int
    
    Return the lowest index in S where substring sub is found,
    such that sub is contained within S[start:end].  Optional
    arguments start and end are interpreted as in slice notation.
    
    Return -1 on failure.

We want to input the substring (denoted sub in the help documentation) that we want to find. Here is a reminder of the value of the string s.

'Saturday, January 1, 2022'

Now we call the find method of s. We don’t need to use the optional arguments denoted by start and end in the documentation. The following call outputs 0, which is telling us that the substring was found at the very beginning (integer position 0) of s.

s.find("Saturday")

These string searches are case sensitive, so "Day" does not occur in s.

s.find("Day")

-1

But "day" does occur, and it starts at integer position 5.

s.find("day")

If we’re confident that "Saturday" will always occur at the beginning of the string, then another option is to use the startswith method. Instead of returning a location, it returns a Boolean value of True or False.

s.startswith("Saturday")

True

It’s important to remember that True is capitalized and that it does not use quotation marks (because it is not a string).

True

True

Let’s see an example of calling this same method on a value taken directly from the DataFrame. The following is a string, just like our s variable.

df.loc[20, "TransDate"]

'Monday, January 3, 2022'

This string does not start with "Saturday", so the following outputs False.

df.loc[20, "TransDate"].startswith("Saturday")

False

Using string methods with a Series#

We now know how to determine if an individual string contains "Saturday" or starts with "Saturday". How can we use that same approach on an entire pandas Series of strings?

myseries = df["TransDate"]

Here is the “TransDate” column.

myseries

      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
                   ...            
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object

When you ask Python, what is the type of a string, it outputs str.

type("Saturday")

str

Given a pandas Series containing strings, there is what is called an “accessor attribute” str which gives us access to Python string methods. The type of myseries.str is not so important; I think I evaluated the following just to show that it is some special type of object, it’s not a string and not a pandas Series. (Much of Python is based on “Object Oriented Programming”, and an effect of this is that Python libraries like pandas are filled with special objects that serve very specific purposes.)

type(myseries.str)

pandas.core.strings.accessor.StringMethods

Notice how similar the attributes and methods listed here are to the methods and attributes defined on strings. They are not exactly the same, but they’re very similar. (If you tried calling dir on a pandas Series, you would see almost no overlap between its methods and string methods, especially if you don’t count the underscore methods.)

dir(myseries.str)

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_data',
 '_doc_args',
 '_freeze',
 '_get_series_list',
 '_index',
 '_inferred_dtype',
 '_is_categorical',
 '_is_string',
 '_name',
 '_orig',
 '_parent',
 '_validate',
 '_wrap_result',
 'capitalize',
 'casefold',
 'cat',
 'center',
 'contains',
 'count',
 'decode',
 'encode',
 'endswith',
 'extract',
 'extractall',
 'find',
 'findall',
 'fullmatch',
 'get',
 'get_dummies',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'islower',
 'isnumeric',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'len',
 'ljust',
 'lower',
 'lstrip',
 'match',
 'normalize',
 'pad',
 'partition',
 'repeat',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'slice',
 'slice_replace',
 'split',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'wrap',
 'zfill']

A quick reminder on slicing syntax. We’ve only seen this for a DataFrame, but it works the same way for a Series. Here we are getting the first four values of the pandas Series myseries.

myseries[:4]

  Saturday, January 1, 2022
  Saturday, January 1, 2022
  Saturday, January 1, 2022
  Saturday, January 1, 2022
Name: TransDate, dtype: object

This is as good of a time as any to mention the negative indexing notation. Here is a way to get the last 4 values of something (whether a pandas Series as in this case, or a pandas DataFrame, or a Python list, etc).

myseries[-4:]

  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
Name: TransDate, dtype: object

Using this string accessor attribute, str, we can use the find method, just like we used it above. We see 0 at the top, because "Saturday" occurs starting at position 0 in the top few values, and we see -1 at the bottom, because "Saturday" does not occur in the last few values, so the find call fails in that case.

myseries.str.find("Saturday")

     0
     0
     0
     0
     0
       ..
 -1
 -1
 -1
 -1
 -1
Name: TransDate, Length: 6445, dtype: int64

Or we can use the startswith method. Instead of returning a Series of integer values, this returns a series of Boolean values.

myseries.str.startswith("Saturday")

      True
      True
      True
      True
      True
        ...  
  False
  False
  False
  False
  False
Name: TransDate, Length: 6445, dtype: bool

Our overall goal is to count, and that counting is somewhat more natural using the Boolean values. The reason is because True behaves like 1 in Python (or in Matlab) and False behaves like 0, so if we add up Boolean values, the result is the exact same as how many times True occurred.

True+True+False

Even though True and 1 are not literally the same object in Python, True == 1 evaluates to True.

True == 1

True

A pandas Series, like myseries.str.startswith("Saturday"), has a sum method, which adds up all the values in the Series. So the following is the first answer to our overall question, how many transaction dates correspond to Saturday?

myseries.str.startswith("Saturday").sum()

Here is another example solution, very similar, using Boolean indexing. We first get the sub-DataFrame of df containing only those rows which correspond to a transaction that occurred on Saturday. Notice how what goes inside the square brackets is our Boolean Series from above.

# Boolean indexing
df[myseries.str.startswith("Saturday")]

	Status	Device ID	Location	Machine	Product	Category	Transaction	TransDate	Type	RCoil	RPrice	RQty	MCoil	MPrice	MQty	LineTotal	TransTotal	Prcd Date
0	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14515778905	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
1	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516018629	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	5.0	1/1/2022
2	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516018629	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	5.0	1/1/2022
3	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Takis - Hot Chilli Pepper & Lime	Food	14516020373	Saturday, January 1, 2022	Credit	123	1.5	1	123	1.5	1	1.5	1.5	1/1/2022
4	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Red Bull - Energy Drink - Sugar Free	Carbonated	14516021756	Saturday, January 1, 2022	Credit	148	3.5	1	148	3.5	1	3.5	3.5	1/1/2022
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
6322	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Goldfish Baked - Cheddar	Food	15586314095	Saturday, August 27, 2022	Credit	125	1.5	1	125	1.5	1	1.5	1.5	8/27/2022
6323	Processed	VJ300320692	EB Public Library	EB Public Library x1380	Lenny & Larrys - SnickerDoodle	Food	15586432746	Saturday, August 27, 2022	Cash	157	3.0	1	157	3.0	1	3.0	3.0	8/27/2022
6324	Processed	VJ300320692	EB Public Library	EB Public Library x1380	SunChips Multigrain - Harvest Cheddar	Food	15586827743	Saturday, August 27, 2022	Cash	110	1.5	1	110	1.5	1	1.5	1.5	8/27/2022
6325	Processed	VJ300320611	Brunswick Sq Mall	BSQ Mall x1366 - ATT	Poland Springs Water	Water	15586849358	Saturday, August 27, 2022	Credit	143	2.0	1	143	2.0	1	2.0	2.0	8/27/2022
6326	Processed	VJ300205292	Brunswick Sq Mall	BSQ Mall x1364 - Zales	Pop Corners - White Cheddar	Food	15586848996	Saturday, August 27, 2022	Cash	114	1.5	1	114	1.5	1	1.5	1.5	8/27/2022

684 rows × 18 columns

The answer to our question (which we already found above using the sum method) can now be found by determining the number of rows in the sub-DataFrame. The sub-DataFrame has 684 rows and 18 columns.

df[myseries.str.startswith("Saturday")].shape

(684, 18)

If we just want the number of rows, we can extract the number at position 0 from the shape tuple.

df[myseries.str.startswith("Saturday")].shape[0]

Using timestamp methods#

It was easy to tell (at least as a human, if not as a computer) that our string s above corresponded to Saturday, because the word “Saturday” was in the string s.

'Saturday, January 1, 2022'

A lot of times, dates will be provided to us in a more compact format. For example, even in this same dataset, the dates in the “Prcd Date” column are written differently. Here we access the date in the row labeled 20 and the column labeled "Prcd Date" using .loc.

t = df.loc[20, "Prcd Date"]

It’s much more difficult to tell what day of the week this date corresponds to.

'1/3/2022'

As a first step, we will use the to_datetime function in pandas to convert t into a Timestamp. The following is not the correct syntax, because we have not told Python where to find the to_datetime function.

# use the to_datetime function in pandas
to_datetime(t)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [38], line 2
      1 # use the to_datetime function in pandas
----> 2 to_datetime(t)

NameError: name 'to_datetime' is not defined

The correct syntax to use is pd.to_datetime, which tells Python to get the to_datetime function from the pandas library.

ts = pd.to_datetime(t)
ts

Timestamp('2022-01-03 00:00:00')

The variable t is still a string.

type(t)

str

The variable ts that we just defined is a pandas Timestamp. Because it is a new type of object that is specifically related to a time, it will have lots of new functionality.

type(ts)

pandas._libs.tslibs.timestamps.Timestamp

Like usual, we can see all of the methods and attributes defined on ts by using Python’s dir function. Scroll through these and try to decide which ones will tell us that the date represents a Monday.

dir(ts)

['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_date_repr',
 '_repr_base',
 '_round',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'now',
 'quarter',
 'replace',
 'resolution',
 'round',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'to_datetime64',
 'to_julian_date',
 'to_numpy',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']

A good guess is to use day, but that tells us the day of the month. Because ts represents January 3rd 2022, the day attribute returns the integer 3.

ts.day

Warning. As far as I know, there is no good way to know in advance which of these are methods and which are attributes. (And you certainly won’t be tested on that.) The important thing is to recognize when you have called one incorrectly. Here we try calling day as a method. The error says that “‘int’ object is not callable”. That is because ts.day is getting converted into 3, so calling ts.day() is like calling 3(), and that is the explanation for this error message.

ts.day()

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [46], line 1
----> 1 ts.day()

TypeError: 'int' object is not callable

Another try is day_of_week, and that is correct, but it’s difficult to know in advance which day of the week corresponds to 0. Let’s keep looking.

ts.day_of_week

The dayofweek attribute (note that there are no underscores) returns the same integer 0. We are looking for something that returns the string "Monday".

ts.dayofweek

How about day_name?

ts.day_name

<function Timestamp.day_name>

It looks like that didn’t work. This weird output is very similar (but shorter) to the df.head output we saw up above.

This output is a sign that we should be calling day_name as a method, not as an attribute. The following works.

ts.day_name()

'Monday'

Aside. You might wonder why I call day_name() a method instead of a function. In general, if a function is bound to a Python object, like how this day_name is bound to ts, then it is usually called a method. As small evidence of that, notice in the documentation how both the word “function” and the word “method” are used.

help(ts.day_name)

Help on built-in function day_name:

day_name(...) method of pandas._libs.tslibs.timestamps.Timestamp instance
    Return the day name of the Timestamp with specified locale.
    
    Parameters
    ----------
    locale : str, default None (English locale)
        Locale determining the language in which to return the day name.
    
    Returns
    -------
    str

It’s nice that we can get day of the week from a date. If you had to implement that kind of functionality on its own, it would be very complicated.

Let’s try it again. Why doesn’t the following work?

"9/27/2022".day_name()

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [52], line 1
----> 1 "9/27/2022".day_name()

AttributeError: 'str' object has no attribute 'day_name'

The reason the above call to day_name() did not work, is because strings do not have a day_name method. (Why would they? What should "Christopher".day_name() return?)

To use this day_name method, we have to first convert the string into a Timestamp.

pd.to_datetime("9/27/2022").day_name()

'Tuesday'

Using timestamp methods with a Series#

Let’s return to our original question (which we already answered above using string methods) of how many transactions occurred on a Saturday. Remember that the relevant column is the “TransDate” column.

df["TransDate"]

      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
      Saturday, January 1, 2022
                   ...            
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
  Wednesday, August 31, 2022
Name: TransDate, Length: 6445, dtype: object

Just like we used str to access string methods on a pandas Series, here we use dt (which presumably stands for datetime) to access Timestamp methods on a pandas Series. But we can’t use df["TransDate"].dt directly. The following mistake is basically the same as the "9/27/2022".day_name() mistake above.

dir(df["TransDate"].dt)

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In [55], line 1
----> 1 dir(df["TransDate"].dt)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:5461, in NDFrame.__getattr__(self, name)
   5454 # Note: obj.x will always call obj.__getattribute__('x') prior to
   5455 # calling obj.__getattr__('x').
   5456 if (
   5457     name in self._internal_names_set
   5458     or name in self._metadata
   5459     or name in self._accessors
   5460 ):
-> 5461     return object.__getattribute__(self, name)
   5462 else:
   5463     if self._info_axis._can_hold_identifiers_and_holds_name(name):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/accessor.py:180, in CachedAccessor.__get__(self, obj, cls)
    177 if obj is None:
    178     # we're accessing the attribute of the class, i.e., Dataset.geo
    179     return self._accessor
--> 180 accessor_obj = self._accessor(obj)
    181 # Replace the property with the accessor object. Inspired by:
    182 # https://www.pydanny.com/cached-property.html
    183 # We need to use object.__setattr__ because we overwrite __setattr__ on
    184 # NDFrame
    185 object.__setattr__(obj, self._name, accessor_obj)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/accessors.py:494, in CombinedDatetimelikeProperties.__new__(cls, data)
    491 elif is_period_dtype(data.dtype):
    492     return PeriodProperties(data, orig)
--> 494 raise AttributeError("Can only use .dt accessor with datetimelike values")

AttributeError: Can only use .dt accessor with datetimelike values

We first need to convert from strings to Timestamps using pd.to_datetime. Notice how the dtype changes. (I’m not sure if there is a difference between the Timestamp objects we saw above and the type referred to below as datetime64[ns]. I think it is safe to think of them as the same.)

pd.to_datetime(df["TransDate"])

    2022-01-01
    2022-01-01
    2022-01-01
    2022-01-01
    2022-01-01
          ...    
 2022-08-31
 2022-08-31
 2022-08-31
 2022-08-31
 2022-08-31
Name: TransDate, Length: 6445, dtype: datetime64[ns]

Now we can call the Python dir function, like I tried to do above. Notice how similar these methods are to the Timestamp attributes and methods we saw in the previous section.

dir(pd.to_datetime(df["TransDate"]).dt)

['__annotations__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__frozen',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_accessors',
 '_add_delegate_accessors',
 '_constructor',
 '_delegate_method',
 '_delegate_property_get',
 '_delegate_property_set',
 '_dir_additions',
 '_dir_deletions',
 '_freeze',
 '_get_values',
 '_hidden_attrs',
 '_parent',
 '_reset_cache',
 'ceil',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'floor',
 'freq',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'microsecond',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'quarter',
 'round',
 'second',
 'strftime',
 'time',
 'timetz',
 'to_period',
 'to_pydatetime',
 'tz',
 'tz_convert',
 'tz_localize',
 'week',
 'weekday',
 'weekofyear',
 'year']

In particular, using this dt accessor attribute, we can call the day_name method.

pd.to_datetime(df["TransDate"]).dt.day_name()

      Saturday
      Saturday
      Saturday
      Saturday
      Saturday
          ...    
  Wednesday
  Wednesday
  Wednesday
  Wednesday
  Wednesday
Name: TransDate, Length: 6445, dtype: object

Let’s now see how often "Saturday" occurs. Here is the same mistake we have made several times above. We are missing the parentheses from value_counts.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts

<bound method IndexOpsMixin.value_counts of 0        Saturday
      Saturday
      Saturday
      Saturday
      Saturday
          ...    
  Wednesday
  Wednesday
  Wednesday
  Wednesday
  Wednesday
Name: TransDate, Length: 6445, dtype: object>

Here is a correct call of value_counts. This returns a Series. The index of this Series contains the days of the week, and the values of this Series indicate how often they occur.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()

Wednesday    1120
Tuesday      1113
Thursday     1106
Monday       1023
Friday       1016
Saturday      684
Sunday        383
Name: TransDate, dtype: int64

To access the value corresponding to the key "Saturday", we can use indexing.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts()["Saturday"]

We could also use the .loc indexing.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().loc["Saturday"]

If we’re willing to count to see what place "Saturday" occurs, we could also use the iloc indexing. This iloc approach is definitely the worst approach in this case, but it is useful to know if you want to get, for example, the top value.

pd.to_datetime(df["TransDate"]).dt.day_name().value_counts().iloc[5]

UC Irvine Math 10, Fall 2022

Week 1 Tuesday

Contents

Week 1 Tuesday#

Using string methods#

Using string methods with a Series#

Using timestamp methods#

Using timestamp methods with a Series#