Week 1 Wednesday#

Most of today’s class will be time to work on Worksheet 2 which is due Tuesday next week before discussion section. Alicia, Chupeng, and I are here to help.

Announcements#

  • I have office hours in here (ALP 3600) today at 1pm. I usually leave after about 20 minutes if nobody is around, so try to come before 1:20pm!

  • I added annotations to the Deepnote notebooks in the course notes. If you want to review what we covered on Monday or Tuesday, I recommend looking in the course notes instead of in Deepnote. I don’t know if I’ll always find time to add annotations, but if there is a particular notebook you’re stuck on, remind me on Ed Discussion to add annotations, or let me know what part is confusing.

  • No new material should be presented today or Thursday, so if you’re feeling overwhelmed with how much has been introduced, this is a great chance to catch up. By far the best way to learn this material is to try using it yourself. That’s a big reason for the worksheets.

  • Videos and video quizzes are due Friday before lecture.

  • Our first in-class quiz will be next week on Tuesday in discussion section. It hasn’t been written yet, but it will very likely include at least one question involving Boolean indexing and at least one question involving either the str accessor attribute or the dt accessor attribute. The quizzes are closed book and closed computer.

Warm-up#

  • In the attached vending machine file, what date occurs most often in the “Prcd Date” column? (Don’t worry about ties; just get the top date that occurs when using value_counts.)

  • What day of the week is that?

import pandas as pd
df = pd.read_csv("vend.csv")
df["Prcd Date"]
0        1/1/2022
1        1/1/2022
2        1/1/2022
3        1/1/2022
4        1/1/2022
          ...    
6440    8/31/2022
6441    8/31/2022
6442    8/31/2022
6443    8/31/2022
6444    8/31/2022
Name: Prcd Date, Length: 6445, dtype: object

It’s easy enough to look and find the most frequent value by looking at the pandas Series produced by calling value_counts(), but it is better to also have a “programmatic” way to get this value.

myseries = df["Prcd Date"].value_counts()
myseries
3/30/2022    123
7/22/2022    117
7/30/2022    108
7/7/2022      83
6/26/2022     77
            ... 
8/14/2022      2
1/9/2022       2
7/17/2022      1
4/17/2022      1
1/29/2022      1
Name: Prcd Date, Length: 241, dtype: int64

If we use iloc, that will get us the value, which is not what we want. We want the corresponding key, which is "3/30/2022" in this case.

myseries.iloc[0]
123

I’m very surprised that the following works, instead of raising a key error. My best guess is that, when the keys are strings, if you input an integer (without using loc or iloc), pandas defaults to using iloc. But I’m not sure if that’s the correct explanation, and at least for now, we should not use this kind of indexing with a pandas Series.

myseries[0]
123

This is the error I was expecting the previous cell to raise.

myseries.loc[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
   3080 try:
-> 3081     return self._engine.get_loc(casted_key)
   3082 except KeyError as err:

File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:101, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:4554, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:4562, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [8], line 1
----> 1 myseries.loc[0]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
    892 axis = self.axis or 0
    894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1124, in _LocIndexer._getitem_axis(self, key, axis)
   1122 # fall thru to straight lookup
   1123 self._validate_key(key, axis)
-> 1124 return self._get_label(key, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocIndexer._get_label(self, label, axis)
   1071 def _get_label(self, label, axis: int):
   1072     # GH#5667 this will fail if the label is not present in the axis.
-> 1073     return self.obj.xs(label, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:3739, in NDFrame.xs(self, key, axis, level, drop_level)
   3737         raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3738 else:
-> 3739     loc = index.get_loc(key)
   3741     if isinstance(loc, np.ndarray):
   3742         if loc.dtype == np.bool_:

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
   3081         return self._engine.get_loc(casted_key)
   3082     except KeyError as err:
-> 3083         raise KeyError(key) from err
   3085 if tolerance is not None:
   3086     tolerance = self._convert_tolerance(tolerance, np.asarray(key))

KeyError: 0

Here is the correct way to use loc. Notice (by looking in the above series) how the value corresponding to this key is 2.

myseries.loc["8/14/2022"]
2

Usually loc is not needed with a pandas Series, because you can just use square brackets directly to access a value by the key.

myseries["8/14/2022"]
2

Using iloc with this sort of key, or anything that is not an integer, will always raise an error.

myseries.iloc["8/14/2022"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [12], line 1
----> 1 myseries.iloc["8/14/2022"]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
    892 axis = self.axis or 0
    894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1498, in _iLocIndexer._getitem_axis(self, key, axis)
   1496 key = item_from_zerodim(key)
   1497 if not is_integer(key):
-> 1498     raise TypeError("Cannot index by location index with a non-integer key")
   1500 # validate the location
   1501 self._validate_integer(key, axis)

TypeError: Cannot index by location index with a non-integer key

Here is a reminder of what myseries contains.

myseries
3/30/2022    123
7/22/2022    117
7/30/2022    108
7/7/2022      83
6/26/2022     77
            ... 
8/14/2022      2
1/9/2022       2
7/17/2022      1
4/17/2022      1
1/29/2022      1
Name: Prcd Date, Length: 241, dtype: int64

Let’s finally get to accessing the most frequent date, 3/30/2022, using this series. We can use the index attribute to get all of these keys.

myseries.index
Index(['3/30/2022', '7/22/2022', '7/30/2022', '7/7/2022', '6/26/2022',
       '7/23/2022', '7/14/2022', '4/22/2022', '8/18/2022', '7/13/2022',
       ...
       '7/24/2022', '7/4/2022', '1/2/2022', '8/21/2022', '7/3/2022',
       '8/14/2022', '1/9/2022', '7/17/2022', '4/17/2022', '1/29/2022'],
      dtype='object', length=241)

The two most important pandas data types are the pandas Series and the pandas DataFrame data types. There is also a pandas Index data type, but I think we can just pretend this is a list and everything will work fine.

type(myseries.index)
pandas.core.indexes.base.Index

We want the initial element in this index, so we use indexing to access it.

myseries.index[0]
'3/30/2022'

Now we can work on the second question, which was, what day of the week does this date correspond to? This is harder to answer than on Tuesday, because the day of the week is not shown in the string.

mydate = myseries.index[0]

Notice that, for now, mydate really is a string.

type(mydate)
str

We’d rather this value be something that has functionality related to being a date, so we will convert it to a Timestamp using pd.to_datetime.

pd.to_datetime(mydate)
Timestamp('2022-03-30 00:00:00')

It really is a pandas Timestamp.

type(pd.to_datetime(mydate))
pandas._libs.tslibs.timestamps.Timestamp

Now we can use the same day_name method we used on Tuesday.

pd.to_datetime(mydate).day_name()
'Wednesday'

If you don’t remember the name of this method, you can always browse the options by using the Python dir function.

dir(pd.to_datetime(mydate))
['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_date_repr',
 '_repr_base',
 '_round',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 'is_year_end',
 'is_year_start',
 'isocalendar',
 'isoformat',
 'isoweekday',
 'max',
 'microsecond',
 'min',
 'minute',
 'month',
 'month_name',
 'nanosecond',
 'normalize',
 'now',
 'quarter',
 'replace',
 'resolution',
 'round',
 'second',
 'strftime',
 'strptime',
 'time',
 'timestamp',
 'timetuple',
 'timetz',
 'to_datetime64',
 'to_julian_date',
 'to_numpy',
 'to_period',
 'to_pydatetime',
 'today',
 'toordinal',
 'tz',
 'tz_convert',
 'tz_localize',
 'tzinfo',
 'tzname',
 'utcfromtimestamp',
 'utcnow',
 'utcoffset',
 'utctimetuple',
 'value',
 'week',
 'weekday',
 'weekofyear',
 'year']