Week 1 Wednesday
Contents
Week 1 Wednesday#
Most of today’s class will be time to work on Worksheet 2 which is due Tuesday next week before discussion section. Alicia, Chupeng, and I are here to help.
Announcements#
I have office hours in here (ALP 3600) today at 1pm. I usually leave after about 20 minutes if nobody is around, so try to come before 1:20pm!
I added annotations to the Deepnote notebooks in the course notes. If you want to review what we covered on Monday or Tuesday, I recommend looking in the course notes instead of in Deepnote. I don’t know if I’ll always find time to add annotations, but if there is a particular notebook you’re stuck on, remind me on Ed Discussion to add annotations, or let me know what part is confusing.
No new material should be presented today or Thursday, so if you’re feeling overwhelmed with how much has been introduced, this is a great chance to catch up. By far the best way to learn this material is to try using it yourself. That’s a big reason for the worksheets.
Videos and video quizzes are due Friday before lecture.
Our first in-class quiz will be next week on Tuesday in discussion section. It hasn’t been written yet, but it will very likely include at least one question involving Boolean indexing and at least one question involving either the
str
accessor attribute or thedt
accessor attribute. The quizzes are closed book and closed computer.
Warm-up#
In the attached vending machine file, what date occurs most often in the “Prcd Date” column? (Don’t worry about ties; just get the top date that occurs when using
value_counts
.)What day of the week is that?
import pandas as pd
df = pd.read_csv("vend.csv")
df["Prcd Date"]
0 1/1/2022
1 1/1/2022
2 1/1/2022
3 1/1/2022
4 1/1/2022
...
6440 8/31/2022
6441 8/31/2022
6442 8/31/2022
6443 8/31/2022
6444 8/31/2022
Name: Prcd Date, Length: 6445, dtype: object
It’s easy enough to look and find the most frequent value by looking at the pandas Series produced by calling value_counts()
, but it is better to also have a “programmatic” way to get this value.
myseries = df["Prcd Date"].value_counts()
myseries
3/30/2022 123
7/22/2022 117
7/30/2022 108
7/7/2022 83
6/26/2022 77
...
8/14/2022 2
1/9/2022 2
7/17/2022 1
4/17/2022 1
1/29/2022 1
Name: Prcd Date, Length: 241, dtype: int64
If we use iloc
, that will get us the value, which is not what we want. We want the corresponding key, which is "3/30/2022"
in this case.
myseries.iloc[0]
123
I’m very surprised that the following works, instead of raising a key error. My best guess is that, when the keys are strings, if you input an integer (without using loc
or iloc
), pandas defaults to using iloc
. But I’m not sure if that’s the correct explanation, and at least for now, we should not use this kind of indexing with a pandas Series.
myseries[0]
123
This is the error I was expecting the previous cell to raise.
myseries.loc[0]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
3080 try:
-> 3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:101, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:4554, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:4562, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [8], line 1
----> 1 myseries.loc[0]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
892 axis = self.axis or 0
894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1124, in _LocIndexer._getitem_axis(self, key, axis)
1122 # fall thru to straight lookup
1123 self._validate_key(key, axis)
-> 1124 return self._get_label(key, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocIndexer._get_label(self, label, axis)
1071 def _get_label(self, label, axis: int):
1072 # GH#5667 this will fail if the label is not present in the axis.
-> 1073 return self.obj.xs(label, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:3739, in NDFrame.xs(self, key, axis, level, drop_level)
3737 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
3738 else:
-> 3739 loc = index.get_loc(key)
3741 if isinstance(loc, np.ndarray):
3742 if loc.dtype == np.bool_:
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
-> 3083 raise KeyError(key) from err
3085 if tolerance is not None:
3086 tolerance = self._convert_tolerance(tolerance, np.asarray(key))
KeyError: 0
Here is the correct way to use loc
. Notice (by looking in the above series) how the value corresponding to this key is 2
.
myseries.loc["8/14/2022"]
2
Usually loc
is not needed with a pandas Series, because you can just use square brackets directly to access a value by the key.
myseries["8/14/2022"]
2
Using iloc
with this sort of key, or anything that is not an integer, will always raise an error.
myseries.iloc["8/14/2022"]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In [12], line 1
----> 1 myseries.iloc["8/14/2022"]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
892 axis = self.axis or 0
894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1498, in _iLocIndexer._getitem_axis(self, key, axis)
1496 key = item_from_zerodim(key)
1497 if not is_integer(key):
-> 1498 raise TypeError("Cannot index by location index with a non-integer key")
1500 # validate the location
1501 self._validate_integer(key, axis)
TypeError: Cannot index by location index with a non-integer key
Here is a reminder of what myseries
contains.
myseries
3/30/2022 123
7/22/2022 117
7/30/2022 108
7/7/2022 83
6/26/2022 77
...
8/14/2022 2
1/9/2022 2
7/17/2022 1
4/17/2022 1
1/29/2022 1
Name: Prcd Date, Length: 241, dtype: int64
Let’s finally get to accessing the most frequent date, 3/30/2022, using this series. We can use the index
attribute to get all of these keys.
myseries.index
Index(['3/30/2022', '7/22/2022', '7/30/2022', '7/7/2022', '6/26/2022',
'7/23/2022', '7/14/2022', '4/22/2022', '8/18/2022', '7/13/2022',
...
'7/24/2022', '7/4/2022', '1/2/2022', '8/21/2022', '7/3/2022',
'8/14/2022', '1/9/2022', '7/17/2022', '4/17/2022', '1/29/2022'],
dtype='object', length=241)
The two most important pandas data types are the pandas Series and the pandas DataFrame data types. There is also a pandas Index data type, but I think we can just pretend this is a list and everything will work fine.
type(myseries.index)
pandas.core.indexes.base.Index
We want the initial element in this index, so we use indexing to access it.
myseries.index[0]
'3/30/2022'
Now we can work on the second question, which was, what day of the week does this date correspond to? This is harder to answer than on Tuesday, because the day of the week is not shown in the string.
mydate = myseries.index[0]
Notice that, for now, mydate
really is a string.
type(mydate)
str
We’d rather this value be something that has functionality related to being a date, so we will convert it to a Timestamp using pd.to_datetime
.
pd.to_datetime(mydate)
Timestamp('2022-03-30 00:00:00')
It really is a pandas Timestamp.
type(pd.to_datetime(mydate))
pandas._libs.tslibs.timestamps.Timestamp
Now we can use the same day_name
method we used on Tuesday.
pd.to_datetime(mydate).day_name()
'Wednesday'
If you don’t remember the name of this method, you can always browse the options by using the Python dir
function.
dir(pd.to_datetime(mydate))
['__add__',
'__array_priority__',
'__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',
'__eq__',
'__format__',
'__ge__',
'__getattribute__',
'__gt__',
'__hash__',
'__init__',
'__init_subclass__',
'__le__',
'__lt__',
'__module__',
'__ne__',
'__new__',
'__pyx_vtable__',
'__radd__',
'__reduce__',
'__reduce_cython__',
'__reduce_ex__',
'__repr__',
'__rsub__',
'__setattr__',
'__setstate__',
'__setstate_cython__',
'__sizeof__',
'__str__',
'__sub__',
'__subclasshook__',
'__weakref__',
'_date_repr',
'_repr_base',
'_round',
'_short_repr',
'_time_repr',
'asm8',
'astimezone',
'ceil',
'combine',
'ctime',
'date',
'day',
'day_name',
'day_of_week',
'day_of_year',
'dayofweek',
'dayofyear',
'days_in_month',
'daysinmonth',
'dst',
'floor',
'fold',
'freq',
'freqstr',
'fromisocalendar',
'fromisoformat',
'fromordinal',
'fromtimestamp',
'hour',
'is_leap_year',
'is_month_end',
'is_month_start',
'is_quarter_end',
'is_quarter_start',
'is_year_end',
'is_year_start',
'isocalendar',
'isoformat',
'isoweekday',
'max',
'microsecond',
'min',
'minute',
'month',
'month_name',
'nanosecond',
'normalize',
'now',
'quarter',
'replace',
'resolution',
'round',
'second',
'strftime',
'strptime',
'time',
'timestamp',
'timetuple',
'timetz',
'to_datetime64',
'to_julian_date',
'to_numpy',
'to_period',
'to_pydatetime',
'today',
'toordinal',
'tz',
'tz_convert',
'tz_localize',
'tzinfo',
'tzname',
'utcfromtimestamp',
'utcnow',
'utcoffset',
'utctimetuple',
'value',
'week',
'weekday',
'weekofyear',
'year']