Week 2 Monday
Contents
Week 2 Monday#
Announcements#
In-class Quiz 1 is in discussion tomorrow. Based on the material up to and including Worksheet 2. (So the material from last week’s videos will not be covered, at least not directly.)
Worksheet 1 and Worksheet 2 due before discussion tomorrow.
I have office hours after class, 11am, next door in ALP 3610.
Friday videos and video quizzes for this week are posted.
The main goal today is to briefly introduce two topics that appear on Worksheet 3: f-strings and map
. Also we will briefly see how to time a computation.
import pandas as pd
f-strings#
f-strings are a relatively recent addition to Python. We are working in Python 3.9 (how do I know?) and they were added in Python 3.6, which was released in 2016. For this reason, you will often see Python code (even code written by experts) that uses the previous format
method, so it’s important to recognize both.
Define
name
to be"Chris"
andday
to be today’s date as a pandas Timestamp object.
name = "Chris"
day = pd.to_datetime("10/3/2022")
Let’s check that day
really is a Timestamp and not a string.
day
Timestamp('2022-10-03 00:00:00')
type(day)
pandas._libs.tslibs.timestamps.Timestamp
Print
"Hello, Chris, how are you doing on Monday?"
usingname
andday
and the string methodformat
.
Even though the f-string approach is more elegant and more readable than the format
approach, it is still useful to be able to recognize the format
approach. Code from before 2016 (and even a lot of modern Python code) will use the format
approach.
# old way
s = "Hello, {}, how are you doing on {}?".format(name, day)
print(s)
Hello, Chris, how are you doing on 2022-10-03 00:00:00?
The above showed the Timestamp, not the day of the week. Here we fix that. Notice how we don’t need the dt
accessor, because we are not working with a pandas Series. We only use .dt
and .str
when working with a pandas Series.
# don't need `dt` because `day` is not a Series
s = "Hello, {}, how are you doing on {}?".format(name, day.day_name())
print(s)
Hello, Chris, how are you doing on Monday?
Print
"Hello, Chris, how are you doing today on October 3?"
usingname
andday
and f-strings. (You can also have the f-string make the conversion to “Monday” automatically, using a strftime format code.)
The biggest drawback to the format
code is that you can’t read the text in order (you have to jump to the right to find what goes inside the brackets {}
). Also, when there are lots of variables, you need to count them and that can get confusing.
The f-string approach is much more readable. Notice how we add the letter f
before the quotation marks.
# new way, f-string way
s1 = f"Hello, {name}, how are you doing on {day.day_name()}?"
print(s1)
Hello, Chris, how are you doing on Monday?
The type of s1
in Python is still just an ordinary string.
type(s1)
str
The following is not nearly as important for Math 10 as the day_name
method, but just for fun, here is a way to use a format code instead of day_name
to display the day of the week. (See the link above for other options.)
# just for fun, day_name is more important
f"Hello, {name}, how are you doing on {day:%A}?"
'Hello, Chris, how are you doing on Monday?'
Adding a column to a pandas DataFrame#
Here we briefly see two things.
How to insert a new column into a DataFrame.
How to use
map
to apply a function to every value in a pandas Series.
The string split
method converts from a string to a list. If you don’t pass any arguments to the method, meaning you call split()
with empty parentheses, then it will divide the string at all whitespace.
df = pd.DataFrame(
{
"A": [3,1,4,1,5],
"B": "Hello, how are you doing?".split()
}
)
df
A | B | |
---|---|---|
0 | 3 | Hello, |
1 | 1 | how |
2 | 4 | are |
3 | 1 | you |
4 | 5 | doing? |
Making a new column with the same value in every position is easy.
Make a new column “C” that contains
4.5
in every position.
df["C"] = 4.5
Here we verify that there really is a new column in df
.
df
A | B | C | |
---|---|---|---|
0 | 3 | Hello, | 4.5 |
1 | 1 | how | 4.5 |
2 | 4 | are | 4.5 |
3 | 1 | you | 4.5 |
4 | 5 | doing? | 4.5 |
Remember that if we use the notation df[???]
, without loc
or iloc
, that will access a column by its label.
df["B"]
0 Hello,
1 how
2 are
3 you
4 doing?
Name: B, dtype: object
Using an integer position here doesn’t work. (This would work if there were a column with label 1
.)
# looking in columns
df[1]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
3080 try:
-> 3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:101, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:4554, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:4562, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 1
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [17], line 1
----> 1 df[1]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/frame.py:3024, in DataFrame.__getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
3081 return self._engine.get_loc(casted_key)
3082 except KeyError as err:
-> 3083 raise KeyError(key) from err
3085 if tolerance is not None:
3086 tolerance = self._convert_tolerance(tolerance, np.asarray(key))
KeyError: 1
If instead you want to get a row, not a column, you should use loc
.
# looking in rows
df.loc["B"]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In [18], line 1
----> 1 df.loc["B"]
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
892 axis = self.axis or 0
894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1124, in _LocIndexer._getitem_axis(self, key, axis)
1122 # fall thru to straight lookup
1123 self._validate_key(key, axis)
-> 1124 return self._get_label(key, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocIndexer._get_label(self, label, axis)
1071 def _get_label(self, label, axis: int):
1072 # GH#5667 this will fail if the label is not present in the axis.
-> 1073 return self.obj.xs(label, axis=axis)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:3739, in NDFrame.xs(self, key, axis, level, drop_level)
3737 raise TypeError(f"Expected label or tuple of labels, got {key}") from e
3738 else:
-> 3739 loc = index.get_loc(key)
3741 if isinstance(loc, np.ndarray):
3742 if loc.dtype == np.bool_:
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/range.py:354, in RangeIndex.get_loc(self, key, method, tolerance)
352 except ValueError as err:
353 raise KeyError(key) from err
--> 354 raise KeyError(key)
355 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 'B'
I don’t use this approach very often, but here is the loc
way to get a column.
df.loc[:,"B"]
0 Hello,
1 how
2 are
3 you
4 doing?
Name: B, dtype: object
We can also create a new column using this approach. (I wasn’t 100% sure this would work, since I always use the abbreviation without loc
, but it did work.)
df.loc[:, "New Column"] = 3
Now we can see that two new columns have been added to df
.
df
A | B | C | New Column | |
---|---|---|---|---|
0 | 3 | Hello, | 4.5 | 3 |
1 | 1 | how | 4.5 | 3 |
2 | 4 | are | 4.5 | 3 |
3 | 1 | you | 4.5 | 3 |
4 | 5 | doing? | 4.5 | 3 |
Using the pandas Series
map
method and a lambda function, make a new column “D” that contains the first two characters of each string in column “B”.
Here is a reminder of what df["B"]
looks like. Each value is a string in this pandas Series.
df["B"]
0 Hello,
1 how
2 are
3 you
4 doing?
Name: B, dtype: object
The map
method is a common place where we use lambda functions. (Any function will work; it doesn’t have to be a lambda function.) The input to map
is the function we want to apply to every value. Here we use slicing to get the first two characters of every string in the “B” column.
df["B"].map(lambda s: s[:2])
0 He
1 ho
2 ar
3 yo
4 do
Name: B, dtype: object
Here we assign that pandas Series to a new column.
df["D"] = df["B"].map(lambda s: s[:2])
df
A | B | C | New Column | D | |
---|---|---|---|---|---|
0 | 3 | Hello, | 4.5 | 3 | He |
1 | 1 | how | 4.5 | 3 | ho |
2 | 4 | are | 4.5 | 3 | ar |
3 | 1 | you | 4.5 | 3 | yo |
4 | 5 | doing? | 4.5 | 3 | do |
Make a new column “E” that is equal to column “A” divided by pi. (Get pi from the NumPy library.) Try this using
map
and also by dividing the column directly. The direct method (withoutmap
) is more efficient (at least for large Series) and more readable.
Unlike Matlab and Mathematica, Python does not have a built-in constant pi
. (In general, Python has much less built-in math functionality.)
pi
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In [26], line 1
----> 1 pi
NameError: name 'pi' is not defined
We will get the definition of pi
from the NumPy library. The NumPy library is very important, and is in the background of many efficient pandas computations.
import numpy as np
np.pi
3.141592653589793
Don’t overuse map
. Even though the following works, there is a much simpler approach.
df["E"] = df["A"].map(lambda x: x/np.pi)
df
A | B | C | New Column | D | E | |
---|---|---|---|---|---|---|
0 | 3 | Hello, | 4.5 | 3 | He | 0.954930 |
1 | 1 | how | 4.5 | 3 | ho | 0.318310 |
2 | 4 | are | 4.5 | 3 | ar | 1.273240 |
3 | 1 | you | 4.5 | 3 | yo | 0.318310 |
4 | 5 | doing? | 4.5 | 3 | do | 1.591549 |
Here is the better approach. We just divide the column by np.pi
, and pandas automatically “broadcasts” the operation to each entry in the column.
# better way
df["E"] = df["A"]/np.pi
df
A | B | C | New Column | D | E | |
---|---|---|---|---|---|---|
0 | 3 | Hello, | 4.5 | 3 | He | 0.954930 |
1 | 1 | how | 4.5 | 3 | ho | 0.318310 |
2 | 4 | are | 4.5 | 3 | ar | 1.273240 |
3 | 1 | you | 4.5 | 3 | yo | 0.318310 |
4 | 5 | doing? | 4.5 | 3 | do | 1.591549 |
As a quick aside, if you want to time a computation, one way is to put %%time
at the top of the cell. With a much larger column (say with ten million numbers in it), we would find that this broadcasting approach is faster.
%%time
df["E"] = df["A"]/np.pi
CPU times: user 1.16 ms, sys: 0 ns, total: 1.16 ms
Wall time: 1.08 ms
Check that operating on the column directly really is more efficient for longer Series. Make a pandas Series
ser
whose values are the even numbers from0
up to (but not including)10**7
. Put%%time
in the top of the cell (this is called a Jupyter or IPython “magic” function) and time how long it takes to divide each value inser
by pi. Try with both methods from above.
We didn’t get here!