Week 2 Monday#

Announcements#

  • In-class Quiz 1 is in discussion tomorrow. Based on the material up to and including Worksheet 2. (So the material from last week’s videos will not be covered, at least not directly.)

  • Worksheet 1 and Worksheet 2 due before discussion tomorrow.

  • I have office hours after class, 11am, next door in ALP 3610.

  • Friday videos and video quizzes for this week are posted.

The main goal today is to briefly introduce two topics that appear on Worksheet 3: f-strings and map. Also we will briefly see how to time a computation.

import pandas as pd

f-strings#

f-strings are a relatively recent addition to Python. We are working in Python 3.9 (how do I know?) and they were added in Python 3.6, which was released in 2016. For this reason, you will often see Python code (even code written by experts) that uses the previous format method, so it’s important to recognize both.

  • Define name to be "Chris" and day to be today’s date as a pandas Timestamp object.

name = "Chris"
day = pd.to_datetime("10/3/2022")

Let’s check that day really is a Timestamp and not a string.

day
Timestamp('2022-10-03 00:00:00')
type(day)
pandas._libs.tslibs.timestamps.Timestamp
  • Print "Hello, Chris, how are you doing on Monday?" using name and day and the string method format.

Even though the f-string approach is more elegant and more readable than the format approach, it is still useful to be able to recognize the format approach. Code from before 2016 (and even a lot of modern Python code) will use the format approach.

# old way
s = "Hello, {}, how are you doing on {}?".format(name, day)
print(s)
Hello, Chris, how are you doing on 2022-10-03 00:00:00?

The above showed the Timestamp, not the day of the week. Here we fix that. Notice how we don’t need the dt accessor, because we are not working with a pandas Series. We only use .dt and .str when working with a pandas Series.

# don't need `dt` because `day` is not a Series
s = "Hello, {}, how are you doing on {}?".format(name, day.day_name())
print(s)
Hello, Chris, how are you doing on Monday?
  • Print "Hello, Chris, how are you doing today on October 3?" using name and day and f-strings. (You can also have the f-string make the conversion to “Monday” automatically, using a strftime format code.)

The biggest drawback to the format code is that you can’t read the text in order (you have to jump to the right to find what goes inside the brackets {}). Also, when there are lots of variables, you need to count them and that can get confusing.

The f-string approach is much more readable. Notice how we add the letter f before the quotation marks.

# new way, f-string way
s1 = f"Hello, {name}, how are you doing on {day.day_name()}?"
print(s1)
Hello, Chris, how are you doing on Monday?

The type of s1 in Python is still just an ordinary string.

type(s1)
str

The following is not nearly as important for Math 10 as the day_name method, but just for fun, here is a way to use a format code instead of day_name to display the day of the week. (See the link above for other options.)

# just for fun, day_name is more important
f"Hello, {name}, how are you doing on {day:%A}?"
'Hello, Chris, how are you doing on Monday?'

Adding a column to a pandas DataFrame#

Here we briefly see two things.

  • How to insert a new column into a DataFrame.

  • How to use map to apply a function to every value in a pandas Series.

The string split method converts from a string to a list. If you don’t pass any arguments to the method, meaning you call split() with empty parentheses, then it will divide the string at all whitespace.

df = pd.DataFrame(
    {
        "A": [3,1,4,1,5],
        "B": "Hello, how are you doing?".split()
    }
)

df
A B
0 3 Hello,
1 1 how
2 4 are
3 1 you
4 5 doing?

Making a new column with the same value in every position is easy.

  • Make a new column “C” that contains 4.5 in every position.

df["C"] = 4.5

Here we verify that there really is a new column in df.

df
A B C
0 3 Hello, 4.5
1 1 how 4.5
2 4 are 4.5
3 1 you 4.5
4 5 doing? 4.5

Remember that if we use the notation df[???], without loc or iloc, that will access a column by its label.

df["B"]
0    Hello,
1       how
2       are
3       you
4    doing?
Name: B, dtype: object

Using an integer position here doesn’t work. (This would work if there were a column with label 1.)

# looking in columns
df[1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3081, in Index.get_loc(self, key, method, tolerance)
   3080 try:
-> 3081     return self._engine.get_loc(casted_key)
   3082 except KeyError as err:

File pandas/_libs/index.pyx:70, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:101, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:4554, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:4562, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [17], line 1
----> 1 df[1]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/frame.py:3024, in DataFrame.__getitem__(self, key)
   3022 if self.columns.nlevels > 1:
   3023     return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
   3025 if is_integer(indexer):
   3026     indexer = [indexer]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/base.py:3083, in Index.get_loc(self, key, method, tolerance)
   3081         return self._engine.get_loc(casted_key)
   3082     except KeyError as err:
-> 3083         raise KeyError(key) from err
   3085 if tolerance is not None:
   3086     tolerance = self._convert_tolerance(tolerance, np.asarray(key))

KeyError: 1

If instead you want to get a row, not a column, you should use loc.

# looking in rows
df.loc["B"]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In [18], line 1
----> 1 df.loc["B"]

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:895, in _LocationIndexer.__getitem__(self, key)
    892 axis = self.axis or 0
    894 maybe_callable = com.apply_if_callable(key, self.obj)
--> 895 return self._getitem_axis(maybe_callable, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1124, in _LocIndexer._getitem_axis(self, key, axis)
   1122 # fall thru to straight lookup
   1123 self._validate_key(key, axis)
-> 1124 return self._get_label(key, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexing.py:1073, in _LocIndexer._get_label(self, label, axis)
   1071 def _get_label(self, label, axis: int):
   1072     # GH#5667 this will fail if the label is not present in the axis.
-> 1073     return self.obj.xs(label, axis=axis)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/generic.py:3739, in NDFrame.xs(self, key, axis, level, drop_level)
   3737         raise TypeError(f"Expected label or tuple of labels, got {key}") from e
   3738 else:
-> 3739     loc = index.get_loc(key)
   3741     if isinstance(loc, np.ndarray):
   3742         if loc.dtype == np.bool_:

File /shared-libs/python3.9/py/lib/python3.9/site-packages/pandas/core/indexes/range.py:354, in RangeIndex.get_loc(self, key, method, tolerance)
    352         except ValueError as err:
    353             raise KeyError(key) from err
--> 354     raise KeyError(key)
    355 return super().get_loc(key, method=method, tolerance=tolerance)

KeyError: 'B'

I don’t use this approach very often, but here is the loc way to get a column.

df.loc[:,"B"]
0    Hello,
1       how
2       are
3       you
4    doing?
Name: B, dtype: object

We can also create a new column using this approach. (I wasn’t 100% sure this would work, since I always use the abbreviation without loc, but it did work.)

df.loc[:, "New Column"] = 3

Now we can see that two new columns have been added to df.

df
A B C New Column
0 3 Hello, 4.5 3
1 1 how 4.5 3
2 4 are 4.5 3
3 1 you 4.5 3
4 5 doing? 4.5 3
  • Using the pandas Series map method and a lambda function, make a new column “D” that contains the first two characters of each string in column “B”.

Here is a reminder of what df["B"] looks like. Each value is a string in this pandas Series.

df["B"]
0    Hello,
1       how
2       are
3       you
4    doing?
Name: B, dtype: object

The map method is a common place where we use lambda functions. (Any function will work; it doesn’t have to be a lambda function.) The input to map is the function we want to apply to every value. Here we use slicing to get the first two characters of every string in the “B” column.

df["B"].map(lambda s: s[:2])
0    He
1    ho
2    ar
3    yo
4    do
Name: B, dtype: object

Here we assign that pandas Series to a new column.

df["D"] = df["B"].map(lambda s: s[:2])
df
A B C New Column D
0 3 Hello, 4.5 3 He
1 1 how 4.5 3 ho
2 4 are 4.5 3 ar
3 1 you 4.5 3 yo
4 5 doing? 4.5 3 do
  • Make a new column “E” that is equal to column “A” divided by pi. (Get pi from the NumPy library.) Try this using map and also by dividing the column directly. The direct method (without map) is more efficient (at least for large Series) and more readable.

Unlike Matlab and Mathematica, Python does not have a built-in constant pi. (In general, Python has much less built-in math functionality.)

pi
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In [26], line 1
----> 1 pi

NameError: name 'pi' is not defined

We will get the definition of pi from the NumPy library. The NumPy library is very important, and is in the background of many efficient pandas computations.

import numpy as np
np.pi
3.141592653589793

Don’t overuse map. Even though the following works, there is a much simpler approach.

df["E"] = df["A"].map(lambda x: x/np.pi)
df
A B C New Column D E
0 3 Hello, 4.5 3 He 0.954930
1 1 how 4.5 3 ho 0.318310
2 4 are 4.5 3 ar 1.273240
3 1 you 4.5 3 yo 0.318310
4 5 doing? 4.5 3 do 1.591549

Here is the better approach. We just divide the column by np.pi, and pandas automatically “broadcasts” the operation to each entry in the column.

# better way
df["E"] = df["A"]/np.pi
df
A B C New Column D E
0 3 Hello, 4.5 3 He 0.954930
1 1 how 4.5 3 ho 0.318310
2 4 are 4.5 3 ar 1.273240
3 1 you 4.5 3 yo 0.318310
4 5 doing? 4.5 3 do 1.591549

As a quick aside, if you want to time a computation, one way is to put %%time at the top of the cell. With a much larger column (say with ten million numbers in it), we would find that this broadcasting approach is faster.

%%time
df["E"] = df["A"]/np.pi
CPU times: user 1.16 ms, sys: 0 ns, total: 1.16 ms
Wall time: 1.08 ms
  • Check that operating on the column directly really is more efficient for longer Series. Make a pandas Series ser whose values are the even numbers from 0 up to (but not including) 10**7. Put %%time in the top of the cell (this is called a Jupyter or IPython “magic” function) and time how long it takes to divide each value in ser by pi. Try with both methods from above.

We didn’t get here!