Week 1 Friday

Week 1 Friday#

Announcements#

We’ll start on the library pandas next week. Most important library of Math 10.
Annotated versions of the Monday and Wednesday files are in the course notes.
For now, I’m going to keep using Deepnote to distribute files, but I will lecture locally (not “in the cloud”) using Anaconda Navigator’s Jupyter notebooks from the lab computer.
Worksheets 1 and 2 posted in Deepnote. For Worksheet 1, to get full points, please work in a group (2-3 students total). For Worksheet 2, it’s up to you if you work in a group or not.
Yufei, one of our three LAs, is here to help.
If you’re stuck on something and not able to ask in person, try asking on Ed Discussion (linked from Canvas).
I have another class at 2pm across campus, so unfortunately I can basically never stay around to answer questions.

More complex example of Boolean indexing#

Define arr as follows. We will then create the subarray of arr containing the rows which have at least two 2s using the following strategy.

import numpy as np

rng = np.random.default_rng(seed=100)
arr = rng.integers(0, 5, size=(10,3))
arr

array([[3, 4, 0],
       [2, 0, 1],
       [2, 0, 2],
       [4, 4, 2],
       [2, 3, 4],
       [4, 0, 3],
       [3, 0, 2],
       [4, 3, 1],
       [1, 3, 0],
       [2, 2, 2]], dtype=int64)

Make a 10x3 Boolean array indicating where arr is equal to 2.

arr == 2

array([[False, False, False],
       [ True, False, False],
       [ True, False,  True],
       [False, False,  True],
       [ True, False, False],
       [False, False, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [ True,  True,  True]])

Use the sum method with axis=1 to find how many 2s there are in each row.

cts = (arr == 2).sum(axis=1)
cts

array([0, 1, 2, 1, 1, 0, 1, 0, 0, 3])

Use Boolean indexing to create the subarray of arr2 containing only the rows which have at least two 2s.

# Boolean array
cts > 1 # cts >= 2

array([False, False,  True, False, False, False, False, False, False,
        True])

Here we “keep” the rows corresponding to the True values, that is, we keep the rows that have at least two 2s.

arr[cts > 1]

array([[2, 0, 2],
       [2, 2, 2]], dtype=int64)

The following means exactly the same thing, but it is a little harder to read because we’ve replaced cts with the definition of cts.

arr[(arr == 2).sum(axis=1) > 1]

array([[2, 0, 2],
       [2, 2, 2]], dtype=int64)

We’ve seen that we can use Boolean arrays to keep certain rows. We can also use a list of indices. Here we get the row at index 9 repeated four times.

arr[[9,9,9,9]]

array([[2, 2, 2],
       [2, 2, 2],
       [2, 2, 2],
       [2, 2, 2]], dtype=int64)

Here’s a reminder of what arr is.

arr

array([[3, 4, 0],
       [2, 0, 1],
       [2, 0, 2],
       [4, 4, 2],
       [2, 3, 4],
       [4, 0, 3],
       [3, 0, 2],
       [4, 3, 1],
       [1, 3, 0],
       [2, 2, 2]], dtype=int64)

Why did we use the double square brackets above? The outer square brackets are for indexing. The inner square brackets are for a list. Why do we need the list inside? Here is what happens if we omit the inner square brackets. The 8 gets us to the row at index 8, and the 1 gets to the element at index 1 in that row. (Remember that numbering in Python starts at 0.)

arr[8,1]

On the other hand, if we use a list on the inside for indexing, then we get the row at index 8 followed by the row at index 1.

arr[[8,1]]

array([[1, 3, 0],
       [2, 0, 1]], dtype=int64)

There was a question about whether we can do the same sort of thing with columns. Here is the shortest way I know; there could be an abbreviated way to do the same thing.

arr[:, [0,1,1,1,0]]

array([[3, 4, 4, 4, 3],
       [2, 0, 0, 0, 2],
       [2, 0, 0, 0, 2],
       [4, 4, 4, 4, 4],
       [2, 3, 3, 3, 2],
       [4, 0, 0, 0, 4],
       [3, 0, 0, 0, 3],
       [4, 3, 3, 3, 4],
       [1, 3, 3, 3, 1],
       [2, 2, 2, 2, 2]], dtype=int64)

Here is a more basic example of Boolean indexing, because it’s a one-dimensional array.

b = np.array([2,1,4,1,5,2])

We are keeping three of these values and discarding the rest.

b[[True, False, False, False, True, True]]

array([2, 5, 2])

Another example of the `axis` keyword argument#

What is the result of evaluating the following?
arr.max()
arr.max(axis=0)
arr.max(axis=1)

arr

array([[3, 4, 0],
       [2, 0, 1],
       [2, 0, 2],
       [4, 4, 2],
       [2, 3, 4],
       [4, 0, 3],
       [3, 0, 2],
       [4, 3, 1],
       [1, 3, 0],
       [2, 2, 2]], dtype=int64)

If we use max() without any axis argument, it returns the overall maximum in the array. This max (like sum above) is an example of a method. Methods are types of functions in Python, but they’re functions which are attached to an object. This max method and the sum method above are attached to NumPy array objects.

# max method
arr.max() 

For now, don’t worry about memorizing the difference between axis=0 and axis=1, but you should recognize that one of them is getting the maximum of each column, and one of them is getting the maximum of each row.

The strategy I use to remember, is that axis=0 gives permission to change the rows axis. (Why does “rows” correspond to axis=0? For example, when we say 10x3, the 10 refers to rows.)

The following is saying that all three columns have a maximum of 4.

arr.max(axis=0)

array([4, 4, 4], dtype=int64)

The following is length 10 because there are 10 rows. The top row has a maximum of 4, the next row has a maximum of 2, and so on.

arr.max(axis=1)

array([4, 2, 2, 4, 4, 4, 3, 4, 3, 2], dtype=int64)

There is also a min method.

arr.min(axis=1)

array([0, 0, 0, 2, 2, 0, 0, 1, 0, 2], dtype=int64)

There is also a mean method. Here we are finding the mean of each row.

arr.mean(axis=1)

array([2.33333333, 1.        , 1.33333333, 3.33333333, 3.        ,
       2.33333333, 1.66666667, 2.66666667, 1.33333333, 2.        ])

In how many rows of arr is the maximum entry in that row 2 or less?

First we find the maximum in each row.

rowmax = arr.max(axis=1)
rowmax

array([4, 2, 2, 4, 4, 4, 3, 4, 3, 2], dtype=int64)

Now we create a Boolean array, indicating if the maximum is less than or equal to 2 or not.

(rowmax <= 2)

array([False,  True,  True, False, False, False, False, False, False,
        True])

Now we count. (A general rule of thumb, if we’re counting, probably a sum will be used. If we’re getting a subarray or sub-DataFrame, probably Boolean indexing will be used.)

sum(rowmax <= 2)

Here is the Boolean indexing version of what we just did. The 3 on the previous line corresponds to this array having 3 rows.

# Boolean indexing version
arr[rowmax <= 2]

array([[2, 0, 1],
       [2, 0, 2],
       [2, 2, 2]], dtype=int64)

Functions in Python#

Write a function getsub which takes two inputs, a NumPy arr and an integer n, and as output returns the subarray of arr containing all rows with at least two entries equal to n.

Here is the syntax for writing a function with two input arguments. Be sure you are using the return statement to indicate what should be returned.

def getsub(arr, n):
    cts = (arr == n).sum(axis=1)
    subarr = arr[cts >= 2] # Boolean indexing
    return subarr

Let’s check this, but now with a much bigger array (100 rows) than what we were using above.

rng = np.random.default_rng(seed=100)
arr = rng.integers(0, 5, size=(100,3))

arr

array([[3, 4, 0],
       [2, 0, 1],
       [2, 0, 2],
       [4, 4, 2],
       [2, 3, 4],
       [4, 0, 3],
       [3, 0, 2],
       [4, 3, 1],
       [1, 3, 0],
       [2, 2, 2],
       [3, 2, 3],
       [4, 0, 2],
       [0, 3, 0],
       [2, 1, 4],
       [2, 4, 4],
       [1, 2, 1],
       [4, 4, 2],
       [3, 4, 3],
       [3, 2, 4],
       [1, 3, 0],
       [3, 3, 0],
       [0, 4, 1],
       [4, 3, 0],
       [0, 0, 4],
       [2, 0, 1],
       [2, 4, 2],
       [3, 1, 4],
       [4, 2, 1],
       [4, 2, 3],
       [4, 2, 3],
       [2, 1, 1],
       [0, 0, 4],
       [4, 2, 0],
       [2, 1, 2],
       [2, 0, 3],
       [2, 4, 2],
       [3, 2, 1],
       [0, 2, 3],
       [0, 0, 4],
       [1, 3, 0],
       [0, 3, 0],
       [2, 4, 2],
       [4, 0, 0],
       [2, 2, 1],
       [1, 1, 3],
       [1, 3, 3],
       [1, 3, 1],
       [1, 0, 3],
       [4, 2, 3],
       [3, 1, 1],
       [1, 4, 2],
       [3, 4, 0],
       [3, 2, 4],
       [3, 3, 3],
       [4, 2, 1],
       [1, 4, 3],
       [2, 3, 2],
       [2, 1, 0],
       [1, 3, 3],
       [3, 1, 4],
       [2, 4, 3],
       [1, 0, 4],
       [2, 3, 3],
       [4, 3, 2],
       [3, 4, 1],
       [2, 1, 1],
       [3, 4, 3],
       [1, 1, 1],
       [1, 0, 2],
       [0, 3, 1],
       [2, 1, 2],
       [3, 2, 0],
       [0, 4, 3],
       [4, 2, 4],
       [4, 3, 1],
       [3, 3, 3],
       [4, 4, 1],
       [4, 4, 1],
       [3, 4, 3],
       [4, 0, 3],
       [3, 3, 3],
       [3, 0, 3],
       [1, 4, 1],
       [1, 0, 2],
       [3, 1, 1],
       [3, 2, 0],
       [3, 2, 4],
       [2, 4, 1],
       [3, 2, 0],
       [1, 2, 4],
       [3, 1, 3],
       [4, 0, 4],
       [1, 1, 1],
       [1, 0, 3],
       [4, 1, 3],
       [0, 3, 1],
       [3, 0, 1],
       [2, 2, 4],
       [0, 2, 2],
       [0, 1, 3]], dtype=int64)

There aren’t any 5 values in this array, so that’s why the following returns an empty array.

getsub(arr,5)

array([], shape=(0, 3), dtype=int64)

Here is the sub-array containing all the rows which have at least two 4s.

getsub(arr,4)

array([[4, 4, 2],
       [2, 4, 4],
       [4, 4, 2],
       [4, 2, 4],
       [4, 4, 1],
       [4, 4, 1],
       [4, 0, 4]], dtype=int64)

(If there’s extra time.)

Can you write the same function using a for loop? (Warning. This for-loop approach is definitely worse. It is both less elegant and less efficient. It’s just for practice with some different Python syntax.)

Time to work on Worksheets 1-2#

Yufei and I are here to help.
I have another class at 2pm, so if you have questions, ask me now rather than after class!