NumPy and pandas

NumPy and pandas

Review of NumPy and examples using pandas.

Recording of lecture from 1/12/2022

Warm-up exercise

  1. Define an 8x4 NumPy array A of random integers between 1 and 10 (inclusive).

  2. Each row of A has four columns. Let [x,y,z,w] denote one of these rows. What is the probability that x > y?

import numpy as np
rng  = np.random.default_rng()
help(rng.integers)
Help on built-in function integers:

integers(...) method of numpy.random._generator.Generator instance
    integers(low, high=None, size=None, dtype=np.int64, endpoint=False)
    
    Return random integers from `low` (inclusive) to `high` (exclusive), or
    if endpoint=True, `low` (inclusive) to `high` (inclusive). Replaces
    `RandomState.randint` (with endpoint=False) and
    `RandomState.random_integers` (with endpoint=True)
    
    Return random integers from the "discrete uniform" distribution of
    the specified dtype. If `high` is None (the default), then results are
    from 0 to `low`.
    
    Parameters
    ----------
    low : int or array-like of ints
        Lowest (signed) integers to be drawn from the distribution (unless
        ``high=None``, in which case this parameter is 0 and this value is
        used for `high`).
    high : int or array-like of ints, optional
        If provided, one above the largest (signed) integer to be drawn
        from the distribution (see above for behavior if ``high=None``).
        If array-like, must contain integer values
    size : int or tuple of ints, optional
        Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
        ``m * n * k`` samples are drawn.  Default is None, in which case a
        single value is returned.
    dtype : dtype, optional
        Desired dtype of the result. Byteorder must be native.
        The default value is np.int64.
    endpoint : bool, optional
        If true, sample from the interval [low, high] instead of the
        default [low, high)
        Defaults to False
    
    Returns
    -------
    out : int or ndarray of ints
        `size`-shaped array of random integers from the appropriate
        distribution, or a single such random int if `size` not provided.
    
    Notes
    -----
    When using broadcasting with uint64 dtypes, the maximum value (2**64)
    cannot be represented as a standard integer type. The high array (or
    low if high is None) must have object dtype, e.g., array([2**64]).
    
    Examples
    --------
    >>> rng = np.random.default_rng()
    >>> rng.integers(2, size=10)
    array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0])  # random
    >>> rng.integers(1, size=10)
    array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
    
    Generate a 2 x 4 array of ints between 0 and 4, inclusive:
    
    >>> rng.integers(5, size=(2, 4))
    array([[4, 0, 2, 1],
           [3, 2, 2, 0]])  # random
    
    Generate a 1 x 3 array with 3 different upper bounds
    
    >>> rng.integers(1, [3, 5, 10])
    array([2, 2, 9])  # random
    
    Generate a 1 by 3 array with 3 different lower bounds
    
    >>> rng.integers([1, 5, 7], 10)
    array([9, 8, 7])  # random
    
    Generate a 2 by 4 array using broadcasting with dtype of uint8
    
    >>> rng.integers([1, 3, 5, 7], [[10], [20]], dtype=np.uint8)
    array([[ 8,  6,  9,  7],
           [ 1, 16,  9, 12]], dtype=uint8)  # random
    
    References
    ----------
    .. [1] Daniel Lemire., "Fast Random Integer Generation in an Interval",
           ACM Transactions on Modeling and Computer Simulation 29 (1), 2019,
           http://arxiv.org/abs/1805.10941.
A = rng.integers(1,11,size=(8,4))
A
array([[ 4,  8,  5,  5],
       [ 8,  2,  2,  8],
       [ 1,  3,  7,  1],
       [ 4,  6,  6,  5],
       [ 7,  9,  7,  7],
       [ 7,  3,  9,  8],
       [ 6,  1,  3, 10],
       [ 9,  4,  2,  7]])

In mathematics, it doesn’t make sense to ask if a vector is strictly greater than another vector. In NumPy, this comparison is done elementwise.

A[:,0] > A[:,1]
array([False,  True, False, False, False,  True,  True,  True])

It’s the same with equality: they are compared elementwise.

A[:,0] == A[:,1]
array([False, False, False, False, False, False, False, False])

With a list instead of an np.array, then equality means “are the lists exactly the same, with the same elements in the same positions?”

[4,2,3] == [4,2,3]
True
[4,2,3] == [4,3,2]
False
np.array([1,2,3]) == np.array([4,2,3])
array([False,  True,  True])
set([4,2,3]) == set([4,3,2,2,4,2,2])
True
[1,2,3] == [4,2,3]
False

This next cell produces an example of a Boolean array.

A[:,0] > A[:,1]
array([False,  True, False, False, False,  True,  True,  True])

Counting how often True appears.

np.count_nonzero(A[:,0] > A[:,1])
4

We think of each row as being one “experiment”. We can find the number of rows using len.

# number of experiments = number of rows
len(A)
8

We estimate the probability using “number of successes”/”number of experiments”. It won’t be accurate yet, because we are using so few experiments.

# prob estimate using len(A) experiments
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.5

Using ten million experiments.

A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.449869

If we do the same thing, we should get a very similar answer, but it won’t be exactly the same, since these are estimates using random experiments.

A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.4499494

pandas

Probably the most important Python library for Math 10. Essentially everything we did earlier in this notebook, we can also do in pandas. The library pandas also has a lot of extra functionality that will help us work with datasets.

import pandas as pd
A = rng.integers(1,11,size=(8,4))
type(A)
numpy.ndarray
A.shape
(8, 4)

We convert this NumPy array to a pandas DataFrame. (Make sure you capitalize DataFrame correctly.)

df = pd.DataFrame(A)
df
0 1 2 3
0 9 3 10 5
1 7 7 5 2
2 8 4 10 2
3 1 5 3 1
4 9 1 10 7
5 8 10 4 9
6 9 7 9 2
7 10 10 2 10

The syntax for getting the zeroth column of a pandas DataFrame is a little longer than the NumPy syntax.

# zeroth column of df
df.iloc[:,0]
0     9
1     7
2     8
3     1
4     9
5     8
6     9
7    10
Name: 0, dtype: int64

This column is a pandas Series.

type(df.iloc[:,0])
pandas.core.series.Series

We can compare the entries in these columns elementwise, just like we did using NumPy.

df.iloc[:,0] > df.iloc[:,1]
0     True
1    False
2     True
3    False
4     True
5    False
6     True
7    False
dtype: bool

Here is the most efficient way I know to count Trues in a pandas Boolean Series.

(df.iloc[:,0] > df.iloc[:,1]).sum()
4

We can again get the number of rows using len.

len(df)
8
df.shape
(8, 4)

Here is the probability estimate.

# Not using enough experiments
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.5

Here we increase the number of experiments, but we forget to change df.

# forgot to update df
A = rng.integers(1,11,size=(10**7,4))
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.5

Here is the correct version.

A = rng.integers(1,11,size=(10**7,4))
df = pd.DataFrame(A)
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.4500289
A = rng.integers(1,11,size=(8,4))
df = pd.DataFrame(A)
A
array([[ 1,  8,  6,  3],
       [ 7,  5,  2,  7],
       [ 9,  6,  3,  7],
       [ 8,  2,  7,  8],
       [ 4,  9,  2,  8],
       [ 4,  7,  1,  7],
       [10,  2,  6, 10],
       [ 3,  8,  3,  7]])
df
0 1 2 3
0 1 8 6 3
1 7 5 2 7
2 9 6 3 7
3 8 2 7 8
4 4 9 2 8
5 4 7 1 7
6 10 2 6 10
7 3 8 3 7

Changing column names.

df.columns = ["a","b","m","chris"]
df
a b m chris
0 1 8 6 3
1 7 5 2 7
2 9 6 3 7
3 8 2 7 8
4 4 9 2 8
5 4 7 1 7
6 10 2 6 10
7 3 8 3 7

There are two similar operations, df.loc and df.iloc. The operation df.loc refers to rows and columns by their names, whereas df.iloc refers to rows and columns by their index.

df.loc[:,"b"]
0    8
1    5
2    6
3    2
4    9
5    7
6    2
7    8
Name: b, dtype: int64
df.iloc[:,1]
0    8
1    5
2    6
3    2
4    9
5    7
6    2
7    8
Name: b, dtype: int64

There is a common shortcut for referring to a column by its name.

# abbreviation
df["b"]
0    8
1    5
2    6
3    2
4    9
5    7
6    2
7    8
Name: b, dtype: int64

This next command says, give me the 1st-4th rows (not including the right endpoint) in the 2nd column.

df.iloc[1:4,2]
1    2
2    3
3    7
Name: m, dtype: int64

Somewhat confusingly, right endpoints are included when using loc.

df.loc[1:4,"m"]
1    2
2    3
3    7
4    2
Name: m, dtype: int64

You can use this same sort of notation to set values.

df
a b m chris
0 1 8 6 3
1 7 5 2 7
2 9 6 3 7
3 8 2 7 8
4 4 9 2 8
5 4 7 1 7
6 10 2 6 10
7 3 8 3 7
df.iloc[1:4,2] = -1000
df
a b m chris
0 1 8 6 3
1 7 5 -1000 7
2 9 6 -1000 7
3 8 2 -1000 8
4 4 9 2 8
5 4 7 1 7
6 10 2 6 10
7 3 8 3 7

That same sort of notation also works for NumPy arrays.

B = np.array(df)
B
array([[    1,     8,     6,     3],
       [    7,     5, -1000,     7],
       [    9,     6, -1000,     7],
       [    8,     2, -1000,     8],
       [    4,     9,     2,     8],
       [    4,     7,     1,     7],
       [   10,     2,     6,    10],
       [    3,     8,     3,     7]])
B[1:4,0] = 3
B
array([[    1,     8,     6,     3],
       [    3,     5, -1000,     7],
       [    3,     6, -1000,     7],
       [    3,     2, -1000,     8],
       [    4,     9,     2,     8],
       [    4,     7,     1,     7],
       [   10,     2,     6,    10],
       [    3,     8,     3,     7]])

You can also set multiple different values. The following says, in the 1st column (remember that we start counting at 0), set the elements from the 5th, 6th, 7th rows to be 100, 200, 300, respectively.

B[5:,1] = [100,200,300]
B
array([[    1,     8,     6,     3],
       [    3,     5, -1000,     7],
       [    3,     6, -1000,     7],
       [    3,     2, -1000,     8],
       [    4,     9,     2,     8],
       [    4,   100,     1,     7],
       [   10,   200,     6,    10],
       [    3,   300,     3,     7]])