NumPy and pandas
Contents
NumPy and pandas¶
Review of NumPy and examples using pandas.
Recording of lecture from 1/12/2022
Warm-up exercise¶
Define an 8x4 NumPy array A of random integers between 1 and 10 (inclusive).
Each row of A has four columns. Let [x,y,z,w] denote one of these rows. What is the probability that x > y?
import numpy as np
rng = np.random.default_rng()
help(rng.integers)
Help on built-in function integers:
integers(...) method of numpy.random._generator.Generator instance
integers(low, high=None, size=None, dtype=np.int64, endpoint=False)
Return random integers from `low` (inclusive) to `high` (exclusive), or
if endpoint=True, `low` (inclusive) to `high` (inclusive). Replaces
`RandomState.randint` (with endpoint=False) and
`RandomState.random_integers` (with endpoint=True)
Return random integers from the "discrete uniform" distribution of
the specified dtype. If `high` is None (the default), then results are
from 0 to `low`.
Parameters
----------
low : int or array-like of ints
Lowest (signed) integers to be drawn from the distribution (unless
``high=None``, in which case this parameter is 0 and this value is
used for `high`).
high : int or array-like of ints, optional
If provided, one above the largest (signed) integer to be drawn
from the distribution (see above for behavior if ``high=None``).
If array-like, must contain integer values
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g., ``(m, n, k)``, then
``m * n * k`` samples are drawn. Default is None, in which case a
single value is returned.
dtype : dtype, optional
Desired dtype of the result. Byteorder must be native.
The default value is np.int64.
endpoint : bool, optional
If true, sample from the interval [low, high] instead of the
default [low, high)
Defaults to False
Returns
-------
out : int or ndarray of ints
`size`-shaped array of random integers from the appropriate
distribution, or a single such random int if `size` not provided.
Notes
-----
When using broadcasting with uint64 dtypes, the maximum value (2**64)
cannot be represented as a standard integer type. The high array (or
low if high is None) must have object dtype, e.g., array([2**64]).
Examples
--------
>>> rng = np.random.default_rng()
>>> rng.integers(2, size=10)
array([1, 0, 0, 0, 1, 1, 0, 0, 1, 0]) # random
>>> rng.integers(1, size=10)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Generate a 2 x 4 array of ints between 0 and 4, inclusive:
>>> rng.integers(5, size=(2, 4))
array([[4, 0, 2, 1],
[3, 2, 2, 0]]) # random
Generate a 1 x 3 array with 3 different upper bounds
>>> rng.integers(1, [3, 5, 10])
array([2, 2, 9]) # random
Generate a 1 by 3 array with 3 different lower bounds
>>> rng.integers([1, 5, 7], 10)
array([9, 8, 7]) # random
Generate a 2 by 4 array using broadcasting with dtype of uint8
>>> rng.integers([1, 3, 5, 7], [[10], [20]], dtype=np.uint8)
array([[ 8, 6, 9, 7],
[ 1, 16, 9, 12]], dtype=uint8) # random
References
----------
.. [1] Daniel Lemire., "Fast Random Integer Generation in an Interval",
ACM Transactions on Modeling and Computer Simulation 29 (1), 2019,
http://arxiv.org/abs/1805.10941.
A = rng.integers(1,11,size=(8,4))
A
array([[ 4, 8, 5, 5],
[ 8, 2, 2, 8],
[ 1, 3, 7, 1],
[ 4, 6, 6, 5],
[ 7, 9, 7, 7],
[ 7, 3, 9, 8],
[ 6, 1, 3, 10],
[ 9, 4, 2, 7]])
In mathematics, it doesn’t make sense to ask if a vector is strictly greater than another vector. In NumPy, this comparison is done elementwise.
A[:,0] > A[:,1]
array([False, True, False, False, False, True, True, True])
It’s the same with equality: they are compared elementwise.
A[:,0] == A[:,1]
array([False, False, False, False, False, False, False, False])
With a list
instead of an np.array
, then equality means “are the lists exactly the same, with the same elements in the same positions?”
[4,2,3] == [4,2,3]
True
[4,2,3] == [4,3,2]
False
np.array([1,2,3]) == np.array([4,2,3])
array([False, True, True])
set([4,2,3]) == set([4,3,2,2,4,2,2])
True
[1,2,3] == [4,2,3]
False
This next cell produces an example of a Boolean array.
A[:,0] > A[:,1]
array([False, True, False, False, False, True, True, True])
Counting how often True
appears.
np.count_nonzero(A[:,0] > A[:,1])
4
We think of each row as being one “experiment”. We can find the number of rows using len
.
# number of experiments = number of rows
len(A)
8
We estimate the probability using “number of successes”/”number of experiments”. It won’t be accurate yet, because we are using so few experiments.
# prob estimate using len(A) experiments
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.5
Using ten million experiments.
A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.449869
If we do the same thing, we should get a very similar answer, but it won’t be exactly the same, since these are estimates using random experiments.
A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)
0.4499494
pandas¶
Probably the most important Python library for Math 10. Essentially everything we did earlier in this notebook, we can also do in pandas. The library pandas also has a lot of extra functionality that will help us work with datasets.
import pandas as pd
A = rng.integers(1,11,size=(8,4))
type(A)
numpy.ndarray
A.shape
(8, 4)
We convert this NumPy array to a pandas DataFrame. (Make sure you capitalize DataFrame correctly.)
df = pd.DataFrame(A)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 9 | 3 | 10 | 5 |
1 | 7 | 7 | 5 | 2 |
2 | 8 | 4 | 10 | 2 |
3 | 1 | 5 | 3 | 1 |
4 | 9 | 1 | 10 | 7 |
5 | 8 | 10 | 4 | 9 |
6 | 9 | 7 | 9 | 2 |
7 | 10 | 10 | 2 | 10 |
The syntax for getting the zeroth column of a pandas DataFrame is a little longer than the NumPy syntax.
# zeroth column of df
df.iloc[:,0]
0 9
1 7
2 8
3 1
4 9
5 8
6 9
7 10
Name: 0, dtype: int64
This column is a pandas Series.
type(df.iloc[:,0])
pandas.core.series.Series
We can compare the entries in these columns elementwise, just like we did using NumPy.
df.iloc[:,0] > df.iloc[:,1]
0 True
1 False
2 True
3 False
4 True
5 False
6 True
7 False
dtype: bool
Here is the most efficient way I know to count True
s in a pandas Boolean Series.
(df.iloc[:,0] > df.iloc[:,1]).sum()
4
We can again get the number of rows using len
.
len(df)
8
df.shape
(8, 4)
Here is the probability estimate.
# Not using enough experiments
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.5
Here we increase the number of experiments, but we forget to change df
.
# forgot to update df
A = rng.integers(1,11,size=(10**7,4))
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.5
Here is the correct version.
A = rng.integers(1,11,size=(10**7,4))
df = pd.DataFrame(A)
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)
0.4500289
A = rng.integers(1,11,size=(8,4))
df = pd.DataFrame(A)
A
array([[ 1, 8, 6, 3],
[ 7, 5, 2, 7],
[ 9, 6, 3, 7],
[ 8, 2, 7, 8],
[ 4, 9, 2, 8],
[ 4, 7, 1, 7],
[10, 2, 6, 10],
[ 3, 8, 3, 7]])
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1 | 8 | 6 | 3 |
1 | 7 | 5 | 2 | 7 |
2 | 9 | 6 | 3 | 7 |
3 | 8 | 2 | 7 | 8 |
4 | 4 | 9 | 2 | 8 |
5 | 4 | 7 | 1 | 7 |
6 | 10 | 2 | 6 | 10 |
7 | 3 | 8 | 3 | 7 |
Changing column names.
df.columns = ["a","b","m","chris"]
df
a | b | m | chris | |
---|---|---|---|---|
0 | 1 | 8 | 6 | 3 |
1 | 7 | 5 | 2 | 7 |
2 | 9 | 6 | 3 | 7 |
3 | 8 | 2 | 7 | 8 |
4 | 4 | 9 | 2 | 8 |
5 | 4 | 7 | 1 | 7 |
6 | 10 | 2 | 6 | 10 |
7 | 3 | 8 | 3 | 7 |
There are two similar operations, df.loc
and df.iloc
. The operation df.loc
refers to rows and columns by their names, whereas df.iloc
refers to rows and columns by their index.
df.loc[:,"b"]
0 8
1 5
2 6
3 2
4 9
5 7
6 2
7 8
Name: b, dtype: int64
df.iloc[:,1]
0 8
1 5
2 6
3 2
4 9
5 7
6 2
7 8
Name: b, dtype: int64
There is a common shortcut for referring to a column by its name.
# abbreviation
df["b"]
0 8
1 5
2 6
3 2
4 9
5 7
6 2
7 8
Name: b, dtype: int64
This next command says, give me the 1st-4th rows (not including the right endpoint) in the 2nd column.
df.iloc[1:4,2]
1 2
2 3
3 7
Name: m, dtype: int64
Somewhat confusingly, right endpoints are included when using loc
.
df.loc[1:4,"m"]
1 2
2 3
3 7
4 2
Name: m, dtype: int64
You can use this same sort of notation to set values.
df
a | b | m | chris | |
---|---|---|---|---|
0 | 1 | 8 | 6 | 3 |
1 | 7 | 5 | 2 | 7 |
2 | 9 | 6 | 3 | 7 |
3 | 8 | 2 | 7 | 8 |
4 | 4 | 9 | 2 | 8 |
5 | 4 | 7 | 1 | 7 |
6 | 10 | 2 | 6 | 10 |
7 | 3 | 8 | 3 | 7 |
df.iloc[1:4,2] = -1000
df
a | b | m | chris | |
---|---|---|---|---|
0 | 1 | 8 | 6 | 3 |
1 | 7 | 5 | -1000 | 7 |
2 | 9 | 6 | -1000 | 7 |
3 | 8 | 2 | -1000 | 8 |
4 | 4 | 9 | 2 | 8 |
5 | 4 | 7 | 1 | 7 |
6 | 10 | 2 | 6 | 10 |
7 | 3 | 8 | 3 | 7 |
That same sort of notation also works for NumPy arrays.
B = np.array(df)
B
array([[ 1, 8, 6, 3],
[ 7, 5, -1000, 7],
[ 9, 6, -1000, 7],
[ 8, 2, -1000, 8],
[ 4, 9, 2, 8],
[ 4, 7, 1, 7],
[ 10, 2, 6, 10],
[ 3, 8, 3, 7]])
B[1:4,0] = 3
B
array([[ 1, 8, 6, 3],
[ 3, 5, -1000, 7],
[ 3, 6, -1000, 7],
[ 3, 2, -1000, 8],
[ 4, 9, 2, 8],
[ 4, 7, 1, 7],
[ 10, 2, 6, 10],
[ 3, 8, 3, 7]])
You can also set multiple different values. The following says, in the 1st column (remember that we start counting at 0), set the elements from the 5th, 6th, 7th rows to be 100, 200, 300, respectively.
B[5:,1] = [100,200,300]
B
array([[ 1, 8, 6, 3],
[ 3, 5, -1000, 7],
[ 3, 6, -1000, 7],
[ 3, 2, -1000, 8],
[ 4, 9, 2, 8],
[ 4, 100, 1, 7],
[ 10, 200, 6, 10],
[ 3, 300, 3, 7]])