# NumPy and pandas

Review of NumPy and examples using pandas.

[Recording of lecture from 1/12/2022](https://uci.zoom.us/rec/share/ZX9cbbE7zeJR-MRWu9Rmj1_r7IMlliQOe01P27fln1RgxddHgchdl8x6HYmFKnvU.DVx5i65Sb_b3JrC_)

## Warm-up exercise

1. Define an 8x4 NumPy array A of random integers between 1 and 10 (inclusive).

2. Each row of A has four columns.  Let [x,y,z,w] denote one of these rows.  What is the probability that x > y?

In [7]:
import numpy as np

In [8]:
rng  = np.random.default_rng()

In [6]:
help(rng.integers)

In [87]:
A = rng.integers(1,11,size=(8,4))
A

array([[ 4,  3,  2, 10],
       [ 8,  4,  1,  6],
       [ 5,  8,  4,  8],
       [10,  6, 10,  8],
       [ 9, 10,  6,  6],
       [ 3,  2,  4,  3],
       [ 9,  2, 10,  5],
       [ 6,  6,  7,  2]])

In mathematics, it doesn't make sense to ask if a vector is strictly greater than another vector.  In NumPy, this comparison is done *elementwise*.

In [88]:
A[:,0] > A[:,1]

array([ True,  True, False,  True, False,  True,  True, False])

It's the same with equality: they are compared elementwise.

In [89]:
A[:,0] == A[:,1]

array([False, False, False, False, False, False, False,  True])

With a `list` instead of an `np.array`, then equality means "are the lists exactly the same, with the same elements in the same positions?"

In [23]:
[4,2,3] == [4,2,3]

True

In [24]:
[4,2,3] == [4,3,2]

False

In [22]:
np.array([1,2,3]) == np.array([4,2,3])

array([False,  True,  True])

In [26]:
set([4,2,3]) == set([4,3,2,2,4,2,2])

True

In [21]:
[1,2,3] == [4,2,3]

False

This next cell produces an example of a Boolean array.

In [12]:
A[:,0] > A[:,1]

array([False, False, False,  True,  True, False,  True,  True])

Counting how often `True` appears.

In [1]:
np.count_nonzero(A[:,0] > A[:,1])

NameError: name 'np' is not defined

We think of each row as being one "experiment".  We can find the number of rows using `len`.

In [14]:
# number of experiments = number of rows
len(A)

8

We estimate the probability using "number of successes"/"number of experiments".  It won't be accurate yet, because we are using so few experiments.

In [15]:
# prob estimate using len(A) experiments
np.count_nonzero(A[:,0] > A[:,1])/len(A)

0.5

Using ten million experiments.

In [17]:
A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)

0.4501296

If we do the same thing, we should get a very similar answer, but it won't be exactly the same, since these are estimates using random experiments.

In [3]:
A = rng.integers(1,11,size=(10**7,4))
np.count_nonzero(A[:,0] > A[:,1])/len(A)

NameError: name 'rng' is not defined

## pandas

Probably the most important Python library for Math 10.  Essentially everything we did earlier in this notebook, we can also do in pandas.  The library pandas also has a lot of extra functionality that will help us work with datasets.

In [9]:
import pandas as pd

In [10]:
A = rng.integers(1,11,size=(8,4))
type(A)

numpy.ndarray

In [11]:
A.shape

(8, 4)

We convert this NumPy array to a pandas DataFrame.  (Make sure you capitalize DataFrame correctly.)

In [13]:
df = pd.DataFrame(A)
df

Unnamed: 0,0,1,2,3
0,9,8,5,3
1,6,8,8,4
2,3,5,6,8
3,9,7,9,1
4,5,6,9,6
5,6,7,5,5
6,3,2,5,8
7,1,6,4,9


The syntax for getting the zeroth column of a pandas DataFrame is a little longer than the NumPy syntax.

In [14]:
# zeroth column of df
df.iloc[:,0]

0    9
1    6
2    3
3    9
4    5
5    6
6    3
7    1
Name: 0, dtype: int64

This column is a pandas Series.

In [15]:
type(df.iloc[:,0])

pandas.core.series.Series

We can compare the entries in these columns elementwise, just like we did using NumPy.

In [16]:
df.iloc[:,0] > df.iloc[:,1]

0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
dtype: bool

Here is the most efficient way I know to count `True`s in a pandas Boolean Series.

In [17]:
(df.iloc[:,0] > df.iloc[:,1]).sum()

3

We can again get the number of rows using `len`.

In [19]:
len(df)

8

In [20]:
df.shape

(8, 4)

Here is the probability estimate.

In [23]:
# Not using enough experiments
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)

0.375

Here we increase the number of experiments, but we forget to change `df`.

In [25]:
# forgot to update df
A = rng.integers(1,11,size=(10**7,4))
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)

0.375

Here is the correct version.

In [26]:
A = rng.integers(1,11,size=(10**7,4))
df = pd.DataFrame(A)
((df.iloc[:,0] > df.iloc[:,1]).sum())/len(df)

0.4500086

In [27]:
A = rng.integers(1,11,size=(8,4))
df = pd.DataFrame(A)

In [28]:
A

array([[ 2, 10,  3,  1],
       [ 4,  9,  7,  1],
       [ 3,  2,  5,  6],
       [ 1,  5,  7,  9],
       [ 3,  1,  7,  1],
       [ 3,  5,  2,  9],
       [ 9, 10,  3,  1],
       [ 2,  7,  7,  2]])

In [29]:
df

Unnamed: 0,0,1,2,3
0,2,10,3,1
1,4,9,7,1
2,3,2,5,6
3,1,5,7,9
4,3,1,7,1
5,3,5,2,9
6,9,10,3,1
7,2,7,7,2


Changing column names.

In [30]:
df.columns = ["a","b","m","chris"]

In [31]:
df

Unnamed: 0,a,b,m,chris
0,2,10,3,1
1,4,9,7,1
2,3,2,5,6
3,1,5,7,9
4,3,1,7,1
5,3,5,2,9
6,9,10,3,1
7,2,7,7,2


There are two similar operations, `df.loc` and `df.iloc`.  The operation `df.loc` refers to rows and columns by their names, whereas `df.iloc` refers to rows and columns by their index.

In [32]:
df.loc[:,"b"]

0    10
1     9
2     2
3     5
4     1
5     5
6    10
7     7
Name: b, dtype: int64

In [33]:
df.iloc[:,1]

0    10
1     9
2     2
3     5
4     1
5     5
6    10
7     7
Name: b, dtype: int64

There is a common shortcut for referring to a column by its name.

In [34]:
# abbreviation
df["b"]

0    10
1     9
2     2
3     5
4     1
5     5
6    10
7     7
Name: b, dtype: int64

This next command says, give me the 1st-4th rows (not including the right endpoint) in the 2nd column.

In [35]:
df.iloc[1:4,2]

1    7
2    5
3    7
Name: m, dtype: int64

Somewhat confusingly, right endpoints are included when using `loc`.

In [64]:
df.loc[1:4,"m"]

1     2
2     2
3     5
4    10
Name: m, dtype: int64

You can use this same sort of notation to set values.

In [41]:
df

Unnamed: 0,a,b,m,chris
0,2,10,3,1
1,4,9,-1000,1
2,3,2,-1000,6
3,1,5,-1000,9
4,3,1,7,1
5,3,5,2,9
6,9,10,3,1
7,2,7,7,2


In [36]:
df.iloc[1:4,2] = -1000

In [37]:
df

Unnamed: 0,a,b,m,chris
0,2,10,3,1
1,4,9,-1000,1
2,3,2,-1000,6
3,1,5,-1000,9
4,3,1,7,1
5,3,5,2,9
6,9,10,3,1
7,2,7,7,2


That same sort of notation also works for NumPy arrays.

In [38]:
B = np.array(df)
B

array([[    2,    10,     3,     1],
       [    4,     9, -1000,     1],
       [    3,     2, -1000,     6],
       [    1,     5, -1000,     9],
       [    3,     1,     7,     1],
       [    3,     5,     2,     9],
       [    9,    10,     3,     1],
       [    2,     7,     7,     2]])

In [39]:
B[1:4,0] = 3

In [40]:
B

array([[    2,    10,     3,     1],
       [    3,     9, -1000,     1],
       [    3,     2, -1000,     6],
       [    3,     5, -1000,     9],
       [    3,     1,     7,     1],
       [    3,     5,     2,     9],
       [    9,    10,     3,     1],
       [    2,     7,     7,     2]])

You can also set multiple different values.  The following says, in the 1st column (remember that we start counting at 0), set the elements from the 5th, 6th, 7th rows to be 100, 200, 300, respectively.

In [42]:
B[5:,1] = [100,200,300]

In [43]:
B

array([[    2,    10,     3,     1],
       [    3,     9, -1000,     1],
       [    3,     2, -1000,     6],
       [    3,     5, -1000,     9],
       [    3,     1,     7,     1],
       [    3,   100,     2,     9],
       [    9,   200,     3,     1],
       [    2,   300,     7,     2]])