Introduction to pandas¶

Download the csv file “indexData.csv” from Files->Datasets on Canvas. Save the file somewhere you can access on the computer. This “indexData.csv” comes from here, but I changed the name of the first column so you should use the version on Canvas, not the version on Kaggle.

Goals for today:

Introduce dictionaries, an important data type.
Introduce the library pandas (for working with spreadsheet-like data)
Similarity: both of these involve “named” data entries.

Dictionaries¶

The main goal of this notebook is to introduce the library pandas, which is arguably the most important library we will work with in Math 10. There are two fundamental data types defined by pandas, Series and DataFrame. The Series data type in particular has many similarities to dict, for dictionary, which is an important and standard data type in Python. So before talking about pandas, we introduce this dictionary data type.

roles = {"chris": "teacher", "yasmeen": "TA", 
         "xingshuo": "LA", "chupeng": "LA"}

type(roles)

dict

roles["yasmeen"]

'TA'

roles[xingshuo]

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3930756786.py in <module>
----> 1 roles[xingshuo]

NameError: name 'xingshuo' is not defined

c = "xingshuo"

roles[c]

'LA'

roles[(3,1)] = 17

roles[[2,5]] = 10

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/2435546372.py in <module>
----> 1 roles[[2,5]] = 10

TypeError: unhashable type: 'list'

roles

{'chris': 'teacher',
 'yasmeen': 'TA',
 'xingshuo': 'LA',
 'chupeng': 'LA',
 (3, 1): 17}

roles['chris'] = 'professor'

# Notice that there cannot be repeated keys: there is only one "chris" key
roles

{'chris': 'professor',
 'yasmeen': 'TA',
 'xingshuo': 'LA',
 'chupeng': 'LA',
 (3, 1): 17}

my_list = [3,1,5]

my_list[2] = 10

my_list

[3, 1, 10]

my_list[7] = 0

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/46238093.py in <module>
----> 1 my_list[7] = 0

IndexError: list assignment index out of range

pandas¶

import pandas as pd
import numpy as np

A = np.arange(40).reshape((10,4))
A

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31],
       [32, 33, 34, 35],
       [36, 37, 38, 39]])

A.shape

(10, 4)

np.arange(40).reshape((-1,5))

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39]])

np.arange(40).reshape((-1,3))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/4274071939.py in <module>
----> 1 np.arange(40).reshape((-1,3))

ValueError: cannot reshape array of size 40 into shape (3)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31],
       [32, 33, 34, 35],
       [36, 37, 38, 39]])

list(A)

[array([0, 1, 2, 3]),
 array([4, 5, 6, 7]),
 array([ 8,  9, 10, 11]),
 array([12, 13, 14, 15]),
 array([16, 17, 18, 19]),
 array([20, 21, 22, 23]),
 array([24, 25, 26, 27]),
 array([28, 29, 30, 31]),
 array([32, 33, 34, 35]),
 array([36, 37, 38, 39])]

df = pd.DataFrame(A)
df

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

type(df)

pandas.core.frame.DataFrame

df = pd.DataFrame(A,columns=["a","b","e","a"])
df

	a	b	e	a
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

list("hi there chris davis")

['h',
 'i',
 ' ',
 't',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'c',
 'h',
 'r',
 'i',
 's',
 ' ',
 'd',
 'a',
 'v',
 'i',
 's']

df = pd.DataFrame(A,columns=list("hi there chris davis"))
df

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3021520278.py in <module>
----> 1 df = pd.DataFrame(A,columns=list("hi there chris davis"))
      2 df

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    670                 )
    671             else:
--> 672                 mgr = ndarray_to_mgr(
    673                     data,
    674                     index,

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    322     )
    323 
--> 324     _check_values_indices_shape_match(values, index, columns)
    325 
    326     if typ == "array":

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
    391         passed = values.shape
    392         implied = (len(index), len(columns))
--> 393         raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
    394 
    395 

ValueError: Shape of passed values is (10, 4), indices imply (10, 20)

df = pd.DataFrame(A,columns=list("abcd"))
df

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

df.index = list("abcdefghij")

df.columns = [10,2,5,8]

df

	10	2	5	8
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11
d	12	13	14	15
e	16	17	18	19
f	20	21	22	23
g	24	25	26	27
h	28	29	30	31
i	32	33	34	35
j	36	37	38	39

s = df[8]
s

a     3
b     7
c    11
d    15
e    19
f    23
g    27
h    31
i    35
j    39
Name: 8, dtype: int64

type(s)

pandas.core.series.Series

s["c"]

dict(s)

{'a': 3,
 'b': 7,
 'c': 11,
 'd': 15,
 'e': 19,
 'f': 23,
 'g': 27,
 'h': 31,
 'i': 35,
 'j': 39}

# Your path will be different!
df = pd.read_csv("../data/indexData.csv")

df

	Name	Date	Open	High	Low	Close	Adj Close	Volume
0	NYA	12/31/65	528.690002	528.690002	528.690002	528.690002	528.690002	0.0
1	NYA	1/3/66	527.210022	527.210022	527.210022	527.210022	527.210022	0.0
2	NYA	1/4/66	527.840027	527.840027	527.840027	527.840027	527.840027	0.0
3	NYA	1/5/66	531.119995	531.119995	531.119995	531.119995	531.119995	0.0
4	NYA	1/6/66	532.070007	532.070007	532.070007	532.070007	532.070007	0.0
...	...	...	...	...	...	...	...	...
112452	N100	5/27/21	1241.119995	1251.910034	1241.119995	1247.069946	1247.069946	379696400.0
112453	N100	5/28/21	1249.469971	1259.209961	1249.030029	1256.599976	1256.599976	160773400.0
112454	N100	5/31/21	1256.079956	1258.880005	1248.140015	1248.930054	1248.930054	91173700.0
112455	N100	6/1/21	1254.609985	1265.660034	1254.609985	1258.579956	1258.579956	155179900.0
112456	N100	6/2/21	1258.489990	1263.709961	1258.239990	1263.619995	1263.619995	148465000.0

112457 rows × 8 columns

Let’s find how many days was the high - low for NYSE bigger than 5% of the open value

# restrict to the NYA rows
df["Name"] == "NYA"

0          True
1          True
2          True
3          True
4          True
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool

# syntax: dataframe [ Boolean series]
df_nya = df[df["Name"] == "NYA"]
df_nya

	Name	Date	Open	High	Low	Close	Adj Close	Volume
0	NYA	12/31/65	528.690002	528.690002	528.690002	528.690002	528.690002	0.000000e+00
1	NYA	1/3/66	527.210022	527.210022	527.210022	527.210022	527.210022	0.000000e+00
2	NYA	1/4/66	527.840027	527.840027	527.840027	527.840027	527.840027	0.000000e+00
3	NYA	1/5/66	531.119995	531.119995	531.119995	531.119995	531.119995	0.000000e+00
4	NYA	1/6/66	532.070007	532.070007	532.070007	532.070007	532.070007	0.000000e+00
...	...	...	...	...	...	...	...	...
13943	NYA	5/24/21	16375.000000	16508.519530	16375.000000	16464.689450	16464.689450	2.947400e+09
13944	NYA	5/25/21	16464.689450	16525.810550	16375.150390	16390.189450	16390.189450	3.420870e+09
13945	NYA	5/26/21	16390.189450	16466.339840	16388.320310	16451.960940	16451.960940	3.674490e+09
13946	NYA	5/27/21	16451.960940	16546.359380	16451.960940	16531.949220	16531.949220	5.201110e+09
13947	NYA	5/28/21	16531.949220	16588.689450	16531.949220	16555.660160	16555.660160	4.199270e+09

13948 rows × 8 columns

df_nya.shape

(13948, 8)

df_nya["High"]

        528.690002
        527.210022
        527.840027
        531.119995
        532.070007
             ...     
  16508.519530
  16525.810550
  16466.339840
  16546.359380
  16588.689450
Name: High, Length: 13948, dtype: float64

df_nya.High

        528.690002
        527.210022
        527.840027
        531.119995
        532.070007
             ...     
  16508.519530
  16525.810550
  16466.339840
  16546.359380
  16588.689450
Name: High, Length: 13948, dtype: float64

df_nya.columns

Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

This notation df_nya.High (which could be called “attribute syntax”) can be very convenient since it’s shorter than typing df_nya["High"], but it doesn’t always work. For example, it doesn’t work if there is a space in the column name or if the column name begins with a number.

df_nya.Adj Close

  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3122134759.py", line 1
    df_nya.Adj Close
               ^
SyntaxError: invalid syntax

df_nya["Adj Close"]

        528.690002
        527.210022
        527.840027
        531.119995
        532.070007
             ...     
  16464.689450
  16390.189450
  16451.960940
  16531.949220
  16555.660160
Name: Adj Close, Length: 13948, dtype: float64

df_nya["High"] - df_nya["Low"]

        0.00000
        0.00000
        0.00000
        0.00000
        0.00000
           ...    
  133.51953
  150.66016
   78.01953
   94.39844
   56.74023
Length: 13948, dtype: float64

(df_nya["High"] - df_nya["Low"])/df_nya["Open"]

      0.000000
      0.000000
      0.000000
      0.000000
      0.000000
           ...   
  0.008154
  0.009151
  0.004760
  0.005738
  0.003432
Length: 13948, dtype: float64

(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05

      False
      False
      False
      False
      False
         ...  
  False
  False
  False
  False
  False
Length: 13948, dtype: bool

It’s worth staring at this next line until you understand what it’s saying. The part inside the square brackets is telling pandas which of the rows to keep.

df_nya[(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05]

	Name	Date	Open	High	Low	Close	Adj Close	Volume
10586	NYA	1/23/08	8661.059570	8806.099609	8343.620117	8805.679688	8805.679688	3.241600e+08
10749	NYA	9/15/08	8091.569824	8091.569824	7680.149902	7680.149902	7680.149902	8.279510e+09
10752	NYA	9/18/08	7440.390137	7806.410156	7318.839844	7775.160156	7775.160156	1.008269e+10
10753	NYA	9/19/08	7775.169922	8201.240234	7775.169922	8187.129883	8187.129883	9.387170e+09
10759	NYA	9/29/08	7890.370117	7890.370117	7167.770020	7204.009766	7204.009766	7.305060e+09
...	...	...	...	...	...	...	...	...
13644	NYA	3/17/20	9726.179688	10154.160160	9484.650391	10063.360350	10063.360350	8.358500e+09
13645	NYA	3/18/20	9562.490234	9666.570313	8953.599609	9384.599609	9384.599609	8.755780e+09
13646	NYA	3/19/20	9313.759766	9622.209961	9048.360352	9461.309570	9461.309570	7.946710e+09
13647	NYA	3/20/20	9577.900391	9676.940430	9108.150391	9133.160156	9133.160156	9.044690e+09
13650	NYA	3/25/20	9749.799805	10303.900390	9591.740234	9961.379883	9961.379883	8.285670e+09

66 rows × 8 columns

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	e	a
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	10	2	5	8
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11
d	12	13	14	15
e	16	17	18	19
f	20	21	22	23
g	24	25	26	27
h	28	29	30	31
i	32	33	34	35
j	36	37	38	39

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	e	a
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	10	2	5	8
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11
d	12	13	14	15
e	16	17	18	19
f	20	21	22	23
g	24	25	26	27
h	28	29	30	31
i	32	33	34	35
j	36	37	38	39

UC Irvine Math 10

Introduction to pandas¶

Dictionaries¶

pandas¶

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	e	a
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19
5	20	21	22	23
6	24	25	26	27
7	28	29	30	31
8	32	33	34	35
9	36	37	38	39

	10	2	5	8
a	0	1	2	3
b	4	5	6	7
c	8	9	10	11
d	12	13	14	15
e	16	17	18	19
f	20	21	22	23
g	24	25	26	27
h	28	29	30	31
i	32	33	34	35
j	36	37	38	39