Introduction to pandas

Download the csv file “indexData.csv” from Files->Datasets on Canvas. Save the file somewhere you can access on the computer. This “indexData.csv” comes from here, but I changed the name of the first column so you should use the version on Canvas, not the version on Kaggle.

Goals for today:

  • Introduce dictionaries, an important data type.

  • Introduce the library pandas (for working with spreadsheet-like data)

  • Similarity: both of these involve “named” data entries.

Dictionaries

The main goal of this notebook is to introduce the library pandas, which is arguably the most important library we will work with in Math 10. There are two fundamental data types defined by pandas, Series and DataFrame. The Series data type in particular has many similarities to dict, for dictionary, which is an important and standard data type in Python. So before talking about pandas, we introduce this dictionary data type.

roles = {"chris": "teacher", "yasmeen": "TA", 
         "xingshuo": "LA", "chupeng": "LA"}
type(roles)
dict
roles["yasmeen"]
'TA'
roles[xingshuo]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3930756786.py in <module>
----> 1 roles[xingshuo]

NameError: name 'xingshuo' is not defined
c = "xingshuo"
roles[c]
'LA'
roles[(3,1)] = 17
roles[[2,5]] = 10
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/2435546372.py in <module>
----> 1 roles[[2,5]] = 10

TypeError: unhashable type: 'list'
roles
{'chris': 'teacher',
 'yasmeen': 'TA',
 'xingshuo': 'LA',
 'chupeng': 'LA',
 (3, 1): 17}
roles['chris'] = 'professor'
# Notice that there cannot be repeated keys: there is only one "chris" key
roles
{'chris': 'professor',
 'yasmeen': 'TA',
 'xingshuo': 'LA',
 'chupeng': 'LA',
 (3, 1): 17}
my_list = [3,1,5]
my_list[2] = 10
my_list
[3, 1, 10]
my_list[7] = 0
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/46238093.py in <module>
----> 1 my_list[7] = 0

IndexError: list assignment index out of range

pandas

import pandas as pd
import numpy as np
A = np.arange(40).reshape((10,4))
A
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31],
       [32, 33, 34, 35],
       [36, 37, 38, 39]])
A.shape
(10, 4)
np.arange(40).reshape((-1,5))
array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34],
       [35, 36, 37, 38, 39]])
np.arange(40).reshape((-1,3))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/4274071939.py in <module>
----> 1 np.arange(40).reshape((-1,3))

ValueError: cannot reshape array of size 40 into shape (3)
A
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31],
       [32, 33, 34, 35],
       [36, 37, 38, 39]])
list(A)
[array([0, 1, 2, 3]),
 array([4, 5, 6, 7]),
 array([ 8,  9, 10, 11]),
 array([12, 13, 14, 15]),
 array([16, 17, 18, 19]),
 array([20, 21, 22, 23]),
 array([24, 25, 26, 27]),
 array([28, 29, 30, 31]),
 array([32, 33, 34, 35]),
 array([36, 37, 38, 39])]
df = pd.DataFrame(A)
df
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
8 32 33 34 35
9 36 37 38 39
type(df)
pandas.core.frame.DataFrame
df = pd.DataFrame(A,columns=["a","b","e","a"])
df
a b e a
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
8 32 33 34 35
9 36 37 38 39
list("hi there chris davis")
['h',
 'i',
 ' ',
 't',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'c',
 'h',
 'r',
 'i',
 's',
 ' ',
 'd',
 'a',
 'v',
 'i',
 's']
df = pd.DataFrame(A,columns=list("hi there chris davis"))
df
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3021520278.py in <module>
----> 1 df = pd.DataFrame(A,columns=list("hi there chris davis"))
      2 df

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    670                 )
    671             else:
--> 672                 mgr = ndarray_to_mgr(
    673                     data,
    674                     index,

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
    322     )
    323 
--> 324     _check_values_indices_shape_match(values, index, columns)
    325 
    326     if typ == "array":

~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
    391         passed = values.shape
    392         implied = (len(index), len(columns))
--> 393         raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
    394 
    395 

ValueError: Shape of passed values is (10, 4), indices imply (10, 20)
df = pd.DataFrame(A,columns=list("abcd"))
df
a b c d
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
5 20 21 22 23
6 24 25 26 27
7 28 29 30 31
8 32 33 34 35
9 36 37 38 39
df.index = list("abcdefghij")
df.columns = [10,2,5,8]
df
10 2 5 8
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
e 16 17 18 19
f 20 21 22 23
g 24 25 26 27
h 28 29 30 31
i 32 33 34 35
j 36 37 38 39
s = df[8]
s
a     3
b     7
c    11
d    15
e    19
f    23
g    27
h    31
i    35
j    39
Name: 8, dtype: int64
type(s)
pandas.core.series.Series
s["c"]
11
dict(s)
{'a': 3,
 'b': 7,
 'c': 11,
 'd': 15,
 'e': 19,
 'f': 23,
 'g': 27,
 'h': 31,
 'i': 35,
 'j': 39}
# Your path will be different!
df = pd.read_csv("../data/indexData.csv")
df
Name Date Open High Low Close Adj Close Volume
0 NYA 12/31/65 528.690002 528.690002 528.690002 528.690002 528.690002 0.0
1 NYA 1/3/66 527.210022 527.210022 527.210022 527.210022 527.210022 0.0
2 NYA 1/4/66 527.840027 527.840027 527.840027 527.840027 527.840027 0.0
3 NYA 1/5/66 531.119995 531.119995 531.119995 531.119995 531.119995 0.0
4 NYA 1/6/66 532.070007 532.070007 532.070007 532.070007 532.070007 0.0
... ... ... ... ... ... ... ... ...
112452 N100 5/27/21 1241.119995 1251.910034 1241.119995 1247.069946 1247.069946 379696400.0
112453 N100 5/28/21 1249.469971 1259.209961 1249.030029 1256.599976 1256.599976 160773400.0
112454 N100 5/31/21 1256.079956 1258.880005 1248.140015 1248.930054 1248.930054 91173700.0
112455 N100 6/1/21 1254.609985 1265.660034 1254.609985 1258.579956 1258.579956 155179900.0
112456 N100 6/2/21 1258.489990 1263.709961 1258.239990 1263.619995 1263.619995 148465000.0

112457 rows × 8 columns

Let’s find how many days was the high - low for NYSE bigger than 5% of the open value

# restrict to the NYA rows
df["Name"] == "NYA"
0          True
1          True
2          True
3          True
4          True
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool
# syntax: dataframe [ Boolean series]
df_nya = df[df["Name"] == "NYA"]
df_nya
Name Date Open High Low Close Adj Close Volume
0 NYA 12/31/65 528.690002 528.690002 528.690002 528.690002 528.690002 0.000000e+00
1 NYA 1/3/66 527.210022 527.210022 527.210022 527.210022 527.210022 0.000000e+00
2 NYA 1/4/66 527.840027 527.840027 527.840027 527.840027 527.840027 0.000000e+00
3 NYA 1/5/66 531.119995 531.119995 531.119995 531.119995 531.119995 0.000000e+00
4 NYA 1/6/66 532.070007 532.070007 532.070007 532.070007 532.070007 0.000000e+00
... ... ... ... ... ... ... ... ...
13943 NYA 5/24/21 16375.000000 16508.519530 16375.000000 16464.689450 16464.689450 2.947400e+09
13944 NYA 5/25/21 16464.689450 16525.810550 16375.150390 16390.189450 16390.189450 3.420870e+09
13945 NYA 5/26/21 16390.189450 16466.339840 16388.320310 16451.960940 16451.960940 3.674490e+09
13946 NYA 5/27/21 16451.960940 16546.359380 16451.960940 16531.949220 16531.949220 5.201110e+09
13947 NYA 5/28/21 16531.949220 16588.689450 16531.949220 16555.660160 16555.660160 4.199270e+09

13948 rows × 8 columns

df_nya.shape
(13948, 8)
df_nya["High"]
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
13943    16508.519530
13944    16525.810550
13945    16466.339840
13946    16546.359380
13947    16588.689450
Name: High, Length: 13948, dtype: float64
df_nya.High
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
13943    16508.519530
13944    16525.810550
13945    16466.339840
13946    16546.359380
13947    16588.689450
Name: High, Length: 13948, dtype: float64
df_nya.columns
Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')

This notation df_nya.High (which could be called “attribute syntax”) can be very convenient since it’s shorter than typing df_nya["High"], but it doesn’t always work. For example, it doesn’t work if there is a space in the column name or if the column name begins with a number.

df_nya.Adj Close
  File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3122134759.py", line 1
    df_nya.Adj Close
               ^
SyntaxError: invalid syntax
df_nya["Adj Close"]
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
13943    16464.689450
13944    16390.189450
13945    16451.960940
13946    16531.949220
13947    16555.660160
Name: Adj Close, Length: 13948, dtype: float64
df_nya["High"] - df_nya["Low"]
0          0.00000
1          0.00000
2          0.00000
3          0.00000
4          0.00000
           ...    
13943    133.51953
13944    150.66016
13945     78.01953
13946     94.39844
13947     56.74023
Length: 13948, dtype: float64
(df_nya["High"] - df_nya["Low"])/df_nya["Open"]
0        0.000000
1        0.000000
2        0.000000
3        0.000000
4        0.000000
           ...   
13943    0.008154
13944    0.009151
13945    0.004760
13946    0.005738
13947    0.003432
Length: 13948, dtype: float64
(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05
0        False
1        False
2        False
3        False
4        False
         ...  
13943    False
13944    False
13945    False
13946    False
13947    False
Length: 13948, dtype: bool

It’s worth staring at this next line until you understand what it’s saying. The part inside the square brackets is telling pandas which of the rows to keep.

df_nya[(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05]
Name Date Open High Low Close Adj Close Volume
10586 NYA 1/23/08 8661.059570 8806.099609 8343.620117 8805.679688 8805.679688 3.241600e+08
10749 NYA 9/15/08 8091.569824 8091.569824 7680.149902 7680.149902 7680.149902 8.279510e+09
10752 NYA 9/18/08 7440.390137 7806.410156 7318.839844 7775.160156 7775.160156 1.008269e+10
10753 NYA 9/19/08 7775.169922 8201.240234 7775.169922 8187.129883 8187.129883 9.387170e+09
10759 NYA 9/29/08 7890.370117 7890.370117 7167.770020 7204.009766 7204.009766 7.305060e+09
... ... ... ... ... ... ... ... ...
13644 NYA 3/17/20 9726.179688 10154.160160 9484.650391 10063.360350 10063.360350 8.358500e+09
13645 NYA 3/18/20 9562.490234 9666.570313 8953.599609 9384.599609 9384.599609 8.755780e+09
13646 NYA 3/19/20 9313.759766 9622.209961 9048.360352 9461.309570 9461.309570 7.946710e+09
13647 NYA 3/20/20 9577.900391 9676.940430 9108.150391 9133.160156 9133.160156 9.044690e+09
13650 NYA 3/25/20 9749.799805 10303.900390 9591.740234 9961.379883 9961.379883 8.285670e+09

66 rows × 8 columns