Introduction to pandas¶
Download the csv file “indexData.csv” from Files->Datasets on Canvas. Save the file somewhere you can access on the computer. This “indexData.csv” comes from here, but I changed the name of the first column so you should use the version on Canvas, not the version on Kaggle.
Goals for today:
Introduce dictionaries, an important data type.
Introduce the library pandas (for working with spreadsheet-like data)
Similarity: both of these involve “named” data entries.
Dictionaries¶
The main goal of this notebook is to introduce the library pandas, which is arguably the most important library we will work with in Math 10. There are two fundamental data types defined by pandas, Series
and DataFrame
. The Series
data type in particular has many similarities to dict
, for dictionary, which is an important and standard data type in Python. So before talking about pandas, we introduce this dictionary data type.
roles = {"chris": "teacher", "yasmeen": "TA",
"xingshuo": "LA", "chupeng": "LA"}
type(roles)
dict
roles["yasmeen"]
'TA'
roles[xingshuo]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3930756786.py in <module>
----> 1 roles[xingshuo]
NameError: name 'xingshuo' is not defined
c = "xingshuo"
roles[c]
'LA'
roles[(3,1)] = 17
roles[[2,5]] = 10
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/2435546372.py in <module>
----> 1 roles[[2,5]] = 10
TypeError: unhashable type: 'list'
roles
{'chris': 'teacher',
'yasmeen': 'TA',
'xingshuo': 'LA',
'chupeng': 'LA',
(3, 1): 17}
roles['chris'] = 'professor'
# Notice that there cannot be repeated keys: there is only one "chris" key
roles
{'chris': 'professor',
'yasmeen': 'TA',
'xingshuo': 'LA',
'chupeng': 'LA',
(3, 1): 17}
my_list = [3,1,5]
my_list[2] = 10
my_list
[3, 1, 10]
my_list[7] = 0
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/46238093.py in <module>
----> 1 my_list[7] = 0
IndexError: list assignment index out of range
pandas¶
import pandas as pd
import numpy as np
A = np.arange(40).reshape((10,4))
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31],
[32, 33, 34, 35],
[36, 37, 38, 39]])
A.shape
(10, 4)
np.arange(40).reshape((-1,5))
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24],
[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39]])
np.arange(40).reshape((-1,3))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/4274071939.py in <module>
----> 1 np.arange(40).reshape((-1,3))
ValueError: cannot reshape array of size 40 into shape (3)
A
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23],
[24, 25, 26, 27],
[28, 29, 30, 31],
[32, 33, 34, 35],
[36, 37, 38, 39]])
list(A)
[array([0, 1, 2, 3]),
array([4, 5, 6, 7]),
array([ 8, 9, 10, 11]),
array([12, 13, 14, 15]),
array([16, 17, 18, 19]),
array([20, 21, 22, 23]),
array([24, 25, 26, 27]),
array([28, 29, 30, 31]),
array([32, 33, 34, 35]),
array([36, 37, 38, 39])]
df = pd.DataFrame(A)
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
5 | 20 | 21 | 22 | 23 |
6 | 24 | 25 | 26 | 27 |
7 | 28 | 29 | 30 | 31 |
8 | 32 | 33 | 34 | 35 |
9 | 36 | 37 | 38 | 39 |
type(df)
pandas.core.frame.DataFrame
df = pd.DataFrame(A,columns=["a","b","e","a"])
df
a | b | e | a | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
5 | 20 | 21 | 22 | 23 |
6 | 24 | 25 | 26 | 27 |
7 | 28 | 29 | 30 | 31 |
8 | 32 | 33 | 34 | 35 |
9 | 36 | 37 | 38 | 39 |
list("hi there chris davis")
['h',
'i',
' ',
't',
'h',
'e',
'r',
'e',
' ',
'c',
'h',
'r',
'i',
's',
' ',
'd',
'a',
'v',
'i',
's']
df = pd.DataFrame(A,columns=list("hi there chris davis"))
df
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3021520278.py in <module>
----> 1 df = pd.DataFrame(A,columns=list("hi there chris davis"))
2 df
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
670 )
671 else:
--> 672 mgr = ndarray_to_mgr(
673 data,
674 index,
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in ndarray_to_mgr(values, index, columns, dtype, copy, typ)
322 )
323
--> 324 _check_values_indices_shape_match(values, index, columns)
325
326 if typ == "array":
~/opt/anaconda3/envs/book/lib/python3.8/site-packages/pandas/core/internals/construction.py in _check_values_indices_shape_match(values, index, columns)
391 passed = values.shape
392 implied = (len(index), len(columns))
--> 393 raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
394
395
ValueError: Shape of passed values is (10, 4), indices imply (10, 20)
df = pd.DataFrame(A,columns=list("abcd"))
df
a | b | c | d | |
---|---|---|---|---|
0 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
2 | 8 | 9 | 10 | 11 |
3 | 12 | 13 | 14 | 15 |
4 | 16 | 17 | 18 | 19 |
5 | 20 | 21 | 22 | 23 |
6 | 24 | 25 | 26 | 27 |
7 | 28 | 29 | 30 | 31 |
8 | 32 | 33 | 34 | 35 |
9 | 36 | 37 | 38 | 39 |
df.index = list("abcdefghij")
df.columns = [10,2,5,8]
df
10 | 2 | 5 | 8 | |
---|---|---|---|---|
a | 0 | 1 | 2 | 3 |
b | 4 | 5 | 6 | 7 |
c | 8 | 9 | 10 | 11 |
d | 12 | 13 | 14 | 15 |
e | 16 | 17 | 18 | 19 |
f | 20 | 21 | 22 | 23 |
g | 24 | 25 | 26 | 27 |
h | 28 | 29 | 30 | 31 |
i | 32 | 33 | 34 | 35 |
j | 36 | 37 | 38 | 39 |
s = df[8]
s
a 3
b 7
c 11
d 15
e 19
f 23
g 27
h 31
i 35
j 39
Name: 8, dtype: int64
type(s)
pandas.core.series.Series
s["c"]
11
dict(s)
{'a': 3,
'b': 7,
'c': 11,
'd': 15,
'e': 19,
'f': 23,
'g': 27,
'h': 31,
'i': 35,
'j': 39}
# Your path will be different!
df = pd.read_csv("../data/indexData.csv")
df
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | NYA | 12/31/65 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 0.0 |
1 | NYA | 1/3/66 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 0.0 |
2 | NYA | 1/4/66 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 0.0 |
3 | NYA | 1/5/66 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 0.0 |
4 | NYA | 1/6/66 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
112452 | N100 | 5/27/21 | 1241.119995 | 1251.910034 | 1241.119995 | 1247.069946 | 1247.069946 | 379696400.0 |
112453 | N100 | 5/28/21 | 1249.469971 | 1259.209961 | 1249.030029 | 1256.599976 | 1256.599976 | 160773400.0 |
112454 | N100 | 5/31/21 | 1256.079956 | 1258.880005 | 1248.140015 | 1248.930054 | 1248.930054 | 91173700.0 |
112455 | N100 | 6/1/21 | 1254.609985 | 1265.660034 | 1254.609985 | 1258.579956 | 1258.579956 | 155179900.0 |
112456 | N100 | 6/2/21 | 1258.489990 | 1263.709961 | 1258.239990 | 1263.619995 | 1263.619995 | 148465000.0 |
112457 rows × 8 columns
Let’s find how many days was the high - low for NYSE bigger than 5% of the open value
# restrict to the NYA rows
df["Name"] == "NYA"
0 True
1 True
2 True
3 True
4 True
...
112452 False
112453 False
112454 False
112455 False
112456 False
Name: Name, Length: 112457, dtype: bool
# syntax: dataframe [ Boolean series]
df_nya = df[df["Name"] == "NYA"]
df_nya
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | NYA | 12/31/65 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 0.000000e+00 |
1 | NYA | 1/3/66 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 0.000000e+00 |
2 | NYA | 1/4/66 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 0.000000e+00 |
3 | NYA | 1/5/66 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 0.000000e+00 |
4 | NYA | 1/6/66 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 0.000000e+00 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13943 | NYA | 5/24/21 | 16375.000000 | 16508.519530 | 16375.000000 | 16464.689450 | 16464.689450 | 2.947400e+09 |
13944 | NYA | 5/25/21 | 16464.689450 | 16525.810550 | 16375.150390 | 16390.189450 | 16390.189450 | 3.420870e+09 |
13945 | NYA | 5/26/21 | 16390.189450 | 16466.339840 | 16388.320310 | 16451.960940 | 16451.960940 | 3.674490e+09 |
13946 | NYA | 5/27/21 | 16451.960940 | 16546.359380 | 16451.960940 | 16531.949220 | 16531.949220 | 5.201110e+09 |
13947 | NYA | 5/28/21 | 16531.949220 | 16588.689450 | 16531.949220 | 16555.660160 | 16555.660160 | 4.199270e+09 |
13948 rows × 8 columns
df_nya.shape
(13948, 8)
df_nya["High"]
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
13943 16508.519530
13944 16525.810550
13945 16466.339840
13946 16546.359380
13947 16588.689450
Name: High, Length: 13948, dtype: float64
df_nya.High
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
13943 16508.519530
13944 16525.810550
13945 16466.339840
13946 16546.359380
13947 16588.689450
Name: High, Length: 13948, dtype: float64
df_nya.columns
Index(['Name', 'Date', 'Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'], dtype='object')
This notation df_nya.High
(which could be called “attribute syntax”) can be very convenient since it’s shorter than typing df_nya["High"]
, but it doesn’t always work. For example, it doesn’t work if there is a space in the column name or if the column name begins with a number.
df_nya.Adj Close
File "/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_83255/3122134759.py", line 1
df_nya.Adj Close
^
SyntaxError: invalid syntax
df_nya["Adj Close"]
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
13943 16464.689450
13944 16390.189450
13945 16451.960940
13946 16531.949220
13947 16555.660160
Name: Adj Close, Length: 13948, dtype: float64
df_nya["High"] - df_nya["Low"]
0 0.00000
1 0.00000
2 0.00000
3 0.00000
4 0.00000
...
13943 133.51953
13944 150.66016
13945 78.01953
13946 94.39844
13947 56.74023
Length: 13948, dtype: float64
(df_nya["High"] - df_nya["Low"])/df_nya["Open"]
0 0.000000
1 0.000000
2 0.000000
3 0.000000
4 0.000000
...
13943 0.008154
13944 0.009151
13945 0.004760
13946 0.005738
13947 0.003432
Length: 13948, dtype: float64
(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05
0 False
1 False
2 False
3 False
4 False
...
13943 False
13944 False
13945 False
13946 False
13947 False
Length: 13948, dtype: bool
It’s worth staring at this next line until you understand what it’s saying. The part inside the square brackets is telling pandas which of the rows to keep.
df_nya[(df_nya["High"] - df_nya["Low"])/df_nya["Open"] > 0.05]
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
10586 | NYA | 1/23/08 | 8661.059570 | 8806.099609 | 8343.620117 | 8805.679688 | 8805.679688 | 3.241600e+08 |
10749 | NYA | 9/15/08 | 8091.569824 | 8091.569824 | 7680.149902 | 7680.149902 | 7680.149902 | 8.279510e+09 |
10752 | NYA | 9/18/08 | 7440.390137 | 7806.410156 | 7318.839844 | 7775.160156 | 7775.160156 | 1.008269e+10 |
10753 | NYA | 9/19/08 | 7775.169922 | 8201.240234 | 7775.169922 | 8187.129883 | 8187.129883 | 9.387170e+09 |
10759 | NYA | 9/29/08 | 7890.370117 | 7890.370117 | 7167.770020 | 7204.009766 | 7204.009766 | 7.305060e+09 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
13644 | NYA | 3/17/20 | 9726.179688 | 10154.160160 | 9484.650391 | 10063.360350 | 10063.360350 | 8.358500e+09 |
13645 | NYA | 3/18/20 | 9562.490234 | 9666.570313 | 8953.599609 | 9384.599609 | 9384.599609 | 8.755780e+09 |
13646 | NYA | 3/19/20 | 9313.759766 | 9622.209961 | 9048.360352 | 9461.309570 | 9461.309570 | 7.946710e+09 |
13647 | NYA | 3/20/20 | 9577.900391 | 9676.940430 | 9108.150391 | 9133.160156 | 9133.160156 | 9.044690e+09 |
13650 | NYA | 3/25/20 | 9749.799805 | 10303.900390 | 9591.740234 | 9961.379883 | 9961.379883 | 8.285670e+09 |
66 rows × 8 columns