Introduction to pandas

Making our own pandas DataFrame

Usually we will import data (like from a csv file or from a Python library), but here we make our own data. It starts out as a list.

sample_data = [[2,3],[5,1],[10,1]]
type(sample_data)
list

Counting in Python starts at 0.

sample_data[0]
[2, 3]
sample_data[1]
[5, 1]
sample_data[1][0]
5

Here we import the pandas library and give it the abbreviation pd. In theory, we could use any abbreviation we want (or not use an abbreviation), but pd is the convention and we will always follow that convention.

import pandas as pd

Here we convert sample_data into a pandas DataFrame.

pd.DataFrame(sample_data)
0 1
0 2 3
1 5 1
2 10 1

Notice how the rows are named 0,1,2 and the columns are named 0,1, by default. Here we specify labels for the rows and the columns.

df = pd.DataFrame(sample_data, columns = ["col0","col1"], index = ["a","b","c"])
df
col0 col1
a 2 3
b 5 1
c 10 1
type(df)
pandas.core.frame.DataFrame

Here is an example of getting the column named “col0”.

df["col0"]
a     2
b     5
c    10
Name: col0, dtype: int64

Here is an example of getting the entry in the “b” row and the “col0” column.

df.loc["b","col0"]
5

Above we used sample_data[1][0] to get this element. We can do something similar in pandas, if we use iloc instead of loc.

df.iloc[1,0]
5

Importing data from a csv file

Here is the more typical way to access data, in this case by importing data from a csv file.

df = pd.read_csv("../data/indexData.csv")

This is a much bigger DataFrame than what we worked with above. It has 112,457 rows and 8 columns.

df.shape
(112457, 8)
df.iloc[10040,3]
7601.709961

Here is getting the row at position 1 (remember that we start counting from 0). As will often be the case, the default label for this row is the same as its position, so in this case we can use either loc or iloc.

df.iloc[1]
Name                NYA
Date             1/3/66
Open         527.210022
High         527.210022
Low          527.210022
Close        527.210022
Adj Close    527.210022
Volume              0.0
Name: 1, dtype: object
df.loc[1]
Name                NYA
Date             1/3/66
Open         527.210022
High         527.210022
Low          527.210022
Close        527.210022
Adj Close    527.210022
Volume              0.0
Name: 1, dtype: object

Here are the first 5 rows of the DataFrame.

df.head()
Name Date Open High Low Close Adj Close Volume
0 NYA 12/31/65 528.690002 528.690002 528.690002 528.690002 528.690002 0.0
1 NYA 1/3/66 527.210022 527.210022 527.210022 527.210022 527.210022 0.0
2 NYA 1/4/66 527.840027 527.840027 527.840027 527.840027 527.840027 0.0
3 NYA 1/5/66 531.119995 531.119995 531.119995 531.119995 531.119995 0.0
4 NYA 1/6/66 532.070007 532.070007 532.070007 532.070007 532.070007 0.0

Here are a bunch of different ways to get the column labelled “High”.

df["High"]
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64
df.loc[:,"High"]
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

We can’t use iloc with “High” since “High” is not an integer.

df.iloc[:,"High"]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    753             try:
--> 754                 self._validate_key(k, i)
    755             except ValueError as err:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1425         else:
-> 1426             raise ValueError(f"Can only index by location with a [{self._valid_types}]")
   1427 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_57194/2234844422.py in <module>
----> 1 df.iloc[:,"High"]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1504     def _getitem_tuple(self, tup: tuple):
   1505 
-> 1506         self._has_valid_tuple(tup)
   1507         with suppress(IndexingError):
   1508             return self._getitem_lowerdim(tup)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    757                     "Location based indexing can only have "
    758                     f"[{self._valid_types}] types"
--> 759                 ) from err
    760 
    761     def _is_nested_tuple_indexer(self, tup: tuple) -> bool:

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
df.iloc[:,3]
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64
df.High
0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

Preview of Boolean indexing

First we find all the different values in the “Name” column.

df["Name"].unique()
array(['NYA', 'IXIC', 'HSI', '000001.SS', 'GSPTSE', '399001.SZ', 'NSEI',
       'GDAXI', 'KS11', 'SSMI', 'TWII', 'J203.JO', 'N225', 'N100'],
      dtype=object)

The following is an example of what’s called a Boolean Series.

df["Name"] == "NSEI"
0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool

If we plug that Boolean Series into df[???], then we keep only the rows where the value is True. In this case, we are keeping only the rows where the “Name” value is “NSEI”.

# boolean indexing
df[df["Name"] == "NSEI"]
Name Date Open High Low Close Adj Close Volume
58055 NSEI 9/17/07 4518.450195 4549.049805 4482.850098 4494.649902 4494.649902 0.0
58056 NSEI 9/18/07 4494.100098 4551.799805 4481.549805 4546.200195 4546.200195 0.0
58057 NSEI 9/19/07 4550.250000 4739.000000 4550.250000 4732.350098 4732.350098 0.0
58058 NSEI 9/20/07 4734.850098 4760.850098 4721.149902 4747.549805 4747.549805 0.0
58059 NSEI 9/21/07 4752.950195 4855.700195 4733.700195 4837.549805 4837.549805 0.0
... ... ... ... ... ... ... ... ...
61431 NSEI 5/25/21 15291.750000 15293.849610 15163.400390 15208.450200 15208.450200 441500.0
61432 NSEI 5/26/21 15257.049810 15319.900390 15194.950200 15301.450200 15301.450200 384000.0
61433 NSEI 5/27/21 15323.950200 15384.549810 15272.500000 15337.849610 15337.849610 0.0
61434 NSEI 5/28/21 15421.200200 15469.650390 15394.750000 15435.650390 15435.650390 468100.0
61435 NSEI 5/31/21 15437.750000 15606.349610 15374.000000 15582.799810 15582.799810 436000.0

3381 rows × 8 columns