Introduction to pandas
Contents
Introduction to pandas¶
Making our own pandas DataFrame¶
Usually we will import data (like from a csv file or from a Python library), but here we make our own data. It starts out as a list
.
sample_data = [[2,3],[5,1],[10,1]]
type(sample_data)
list
Counting in Python starts at 0.
sample_data[0]
[2, 3]
sample_data[1]
[5, 1]
sample_data[1][0]
5
Here we import the pandas library and give it the abbreviation pd
. In theory, we could use any abbreviation we want (or not use an abbreviation), but pd
is the convention and we will always follow that convention.
import pandas as pd
Here we convert sample_data
into a pandas DataFrame.
pd.DataFrame(sample_data)
0 | 1 | |
---|---|---|
0 | 2 | 3 |
1 | 5 | 1 |
2 | 10 | 1 |
Notice how the rows are named 0,1,2 and the columns are named 0,1, by default. Here we specify labels for the rows and the columns.
df = pd.DataFrame(sample_data, columns = ["col0","col1"], index = ["a","b","c"])
df
col0 | col1 | |
---|---|---|
a | 2 | 3 |
b | 5 | 1 |
c | 10 | 1 |
type(df)
pandas.core.frame.DataFrame
Here is an example of getting the column named “col0”.
df["col0"]
a 2
b 5
c 10
Name: col0, dtype: int64
Here is an example of getting the entry in the “b” row and the “col0” column.
df.loc["b","col0"]
5
Above we used sample_data[1][0]
to get this element. We can do something similar in pandas, if we use iloc
instead of loc
.
df.iloc[1,0]
5
Importing data from a csv file¶
Here is the more typical way to access data, in this case by importing data from a csv file.
df = pd.read_csv("../data/indexData.csv")
This is a much bigger DataFrame than what we worked with above. It has 112,457 rows and 8 columns.
df.shape
(112457, 8)
df.iloc[10040,3]
7601.709961
Here is getting the row at position 1 (remember that we start counting from 0). As will often be the case, the default label for this row is the same as its position, so in this case we can use either loc
or iloc
.
df.iloc[1]
Name NYA
Date 1/3/66
Open 527.210022
High 527.210022
Low 527.210022
Close 527.210022
Adj Close 527.210022
Volume 0.0
Name: 1, dtype: object
df.loc[1]
Name NYA
Date 1/3/66
Open 527.210022
High 527.210022
Low 527.210022
Close 527.210022
Adj Close 527.210022
Volume 0.0
Name: 1, dtype: object
Here are the first 5 rows of the DataFrame.
df.head()
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
0 | NYA | 12/31/65 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 528.690002 | 0.0 |
1 | NYA | 1/3/66 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 527.210022 | 0.0 |
2 | NYA | 1/4/66 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 527.840027 | 0.0 |
3 | NYA | 1/5/66 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 531.119995 | 0.0 |
4 | NYA | 1/6/66 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 532.070007 | 0.0 |
Here are a bunch of different ways to get the column labelled “High”.
df["High"]
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
112452 1251.910034
112453 1259.209961
112454 1258.880005
112455 1265.660034
112456 1263.709961
Name: High, Length: 112457, dtype: float64
df.loc[:,"High"]
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
112452 1251.910034
112453 1259.209961
112454 1258.880005
112455 1265.660034
112456 1263.709961
Name: High, Length: 112457, dtype: float64
We can’t use iloc
with “High” since “High” is not an integer.
df.iloc[:,"High"]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
753 try:
--> 754 self._validate_key(k, i)
755 except ValueError as err:
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
1425 else:
-> 1426 raise ValueError(f"Can only index by location with a [{self._valid_types}]")
1427
ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_57194/2234844422.py in <module>
----> 1 df.iloc[:,"High"]
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
923 with suppress(KeyError, IndexError):
924 return self.obj._get_value(*key, takeable=self._takeable)
--> 925 return self._getitem_tuple(key)
926 else:
927 # we by definition only have the 0th axis
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
1504 def _getitem_tuple(self, tup: tuple):
1505
-> 1506 self._has_valid_tuple(tup)
1507 with suppress(IndexingError):
1508 return self._getitem_lowerdim(tup)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
757 "Location based indexing can only have "
758 f"[{self._valid_types}] types"
--> 759 ) from err
760
761 def _is_nested_tuple_indexer(self, tup: tuple) -> bool:
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
df.iloc[:,3]
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
112452 1251.910034
112453 1259.209961
112454 1258.880005
112455 1265.660034
112456 1263.709961
Name: High, Length: 112457, dtype: float64
df.High
0 528.690002
1 527.210022
2 527.840027
3 531.119995
4 532.070007
...
112452 1251.910034
112453 1259.209961
112454 1258.880005
112455 1265.660034
112456 1263.709961
Name: High, Length: 112457, dtype: float64
Preview of Boolean indexing¶
First we find all the different values in the “Name” column.
df["Name"].unique()
array(['NYA', 'IXIC', 'HSI', '000001.SS', 'GSPTSE', '399001.SZ', 'NSEI',
'GDAXI', 'KS11', 'SSMI', 'TWII', 'J203.JO', 'N225', 'N100'],
dtype=object)
The following is an example of what’s called a Boolean Series.
df["Name"] == "NSEI"
0 False
1 False
2 False
3 False
4 False
...
112452 False
112453 False
112454 False
112455 False
112456 False
Name: Name, Length: 112457, dtype: bool
If we plug that Boolean Series into df[???]
, then we keep only the rows where the value is True
. In this case, we are keeping only the rows where the “Name” value is “NSEI”.
# boolean indexing
df[df["Name"] == "NSEI"]
Name | Date | Open | High | Low | Close | Adj Close | Volume | |
---|---|---|---|---|---|---|---|---|
58055 | NSEI | 9/17/07 | 4518.450195 | 4549.049805 | 4482.850098 | 4494.649902 | 4494.649902 | 0.0 |
58056 | NSEI | 9/18/07 | 4494.100098 | 4551.799805 | 4481.549805 | 4546.200195 | 4546.200195 | 0.0 |
58057 | NSEI | 9/19/07 | 4550.250000 | 4739.000000 | 4550.250000 | 4732.350098 | 4732.350098 | 0.0 |
58058 | NSEI | 9/20/07 | 4734.850098 | 4760.850098 | 4721.149902 | 4747.549805 | 4747.549805 | 0.0 |
58059 | NSEI | 9/21/07 | 4752.950195 | 4855.700195 | 4733.700195 | 4837.549805 | 4837.549805 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
61431 | NSEI | 5/25/21 | 15291.750000 | 15293.849610 | 15163.400390 | 15208.450200 | 15208.450200 | 441500.0 |
61432 | NSEI | 5/26/21 | 15257.049810 | 15319.900390 | 15194.950200 | 15301.450200 | 15301.450200 | 384000.0 |
61433 | NSEI | 5/27/21 | 15323.950200 | 15384.549810 | 15272.500000 | 15337.849610 | 15337.849610 | 0.0 |
61434 | NSEI | 5/28/21 | 15421.200200 | 15469.650390 | 15394.750000 | 15435.650390 | 15435.650390 | 468100.0 |
61435 | NSEI | 5/31/21 | 15437.750000 | 15606.349610 | 15374.000000 | 15582.799810 | 15582.799810 | 436000.0 |
3381 rows × 8 columns