Introduction to pandas¶

Making our own pandas DataFrame¶

Usually we will import data (like from a csv file or from a Python library), but here we make our own data. It starts out as a list.

sample_data = [[2,3],[5,1],[10,1]]

type(sample_data)

list

Counting in Python starts at 0.

sample_data[0]

[2, 3]

sample_data[1]

[5, 1]

sample_data[1][0]

Here we import the pandas library and give it the abbreviation pd. In theory, we could use any abbreviation we want (or not use an abbreviation), but pd is the convention and we will always follow that convention.

import pandas as pd

Here we convert sample_data into a pandas DataFrame.

pd.DataFrame(sample_data)

	0	1
0	2	3
1	5	1
2	10	1

Notice how the rows are named 0,1,2 and the columns are named 0,1, by default. Here we specify labels for the rows and the columns.

df = pd.DataFrame(sample_data, columns = ["col0","col1"], index = ["a","b","c"])

df

	col0	col1
a	2	3
b	5	1
c	10	1

type(df)

pandas.core.frame.DataFrame

Here is an example of getting the column named “col0”.

df["col0"]

a     2
b     5
c    10
Name: col0, dtype: int64

Here is an example of getting the entry in the “b” row and the “col0” column.

df.loc["b","col0"]

Above we used sample_data[1][0] to get this element. We can do something similar in pandas, if we use iloc instead of loc.

df.iloc[1,0]

Importing data from a csv file¶

Here is the more typical way to access data, in this case by importing data from a csv file.

df = pd.read_csv("../data/indexData.csv")

This is a much bigger DataFrame than what we worked with above. It has 112,457 rows and 8 columns.

df.shape

(112457, 8)

df.iloc[10040,3]

7601.709961

Here is getting the row at position 1 (remember that we start counting from 0). As will often be the case, the default label for this row is the same as its position, so in this case we can use either loc or iloc.

df.iloc[1]

Name                NYA
Date             1/3/66
Open         527.210022
High         527.210022
Low          527.210022
Close        527.210022
Adj Close    527.210022
Volume              0.0
Name: 1, dtype: object

df.loc[1]

Name                NYA
Date             1/3/66
Open         527.210022
High         527.210022
Low          527.210022
Close        527.210022
Adj Close    527.210022
Volume              0.0
Name: 1, dtype: object

Here are the first 5 rows of the DataFrame.

df.head()

	Name	Date	Open	High	Low	Close	Adj Close
0	NYA	12/31/65	528.690002	528.690002	528.690002	528.690002	528.690002
1	NYA	1/3/66	527.210022	527.210022	527.210022	527.210022	527.210022
2	NYA	1/4/66	527.840027	527.840027	527.840027	527.840027	527.840027
3	NYA	1/5/66	531.119995	531.119995	531.119995	531.119995	531.119995
4	NYA	1/6/66	532.070007	532.070007	532.070007	532.070007	532.070007

Here are a bunch of different ways to get the column labelled “High”.

df["High"]

0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

df.loc[:,"High"]

0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

We can’t use iloc with “High” since “High” is not an integer.

df.iloc[:,"High"]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    753             try:
--> 754                 self._validate_key(k, i)
    755             except ValueError as err:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1425         else:
-> 1426             raise ValueError(f"Can only index by location with a [{self._valid_types}]")
   1427 

ValueError: Can only index by location with a [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array]

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_57194/2234844422.py in <module>
----> 1 df.iloc[:,"High"]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
    923                 with suppress(KeyError, IndexError):
    924                     return self.obj._get_value(*key, takeable=self._takeable)
--> 925             return self._getitem_tuple(key)
    926         else:
    927             # we by definition only have the 0th axis

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
   1504     def _getitem_tuple(self, tup: tuple):
   1505 
-> 1506         self._has_valid_tuple(tup)
   1507         with suppress(IndexingError):
   1508             return self._getitem_lowerdim(tup)

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
    757                     "Location based indexing can only have "
    758                     f"[{self._valid_types}] types"
--> 759                 ) from err
    760 
    761     def _is_nested_tuple_indexer(self, tup: tuple) -> bool:

ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types

df.iloc[:,3]

0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

df.High

0          528.690002
1          527.210022
2          527.840027
3          531.119995
4          532.070007
             ...     
112452    1251.910034
112453    1259.209961
112454    1258.880005
112455    1265.660034
112456    1263.709961
Name: High, Length: 112457, dtype: float64

Preview of Boolean indexing¶

First we find all the different values in the “Name” column.

df["Name"].unique()

array(['NYA', 'IXIC', 'HSI', '000001.SS', 'GSPTSE', '399001.SZ', 'NSEI',
       'GDAXI', 'KS11', 'SSMI', 'TWII', 'J203.JO', 'N225', 'N100'],
      dtype=object)

The following is an example of what’s called a Boolean Series.

df["Name"] == "NSEI"

0         False
1         False
2         False
3         False
4         False
          ...  
112452    False
112453    False
112454    False
112455    False
112456    False
Name: Name, Length: 112457, dtype: bool

If we plug that Boolean Series into df[???], then we keep only the rows where the value is True. In this case, we are keeping only the rows where the “Name” value is “NSEI”.

# boolean indexing
df[df["Name"] == "NSEI"]

	Name	Date	Open	High	Low	Close	Adj Close	Volume
58055	NSEI	9/17/07	4518.450195	4549.049805	4482.850098	4494.649902	4494.649902	0.0
58056	NSEI	9/18/07	4494.100098	4551.799805	4481.549805	4546.200195	4546.200195	0.0
58057	NSEI	9/19/07	4550.250000	4739.000000	4550.250000	4732.350098	4732.350098	0.0
58058	NSEI	9/20/07	4734.850098	4760.850098	4721.149902	4747.549805	4747.549805	0.0
58059	NSEI	9/21/07	4752.950195	4855.700195	4733.700195	4837.549805	4837.549805	0.0
...	...	...	...	...	...	...	...	...
61431	NSEI	5/25/21	15291.750000	15293.849610	15163.400390	15208.450200	15208.450200	441500.0
61432	NSEI	5/26/21	15257.049810	15319.900390	15194.950200	15301.450200	15301.450200	384000.0
61433	NSEI	5/27/21	15323.950200	15384.549810	15272.500000	15337.849610	15337.849610	0.0
61434	NSEI	5/28/21	15421.200200	15469.650390	15394.750000	15435.650390	15435.650390	468100.0
61435	NSEI	5/31/21	15437.750000	15606.349610	15374.000000	15582.799810	15582.799810	436000.0

3381 rows × 8 columns

UC Irvine Math 10 S22

Introduction to pandas

Contents

Introduction to pandas¶

Making our own pandas DataFrame¶

Importing data from a csv file¶

Preview of Boolean indexing¶