Reading a csv file¶

pandas is probably the most important library for Math 10.

Exploring the data¶

Here are some first things you might try when importing a new dataset.

import pandas as pd

On my personal computer, the cars.csv file is located in a different folder from this notebook. If it’s in the same folder (like it is on Deepnote), you can just type pd.read_csv("cars.csv").

df = pd.read_csv("../data/cars.csv")

Viewing the first 5 rows of the dataset. You should think of each row as corresponding to one instance or one data point.

df.head()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

Or the first 10 rows.

df.head(10)

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
5	ford galaxie 500	15.0	8	429.0	198.0	4341	10.0	1970-01-01	USA
6	chevrolet impala	14.0	8	454.0	220.0	4354	9.0	1970-01-01	USA
7	plymouth fury iii	14.0	8	440.0	215.0	4312	8.5	1970-01-01	USA
8	pontiac catalina	14.0	8	455.0	225.0	4425	10.0	1970-01-01	USA
9	amc ambassador dpl	15.0	8	390.0	190.0	3850	8.5	1970-01-01	USA

The number of rows and columns.

df.shape

(406, 9)

The names of the columns.

df.columns

Index(['Name', 'Miles_per_Gallon', 'Cylinders', 'Displacement', 'Horsepower',
       'Weight_in_lbs', 'Acceleration', 'Year', 'Origin'],
      dtype='object')

Some data about the numeric columns.

df.describe()

	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration
count	398.000000	406.000000	406.000000	400.000000	406.000000	406.000000
mean	23.514573	5.475369	194.779557	105.082500	2979.413793	15.519704
std	7.815984	1.712160	104.922458	38.768779	847.004328	2.803359
min	9.000000	3.000000	68.000000	46.000000	1613.000000	8.000000
25%	17.500000	4.000000	105.000000	75.750000	2226.500000	13.700000
50%	23.000000	4.000000	151.000000	95.000000	2822.500000	15.500000
75%	29.000000	8.000000	302.000000	130.000000	3618.250000	17.175000
max	46.600000	8.000000	455.000000	230.000000	5140.000000	24.800000

Some more information. You can use this next information to determine which columns have missing values.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB

The data types of the columns.

df.dtypes

Name                 object
Miles_per_Gallon    float64
Cylinders             int64
Displacement        float64
Horsepower          float64
Weight_in_lbs         int64
Acceleration        float64
Year                 object
Origin               object
dtype: object

You can see how the numeric columns are correlated with each other. These correlation values range between -1 and 1, with 1 meaning the two columns are perfectly correlated.

df.corr()

	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration
Miles_per_Gallon	1.000000	-0.775396	-0.804203	-0.778427	-0.831741	0.420289
Cylinders	-0.775396	1.000000	0.951787	0.844158	0.895220	-0.522452
Displacement	-0.804203	0.951787	1.000000	0.898326	0.932475	-0.557984
Horsepower	-0.778427	0.844158	0.898326	1.000000	0.866586	-0.697124
Weight_in_lbs	-0.831741	0.895220	0.932475	0.866586	1.000000	-0.430086
Acceleration	0.420289	-0.522452	-0.557984	-0.697124	-0.430086	1.000000

Indexing¶

There are many different ways to select data within a pandas DataFrame. The best way to remember them is to practice using them.

# Reminder of how df starts
df.head()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

The entry in the 2nd row, 3rd column (remember we start counting at 0).

df.iloc[2,3]

318.0

The entry in the row with label 2 and the column with label Displacement. (Notice that the index and the label is the same for the rows; this is not uncommon.)

df.loc[2,"Displacement"]

318.0

You can get an entire row or column with the same syntax, using a colon : to represent “all rows or all columns”.

# The row at index 2.
df.iloc[2,:]

Name                plymouth satellite
Miles_per_Gallon                  18.0
Cylinders                            8
Displacement                     318.0
Horsepower                       150.0
Weight_in_lbs                     3436
Acceleration                      11.0
Year                        1970-01-01
Origin                             USA
Name: 2, dtype: object

# The column at index 3.
df.iloc[:,3]

    307.0
    350.0
    318.0
    304.0
    302.0
       ...  
  140.0
   97.0
  135.0
  120.0
  119.0
Name: Displacement, Length: 406, dtype: float64

# The column with label "Displacement".
df.loc[:,"Displacement"]

    307.0
    350.0
    318.0
    304.0
    302.0
       ...  
  140.0
   97.0
  135.0
  120.0
  119.0
Name: Displacement, Length: 406, dtype: float64

There is an abbreviation for getting a certain column, using its label.

df["Displacement"]

    307.0
    350.0
    318.0
    304.0
    302.0
       ...  
  140.0
   97.0
  135.0
  120.0
  119.0
Name: Displacement, Length: 406, dtype: float64

The next abbreviation does not always work, but can be a further shortcut. It is called “attribute” access. The subtleties of attribute access won’t be important for us in Math 10; you can read about those subtleties in the pandas documentation.

df.Displacement

    307.0
    350.0
    318.0
    304.0
    302.0
       ...  
  140.0
   97.0
  135.0
  120.0
  119.0
Name: Displacement, Length: 406, dtype: float64

Missing values¶

An important concept (especially when working with real-world datasets) is the concept of missing data. This particular dataset has missing values in the Miles_per_Gallon column and in the Horsepower column. In this DataFrame, the missing data is denoted by the NumPy object np.nan which stands for “not a number”.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB

We create a Boolean DataFrame using the method isna(). This DataFrame will be True where there are null values.

df.isna()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...
401	False	False	False	False	False	False	False	False	False
402	False	False	False	False	False	False	False	False	False
403	False	False	False	False	False	False	False	False	False
404	False	False	False	False	False	False	False	False	False
405	False	False	False	False	False	False	False	False	False

406 rows × 9 columns

Let’s find where the np.nan values are in the Miles_per_Gallon column.

df[df["Miles_per_Gallon"].isna()]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
10	citroen ds-21 pallas	NaN	4	133.0	115.0	3090	17.5	1970-01-01	Europe
11	chevrolet chevelle concours (sw)	NaN	8	350.0	165.0	4142	11.5	1970-01-01	USA
12	ford torino (sw)	NaN	8	351.0	153.0	4034	11.0	1970-01-01	USA
13	plymouth satellite (sw)	NaN	8	383.0	175.0	4166	10.5	1970-01-01	USA
14	amc rebel sst (sw)	NaN	8	360.0	175.0	3850	11.0	1970-01-01	USA
17	ford mustang boss 302	NaN	8	302.0	140.0	3353	8.0	1970-01-01	USA
39	volkswagen super beetle 117	NaN	4	97.0	48.0	1978	20.0	1971-01-01	Europe
367	saab 900s	NaN	4	121.0	110.0	2800	15.4	1982-01-01	Europe

The same thing for the Horsepower column.

df[df["Horsepower"].isna()]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
38	ford pinto	25.0	4	98.0	NaN	2046	19.0	1971-01-01	USA
133	ford maverick	21.0	6	200.0	NaN	2875	17.0	1974-01-01	USA
337	renault lecar deluxe	40.9	4	85.0	NaN	1835	17.3	1980-01-01	Europe
343	ford mustang cobra	23.6	4	140.0	NaN	2905	14.3	1980-01-01	USA
361	renault 18i	34.5	4	100.0	NaN	2320	15.8	1982-01-01	Europe
382	amc concord dl	23.0	4	151.0	NaN	3035	20.5	1982-01-01	USA

In Python, logical or is usually spelled out.

True or True

True

True or False

True

False or False

False

The equivalent of or in pandas is denoted with a vertical line |, which is sometimes called “pipe”.

df[df["Miles_per_Gallon"].isna() | df["Horsepower"].isna()]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
10	citroen ds-21 pallas	NaN	4	133.0	115.0	3090	17.5	1970-01-01	Europe
11	chevrolet chevelle concours (sw)	NaN	8	350.0	165.0	4142	11.5	1970-01-01	USA
12	ford torino (sw)	NaN	8	351.0	153.0	4034	11.0	1970-01-01	USA
13	plymouth satellite (sw)	NaN	8	383.0	175.0	4166	10.5	1970-01-01	USA
14	amc rebel sst (sw)	NaN	8	360.0	175.0	3850	11.0	1970-01-01	USA
17	ford mustang boss 302	NaN	8	302.0	140.0	3353	8.0	1970-01-01	USA
38	ford pinto	25.0	4	98.0	NaN	2046	19.0	1971-01-01	USA
39	volkswagen super beetle 117	NaN	4	97.0	48.0	1978	20.0	1971-01-01	Europe
133	ford maverick	21.0	6	200.0	NaN	2875	17.0	1974-01-01	USA
337	renault lecar deluxe	40.9	4	85.0	NaN	1835	17.3	1980-01-01	Europe
343	ford mustang cobra	23.6	4	140.0	NaN	2905	14.3	1980-01-01	USA
361	renault 18i	34.5	4	100.0	NaN	2320	15.8	1982-01-01	Europe
367	saab 900s	NaN	4	121.0	110.0	2800	15.4	1982-01-01	Europe
382	amc concord dl	23.0	4	151.0	NaN	3035	20.5	1982-01-01	USA

A fancier and more robust method is to use any. In this example, axis = 1 is saying, look one row at a time. So df.isna().any(axis=1) is asking if there are any missing values in the entire row.

df[df.isna().any(axis=1)]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
10	citroen ds-21 pallas	NaN	4	133.0	115.0	3090	17.5	1970-01-01	Europe
11	chevrolet chevelle concours (sw)	NaN	8	350.0	165.0	4142	11.5	1970-01-01	USA
12	ford torino (sw)	NaN	8	351.0	153.0	4034	11.0	1970-01-01	USA
13	plymouth satellite (sw)	NaN	8	383.0	175.0	4166	10.5	1970-01-01	USA
14	amc rebel sst (sw)	NaN	8	360.0	175.0	3850	11.0	1970-01-01	USA
17	ford mustang boss 302	NaN	8	302.0	140.0	3353	8.0	1970-01-01	USA
38	ford pinto	25.0	4	98.0	NaN	2046	19.0	1971-01-01	USA
39	volkswagen super beetle 117	NaN	4	97.0	48.0	1978	20.0	1971-01-01	Europe
133	ford maverick	21.0	6	200.0	NaN	2875	17.0	1974-01-01	USA
337	renault lecar deluxe	40.9	4	85.0	NaN	1835	17.3	1980-01-01	Europe
343	ford mustang cobra	23.6	4	140.0	NaN	2905	14.3	1980-01-01	USA
361	renault 18i	34.5	4	100.0	NaN	2320	15.8	1982-01-01	Europe
367	saab 900s	NaN	4	121.0	110.0	2800	15.4	1982-01-01	Europe
382	amc concord dl	23.0	4	151.0	NaN	3035	20.5	1982-01-01	USA

If we were to instead use axis=0, it would ask if there were any missing values in the entire column.

df.isna().any(axis=0)

Name                False
Miles_per_Gallon     True
Cylinders           False
Displacement        False
Horsepower           True
Weight_in_lbs       False
Acceleration        False
Year                False
Origin              False
dtype: bool

UC Irvine Math 10 W22

Reading a csv file

Contents

Reading a csv file¶

Exploring the data¶

Indexing¶

Missing values¶