Reading a csv file

pandas is probably the most important library for Math 10.

Exploring the data

Here are some first things you might try when importing a new dataset.

import pandas as pd

On my personal computer, the cars.csv file is located in a different folder from this notebook. If it’s in the same folder (like it is on Deepnote), you can just type pd.read_csv("cars.csv").

df = pd.read_csv("../data/cars.csv")

Viewing the first 5 rows of the dataset. You should think of each row as corresponding to one instance or one data point.

df.head()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

Or the first 10 rows.

df.head(10)
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA
5 ford galaxie 500 15.0 8 429.0 198.0 4341 10.0 1970-01-01 USA
6 chevrolet impala 14.0 8 454.0 220.0 4354 9.0 1970-01-01 USA
7 plymouth fury iii 14.0 8 440.0 215.0 4312 8.5 1970-01-01 USA
8 pontiac catalina 14.0 8 455.0 225.0 4425 10.0 1970-01-01 USA
9 amc ambassador dpl 15.0 8 390.0 190.0 3850 8.5 1970-01-01 USA

The number of rows and columns.

df.shape
(406, 9)

The names of the columns.

df.columns
Index(['Name', 'Miles_per_Gallon', 'Cylinders', 'Displacement', 'Horsepower',
       'Weight_in_lbs', 'Acceleration', 'Year', 'Origin'],
      dtype='object')

Some data about the numeric columns.

df.describe()
Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration
count 398.000000 406.000000 406.000000 400.000000 406.000000 406.000000
mean 23.514573 5.475369 194.779557 105.082500 2979.413793 15.519704
std 7.815984 1.712160 104.922458 38.768779 847.004328 2.803359
min 9.000000 3.000000 68.000000 46.000000 1613.000000 8.000000
25% 17.500000 4.000000 105.000000 75.750000 2226.500000 13.700000
50% 23.000000 4.000000 151.000000 95.000000 2822.500000 15.500000
75% 29.000000 8.000000 302.000000 130.000000 3618.250000 17.175000
max 46.600000 8.000000 455.000000 230.000000 5140.000000 24.800000

Some more information. You can use this next information to determine which columns have missing values.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB

The data types of the columns.

df.dtypes
Name                 object
Miles_per_Gallon    float64
Cylinders             int64
Displacement        float64
Horsepower          float64
Weight_in_lbs         int64
Acceleration        float64
Year                 object
Origin               object
dtype: object

You can see how the numeric columns are correlated with each other. These correlation values range between -1 and 1, with 1 meaning the two columns are perfectly correlated.

df.corr()
Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration
Miles_per_Gallon 1.000000 -0.775396 -0.804203 -0.778427 -0.831741 0.420289
Cylinders -0.775396 1.000000 0.951787 0.844158 0.895220 -0.522452
Displacement -0.804203 0.951787 1.000000 0.898326 0.932475 -0.557984
Horsepower -0.778427 0.844158 0.898326 1.000000 0.866586 -0.697124
Weight_in_lbs -0.831741 0.895220 0.932475 0.866586 1.000000 -0.430086
Acceleration 0.420289 -0.522452 -0.557984 -0.697124 -0.430086 1.000000

Indexing

There are many different ways to select data within a pandas DataFrame. The best way to remember them is to practice using them.

# Reminder of how df starts
df.head()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA

The entry in the 2nd row, 3rd column (remember we start counting at 0).

df.iloc[2,3]
318.0

The entry in the row with label 2 and the column with label Displacement. (Notice that the index and the label is the same for the rows; this is not uncommon.)

df.loc[2,"Displacement"]
318.0

You can get an entire row or column with the same syntax, using a colon : to represent “all rows or all columns”.

# The row at index 2.
df.iloc[2,:]
Name                plymouth satellite
Miles_per_Gallon                  18.0
Cylinders                            8
Displacement                     318.0
Horsepower                       150.0
Weight_in_lbs                     3436
Acceleration                      11.0
Year                        1970-01-01
Origin                             USA
Name: 2, dtype: object
# The column at index 3.
df.iloc[:,3]
0      307.0
1      350.0
2      318.0
3      304.0
4      302.0
       ...  
401    140.0
402     97.0
403    135.0
404    120.0
405    119.0
Name: Displacement, Length: 406, dtype: float64
# The column with label "Displacement".
df.loc[:,"Displacement"]
0      307.0
1      350.0
2      318.0
3      304.0
4      302.0
       ...  
401    140.0
402     97.0
403    135.0
404    120.0
405    119.0
Name: Displacement, Length: 406, dtype: float64

There is an abbreviation for getting a certain column, using its label.

df["Displacement"]
0      307.0
1      350.0
2      318.0
3      304.0
4      302.0
       ...  
401    140.0
402     97.0
403    135.0
404    120.0
405    119.0
Name: Displacement, Length: 406, dtype: float64

The next abbreviation does not always work, but can be a further shortcut. It is called “attribute” access. The subtleties of attribute access won’t be important for us in Math 10; you can read about those subtleties in the pandas documentation.

df.Displacement
0      307.0
1      350.0
2      318.0
3      304.0
4      302.0
       ...  
401    140.0
402     97.0
403    135.0
404    120.0
405    119.0
Name: Displacement, Length: 406, dtype: float64

Missing values

An important concept (especially when working with real-world datasets) is the concept of missing data. This particular dataset has missing values in the Miles_per_Gallon column and in the Horsepower column. In this DataFrame, the missing data is denoted by the NumPy object np.nan which stands for “not a number”.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB

We create a Boolean DataFrame using the method isna(). This DataFrame will be True where there are null values.

df.isna()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
401 False False False False False False False False False
402 False False False False False False False False False
403 False False False False False False False False False
404 False False False False False False False False False
405 False False False False False False False False False

406 rows × 9 columns

Let’s find where the np.nan values are in the Miles_per_Gallon column.

df[df["Miles_per_Gallon"].isna()]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
10 citroen ds-21 pallas NaN 4 133.0 115.0 3090 17.5 1970-01-01 Europe
11 chevrolet chevelle concours (sw) NaN 8 350.0 165.0 4142 11.5 1970-01-01 USA
12 ford torino (sw) NaN 8 351.0 153.0 4034 11.0 1970-01-01 USA
13 plymouth satellite (sw) NaN 8 383.0 175.0 4166 10.5 1970-01-01 USA
14 amc rebel sst (sw) NaN 8 360.0 175.0 3850 11.0 1970-01-01 USA
17 ford mustang boss 302 NaN 8 302.0 140.0 3353 8.0 1970-01-01 USA
39 volkswagen super beetle 117 NaN 4 97.0 48.0 1978 20.0 1971-01-01 Europe
367 saab 900s NaN 4 121.0 110.0 2800 15.4 1982-01-01 Europe

The same thing for the Horsepower column.

df[df["Horsepower"].isna()]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
38 ford pinto 25.0 4 98.0 NaN 2046 19.0 1971-01-01 USA
133 ford maverick 21.0 6 200.0 NaN 2875 17.0 1974-01-01 USA
337 renault lecar deluxe 40.9 4 85.0 NaN 1835 17.3 1980-01-01 Europe
343 ford mustang cobra 23.6 4 140.0 NaN 2905 14.3 1980-01-01 USA
361 renault 18i 34.5 4 100.0 NaN 2320 15.8 1982-01-01 Europe
382 amc concord dl 23.0 4 151.0 NaN 3035 20.5 1982-01-01 USA

In Python, logical or is usually spelled out.

True or True
True
True or False
True
False or False
False

The equivalent of or in pandas is denoted with a vertical line |, which is sometimes called “pipe”.

df[df["Miles_per_Gallon"].isna() | df["Horsepower"].isna()]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
10 citroen ds-21 pallas NaN 4 133.0 115.0 3090 17.5 1970-01-01 Europe
11 chevrolet chevelle concours (sw) NaN 8 350.0 165.0 4142 11.5 1970-01-01 USA
12 ford torino (sw) NaN 8 351.0 153.0 4034 11.0 1970-01-01 USA
13 plymouth satellite (sw) NaN 8 383.0 175.0 4166 10.5 1970-01-01 USA
14 amc rebel sst (sw) NaN 8 360.0 175.0 3850 11.0 1970-01-01 USA
17 ford mustang boss 302 NaN 8 302.0 140.0 3353 8.0 1970-01-01 USA
38 ford pinto 25.0 4 98.0 NaN 2046 19.0 1971-01-01 USA
39 volkswagen super beetle 117 NaN 4 97.0 48.0 1978 20.0 1971-01-01 Europe
133 ford maverick 21.0 6 200.0 NaN 2875 17.0 1974-01-01 USA
337 renault lecar deluxe 40.9 4 85.0 NaN 1835 17.3 1980-01-01 Europe
343 ford mustang cobra 23.6 4 140.0 NaN 2905 14.3 1980-01-01 USA
361 renault 18i 34.5 4 100.0 NaN 2320 15.8 1982-01-01 Europe
367 saab 900s NaN 4 121.0 110.0 2800 15.4 1982-01-01 Europe
382 amc concord dl 23.0 4 151.0 NaN 3035 20.5 1982-01-01 USA

A fancier and more robust method is to use any. In this example, axis = 1 is saying, look one row at a time. So df.isna().any(axis=1) is asking if there are any missing values in the entire row.

df[df.isna().any(axis=1)]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
10 citroen ds-21 pallas NaN 4 133.0 115.0 3090 17.5 1970-01-01 Europe
11 chevrolet chevelle concours (sw) NaN 8 350.0 165.0 4142 11.5 1970-01-01 USA
12 ford torino (sw) NaN 8 351.0 153.0 4034 11.0 1970-01-01 USA
13 plymouth satellite (sw) NaN 8 383.0 175.0 4166 10.5 1970-01-01 USA
14 amc rebel sst (sw) NaN 8 360.0 175.0 3850 11.0 1970-01-01 USA
17 ford mustang boss 302 NaN 8 302.0 140.0 3353 8.0 1970-01-01 USA
38 ford pinto 25.0 4 98.0 NaN 2046 19.0 1971-01-01 USA
39 volkswagen super beetle 117 NaN 4 97.0 48.0 1978 20.0 1971-01-01 Europe
133 ford maverick 21.0 6 200.0 NaN 2875 17.0 1974-01-01 USA
337 renault lecar deluxe 40.9 4 85.0 NaN 1835 17.3 1980-01-01 Europe
343 ford mustang cobra 23.6 4 140.0 NaN 2905 14.3 1980-01-01 USA
361 renault 18i 34.5 4 100.0 NaN 2320 15.8 1982-01-01 Europe
367 saab 900s NaN 4 121.0 110.0 2800 15.4 1982-01-01 Europe
382 amc concord dl 23.0 4 151.0 NaN 3035 20.5 1982-01-01 USA

If we were to instead use axis=0, it would ask if there were any missing values in the entire column.

df.isna().any(axis=0)
Name                False
Miles_per_Gallon     True
Cylinders           False
Displacement        False
Horsepower           True
Weight_in_lbs       False
Acceleration        False
Year                False
Origin              False
dtype: bool