Reading a csv file
Contents
Reading a csv file¶
pandas is probably the most important library for Math 10.
Exploring the data¶
Here are some first things you might try when importing a new dataset.
import pandas as pd
On my personal computer, the cars.csv
file is located in a different folder from this notebook. If it’s in the same folder (like it is on Deepnote), you can just type pd.read_csv("cars.csv")
.
df = pd.read_csv("../data/cars.csv")
Viewing the first 5 rows of the dataset. You should think of each row as corresponding to one instance or one data point.
df.head()
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |
1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |
2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |
4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |
Or the first 10 rows.
df.head(10)
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |
1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |
2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |
4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |
5 | ford galaxie 500 | 15.0 | 8 | 429.0 | 198.0 | 4341 | 10.0 | 1970-01-01 | USA |
6 | chevrolet impala | 14.0 | 8 | 454.0 | 220.0 | 4354 | 9.0 | 1970-01-01 | USA |
7 | plymouth fury iii | 14.0 | 8 | 440.0 | 215.0 | 4312 | 8.5 | 1970-01-01 | USA |
8 | pontiac catalina | 14.0 | 8 | 455.0 | 225.0 | 4425 | 10.0 | 1970-01-01 | USA |
9 | amc ambassador dpl | 15.0 | 8 | 390.0 | 190.0 | 3850 | 8.5 | 1970-01-01 | USA |
The number of rows and columns.
df.shape
(406, 9)
The names of the columns.
df.columns
Index(['Name', 'Miles_per_Gallon', 'Cylinders', 'Displacement', 'Horsepower',
'Weight_in_lbs', 'Acceleration', 'Year', 'Origin'],
dtype='object')
Some data about the numeric columns.
df.describe()
Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | |
---|---|---|---|---|---|---|
count | 398.000000 | 406.000000 | 406.000000 | 400.000000 | 406.000000 | 406.000000 |
mean | 23.514573 | 5.475369 | 194.779557 | 105.082500 | 2979.413793 | 15.519704 |
std | 7.815984 | 1.712160 | 104.922458 | 38.768779 | 847.004328 | 2.803359 |
min | 9.000000 | 3.000000 | 68.000000 | 46.000000 | 1613.000000 | 8.000000 |
25% | 17.500000 | 4.000000 | 105.000000 | 75.750000 | 2226.500000 | 13.700000 |
50% | 23.000000 | 4.000000 | 151.000000 | 95.000000 | 2822.500000 | 15.500000 |
75% | 29.000000 | 8.000000 | 302.000000 | 130.000000 | 3618.250000 | 17.175000 |
max | 46.600000 | 8.000000 | 455.000000 | 230.000000 | 5140.000000 | 24.800000 |
Some more information. You can use this next information to determine which columns have missing values.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 406 non-null object
1 Miles_per_Gallon 398 non-null float64
2 Cylinders 406 non-null int64
3 Displacement 406 non-null float64
4 Horsepower 400 non-null float64
5 Weight_in_lbs 406 non-null int64
6 Acceleration 406 non-null float64
7 Year 406 non-null object
8 Origin 406 non-null object
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB
The data types of the columns.
df.dtypes
Name object
Miles_per_Gallon float64
Cylinders int64
Displacement float64
Horsepower float64
Weight_in_lbs int64
Acceleration float64
Year object
Origin object
dtype: object
You can see how the numeric columns are correlated with each other. These correlation values range between -1 and 1, with 1 meaning the two columns are perfectly correlated.
df.corr()
Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | |
---|---|---|---|---|---|---|
Miles_per_Gallon | 1.000000 | -0.775396 | -0.804203 | -0.778427 | -0.831741 | 0.420289 |
Cylinders | -0.775396 | 1.000000 | 0.951787 | 0.844158 | 0.895220 | -0.522452 |
Displacement | -0.804203 | 0.951787 | 1.000000 | 0.898326 | 0.932475 | -0.557984 |
Horsepower | -0.778427 | 0.844158 | 0.898326 | 1.000000 | 0.866586 | -0.697124 |
Weight_in_lbs | -0.831741 | 0.895220 | 0.932475 | 0.866586 | 1.000000 | -0.430086 |
Acceleration | 0.420289 | -0.522452 | -0.557984 | -0.697124 | -0.430086 | 1.000000 |
Indexing¶
There are many different ways to select data within a pandas DataFrame. The best way to remember them is to practice using them.
# Reminder of how df starts
df.head()
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 1970-01-01 | USA |
1 | buick skylark 320 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 1970-01-01 | USA |
2 | plymouth satellite | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 1970-01-01 | USA |
3 | amc rebel sst | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 1970-01-01 | USA |
4 | ford torino | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 1970-01-01 | USA |
The entry in the 2nd row, 3rd column (remember we start counting at 0).
df.iloc[2,3]
318.0
The entry in the row with label 2 and the column with label Displacement. (Notice that the index and the label is the same for the rows; this is not uncommon.)
df.loc[2,"Displacement"]
318.0
You can get an entire row or column with the same syntax, using a colon :
to represent “all rows or all columns”.
# The row at index 2.
df.iloc[2,:]
Name plymouth satellite
Miles_per_Gallon 18.0
Cylinders 8
Displacement 318.0
Horsepower 150.0
Weight_in_lbs 3436
Acceleration 11.0
Year 1970-01-01
Origin USA
Name: 2, dtype: object
# The column at index 3.
df.iloc[:,3]
0 307.0
1 350.0
2 318.0
3 304.0
4 302.0
...
401 140.0
402 97.0
403 135.0
404 120.0
405 119.0
Name: Displacement, Length: 406, dtype: float64
# The column with label "Displacement".
df.loc[:,"Displacement"]
0 307.0
1 350.0
2 318.0
3 304.0
4 302.0
...
401 140.0
402 97.0
403 135.0
404 120.0
405 119.0
Name: Displacement, Length: 406, dtype: float64
There is an abbreviation for getting a certain column, using its label.
df["Displacement"]
0 307.0
1 350.0
2 318.0
3 304.0
4 302.0
...
401 140.0
402 97.0
403 135.0
404 120.0
405 119.0
Name: Displacement, Length: 406, dtype: float64
The next abbreviation does not always work, but can be a further shortcut. It is called “attribute” access. The subtleties of attribute access won’t be important for us in Math 10; you can read about those subtleties in the pandas documentation.
df.Displacement
0 307.0
1 350.0
2 318.0
3 304.0
4 302.0
...
401 140.0
402 97.0
403 135.0
404 120.0
405 119.0
Name: Displacement, Length: 406, dtype: float64
Missing values¶
An important concept (especially when working with real-world datasets) is the concept of missing data. This particular dataset has missing values in the Miles_per_Gallon
column and in the Horsepower
column. In this DataFrame, the missing data is denoted by the NumPy object np.nan
which stands for “not a number”.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 406 non-null object
1 Miles_per_Gallon 398 non-null float64
2 Cylinders 406 non-null int64
3 Displacement 406 non-null float64
4 Horsepower 400 non-null float64
5 Weight_in_lbs 406 non-null int64
6 Acceleration 406 non-null float64
7 Year 406 non-null object
8 Origin 406 non-null object
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB
We create a Boolean DataFrame using the method isna()
. This DataFrame will be True
where there are null values.
df.isna()
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
401 | False | False | False | False | False | False | False | False | False |
402 | False | False | False | False | False | False | False | False | False |
403 | False | False | False | False | False | False | False | False | False |
404 | False | False | False | False | False | False | False | False | False |
405 | False | False | False | False | False | False | False | False | False |
406 rows × 9 columns
Let’s find where the np.nan
values are in the Miles_per_Gallon
column.
df[df["Miles_per_Gallon"].isna()]
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
10 | citroen ds-21 pallas | NaN | 4 | 133.0 | 115.0 | 3090 | 17.5 | 1970-01-01 | Europe |
11 | chevrolet chevelle concours (sw) | NaN | 8 | 350.0 | 165.0 | 4142 | 11.5 | 1970-01-01 | USA |
12 | ford torino (sw) | NaN | 8 | 351.0 | 153.0 | 4034 | 11.0 | 1970-01-01 | USA |
13 | plymouth satellite (sw) | NaN | 8 | 383.0 | 175.0 | 4166 | 10.5 | 1970-01-01 | USA |
14 | amc rebel sst (sw) | NaN | 8 | 360.0 | 175.0 | 3850 | 11.0 | 1970-01-01 | USA |
17 | ford mustang boss 302 | NaN | 8 | 302.0 | 140.0 | 3353 | 8.0 | 1970-01-01 | USA |
39 | volkswagen super beetle 117 | NaN | 4 | 97.0 | 48.0 | 1978 | 20.0 | 1971-01-01 | Europe |
367 | saab 900s | NaN | 4 | 121.0 | 110.0 | 2800 | 15.4 | 1982-01-01 | Europe |
The same thing for the Horsepower
column.
df[df["Horsepower"].isna()]
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
38 | ford pinto | 25.0 | 4 | 98.0 | NaN | 2046 | 19.0 | 1971-01-01 | USA |
133 | ford maverick | 21.0 | 6 | 200.0 | NaN | 2875 | 17.0 | 1974-01-01 | USA |
337 | renault lecar deluxe | 40.9 | 4 | 85.0 | NaN | 1835 | 17.3 | 1980-01-01 | Europe |
343 | ford mustang cobra | 23.6 | 4 | 140.0 | NaN | 2905 | 14.3 | 1980-01-01 | USA |
361 | renault 18i | 34.5 | 4 | 100.0 | NaN | 2320 | 15.8 | 1982-01-01 | Europe |
382 | amc concord dl | 23.0 | 4 | 151.0 | NaN | 3035 | 20.5 | 1982-01-01 | USA |
In Python, logical or
is usually spelled out.
True or True
True
True or False
True
False or False
False
The equivalent of or
in pandas is denoted with a vertical line |
, which is sometimes called “pipe”.
df[df["Miles_per_Gallon"].isna() | df["Horsepower"].isna()]
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
10 | citroen ds-21 pallas | NaN | 4 | 133.0 | 115.0 | 3090 | 17.5 | 1970-01-01 | Europe |
11 | chevrolet chevelle concours (sw) | NaN | 8 | 350.0 | 165.0 | 4142 | 11.5 | 1970-01-01 | USA |
12 | ford torino (sw) | NaN | 8 | 351.0 | 153.0 | 4034 | 11.0 | 1970-01-01 | USA |
13 | plymouth satellite (sw) | NaN | 8 | 383.0 | 175.0 | 4166 | 10.5 | 1970-01-01 | USA |
14 | amc rebel sst (sw) | NaN | 8 | 360.0 | 175.0 | 3850 | 11.0 | 1970-01-01 | USA |
17 | ford mustang boss 302 | NaN | 8 | 302.0 | 140.0 | 3353 | 8.0 | 1970-01-01 | USA |
38 | ford pinto | 25.0 | 4 | 98.0 | NaN | 2046 | 19.0 | 1971-01-01 | USA |
39 | volkswagen super beetle 117 | NaN | 4 | 97.0 | 48.0 | 1978 | 20.0 | 1971-01-01 | Europe |
133 | ford maverick | 21.0 | 6 | 200.0 | NaN | 2875 | 17.0 | 1974-01-01 | USA |
337 | renault lecar deluxe | 40.9 | 4 | 85.0 | NaN | 1835 | 17.3 | 1980-01-01 | Europe |
343 | ford mustang cobra | 23.6 | 4 | 140.0 | NaN | 2905 | 14.3 | 1980-01-01 | USA |
361 | renault 18i | 34.5 | 4 | 100.0 | NaN | 2320 | 15.8 | 1982-01-01 | Europe |
367 | saab 900s | NaN | 4 | 121.0 | 110.0 | 2800 | 15.4 | 1982-01-01 | Europe |
382 | amc concord dl | 23.0 | 4 | 151.0 | NaN | 3035 | 20.5 | 1982-01-01 | USA |
A fancier and more robust method is to use any
. In this example, axis = 1
is saying, look one row at a time. So df.isna().any(axis=1)
is asking if there are any missing values in the entire row.
df[df.isna().any(axis=1)]
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | Weight_in_lbs | Acceleration | Year | Origin | |
---|---|---|---|---|---|---|---|---|---|
10 | citroen ds-21 pallas | NaN | 4 | 133.0 | 115.0 | 3090 | 17.5 | 1970-01-01 | Europe |
11 | chevrolet chevelle concours (sw) | NaN | 8 | 350.0 | 165.0 | 4142 | 11.5 | 1970-01-01 | USA |
12 | ford torino (sw) | NaN | 8 | 351.0 | 153.0 | 4034 | 11.0 | 1970-01-01 | USA |
13 | plymouth satellite (sw) | NaN | 8 | 383.0 | 175.0 | 4166 | 10.5 | 1970-01-01 | USA |
14 | amc rebel sst (sw) | NaN | 8 | 360.0 | 175.0 | 3850 | 11.0 | 1970-01-01 | USA |
17 | ford mustang boss 302 | NaN | 8 | 302.0 | 140.0 | 3353 | 8.0 | 1970-01-01 | USA |
38 | ford pinto | 25.0 | 4 | 98.0 | NaN | 2046 | 19.0 | 1971-01-01 | USA |
39 | volkswagen super beetle 117 | NaN | 4 | 97.0 | 48.0 | 1978 | 20.0 | 1971-01-01 | Europe |
133 | ford maverick | 21.0 | 6 | 200.0 | NaN | 2875 | 17.0 | 1974-01-01 | USA |
337 | renault lecar deluxe | 40.9 | 4 | 85.0 | NaN | 1835 | 17.3 | 1980-01-01 | Europe |
343 | ford mustang cobra | 23.6 | 4 | 140.0 | NaN | 2905 | 14.3 | 1980-01-01 | USA |
361 | renault 18i | 34.5 | 4 | 100.0 | NaN | 2320 | 15.8 | 1982-01-01 | Europe |
367 | saab 900s | NaN | 4 | 121.0 | 110.0 | 2800 | 15.4 | 1982-01-01 | Europe |
382 | amc concord dl | 23.0 | 4 | 151.0 | NaN | 3035 | 20.5 | 1982-01-01 | USA |
If we were to instead use axis=0
, it would ask if there were any missing values in the entire column.
df.isna().any(axis=0)
Name False
Miles_per_Gallon True
Cylinders False
Displacement False
Horsepower True
Weight_in_lbs False
Acceleration False
Year False
Origin False
dtype: bool