{ "cells": [ { "cell_type": "markdown", "id": "e75b6a78", "metadata": {}, "source": [ "# Reading a csv file\n", "\n", "pandas is probably the most important library for Math 10.\n", "\n", "## Exploring the data\n", "\n", "Here are some first things you might try when importing a new dataset." ] }, { "cell_type": "code", "execution_count": 2, "id": "adedba05", "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "id": "29baa6fc", "metadata": {}, "source": [ "On my personal computer, the `cars.csv` file is located in a different folder from this notebook. If it's in the same folder (like it is on Deepnote), you can just type `pd.read_csv(\"cars.csv\")`." ] }, { "cell_type": "code", "execution_count": 3, "id": "30e81e66", "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv(\"../data/cars.csv\")" ] }, { "cell_type": "markdown", "id": "a76296af", "metadata": {}, "source": [ "Viewing the first 5 rows of the dataset. You should think of each row as corresponding to one *instance* or one *data point*." ] }, { "cell_type": "code", "execution_count": 5, "id": "ba8bdc56", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "id": "708763f7", "metadata": {}, "source": [ "Or the first 10 rows." ] }, { "cell_type": "code", "execution_count": 6, "id": "06917891", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
5ford galaxie 50015.08429.0198.0434110.01970-01-01USA
6chevrolet impala14.08454.0220.043549.01970-01-01USA
7plymouth fury iii14.08440.0215.043128.51970-01-01USA
8pontiac catalina14.08455.0225.0442510.01970-01-01USA
9amc ambassador dpl15.08390.0190.038508.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "5 ford galaxie 500 15.0 8 429.0 \n", "6 chevrolet impala 14.0 8 454.0 \n", "7 plymouth fury iii 14.0 8 440.0 \n", "8 pontiac catalina 14.0 8 455.0 \n", "9 amc ambassador dpl 15.0 8 390.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA \n", "5 198.0 4341 10.0 1970-01-01 USA \n", "6 220.0 4354 9.0 1970-01-01 USA \n", "7 215.0 4312 8.5 1970-01-01 USA \n", "8 225.0 4425 10.0 1970-01-01 USA \n", "9 190.0 3850 8.5 1970-01-01 USA " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(10)" ] }, { "cell_type": "markdown", "id": "84431b9a", "metadata": {}, "source": [ "The number of rows and columns." ] }, { "cell_type": "code", "execution_count": 26, "id": "abe63773", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(406, 9)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "id": "acca1bce", "metadata": {}, "source": [ "The names of the columns." ] }, { "cell_type": "code", "execution_count": 25, "id": "5aa722f1", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['Name', 'Miles_per_Gallon', 'Cylinders', 'Displacement', 'Horsepower',\n", " 'Weight_in_lbs', 'Acceleration', 'Year', 'Origin'],\n", " dtype='object')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "id": "8817be95", "metadata": {}, "source": [ "Some data about the numeric columns." ] }, { "cell_type": "code", "execution_count": 7, "id": "b369d59c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Miles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAcceleration
count398.000000406.000000406.000000400.000000406.000000406.000000
mean23.5145735.475369194.779557105.0825002979.41379315.519704
std7.8159841.712160104.92245838.768779847.0043282.803359
min9.0000003.00000068.00000046.0000001613.0000008.000000
25%17.5000004.000000105.00000075.7500002226.50000013.700000
50%23.0000004.000000151.00000095.0000002822.50000015.500000
75%29.0000008.000000302.000000130.0000003618.25000017.175000
max46.6000008.000000455.000000230.0000005140.00000024.800000
\n", "
" ], "text/plain": [ " Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs \\\n", "count 398.000000 406.000000 406.000000 400.000000 406.000000 \n", "mean 23.514573 5.475369 194.779557 105.082500 2979.413793 \n", "std 7.815984 1.712160 104.922458 38.768779 847.004328 \n", "min 9.000000 3.000000 68.000000 46.000000 1613.000000 \n", "25% 17.500000 4.000000 105.000000 75.750000 2226.500000 \n", "50% 23.000000 4.000000 151.000000 95.000000 2822.500000 \n", "75% 29.000000 8.000000 302.000000 130.000000 3618.250000 \n", "max 46.600000 8.000000 455.000000 230.000000 5140.000000 \n", "\n", " Acceleration \n", "count 406.000000 \n", "mean 15.519704 \n", "std 2.803359 \n", "min 8.000000 \n", "25% 13.700000 \n", "50% 15.500000 \n", "75% 17.175000 \n", "max 24.800000 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "id": "aee8413f", "metadata": {}, "source": [ "Some more information. You can use this next information to determine which columns have missing values." ] }, { "cell_type": "code", "execution_count": 8, "id": "c8ac533d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 406 entries, 0 to 405\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Name 406 non-null object \n", " 1 Miles_per_Gallon 398 non-null float64\n", " 2 Cylinders 406 non-null int64 \n", " 3 Displacement 406 non-null float64\n", " 4 Horsepower 400 non-null float64\n", " 5 Weight_in_lbs 406 non-null int64 \n", " 6 Acceleration 406 non-null float64\n", " 7 Year 406 non-null object \n", " 8 Origin 406 non-null object \n", "dtypes: float64(4), int64(2), object(3)\n", "memory usage: 28.7+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "ac815bd6", "metadata": {}, "source": [ "The data types of the columns." ] }, { "cell_type": "code", "execution_count": 10, "id": "73a407c8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name object\n", "Miles_per_Gallon float64\n", "Cylinders int64\n", "Displacement float64\n", "Horsepower float64\n", "Weight_in_lbs int64\n", "Acceleration float64\n", "Year object\n", "Origin object\n", "dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "id": "c34943f9", "metadata": {}, "source": [ "You can see how the numeric columns are *correlated* with each other. These correlation values range between -1 and 1, with 1 meaning the two columns are perfectly correlated." ] }, { "cell_type": "code", "execution_count": 11, "id": "725afe82", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Miles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAcceleration
Miles_per_Gallon1.000000-0.775396-0.804203-0.778427-0.8317410.420289
Cylinders-0.7753961.0000000.9517870.8441580.895220-0.522452
Displacement-0.8042030.9517871.0000000.8983260.932475-0.557984
Horsepower-0.7784270.8441580.8983261.0000000.866586-0.697124
Weight_in_lbs-0.8317410.8952200.9324750.8665861.000000-0.430086
Acceleration0.420289-0.522452-0.557984-0.697124-0.4300861.000000
\n", "
" ], "text/plain": [ " Miles_per_Gallon Cylinders Displacement Horsepower \\\n", "Miles_per_Gallon 1.000000 -0.775396 -0.804203 -0.778427 \n", "Cylinders -0.775396 1.000000 0.951787 0.844158 \n", "Displacement -0.804203 0.951787 1.000000 0.898326 \n", "Horsepower -0.778427 0.844158 0.898326 1.000000 \n", "Weight_in_lbs -0.831741 0.895220 0.932475 0.866586 \n", "Acceleration 0.420289 -0.522452 -0.557984 -0.697124 \n", "\n", " Weight_in_lbs Acceleration \n", "Miles_per_Gallon -0.831741 0.420289 \n", "Cylinders 0.895220 -0.522452 \n", "Displacement 0.932475 -0.557984 \n", "Horsepower 0.866586 -0.697124 \n", "Weight_in_lbs 1.000000 -0.430086 \n", "Acceleration -0.430086 1.000000 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.corr()" ] }, { "cell_type": "markdown", "id": "6ce37d8c", "metadata": {}, "source": [ "## Indexing\n", "\n", "There are many different ways to select data within a pandas DataFrame. The best way to remember them is to practice using them." ] }, { "cell_type": "code", "execution_count": 15, "id": "383459dd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0chevrolet chevelle malibu18.08307.0130.0350412.01970-01-01USA
1buick skylark 32015.08350.0165.0369311.51970-01-01USA
2plymouth satellite18.08318.0150.0343611.01970-01-01USA
3amc rebel sst16.08304.0150.0343312.01970-01-01USA
4ford torino17.08302.0140.0344910.51970-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "0 chevrolet chevelle malibu 18.0 8 307.0 \n", "1 buick skylark 320 15.0 8 350.0 \n", "2 plymouth satellite 18.0 8 318.0 \n", "3 amc rebel sst 16.0 8 304.0 \n", "4 ford torino 17.0 8 302.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "0 130.0 3504 12.0 1970-01-01 USA \n", "1 165.0 3693 11.5 1970-01-01 USA \n", "2 150.0 3436 11.0 1970-01-01 USA \n", "3 150.0 3433 12.0 1970-01-01 USA \n", "4 140.0 3449 10.5 1970-01-01 USA " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Reminder of how df starts\n", "df.head()" ] }, { "cell_type": "markdown", "id": "9d09d097", "metadata": {}, "source": [ "The entry in the 2nd row, 3rd column (remember we start counting at 0)." ] }, { "cell_type": "code", "execution_count": 14, "id": "166c715d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "318.0" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[2,3]" ] }, { "cell_type": "markdown", "id": "7238bd01", "metadata": {}, "source": [ "The entry in the row with label 2 and the column with label Displacement. (Notice that the index and the label is the same for the rows; this is not uncommon.)" ] }, { "cell_type": "code", "execution_count": 16, "id": "ae8225fb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "318.0" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[2,\"Displacement\"]" ] }, { "cell_type": "markdown", "id": "12afa509", "metadata": {}, "source": [ "You can get an entire row or column with the same syntax, using a colon `:` to represent \"all rows or all columns\"." ] }, { "cell_type": "code", "execution_count": 19, "id": "1c7bb2da", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name plymouth satellite\n", "Miles_per_Gallon 18.0\n", "Cylinders 8\n", "Displacement 318.0\n", "Horsepower 150.0\n", "Weight_in_lbs 3436\n", "Acceleration 11.0\n", "Year 1970-01-01\n", "Origin USA\n", "Name: 2, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The row at index 2.\n", "df.iloc[2,:]" ] }, { "cell_type": "code", "execution_count": 21, "id": "53e6f231", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 307.0\n", "1 350.0\n", "2 318.0\n", "3 304.0\n", "4 302.0\n", " ... \n", "401 140.0\n", "402 97.0\n", "403 135.0\n", "404 120.0\n", "405 119.0\n", "Name: Displacement, Length: 406, dtype: float64" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The column at index 3.\n", "df.iloc[:,3]" ] }, { "cell_type": "code", "execution_count": 22, "id": "0c75693b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 307.0\n", "1 350.0\n", "2 318.0\n", "3 304.0\n", "4 302.0\n", " ... \n", "401 140.0\n", "402 97.0\n", "403 135.0\n", "404 120.0\n", "405 119.0\n", "Name: Displacement, Length: 406, dtype: float64" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The column with label \"Displacement\".\n", "df.loc[:,\"Displacement\"]" ] }, { "cell_type": "markdown", "id": "26376ae9", "metadata": {}, "source": [ "There is an abbreviation for getting a certain column, using its label." ] }, { "cell_type": "code", "execution_count": 23, "id": "8ea619c3", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 307.0\n", "1 350.0\n", "2 318.0\n", "3 304.0\n", "4 302.0\n", " ... \n", "401 140.0\n", "402 97.0\n", "403 135.0\n", "404 120.0\n", "405 119.0\n", "Name: Displacement, Length: 406, dtype: float64" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"Displacement\"]" ] }, { "cell_type": "markdown", "id": "52a688e1", "metadata": {}, "source": [ "The next abbreviation does not always work, but can be a further shortcut. It is called \"attribute\" access. The subtleties of attribute access won't be important for us in Math 10; you can read about those subtleties in the [pandas documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#attribute-access)." ] }, { "cell_type": "code", "execution_count": 24, "id": "637ba347", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 307.0\n", "1 350.0\n", "2 318.0\n", "3 304.0\n", "4 302.0\n", " ... \n", "401 140.0\n", "402 97.0\n", "403 135.0\n", "404 120.0\n", "405 119.0\n", "Name: Displacement, Length: 406, dtype: float64" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Displacement" ] }, { "cell_type": "markdown", "id": "e3ac71d1", "metadata": {}, "source": [ "## Missing values\n", "\n", "An important concept (especially when working with real-world datasets) is the concept of missing data. This particular dataset has missing values in the `Miles_per_Gallon` column and in the `Horsepower` column. In this DataFrame, the missing data is denoted by the NumPy object `np.nan` which stands for \"not a number\"." ] }, { "cell_type": "code", "execution_count": 27, "id": "8e661f88", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 406 entries, 0 to 405\n", "Data columns (total 9 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Name 406 non-null object \n", " 1 Miles_per_Gallon 398 non-null float64\n", " 2 Cylinders 406 non-null int64 \n", " 3 Displacement 406 non-null float64\n", " 4 Horsepower 400 non-null float64\n", " 5 Weight_in_lbs 406 non-null int64 \n", " 6 Acceleration 406 non-null float64\n", " 7 Year 406 non-null object \n", " 8 Origin 406 non-null object \n", "dtypes: float64(4), int64(2), object(3)\n", "memory usage: 28.7+ KB\n" ] } ], "source": [ "df.info()" ] }, { "cell_type": "markdown", "id": "b577d9bf", "metadata": {}, "source": [ "We create a Boolean DataFrame using the method `isna()`. This DataFrame will be `True` where there are null values." ] }, { "cell_type": "code", "execution_count": 29, "id": "8f26c9a1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
0FalseFalseFalseFalseFalseFalseFalseFalseFalse
1FalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalse
..............................
401FalseFalseFalseFalseFalseFalseFalseFalseFalse
402FalseFalseFalseFalseFalseFalseFalseFalseFalse
403FalseFalseFalseFalseFalseFalseFalseFalseFalse
404FalseFalseFalseFalseFalseFalseFalseFalseFalse
405FalseFalseFalseFalseFalseFalseFalseFalseFalse
\n", "

406 rows × 9 columns

\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement Horsepower \\\n", "0 False False False False False \n", "1 False False False False False \n", "2 False False False False False \n", "3 False False False False False \n", "4 False False False False False \n", ".. ... ... ... ... ... \n", "401 False False False False False \n", "402 False False False False False \n", "403 False False False False False \n", "404 False False False False False \n", "405 False False False False False \n", "\n", " Weight_in_lbs Acceleration Year Origin \n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", ".. ... ... ... ... \n", "401 False False False False \n", "402 False False False False \n", "403 False False False False \n", "404 False False False False \n", "405 False False False False \n", "\n", "[406 rows x 9 columns]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna()" ] }, { "cell_type": "markdown", "id": "b7ac19ab", "metadata": {}, "source": [ "Let's find where the `np.nan` values are in the `Miles_per_Gallon` column." ] }, { "cell_type": "code", "execution_count": 31, "id": "ba4f53ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
10citroen ds-21 pallasNaN4133.0115.0309017.51970-01-01Europe
11chevrolet chevelle concours (sw)NaN8350.0165.0414211.51970-01-01USA
12ford torino (sw)NaN8351.0153.0403411.01970-01-01USA
13plymouth satellite (sw)NaN8383.0175.0416610.51970-01-01USA
14amc rebel sst (sw)NaN8360.0175.0385011.01970-01-01USA
17ford mustang boss 302NaN8302.0140.033538.01970-01-01USA
39volkswagen super beetle 117NaN497.048.0197820.01971-01-01Europe
367saab 900sNaN4121.0110.0280015.41982-01-01Europe
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders \\\n", "10 citroen ds-21 pallas NaN 4 \n", "11 chevrolet chevelle concours (sw) NaN 8 \n", "12 ford torino (sw) NaN 8 \n", "13 plymouth satellite (sw) NaN 8 \n", "14 amc rebel sst (sw) NaN 8 \n", "17 ford mustang boss 302 NaN 8 \n", "39 volkswagen super beetle 117 NaN 4 \n", "367 saab 900s NaN 4 \n", "\n", " Displacement Horsepower Weight_in_lbs Acceleration Year Origin \n", "10 133.0 115.0 3090 17.5 1970-01-01 Europe \n", "11 350.0 165.0 4142 11.5 1970-01-01 USA \n", "12 351.0 153.0 4034 11.0 1970-01-01 USA \n", "13 383.0 175.0 4166 10.5 1970-01-01 USA \n", "14 360.0 175.0 3850 11.0 1970-01-01 USA \n", "17 302.0 140.0 3353 8.0 1970-01-01 USA \n", "39 97.0 48.0 1978 20.0 1971-01-01 Europe \n", "367 121.0 110.0 2800 15.4 1982-01-01 Europe " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Miles_per_Gallon\"].isna()]" ] }, { "cell_type": "markdown", "id": "8b36f968", "metadata": {}, "source": [ "The same thing for the `Horsepower` column." ] }, { "cell_type": "code", "execution_count": 33, "id": "f92199c1", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
38ford pinto25.0498.0NaN204619.01971-01-01USA
133ford maverick21.06200.0NaN287517.01974-01-01USA
337renault lecar deluxe40.9485.0NaN183517.31980-01-01Europe
343ford mustang cobra23.64140.0NaN290514.31980-01-01USA
361renault 18i34.54100.0NaN232015.81982-01-01Europe
382amc concord dl23.04151.0NaN303520.51982-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders Displacement \\\n", "38 ford pinto 25.0 4 98.0 \n", "133 ford maverick 21.0 6 200.0 \n", "337 renault lecar deluxe 40.9 4 85.0 \n", "343 ford mustang cobra 23.6 4 140.0 \n", "361 renault 18i 34.5 4 100.0 \n", "382 amc concord dl 23.0 4 151.0 \n", "\n", " Horsepower Weight_in_lbs Acceleration Year Origin \n", "38 NaN 2046 19.0 1971-01-01 USA \n", "133 NaN 2875 17.0 1974-01-01 USA \n", "337 NaN 1835 17.3 1980-01-01 Europe \n", "343 NaN 2905 14.3 1980-01-01 USA \n", "361 NaN 2320 15.8 1982-01-01 Europe \n", "382 NaN 3035 20.5 1982-01-01 USA " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Horsepower\"].isna()]" ] }, { "cell_type": "markdown", "id": "3ede598c", "metadata": {}, "source": [ "In Python, logical `or` is usually spelled out." ] }, { "cell_type": "code", "execution_count": 36, "id": "3f25ffbe", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "True or True" ] }, { "cell_type": "code", "execution_count": 34, "id": "4f48158d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "True or False" ] }, { "cell_type": "code", "execution_count": 35, "id": "fdc432f5", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "False or False" ] }, { "cell_type": "markdown", "id": "85463260", "metadata": {}, "source": [ "The equivalent of `or` in pandas is denoted with a vertical line `|`, which is sometimes called \"pipe\"." ] }, { "cell_type": "code", "execution_count": 37, "id": "cede5541", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
10citroen ds-21 pallasNaN4133.0115.0309017.51970-01-01Europe
11chevrolet chevelle concours (sw)NaN8350.0165.0414211.51970-01-01USA
12ford torino (sw)NaN8351.0153.0403411.01970-01-01USA
13plymouth satellite (sw)NaN8383.0175.0416610.51970-01-01USA
14amc rebel sst (sw)NaN8360.0175.0385011.01970-01-01USA
17ford mustang boss 302NaN8302.0140.033538.01970-01-01USA
38ford pinto25.0498.0NaN204619.01971-01-01USA
39volkswagen super beetle 117NaN497.048.0197820.01971-01-01Europe
133ford maverick21.06200.0NaN287517.01974-01-01USA
337renault lecar deluxe40.9485.0NaN183517.31980-01-01Europe
343ford mustang cobra23.64140.0NaN290514.31980-01-01USA
361renault 18i34.54100.0NaN232015.81982-01-01Europe
367saab 900sNaN4121.0110.0280015.41982-01-01Europe
382amc concord dl23.04151.0NaN303520.51982-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders \\\n", "10 citroen ds-21 pallas NaN 4 \n", "11 chevrolet chevelle concours (sw) NaN 8 \n", "12 ford torino (sw) NaN 8 \n", "13 plymouth satellite (sw) NaN 8 \n", "14 amc rebel sst (sw) NaN 8 \n", "17 ford mustang boss 302 NaN 8 \n", "38 ford pinto 25.0 4 \n", "39 volkswagen super beetle 117 NaN 4 \n", "133 ford maverick 21.0 6 \n", "337 renault lecar deluxe 40.9 4 \n", "343 ford mustang cobra 23.6 4 \n", "361 renault 18i 34.5 4 \n", "367 saab 900s NaN 4 \n", "382 amc concord dl 23.0 4 \n", "\n", " Displacement Horsepower Weight_in_lbs Acceleration Year Origin \n", "10 133.0 115.0 3090 17.5 1970-01-01 Europe \n", "11 350.0 165.0 4142 11.5 1970-01-01 USA \n", "12 351.0 153.0 4034 11.0 1970-01-01 USA \n", "13 383.0 175.0 4166 10.5 1970-01-01 USA \n", "14 360.0 175.0 3850 11.0 1970-01-01 USA \n", "17 302.0 140.0 3353 8.0 1970-01-01 USA \n", "38 98.0 NaN 2046 19.0 1971-01-01 USA \n", "39 97.0 48.0 1978 20.0 1971-01-01 Europe \n", "133 200.0 NaN 2875 17.0 1974-01-01 USA \n", "337 85.0 NaN 1835 17.3 1980-01-01 Europe \n", "343 140.0 NaN 2905 14.3 1980-01-01 USA \n", "361 100.0 NaN 2320 15.8 1982-01-01 Europe \n", "367 121.0 110.0 2800 15.4 1982-01-01 Europe \n", "382 151.0 NaN 3035 20.5 1982-01-01 USA " ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df[\"Miles_per_Gallon\"].isna() | df[\"Horsepower\"].isna()]" ] }, { "cell_type": "markdown", "id": "0c678fb5", "metadata": {}, "source": [ "A fancier and more robust method is to use `any`. In this example, `axis = 1` is saying, look one row at a time. So `df.isna().any(axis=1)` is asking if there are any missing values in the entire row." ] }, { "cell_type": "code", "execution_count": 42, "id": "e66582bd", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameMiles_per_GallonCylindersDisplacementHorsepowerWeight_in_lbsAccelerationYearOrigin
10citroen ds-21 pallasNaN4133.0115.0309017.51970-01-01Europe
11chevrolet chevelle concours (sw)NaN8350.0165.0414211.51970-01-01USA
12ford torino (sw)NaN8351.0153.0403411.01970-01-01USA
13plymouth satellite (sw)NaN8383.0175.0416610.51970-01-01USA
14amc rebel sst (sw)NaN8360.0175.0385011.01970-01-01USA
17ford mustang boss 302NaN8302.0140.033538.01970-01-01USA
38ford pinto25.0498.0NaN204619.01971-01-01USA
39volkswagen super beetle 117NaN497.048.0197820.01971-01-01Europe
133ford maverick21.06200.0NaN287517.01974-01-01USA
337renault lecar deluxe40.9485.0NaN183517.31980-01-01Europe
343ford mustang cobra23.64140.0NaN290514.31980-01-01USA
361renault 18i34.54100.0NaN232015.81982-01-01Europe
367saab 900sNaN4121.0110.0280015.41982-01-01Europe
382amc concord dl23.04151.0NaN303520.51982-01-01USA
\n", "
" ], "text/plain": [ " Name Miles_per_Gallon Cylinders \\\n", "10 citroen ds-21 pallas NaN 4 \n", "11 chevrolet chevelle concours (sw) NaN 8 \n", "12 ford torino (sw) NaN 8 \n", "13 plymouth satellite (sw) NaN 8 \n", "14 amc rebel sst (sw) NaN 8 \n", "17 ford mustang boss 302 NaN 8 \n", "38 ford pinto 25.0 4 \n", "39 volkswagen super beetle 117 NaN 4 \n", "133 ford maverick 21.0 6 \n", "337 renault lecar deluxe 40.9 4 \n", "343 ford mustang cobra 23.6 4 \n", "361 renault 18i 34.5 4 \n", "367 saab 900s NaN 4 \n", "382 amc concord dl 23.0 4 \n", "\n", " Displacement Horsepower Weight_in_lbs Acceleration Year Origin \n", "10 133.0 115.0 3090 17.5 1970-01-01 Europe \n", "11 350.0 165.0 4142 11.5 1970-01-01 USA \n", "12 351.0 153.0 4034 11.0 1970-01-01 USA \n", "13 383.0 175.0 4166 10.5 1970-01-01 USA \n", "14 360.0 175.0 3850 11.0 1970-01-01 USA \n", "17 302.0 140.0 3353 8.0 1970-01-01 USA \n", "38 98.0 NaN 2046 19.0 1971-01-01 USA \n", "39 97.0 48.0 1978 20.0 1971-01-01 Europe \n", "133 200.0 NaN 2875 17.0 1974-01-01 USA \n", "337 85.0 NaN 1835 17.3 1980-01-01 Europe \n", "343 140.0 NaN 2905 14.3 1980-01-01 USA \n", "361 100.0 NaN 2320 15.8 1982-01-01 Europe \n", "367 121.0 110.0 2800 15.4 1982-01-01 Europe \n", "382 151.0 NaN 3035 20.5 1982-01-01 USA " ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.isna().any(axis=1)]" ] }, { "cell_type": "markdown", "id": "0847d550", "metadata": {}, "source": [ "If we were to instead use `axis=0`, it would ask if there were any missing values in the entire column." ] }, { "cell_type": "code", "execution_count": 44, "id": "e27fd3a7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Name False\n", "Miles_per_Gallon True\n", "Cylinders False\n", "Displacement False\n", "Horsepower True\n", "Weight_in_lbs False\n", "Acceleration False\n", "Year False\n", "Origin False\n", "dtype: bool" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().any(axis=0)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 5 }