Working with missing data

Today we covered a variety of shorter topics. The most important topic is how to work with missing data.

Warm-up question

  • What can go inside the brackets? df[???]

(Notice that there is no loc or iloc in this code.)

import pandas as pd

We import a fake (artificial) dataset.

df = pd.read_csv("../data/sampleData.csv")
df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

The most basic way to use the syntax df[???] is to access a column of a DataFrame. The column is returned as a pandas Series.

df["B"]
0   -2.3
1    1.0
2    5.0
3   -4.0
4    2.8
5    6.0
6    1.0
7   -1.0
Name: B, dtype: float64
type(df["B"])
pandas.core.series.Series
s = df["B"]
s.index
RangeIndex(start=0, stop=8, step=1)
s[5]
6.0

pandas Series are like fancier versions of the dictionary data type in Python. Here we convert the pandas Series s into a Python dictionary d.

d = dict(s)
d
{0: -2.3, 1: 1.0, 2: 5.0, 3: -4.0, 4: 2.8, 5: 6.0, 6: 1.0, 7: -1.0}

Notice how similar the content is.

s
0   -2.3
1    1.0
2    5.0
3   -4.0
4    2.8
5    6.0
6    1.0
7   -1.0
Name: B, dtype: float64

We access elements in a Python dict the same we access them in a pandas Series, using square brackets.

d[5]
6.0

Here is a simpler example of a Python dictionary.

d2 = {"chris":"instructor","yasmeen":"ta"}
d2["chris"]
'instructor'

Another use of the syntax df[???] is to provide a list of column names.

df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14
df[["A","C"]]
A C
0 3 -1
1 2.1 0
2 ? ?
3 12.3 4.1
4 7 3
5 5 ?
6 ? 1.8
7 2 3.14

You can even repeat a column multiple times.

df[["A","C","A"]]
A C A
0 3 -1 3
1 2.1 0 2.1
2 ? ? ?
3 12.3 4.1 12.3
4 7 3 7
5 5 ? 5
6 ? 1.8 ?
7 2 3.14 2

Another use of the syntax df[???] is for what is called slicing.

df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

Here we get the first 3 rows in the DataFrame.

# slicing
df[:3]
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?

Alternate methods:

df.head(3)
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
df.iloc[:3]
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?

The last example we’ll cover with the syntax df[???], and the version most relevant for this last week’s material, is Boolean indexing.

df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

In the following, we find all the rows where the “A” column is equal to ?, but how does this actually work?

# Boolean indexing
df[df["A"]=="?"]
A B C
2 ? 5.0 ?
6 ? 1.0 1.8

Let’s look more closely at what is going inside the square brackets. This is an example of a Boolean Series.

# Boolean Series
df["A"]=="?"
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
Name: A, dtype: bool

Think of the Boolean Series as telling Python which rows to keep. In this case, only rows 2 and 6 are kept.

df[df["A"]=="?"]
A B C
2 ? 5.0 ?
6 ? 1.0 1.8

Usually this will arise from a Boolean Series, but it’s also fine to use a list of True and False values. Here we get rid of the row labeled 3.

df[[True, True, True, False, True, True, True, True]]
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

Here is an example of making a small DataFrame from a Python dictionary. Here the keys of the dictionary are the names of the columns, and the values of the dictionary are the values in the rows.

df3 = pd.DataFrame({"col0":[3,1,4],"col1":[2,2,5]})
df3
col0 col1
0 3 2
1 1 2
2 4 5

The following does not work because the inner list does not have length matching the number of rows. (Warning. It can be confusing, because if you are providing strings rather than Boolean values, then the length should match the number of columns!)

df3[[True, False]]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_45289/2048590108.py in <module>
----> 1 df3[[True, False]]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3447         # Do we have a (boolean) 1d indexer?
   3448         if com.is_bool_indexer(key):
-> 3449             return self._getitem_bool_array(key)
   3450 
   3451         # We are left with two options: a single key, and a collection of keys,

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_bool_array(self, key)
   3495         elif len(key) != len(self.index):
   3496             raise ValueError(
-> 3497                 f"Item wrong length {len(key)} instead of {len(self.index)}."
   3498             )
   3499 

ValueError: Item wrong length 2 instead of 3.

Here we keep the first and last rows.

df3[[True, False,True]]
col0 col1
0 3 2
2 4 5
df3
col0 col1
0 3 2
1 1 2
2 4 5

That example is a little artificial, but it is the same reasoning that’s used with Boolean indexing. Here we make a Boolean Series.

df3["col0"] > 2
0     True
1    False
2     True
Name: col0, dtype: bool

And now we provide that Boolean Series inside the square brackets. This tells pandas which rows to keep. We are keeping the rows which have a value strictly greater than 2 in “col0”.

# Boolean indexing
df3[df3["col0"] > 2]
col0 col1
0 3 2
2 4 5

Logic in pandas

Let’s return to our original DataFrame.

df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

Say we want to know which rows have a “?” anywhere (in any of the columns). Here is one solution, but it is not DRY. The vertical bar | denotes “or” in pandas.

(df["A"]=="?") | (df["B"]=="?") | (df["C"]=="?") 
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7    False
dtype: bool

Aside: The parentheses are important.

df["A"]=="?" | df["B"]=="?" | df["C"]=="?" 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    301         #  (xint or xbool) and (yint or bool)
--> 302         result = op(x, y)
    303     except TypeError:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/roperator.py in ror_(left, right)
     55 def ror_(left, right):
---> 56     return operator.or_(right, left)
     57 

TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    315             try:
--> 316                 result = libops.scalar_binop(x, y, op)
    317             except (

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/ops.pyx in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_45289/2004769874.py in <module>
----> 1 df["A"]=="?" | df["B"]=="?" | df["C"]=="?"

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arraylike.py in __ror__(self, other)
     72     @unpack_zerodim_and_defer("__ror__")
     73     def __ror__(self, other):
---> 74         return self._logical_method(other, roperator.ror_)
     75 
     76     @unpack_zerodim_and_defer("__xor__")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/series.py in _logical_method(self, other, op)
   5511         rvalues = extract_array(other, extract_numpy=True, extract_range=True)
   5512 
-> 5513         res_values = ops.logical_op(lvalues, rvalues, op)
   5514         return self._construct_result(res_values, name=res_name)
   5515 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in logical_op(left, right, op)
    390         filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool
    391 
--> 392         res_values = na_logical_op(lvalues, rvalues, op)
    393         # error: Cannot call function of unknown type
    394         res_values = filler(res_values)  # type: ignore[operator]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    326                     f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    327                     f"and scalar of type [{typ}]"
--> 328                 ) from err
    329 
    330     return result.reshape(x.shape)

TypeError: Cannot perform 'ror_' with a dtyped [float64] array and scalar of type [bool]

Let’s take a closer look at how this syntax works.

(df["A"]=="?") | (df["B"]=="?") | (df["C"]=="?") 
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7    False
dtype: bool

Here is an example of a Boolean DataFrame. In each position, it says whether the DataFrame df had the value “?”.

df=="?"
A B C
0 False False False
1 False False False
2 True False True
3 False False False
4 False False False
5 False False True
6 True False False
7 False False False

Here is a method to find if there are any True values along a row. The keyword argument axis=1 is telling pandas to look one row at a time. Notice how this produces the same Boolean Series as the non-Dry version above, with the vertical bars |.

# DRY code
(df=="?").any(axis=1)
0    False
1    False
2     True
3    False
4    False
5     True
6     True
7    False
dtype: bool

If we instead use axis=0, then we are determining if there are any True values in each column, one column at a time.

(df=="?").any(axis=0)
A     True
B    False
C     True
dtype: bool

We can also use all instead of any, if we want to know if all of the values are True.

(df=="?").all(axis=0)
A    False
B    False
C    False
dtype: bool

Problems with this DataFrame

The “?”s in this DataFrame cause a lot of problems. They are supposed to represent missing values, but pandas doesn’t know that.

df
A B C
0 3 -2.3 -1
1 2.1 1.0 0
2 ? 5.0 ?
3 12.3 -4.0 4.1
4 7 2.8 3
5 5 6.0 ?
6 ? 1.0 1.8
7 2 -1.0 3.14

The values in column “A” look like numbers, but because of the “?” values, pandas is treating them as strings. For example, if we try to use the sum method, the values get concatenated.

df["A"].sum()
'32.1?12.375?2'

The “B” column is better because it doesn’t have any “?”s in it.

df.dtypes
A     object
B    float64
C     object
dtype: object

This is how + works when the two objects are strings in Python.

"3"+"2"
'32'

The describe method also doesn’t work in two of the three columns, because pandas thinks they are strings.

df.describe()
B
count 8.000000
mean 1.062500
std 3.466137
min -4.000000
25% -1.325000
50% 1.000000
75% 3.350000
max 6.000000

Often missing values are handled automatically by pandas. In this case, they are not handled automatically, so we tell pandas explicitly when importing that the missing values are denoted by “?”.

df4 = pd.read_csv("../data/sampleData.csv", na_values="?")

Notice how those “?” values have gotten represented as NaN.

df4
A B C
0 3.0 -2.3 -1.00
1 2.1 1.0 0.00
2 NaN 5.0 NaN
3 12.3 -4.0 4.10
4 7.0 2.8 3.00
5 5.0 6.0 NaN
6 NaN 1.0 1.80
7 2.0 -1.0 3.14

Now sum works correctly.

df4["A"].sum()
31.4

We can use the axis argument the same as before. Here axis=0 is saying to add up the columns one at a time.

df4.sum(axis=0)
A    31.40
B     8.50
C    11.04
dtype: float64

Here we are adding the values in each row, one row at a time, because of the axis=1 argument.

df4.sum(axis=1)
0    -0.30
1     3.10
2     5.00
3    12.40
4    12.80
5    11.00
6     2.80
7     4.14
dtype: float64

There are lots of different methods which behave similarly. For example, here we compute the median of each row.

df4.median(axis=1)
0   -1.0
1    1.0
2    5.0
3    4.1
4    3.0
5    5.5
6    1.4
7    2.0
dtype: float64

NaN

Let’s look more closely at these missing values.

df4.loc[2,"A"]
nan
type(df4.loc[2,"A"])
numpy.float64

This constant is part of the important NumPy library.

import numpy as np

If we want to make this “not a number” value ourselves, we can use the following syntax.

np.nan
nan

Missing values in a real dataset

So far today, all of our examples came from a small artificial dataset. Here we import a slightly bigger (still relatively small) dataset with real data in it. Notice how we don’t need to specify the na_values keyword argument, because pandas correctly identifies the missing data automatically.

cars = pd.read_csv("../data/cars.csv")

Using the info method, we can see that, for example, there are 8 null values (i.e., missing values) in the “Miles_per_Gallon column.

cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB
cars["Miles_per_Gallon"].isna()
0      False
1      False
2      False
3      False
4      False
       ...  
401    False
402    False
403    False
404    False
405    False
Name: Miles_per_Gallon, Length: 406, dtype: bool

Here is a quick way to count those missing values.

cars["Miles_per_Gallon"].isna().sum()
8

Using Boolean indexing to find the rows where the “Miles_per_Gallon” value is missing.

cars[cars["Miles_per_Gallon"].isna()]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
10 citroen ds-21 pallas NaN 4 133.0 115.0 3090 17.5 1970-01-01 Europe
11 chevrolet chevelle concours (sw) NaN 8 350.0 165.0 4142 11.5 1970-01-01 USA
12 ford torino (sw) NaN 8 351.0 153.0 4034 11.0 1970-01-01 USA
13 plymouth satellite (sw) NaN 8 383.0 175.0 4166 10.5 1970-01-01 USA
14 amc rebel sst (sw) NaN 8 360.0 175.0 3850 11.0 1970-01-01 USA
17 ford mustang boss 302 NaN 8 302.0 140.0 3353 8.0 1970-01-01 USA
39 volkswagen super beetle 117 NaN 4 97.0 48.0 1978 20.0 1971-01-01 Europe
367 saab 900s NaN 4 121.0 110.0 2800 15.4 1982-01-01 Europe

A common task is to remove all the rows which are missing any values. Here is a quick way to do that. (This is displaying a new DataFrame; the way it’s written, it does not change cars.)

cars.dropna()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA
... ... ... ... ... ... ... ... ... ...
401 ford mustang gl 27.0 4 140.0 86.0 2790 15.6 1982-01-01 USA
402 vw pickup 44.0 4 97.0 52.0 2130 24.6 1982-01-01 Europe
403 dodge rampage 32.0 4 135.0 84.0 2295 11.6 1982-01-01 USA
404 ford ranger 28.0 4 120.0 79.0 2625 18.6 1982-01-01 USA
405 chevy s-10 31.0 4 119.0 82.0 2720 19.4 1982-01-01 USA

392 rows × 9 columns

Here is another way to do the same thing, using any.

cars.isna()
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 False False False False False False False False False
1 False False False False False False False False False
2 False False False False False False False False False
3 False False False False False False False False False
4 False False False False False False False False False
... ... ... ... ... ... ... ... ... ...
401 False False False False False False False False False
402 False False False False False False False False False
403 False False False False False False False False False
404 False False False False False False False False False
405 False False False False False False False False False

406 rows × 9 columns

We want to identify rows where any value is missing. Since we are working one row at a time, we use axis=1. Here is a Boolean Series.

cars.isna().any(axis=1)
0      False
1      False
2      False
3      False
4      False
       ...  
401    False
402    False
403    False
404    False
405    False
Length: 406, dtype: bool

Here we are using Boolean indexing to get keep only those rows which are missing data.

cars[cars.isna().any(axis=1)]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
10 citroen ds-21 pallas NaN 4 133.0 115.0 3090 17.5 1970-01-01 Europe
11 chevrolet chevelle concours (sw) NaN 8 350.0 165.0 4142 11.5 1970-01-01 USA
12 ford torino (sw) NaN 8 351.0 153.0 4034 11.0 1970-01-01 USA
13 plymouth satellite (sw) NaN 8 383.0 175.0 4166 10.5 1970-01-01 USA
14 amc rebel sst (sw) NaN 8 360.0 175.0 3850 11.0 1970-01-01 USA
17 ford mustang boss 302 NaN 8 302.0 140.0 3353 8.0 1970-01-01 USA
38 ford pinto 25.0 4 98.0 NaN 2046 19.0 1971-01-01 USA
39 volkswagen super beetle 117 NaN 4 97.0 48.0 1978 20.0 1971-01-01 Europe
133 ford maverick 21.0 6 200.0 NaN 2875 17.0 1974-01-01 USA
337 renault lecar deluxe 40.9 4 85.0 NaN 1835 17.3 1980-01-01 Europe
343 ford mustang cobra 23.6 4 140.0 NaN 2905 14.3 1980-01-01 USA
361 renault 18i 34.5 4 100.0 NaN 2320 15.8 1982-01-01 Europe
367 saab 900s NaN 4 121.0 110.0 2800 15.4 1982-01-01 Europe
382 amc concord dl 23.0 4 151.0 NaN 3035 20.5 1982-01-01 USA

A more natural thing is to get rid of those rows. We can negate our Boolean Series by using ~.

# negation in pandas
cars[~cars.isna().any(axis=1)]
Name Miles_per_Gallon Cylinders Displacement Horsepower Weight_in_lbs Acceleration Year Origin
0 chevrolet chevelle malibu 18.0 8 307.0 130.0 3504 12.0 1970-01-01 USA
1 buick skylark 320 15.0 8 350.0 165.0 3693 11.5 1970-01-01 USA
2 plymouth satellite 18.0 8 318.0 150.0 3436 11.0 1970-01-01 USA
3 amc rebel sst 16.0 8 304.0 150.0 3433 12.0 1970-01-01 USA
4 ford torino 17.0 8 302.0 140.0 3449 10.5 1970-01-01 USA
... ... ... ... ... ... ... ... ... ...
401 ford mustang gl 27.0 4 140.0 86.0 2790 15.6 1982-01-01 USA
402 vw pickup 44.0 4 97.0 52.0 2130 24.6 1982-01-01 Europe
403 dodge rampage 32.0 4 135.0 84.0 2295 11.6 1982-01-01 USA
404 ford ranger 28.0 4 120.0 79.0 2625 18.6 1982-01-01 USA
405 chevy s-10 31.0 4 119.0 82.0 2720 19.4 1982-01-01 USA

392 rows × 9 columns