Working with missing data¶

Today we covered a variety of shorter topics. The most important topic is how to work with missing data.

Warm-up question¶

What can go inside the brackets? df[???]

(Notice that there is no loc or iloc in this code.)

import pandas as pd

We import a fake (artificial) dataset.

df = pd.read_csv("../data/sampleData.csv")

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

The most basic way to use the syntax df[???] is to access a column of a DataFrame. The column is returned as a pandas Series.

df["B"]

 -2.3
  1.0
  5.0
 -4.0
  2.8
  6.0
  1.0
 -1.0
Name: B, dtype: float64

type(df["B"])

pandas.core.series.Series

s = df["B"]

s.index

RangeIndex(start=0, stop=8, step=1)

s[5]

6.0

pandas Series are like fancier versions of the dictionary data type in Python. Here we convert the pandas Series s into a Python dictionary d.

d = dict(s)
d

{0: -2.3, 1: 1.0, 2: 5.0, 3: -4.0, 4: 2.8, 5: 6.0, 6: 1.0, 7: -1.0}

Notice how similar the content is.

 -2.3
  1.0
  5.0
 -4.0
  2.8
  6.0
  1.0
 -1.0
Name: B, dtype: float64

We access elements in a Python dict the same we access them in a pandas Series, using square brackets.

d[5]

6.0

Here is a simpler example of a Python dictionary.

d2 = {"chris":"instructor","yasmeen":"ta"}

d2["chris"]

'instructor'

Another use of the syntax df[???] is to provide a list of column names.

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

df[["A","C"]]

	A	C
0	3	-1
1	2.1	0
2	?	?
3	12.3	4.1
4	7	3
5	5	?
6	?	1.8
7	2	3.14

You can even repeat a column multiple times.

df[["A","C","A"]]

	A	C	A
0	3	-1	3
1	2.1	0	2.1
2	?	?	?
3	12.3	4.1	12.3
4	7	3	7
5	5	?	5
6	?	1.8	?
7	2	3.14	2

Another use of the syntax df[???] is for what is called slicing.

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

Here we get the first 3 rows in the DataFrame.

# slicing
df[:3]

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?

Alternate methods:

df.head(3)

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?

df.iloc[:3]

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?

The last example we’ll cover with the syntax df[???], and the version most relevant for this last week’s material, is Boolean indexing.

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

In the following, we find all the rows where the “A” column is equal to ?, but how does this actually work?

# Boolean indexing
df[df["A"]=="?"]

	A	B	C
2	?	5.0	?
6	?	1.0	1.8

Let’s look more closely at what is going inside the square brackets. This is an example of a Boolean Series.

# Boolean Series
df["A"]=="?"

  False
  False
   True
  False
  False
  False
   True
  False
Name: A, dtype: bool

Think of the Boolean Series as telling Python which rows to keep. In this case, only rows 2 and 6 are kept.

df[df["A"]=="?"]

	A	B	C
2	?	5.0	?
6	?	1.0	1.8

Usually this will arise from a Boolean Series, but it’s also fine to use a list of True and False values. Here we get rid of the row labeled 3.

df[[True, True, True, False, True, True, True, True]]

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

Here is an example of making a small DataFrame from a Python dictionary. Here the keys of the dictionary are the names of the columns, and the values of the dictionary are the values in the rows.

df3 = pd.DataFrame({"col0":[3,1,4],"col1":[2,2,5]})

df3

	col0	col1
0	3	2
1	1	2
2	4	5

The following does not work because the inner list does not have length matching the number of rows. (Warning. It can be confusing, because if you are providing strings rather than Boolean values, then the length should match the number of columns!)

df3[[True, False]]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_45289/2048590108.py in <module>
----> 1 df3[[True, False]]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3447         # Do we have a (boolean) 1d indexer?
   3448         if com.is_bool_indexer(key):
-> 3449             return self._getitem_bool_array(key)
   3450 
   3451         # We are left with two options: a single key, and a collection of keys,

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_bool_array(self, key)
   3495         elif len(key) != len(self.index):
   3496             raise ValueError(
-> 3497                 f"Item wrong length {len(key)} instead of {len(self.index)}."
   3498             )
   3499 

ValueError: Item wrong length 2 instead of 3.

Here we keep the first and last rows.

df3[[True, False,True]]

	col0	col1
0	3	2
2	4	5

df3

	col0	col1
0	3	2
1	1	2
2	4	5

That example is a little artificial, but it is the same reasoning that’s used with Boolean indexing. Here we make a Boolean Series.

df3["col0"] > 2

   True
  False
   True
Name: col0, dtype: bool

And now we provide that Boolean Series inside the square brackets. This tells pandas which rows to keep. We are keeping the rows which have a value strictly greater than 2 in “col0”.

# Boolean indexing
df3[df3["col0"] > 2]

	col0	col1
0	3	2
2	4	5

Logic in pandas¶

Let’s return to our original DataFrame.

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

Say we want to know which rows have a “?” anywhere (in any of the columns). Here is one solution, but it is not DRY. The vertical bar | denotes “or” in pandas.

(df["A"]=="?") | (df["B"]=="?") | (df["C"]=="?") 

  False
  False
   True
  False
  False
   True
   True
  False
dtype: bool

Aside: The parentheses are important.

df["A"]=="?" | df["B"]=="?" | df["C"]=="?" 

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    301         #  (xint or xbool) and (yint or bool)
--> 302         result = op(x, y)
    303     except TypeError:

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/roperator.py in ror_(left, right)
     55 def ror_(left, right):
---> 56     return operator.or_(right, left)
     57 

TypeError: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    315             try:
--> 316                 result = libops.scalar_binop(x, y, op)
    317             except (

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/ops.pyx in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_45289/2004769874.py in <module>
----> 1 df["A"]=="?" | df["B"]=="?" | df["C"]=="?"

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/common.py in new_method(self, other)
     67         other = item_from_zerodim(other)
     68 
---> 69         return method(self, other)
     70 
     71     return new_method

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arraylike.py in __ror__(self, other)
     72     @unpack_zerodim_and_defer("__ror__")
     73     def __ror__(self, other):
---> 74         return self._logical_method(other, roperator.ror_)
     75 
     76     @unpack_zerodim_and_defer("__xor__")

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/series.py in _logical_method(self, other, op)
   5511         rvalues = extract_array(other, extract_numpy=True, extract_range=True)
   5512 
-> 5513         res_values = ops.logical_op(lvalues, rvalues, op)
   5514         return self._construct_result(res_values, name=res_name)
   5515 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in logical_op(left, right, op)
    390         filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool
    391 
--> 392         res_values = na_logical_op(lvalues, rvalues, op)
    393         # error: Cannot call function of unknown type
    394         res_values = filler(res_values)  # type: ignore[operator]

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    326                     f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    327                     f"and scalar of type [{typ}]"
--> 328                 ) from err
    329 
    330     return result.reshape(x.shape)

TypeError: Cannot perform 'ror_' with a dtyped [float64] array and scalar of type [bool]

Let’s take a closer look at how this syntax works.

(df["A"]=="?") | (df["B"]=="?") | (df["C"]=="?") 

  False
  False
   True
  False
  False
   True
   True
  False
dtype: bool

Here is an example of a Boolean DataFrame. In each position, it says whether the DataFrame df had the value “?”.

df=="?"

	A	B	C
0	False	False	False
1	False	False	False
2	True	False	True
3	False	False	False
4	False	False	False
5	False	False	True
6	True	False	False
7	False	False	False

Here is a method to find if there are any True values along a row. The keyword argument axis=1 is telling pandas to look one row at a time. Notice how this produces the same Boolean Series as the non-Dry version above, with the vertical bars |.

# DRY code
(df=="?").any(axis=1)

  False
  False
   True
  False
  False
   True
   True
  False
dtype: bool

If we instead use axis=0, then we are determining if there are any True values in each column, one column at a time.

(df=="?").any(axis=0)

A     True
B    False
C     True
dtype: bool

We can also use all instead of any, if we want to know if all of the values are True.

(df=="?").all(axis=0)

A    False
B    False
C    False
dtype: bool

Problems with this DataFrame¶

The “?”s in this DataFrame cause a lot of problems. They are supposed to represent missing values, but pandas doesn’t know that.

df

	A	B	C
0	3	-2.3	-1
1	2.1	1.0	0
2	?	5.0	?
3	12.3	-4.0	4.1
4	7	2.8	3
5	5	6.0	?
6	?	1.0	1.8
7	2	-1.0	3.14

The values in column “A” look like numbers, but because of the “?” values, pandas is treating them as strings. For example, if we try to use the sum method, the values get concatenated.

df["A"].sum()

'32.1?12.375?2'

The “B” column is better because it doesn’t have any “?”s in it.

df.dtypes

A     object
B    float64
C     object
dtype: object

This is how + works when the two objects are strings in Python.

"3"+"2"

'32'

The describe method also doesn’t work in two of the three columns, because pandas thinks they are strings.

df.describe()

	B
count	8.000000
mean	1.062500
std	3.466137
min	-4.000000
25%	-1.325000
50%	1.000000
75%	3.350000
max	6.000000

Often missing values are handled automatically by pandas. In this case, they are not handled automatically, so we tell pandas explicitly when importing that the missing values are denoted by “?”.

df4 = pd.read_csv("../data/sampleData.csv", na_values="?")

Notice how those “?” values have gotten represented as NaN.

df4

	A	B	C
0	3.0	-2.3	-1.00
1	2.1	1.0	0.00
2	NaN	5.0	NaN
3	12.3	-4.0	4.10
4	7.0	2.8	3.00
5	5.0	6.0	NaN
6	NaN	1.0	1.80
7	2.0	-1.0	3.14

Now sum works correctly.

df4["A"].sum()

31.4

We can use the axis argument the same as before. Here axis=0 is saying to add up the columns one at a time.

df4.sum(axis=0)

A    31.40
B     8.50
C    11.04
dtype: float64

Here we are adding the values in each row, one row at a time, because of the axis=1 argument.

df4.sum(axis=1)

  -0.30
   3.10
   5.00
  12.40
  12.80
  11.00
   2.80
   4.14
dtype: float64

There are lots of different methods which behave similarly. For example, here we compute the median of each row.

df4.median(axis=1)

 -1.0
  1.0
  5.0
  4.1
  3.0
  5.5
  1.4
  2.0
dtype: float64

NaN¶

Let’s look more closely at these missing values.

df4.loc[2,"A"]

nan

type(df4.loc[2,"A"])

numpy.float64

This constant is part of the important NumPy library.

import numpy as np

If we want to make this “not a number” value ourselves, we can use the following syntax.

np.nan

nan

Missing values in a real dataset¶

So far today, all of our examples came from a small artificial dataset. Here we import a slightly bigger (still relatively small) dataset with real data in it. Notice how we don’t need to specify the na_values keyword argument, because pandas correctly identifies the missing data automatically.

cars = pd.read_csv("../data/cars.csv")

Using the info method, we can see that, for example, there are 8 null values (i.e., missing values) in the “Miles_per_Gallon column.

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Name              406 non-null    object 
 1   Miles_per_Gallon  398 non-null    float64
 2   Cylinders         406 non-null    int64  
 3   Displacement      406 non-null    float64
 4   Horsepower        400 non-null    float64
 5   Weight_in_lbs     406 non-null    int64  
 6   Acceleration      406 non-null    float64
 7   Year              406 non-null    object 
 8   Origin            406 non-null    object 
dtypes: float64(4), int64(2), object(3)
memory usage: 28.7+ KB

cars["Miles_per_Gallon"].isna()

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Name: Miles_per_Gallon, Length: 406, dtype: bool

Here is a quick way to count those missing values.

cars["Miles_per_Gallon"].isna().sum()

Using Boolean indexing to find the rows where the “Miles_per_Gallon” value is missing.

cars[cars["Miles_per_Gallon"].isna()]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
10	citroen ds-21 pallas	NaN	4	133.0	115.0	3090	17.5	1970-01-01	Europe
11	chevrolet chevelle concours (sw)	NaN	8	350.0	165.0	4142	11.5	1970-01-01	USA
12	ford torino (sw)	NaN	8	351.0	153.0	4034	11.0	1970-01-01	USA
13	plymouth satellite (sw)	NaN	8	383.0	175.0	4166	10.5	1970-01-01	USA
14	amc rebel sst (sw)	NaN	8	360.0	175.0	3850	11.0	1970-01-01	USA
17	ford mustang boss 302	NaN	8	302.0	140.0	3353	8.0	1970-01-01	USA
39	volkswagen super beetle 117	NaN	4	97.0	48.0	1978	20.0	1971-01-01	Europe
367	saab 900s	NaN	4	121.0	110.0	2800	15.4	1982-01-01	Europe

A common task is to remove all the rows which are missing any values. Here is a quick way to do that. (This is displaying a new DataFrame; the way it’s written, it does not change cars.)

cars.dropna()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
401	ford mustang gl	27.0	4	140.0	86.0	2790	15.6	1982-01-01	USA
402	vw pickup	44.0	4	97.0	52.0	2130	24.6	1982-01-01	Europe
403	dodge rampage	32.0	4	135.0	84.0	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	79.0	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	82.0	2720	19.4	1982-01-01	USA

392 rows × 9 columns

Here is another way to do the same thing, using any.

cars.isna()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...
401	False	False	False	False	False	False	False	False	False
402	False	False	False	False	False	False	False	False	False
403	False	False	False	False	False	False	False	False	False
404	False	False	False	False	False	False	False	False	False
405	False	False	False	False	False	False	False	False	False

406 rows × 9 columns

We want to identify rows where any value is missing. Since we are working one row at a time, we use axis=1. Here is a Boolean Series.

cars.isna().any(axis=1)

    False
    False
    False
    False
    False
       ...  
  False
  False
  False
  False
  False
Length: 406, dtype: bool

Here we are using Boolean indexing to get keep only those rows which are missing data.

cars[cars.isna().any(axis=1)]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
10	citroen ds-21 pallas	NaN	4	133.0	115.0	3090	17.5	1970-01-01	Europe
11	chevrolet chevelle concours (sw)	NaN	8	350.0	165.0	4142	11.5	1970-01-01	USA
12	ford torino (sw)	NaN	8	351.0	153.0	4034	11.0	1970-01-01	USA
13	plymouth satellite (sw)	NaN	8	383.0	175.0	4166	10.5	1970-01-01	USA
14	amc rebel sst (sw)	NaN	8	360.0	175.0	3850	11.0	1970-01-01	USA
17	ford mustang boss 302	NaN	8	302.0	140.0	3353	8.0	1970-01-01	USA
38	ford pinto	25.0	4	98.0	NaN	2046	19.0	1971-01-01	USA
39	volkswagen super beetle 117	NaN	4	97.0	48.0	1978	20.0	1971-01-01	Europe
133	ford maverick	21.0	6	200.0	NaN	2875	17.0	1974-01-01	USA
337	renault lecar deluxe	40.9	4	85.0	NaN	1835	17.3	1980-01-01	Europe
343	ford mustang cobra	23.6	4	140.0	NaN	2905	14.3	1980-01-01	USA
361	renault 18i	34.5	4	100.0	NaN	2320	15.8	1982-01-01	Europe
367	saab 900s	NaN	4	121.0	110.0	2800	15.4	1982-01-01	Europe
382	amc concord dl	23.0	4	151.0	NaN	3035	20.5	1982-01-01	USA

A more natural thing is to get rid of those rows. We can negate our Boolean Series by using ~.

# negation in pandas
cars[~cars.isna().any(axis=1)]

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA
...	...	...	...	...	...	...	...	...	...
401	ford mustang gl	27.0	4	140.0	86.0	2790	15.6	1982-01-01	USA
402	vw pickup	44.0	4	97.0	52.0	2130	24.6	1982-01-01	Europe
403	dodge rampage	32.0	4	135.0	84.0	2295	11.6	1982-01-01	USA
404	ford ranger	28.0	4	120.0	79.0	2625	18.6	1982-01-01	USA
405	chevy s-10	31.0	4	119.0	82.0	2720	19.4	1982-01-01	USA