{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"cell_id": "5aaa54a06acf4c5f97504a5ba7e94c05",
"deepnote_cell_height": 297.765625,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"# Working with missing data\n",
"\n",
"Today we covered a variety of shorter topics. The most important topic is how to work with missing data."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "5d7bdd82-8296-4c49-b0d7-dc3eee542733",
"deepnote_cell_height": 144.78125,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Warm-up question\n",
" \n",
"* What can go inside the brackets? `df[???]`\n",
"\n",
"(Notice that there is no `loc` or `iloc` in this code.)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"cell_id": "8c20572d905c43dab8a01e29069d4b84",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 0,
"execution_start": 1649102496321,
"source_hash": "9b82ee11",
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We import a fake (artificial) dataset."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"cell_id": "f433429e6ac04cfa8d4b82e46a8448f0",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 2,
"execution_start": 1649102529882,
"source_hash": "e8446f83",
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv(\"../data/sampleData.csv\")"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"cell_id": "4754670996c942718a0e2b2a9dc24883",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 32,
"execution_start": 1649102532504,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most basic way to use the syntax `df[???]` is to access a column of a DataFrame. The column is returned as a pandas Series."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"cell_id": "55a4c233f645461e994210f685991552",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 480,
"execution_start": 1649102598839,
"source_hash": "41abad46",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 -2.3\n",
"1 1.0\n",
"2 5.0\n",
"3 -4.0\n",
"4 2.8\n",
"5 6.0\n",
"6 1.0\n",
"7 -1.0\n",
"Name: B, dtype: float64"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"B\"]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"cell_id": "8449f6076436473399f7c57874338466",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1649102625302,
"source_hash": "71add696",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.series.Series"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df[\"B\"])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"cell_id": "acbfaf219be0417091b60cddbcf9c782",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1649102649500,
"source_hash": "f2e61a57",
"tags": []
},
"outputs": [],
"source": [
"s = df[\"B\"]"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"cell_id": "63ec392b2831477abdd30a3d36ac0f90",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 5,
"execution_start": 1649102663575,
"source_hash": "4a3a057a",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"RangeIndex(start=0, stop=8, step=1)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s.index"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"cell_id": "c171c80a28e049908956fe70adaad747",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1649102689384,
"source_hash": "68ae6f2",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"6.0"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s[5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas Series are like fancier versions of the dictionary data type in Python. Here we convert the pandas Series `s` into a Python dictionary `d`."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"cell_id": "9701a406501945af866a578e1d16742e",
"deepnote_cell_height": 154.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1649102732781,
"source_hash": "1e4e7cd4",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"{0: -2.3, 1: 1.0, 2: 5.0, 3: -4.0, 4: 2.8, 5: 6.0, 6: 1.0, 7: -1.0}"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d = dict(s)\n",
"d"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how similar the content is."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"cell_id": "ed64c6aafe864f9ca41193fc80700bdd",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1649102735876,
"source_hash": "465dd177",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 -2.3\n",
"1 1.0\n",
"2 5.0\n",
"3 -4.0\n",
"4 2.8\n",
"5 6.0\n",
"6 1.0\n",
"7 -1.0\n",
"Name: B, dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"s"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We access elements in a Python `dict` the same we access them in a pandas `Series`, using square brackets."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"cell_id": "586018bf91a241ab8cce7edc6c60d4a2",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1649102756456,
"source_hash": "359b1ee1",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"6.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d[5]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a simpler example of a Python dictionary."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"cell_id": "59a0d48592e04f5b83c73bc0f2d515c4",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1649102848551,
"source_hash": "e8da924b",
"tags": []
},
"outputs": [],
"source": [
"d2 = {\"chris\":\"instructor\",\"yasmeen\":\"ta\"}"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"cell_id": "ec543a4fcd514aaaa544e8d46182c66e",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 23,
"execution_start": 1649102859777,
"source_hash": "9f05ebc5",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'instructor'"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"d2[\"chris\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another use of the syntax `df[???]` is to provide a list of column names."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"cell_id": "80a8a4316fbc4322966349bcc351120e",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 14,
"execution_start": 1649102894176,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"cell_id": "6fe6f5085bb74967bbbf0105edd25235",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1649102927653,
"source_hash": "2044d700",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A C\n",
"0 3 -1\n",
"1 2.1 0\n",
"2 ? ?\n",
"3 12.3 4.1\n",
"4 7 3\n",
"5 5 ?\n",
"6 ? 1.8\n",
"7 2 3.14"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[\"A\",\"C\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can even repeat a column multiple times."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"cell_id": "8805b3d51c5d4d5aa6877c4015353731",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1649102948079,
"source_hash": "7cc23c65",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" C | \n",
" A | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -1 | \n",
" 3 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 0 | \n",
" 2.1 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" ? | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" 4.1 | \n",
" 12.3 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 3 | \n",
" 7 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" ? | \n",
" 5 | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.8 | \n",
" ? | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 3.14 | \n",
" 2 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A C A\n",
"0 3 -1 3\n",
"1 2.1 0 2.1\n",
"2 ? ? ?\n",
"3 12.3 4.1 12.3\n",
"4 7 3 7\n",
"5 5 ? 5\n",
"6 ? 1.8 ?\n",
"7 2 3.14 2"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[\"A\",\"C\",\"A\"]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another use of the syntax `df[???]` is for what is called *slicing*."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"cell_id": "5fd2d1b0a62445d5a4769674607ebc4f",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 11,
"execution_start": 1649103038361,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we get the first 3 rows in the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"cell_id": "a6f42843eeee4211a17557f4767fe534",
"deepnote_cell_height": 329,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1649103047149,
"source_hash": "6d948350",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# slicing\n",
"df[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternate methods:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"cell_id": "b3962d21fff84eb1aa31bbc933fd0ada",
"deepnote_cell_height": 311,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1649103090080,
"source_hash": "3a1ea484",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head(3)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"cell_id": "1f985ee58f304c6291223e4d16d04077",
"deepnote_cell_height": 311,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 18,
"execution_start": 1649103105891,
"source_hash": "d3dff12e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.iloc[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The last example we'll cover with the syntax `df[???]`, and the version most relevant for this last week's material, is Boolean indexing."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"cell_id": "9a29ab5f6c334b4390d24a93e5b537eb",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1649103189815,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following, we find all the rows where the \"A\" column is equal to `?`, but how does this actually work?"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"cell_id": "9d2944ff897c410fbabb44142e26da49",
"deepnote_cell_height": 288,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 5,
"execution_start": 1649103204902,
"source_hash": "ef7bfe16",
"tags": []
},
"outputs": [
{
"data": {
"application/vnd.deepnote.dataframe.v3+json": {
"column_count": 3,
"columns": [
{
"dtype": "object",
"name": "A",
"stats": {
"categories": [
{
"count": 2,
"name": "?"
}
],
"nan_count": 0,
"unique_count": 1
}
},
{
"dtype": "float64",
"name": "B",
"stats": {
"histogram": [
{
"bin_end": 1.4,
"bin_start": 1,
"count": 1
},
{
"bin_end": 1.8,
"bin_start": 1.4,
"count": 0
},
{
"bin_end": 2.2,
"bin_start": 1.8,
"count": 0
},
{
"bin_end": 2.6,
"bin_start": 2.2,
"count": 0
},
{
"bin_end": 3,
"bin_start": 2.6,
"count": 0
},
{
"bin_end": 3.4000000000000004,
"bin_start": 3,
"count": 0
},
{
"bin_end": 3.8000000000000003,
"bin_start": 3.4000000000000004,
"count": 0
},
{
"bin_end": 4.2,
"bin_start": 3.8000000000000003,
"count": 0
},
{
"bin_end": 4.6,
"bin_start": 4.2,
"count": 0
},
{
"bin_end": 5,
"bin_start": 4.6,
"count": 1
}
],
"max": "5.0",
"min": "1.0",
"nan_count": 0,
"unique_count": 2
}
},
{
"dtype": "object",
"name": "C",
"stats": {
"categories": [
{
"count": 1,
"name": "?"
},
{
"count": 1,
"name": "1.8"
}
],
"nan_count": 0,
"unique_count": 2
}
},
{
"dtype": "int64",
"name": "_deepnote_index_column"
}
],
"row_count": 2,
"rows": [
{
"A": "?",
"B": 5,
"C": "?",
"_deepnote_index_column": 2
},
{
"A": "?",
"B": 1,
"C": "1.8",
"_deepnote_index_column": 6
}
]
},
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"2 ? 5.0 ?\n",
"6 ? 1.0 1.8"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Boolean indexing\n",
"df[df[\"A\"]==\"?\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look more closely at what is going inside the square brackets. This is an example of a Boolean Series."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"cell_id": "12e2c29f0cb8455ab29e39941ba91552",
"deepnote_cell_height": 289.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1649103263654,
"source_hash": "246a5e1f",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 False\n",
"5 False\n",
"6 True\n",
"7 False\n",
"Name: A, dtype: bool"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Boolean Series\n",
"df[\"A\"]==\"?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Think of the Boolean Series as telling Python which rows to keep. In this case, only rows 2 and 6 are kept."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"cell_id": "0177c4fb541c4ea08c1415f0a8ab4622",
"deepnote_cell_height": 270,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1649103315632,
"source_hash": "ef7bfe16",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"2 ? 5.0 ?\n",
"6 ? 1.0 1.8"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[df[\"A\"]==\"?\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Usually this will arise from a Boolean Series, but it's also fine to use a list of `True` and `False` values. Here we get rid of the row labeled 3."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"cell_id": "fc02e96706914e4ab5c86ff594c3f6b2",
"deepnote_cell_height": 475,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1649103342955,
"source_hash": "23af1fe9",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[True, True, True, False, True, True, True, True]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of making a small DataFrame from a Python dictionary. Here the keys of the dictionary are the names of the columns, and the values of the dictionary are the values in the rows."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"cell_id": "55f89463c67445e9bfa5e7198e4663cc",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 2,
"execution_start": 1649103444009,
"source_hash": "76ee9abc",
"tags": []
},
"outputs": [],
"source": [
"df3 = pd.DataFrame({\"col0\":[3,1,4],\"col1\":[2,2,5]})"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"cell_id": "7f3786671475450bbbf9d6970312ebe6",
"deepnote_cell_height": 311,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 6,
"execution_start": 1649103445630,
"source_hash": "816a197e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col0 | \n",
" col1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col0 col1\n",
"0 3 2\n",
"1 1 2\n",
"2 4 5"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following does not work because the inner list does not have length matching the number of rows. (**Warning**. It can be confusing, because if you are providing strings rather than Boolean values, then the length should match the number of columns!)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"cell_id": "18fe0ab283d6469faa7b7f05561c0f58",
"deepnote_cell_height": 144.1875,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 106,
"execution_start": 1649103481464,
"source_hash": "3fd56677",
"tags": []
},
"outputs": [
{
"ename": "ValueError",
"evalue": "Item wrong length 2 instead of 3.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_44741/2048590108.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf3\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3447\u001b[0m \u001b[0;31m# Do we have a (boolean) 1d indexer?\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3448\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mcom\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mis_bool_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3449\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_bool_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3450\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3451\u001b[0m \u001b[0;31m# We are left with two options: a single key, and a collection of keys,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m_getitem_bool_array\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3495\u001b[0m \u001b[0;32melif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m!=\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mindex\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3496\u001b[0m raise ValueError(\n\u001b[0;32m-> 3497\u001b[0;31m \u001b[0;34mf\"Item wrong length {len(key)} instead of {len(self.index)}.\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3498\u001b[0m )\n\u001b[1;32m 3499\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Item wrong length 2 instead of 3."
]
}
],
"source": [
"df3[[True, False]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we keep the first and last rows."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"cell_id": "2cbe15565df846bf99a9177badcbd566",
"deepnote_cell_height": 270,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1649103510578,
"source_hash": "d20d3a35",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col0 | \n",
" col1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col0 col1\n",
"0 3 2\n",
"2 4 5"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[[True, False,True]]"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"cell_id": "80400f135bb5425f9fac6c84ba213493",
"deepnote_cell_height": 311,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 46,
"execution_start": 1649103530162,
"source_hash": "816a197e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col0 | \n",
" col1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 1 | \n",
" 1 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col0 col1\n",
"0 3 2\n",
"1 1 2\n",
"2 4 5"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That example is a little artificial, but it is the same reasoning that's used with Boolean indexing. Here we make a Boolean Series."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"cell_id": "6d464049be164180ad77fad3f76840c5",
"deepnote_cell_height": 175.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 14,
"execution_start": 1649103547860,
"source_hash": "d31e785f",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 True\n",
"Name: col0, dtype: bool"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df3[\"col0\"] > 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now we provide that Boolean Series inside the square brackets. This tells pandas which rows to keep. We are keeping the rows which have a value strictly greater than 2 in \"col0\"."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"cell_id": "497bc75f71ef48a7a277440727f47f40",
"deepnote_cell_height": 288,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1649103560707,
"source_hash": "ca239ea5",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col0 | \n",
" col1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" 2 | \n",
"
\n",
" \n",
" 2 | \n",
" 4 | \n",
" 5 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col0 col1\n",
"0 3 2\n",
"2 4 5"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Boolean indexing\n",
"df3[df3[\"col0\"] > 2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Logic in pandas\n",
"\n",
"Let's return to our original DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"cell_id": "1336a409e6114ba4a950749c3f0edd19",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 15,
"execution_start": 1649103616726,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Say we want to know which rows have a \"?\" anywhere (in any of the columns). Here is one solution, but it is not DRY. The vertical bar `|` denotes \"or\" in pandas."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"cell_id": "1335f38034414a63bcef7f1e07d05375",
"deepnote_cell_height": 307.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 5,
"execution_start": 1649103666382,
"source_hash": "c5b2323d",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 False\n",
"5 True\n",
"6 True\n",
"7 False\n",
"dtype: bool"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df[\"A\"]==\"?\") | (df[\"B\"]==\"?\") | (df[\"C\"]==\"?\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Aside**: The parentheses are important."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"cell_id": "3015c57058de4c18a1063acb45be0a95",
"deepnote_cell_height": 144.1875,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1649103753888,
"source_hash": "8417a5f9",
"tags": [
"output_scroll"
]
},
"outputs": [
{
"ename": "TypeError",
"evalue": "Cannot perform 'ror_' with a dtyped [float64] array and scalar of type [bool]",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mna_logical_op\u001b[0;34m(x, y, op)\u001b[0m\n\u001b[1;32m 301\u001b[0m \u001b[0;31m# (xint or xbool) and (yint or bool)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 302\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 303\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/roperator.py\u001b[0m in \u001b[0;36mror_\u001b[0;34m(left, right)\u001b[0m\n\u001b[1;32m 55\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mror_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mleft\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mright\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 56\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0moperator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mor_\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mright\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mleft\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 57\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: ufunc 'bitwise_or' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''",
"\nDuring handling of the above exception, another exception occurred:\n",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mna_logical_op\u001b[0;34m(x, y, op)\u001b[0m\n\u001b[1;32m 315\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 316\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlibops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mscalar_binop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 317\u001b[0m except (\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/ops.pyx\u001b[0m in \u001b[0;36mpandas._libs.ops.scalar_binop\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Buffer dtype mismatch, expected 'Python object' but got 'double'",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_44741/2004769874.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"A\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0;34m\"?\"\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"B\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0;34m\"?\"\u001b[0m \u001b[0;34m|\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"C\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m==\u001b[0m\u001b[0;34m\"?\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/common.py\u001b[0m in \u001b[0;36mnew_method\u001b[0;34m(self, other)\u001b[0m\n\u001b[1;32m 67\u001b[0m \u001b[0mother\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mitem_from_zerodim\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 68\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 69\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mmethod\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 70\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 71\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mnew_method\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/arraylike.py\u001b[0m in \u001b[0;36m__ror__\u001b[0;34m(self, other)\u001b[0m\n\u001b[1;32m 72\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0munpack_zerodim_and_defer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"__ror__\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 73\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__ror__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mother\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 74\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_logical_method\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mroperator\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mror_\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 75\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 76\u001b[0m \u001b[0;34m@\u001b[0m\u001b[0munpack_zerodim_and_defer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"__xor__\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m_logical_method\u001b[0;34m(self, other, op)\u001b[0m\n\u001b[1;32m 5511\u001b[0m \u001b[0mrvalues\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mextract_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mother\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mextract_numpy\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mextract_range\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5512\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 5513\u001b[0;31m \u001b[0mres_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlogical_op\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 5514\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_construct_result\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mres_values\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mres_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 5515\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mlogical_op\u001b[0;34m(left, right, op)\u001b[0m\n\u001b[1;32m 390\u001b[0m \u001b[0mfiller\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfill_int\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_self_int_dtype\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mis_other_int_dtype\u001b[0m \u001b[0;32melse\u001b[0m \u001b[0mfill_bool\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 391\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 392\u001b[0;31m \u001b[0mres_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mna_logical_op\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 393\u001b[0m \u001b[0;31m# error: Cannot call function of unknown type\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 394\u001b[0m \u001b[0mres_values\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfiller\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mres_values\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# type: ignore[operator]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/ops/array_ops.py\u001b[0m in \u001b[0;36mna_logical_op\u001b[0;34m(x, y, op)\u001b[0m\n\u001b[1;32m 326\u001b[0m \u001b[0;34mf\"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 327\u001b[0m \u001b[0;34mf\"and scalar of type [{typ}]\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 328\u001b[0;31m ) from err\n\u001b[0m\u001b[1;32m 329\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 330\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreshape\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: Cannot perform 'ror_' with a dtyped [float64] array and scalar of type [bool]"
]
}
],
"source": [
"df[\"A\"]==\"?\" | df[\"B\"]==\"?\" | df[\"C\"]==\"?\" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take a closer look at how this syntax works."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"cell_id": "29e7f6ac17fd44b3883a525a6bf4d43f",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1649103858923,
"source_hash": "c5b2323d",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 False\n",
"5 True\n",
"6 True\n",
"7 False\n",
"dtype: bool"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df[\"A\"]==\"?\") | (df[\"B\"]==\"?\") | (df[\"C\"]==\"?\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is an example of a Boolean DataFrame. In each position, it says whether the DataFrame `df` had the value \"?\"."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"cell_id": "d3a47d273a414326a9bddcc9af46a091",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1649103867483,
"source_hash": "c01b8366",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" True | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 3 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 5 | \n",
" False | \n",
" False | \n",
" True | \n",
"
\n",
" \n",
" 6 | \n",
" True | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 7 | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 False False False\n",
"1 False False False\n",
"2 True False True\n",
"3 False False False\n",
"4 False False False\n",
"5 False False True\n",
"6 True False False\n",
"7 False False False"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df==\"?\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a method to find if there are any `True` values along a row. The keyword argument `axis=1` is telling pandas to look one row at a time. Notice how this produces the same Boolean Series as the non-Dry version above, with the vertical bars `|`."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"cell_id": "84f9f3f1374d4d0ea17b59cf93c80a6e",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1649103915327,
"source_hash": "38eeb1d6",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 False\n",
"5 True\n",
"6 True\n",
"7 False\n",
"dtype: bool"
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# DRY code\n",
"(df==\"?\").any(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we instead use `axis=0`, then we are determining if there are any `True` values in each column, one column at a time."
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"cell_id": "a323399c5f96404a9e25489a2b9748bc",
"deepnote_cell_height": 175.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1649103941642,
"source_hash": "b536adb1",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"A True\n",
"B False\n",
"C True\n",
"dtype: bool"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df==\"?\").any(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use `all` instead of `any`, if we want to know if all of the values are `True`."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"cell_id": "e03c20de434b453ea5202096fb9f20b6",
"deepnote_cell_height": 175.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1649104132497,
"source_hash": "c38979c3",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"A False\n",
"B False\n",
"C False\n",
"dtype: bool"
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(df==\"?\").all(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problems with this DataFrame\n",
"\n",
"The \"?\"s in this DataFrame cause a lot of problems. They are supposed to represent missing values, but pandas doesn't know that."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"cell_id": "1d4f446ac16543f49d943403a5f8dc35",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1649104203165,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3 | \n",
" -2.3 | \n",
" -1 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0 | \n",
"
\n",
" \n",
" 2 | \n",
" ? | \n",
" 5.0 | \n",
" ? | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.1 | \n",
"
\n",
" \n",
" 4 | \n",
" 7 | \n",
" 2.8 | \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
" 5 | \n",
" 6.0 | \n",
" ? | \n",
"
\n",
" \n",
" 6 | \n",
" ? | \n",
" 1.0 | \n",
" 1.8 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3 -2.3 -1\n",
"1 2.1 1.0 0\n",
"2 ? 5.0 ?\n",
"3 12.3 -4.0 4.1\n",
"4 7 2.8 3\n",
"5 5 6.0 ?\n",
"6 ? 1.0 1.8\n",
"7 2 -1.0 3.14"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values in column \"A\" look like numbers, but because of the \"?\" values, pandas is treating them as strings. For example, if we try to use the `sum` method, the values get concatenated."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"cell_id": "cc2b970d5f4c4028b190a1087105dbb6",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1649104232830,
"source_hash": "3299dadc",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'32.1?12.375?2'"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[\"A\"].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The \"B\" column is better because it doesn't have any \"?\"s in it."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"cell_id": "046aabfac03e4f20b4d5d24dc50cca5d",
"deepnote_cell_height": 175.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 16,
"execution_start": 1649104262434,
"source_hash": "52430027",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"A object\n",
"B float64\n",
"C object\n",
"dtype: object"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is how `+` works when the two objects are strings in Python."
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"cell_id": "15f63b71ad2b49d69b65c3f4bfa58c2c",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1649104281757,
"source_hash": "470eda09",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"'32'"
]
},
"execution_count": 52,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"\"3\"+\"2\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `describe` method also doesn't work in two of the three columns, because pandas thinks they are strings."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"cell_id": "a07b3aee862d486ebcf6f034bac2d1ec",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 17,
"execution_start": 1649104303499,
"source_hash": "f88152d9",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" B | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 8.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 1.062500 | \n",
"
\n",
" \n",
" std | \n",
" 3.466137 | \n",
"
\n",
" \n",
" min | \n",
" -4.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" -1.325000 | \n",
"
\n",
" \n",
" 50% | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 75% | \n",
" 3.350000 | \n",
"
\n",
" \n",
" max | \n",
" 6.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" B\n",
"count 8.000000\n",
"mean 1.062500\n",
"std 3.466137\n",
"min -4.000000\n",
"25% -1.325000\n",
"50% 1.000000\n",
"75% 3.350000\n",
"max 6.000000"
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Often missing values are handled automatically by pandas. In this case, they are not handled automatically, so we tell pandas explicitly when importing that the missing values are denoted by \"?\"."
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"cell_id": "c90045a79e7048d6952554c9fabc40a9",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1649104390159,
"source_hash": "647887ff",
"tags": []
},
"outputs": [],
"source": [
"df4 = pd.read_csv(\"../data/sampleData.csv\", na_values=\"?\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how those \"?\" values have gotten represented as `NaN`."
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"cell_id": "ec9244eab7324f7f8333a6e1c8a8afd3",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 17,
"execution_start": 1649104391685,
"source_hash": "786d530e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3.0 | \n",
" -2.3 | \n",
" -1.00 | \n",
"
\n",
" \n",
" 1 | \n",
" 2.1 | \n",
" 1.0 | \n",
" 0.00 | \n",
"
\n",
" \n",
" 2 | \n",
" NaN | \n",
" 5.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 3 | \n",
" 12.3 | \n",
" -4.0 | \n",
" 4.10 | \n",
"
\n",
" \n",
" 4 | \n",
" 7.0 | \n",
" 2.8 | \n",
" 3.00 | \n",
"
\n",
" \n",
" 5 | \n",
" 5.0 | \n",
" 6.0 | \n",
" NaN | \n",
"
\n",
" \n",
" 6 | \n",
" NaN | \n",
" 1.0 | \n",
" 1.80 | \n",
"
\n",
" \n",
" 7 | \n",
" 2.0 | \n",
" -1.0 | \n",
" 3.14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"0 3.0 -2.3 -1.00\n",
"1 2.1 1.0 0.00\n",
"2 NaN 5.0 NaN\n",
"3 12.3 -4.0 4.10\n",
"4 7.0 2.8 3.00\n",
"5 5.0 6.0 NaN\n",
"6 NaN 1.0 1.80\n",
"7 2.0 -1.0 3.14"
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now `sum` works correctly."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"cell_id": "8b176dbbff794838997e1472525191f8",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1649104414272,
"source_hash": "f571903f",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"31.4"
]
},
"execution_count": 56,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4[\"A\"].sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the `axis` argument the same as before. Here `axis=0` is saying to add up the columns one at a time."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"cell_id": "80696626331545ff93cee70d354ed80e",
"deepnote_cell_height": 175.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 476,
"execution_start": 1649104443308,
"source_hash": "a7ba9e58",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"A 31.40\n",
"B 8.50\n",
"C 11.04\n",
"dtype: float64"
]
},
"execution_count": 57,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.sum(axis=0)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are adding the values in each row, one row at a time, because of the `axis=1` argument."
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"cell_id": "4dd4163aa634456b924c32df3dd10e38",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1649104488854,
"source_hash": "68513aa5",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 -0.30\n",
"1 3.10\n",
"2 5.00\n",
"3 12.40\n",
"4 12.80\n",
"5 11.00\n",
"6 2.80\n",
"7 4.14\n",
"dtype: float64"
]
},
"execution_count": 59,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.sum(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are lots of different methods which behave similarly. For example, here we compute the `median` of each row."
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"cell_id": "91d85a84fab6424aba4726d39494b770",
"deepnote_cell_height": 271.78125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
174.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1649104511372,
"source_hash": "a64f4a4b",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 -1.0\n",
"1 1.0\n",
"2 5.0\n",
"3 4.1\n",
"4 3.0\n",
"5 5.5\n",
"6 1.4\n",
"7 2.0\n",
"dtype: float64"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.median(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NaN\n",
"\n",
"Let's look more closely at these missing values."
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"cell_id": "1e3bc538203c490b83418bf8a058438a",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1649104645761,
"source_hash": "389fcc9a",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df4.loc[2,\"A\"]"
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"cell_id": "4fb243f15f294c078c387d5346775514",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 13,
"execution_start": 1649104656229,
"source_hash": "96b8035c",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"numpy.float64"
]
},
"execution_count": 62,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(df4.loc[2,\"A\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This constant is part of the important NumPy library."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"cell_id": "ae34a79ae0f7428989f6b83df9d16c82",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 0,
"execution_start": 1649104709748,
"source_hash": "c2602aa8",
"tags": []
},
"outputs": [],
"source": [
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we want to make this \"not a number\" value ourselves, we can use the following syntax."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"cell_id": "3f4cf77f3960412d81866259f7380a80",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 17,
"execution_start": 1649104715353,
"source_hash": "e200addd",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"nan"
]
},
"execution_count": 65,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.nan"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing values in a real dataset\n",
"\n",
"So far today, all of our examples came from a small artificial dataset. Here we import a slightly bigger (still relatively small) dataset with real data in it. Notice how we don't need to specify the `na_values` keyword argument, because pandas correctly identifies the missing data automatically."
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"cell_id": "c70cd2ecd2b046c78641fa24bec7e42a",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1649104825688,
"source_hash": "cef682a2",
"tags": []
},
"outputs": [],
"source": [
"cars = pd.read_csv(\"../data/cars.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the `info` method, we can see that, for example, there are 8 null values (i.e., missing values) in the \"Miles_per_Gallon column."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"cell_id": "af03aec7146d4562a8e22b5f458dc93f",
"deepnote_cell_height": 409.828125,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 13,
"execution_start": 1649104836117,
"source_hash": "53e91b83",
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 406 entries, 0 to 405\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Name 406 non-null object \n",
" 1 Miles_per_Gallon 398 non-null float64\n",
" 2 Cylinders 406 non-null int64 \n",
" 3 Displacement 406 non-null float64\n",
" 4 Horsepower 400 non-null float64\n",
" 5 Weight_in_lbs 406 non-null int64 \n",
" 6 Acceleration 406 non-null float64\n",
" 7 Year 406 non-null object \n",
" 8 Origin 406 non-null object \n",
"dtypes: float64(4), int64(2), object(3)\n",
"memory usage: 28.7+ KB\n"
]
}
],
"source": [
"cars.info()"
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"cell_id": "a9dcd462dd8242d0bf9d0768b7391301",
"deepnote_cell_height": 329.375,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
232.375
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 6,
"execution_start": 1649104886482,
"source_hash": "b2d762b9",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"401 False\n",
"402 False\n",
"403 False\n",
"404 False\n",
"405 False\n",
"Name: Miles_per_Gallon, Length: 406, dtype: bool"
]
},
"execution_count": 69,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars[\"Miles_per_Gallon\"].isna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a quick way to count those missing values."
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"cell_id": "e74c2d32fef74d8ea61a3d38bb509e92",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1649104900782,
"source_hash": "51f9069e",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"8"
]
},
"execution_count": 70,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars[\"Miles_per_Gallon\"].isna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using Boolean indexing to find the rows where the \"Miles_per_Gallon\" value is missing."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"cell_id": "9913547c6b1144849ee86bf03d02a394",
"deepnote_cell_height": 516,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1649104923859,
"source_hash": "c6185c9a",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Miles_per_Gallon | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" citroen ds-21 pallas | \n",
" NaN | \n",
" 4 | \n",
" 133.0 | \n",
" 115.0 | \n",
" 3090 | \n",
" 17.5 | \n",
" 1970-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 11 | \n",
" chevrolet chevelle concours (sw) | \n",
" NaN | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 4142 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 12 | \n",
" ford torino (sw) | \n",
" NaN | \n",
" 8 | \n",
" 351.0 | \n",
" 153.0 | \n",
" 4034 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 13 | \n",
" plymouth satellite (sw) | \n",
" NaN | \n",
" 8 | \n",
" 383.0 | \n",
" 175.0 | \n",
" 4166 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 14 | \n",
" amc rebel sst (sw) | \n",
" NaN | \n",
" 8 | \n",
" 360.0 | \n",
" 175.0 | \n",
" 3850 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 17 | \n",
" ford mustang boss 302 | \n",
" NaN | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3353 | \n",
" 8.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 39 | \n",
" volkswagen super beetle 117 | \n",
" NaN | \n",
" 4 | \n",
" 97.0 | \n",
" 48.0 | \n",
" 1978 | \n",
" 20.0 | \n",
" 1971-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 367 | \n",
" saab 900s | \n",
" NaN | \n",
" 4 | \n",
" 121.0 | \n",
" 110.0 | \n",
" 2800 | \n",
" 15.4 | \n",
" 1982-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Miles_per_Gallon Cylinders \\\n",
"10 citroen ds-21 pallas NaN 4 \n",
"11 chevrolet chevelle concours (sw) NaN 8 \n",
"12 ford torino (sw) NaN 8 \n",
"13 plymouth satellite (sw) NaN 8 \n",
"14 amc rebel sst (sw) NaN 8 \n",
"17 ford mustang boss 302 NaN 8 \n",
"39 volkswagen super beetle 117 NaN 4 \n",
"367 saab 900s NaN 4 \n",
"\n",
" Displacement Horsepower Weight_in_lbs Acceleration Year Origin \n",
"10 133.0 115.0 3090 17.5 1970-01-01 Europe \n",
"11 350.0 165.0 4142 11.5 1970-01-01 USA \n",
"12 351.0 153.0 4034 11.0 1970-01-01 USA \n",
"13 383.0 175.0 4166 10.5 1970-01-01 USA \n",
"14 360.0 175.0 3850 11.0 1970-01-01 USA \n",
"17 302.0 140.0 3353 8.0 1970-01-01 USA \n",
"39 97.0 48.0 1978 20.0 1971-01-01 Europe \n",
"367 121.0 110.0 2800 15.4 1982-01-01 Europe "
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars[cars[\"Miles_per_Gallon\"].isna()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A common task is to remove all the rows which are missing any values. Here is a quick way to do that. (This is displaying a new DataFrame; the way it's written, it does not change `cars`.)"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"cell_id": "53d1abe86db34f7fbb01fbe19702f488",
"deepnote_cell_height": 598,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 27,
"execution_start": 1649105008028,
"source_hash": "8f2e83b8",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Miles_per_Gallon | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chevrolet chevelle malibu | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 1 | \n",
" buick skylark 320 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 2 | \n",
" plymouth satellite | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 3 | \n",
" amc rebel sst | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 4 | \n",
" ford torino | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 401 | \n",
" ford mustang gl | \n",
" 27.0 | \n",
" 4 | \n",
" 140.0 | \n",
" 86.0 | \n",
" 2790 | \n",
" 15.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 402 | \n",
" vw pickup | \n",
" 44.0 | \n",
" 4 | \n",
" 97.0 | \n",
" 52.0 | \n",
" 2130 | \n",
" 24.6 | \n",
" 1982-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 403 | \n",
" dodge rampage | \n",
" 32.0 | \n",
" 4 | \n",
" 135.0 | \n",
" 84.0 | \n",
" 2295 | \n",
" 11.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 404 | \n",
" ford ranger | \n",
" 28.0 | \n",
" 4 | \n",
" 120.0 | \n",
" 79.0 | \n",
" 2625 | \n",
" 18.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 405 | \n",
" chevy s-10 | \n",
" 31.0 | \n",
" 4 | \n",
" 119.0 | \n",
" 82.0 | \n",
" 2720 | \n",
" 19.4 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
"
\n",
"
392 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Name Miles_per_Gallon Cylinders Displacement \\\n",
"0 chevrolet chevelle malibu 18.0 8 307.0 \n",
"1 buick skylark 320 15.0 8 350.0 \n",
"2 plymouth satellite 18.0 8 318.0 \n",
"3 amc rebel sst 16.0 8 304.0 \n",
"4 ford torino 17.0 8 302.0 \n",
".. ... ... ... ... \n",
"401 ford mustang gl 27.0 4 140.0 \n",
"402 vw pickup 44.0 4 97.0 \n",
"403 dodge rampage 32.0 4 135.0 \n",
"404 ford ranger 28.0 4 120.0 \n",
"405 chevy s-10 31.0 4 119.0 \n",
"\n",
" Horsepower Weight_in_lbs Acceleration Year Origin \n",
"0 130.0 3504 12.0 1970-01-01 USA \n",
"1 165.0 3693 11.5 1970-01-01 USA \n",
"2 150.0 3436 11.0 1970-01-01 USA \n",
"3 150.0 3433 12.0 1970-01-01 USA \n",
"4 140.0 3449 10.5 1970-01-01 USA \n",
".. ... ... ... ... ... \n",
"401 86.0 2790 15.6 1982-01-01 USA \n",
"402 52.0 2130 24.6 1982-01-01 Europe \n",
"403 84.0 2295 11.6 1982-01-01 USA \n",
"404 79.0 2625 18.6 1982-01-01 USA \n",
"405 82.0 2720 19.4 1982-01-01 USA \n",
"\n",
"[392 rows x 9 columns]"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.dropna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is another way to do the same thing, using `any`."
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"cell_id": "efcfcfcb195841f0a7064cfeeb269239",
"deepnote_cell_height": 598,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1649105033537,
"source_hash": "9102c47e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Miles_per_Gallon | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 1 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 2 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 3 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 4 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 401 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 402 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 403 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 404 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
" 405 | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
" False | \n",
"
\n",
" \n",
"
\n",
"
406 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Name Miles_per_Gallon Cylinders Displacement Horsepower \\\n",
"0 False False False False False \n",
"1 False False False False False \n",
"2 False False False False False \n",
"3 False False False False False \n",
"4 False False False False False \n",
".. ... ... ... ... ... \n",
"401 False False False False False \n",
"402 False False False False False \n",
"403 False False False False False \n",
"404 False False False False False \n",
"405 False False False False False \n",
"\n",
" Weight_in_lbs Acceleration Year Origin \n",
"0 False False False False \n",
"1 False False False False \n",
"2 False False False False \n",
"3 False False False False \n",
"4 False False False False \n",
".. ... ... ... ... \n",
"401 False False False False \n",
"402 False False False False \n",
"403 False False False False \n",
"404 False False False False \n",
"405 False False False False \n",
"\n",
"[406 rows x 9 columns]"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.isna()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We want to identify rows where any value is missing. Since we are working one row at a time, we use `axis=1`. Here is a Boolean Series."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"cell_id": "f1a4f5e5c7484f7b8905aa215bf28b9c",
"deepnote_cell_height": 329.375,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
232.375
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1649105078776,
"source_hash": "3ad1372b",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 False\n",
"3 False\n",
"4 False\n",
" ... \n",
"401 False\n",
"402 False\n",
"403 False\n",
"404 False\n",
"405 False\n",
"Length: 406, dtype: bool"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars.isna().any(axis=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we are using Boolean indexing to get keep only those rows which are missing data."
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"cell_id": "482c12d8ebe045ca8a8063a3d3e12445",
"deepnote_cell_height": 598,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 46,
"execution_start": 1649105094095,
"source_hash": "8dabe1a0",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Miles_per_Gallon | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
"
\n",
" \n",
" \n",
" \n",
" 10 | \n",
" citroen ds-21 pallas | \n",
" NaN | \n",
" 4 | \n",
" 133.0 | \n",
" 115.0 | \n",
" 3090 | \n",
" 17.5 | \n",
" 1970-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 11 | \n",
" chevrolet chevelle concours (sw) | \n",
" NaN | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 4142 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 12 | \n",
" ford torino (sw) | \n",
" NaN | \n",
" 8 | \n",
" 351.0 | \n",
" 153.0 | \n",
" 4034 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 13 | \n",
" plymouth satellite (sw) | \n",
" NaN | \n",
" 8 | \n",
" 383.0 | \n",
" 175.0 | \n",
" 4166 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 14 | \n",
" amc rebel sst (sw) | \n",
" NaN | \n",
" 8 | \n",
" 360.0 | \n",
" 175.0 | \n",
" 3850 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 17 | \n",
" ford mustang boss 302 | \n",
" NaN | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3353 | \n",
" 8.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 38 | \n",
" ford pinto | \n",
" 25.0 | \n",
" 4 | \n",
" 98.0 | \n",
" NaN | \n",
" 2046 | \n",
" 19.0 | \n",
" 1971-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 39 | \n",
" volkswagen super beetle 117 | \n",
" NaN | \n",
" 4 | \n",
" 97.0 | \n",
" 48.0 | \n",
" 1978 | \n",
" 20.0 | \n",
" 1971-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 133 | \n",
" ford maverick | \n",
" 21.0 | \n",
" 6 | \n",
" 200.0 | \n",
" NaN | \n",
" 2875 | \n",
" 17.0 | \n",
" 1974-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 337 | \n",
" renault lecar deluxe | \n",
" 40.9 | \n",
" 4 | \n",
" 85.0 | \n",
" NaN | \n",
" 1835 | \n",
" 17.3 | \n",
" 1980-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 343 | \n",
" ford mustang cobra | \n",
" 23.6 | \n",
" 4 | \n",
" 140.0 | \n",
" NaN | \n",
" 2905 | \n",
" 14.3 | \n",
" 1980-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 361 | \n",
" renault 18i | \n",
" 34.5 | \n",
" 4 | \n",
" 100.0 | \n",
" NaN | \n",
" 2320 | \n",
" 15.8 | \n",
" 1982-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 367 | \n",
" saab 900s | \n",
" NaN | \n",
" 4 | \n",
" 121.0 | \n",
" 110.0 | \n",
" 2800 | \n",
" 15.4 | \n",
" 1982-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 382 | \n",
" amc concord dl | \n",
" 23.0 | \n",
" 4 | \n",
" 151.0 | \n",
" NaN | \n",
" 3035 | \n",
" 20.5 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name Miles_per_Gallon Cylinders \\\n",
"10 citroen ds-21 pallas NaN 4 \n",
"11 chevrolet chevelle concours (sw) NaN 8 \n",
"12 ford torino (sw) NaN 8 \n",
"13 plymouth satellite (sw) NaN 8 \n",
"14 amc rebel sst (sw) NaN 8 \n",
"17 ford mustang boss 302 NaN 8 \n",
"38 ford pinto 25.0 4 \n",
"39 volkswagen super beetle 117 NaN 4 \n",
"133 ford maverick 21.0 6 \n",
"337 renault lecar deluxe 40.9 4 \n",
"343 ford mustang cobra 23.6 4 \n",
"361 renault 18i 34.5 4 \n",
"367 saab 900s NaN 4 \n",
"382 amc concord dl 23.0 4 \n",
"\n",
" Displacement Horsepower Weight_in_lbs Acceleration Year Origin \n",
"10 133.0 115.0 3090 17.5 1970-01-01 Europe \n",
"11 350.0 165.0 4142 11.5 1970-01-01 USA \n",
"12 351.0 153.0 4034 11.0 1970-01-01 USA \n",
"13 383.0 175.0 4166 10.5 1970-01-01 USA \n",
"14 360.0 175.0 3850 11.0 1970-01-01 USA \n",
"17 302.0 140.0 3353 8.0 1970-01-01 USA \n",
"38 98.0 NaN 2046 19.0 1971-01-01 USA \n",
"39 97.0 48.0 1978 20.0 1971-01-01 Europe \n",
"133 200.0 NaN 2875 17.0 1974-01-01 USA \n",
"337 85.0 NaN 1835 17.3 1980-01-01 Europe \n",
"343 140.0 NaN 2905 14.3 1980-01-01 USA \n",
"361 100.0 NaN 2320 15.8 1982-01-01 Europe \n",
"367 121.0 110.0 2800 15.4 1982-01-01 Europe \n",
"382 151.0 NaN 3035 20.5 1982-01-01 USA "
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cars[cars.isna().any(axis=1)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A more natural thing is to get rid of those rows. We can negate our Boolean Series by using `~`."
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"cell_id": "afcad5bc940d4c718a2de164db65ca90",
"deepnote_cell_height": 616,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 48,
"execution_start": 1649105137634,
"source_hash": "3743334f",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" Miles_per_Gallon | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chevrolet chevelle malibu | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 1 | \n",
" buick skylark 320 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 2 | \n",
" plymouth satellite | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 3 | \n",
" amc rebel sst | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 4 | \n",
" ford torino | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
" ... | \n",
"
\n",
" \n",
" 401 | \n",
" ford mustang gl | \n",
" 27.0 | \n",
" 4 | \n",
" 140.0 | \n",
" 86.0 | \n",
" 2790 | \n",
" 15.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 402 | \n",
" vw pickup | \n",
" 44.0 | \n",
" 4 | \n",
" 97.0 | \n",
" 52.0 | \n",
" 2130 | \n",
" 24.6 | \n",
" 1982-01-01 | \n",
" Europe | \n",
"
\n",
" \n",
" 403 | \n",
" dodge rampage | \n",
" 32.0 | \n",
" 4 | \n",
" 135.0 | \n",
" 84.0 | \n",
" 2295 | \n",
" 11.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 404 | \n",
" ford ranger | \n",
" 28.0 | \n",
" 4 | \n",
" 120.0 | \n",
" 79.0 | \n",
" 2625 | \n",
" 18.6 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
" 405 | \n",
" chevy s-10 | \n",
" 31.0 | \n",
" 4 | \n",
" 119.0 | \n",
" 82.0 | \n",
" 2720 | \n",
" 19.4 | \n",
" 1982-01-01 | \n",
" USA | \n",
"
\n",
" \n",
"
\n",
"
392 rows × 9 columns
\n",
"
"
],
"text/plain": [
" Name Miles_per_Gallon Cylinders Displacement \\\n",
"0 chevrolet chevelle malibu 18.0 8 307.0 \n",
"1 buick skylark 320 15.0 8 350.0 \n",
"2 plymouth satellite 18.0 8 318.0 \n",
"3 amc rebel sst 16.0 8 304.0 \n",
"4 ford torino 17.0 8 302.0 \n",
".. ... ... ... ... \n",
"401 ford mustang gl 27.0 4 140.0 \n",
"402 vw pickup 44.0 4 97.0 \n",
"403 dodge rampage 32.0 4 135.0 \n",
"404 ford ranger 28.0 4 120.0 \n",
"405 chevy s-10 31.0 4 119.0 \n",
"\n",
" Horsepower Weight_in_lbs Acceleration Year Origin \n",
"0 130.0 3504 12.0 1970-01-01 USA \n",
"1 165.0 3693 11.5 1970-01-01 USA \n",
"2 150.0 3436 11.0 1970-01-01 USA \n",
"3 150.0 3433 12.0 1970-01-01 USA \n",
"4 140.0 3449 10.5 1970-01-01 USA \n",
".. ... ... ... ... ... \n",
"401 86.0 2790 15.6 1982-01-01 USA \n",
"402 52.0 2130 24.6 1982-01-01 Europe \n",
"403 84.0 2295 11.6 1982-01-01 USA \n",
"404 79.0 2625 18.6 1982-01-01 USA \n",
"405 82.0 2720 19.4 1982-01-01 USA \n",
"\n",
"[392 rows x 9 columns]"
]
},
"execution_count": 78,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# negation in pandas\n",
"cars[~cars.isna().any(axis=1)]"
]
}
],
"metadata": {
"deepnote": {
"is_reactive": false
},
"deepnote_execution_queue": [],
"deepnote_notebook_id": "a008aa3f-3aaa-4f75-9462-e29841a64137",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}