{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"cell_id": "3aadfa3503dd4ac982c058d93a0dd4d2",
"deepnote_cell_height": 142.796875,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"# K-Means clustering 2\n",
"\n",
"We've seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works. The main goal today is to see what the K-Means clustering algorithm is doing."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "19b3d58aad1740e1934c572e2d3d7d4c",
"deepnote_cell_height": 220.984375,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Warm-up\n",
"\n",
"* Make a DataFrame `df` with two columns, \"miles\" and \"cars\", containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).\n",
"* For later use, make a copy of `df` called `df2`. Be sure to use `.copy()`.\n",
"* Using K-Means clustering, divide this `df` data into two clusters. Store the data in a new column called \"cluster\".\n",
"* Does the result match what you expect?"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"cell_id": "371d084fe33d47bab1ccf9202137f064",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 2,
"execution_start": 1651089913761,
"source_hash": "9b82ee11",
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"cell_id": "ebee3555c49d4d9d91425e55e170f8f7",
"deepnote_cell_height": 413,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 6,
"execution_start": 1651090336096,
"source_hash": "aea9a7cb",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" miles \n",
" cars \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 5 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" miles cars\n",
"0 0 1\n",
"1 0 5\n",
"2 1 0\n",
"3 1 1\n",
"4 1 5"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=[\"miles\",\"cars\"])\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"cell_id": "df8a3d36652d48ae80f79ee7fdb0b671",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1651090015112,
"source_hash": "61993df1",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" miles \n",
" cars \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 5 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" miles cars\n",
"0 0 1\n",
"1 0 5\n",
"2 1 0\n",
"3 1 1\n",
"4 1 5"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Another way\n",
"pd.DataFrame({\"miles\":[0,0,1,1,1],\"cars\":[1,5,0,1,5]})"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"cell_id": "cba53a3bfd114edaafc1b22bf51b715f",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1651090425102,
"source_hash": "5c75adad",
"tags": []
},
"outputs": [],
"source": [
"from sklearn.cluster import KMeans"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"cell_id": "9c7293149a524809bbed4823ca3e19dc",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1651090469185,
"source_hash": "db5c8520",
"tags": []
},
"outputs": [],
"source": [
"# create/instantiate\n",
"kmeans = KMeans(n_clusters=2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"cell_id": "50b3ac7c9bb8499da60a1586be319e50",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 56,
"execution_start": 1651090493854,
"source_hash": "daae2edd",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(n_clusters=2)"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kmeans.fit(df)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"cell_id": "bdb333ae46c241d395c919c1dd650746",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 240,
"execution_start": 1651090500713,
"source_hash": "11b67079",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"array([0, 1, 0, 0, 1], dtype=int32)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kmeans.predict(df)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"cell_id": "58d1b97551f644d0855c30f08bae14fa",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 13,
"execution_start": 1651090539359,
"source_hash": "797d55fc",
"tags": []
},
"outputs": [],
"source": [
"df[\"cluster\"] = kmeans.predict(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster. (The cluster numberings themselves are random.)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"cell_id": "8f5788810c044c0d83c03342b54f904f",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1651090557904,
"source_hash": "f804c160",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" miles \n",
" cars \n",
" cluster \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" 1 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 5 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" miles cars cluster\n",
"0 0 1 0\n",
"1 0 5 1\n",
"2 1 0 0\n",
"3 1 1 0\n",
"4 1 5 1"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df"
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "df31de3c63e942f39efd4ac6cc6032ef",
"deepnote_cell_height": 70,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Importance of scaling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get the sub-DataFrame with just the \"miles\" and \"cars\" columns (not the \"cluster\" column). The following doesn't work: the column names need to be in a list."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"cell_id": "3b564acdb6e540db89925c40f88f8e74",
"deepnote_cell_height": 144.1875,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 962,
"execution_start": 1651090774319,
"source_hash": "9e8d177e",
"tags": [
"output_scroll"
]
},
"outputs": [
{
"ename": "KeyError",
"evalue": "('miles', 'cars')",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')",
"\nThe above exception was the direct cause of the following exception:\n",
"\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_39671/3269067737.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"miles\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"cars\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3459\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3460\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')"
]
}
],
"source": [
"df[\"miles\",\"cars\"]"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"cell_id": "e3c05828a0814d099d2fcae515ca9782",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1651090797406,
"source_hash": "d1cea5ef",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" miles \n",
" cars \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 5 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" miles cars\n",
"0 0 1\n",
"1 0 5\n",
"2 1 0\n",
"3 1 1\n",
"4 1 5"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[[\"miles\",\"cars\"]]"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"cell_id": "9a86fac2f5a840c6972a72f131e29528",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1651090808299,
"source_hash": "83bae8f7",
"tags": []
},
"outputs": [],
"source": [
"df2 = df[[\"miles\",\"cars\"]].copy()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"cell_id": "8e703e6368ae45898b572518a4ab9401",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1651090811359,
"source_hash": "caa55e2e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" miles \n",
" cars \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" \n",
" \n",
" 2 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 1 \n",
" 5 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" miles cars\n",
"0 0 1\n",
"1 0 5\n",
"2 1 0\n",
"3 1 1\n",
"4 1 5"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `axis=1` keyword argument says that we are changing names of columns."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"cell_id": "d1347a65f57a435481a2a91feb8a6805",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 7,
"execution_start": 1651090980023,
"source_hash": "93889720",
"tags": []
},
"outputs": [],
"source": [
"df2 = df2.rename({\"miles\":\"feet\"}, axis=1).copy()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"cell_id": "bbc2fc0870f04ea9b340190da48af6af",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1651091024095,
"source_hash": "6e1b0bf0",
"tags": []
},
"outputs": [],
"source": [
"# same as df2.feet = 5280*df2.feet\n",
"df2.feet *= 5280"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"cell_id": "fce97f421f314ffabd5b92b5f0e5e7e8",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 14,
"execution_start": 1651091044930,
"source_hash": "caa55e2e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" feet \n",
" cars \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" \n",
" \n",
" 2 \n",
" 5280 \n",
" 0 \n",
" \n",
" \n",
" 3 \n",
" 5280 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 5280 \n",
" 5 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feet cars\n",
"0 0 1\n",
"1 0 5\n",
"2 5280 0\n",
"3 5280 1\n",
"4 5280 5"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that `df2` and `df1` contain the exact same data. The only difference is the unit of measurement used. `df1` uses \"miles\" for its first column, whereas `df2` uses \"feet\"."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"cell_id": "2ca75fa599a7423191c4e07bcf081df2",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 14,
"execution_start": 1651091075742,
"source_hash": "b415181c",
"tags": []
},
"outputs": [],
"source": [
"kmeans2 = KMeans(n_clusters=2)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"cell_id": "d64a274e43804294ac6ffb69e7f86c8f",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 19,
"execution_start": 1651091095681,
"source_hash": "fc106725",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"KMeans(n_clusters=2)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"kmeans2.fit(df2)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"cell_id": "32c7a8fab96a42d7b479f1a4efbf2bc8",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 20,
"execution_start": 1651091197609,
"source_hash": "4b3e8292",
"tags": []
},
"outputs": [],
"source": [
"df2[\"cluster\"] = kmeans2.predict(df2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now it is the \"feet\" column which is dominating, instead of the \"cars\" column. This is bad, since the only change was to the unit of measurement used. The lesson is that, unless the columns are in the same unit, we should normalize using for example `StandardScaler`. (We won't do that normalization today.)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"cell_id": "e6ad1798db8e4dbead20f35e45f05311",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1651091211988,
"source_hash": "caa55e2e",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" feet \n",
" cars \n",
" cluster \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 0 \n",
" 1 \n",
" 0 \n",
" \n",
" \n",
" 1 \n",
" 0 \n",
" 5 \n",
" 0 \n",
" \n",
" \n",
" 2 \n",
" 5280 \n",
" 0 \n",
" 1 \n",
" \n",
" \n",
" 3 \n",
" 5280 \n",
" 1 \n",
" 1 \n",
" \n",
" \n",
" 4 \n",
" 5280 \n",
" 5 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" feet cars cluster\n",
"0 0 1 0\n",
"1 0 5 0\n",
"2 5280 0 1\n",
"3 5280 1 1\n",
"4 5280 5 1"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df2"
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "81cffe2a187743c7bfe47d23ef05d3f3",
"deepnote_cell_height": 184.1875,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Summary of K-Means\n",
"\n",
"On the whiteboard we described:\n",
"* The K-Means algorithm.\n",
"* How a clustering is evaluated (why is one cluster better than another)."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "311ec6b6b2234e49bbb49fa41c2de9dd",
"deepnote_cell_height": 401,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Demonstrations\n",
"\n",
"Here is a nice video demonstration of the K-Means clustering algorithm.\n",
"\n",
"VIDEO "
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "1a309d378f4c4bce92edb953f7793d2e",
"deepnote_cell_height": 74.796875,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"Say you want to divide the plane into different regions, depending on which point is closest. What does that division of the plane look like? Check your answer [here](https://en.wikipedia.org/wiki/Voronoi_diagram#/media/File:Voronoi_growth_euclidean.gif)."
]
}
],
"metadata": {
"deepnote": {},
"deepnote_execution_queue": [],
"deepnote_notebook_id": "10213fad-981e-42a2-b8d8-21ebde5bea90",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}