{ "cells": [ { "cell_type": "markdown", "metadata": { "cell_id": "3aadfa3503dd4ac982c058d93a0dd4d2", "deepnote_cell_height": 142.796875, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "# K-Means clustering 2\n", "\n", "We've seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works. The main goal today is to see what the K-Means clustering algorithm is doing." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "19b3d58aad1740e1934c572e2d3d7d4c", "deepnote_cell_height": 220.984375, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Warm-up\n", "\n", "* Make a DataFrame `df` with two columns, \"miles\" and \"cars\", containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).\n", "* For later use, make a copy of `df` called `df2`. Be sure to use `.copy()`.\n", "* Using K-Means clustering, divide this `df` data into two clusters. Store the data in a new column called \"cluster\".\n", "* Does the result match what you expect?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cell_id": "371d084fe33d47bab1ccf9202137f064", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 2, "execution_start": 1651089913761, "source_hash": "9b82ee11", "tags": [] }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "cell_id": "ebee3555c49d4d9d91425e55e170f8f7", "deepnote_cell_height": 413, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1651090336096, "source_hash": "aea9a7cb", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
milescars
001
105
210
311
415
\n", "
" ], "text/plain": [ " miles cars\n", "0 0 1\n", "1 0 5\n", "2 1 0\n", "3 1 1\n", "4 1 5" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=[\"miles\",\"cars\"])\n", "df" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "cell_id": "df8a3d36652d48ae80f79ee7fdb0b671", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 4, "execution_start": 1651090015112, "source_hash": "61993df1", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
milescars
001
105
210
311
415
\n", "
" ], "text/plain": [ " miles cars\n", "0 0 1\n", "1 0 5\n", "2 1 0\n", "3 1 1\n", "4 1 5" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Another way\n", "pd.DataFrame({\"miles\":[0,0,1,1,1],\"cars\":[1,5,0,1,5]})" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cell_id": "cba53a3bfd114edaafc1b22bf51b715f", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1651090425102, "source_hash": "5c75adad", "tags": [] }, "outputs": [], "source": [ "from sklearn.cluster import KMeans" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "cell_id": "9c7293149a524809bbed4823ca3e19dc", "deepnote_cell_height": 99, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1651090469185, "source_hash": "db5c8520", "tags": [] }, "outputs": [], "source": [ "# create/instantiate\n", "kmeans = KMeans(n_clusters=2)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "cell_id": "50b3ac7c9bb8499da60a1586be319e50", "deepnote_cell_height": 118.1875, "deepnote_cell_type": "code", "deepnote_output_heights": [ 21.1875 ], "deepnote_to_be_reexecuted": false, "execution_millis": 56, "execution_start": 1651090493854, "source_hash": "daae2edd", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "KMeans(n_clusters=2)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kmeans.fit(df)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cell_id": "bdb333ae46c241d395c919c1dd650746", "deepnote_cell_height": 118.1875, "deepnote_cell_type": "code", "deepnote_output_heights": [ 21.1875 ], "deepnote_to_be_reexecuted": false, "execution_millis": 240, "execution_start": 1651090500713, "source_hash": "11b67079", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 0, 0, 1], dtype=int32)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kmeans.predict(df)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "cell_id": "58d1b97551f644d0855c30f08bae14fa", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 13, "execution_start": 1651090539359, "source_hash": "797d55fc", "tags": [] }, "outputs": [], "source": [ "df[\"cluster\"] = kmeans.predict(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster. (The cluster numberings themselves are random.)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "cell_id": "8f5788810c044c0d83c03342b54f904f", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 19, "execution_start": 1651090557904, "source_hash": "f804c160", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
milescarscluster
0010
1051
2100
3110
4151
\n", "
" ], "text/plain": [ " miles cars cluster\n", "0 0 1 0\n", "1 0 5 1\n", "2 1 0 0\n", "3 1 1 0\n", "4 1 5 1" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "df31de3c63e942f39efd4ac6cc6032ef", "deepnote_cell_height": 70, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Importance of scaling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the sub-DataFrame with just the \"miles\" and \"cars\" columns (not the \"cluster\" column). The following doesn't work: the column names need to be in a list." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "cell_id": "3b564acdb6e540db89925c40f88f8e74", "deepnote_cell_height": 144.1875, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 962, "execution_start": 1651090774319, "source_hash": "9e8d177e", "tags": [ "output_scroll" ] }, "outputs": [ { "ename": "KeyError", "evalue": "('miles', 'cars')", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3360\u001b[0m \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')", "\nThe above exception was the direct cause of the following exception:\n", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_39671/3269067737.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"miles\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"cars\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m 3456\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3457\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3459\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3460\u001b[0m \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m 3361\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3362\u001b[0m \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 3365\u001b[0m \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", "\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')" ] } ], "source": [ "df[\"miles\",\"cars\"]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "cell_id": "e3c05828a0814d099d2fcae515ca9782", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 12, "execution_start": 1651090797406, "source_hash": "d1cea5ef", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
milescars
001
105
210
311
415
\n", "
" ], "text/plain": [ " miles cars\n", "0 0 1\n", "1 0 5\n", "2 1 0\n", "3 1 1\n", "4 1 5" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[[\"miles\",\"cars\"]]" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "cell_id": "9a86fac2f5a840c6972a72f131e29528", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1651090808299, "source_hash": "83bae8f7", "tags": [] }, "outputs": [], "source": [ "df2 = df[[\"miles\",\"cars\"]].copy()" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "cell_id": "8e703e6368ae45898b572518a4ab9401", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 9, "execution_start": 1651090811359, "source_hash": "caa55e2e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
milescars
001
105
210
311
415
\n", "
" ], "text/plain": [ " miles cars\n", "0 0 1\n", "1 0 5\n", "2 1 0\n", "3 1 1\n", "4 1 5" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `axis=1` keyword argument says that we are changing names of columns." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "cell_id": "d1347a65f57a435481a2a91feb8a6805", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 7, "execution_start": 1651090980023, "source_hash": "93889720", "tags": [] }, "outputs": [], "source": [ "df2 = df2.rename({\"miles\":\"feet\"}, axis=1).copy()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "cell_id": "bbc2fc0870f04ea9b340190da48af6af", "deepnote_cell_height": 99, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 9, "execution_start": 1651091024095, "source_hash": "6e1b0bf0", "tags": [] }, "outputs": [], "source": [ "# same as df2.feet = 5280*df2.feet\n", "df2.feet *= 5280" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "cell_id": "fce97f421f314ffabd5b92b5f0e5e7e8", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 14, "execution_start": 1651091044930, "source_hash": "caa55e2e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feetcars
001
105
252800
352801
452805
\n", "
" ], "text/plain": [ " feet cars\n", "0 0 1\n", "1 0 5\n", "2 5280 0\n", "3 5280 1\n", "4 5280 5" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that `df2` and `df1` contain the exact same data. The only difference is the unit of measurement used. `df1` uses \"miles\" for its first column, whereas `df2` uses \"feet\"." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "cell_id": "2ca75fa599a7423191c4e07bcf081df2", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 14, "execution_start": 1651091075742, "source_hash": "b415181c", "tags": [] }, "outputs": [], "source": [ "kmeans2 = KMeans(n_clusters=2)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "cell_id": "d64a274e43804294ac6ffb69e7f86c8f", "deepnote_cell_height": 118.1875, "deepnote_cell_type": "code", "deepnote_output_heights": [ 21.1875 ], "deepnote_to_be_reexecuted": false, "execution_millis": 19, "execution_start": 1651091095681, "source_hash": "fc106725", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "KMeans(n_clusters=2)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kmeans2.fit(df2)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "cell_id": "32c7a8fab96a42d7b479f1a4efbf2bc8", "deepnote_cell_height": 81, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 20, "execution_start": 1651091197609, "source_hash": "4b3e8292", "tags": [] }, "outputs": [], "source": [ "df2[\"cluster\"] = kmeans2.predict(df2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it is the \"feet\" column which is dominating, instead of the \"cars\" column. This is bad, since the only change was to the unit of measurement used. The lesson is that, unless the columns are in the same unit, we should normalize using for example `StandardScaler`. (We won't do that normalization today.)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "cell_id": "e6ad1798db8e4dbead20f35e45f05311", "deepnote_cell_height": 395, "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 9, "execution_start": 1651091211988, "source_hash": "caa55e2e", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feetcarscluster
0010
1050
2528001
3528011
4528051
\n", "
" ], "text/plain": [ " feet cars cluster\n", "0 0 1 0\n", "1 0 5 0\n", "2 5280 0 1\n", "3 5280 1 1\n", "4 5280 5 1" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "81cffe2a187743c7bfe47d23ef05d3f3", "deepnote_cell_height": 184.1875, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Summary of K-Means\n", "\n", "On the whiteboard we described:\n", "* The K-Means algorithm.\n", "* How a clustering is evaluated (why is one cluster better than another)." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "311ec6b6b2234e49bbb49fa41c2de9dd", "deepnote_cell_height": 401, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Demonstrations\n", "\n", "Here is a nice video demonstration of the K-Means clustering algorithm.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "1a309d378f4c4bce92edb953f7793d2e", "deepnote_cell_height": 74.796875, "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "Say you want to divide the plane into different regions, depending on which point is closest. What does that division of the plane look like? Check your answer [here](https://en.wikipedia.org/wiki/Voronoi_diagram#/media/File:Voronoi_growth_euclidean.gif)." ] } ], "metadata": { "deepnote": {}, "deepnote_execution_queue": [], "deepnote_notebook_id": "10213fad-981e-42a2-b8d8-21ebde5bea90", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.12" } }, "nbformat": 4, "nbformat_minor": 4 }