{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "3aadfa3503dd4ac982c058d93a0dd4d2",
    "deepnote_cell_height": 142.796875,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "# K-Means clustering 2\n",
    "\n",
    "We've seen how to implement K-Means clustering using scikit-learn, but not how the algorithm actually works.  The main goal today is to see what the K-Means clustering algorithm is doing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "19b3d58aad1740e1934c572e2d3d7d4c",
    "deepnote_cell_height": 220.984375,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "## Warm-up\n",
    "\n",
    "* Make a DataFrame `df` with two columns, \"miles\" and \"cars\", containing the following five data points in (miles, parking spaces): (0,1), (0,5), (1,0), (1,1), (1,5).\n",
    "* For later use, make a copy of `df` called `df2`.  Be sure to use `.copy()`.\n",
    "* Using K-Means clustering, divide this `df` data into two clusters.  Store the data in a new column called \"cluster\".\n",
    "* Does the result match what you expect?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "cell_id": "371d084fe33d47bab1ccf9202137f064",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 2,
    "execution_start": 1651089913761,
    "source_hash": "9b82ee11",
    "tags": []
   },
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "cell_id": "ebee3555c49d4d9d91425e55e170f8f7",
    "deepnote_cell_height": 413,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 6,
    "execution_start": 1651090336096,
    "source_hash": "aea9a7cb",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>miles</th>\n",
       "      <th>cars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   miles  cars\n",
       "0      0     1\n",
       "1      0     5\n",
       "2      1     0\n",
       "3      1     1\n",
       "4      1     5"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame([[0,1],[0,5],[1,0],[1,1],[1,5]], columns=[\"miles\",\"cars\"])\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "cell_id": "df8a3d36652d48ae80f79ee7fdb0b671",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 4,
    "execution_start": 1651090015112,
    "source_hash": "61993df1",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>miles</th>\n",
       "      <th>cars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   miles  cars\n",
       "0      0     1\n",
       "1      0     5\n",
       "2      1     0\n",
       "3      1     1\n",
       "4      1     5"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Another way\n",
    "pd.DataFrame({\"miles\":[0,0,1,1,1],\"cars\":[1,5,0,1,5]})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "cell_id": "cba53a3bfd114edaafc1b22bf51b715f",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 3,
    "execution_start": 1651090425102,
    "source_hash": "5c75adad",
    "tags": []
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "cell_id": "9c7293149a524809bbed4823ca3e19dc",
    "deepnote_cell_height": 99,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 3,
    "execution_start": 1651090469185,
    "source_hash": "db5c8520",
    "tags": []
   },
   "outputs": [],
   "source": [
    "# create/instantiate\n",
    "kmeans = KMeans(n_clusters=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "cell_id": "50b3ac7c9bb8499da60a1586be319e50",
    "deepnote_cell_height": 118.1875,
    "deepnote_cell_type": "code",
    "deepnote_output_heights": [
     21.1875
    ],
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 56,
    "execution_start": 1651090493854,
    "source_hash": "daae2edd",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KMeans(n_clusters=2)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans.fit(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "cell_id": "bdb333ae46c241d395c919c1dd650746",
    "deepnote_cell_height": 118.1875,
    "deepnote_cell_type": "code",
    "deepnote_output_heights": [
     21.1875
    ],
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 240,
    "execution_start": 1651090500713,
    "source_hash": "11b67079",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0, 1, 0, 0, 1], dtype=int32)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans.predict(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "cell_id": "58d1b97551f644d0855c30f08bae14fa",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 13,
    "execution_start": 1651090539359,
    "source_hash": "797d55fc",
    "tags": []
   },
   "outputs": [],
   "source": [
    "df[\"cluster\"] = kmeans.predict(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the two data points with 5 cars are in one cluster, and the three data points with 0 or 1 cars are in the other cluster.  (The cluster numberings themselves are random.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "cell_id": "8f5788810c044c0d83c03342b54f904f",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 19,
    "execution_start": 1651090557904,
    "source_hash": "f804c160",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>miles</th>\n",
       "      <th>cars</th>\n",
       "      <th>cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   miles  cars  cluster\n",
       "0      0     1        0\n",
       "1      0     5        1\n",
       "2      1     0        0\n",
       "3      1     1        0\n",
       "4      1     5        1"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "df31de3c63e942f39efd4ac6cc6032ef",
    "deepnote_cell_height": 70,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "## Importance of scaling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the sub-DataFrame with just the \"miles\" and \"cars\" columns (not the \"cluster\" column).  The following doesn't work: the column names need to be in a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "cell_id": "3b564acdb6e540db89925c40f88f8e74",
    "deepnote_cell_height": 144.1875,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 962,
    "execution_start": 1651090774319,
    "source_hash": "9e8d177e",
    "tags": [
     "output_scroll"
    ]
   },
   "outputs": [
    {
     "ename": "KeyError",
     "evalue": "('miles', 'cars')",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m   3360\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3361\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3362\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/_libs/index.pyx\u001b[0m in \u001b[0;36mpandas._libs.index.IndexEngine.get_loc\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;32mpandas/_libs/hashtable_class_helper.pxi\u001b[0m in \u001b[0;36mpandas._libs.hashtable.PyObjectHashTable.get_item\u001b[0;34m()\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')",
      "\nThe above exception was the direct cause of the following exception:\n",
      "\u001b[0;31mKeyError\u001b[0m                                  Traceback (most recent call last)",
      "\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_39671/3269067737.py\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"miles\"\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\"cars\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/frame.py\u001b[0m in \u001b[0;36m__getitem__\u001b[0;34m(self, key)\u001b[0m\n\u001b[1;32m   3456\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnlevels\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3457\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_getitem_multilevel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3458\u001b[0;31m             \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcolumns\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3459\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mis_integer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3460\u001b[0m                 \u001b[0mindexer\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m[\u001b[0m\u001b[0mindexer\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/pandas/core/indexes/base.py\u001b[0m in \u001b[0;36mget_loc\u001b[0;34m(self, key, method, tolerance)\u001b[0m\n\u001b[1;32m   3361\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_loc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcasted_key\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3362\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mKeyError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 3363\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mKeyError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   3364\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   3365\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mis_scalar\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhasnans\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mKeyError\u001b[0m: ('miles', 'cars')"
     ]
    }
   ],
   "source": [
    "df[\"miles\",\"cars\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "cell_id": "e3c05828a0814d099d2fcae515ca9782",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 12,
    "execution_start": 1651090797406,
    "source_hash": "d1cea5ef",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>miles</th>\n",
       "      <th>cars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   miles  cars\n",
       "0      0     1\n",
       "1      0     5\n",
       "2      1     0\n",
       "3      1     1\n",
       "4      1     5"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[[\"miles\",\"cars\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "cell_id": "9a86fac2f5a840c6972a72f131e29528",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 1,
    "execution_start": 1651090808299,
    "source_hash": "83bae8f7",
    "tags": []
   },
   "outputs": [],
   "source": [
    "df2 = df[[\"miles\",\"cars\"]].copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "cell_id": "8e703e6368ae45898b572518a4ab9401",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 9,
    "execution_start": 1651090811359,
    "source_hash": "caa55e2e",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>miles</th>\n",
       "      <th>cars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   miles  cars\n",
       "0      0     1\n",
       "1      0     5\n",
       "2      1     0\n",
       "3      1     1\n",
       "4      1     5"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `axis=1` keyword argument says that we are changing names of columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "cell_id": "d1347a65f57a435481a2a91feb8a6805",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 7,
    "execution_start": 1651090980023,
    "source_hash": "93889720",
    "tags": []
   },
   "outputs": [],
   "source": [
    "df2 = df2.rename({\"miles\":\"feet\"}, axis=1).copy()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "cell_id": "bbc2fc0870f04ea9b340190da48af6af",
    "deepnote_cell_height": 99,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 9,
    "execution_start": 1651091024095,
    "source_hash": "6e1b0bf0",
    "tags": []
   },
   "outputs": [],
   "source": [
    "# same as df2.feet = 5280*df2.feet\n",
    "df2.feet *= 5280"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "cell_id": "fce97f421f314ffabd5b92b5f0e5e7e8",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 14,
    "execution_start": 1651091044930,
    "source_hash": "caa55e2e",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feet</th>\n",
       "      <th>cars</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5280</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5280</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5280</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   feet  cars\n",
       "0     0     1\n",
       "1     0     5\n",
       "2  5280     0\n",
       "3  5280     1\n",
       "4  5280     5"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that `df2` and `df1` contain the exact same data.  The only difference is the unit of measurement used.  `df1` uses \"miles\" for its first column, whereas `df2` uses \"feet\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "cell_id": "2ca75fa599a7423191c4e07bcf081df2",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 14,
    "execution_start": 1651091075742,
    "source_hash": "b415181c",
    "tags": []
   },
   "outputs": [],
   "source": [
    "kmeans2 = KMeans(n_clusters=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "cell_id": "d64a274e43804294ac6ffb69e7f86c8f",
    "deepnote_cell_height": 118.1875,
    "deepnote_cell_type": "code",
    "deepnote_output_heights": [
     21.1875
    ],
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 19,
    "execution_start": 1651091095681,
    "source_hash": "fc106725",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "KMeans(n_clusters=2)"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "kmeans2.fit(df2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "cell_id": "32c7a8fab96a42d7b479f1a4efbf2bc8",
    "deepnote_cell_height": 81,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 20,
    "execution_start": 1651091197609,
    "source_hash": "4b3e8292",
    "tags": []
   },
   "outputs": [],
   "source": [
    "df2[\"cluster\"] = kmeans2.predict(df2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now it is the \"feet\" column which is dominating, instead of the \"cars\" column.  This is bad, since the only change was to the unit of measurement used.  The lesson is that, unless the columns are in the same unit, we should normalize using for example `StandardScaler`.  (We won't do that normalization today.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "cell_id": "e6ad1798db8e4dbead20f35e45f05311",
    "deepnote_cell_height": 395,
    "deepnote_cell_type": "code",
    "deepnote_to_be_reexecuted": false,
    "execution_millis": 9,
    "execution_start": 1651091211988,
    "source_hash": "caa55e2e",
    "tags": []
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feet</th>\n",
       "      <th>cars</th>\n",
       "      <th>cluster</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5280</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>5280</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5280</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   feet  cars  cluster\n",
       "0     0     1        0\n",
       "1     0     5        0\n",
       "2  5280     0        1\n",
       "3  5280     1        1\n",
       "4  5280     5        1"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "81cffe2a187743c7bfe47d23ef05d3f3",
    "deepnote_cell_height": 184.1875,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "## Summary of K-Means\n",
    "\n",
    "On the whiteboard we described:\n",
    "* The K-Means algorithm.\n",
    "* How a clustering is evaluated (why is one cluster better than another)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "311ec6b6b2234e49bbb49fa41c2de9dd",
    "deepnote_cell_height": 401,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "## Demonstrations\n",
    "\n",
    "Here is a nice video demonstration of the K-Means clustering algorithm.\n",
    "\n",
    "<iframe width=\"560\" height=\"315\" src=\"https://www.youtube.com/embed/5I3Ei69I40s\" title=\"YouTube video player\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen></iframe>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cell_id": "1a309d378f4c4bce92edb953f7793d2e",
    "deepnote_cell_height": 74.796875,
    "deepnote_cell_type": "markdown",
    "tags": []
   },
   "source": [
    "Say you want to divide the plane into different regions, depending on which point is closest.  What does that division of the plane look like?  Check your answer [here](https://en.wikipedia.org/wiki/Voronoi_diagram#/media/File:Voronoi_growth_euclidean.gif)."
   ]
  }
 ],
 "metadata": {
  "deepnote": {},
  "deepnote_execution_queue": [],
  "deepnote_notebook_id": "10213fad-981e-42a2-b8d8-21ebde5bea90",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}