{ "cells": [ { "cell_type": "markdown", "metadata": { "cell_id": "3fbc88f148b34ae4ae03e28a0092f696", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "# Week 8 Wednesday" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "aed216c6bfb74c8589e096aefd583b6e", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Announcements\n", "\n", "* This week's videos posted (due Friday before lecture).\n", "* No discussion section until the strike is resolved.\n", "* If you weren't able to get the pickle file from Worksheet 12 submitted, include a link to your Deepnote notebook as a comment on the Canvas Worksheet 12 assignment (be sure to change the sharing options so I can see your Worksheet 12) and I will try to give partial credit.\n", "* I have office hours today (Wednesday) at 1pm in my office, RH 440J.\n", "* Course project instructions have been added to the [course notes](https://christopherdavisuci.github.io/UCI-Math-10-F22/Proj/CourseProject.html). A \"warm-up\" homework for the course project is Worksheet 16, posted in the Week 9 folder." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "512439688dd24624a6fe75f92c94cdba", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Rough plan for the next few classes:\n", "\n", "Very subject to change.\n", "\n", "* Week 8 Wednesday (today): The U-shaped test error curve\n", "* Week 8 Friday: More on decision trees. (I probably have too much planned for today, so we can catch up on Friday.)\n", "* Week 9 Monday: Random forests (ensembles of decision trees)\n", "* Week 9 Wednesday: Introduction to the course project" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "12d7adec0b5a4ae1a5fcc6b2f47bee2f", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Loading the iris dataset\n", "\n", "The iris dataset (also available from Seaborn) is smaller than most of the datasets we work with in Math 10 (it only contains 150 rows/observations/data points). But it is one of the most classic datasets in Machine Learning, so we should see it at some point." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cell_id": "ad060fce9de0420da351b1e0427774df", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 2230, "execution_start": 1668621351436, "source_hash": "22fa9960", "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import altair as alt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "cell_id": "972022c8e1e447b592e648e2f997b9aa", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 303, "execution_start": 1668621803112, "source_hash": "a47677b0", "tags": [] }, "outputs": [], "source": [ "df = sns.load_dataset(\"iris\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "cell_id": "1928dc1e40d7429c916648df05a4be4f", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 30, "execution_start": 1668621806184, "source_hash": "c085b6ba", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
05.13.51.40.2setosa
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "0 5.1 3.5 1.4 0.2 setosa\n", "1 4.9 3.0 1.4 0.2 setosa\n", "2 4.7 3.2 1.3 0.2 setosa\n", "3 4.6 3.1 1.5 0.2 setosa\n", "4 5.0 3.6 1.4 0.2 setosa" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "8c6b7713d56d46a1bfa456c568dbffe3", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Splitting the data into a training set and a test set" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cell_id": "ab6ad3bd1deb4310a352d9db1a290648", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 249, "execution_start": 1668621826231, "source_hash": "746a4dbc", "tags": [] }, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "5fb8c75a51424832bdbde353635a4f07", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* Use `train_test_split` to divide the iris data into a training set and a test set. Use 80% of the rows for the training data, and specify `random_state=2`.\n", "* On Monday, we made `X_train, X_test, y_train, y_test`. This time we will just make `df_train` and `df_test` (so the input features and the target are not separated yet)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "cell_id": "190b403d2a864212b186ab06e15f2413", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1668621952106, "source_hash": "9974d74e", "tags": [] }, "outputs": [], "source": [ "df_train, df_test = train_test_split(df, train_size=0.8, random_state=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting training DataFrame is definitely not just the first 80% of rows. In fact, the order of the rows is also scrambled. Here are the first 3 rows in the training DataFrame." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "cell_id": "9004a21900f5472888202ccf518995e4", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 244, "execution_start": 1668621957560, "source_hash": "4fc278fa", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sepal_lengthsepal_widthpetal_lengthpetal_widthspecies
1266.22.84.81.8virginica
235.13.31.70.5setosa
645.62.93.61.3versicolor
\n", "
" ], "text/plain": [ " sepal_length sepal_width petal_length petal_width species\n", "126 6.2 2.8 4.8 1.8 virginica\n", "23 5.1 3.3 1.7 0.5 setosa\n", "64 5.6 2.9 3.6 1.3 versicolor" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The reason we divide into a training set and a test set is that can help us detect overfitting. In short, if the Machine Learning model performs much better on the training set than on the test set, that is strong evidence of overfitting.\n", "\n", "There was a question of what \"much better\" performance means, and in most scenarios I believe this is more art than science, but here is a picture of the general phenomenon. This picture is taken from the great textbook (freely available on campus via SpringerLink) [Introduction to Statistical Learning](https://link.springer.com/book/10.1007/978-1-4614-7138-7).\n", "\n", "![U-shaped test error curve](../images/Ushape-test-error.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is the image on the right that you should pay attention to. That image is for Mean Squared Error, but any error function (cost function, loss function) would work just as well. We generally start with a less flexible model (in polynomial regression, this means lower degree; in decision trees, this means fewer leaves). As we make the model more flexible, the performance on the training set will always improve (the error will decrease). Typically the test error curve also begins to decrease, but eventually, once overfitting starts, the test error will start to increase again. The resulting curve is called the \"U-shaped test error\" curve. We will see examples of this phenomenon in Worksheet 14 and also in class on Friday." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "9bf5fc7f373147d09f7368fa0846d1a0", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Visualizing how a decision tree splits the data" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "1f93403efc1d4eb3acfcc50801f04f2b", "deepnote_cell_type": "markdown", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1668569117379, "source_hash": "d28a56d5", "tags": [] }, "source": [ "Our goal is to divide the iris data by species.\n", "\n", "* First we will divide by petal length.\n", "* Then we will divide by petal width.\n", "* Where would you make these divisions?\n", "\n", "Use the following to visualize the data:\n", "```\n", "alt.Chart(???not df).mark_circle(size=50, opacity=1).encode(\n", " x=\"petal_length\",\n", " y=\"petal_width\",\n", " color=alt.Color(\"species\", scale=alt.Scale(domain=[\"versicolor\", \"setosa\", \"virginica\"])),\n", ")\n", "```" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cell_id": "630d8618b4144d469525a59f39812158", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 34, "execution_start": 1668622471349, "source_hash": "693093f6", "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(df_train).mark_circle(size=50, opacity=1).encode(\n", " x=\"petal_length\",\n", " y=\"petal_width\",\n", " color=alt.Color(\"species\", scale=alt.Scale(domain=[\"versicolor\", \"setosa\", \"virginica\"])),\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think dividing the region at petal_length 2.5 or petal_width 0.7 would be equally reasonable. We have been told the first division should happen by petal length. This initial division perfectly separates the setosa flowers from the other two species. The next division we are told should be by petal width. A division around 1.7 seems to do a good job (but not a perfect job) dividing the versicolor flowers from the virginica flowers." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "fbc738091f4e43cbb70e2d5844c3424c", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* How many \"leaves\" will the corresponding decision tree have?\n", "\n", "The answer is three. We initially divide the region in half vertically, and then we divide the right-half region again in half horizontally." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "8d6a9188c8fb446ca0e3875070aba53f", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## A Decision tree with two splits\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "cell_id": "8f2d51ffe07047168cea3a5cbbeca0af", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 185, "execution_start": 1668622713603, "source_hash": "2d922c1a", "tags": [] }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "from sklearn.tree import DecisionTreeClassifier, plot_tree\n", "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "803584bd003d4f60a4818bc2c5fda8b9", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* Create an instance of a `DecisionTreeClassifier` with `max_leaf_nodes` as above. Specify `random_state=1` (which will help me know that the first split happens on \"petal_length\"... if it doesn't work, try some other values of `random_state`).\n", "* Fit the classifier to the training data using `cols = [\"petal_length\", \"petal_width\", \"sepal_length\"]` for the input features and using `\"species\"` for the target." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "cell_id": "adc3f1b9850444a897196fcae9ede9f7", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1668622720916, "source_hash": "a65b6797", "tags": [] }, "outputs": [], "source": [ "cols = [\"petal_length\", \"petal_width\", \"sepal_length\"]" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "cell_id": "e6d9179fce8a46a0b0c7d4faed7da4e7", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668622787959, "source_hash": "d45611cc", "tags": [] }, "outputs": [], "source": [ "clf = DecisionTreeClassifier(max_leaf_nodes=3, random_state=1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "cell_id": "aa48ccbe06e4483f87b6dbe176cd4235", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 8, "execution_start": 1668622833918, "source_hash": "3d265dc1", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
DecisionTreeClassifier(max_leaf_nodes=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "DecisionTreeClassifier(max_leaf_nodes=3, random_state=1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.fit(df_train[cols], df_train[\"species\"])" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "069ac91ea49c4081a62d695705cbf248", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* Illustrate the resulting tree using the following.\n", "* Does it match what we expected from the Altair chart?\n", "```\n", "fig = plt.figure()\n", "_ = plot_tree(clf, \n", " feature_names=clf.feature_names_in_,\n", " class_names=clf.classes_,\n", " filled=True)\n", "```" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "cell_id": "2ad9d4f50ae74448bcd93229e3e5086f", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 469, "execution_start": 1668622846146, "source_hash": "b6c04216", "tags": [] }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = plt.figure()\n", "_ = plot_tree(clf, \n", " feature_names=clf.feature_names_in_,\n", " class_names=clf.classes_,\n", " filled=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how these values 2.35 and 1.65 closely match what we saw above. Also notice how the setosa flowers are perfectly separated (that corresponds to the `value = [36, 0, 0]` report, whereas there is some overlap in the other two species. If you look back up to the Altair chart, it should seem plausible that the lower-right region (where petal length is greater than 2.35 and where petal width is less than or equal to 1.65) contains exactly 3 of the virginica species. That is represented in the `value = [0, 41, 3]`. The classes are (I believe) always listed in alphabetical order, but if you want to check it, you can also call `clf.classes_`." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "3f8957d7b6644fe2b781f1d71cb38df8", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* What is the depth of the corresponding tree? We can answer by looking at the diagram, or by using the `get_depth` method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "cell_id": "44f13db53e614ac8ae7fdc6242bfdcfa", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 923, "execution_start": 1668623103860, "source_hash": "77dd3aaa", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.get_depth()" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "363df2fcbaca456cb7cd94c26f138868", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* What are the corresponding feature importances? Use the `feature_importances_` attribute." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "cell_id": "055547e8c7e2476390c72c23c761a39e", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1668623168592, "source_hash": "c85582bc", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([0.52311757, 0.47688243, 0. ])" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.feature_importances_" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "cell_id": "d7b5102e6768486cbbf6b9cefa53a396", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 7, "execution_start": 1668623189658, "source_hash": "b515fe8b", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 0.523118\n", "1 0.476882\n", "2 0.000000\n", "dtype: float64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(clf.feature_importances_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use the `feature_names_in_` attribute to know which number corresponds to which column. I don't know exactly what these numbers 0.523 and 0.477 represent, but higher values means more important (for the decision tree). The 0 value for sepal_length makes sense, because our small decision tree (which only had two splits) never used the sepal length value." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "cell_id": "389b51fcd04642e294cfe67d4a8a91aa", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 213, "execution_start": 1668623225100, "source_hash": "89e1305f", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "petal_length 0.523118\n", "petal_width 0.476882\n", "sepal_length 0.000000\n", "dtype: float64" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(clf.feature_importances_, index=clf.feature_names_in_)" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "b714c6b0899f43b5973abac694a2df97", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Predictions and predicted probabilities\n", "\n", "* What species will be predicted for an iris with the following (physically impossible) values?\n", "```\n", "{\"petal_length\": 4, \"petal_width\": -5, \"sepal_length\": 3}\n", "```\n", "* What are the corresponding predicted probabilities?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we have to add square brackets around the dictionary; this is like telling pandas that it will be a one-row DataFrame (because the list has only one dictionary in it)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "cell_id": "6b0e68b30cd348c7ac4c74e31d8f71c5", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 839, "execution_start": 1668623537442, "source_hash": "e8bfc735", "tags": [] }, "outputs": [ { "data": { "application/vnd.deepnote.dataframe.v3+json": { "column_count": 3, "columns": [ { "dtype": "int64", "name": "petal_length", "stats": { "histogram": [ { "bin_end": 3.6, "bin_start": 3.5, "count": 0 }, { "bin_end": 3.7, "bin_start": 3.6, "count": 0 }, { "bin_end": 3.8, "bin_start": 3.7, "count": 0 }, { "bin_end": 3.9, "bin_start": 3.8, "count": 0 }, { "bin_end": 4, "bin_start": 3.9, "count": 0 }, { "bin_end": 4.1, "bin_start": 4, "count": 1 }, { "bin_end": 4.2, "bin_start": 4.1, "count": 0 }, { "bin_end": 4.3, "bin_start": 4.2, "count": 0 }, { "bin_end": 4.4, "bin_start": 4.3, "count": 0 }, { "bin_end": 4.5, "bin_start": 4.4, "count": 0 } ], "max": "4", "min": "4", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "petal_width", "stats": { "histogram": [ { "bin_end": -5.4, "bin_start": -5.5, "count": 0 }, { "bin_end": -5.3, "bin_start": -5.4, "count": 0 }, { "bin_end": -5.2, "bin_start": -5.3, "count": 0 }, { "bin_end": -5.1, "bin_start": -5.2, "count": 0 }, { "bin_end": -5, "bin_start": -5.1, "count": 0 }, { "bin_end": -4.9, "bin_start": -5, "count": 1 }, { "bin_end": -4.8, "bin_start": -4.9, "count": 0 }, { "bin_end": -4.7, "bin_start": -4.8, "count": 0 }, { "bin_end": -4.6, "bin_start": -4.7, "count": 0 }, { "bin_end": -4.5, "bin_start": -4.6, "count": 0 } ], "max": "-5", "min": "-5", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "sepal_length", "stats": { "histogram": [ { "bin_end": 2.6, "bin_start": 2.5, "count": 0 }, { "bin_end": 2.7, "bin_start": 2.6, "count": 0 }, { "bin_end": 2.8, "bin_start": 2.7, "count": 0 }, { "bin_end": 2.9, "bin_start": 2.8, "count": 0 }, { "bin_end": 3, "bin_start": 2.9, "count": 0 }, { "bin_end": 3.1, "bin_start": 3, "count": 1 }, { "bin_end": 3.2, "bin_start": 3.1, "count": 0 }, { "bin_end": 3.3, "bin_start": 3.2, "count": 0 }, { "bin_end": 3.4, "bin_start": 3.3, "count": 0 }, { "bin_end": 3.5, "bin_start": 3.4, "count": 0 } ], "max": "3", "min": "3", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "_deepnote_index_column" } ], "row_count": 1, "rows": [ { "_deepnote_index_column": "0", "petal_length": "4", "petal_width": "-5", "sepal_length": "3" } ] }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
petal_lengthpetal_widthsepal_length
04-53
\n", "
" ], "text/plain": [ " petal_length petal_width sepal_length\n", "0 4 -5 3" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_mini = pd.DataFrame([{\"petal_length\": 4, \"petal_width\": -5, \"sepal_length\": 3}])\n", "df_mini" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even though this flower does not exist in real life (for example because of the -5 value), the classifier has no trouble evaluating it. Because the petal length value is greater than 2.35 and the petal width value is less than 1.65, we are in the exact same lower-right region we were discussing above, which corresponded to a *versicolor* prediction." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "cell_id": "5355f84a04a64057928b1e91c844f9a2", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1668623598908, "source_hash": "50bd758f", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array(['versicolor'], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict(df_mini)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a more refined prediction, we can get predicted probabilities, rather than just the final answer. The classes will be listed in alphabetical order." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "cell_id": "d944de8bc0464761a5dc41730d730d3b", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 525, "execution_start": 1668623624998, "source_hash": "c1a71e1", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array(['setosa', 'versicolor', 'virginica'], dtype=object)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.classes_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So our classifier thinks there is a 93.2% chance that our flower is in the versicolor species." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "cell_id": "1a15dc9da0ec4bac87e185d3a9ff36df", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 20, "execution_start": 1668623613054, "source_hash": "9501f8ec", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[0. , 0.93181818, 0.06818182]])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clf.predict_proba(df_mini)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Where did those numbers come from? Remember that our lower-right region contained 41 versicolor flowers, 3 virginica flowers, and 0 setosa flowers. For example, that 93.2% corresponds to the probability 41/(0+41+3)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "cell_id": "a6e9cc14c1b14ce2be29c7872e3caa2a", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 205, "execution_start": 1668623691100, "source_hash": "d1494f49", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.9318181818181818" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "41/44" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We didn't get to the following. We will either start there or start with a similar example of the U-shaped test error curve on Friday." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "a266697f8918493b810d73e8abb54874", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## The U-shaped test error curve\n", "\n", "We haven't used `df_test` yet. Here we will. We will use a loss function that is commonly used for classification problems, **log loss** (also called **cross entropy**). If you want to see the mathematical description, I have it in the [Spring 2022 course notes](https://christopherdavisuci.github.io/UCI-Math-10-S22/Week9/Week9-Monday.html#a-loss-function-for-classification), but for us the most important thing is that lower values are better (as with Mean Squared Error and Mean Absolute Error) and that this is a loss function for classification (not for regression)." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "cell_id": "f0e1e33b0278402ba40244da45e7d1cf", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1668608092942, "source_hash": "1e40b449", "tags": [] }, "outputs": [], "source": [ "from sklearn.metrics import log_loss" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "fc160db9396642b7ace57b2a308b1de0", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* Define two empty dictionaries, `train_dict` and `test_dict`.\n", "\n", "For each integer value of `n` from `2` to `9` (inclusive), do the following. \n", "* Instantiate a new `DecisionTreeClassifier` using `max_leaf_nodes` as `n` and using `random_state=1`.\n", "* Fit the regressor using `df_train[cols]` and `df_train[\"species\"]`.\n", "* Using `log_loss` from `sklearn.metrics`, evaluate the log loss (error) between `df_train[\"species\"]` and the predicted probabilities. (These two objects need to be input to `log_loss` in that order, with the true values first, followed by the predicted probabilities.) Put this error as a value into the `train_dict` dictionary with the key `n`.\n", "* Do the same thing with `df_test` and `test_dict`." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "b899fe00e290410ca9a168e1999ac005", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "* How do the values in `train_dict` and `test_dict` compare?\n", "* At what point does the classifier seem to begin overfitting the data?" ] } ], "metadata": { "deepnote": {}, "deepnote_execution_queue": [], "deepnote_notebook_id": "1cc2bdb3841f457ba7ebf401214794ca", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }