{ "cells": [ { "cell_type": "markdown", "metadata": { "cell_id": "8b73c08835644ec7972f6e0fd729ccc4", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "# Week 7 Wednesday" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "d3e9a55d47264509ae4756cfa2cc6b62", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Announcements\n", "\n", "* I'm going to move my Wednesday office hours to my office RH 440J (same time: 1pm).\n", "* The next videos and video quizzes are posted. Due Monday of Week 8 (because Friday is a holiday this week).\n", "* Big topics left in the class: overfitting and decision trees/random forests.\n", "* Midterm: Tuesday of Week 9. On Wednesday of Week 9 (day before Thanksgiving) I'll introduce the Course Project. (No Final Exam in Math 10.)" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "e60eb988148c47a2b3a1b0092b66304b", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Including a categorical variable in our linear regression\n", "\n", "Here is some of the code from the last class. We used one-hot encoding to convert the \"origin\" column (which contains strings) into three separate numerical columns (containing only 0s and 1s, like a Boolean Series).\n", "\n", "We haven't performed the linear regression yet using these new columns." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "cell_id": "ecf073d8b24047e7a598945d916cd7e0", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1668016928212, "source_hash": "8ecb1ef2", "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "import altair as alt\n", "import seaborn as sns\n", "\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "cell_id": "f879a5b86e0b4a438f07590aaa56734e", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 356, "execution_start": 1668016953935, "source_hash": "edc73efe", "tags": [] }, "outputs": [], "source": [ "df = sns.load_dataset(\"mpg\").dropna(axis=0)\n", "\n", "cols = [\"horsepower\", \"weight\", \"model_year\", \"cylinders\"]" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "cell_id": "7d6327bcb3d14a7ea7a67a129ea9b87f", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1668016970218, "source_hash": "cd72858e", "tags": [] }, "outputs": [], "source": [ "reg = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "cell_id": "b5df299a9e4c46a99c38a19e215b6d80", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 9, "execution_start": 1668016976508, "source_hash": "bf85b034", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg.fit(df[cols], df[\"mpg\"])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "cell_id": "60b6f66c5a694486a5a0099386819ab6", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 4, "execution_start": 1668016978604, "source_hash": "7f9518fb", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "horsepower -0.003615\n", "weight -0.006275\n", "model_year 0.746632\n", "cylinders -0.127687\n", "dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(reg.coef_, index=reg.feature_names_in_)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "cell_id": "88920404f7ef4521b09462f3ddd6f5fa", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1668017061103, "source_hash": "93fdf3e0", "tags": [] }, "outputs": [], "source": [ "encoder = OneHotEncoder()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "cell_id": "10f4988b14264169aa2c614f23ff378d", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 5, "execution_start": 1668017082547, "source_hash": "c6d39e3d", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
OneHotEncoder()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "OneHotEncoder()" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder.fit(df[[\"origin\"]])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "cell_id": "de5957fc1a5c49a1ac10e1e00539a8cd", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1668017084456, "source_hash": "fafa1b", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "['origin_europe', 'origin_japan', 'origin_usa']" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "new_cols = list(encoder.get_feature_names_out())\n", "\n", "new_cols" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "cell_id": "88b8d872614f472db43e29bd0454094a", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 1, "execution_start": 1668017104672, "source_hash": "f91a71af", "tags": [] }, "outputs": [], "source": [ "df2 = df.copy()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "cell_id": "0ff4e0083ec14b6ab6c225e971029d6c", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668017106810, "source_hash": "7cd00360", "tags": [] }, "outputs": [], "source": [ "df2[new_cols] = encoder.transform(df[[\"origin\"]]).toarray()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "cell_id": "1dc952ffc460460d9e8a3904cb5428a0", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 44, "execution_start": 1668017112108, "source_hash": "9123f3f0", "tags": [] }, "outputs": [ { "data": { "application/vnd.deepnote.dataframe.v3+json": { "column_count": 12, "columns": [ { "dtype": "float64", "name": "mpg", "stats": { "histogram": [ { "bin_end": 17.01, "bin_start": 15, "count": 1 }, { "bin_end": 19.02, "bin_start": 17.01, "count": 0 }, { "bin_end": 21.03, "bin_start": 19.02, "count": 0 }, { "bin_end": 23.04, "bin_start": 21.03, "count": 1 }, { "bin_end": 25.05, "bin_start": 23.04, "count": 0 }, { "bin_end": 27.060000000000002, "bin_start": 25.05, "count": 0 }, { "bin_end": 29.07, "bin_start": 27.060000000000002, "count": 1 }, { "bin_end": 31.080000000000002, "bin_start": 29.07, "count": 1 }, { "bin_end": 33.09, "bin_start": 31.080000000000002, "count": 0 }, { "bin_end": 35.1, "bin_start": 33.09, "count": 1 } ], "max": "35.1", "min": "15.0", "nan_count": 0, "unique_count": 5 } }, { "dtype": "int64", "name": "cylinders", "stats": { "histogram": [ { "bin_end": 4.4, "bin_start": 4, "count": 4 }, { "bin_end": 4.8, "bin_start": 4.4, "count": 0 }, { "bin_end": 5.2, "bin_start": 4.8, "count": 0 }, { "bin_end": 5.6, "bin_start": 5.2, "count": 0 }, { "bin_end": 6, "bin_start": 5.6, "count": 0 }, { "bin_end": 6.4, "bin_start": 6, "count": 0 }, { "bin_end": 6.800000000000001, "bin_start": 6.4, "count": 0 }, { "bin_end": 7.2, "bin_start": 6.800000000000001, "count": 0 }, { "bin_end": 7.6, "bin_start": 7.2, "count": 0 }, { "bin_end": 8, "bin_start": 7.6, "count": 1 } ], "max": "8", "min": "4", "nan_count": 0, "unique_count": 2 } }, { "dtype": "float64", "name": "displacement", "stats": { "histogram": [ { "bin_end": 107.9, "bin_start": 81, "count": 3 }, { "bin_end": 134.8, "bin_start": 107.9, "count": 1 }, { "bin_end": 161.7, "bin_start": 134.8, "count": 0 }, { "bin_end": 188.6, "bin_start": 161.7, "count": 0 }, { "bin_end": 215.5, "bin_start": 188.6, "count": 0 }, { "bin_end": 242.39999999999998, "bin_start": 215.5, "count": 0 }, { "bin_end": 269.29999999999995, "bin_start": 242.39999999999998, "count": 0 }, { "bin_end": 296.2, "bin_start": 269.29999999999995, "count": 0 }, { "bin_end": 323.1, "bin_start": 296.2, "count": 0 }, { "bin_end": 350, "bin_start": 323.1, "count": 1 } ], "max": "350.0", "min": "81.0", "nan_count": 0, "unique_count": 5 } }, { "dtype": "float64", "name": "horsepower", "stats": { "histogram": [ { "bin_end": 70.5, "bin_start": 60, "count": 1 }, { "bin_end": 81, "bin_start": 70.5, "count": 2 }, { "bin_end": 91.5, "bin_start": 81, "count": 0 }, { "bin_end": 102, "bin_start": 91.5, "count": 1 }, { "bin_end": 112.5, "bin_start": 102, "count": 0 }, { "bin_end": 123, "bin_start": 112.5, "count": 0 }, { "bin_end": 133.5, "bin_start": 123, "count": 0 }, { "bin_end": 144, "bin_start": 133.5, "count": 0 }, { "bin_end": 154.5, "bin_start": 144, "count": 0 }, { "bin_end": 165, "bin_start": 154.5, "count": 1 } ], "max": "165.0", "min": "60.0", "nan_count": 0, "unique_count": 5 } }, { "dtype": "int64", "name": "weight", "stats": { "histogram": [ { "bin_end": 1953.3, "bin_start": 1760, "count": 1 }, { "bin_end": 2146.6, "bin_start": 1953.3, "count": 1 }, { "bin_end": 2339.9, "bin_start": 2146.6, "count": 1 }, { "bin_end": 2533.2, "bin_start": 2339.9, "count": 1 }, { "bin_end": 2726.5, "bin_start": 2533.2, "count": 0 }, { "bin_end": 2919.8, "bin_start": 2726.5, "count": 0 }, { "bin_end": 3113.1000000000004, "bin_start": 2919.8, "count": 0 }, { "bin_end": 3306.4, "bin_start": 3113.1000000000004, "count": 0 }, { "bin_end": 3499.7, "bin_start": 3306.4, "count": 0 }, { "bin_end": 3693, "bin_start": 3499.7, "count": 1 } ], "max": "3693", "min": "1760", "nan_count": 0, "unique_count": 5 } }, { "dtype": "float64", "name": "acceleration", "stats": { "histogram": [ { "bin_end": 11.96, "bin_start": 11.5, "count": 1 }, { "bin_end": 12.42, "bin_start": 11.96, "count": 0 }, { "bin_end": 12.88, "bin_start": 12.42, "count": 0 }, { "bin_end": 13.34, "bin_start": 12.88, "count": 0 }, { "bin_end": 13.8, "bin_start": 13.34, "count": 0 }, { "bin_end": 14.260000000000002, "bin_start": 13.8, "count": 1 }, { "bin_end": 14.72, "bin_start": 14.260000000000002, "count": 2 }, { "bin_end": 15.180000000000001, "bin_start": 14.72, "count": 0 }, { "bin_end": 15.64, "bin_start": 15.180000000000001, "count": 0 }, { "bin_end": 16.1, "bin_start": 15.64, "count": 1 } ], "max": "16.1", "min": "11.5", "nan_count": 0, "unique_count": 4 } }, { "dtype": "int64", "name": "model_year", "stats": { "histogram": [ { "bin_end": 71.1, "bin_start": 70, "count": 1 }, { "bin_end": 72.2, "bin_start": 71.1, "count": 1 }, { "bin_end": 73.3, "bin_start": 72.2, "count": 0 }, { "bin_end": 74.4, "bin_start": 73.3, "count": 1 }, { "bin_end": 75.5, "bin_start": 74.4, "count": 0 }, { "bin_end": 76.6, "bin_start": 75.5, "count": 0 }, { "bin_end": 77.7, "bin_start": 76.6, "count": 1 }, { "bin_end": 78.8, "bin_start": 77.7, "count": 0 }, { "bin_end": 79.9, "bin_start": 78.8, "count": 0 }, { "bin_end": 81, "bin_start": 79.9, "count": 1 } ], "max": "81", "min": "70", "nan_count": 0, "unique_count": 5 } }, { "dtype": "object", "name": "origin", "stats": { "categories": [ { "count": 2, "name": "usa" }, { "count": 2, "name": "japan" }, { "count": 1, "name": "europe" } ], "nan_count": 0, "unique_count": 3 } }, { "dtype": "object", "name": "name", "stats": { "categories": [ { "count": 1, "name": "dodge colt" }, { "count": 1, "name": "volkswagen dasher" }, { "count": 3, "name": "3 others" } ], "nan_count": 0, "unique_count": 5 } }, { "dtype": "float64", "name": "origin_europe", "stats": { "histogram": [ { "bin_end": 0.1, "bin_start": 0, "count": 4 }, { "bin_end": 0.2, "bin_start": 0.1, "count": 0 }, { "bin_end": 0.30000000000000004, "bin_start": 0.2, "count": 0 }, { "bin_end": 0.4, "bin_start": 0.30000000000000004, "count": 0 }, { "bin_end": 0.5, "bin_start": 0.4, "count": 0 }, { "bin_end": 0.6000000000000001, "bin_start": 0.5, "count": 0 }, { "bin_end": 0.7000000000000001, "bin_start": 0.6000000000000001, "count": 0 }, { "bin_end": 0.8, "bin_start": 0.7000000000000001, "count": 0 }, { "bin_end": 0.9, "bin_start": 0.8, "count": 0 }, { "bin_end": 1, "bin_start": 0.9, "count": 1 } ], "max": "1.0", "min": "0.0", "nan_count": 0, "unique_count": 2 } }, { "dtype": "float64", "name": "origin_japan", "stats": { "histogram": [ { "bin_end": 0.1, "bin_start": 0, "count": 3 }, { "bin_end": 0.2, "bin_start": 0.1, "count": 0 }, { "bin_end": 0.30000000000000004, "bin_start": 0.2, "count": 0 }, { "bin_end": 0.4, "bin_start": 0.30000000000000004, "count": 0 }, { "bin_end": 0.5, "bin_start": 0.4, "count": 0 }, { "bin_end": 0.6000000000000001, "bin_start": 0.5, "count": 0 }, { "bin_end": 0.7000000000000001, "bin_start": 0.6000000000000001, "count": 0 }, { "bin_end": 0.8, "bin_start": 0.7000000000000001, "count": 0 }, { "bin_end": 0.9, "bin_start": 0.8, "count": 0 }, { "bin_end": 1, "bin_start": 0.9, "count": 2 } ], "max": "1.0", "min": "0.0", "nan_count": 0, "unique_count": 2 } }, { "dtype": "float64", "name": "origin_usa", "stats": { "histogram": [ { "bin_end": 0.1, "bin_start": 0, "count": 3 }, { "bin_end": 0.2, "bin_start": 0.1, "count": 0 }, { "bin_end": 0.30000000000000004, "bin_start": 0.2, "count": 0 }, { "bin_end": 0.4, "bin_start": 0.30000000000000004, "count": 0 }, { "bin_end": 0.5, "bin_start": 0.4, "count": 0 }, { "bin_end": 0.6000000000000001, "bin_start": 0.5, "count": 0 }, { "bin_end": 0.7000000000000001, "bin_start": 0.6000000000000001, "count": 0 }, { "bin_end": 0.8, "bin_start": 0.7000000000000001, "count": 0 }, { "bin_end": 0.9, "bin_start": 0.8, "count": 0 }, { "bin_end": 1, "bin_start": 0.9, "count": 2 } ], "max": "1.0", "min": "0.0", "nan_count": 0, "unique_count": 2 } }, { "dtype": "int64", "name": "_deepnote_index_column" } ], "row_count": 5, "rows": [ { "_deepnote_index_column": "146", "acceleration": "14.5", "cylinders": "4", "displacement": "90.0", "horsepower": "75.0", "model_year": "74", "mpg": "28.0", "name": "dodge colt", "origin": "usa", "origin_europe": "0.0", "origin_japan": "0.0", "origin_usa": "1.0", "weight": "2125" }, { "_deepnote_index_column": "240", "acceleration": "14.1", "cylinders": "4", "displacement": "97.0", "horsepower": "78.0", "model_year": "77", "mpg": "30.5", "name": "volkswagen dasher", "origin": "europe", "origin_europe": "1.0", "origin_japan": "0.0", "origin_usa": "0.0", "weight": "2190" }, { "_deepnote_index_column": "82", "acceleration": "14.5", "cylinders": "4", "displacement": "120.0", "horsepower": "97.0", "model_year": "72", "mpg": "23.0", "name": "toyouta corona mark ii (sw)", "origin": "japan", "origin_europe": "0.0", "origin_japan": "1.0", "origin_usa": "0.0", "weight": "2506" }, { "_deepnote_index_column": "1", "acceleration": "11.5", "cylinders": "8", "displacement": "350.0", "horsepower": "165.0", "model_year": "70", "mpg": "15.0", "name": "buick skylark 320", "origin": "usa", "origin_europe": "0.0", "origin_japan": "0.0", "origin_usa": "1.0", "weight": "3693" }, { "_deepnote_index_column": "345", "acceleration": "16.1", "cylinders": "4", "displacement": "81.0", "horsepower": "60.0", "model_year": "81", "mpg": "35.1", "name": "honda civic 1300", "origin": "japan", "origin_europe": "0.0", "origin_japan": "1.0", "origin_usa": "0.0", "weight": "1760" } ] }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearoriginnameorigin_europeorigin_japanorigin_usa
14628.0490.075.0212514.574usadodge colt0.00.01.0
24030.5497.078.0219014.177europevolkswagen dasher1.00.00.0
8223.04120.097.0250614.572japantoyouta corona mark ii (sw)0.01.00.0
115.08350.0165.0369311.570usabuick skylark 3200.00.01.0
34535.1481.060.0176016.181japanhonda civic 13000.01.00.0
\n", "
" ], "text/plain": [ " mpg cylinders displacement horsepower weight acceleration \\\n", "146 28.0 4 90.0 75.0 2125 14.5 \n", "240 30.5 4 97.0 78.0 2190 14.1 \n", "82 23.0 4 120.0 97.0 2506 14.5 \n", "1 15.0 8 350.0 165.0 3693 11.5 \n", "345 35.1 4 81.0 60.0 1760 16.1 \n", "\n", " model_year origin name origin_europe \\\n", "146 74 usa dodge colt 0.0 \n", "240 77 europe volkswagen dasher 1.0 \n", "82 72 japan toyouta corona mark ii (sw) 0.0 \n", "1 70 usa buick skylark 320 0.0 \n", "345 81 japan honda civic 1300 0.0 \n", "\n", " origin_japan origin_usa \n", "146 0.0 1.0 \n", "240 0.0 0.0 \n", "82 1.0 0.0 \n", "1 0.0 1.0 \n", "345 1.0 0.0 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.sample(5, random_state=12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is where the new (Wednesday) material begins. Let's start with a reminder of what the one-hot encoding is doing. There are three total values in the \"origin\" column. We definitely can't use this column \"as is\" for linear regression, because it contains strings. In theory we could replace the strings with numbers, like 0,1,2, but that would enforce an order on the categories, as well as the relative difference between the categories. Instead, at the cost of taking up more space, we make three new columns." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "cell_id": "b9381d54529a493eb7295f298dcd0a5f", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 10, "execution_start": 1668017165754, "source_hash": "b8f82ffd", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0 usa\n", "1 usa\n", "2 usa\n", "3 usa\n", "4 usa\n", " ... \n", "393 usa\n", "394 europe\n", "395 usa\n", "396 usa\n", "397 usa\n", "Name: origin, Length: 392, dtype: object" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"origin\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This particular NumPy array would be two-thirds zeros (why?). If there were more values, it would be even more dominated by zeros. For this reason, to be more memory-efficient, by default, scikit-learn saves the result as a special \"sparse\" object." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "cell_id": "4808ebc2a9d540b9bc916301a6d08f6b", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 954, "execution_start": 1668017237757, "source_hash": "8eca8651", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "<392x3 sparse matrix of type ''\n", "\twith 392 stored elements in Compressed Sparse Row format>" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder.fit_transform(df[[\"origin\"]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can convert this to a normal NumPy array by using the `toarray` method." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "cell_id": "10e3813fef8a4d3dbbbe4608f10d361b", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 9, "execution_start": 1668017287530, "source_hash": "1d2c4ead", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " ...,\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.]])" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder.fit_transform(df[[\"origin\"]]).toarray()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a reminder of the last five values in the \"origin\" column." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "cell_id": "75cf2d7bc8324be0bd2509550c7827b2", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 185, "execution_start": 1668017384640, "source_hash": "5de4b0b6", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "393 usa\n", "394 europe\n", "395 usa\n", "396 usa\n", "397 usa\n", "Name: origin, dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[\"origin\"][-5:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here are the corresponding rows of the encoding. Notice how most of the 1 values correspond in the last column (corresponding to \"usa\"), and the other 1 value is in the first column (corresponding to \"europe\". Be sure you understand how the 5 entries in the \"origin\" column correspond to this 5x3 NumPy array." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "cell_id": "8cc5e0b6c2c049b38d42e9b97fd5ebb4", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 306, "execution_start": 1668017339357, "source_hash": "489b3b2", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([[0., 0., 1.],\n", " [1., 0., 0.],\n", " [0., 0., 1.],\n", " [0., 0., 1.],\n", " [0., 0., 1.]])" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "encoder.fit_transform(df[[\"origin\"]]).toarray()[-5:]" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "cell_id": "9f53e029efb4448189aa5ecff4d9c0e1", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 8, "execution_start": 1668017408838, "source_hash": "25d43fa0", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',\n", " 'acceleration', 'model_year', 'origin', 'name'],\n", " dtype='object')" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It wouldn't make sense to use one-hot encoding on the \"name\" column, because almost every row has a unique name." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "cell_id": "41e0efa081b941c38cd7028507e25b2a", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 267, "execution_start": 1668017416010, "source_hash": "f0a8ef91", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "393 ford mustang gl\n", "394 vw pickup\n", "395 dodge rampage\n", "396 ford ranger\n", "397 chevy s-10\n", "Name: name, dtype: object" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.name[-5:]" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "cell_id": "95a842a0a1d24d7eb92cd810c3e49393", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 10, "execution_start": 1668017456202, "source_hash": "ac6dd73d", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "301" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[\"name\"].unique())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's finally see how to perform linear regression using this newly encoded \"origin\" column. We specify that this linear regression object should not learn the intercept. (We'll say more about why a little later in this notebook.)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "cell_id": "67108e8910cc41ed953d6db8d9aa7dbf", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 2, "execution_start": 1668017599789, "source_hash": "fad9364d", "tags": [] }, "outputs": [], "source": [ "reg2 = LinearRegression(fit_intercept=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now fit this object using the 4 old columns and the 3 new columns." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "cell_id": "1a6947250f264a73bc4bd5e81bfae81f", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 332, "execution_start": 1668017691165, "source_hash": "990c27df", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
LinearRegression(fit_intercept=False)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression(fit_intercept=False)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg2.fit(df2[cols+new_cols], df2[\"mpg\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because we set `fit_intercept=False`, the `intercept_` value is `0`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "cell_id": "8bc4b0a0a5ea49f392373ec27ecada67", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 5, "execution_start": 1668017705309, "source_hash": "b6aa95b2", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "0.0" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg2.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More interesting are the seven coefficients." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "cell_id": "fb91813d04a64472b8189354b5e6e09a", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1668017767667, "source_hash": "59aadfbd", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([-1.00027917e-02, -5.73669613e-03, 7.57614205e-01, 1.42568914e-01,\n", " -1.55580718e+01, -1.52649392e+01, -1.75931922e+01])" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg2.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here they are, grouped with the corresponding column names. Originally I was going to use `index=cols+new_cols`, but this approach, using `index=reg2.feature_names_in_`, is more robust." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "cell_id": "829549b6831149f0a05811ea42fc08d9", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 5, "execution_start": 1668017814543, "source_hash": "a57d653b", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "horsepower -0.010003\n", "weight -0.005737\n", "model_year 0.757614\n", "cylinders 0.142569\n", "origin_europe -15.558072\n", "origin_japan -15.264939\n", "origin_usa -17.593192\n", "dtype: float64" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(reg2.coef_, index=reg2.feature_names_in_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finding those coefficients is difficult, but once we have the coefficients, scikit-learn is not doing anything fancy when it makes its predictions. Let's try to mimic the prediction for a single data point." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "cell_id": "59b99d04c81d4ad8815c2d9adf8591eb", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 4, "execution_start": 1668017848897, "source_hash": "6048c052", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "mpg 18.0\n", "cylinders 6\n", "displacement 250.0\n", "horsepower 105.0\n", "weight 3459\n", "acceleration 16.0\n", "model_year 75\n", "origin usa\n", "name chevrolet nova\n", "Name: 153, dtype: object" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[153]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With some rounding, the following is the computation made by scikit-learn. Look at these numbers and try to understand where they come from. For example, `105` is the \"horsepower\" value, and `-0.01` is the corresponding coefficient found by `reg2`.\n", "\n", "Notice also the `-17.59` at the end. This corresponds to the `origin_usa` value. This ending will always be the same for every car with \"origin\" as \"usa\". This `-17.59` is not being multiplied by anything (or, if you prefer, it is being multiplied by `1`), so it functions like an intercept, a custom intercept for the \"usa\" cars. This is the reason why we set `fit_intercept=False` when we created `reg2`. Another way to look at it, is if we wanted to add for example `13` as an intercept, we could just add `13` to the \"origin_europe\" value, to the \"origin_japan\" value, and to the \"origin_usa\" value; that would have the exact same effect." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "cell_id": "22513862c3004ac3a416e8f3d08c49a0", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1668018179351, "source_hash": "4ba3be47", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "19.270699999999994" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "105*-0.01+3459*-0.0057+0.757*75+0.142*6+-17.59" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to recover that number using `reg2.predict`. We can't use the following because it is a pandas Series, and hence is one-dimensional." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "cell_id": "aadfe800a03a41798483b9bf430f3d2e", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 6, "execution_start": 1668018262271, "source_hash": "53d0cbe0", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "horsepower 105.0\n", "weight 3459\n", "model_year 75\n", "cylinders 6\n", "origin_europe 0.0\n", "origin_japan 0.0\n", "origin_usa 1.0\n", "Name: 153, dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[153, cols+new_cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By replacing the integer `153` with the list `[153]`, we get a pandas DataFrame with a single row." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "cell_id": "e6c79294781d49939b18df88de8ca028", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 11, "execution_start": 1668018291126, "source_hash": "aa18f343", "tags": [] }, "outputs": [ { "data": { "application/vnd.deepnote.dataframe.v3+json": { "column_count": 7, "columns": [ { "dtype": "float64", "name": "horsepower", "stats": { "histogram": [ { "bin_end": 104.6, "bin_start": 104.5, "count": 0 }, { "bin_end": 104.7, "bin_start": 104.6, "count": 0 }, { "bin_end": 104.8, "bin_start": 104.7, "count": 0 }, { "bin_end": 104.9, "bin_start": 104.8, "count": 0 }, { "bin_end": 105, "bin_start": 104.9, "count": 0 }, { "bin_end": 105.1, "bin_start": 105, "count": 1 }, { "bin_end": 105.2, "bin_start": 105.1, "count": 0 }, { "bin_end": 105.3, "bin_start": 105.2, "count": 0 }, { "bin_end": 105.4, "bin_start": 105.3, "count": 0 }, { "bin_end": 105.5, "bin_start": 105.4, "count": 0 } ], "max": "105.0", "min": "105.0", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "weight", "stats": { "histogram": [ { "bin_end": 3458.6, "bin_start": 3458.5, "count": 0 }, { "bin_end": 3458.7, "bin_start": 3458.6, "count": 0 }, { "bin_end": 3458.8, "bin_start": 3458.7, "count": 0 }, { "bin_end": 3458.9, "bin_start": 3458.8, "count": 0 }, { "bin_end": 3459, "bin_start": 3458.9, "count": 0 }, { "bin_end": 3459.1, "bin_start": 3459, "count": 1 }, { "bin_end": 3459.2, "bin_start": 3459.1, "count": 0 }, { "bin_end": 3459.3, "bin_start": 3459.2, "count": 0 }, { "bin_end": 3459.4, "bin_start": 3459.3, "count": 0 }, { "bin_end": 3459.5, "bin_start": 3459.4, "count": 0 } ], "max": "3459", "min": "3459", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "model_year", "stats": { "histogram": [ { "bin_end": 74.6, "bin_start": 74.5, "count": 0 }, { "bin_end": 74.7, "bin_start": 74.6, "count": 0 }, { "bin_end": 74.8, "bin_start": 74.7, "count": 0 }, { "bin_end": 74.9, "bin_start": 74.8, "count": 0 }, { "bin_end": 75, "bin_start": 74.9, "count": 0 }, { "bin_end": 75.1, "bin_start": 75, "count": 1 }, { "bin_end": 75.2, "bin_start": 75.1, "count": 0 }, { "bin_end": 75.3, "bin_start": 75.2, "count": 0 }, { "bin_end": 75.4, "bin_start": 75.3, "count": 0 }, { "bin_end": 75.5, "bin_start": 75.4, "count": 0 } ], "max": "75", "min": "75", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "cylinders", "stats": { "histogram": [ { "bin_end": 5.6, "bin_start": 5.5, "count": 0 }, { "bin_end": 5.7, "bin_start": 5.6, "count": 0 }, { "bin_end": 5.8, "bin_start": 5.7, "count": 0 }, { "bin_end": 5.9, "bin_start": 5.8, "count": 0 }, { "bin_end": 6, "bin_start": 5.9, "count": 0 }, { "bin_end": 6.1, "bin_start": 6, "count": 1 }, { "bin_end": 6.2, "bin_start": 6.1, "count": 0 }, { "bin_end": 6.3, "bin_start": 6.2, "count": 0 }, { "bin_end": 6.4, "bin_start": 6.3, "count": 0 }, { "bin_end": 6.5, "bin_start": 6.4, "count": 0 } ], "max": "6", "min": "6", "nan_count": 0, "unique_count": 1 } }, { "dtype": "float64", "name": "origin_europe", "stats": { "histogram": [ { "bin_end": -0.4, "bin_start": -0.5, "count": 0 }, { "bin_end": -0.3, "bin_start": -0.4, "count": 0 }, { "bin_end": -0.19999999999999996, "bin_start": -0.3, "count": 0 }, { "bin_end": -0.09999999999999998, "bin_start": -0.19999999999999996, "count": 0 }, { "bin_end": 0, "bin_start": -0.09999999999999998, "count": 0 }, { "bin_end": 0.10000000000000009, "bin_start": 0, "count": 1 }, { "bin_end": 0.20000000000000007, "bin_start": 0.10000000000000009, "count": 0 }, { "bin_end": 0.30000000000000004, "bin_start": 0.20000000000000007, "count": 0 }, { "bin_end": 0.4, "bin_start": 0.30000000000000004, "count": 0 }, { "bin_end": 0.5, "bin_start": 0.4, "count": 0 } ], "max": "0.0", "min": "0.0", "nan_count": 0, "unique_count": 1 } }, { "dtype": "float64", "name": "origin_japan", "stats": { "histogram": [ { "bin_end": -0.4, "bin_start": -0.5, "count": 0 }, { "bin_end": -0.3, "bin_start": -0.4, "count": 0 }, { "bin_end": -0.19999999999999996, "bin_start": -0.3, "count": 0 }, { "bin_end": -0.09999999999999998, "bin_start": -0.19999999999999996, "count": 0 }, { "bin_end": 0, "bin_start": -0.09999999999999998, "count": 0 }, { "bin_end": 0.10000000000000009, "bin_start": 0, "count": 1 }, { "bin_end": 0.20000000000000007, "bin_start": 0.10000000000000009, "count": 0 }, { "bin_end": 0.30000000000000004, "bin_start": 0.20000000000000007, "count": 0 }, { "bin_end": 0.4, "bin_start": 0.30000000000000004, "count": 0 }, { "bin_end": 0.5, "bin_start": 0.4, "count": 0 } ], "max": "0.0", "min": "0.0", "nan_count": 0, "unique_count": 1 } }, { "dtype": "float64", "name": "origin_usa", "stats": { "histogram": [ { "bin_end": 0.6, "bin_start": 0.5, "count": 0 }, { "bin_end": 0.7, "bin_start": 0.6, "count": 0 }, { "bin_end": 0.8, "bin_start": 0.7, "count": 0 }, { "bin_end": 0.9, "bin_start": 0.8, "count": 0 }, { "bin_end": 1, "bin_start": 0.9, "count": 0 }, { "bin_end": 1.1, "bin_start": 1, "count": 1 }, { "bin_end": 1.2000000000000002, "bin_start": 1.1, "count": 0 }, { "bin_end": 1.3, "bin_start": 1.2000000000000002, "count": 0 }, { "bin_end": 1.4, "bin_start": 1.3, "count": 0 }, { "bin_end": 1.5, "bin_start": 1.4, "count": 0 } ], "max": "1.0", "min": "1.0", "nan_count": 0, "unique_count": 1 } }, { "dtype": "int64", "name": "_deepnote_index_column" } ], "row_count": 1, "rows": [ { "_deepnote_index_column": "153", "cylinders": "6", "horsepower": "105.0", "model_year": "75", "origin_europe": "0.0", "origin_japan": "0.0", "origin_usa": "1.0", "weight": "3459" } ] }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
horsepowerweightmodel_yearcylindersorigin_europeorigin_japanorigin_usa
153105.034597560.00.01.0
\n", "
" ], "text/plain": [ " horsepower weight model_year cylinders origin_europe origin_japan \\\n", "153 105.0 3459 75 6 0.0 0.0 \n", "\n", " origin_usa \n", "153 1.0 " ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2.loc[[153], cols+new_cols]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use `reg2.predict`. The resulting value isn't exactly the same as above (`19.19` instead of `19.27`), but I think this distinction is just due to the rounding I did when typing out the coefficients." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "cell_id": "2551157f30034912a88a033afeb88ea4", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668018318160, "source_hash": "d274ad1", "tags": [] }, "outputs": [ { "data": { "text/plain": [ "array([19.18976159])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg2.predict(df2.loc[[153], cols+new_cols])" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "277beb319bdd4352a862c2c835a59a2b", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Polynomial regression\n", "\n", "In linear regression, we find the linear function that best fits the data (\"best\" meaning it minimizes the Mean Squared Error, discussed more in this week's videos).\n", "\n", "In polynomial regression, we fix a degree `d`, and then find the polynomial function that best fits the data. We use the same `LinearRegression` class from scikit-learn, and if we use `d=1`, we will get the same results as linear regression.\n", "\n", "* Using polynomial regression with degree `d=3`, model \"mpg\" as a degree 3 polynomial of \"horsepower\". Plot the corresponding predicted values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I had intended to use `PolynomialFeatures` from scikit-learn.preprocessing, but because we were low on time, I used a more basic approach." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "cell_id": "f9f973f180d44f4a8640a81de815b9fe", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 92, "execution_start": 1668019402553, "source_hash": "6c23d5c5", "tags": [] }, "outputs": [ { "data": { "application/vnd.deepnote.dataframe.v3+json": { "column_count": 9, "columns": [ { "dtype": "float64", "name": "mpg", "stats": { "histogram": [ { "bin_end": 15.3, "bin_start": 15, "count": 1 }, { "bin_end": 15.6, "bin_start": 15.3, "count": 0 }, { "bin_end": 15.9, "bin_start": 15.6, "count": 0 }, { "bin_end": 16.2, "bin_start": 15.9, "count": 0 }, { "bin_end": 16.5, "bin_start": 16.2, "count": 0 }, { "bin_end": 16.8, "bin_start": 16.5, "count": 0 }, { "bin_end": 17.1, "bin_start": 16.8, "count": 0 }, { "bin_end": 17.4, "bin_start": 17.1, "count": 0 }, { "bin_end": 17.7, "bin_start": 17.4, "count": 0 }, { "bin_end": 18, "bin_start": 17.7, "count": 2 } ], "max": "18.0", "min": "15.0", "nan_count": 0, "unique_count": 2 } }, { "dtype": "int64", "name": "cylinders", "stats": { "histogram": [ { "bin_end": 7.6, "bin_start": 7.5, "count": 0 }, { "bin_end": 7.7, "bin_start": 7.6, "count": 0 }, { "bin_end": 7.8, "bin_start": 7.7, "count": 0 }, { "bin_end": 7.9, "bin_start": 7.8, "count": 0 }, { "bin_end": 8, "bin_start": 7.9, "count": 0 }, { "bin_end": 8.1, "bin_start": 8, "count": 3 }, { "bin_end": 8.2, "bin_start": 8.1, "count": 0 }, { "bin_end": 8.3, "bin_start": 8.2, "count": 0 }, { "bin_end": 8.4, "bin_start": 8.3, "count": 0 }, { "bin_end": 8.5, "bin_start": 8.4, "count": 0 } ], "max": "8", "min": "8", "nan_count": 0, "unique_count": 1 } }, { "dtype": "float64", "name": "displacement", "stats": { "histogram": [ { "bin_end": 311.3, "bin_start": 307, "count": 1 }, { "bin_end": 315.6, "bin_start": 311.3, "count": 0 }, { "bin_end": 319.9, "bin_start": 315.6, "count": 1 }, { "bin_end": 324.2, "bin_start": 319.9, "count": 0 }, { "bin_end": 328.5, "bin_start": 324.2, "count": 0 }, { "bin_end": 332.8, "bin_start": 328.5, "count": 0 }, { "bin_end": 337.1, "bin_start": 332.8, "count": 0 }, { "bin_end": 341.4, "bin_start": 337.1, "count": 0 }, { "bin_end": 345.7, "bin_start": 341.4, "count": 0 }, { "bin_end": 350, "bin_start": 345.7, "count": 1 } ], "max": "350.0", "min": "307.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "horsepower", "stats": { "histogram": [ { "bin_end": 133.5, "bin_start": 130, "count": 1 }, { "bin_end": 137, "bin_start": 133.5, "count": 0 }, { "bin_end": 140.5, "bin_start": 137, "count": 0 }, { "bin_end": 144, "bin_start": 140.5, "count": 0 }, { "bin_end": 147.5, "bin_start": 144, "count": 0 }, { "bin_end": 151, "bin_start": 147.5, "count": 1 }, { "bin_end": 154.5, "bin_start": 151, "count": 0 }, { "bin_end": 158, "bin_start": 154.5, "count": 0 }, { "bin_end": 161.5, "bin_start": 158, "count": 0 }, { "bin_end": 165, "bin_start": 161.5, "count": 1 } ], "max": "165.0", "min": "130.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "weight", "stats": { "histogram": [ { "bin_end": 3461.7, "bin_start": 3436, "count": 1 }, { "bin_end": 3487.4, "bin_start": 3461.7, "count": 0 }, { "bin_end": 3513.1, "bin_start": 3487.4, "count": 1 }, { "bin_end": 3538.8, "bin_start": 3513.1, "count": 0 }, { "bin_end": 3564.5, "bin_start": 3538.8, "count": 0 }, { "bin_end": 3590.2, "bin_start": 3564.5, "count": 0 }, { "bin_end": 3615.9, "bin_start": 3590.2, "count": 0 }, { "bin_end": 3641.6, "bin_start": 3615.9, "count": 0 }, { "bin_end": 3667.3, "bin_start": 3641.6, "count": 0 }, { "bin_end": 3693, "bin_start": 3667.3, "count": 1 } ], "max": "3693", "min": "3436", "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "acceleration", "stats": { "histogram": [ { "bin_end": 11.1, "bin_start": 11, "count": 1 }, { "bin_end": 11.2, "bin_start": 11.1, "count": 0 }, { "bin_end": 11.3, "bin_start": 11.2, "count": 0 }, { "bin_end": 11.4, "bin_start": 11.3, "count": 0 }, { "bin_end": 11.5, "bin_start": 11.4, "count": 0 }, { "bin_end": 11.6, "bin_start": 11.5, "count": 1 }, { "bin_end": 11.7, "bin_start": 11.6, "count": 0 }, { "bin_end": 11.8, "bin_start": 11.7, "count": 0 }, { "bin_end": 11.9, "bin_start": 11.8, "count": 0 }, { "bin_end": 12, "bin_start": 11.9, "count": 1 } ], "max": "12.0", "min": "11.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "model_year", "stats": { "histogram": [ { "bin_end": 69.6, "bin_start": 69.5, "count": 0 }, { "bin_end": 69.7, "bin_start": 69.6, "count": 0 }, { "bin_end": 69.8, "bin_start": 69.7, "count": 0 }, { "bin_end": 69.9, "bin_start": 69.8, "count": 0 }, { "bin_end": 70, "bin_start": 69.9, "count": 0 }, { "bin_end": 70.1, "bin_start": 70, "count": 3 }, { "bin_end": 70.2, "bin_start": 70.1, "count": 0 }, { "bin_end": 70.3, "bin_start": 70.2, "count": 0 }, { "bin_end": 70.4, "bin_start": 70.3, "count": 0 }, { "bin_end": 70.5, "bin_start": 70.4, "count": 0 } ], "max": "70", "min": "70", "nan_count": 0, "unique_count": 1 } }, { "dtype": "object", "name": "origin", "stats": { "categories": [ { "count": 3, "name": "usa" } ], "nan_count": 0, "unique_count": 1 } }, { "dtype": "object", "name": "name", "stats": { "categories": [ { "count": 1, "name": "chevrolet chevelle malibu" }, { "count": 1, "name": "buick skylark 320" }, { "count": 1, "name": "plymouth satellite" } ], "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "_deepnote_index_column" } ], "row_count": 3, "rows": [ { "_deepnote_index_column": "0", "acceleration": "12.0", "cylinders": "8", "displacement": "307.0", "horsepower": "130.0", "model_year": "70", "mpg": "18.0", "name": "chevrolet chevelle malibu", "origin": "usa", "weight": "3504" }, { "_deepnote_index_column": "1", "acceleration": "11.5", "cylinders": "8", "displacement": "350.0", "horsepower": "165.0", "model_year": "70", "mpg": "15.0", "name": "buick skylark 320", "origin": "usa", "weight": "3693" }, { "_deepnote_index_column": "2", "acceleration": "11.0", "cylinders": "8", "displacement": "318.0", "horsepower": "150.0", "model_year": "70", "mpg": "18.0", "name": "plymouth satellite", "origin": "usa", "weight": "3436" } ] }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearoriginname
018.08307.0130.0350412.070usachevrolet chevelle malibu
115.08350.0165.0369311.570usabuick skylark 320
218.08318.0150.0343611.070usaplymouth satellite
\n", "
" ], "text/plain": [ " mpg cylinders displacement horsepower weight acceleration \\\n", "0 18.0 8 307.0 130.0 3504 12.0 \n", "1 15.0 8 350.0 165.0 3693 11.5 \n", "2 18.0 8 318.0 150.0 3436 11.0 \n", "\n", " model_year origin name \n", "0 70 usa chevrolet chevelle malibu \n", "1 70 usa buick skylark 320 \n", "2 70 usa plymouth satellite " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# There is a fancier approach using PolynomialFeatures\n", "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The main trick is to add columns to the DataFrame corresponding to powers of the \"horsepower\" column. Then we can perform linear regression. Phrased another way, if $x_2 = x^2$, then finding a coefficient of $x_2$ is the same as finding a coefficient of $x^2$.\n", "\n", "With a higher degree, you should use a for loop. I just typed these out by hand to keep things moving faster." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "cell_id": "886dfbdca2a741d8a39b28ea211f2f86", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668019461037, "source_hash": "84c0adc0", "tags": [] }, "outputs": [], "source": [ "df[\"h2\"] = df[\"horsepower\"]**2" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "cell_id": "adf542ff439f41d1985ab783758d18c1", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668019481445, "source_hash": "40d75bf8", "tags": [] }, "outputs": [], "source": [ "df[\"h3\"] = df[\"horsepower\"]**3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we have two new columns on the right. For example, the \"h2\" value of `16900` in the top row corresponds to `130**2`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "cell_id": "3d9a7d56af2841a68fca916c9996dcda", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 37, "execution_start": 1668019484365, "source_hash": "3a1ea484", "tags": [] }, "outputs": [ { "data": { "application/vnd.deepnote.dataframe.v3+json": { "column_count": 11, "columns": [ { "dtype": "float64", "name": "mpg", "stats": { "histogram": [ { "bin_end": 15.3, "bin_start": 15, "count": 1 }, { "bin_end": 15.6, "bin_start": 15.3, "count": 0 }, { "bin_end": 15.9, "bin_start": 15.6, "count": 0 }, { "bin_end": 16.2, "bin_start": 15.9, "count": 0 }, { "bin_end": 16.5, "bin_start": 16.2, "count": 0 }, { "bin_end": 16.8, "bin_start": 16.5, "count": 0 }, { "bin_end": 17.1, "bin_start": 16.8, "count": 0 }, { "bin_end": 17.4, "bin_start": 17.1, "count": 0 }, { "bin_end": 17.7, "bin_start": 17.4, "count": 0 }, { "bin_end": 18, "bin_start": 17.7, "count": 2 } ], "max": "18.0", "min": "15.0", "nan_count": 0, "unique_count": 2 } }, { "dtype": "int64", "name": "cylinders", "stats": { "histogram": [ { "bin_end": 7.6, "bin_start": 7.5, "count": 0 }, { "bin_end": 7.7, "bin_start": 7.6, "count": 0 }, { "bin_end": 7.8, "bin_start": 7.7, "count": 0 }, { "bin_end": 7.9, "bin_start": 7.8, "count": 0 }, { "bin_end": 8, "bin_start": 7.9, "count": 0 }, { "bin_end": 8.1, "bin_start": 8, "count": 3 }, { "bin_end": 8.2, "bin_start": 8.1, "count": 0 }, { "bin_end": 8.3, "bin_start": 8.2, "count": 0 }, { "bin_end": 8.4, "bin_start": 8.3, "count": 0 }, { "bin_end": 8.5, "bin_start": 8.4, "count": 0 } ], "max": "8", "min": "8", "nan_count": 0, "unique_count": 1 } }, { "dtype": "float64", "name": "displacement", "stats": { "histogram": [ { "bin_end": 311.3, "bin_start": 307, "count": 1 }, { "bin_end": 315.6, "bin_start": 311.3, "count": 0 }, { "bin_end": 319.9, "bin_start": 315.6, "count": 1 }, { "bin_end": 324.2, "bin_start": 319.9, "count": 0 }, { "bin_end": 328.5, "bin_start": 324.2, "count": 0 }, { "bin_end": 332.8, "bin_start": 328.5, "count": 0 }, { "bin_end": 337.1, "bin_start": 332.8, "count": 0 }, { "bin_end": 341.4, "bin_start": 337.1, "count": 0 }, { "bin_end": 345.7, "bin_start": 341.4, "count": 0 }, { "bin_end": 350, "bin_start": 345.7, "count": 1 } ], "max": "350.0", "min": "307.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "horsepower", "stats": { "histogram": [ { "bin_end": 133.5, "bin_start": 130, "count": 1 }, { "bin_end": 137, "bin_start": 133.5, "count": 0 }, { "bin_end": 140.5, "bin_start": 137, "count": 0 }, { "bin_end": 144, "bin_start": 140.5, "count": 0 }, { "bin_end": 147.5, "bin_start": 144, "count": 0 }, { "bin_end": 151, "bin_start": 147.5, "count": 1 }, { "bin_end": 154.5, "bin_start": 151, "count": 0 }, { "bin_end": 158, "bin_start": 154.5, "count": 0 }, { "bin_end": 161.5, "bin_start": 158, "count": 0 }, { "bin_end": 165, "bin_start": 161.5, "count": 1 } ], "max": "165.0", "min": "130.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "weight", "stats": { "histogram": [ { "bin_end": 3461.7, "bin_start": 3436, "count": 1 }, { "bin_end": 3487.4, "bin_start": 3461.7, "count": 0 }, { "bin_end": 3513.1, "bin_start": 3487.4, "count": 1 }, { "bin_end": 3538.8, "bin_start": 3513.1, "count": 0 }, { "bin_end": 3564.5, "bin_start": 3538.8, "count": 0 }, { "bin_end": 3590.2, "bin_start": 3564.5, "count": 0 }, { "bin_end": 3615.9, "bin_start": 3590.2, "count": 0 }, { "bin_end": 3641.6, "bin_start": 3615.9, "count": 0 }, { "bin_end": 3667.3, "bin_start": 3641.6, "count": 0 }, { "bin_end": 3693, "bin_start": 3667.3, "count": 1 } ], "max": "3693", "min": "3436", "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "acceleration", "stats": { "histogram": [ { "bin_end": 11.1, "bin_start": 11, "count": 1 }, { "bin_end": 11.2, "bin_start": 11.1, "count": 0 }, { "bin_end": 11.3, "bin_start": 11.2, "count": 0 }, { "bin_end": 11.4, "bin_start": 11.3, "count": 0 }, { "bin_end": 11.5, "bin_start": 11.4, "count": 0 }, { "bin_end": 11.6, "bin_start": 11.5, "count": 1 }, { "bin_end": 11.7, "bin_start": 11.6, "count": 0 }, { "bin_end": 11.8, "bin_start": 11.7, "count": 0 }, { "bin_end": 11.9, "bin_start": 11.8, "count": 0 }, { "bin_end": 12, "bin_start": 11.9, "count": 1 } ], "max": "12.0", "min": "11.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "model_year", "stats": { "histogram": [ { "bin_end": 69.6, "bin_start": 69.5, "count": 0 }, { "bin_end": 69.7, "bin_start": 69.6, "count": 0 }, { "bin_end": 69.8, "bin_start": 69.7, "count": 0 }, { "bin_end": 69.9, "bin_start": 69.8, "count": 0 }, { "bin_end": 70, "bin_start": 69.9, "count": 0 }, { "bin_end": 70.1, "bin_start": 70, "count": 3 }, { "bin_end": 70.2, "bin_start": 70.1, "count": 0 }, { "bin_end": 70.3, "bin_start": 70.2, "count": 0 }, { "bin_end": 70.4, "bin_start": 70.3, "count": 0 }, { "bin_end": 70.5, "bin_start": 70.4, "count": 0 } ], "max": "70", "min": "70", "nan_count": 0, "unique_count": 1 } }, { "dtype": "object", "name": "origin", "stats": { "categories": [ { "count": 3, "name": "usa" } ], "nan_count": 0, "unique_count": 1 } }, { "dtype": "object", "name": "name", "stats": { "categories": [ { "count": 1, "name": "chevrolet chevelle malibu" }, { "count": 1, "name": "buick skylark 320" }, { "count": 1, "name": "plymouth satellite" } ], "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "h2", "stats": { "histogram": [ { "bin_end": 17932.5, "bin_start": 16900, "count": 1 }, { "bin_end": 18965, "bin_start": 17932.5, "count": 0 }, { "bin_end": 19997.5, "bin_start": 18965, "count": 0 }, { "bin_end": 21030, "bin_start": 19997.5, "count": 0 }, { "bin_end": 22062.5, "bin_start": 21030, "count": 0 }, { "bin_end": 23095, "bin_start": 22062.5, "count": 1 }, { "bin_end": 24127.5, "bin_start": 23095, "count": 0 }, { "bin_end": 25160, "bin_start": 24127.5, "count": 0 }, { "bin_end": 26192.5, "bin_start": 25160, "count": 0 }, { "bin_end": 27225, "bin_start": 26192.5, "count": 1 } ], "max": "27225.0", "min": "16900.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "float64", "name": "h3", "stats": { "histogram": [ { "bin_end": 2426512.5, "bin_start": 2197000, "count": 1 }, { "bin_end": 2656025, "bin_start": 2426512.5, "count": 0 }, { "bin_end": 2885537.5, "bin_start": 2656025, "count": 0 }, { "bin_end": 3115050, "bin_start": 2885537.5, "count": 0 }, { "bin_end": 3344562.5, "bin_start": 3115050, "count": 0 }, { "bin_end": 3574075, "bin_start": 3344562.5, "count": 1 }, { "bin_end": 3803587.5, "bin_start": 3574075, "count": 0 }, { "bin_end": 4033100, "bin_start": 3803587.5, "count": 0 }, { "bin_end": 4262612.5, "bin_start": 4033100, "count": 0 }, { "bin_end": 4492125, "bin_start": 4262612.5, "count": 1 } ], "max": "4492125.0", "min": "2197000.0", "nan_count": 0, "unique_count": 3 } }, { "dtype": "int64", "name": "_deepnote_index_column" } ], "row_count": 3, "rows": [ { "_deepnote_index_column": "0", "acceleration": "12.0", "cylinders": "8", "displacement": "307.0", "h2": "16900.0", "h3": "2197000.0", "horsepower": "130.0", "model_year": "70", "mpg": "18.0", "name": "chevrolet chevelle malibu", "origin": "usa", "weight": "3504" }, { "_deepnote_index_column": "1", "acceleration": "11.5", "cylinders": "8", "displacement": "350.0", "h2": "27225.0", "h3": "4492125.0", "horsepower": "165.0", "model_year": "70", "mpg": "15.0", "name": "buick skylark 320", "origin": "usa", "weight": "3693" }, { "_deepnote_index_column": "2", "acceleration": "11.0", "cylinders": "8", "displacement": "318.0", "h2": "22500.0", "h3": "3375000.0", "horsepower": "150.0", "model_year": "70", "mpg": "18.0", "name": "plymouth satellite", "origin": "usa", "weight": "3436" } ] }, "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearoriginnameh2h3
018.08307.0130.0350412.070usachevrolet chevelle malibu16900.02197000.0
115.08350.0165.0369311.570usabuick skylark 32027225.04492125.0
218.08318.0150.0343611.070usaplymouth satellite22500.03375000.0
\n", "
" ], "text/plain": [ " mpg cylinders displacement horsepower weight acceleration \\\n", "0 18.0 8 307.0 130.0 3504 12.0 \n", "1 15.0 8 350.0 165.0 3693 11.5 \n", "2 18.0 8 318.0 150.0 3436 11.0 \n", "\n", " model_year origin name h2 h3 \n", "0 70 usa chevrolet chevelle malibu 16900.0 2197000.0 \n", "1 70 usa buick skylark 320 27225.0 4492125.0 \n", "2 70 usa plymouth satellite 22500.0 3375000.0 " ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use Linear Regression with these columns (equivalently, we are using polynomial regression of degree 3 with the \"horsepower\" column).\n", "\n", "It would be more robust to call these instead `[\"h1\", \"h2\", \"h3\"]` and to make the list using list comprehension. That is what you're asked to do on Worksheet 12." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "cell_id": "95975a21c5434017a99132c7179e35c6", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 0, "execution_start": 1668019554259, "source_hash": "4f5df94f", "tags": [] }, "outputs": [], "source": [ "poly_cols = [\"horsepower\", \"h2\", \"h3\"]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "cell_id": "b4bc59b5f81b4763bbedd7bd490ae6d7", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 2, "execution_start": 1668019567931, "source_hash": "dcd14da", "tags": [] }, "outputs": [], "source": [ "reg3 = LinearRegression()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "cell_id": "3abfa9b7fa3041ee921ed4c30a55ce01", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 541, "execution_start": 1668019620340, "source_hash": "8d705eb1", "tags": [] }, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "reg3.fit(df[poly_cols], df[\"mpg\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We add the predicted value to the DataFrame." ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "cell_id": "0adeb43e9b0e47388c253930b87ba5fa", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 3, "execution_start": 1668019642626, "source_hash": "9e87d4c3", "tags": [] }, "outputs": [], "source": [ "df[\"poly_pred\"] = reg3.predict(df[poly_cols])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we plot the predicted values. Take a moment to be impressed that linear regression is what produced this curve which is clearly not a straight line. Something that will be emphasized in Worksheet 12, is that as you use higher degrees for polynomial regression, the curves will eventually start to \"overfit\" the data. This notion of overfitting is arguably the most important topic in Machine Learning." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "cell_id": "a84f891106284603ae286cf72b30bca7", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 733, "execution_start": 1668019696742, "source_hash": "1e2f898c", "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = alt.Chart(df).mark_circle().encode(\n", " x=\"horsepower\",\n", " y=\"mpg\",\n", " color=\"origin\"\n", ")\n", "\n", "c1 = alt.Chart(df).mark_line(color=\"black\", size=3).encode(\n", " x=\"horsepower\",\n", " y=\"poly_pred\"\n", ")\n", "\n", "c+c1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Not related to polynomial regression directly, but I wanted to keep the following \"starter\" code present in the notebook. It shows an example of specifying a direct numerical value for `y`. The following could be considered the best \"degree zero\" polynomial for this data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "cell_id": "39db72b4ab9b4dc285298406125c3090", "deepnote_cell_type": "code", "deepnote_to_be_reexecuted": false, "execution_millis": 66, "execution_start": 1667999302865, "source_hash": "cdc639a5", "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "" ], "text/plain": [ "alt.LayerChart(...)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = alt.Chart(df).mark_circle().encode(\n", " x=\"horsepower\",\n", " y=\"mpg\",\n", " color=\"origin\"\n", ")\n", "\n", "c1 = alt.Chart(df).mark_line(color=\"black\", size=3).encode(\n", " x=\"horsepower\",\n", " y=alt.datum(df[\"mpg\"].mean())\n", ")\n", "\n", "c+c1" ] }, { "cell_type": "markdown", "metadata": { "cell_id": "7cb73e74cca54cb48af44f918d3984e8", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "## Warning: Don't misuse polynomial regression\n", "\n", "For some reason, unreasonable cubic models often get shared in the media. The cubic polynomial that \"fits best\" can be interesting to look at, but don't expect it to provide accurate predictions in the future. (This is a symptom of *overfitting*, which is a key concept in Machine Learning that we will return to soon.)\n", "\n", "In the following, from May 2020, if we trusted the linear fit, we would expect Covid deaths to grow without bound at a constant rate forever. If we trusted the cubic fit, we would expect Covid deaths to hit zero (and in fact to become negative) by mid-May." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "b5cbaec191414f998ba4e8467bf93654", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "![Cubic fit to Covid data](../images/Cubic2.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following is another example of a cubic fit. To my eyes, when I look at this data, I do not see anything resemblance to the shown cubic polynomial. That cubic polynomial is probably the \"best\" cubic polynomial for this data, but I do not think it is a reasonable model for this data (meaning I don't think any cubic polynomial would model this data well)." ] }, { "cell_type": "markdown", "metadata": { "cell_id": "71b5d34d95e3487d8c158b4d67006653", "deepnote_cell_type": "markdown", "tags": [] }, "source": [ "![Cubic fit to supreme court confirmation votes](../images/Cubic.png)" ] } ], "metadata": { "deepnote": {}, "deepnote_execution_queue": [], "deepnote_notebook_id": "6dc4b862d39142c79c7e805344b1e21a", "deepnote_persisted_session": { "createdAt": "2022-11-09T05:30:34.086Z" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13" } }, "nbformat": 4, "nbformat_minor": 4 }