{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"cell_id": "e4e4c69a-c97b-49fb-8c83-2a5b1d3471da",
"deepnote_cell_height": 82,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"# Polynomial Regression"
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "484cc5d4acec411686909bdd4e32fedb",
"deepnote_cell_height": 419.796875,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Announcements\n",
"\n",
"The quiz tomorrow will have 3 questions. The topics will be:\n",
"* Performing K-Means clustering using scikit-learn. See the notebooks from Week 5.\n",
"* Performing K-Means clustering by hand (assign to clusters, find centroids, repeat). We discussed this Wednesday of Week 5 (much of the discussion was at the whiteboard). (Here is a [Wikipedia description](https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm_(naive_k-means)) and an animation on [YouTube](https://youtu.be/5I3Ei69I40s).)\n",
"* Using a pandas DataFrame's `apply` method using a lambda function with `axis=0` (apply to one column at a time) or with `axis=1` (apply to one row at a time). (We first introduced `apply` in [this notebook](https://christopherdavisuci.github.io/UCI-Math-10-S22/Week4/Week4-Friday.html#rescaling-2-using-apply), and it also was used in both worksheets last week.)\n",
"\n",
"There isn't time to go over this material today, but Yasmeen will review the material on Tuesday. Yasmeen and I also have Zoom office hours the next two days:\n",
"* Yasmeen: 3:00-5:00pm Monday\n",
"* Chris: 11:00am-12:30pm Tuesday\n",
"\n",
"See Canvas for the Zoom links to the office hours."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "d00034ccbbba45e3801f0e1c725c75b8",
"deepnote_cell_height": 108.390625,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Linear Regression with the cars dataset\n",
"\n",
"Let's try to express \"Horsepower\" in terms of \"MPG\" (miles per gallon). You should have the intuition that these two columns are negatively correlated."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"cell_id": "90b5eedfa9f141a1bd52062da5e353cb",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 241,
"execution_start": 1651521393043,
"source_hash": "46e6bd77",
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import altair as alt"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"cell_id": "9faa3381415145449d0f49005d638512",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1651521713219,
"source_hash": "8bad964e",
"tags": []
},
"outputs": [],
"source": [
"df = pd.read_csv(\"../data/cars.csv\").dropna()\n",
"df.rename({\"Miles_per_Gallon\":\"MPG\"}, axis=1, inplace=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looking at the following, we see that indeed, \"MPG\" and \"Horsepower\" are quite negatively correlated."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"cell_id": "ef8564ffd03d4d038ab48410e0ab75ef",
"deepnote_cell_height": 436,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 39,
"execution_start": 1651521782211,
"source_hash": "f8453a83",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" MPG | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
"
\n",
" \n",
" \n",
" \n",
" MPG | \n",
" 1.000000 | \n",
" -0.777618 | \n",
" -0.805127 | \n",
" -0.778427 | \n",
" -0.832244 | \n",
" 0.423329 | \n",
"
\n",
" \n",
" Cylinders | \n",
" -0.777618 | \n",
" 1.000000 | \n",
" 0.950823 | \n",
" 0.842983 | \n",
" 0.897527 | \n",
" -0.504683 | \n",
"
\n",
" \n",
" Displacement | \n",
" -0.805127 | \n",
" 0.950823 | \n",
" 1.000000 | \n",
" 0.897257 | \n",
" 0.932994 | \n",
" -0.543800 | \n",
"
\n",
" \n",
" Horsepower | \n",
" -0.778427 | \n",
" 0.842983 | \n",
" 0.897257 | \n",
" 1.000000 | \n",
" 0.864538 | \n",
" -0.689196 | \n",
"
\n",
" \n",
" Weight_in_lbs | \n",
" -0.832244 | \n",
" 0.897527 | \n",
" 0.932994 | \n",
" 0.864538 | \n",
" 1.000000 | \n",
" -0.416839 | \n",
"
\n",
" \n",
" Acceleration | \n",
" 0.423329 | \n",
" -0.504683 | \n",
" -0.543800 | \n",
" -0.689196 | \n",
" -0.416839 | \n",
" 1.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" MPG Cylinders Displacement Horsepower Weight_in_lbs \\\n",
"MPG 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 \n",
"Cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 \n",
"Displacement -0.805127 0.950823 1.000000 0.897257 0.932994 \n",
"Horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 \n",
"Weight_in_lbs -0.832244 0.897527 0.932994 0.864538 1.000000 \n",
"Acceleration 0.423329 -0.504683 -0.543800 -0.689196 -0.416839 \n",
"\n",
" Acceleration \n",
"MPG 0.423329 \n",
"Cylinders -0.504683 \n",
"Displacement -0.543800 \n",
"Horsepower -0.689196 \n",
"Weight_in_lbs -0.416839 \n",
"Acceleration 1.000000 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is reflected in the chart in the sense that as one coordinate value increases, the other tends to decrease."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"cell_id": "7b164998f231434fa587ad62b4abac73",
"deepnote_cell_height": 512,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
361
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 98,
"execution_start": 1651521904194,
"source_hash": "53d73df9",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.Chart(...)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"alt.Chart(df).mark_circle().encode(\n",
" x=\"MPG\",\n",
" y=\"Horsepower\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's investigate the same thing using scikit-learn. We follow the usual pattern:\n",
"* import\n",
"* create/instantiate\n",
"* fit\n",
"* predict/transform"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"cell_id": "df393182a8f24536801765c0778275f6",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1651521986889,
"source_hash": "9527aab5",
"tags": []
},
"outputs": [],
"source": [
"# import\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"cell_id": "dbb72e37f4dd4164a692e59136794f21",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1651522044813,
"source_hash": "7f3894c8",
"tags": []
},
"outputs": [],
"source": [
"# create/instantiate\n",
"reg = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following is probably the most common mistake with scikit-learn. The first input needs to be two-dimensional (in this case, the first input needs to be a DataFrame, not a Series)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"cell_id": "c64ad619c07740e5b28f28dc39949bce",
"deepnote_cell_height": 720.1875,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 569,
"execution_start": 1651522087773,
"source_hash": "d0df0f93",
"tags": [
"output_scroll"
]
},
"outputs": [
{
"ename": "ValueError",
"evalue": "Expected 2D array, got 1D array instead:\narray=[18. 15. 18. 16. 17. 15. 14. 14. 14. 15. 15. 14. 15. 14.\n 24. 22. 18. 21. 27. 26. 25. 24. 25. 26. 21. 10. 10. 11.\n 9. 27. 28. 25. 19. 16. 17. 19. 18. 14. 14. 14. 14. 12.\n 13. 13. 18. 22. 19. 18. 23. 28. 30. 30. 31. 35. 27. 26.\n 24. 25. 23. 20. 21. 13. 14. 15. 14. 17. 11. 13. 12. 13.\n 19. 15. 13. 13. 14. 18. 22. 21. 26. 22. 28. 23. 28. 27.\n 13. 14. 13. 14. 15. 12. 13. 13. 14. 13. 12. 13. 18. 16.\n 18. 18. 23. 26. 11. 12. 13. 12. 18. 20. 21. 22. 18. 19.\n 21. 26. 15. 16. 29. 24. 20. 19. 15. 24. 20. 11. 20. 19.\n 15. 31. 26. 32. 25. 16. 16. 18. 16. 13. 14. 14. 14. 29.\n 26. 26. 31. 32. 28. 24. 26. 24. 26. 31. 19. 18. 15. 15.\n 16. 15. 16. 14. 17. 16. 15. 18. 21. 20. 13. 29. 23. 20.\n 23. 24. 25. 24. 18. 29. 19. 23. 23. 22. 25. 33. 28. 25.\n 25. 26. 27. 17.5 16. 15.5 14.5 22. 22. 24. 22.5 29. 24.5 29.\n 33. 20. 18. 18.5 17.5 29.5 32. 28. 26.5 20. 13. 19. 19. 16.5\n 16.5 13. 13. 13. 31.5 30. 36. 25.5 33.5 17.5 17. 15.5 15. 17.5\n 20.5 19. 18.5 16. 15.5 15.5 16. 29. 24.5 26. 25.5 30.5 33.5 30.\n 30.5 22. 21.5 21.5 43.1 36.1 32.8 39.4 36.1 19.9 19.4 20.2 19.2 20.5\n 20.2 25.1 20.5 19.4 20.6 20.8 18.6 18.1 19.2 17.7 18.1 17.5 30. 27.5\n 27.2 30.9 21.1 23.2 23.8 23.9 20.3 17. 21.6 16.2 31.5 29.5 21.5 19.8\n 22.3 20.2 20.6 17. 17.6 16.5 18.2 16.9 15.5 19.2 18.5 31.9 34.1 35.7\n 27.4 25.4 23. 27.2 23.9 34.2 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5\n 38.1 32.1 37.2 28. 26.4 24.3 19.1 34.3 29.8 31.3 37. 32.2 46.6 27.9\n 40.8 44.3 43.4 36.4 30. 44.6 33.8 29.8 32.7 23.7 35. 32.4 27.2 26.6\n 25.8 23.5 30. 39.1 39. 35.1 32.3 37. 37.7 34.1 34.7 34.4 29.9 33.\n 33.7 32.4 32.9 31.6 28.1 30.7 25.4 24.2 22.4 26.6 20.2 17.6 28. 27.\n 34. 31. 29. 27. 24. 36. 37. 31. 38. 36. 36. 36. 34. 38.\n 32. 38. 25. 38. 26. 22. 32. 36. 27. 27. 44. 32. 28. 31. ].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"\u001b[0;32m/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_97000/2122096284.py\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mreg\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"MPG\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"Horsepower\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py\u001b[0m in \u001b[0;36mfit\u001b[0;34m(self, X, y, sample_weight)\u001b[0m\n\u001b[1;32m 661\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 662\u001b[0m X, y = self._validate_data(\n\u001b[0;32m--> 663\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maccept_sparse\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maccept_sparse\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my_numeric\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmulti_output\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mTrue\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 664\u001b[0m )\n\u001b[1;32m 665\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py\u001b[0m in \u001b[0;36m_validate_data\u001b[0;34m(self, X, y, reset, validate_separately, **check_params)\u001b[0m\n\u001b[1;32m 579\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_array\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_y_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 580\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 581\u001b[0;31m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcheck_X_y\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mcheck_params\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 582\u001b[0m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mX\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0my\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 583\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_X_y\u001b[0;34m(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)\u001b[0m\n\u001b[1;32m 974\u001b[0m \u001b[0mensure_min_samples\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mensure_min_samples\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 975\u001b[0m \u001b[0mensure_min_features\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mensure_min_features\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 976\u001b[0;31m \u001b[0mestimator\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mestimator\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 977\u001b[0m )\n\u001b[1;32m 978\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py\u001b[0m in \u001b[0;36mcheck_array\u001b[0;34m(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)\u001b[0m\n\u001b[1;32m 771\u001b[0m \u001b[0;34m\"Reshape your data either using array.reshape(-1, 1) if \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 772\u001b[0m \u001b[0;34m\"your data has a single feature or array.reshape(1, -1) \"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 773\u001b[0;31m \u001b[0;34m\"if it contains a single sample.\"\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0marray\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 774\u001b[0m )\n\u001b[1;32m 775\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mValueError\u001b[0m: Expected 2D array, got 1D array instead:\narray=[18. 15. 18. 16. 17. 15. 14. 14. 14. 15. 15. 14. 15. 14.\n 24. 22. 18. 21. 27. 26. 25. 24. 25. 26. 21. 10. 10. 11.\n 9. 27. 28. 25. 19. 16. 17. 19. 18. 14. 14. 14. 14. 12.\n 13. 13. 18. 22. 19. 18. 23. 28. 30. 30. 31. 35. 27. 26.\n 24. 25. 23. 20. 21. 13. 14. 15. 14. 17. 11. 13. 12. 13.\n 19. 15. 13. 13. 14. 18. 22. 21. 26. 22. 28. 23. 28. 27.\n 13. 14. 13. 14. 15. 12. 13. 13. 14. 13. 12. 13. 18. 16.\n 18. 18. 23. 26. 11. 12. 13. 12. 18. 20. 21. 22. 18. 19.\n 21. 26. 15. 16. 29. 24. 20. 19. 15. 24. 20. 11. 20. 19.\n 15. 31. 26. 32. 25. 16. 16. 18. 16. 13. 14. 14. 14. 29.\n 26. 26. 31. 32. 28. 24. 26. 24. 26. 31. 19. 18. 15. 15.\n 16. 15. 16. 14. 17. 16. 15. 18. 21. 20. 13. 29. 23. 20.\n 23. 24. 25. 24. 18. 29. 19. 23. 23. 22. 25. 33. 28. 25.\n 25. 26. 27. 17.5 16. 15.5 14.5 22. 22. 24. 22.5 29. 24.5 29.\n 33. 20. 18. 18.5 17.5 29.5 32. 28. 26.5 20. 13. 19. 19. 16.5\n 16.5 13. 13. 13. 31.5 30. 36. 25.5 33.5 17.5 17. 15.5 15. 17.5\n 20.5 19. 18.5 16. 15.5 15.5 16. 29. 24.5 26. 25.5 30.5 33.5 30.\n 30.5 22. 21.5 21.5 43.1 36.1 32.8 39.4 36.1 19.9 19.4 20.2 19.2 20.5\n 20.2 25.1 20.5 19.4 20.6 20.8 18.6 18.1 19.2 17.7 18.1 17.5 30. 27.5\n 27.2 30.9 21.1 23.2 23.8 23.9 20.3 17. 21.6 16.2 31.5 29.5 21.5 19.8\n 22.3 20.2 20.6 17. 17.6 16.5 18.2 16.9 15.5 19.2 18.5 31.9 34.1 35.7\n 27.4 25.4 23. 27.2 23.9 34.2 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5\n 38.1 32.1 37.2 28. 26.4 24.3 19.1 34.3 29.8 31.3 37. 32.2 46.6 27.9\n 40.8 44.3 43.4 36.4 30. 44.6 33.8 29.8 32.7 23.7 35. 32.4 27.2 26.6\n 25.8 23.5 30. 39.1 39. 35.1 32.3 37. 37.7 34.1 34.7 34.4 29.9 33.\n 33.7 32.4 32.9 31.6 28.1 30.7 25.4 24.2 22.4 26.6 20.2 17.6 28. 27.\n 34. 31. 29. 27. 24. 36. 37. 31. 38. 36. 36. 36. 34. 38.\n 32. 38. 25. 38. 26. 22. 32. 36. 27. 27. 44. 32. 28. 31. ].\nReshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
]
}
],
"source": [
"reg.fit(df[\"MPG\"], df[\"Horsepower\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The key to fixing this error in this case is to use a list of column names (in this case, it's just the single column name \"MPG\", but we still need it to be in a list)."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"cell_id": "be742da768504ea6b294ab84466c0b73",
"deepnote_cell_height": 136.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1651522144069,
"source_hash": "deb2b2f4",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# fit\n",
"reg.fit(df[[\"MPG\"]], df[\"Horsepower\"])"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"cell_id": "6c71a53a1dcc48889ea97965ac628ce8",
"deepnote_cell_height": 708,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
611
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 9,
"execution_start": 1651522234033,
"source_hash": "3131447d",
"tags": [
"output_scroll"
]
},
"outputs": [
{
"data": {
"text/plain": [
"array([125.3756586 , 136.8923227 , 125.3756586 , 133.05343467,\n",
" 129.21454664, 136.8923227 , 140.73121073, 140.73121073,\n",
" 140.73121073, 136.8923227 , 136.8923227 , 140.73121073,\n",
" 136.8923227 , 140.73121073, 102.34233041, 110.02010647,\n",
" 125.3756586 , 113.8589945 , 90.82566631, 94.66455434,\n",
" 98.50344237, 102.34233041, 98.50344237, 94.66455434,\n",
" 113.8589945 , 156.08676286, 156.08676286, 152.24787483,\n",
" 159.9256509 , 90.82566631, 86.98677828, 98.50344237,\n",
" 121.53677057, 133.05343467, 129.21454664, 121.53677057,\n",
" 125.3756586 , 140.73121073, 140.73121073, 140.73121073,\n",
" 140.73121073, 148.4089868 , 144.57009877, 144.57009877,\n",
" 125.3756586 , 110.02010647, 121.53677057, 125.3756586 ,\n",
" 106.18121844, 86.98677828, 79.30900221, 79.30900221,\n",
" 75.47011418, 60.11456205, 90.82566631, 94.66455434,\n",
" 102.34233041, 98.50344237, 106.18121844, 117.69788254,\n",
" 113.8589945 , 144.57009877, 140.73121073, 136.8923227 ,\n",
" 140.73121073, 129.21454664, 152.24787483, 144.57009877,\n",
" 148.4089868 , 144.57009877, 121.53677057, 136.8923227 ,\n",
" 144.57009877, 144.57009877, 140.73121073, 125.3756586 ,\n",
" 110.02010647, 113.8589945 , 94.66455434, 110.02010647,\n",
" 86.98677828, 106.18121844, 86.98677828, 90.82566631,\n",
" 144.57009877, 140.73121073, 144.57009877, 140.73121073,\n",
" 136.8923227 , 148.4089868 , 144.57009877, 144.57009877,\n",
" 140.73121073, 144.57009877, 148.4089868 , 144.57009877,\n",
" 125.3756586 , 133.05343467, 125.3756586 , 125.3756586 ,\n",
" 106.18121844, 94.66455434, 152.24787483, 148.4089868 ,\n",
" 144.57009877, 148.4089868 , 125.3756586 , 117.69788254,\n",
" 113.8589945 , 110.02010647, 125.3756586 , 121.53677057,\n",
" 113.8589945 , 94.66455434, 136.8923227 , 133.05343467,\n",
" 83.14789024, 102.34233041, 117.69788254, 121.53677057,\n",
" 136.8923227 , 102.34233041, 117.69788254, 152.24787483,\n",
" 117.69788254, 121.53677057, 136.8923227 , 75.47011418,\n",
" 94.66455434, 71.63122615, 98.50344237, 133.05343467,\n",
" 133.05343467, 125.3756586 , 133.05343467, 144.57009877,\n",
" 140.73121073, 140.73121073, 140.73121073, 83.14789024,\n",
" 94.66455434, 94.66455434, 75.47011418, 71.63122615,\n",
" 86.98677828, 102.34233041, 94.66455434, 102.34233041,\n",
" 94.66455434, 75.47011418, 121.53677057, 125.3756586 ,\n",
" 136.8923227 , 136.8923227 , 133.05343467, 136.8923227 ,\n",
" 133.05343467, 140.73121073, 129.21454664, 133.05343467,\n",
" 136.8923227 , 125.3756586 , 113.8589945 , 117.69788254,\n",
" 144.57009877, 83.14789024, 106.18121844, 117.69788254,\n",
" 106.18121844, 102.34233041, 98.50344237, 102.34233041,\n",
" 125.3756586 , 83.14789024, 121.53677057, 106.18121844,\n",
" 106.18121844, 110.02010647, 98.50344237, 67.79233811,\n",
" 86.98677828, 98.50344237, 98.50344237, 94.66455434,\n",
" 90.82566631, 127.29510262, 133.05343467, 134.97287868,\n",
" 138.81176672, 110.02010647, 110.02010647, 102.34233041,\n",
" 108.10066246, 83.14789024, 100.42288639, 83.14789024,\n",
" 67.79233811, 117.69788254, 125.3756586 , 123.45621459,\n",
" 127.29510262, 81.22844623, 71.63122615, 86.98677828,\n",
" 92.74511032, 117.69788254, 144.57009877, 121.53677057,\n",
" 121.53677057, 131.13399065, 131.13399065, 144.57009877,\n",
" 144.57009877, 144.57009877, 73.55067016, 79.30900221,\n",
" 56.27567401, 96.58399836, 65.8728941 , 127.29510262,\n",
" 129.21454664, 134.97287868, 136.8923227 , 127.29510262,\n",
" 115.77843852, 121.53677057, 123.45621459, 133.05343467,\n",
" 134.97287868, 134.97287868, 133.05343467, 83.14789024,\n",
" 100.42288639, 94.66455434, 96.58399836, 77.38955819,\n",
" 65.8728941 , 79.30900221, 77.38955819, 110.02010647,\n",
" 111.93955049, 111.93955049, 29.01956898, 55.89178521,\n",
" 68.56011572, 43.2234547 , 55.89178521, 118.08177134,\n",
" 120.00121536, 116.93010493, 120.76899296, 115.77843852,\n",
" 116.93010493, 98.11955357, 115.77843852, 120.00121536,\n",
" 115.39454972, 114.62677211, 123.07232578, 124.9917698 ,\n",
" 120.76899296, 126.52732501, 124.9917698 , 127.29510262,\n",
" 79.30900221, 88.90622229, 90.0578887 , 75.85400298,\n",
" 113.4751057 , 105.41344083, 103.11010801, 102.72621921,\n",
" 116.54621613, 129.21454664, 111.55566168, 132.28565706,\n",
" 73.55067016, 81.22844623, 111.93955049, 118.46566014,\n",
" 108.86844006, 116.93010493, 115.39454972, 129.21454664,\n",
" 126.91121382, 131.13399065, 124.607881 , 129.59843544,\n",
" 134.97287868, 120.76899296, 123.45621459, 72.01511495,\n",
" 63.56956128, 57.42734042, 89.2901111 , 96.96788716,\n",
" 106.18121844, 90.0578887 , 102.72621921, 63.18567247,\n",
" 62.03400606, 72.39900375, 51.28511957, 85.45122306,\n",
" 83.91566785, 91.59344391, 65.8728941 , 35.16178983,\n",
" 48.21400915, 71.24733734, 51.66900838, 86.98677828,\n",
" 93.12899913, 101.190664 , 121.15288177, 62.80178367,\n",
" 80.07677982, 74.31844777, 52.43678598, 70.86344854,\n",
" 15.58346087, 87.37066708, 37.84901146, 24.41290334,\n",
" 27.86790257, 54.7401188 , 79.30900221, 23.26123693,\n",
" 64.72122769, 80.07677982, 68.94400452, 103.49399682,\n",
" 60.11456205, 70.09567093, 90.0578887 , 92.36122152,\n",
" 95.43233195, 104.26177442, 79.30900221, 44.37512111,\n",
" 44.75900992, 59.73067324, 70.47955974, 52.43678598,\n",
" 49.74956436, 63.56956128, 61.26622846, 62.41789487,\n",
" 79.69289101, 67.79233811, 65.10511649, 70.09567093,\n",
" 68.17622692, 73.16678136, 86.60288947, 76.62178059,\n",
" 96.96788716, 101.5745528 , 108.48455126, 92.36122152,\n",
" 116.93010493, 126.91121382, 86.98677828, 90.82566631,\n",
" 63.95345008, 75.47011418, 83.14789024, 90.82566631,\n",
" 102.34233041, 56.27567401, 52.43678598, 75.47011418,\n",
" 48.59789795, 56.27567401, 56.27567401, 56.27567401,\n",
" 63.95345008, 48.59789795, 71.63122615, 48.59789795,\n",
" 98.50344237, 48.59789795, 94.66455434, 110.02010647,\n",
" 71.63122615, 56.27567401, 90.82566631, 90.82566631,\n",
" 25.56456975, 71.63122615, 86.98677828, 75.47011418])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.predict(df[[\"MPG\"]])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"cell_id": "4d5ddd856b3045ef905f77ddb85d7abe",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 0,
"execution_start": 1651522285720,
"source_hash": "ca715315",
"tags": []
},
"outputs": [],
"source": [
"#add a column to df\n",
"df[\"Pred\"] = reg.predict(df[[\"MPG\"]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the predicted values show up on the right side of the DataFrame. And at least the first one is pretty accurate: the prediction is 125 whereas the true value is 130."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"cell_id": "83426b460b7b42309e7405744db54d81",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 35,
"execution_start": 1651522290562,
"source_hash": "c085b6ba",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" MPG | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
" Pred | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chevrolet chevelle malibu | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 125.375659 | \n",
"
\n",
" \n",
" 1 | \n",
" buick skylark 320 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 136.892323 | \n",
"
\n",
" \n",
" 2 | \n",
" plymouth satellite | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 125.375659 | \n",
"
\n",
" \n",
" 3 | \n",
" amc rebel sst | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 133.053435 | \n",
"
\n",
" \n",
" 4 | \n",
" ford torino | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 129.214547 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name MPG Cylinders Displacement Horsepower \\\n",
"0 chevrolet chevelle malibu 18.0 8 307.0 130.0 \n",
"1 buick skylark 320 15.0 8 350.0 165.0 \n",
"2 plymouth satellite 18.0 8 318.0 150.0 \n",
"3 amc rebel sst 16.0 8 304.0 150.0 \n",
"4 ford torino 17.0 8 302.0 140.0 \n",
"\n",
" Weight_in_lbs Acceleration Year Origin Pred \n",
"0 3504 12.0 1970-01-01 USA 125.375659 \n",
"1 3693 11.5 1970-01-01 USA 136.892323 \n",
"2 3436 11.0 1970-01-01 USA 125.375659 \n",
"3 3433 12.0 1970-01-01 USA 133.053435 \n",
"4 3449 10.5 1970-01-01 USA 129.214547 "
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"cell_id": "2734f1e9b88942638532d4697410cdd7",
"deepnote_cell_height": 135,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 6,
"execution_start": 1651522373075,
"source_hash": "928ce873",
"tags": []
},
"outputs": [],
"source": [
"c = alt.Chart(df).mark_circle().encode(\n",
" x=\"MPG\",\n",
" y=\"Horsepower\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We eventually want to plot the predicted values using a red line. In the following chart `c1`, we are using the same \"y\" value as before, so the data is the exact same for now."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"cell_id": "a000ff9a84a64f8e830afa9b584838d8",
"deepnote_cell_height": 135,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 2,
"execution_start": 1651522461587,
"source_hash": "c3c5ad2f",
"tags": []
},
"outputs": [],
"source": [
"c1 = alt.Chart(df).mark_line(color=\"red\").encode(\n",
" x=\"MPG\",\n",
" y=\"Horsepower\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can display two Altair charts in the same position, one \"layered\" over the other, using `+`."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"cell_id": "d3274ab35d294b1a8a918b91847a7e86",
"deepnote_cell_height": 458,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
361
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 39,
"execution_start": 1651522480061,
"source_hash": "7875ad2a",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c+c1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's try the same thing, but using the \"Pred\" column for the y-values."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"cell_id": "82aa18b06c1d46258c70a4ffe85c1fe6",
"deepnote_cell_height": 135,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1651522522960,
"source_hash": "c2f31038",
"tags": []
},
"outputs": [],
"source": [
"c1 = alt.Chart(df).mark_line(color=\"red\").encode(\n",
" x=\"MPG\",\n",
" y=\"Pred\"\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following shows the line of best fit for this data. (We will describe what is meant by \"best\" in a later class.)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"cell_id": "18d1ef8f08a442c59380d6e77b59ad4d",
"deepnote_cell_height": 458,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
361
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 31,
"execution_start": 1651522529210,
"source_hash": "7875ad2a",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c+c1"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"cell_id": "cec6120bb65e4579acb7e214b87f70a2",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1651522610487,
"source_hash": "d2db0808",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"sklearn.linear_model._base.LinearRegression"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(reg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can find the slope of the line using `reg.coef_`. The fact that this slope is negative corresponds to the fact that these variables are negatively correlated."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"cell_id": "352eae8f95484e0682a8369b6c7580e7",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 3,
"execution_start": 1651522655700,
"source_hash": "c0bd65f9",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"array([-3.83888803])"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following is the y-intercept. These sorts of stored parameter values are usually named in scikit-learn using a trailing underscore, so for example, using `intercept_` rather than the plain `intercept`."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"cell_id": "a98375595bec4444b47ac50aa0b9a430",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 12,
"execution_start": 1651522718207,
"source_hash": "1661cbe7",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"194.47564319018676"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.intercept_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Do you see how these values of -3.84 and 194.5 are reflected in the above line?"
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "8638d340de824944a8e5755bee09e134",
"deepnote_cell_height": 108.390625,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Polynomial Regression with the cars dataset\n",
"\n",
"Motto: Polynomial regression is no more difficult than linear regression.\n",
"\n",
"We discussed at the whiteboard how finding coefficients in a polynomial can be viewed as a special case of linear regression. (The reverse is also true, but here we are using the fact that we know how to find linear regression coefficients, to find polynomial coefficients.)\n",
"\n",
"We just need to add some new columns to our DataFrame. Here we use f-strings and a for-loop to keep our code Pythonic and DRY."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"cell_id": "1ff308f0529645018f30a2454dc9adba",
"deepnote_cell_height": 99,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1651523408703,
"source_hash": "b0305040",
"tags": []
},
"outputs": [],
"source": [
"for i in range(1,4):\n",
" df[f\"MPG{i}\"] = df[\"MPG\"]**i"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice how the true MPG value in the first row is 18, and so at the end we see the values $18$, $18^2$, and $18^3$. This method would easily adapt to higher degrees. (That is the main benefit of not just typing the three columns out one at a time.)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"cell_id": "7300737ae5284ddda75b3429cfd97096",
"deepnote_cell_height": 395,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 319,
"execution_start": 1651523415590,
"source_hash": "c085b6ba",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Name | \n",
" MPG | \n",
" Cylinders | \n",
" Displacement | \n",
" Horsepower | \n",
" Weight_in_lbs | \n",
" Acceleration | \n",
" Year | \n",
" Origin | \n",
" Pred | \n",
" MPG1 | \n",
" MPG2 | \n",
" MPG3 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" chevrolet chevelle malibu | \n",
" 18.0 | \n",
" 8 | \n",
" 307.0 | \n",
" 130.0 | \n",
" 3504 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 125.375659 | \n",
" 18.0 | \n",
" 324.0 | \n",
" 5832.0 | \n",
"
\n",
" \n",
" 1 | \n",
" buick skylark 320 | \n",
" 15.0 | \n",
" 8 | \n",
" 350.0 | \n",
" 165.0 | \n",
" 3693 | \n",
" 11.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 136.892323 | \n",
" 15.0 | \n",
" 225.0 | \n",
" 3375.0 | \n",
"
\n",
" \n",
" 2 | \n",
" plymouth satellite | \n",
" 18.0 | \n",
" 8 | \n",
" 318.0 | \n",
" 150.0 | \n",
" 3436 | \n",
" 11.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 125.375659 | \n",
" 18.0 | \n",
" 324.0 | \n",
" 5832.0 | \n",
"
\n",
" \n",
" 3 | \n",
" amc rebel sst | \n",
" 16.0 | \n",
" 8 | \n",
" 304.0 | \n",
" 150.0 | \n",
" 3433 | \n",
" 12.0 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 133.053435 | \n",
" 16.0 | \n",
" 256.0 | \n",
" 4096.0 | \n",
"
\n",
" \n",
" 4 | \n",
" ford torino | \n",
" 17.0 | \n",
" 8 | \n",
" 302.0 | \n",
" 140.0 | \n",
" 3449 | \n",
" 10.5 | \n",
" 1970-01-01 | \n",
" USA | \n",
" 129.214547 | \n",
" 17.0 | \n",
" 289.0 | \n",
" 4913.0 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Name MPG Cylinders Displacement Horsepower \\\n",
"0 chevrolet chevelle malibu 18.0 8 307.0 130.0 \n",
"1 buick skylark 320 15.0 8 350.0 165.0 \n",
"2 plymouth satellite 18.0 8 318.0 150.0 \n",
"3 amc rebel sst 16.0 8 304.0 150.0 \n",
"4 ford torino 17.0 8 302.0 140.0 \n",
"\n",
" Weight_in_lbs Acceleration Year Origin Pred MPG1 MPG2 \\\n",
"0 3504 12.0 1970-01-01 USA 125.375659 18.0 324.0 \n",
"1 3693 11.5 1970-01-01 USA 136.892323 15.0 225.0 \n",
"2 3436 11.0 1970-01-01 USA 125.375659 18.0 324.0 \n",
"3 3433 12.0 1970-01-01 USA 133.053435 16.0 256.0 \n",
"4 3449 10.5 1970-01-01 USA 129.214547 17.0 289.0 \n",
"\n",
" MPG3 \n",
"0 5832.0 \n",
"1 3375.0 \n",
"2 5832.0 \n",
"3 4096.0 \n",
"4 4913.0 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"cell_id": "7cd29a850e09421a912d2d78158cb4e5",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 1,
"execution_start": 1651523460186,
"source_hash": "cd72858e",
"tags": []
},
"outputs": [],
"source": [
"reg = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we make a list of the three columns we're interested in."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"cell_id": "d5a26d8ecc094af8b32037bdd9a11a87",
"deepnote_cell_height": 117,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 5,
"execution_start": 1651523635340,
"source_hash": "34dfec1",
"tags": []
},
"outputs": [],
"source": [
"# cols = [\"MPG1\",\"MPG2\",\"MPG3\"]\n",
"# df.columns[-3:]\n",
"cols = [f\"MPG{i}\" for i in range(1,4)]"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"cell_id": "745fc65397b44f9892487048d578579a",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 10,
"execution_start": 1651523671068,
"source_hash": "5936082",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"list"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(cols)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Because `cols` is already a list, it's important to use `df[cols]` instead of `df[[cols]]`."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"cell_id": "71e5cfa2041b4415827f01927914be00",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 5,
"execution_start": 1651523704540,
"source_hash": "8a2ef894",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.fit(df[cols],df[\"Horsepower\"])"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"cell_id": "2b601fcf797a45f692ba543f0881de68",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1651523711172,
"source_hash": "c0bd65f9",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"array([-3.01312437e+01, 8.66793237e-01, -8.50566252e-03])"
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.coef_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is a nice way to display which value corresponds to which coefficient."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"cell_id": "839fcca8ff0c428db9018abb5df78d6a",
"deepnote_cell_height": 175.765625,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
78.78125
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 8,
"execution_start": 1651523767956,
"source_hash": "77d2cb3a",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"MPG1 -30.131244\n",
"MPG2 0.866793\n",
"MPG3 -0.008506\n",
"dtype: float64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.Series(reg.coef_, index=cols)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"cell_id": "4ce4449f5a0b40eab701531ec3293b47",
"deepnote_cell_height": 118.1875,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
21.1875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 542,
"execution_start": 1651523824470,
"source_hash": "1661cbe7",
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"429.5814606301147"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"reg.intercept_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The above values can be interpreted as saying that our model estimates the following formula, where H stands for horsepower and where M stands for miles per gallon.\n",
"\n",
"$$\n",
"H \\approx -0.0085 \\cdot M^3 + 0.86 \\cdot M^2 - 30.1 \\cdot M + 429.6\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's add the values predicted by our model into the DataFrame."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"cell_id": "eda6977ff18d4cfe999fcef968ed0f38",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 4,
"execution_start": 1651524035997,
"source_hash": "dcf1800d",
"tags": []
},
"outputs": [],
"source": [
"df[\"Pred3\"] = reg.predict(df[cols])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's get a new Altair chart. We can save some typing by copying our linear chart `c1`."
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"cell_id": "203e3aec79ab4a5fbaf1976a0c33b35d",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_to_be_reexecuted": false,
"execution_millis": 6,
"execution_start": 1651524068900,
"source_hash": "f8024a29",
"tags": []
},
"outputs": [],
"source": [
"c3 = c1.copy()"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"cell_id": "44a2b8569f5f48d49c435dc6b626162c",
"deepnote_cell_height": 81,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
361
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 2,
"execution_start": 1651524126282,
"source_hash": "a4ba0f29",
"tags": []
},
"outputs": [],
"source": [
"c3 = c3.encode(y=\"Pred3\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Warning: it often happens in Machine Learning that the better the fit appears to be, the worse it will perform on future data (due to *overfitting*). So don't be too impressed by the accuracy of this fit."
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"cell_id": "646e1edbfe5a428da8059d9f89659ec3",
"deepnote_cell_height": 457.953125,
"deepnote_cell_type": "code",
"deepnote_output_heights": [
360.96875
],
"deepnote_to_be_reexecuted": false,
"execution_millis": 50,
"execution_start": 1651524135911,
"source_hash": "eba6d419",
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
""
],
"text/plain": [
"alt.LayerChart(...)"
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"c+c3"
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "0fb1964d0cdb4b44a04424a365f22644",
"deepnote_cell_height": 153.1875,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
"## Warning: Don't misuse polynomial regression\n",
"\n",
"For some reason, unreasonable cubic models often get shared in the media. The cubic polynomial that \"fits best\" can be interesting to look at, but don't expect it to provide accurate predictions in the future. (This is a symptom of *overfitting*, which is a key concept in Machine Learning that we will return to soon.)\n",
"\n",
"I don't know of any natural occurring phenomenon that follows a cubic pattern, so don't be deceived by the closeness of a cubic model fit."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example from FiveThirtyEight\n",
"\n",
"Here a cubic model is being used to model the percentage of Yes confirmation votes as a function of time."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "c690324d502f4420bfe9ce0eeda61296",
"deepnote_cell_height": 1041.453125,
"deepnote_cell_type": "markdown",
"tags": []
},
"source": [
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example from Trump's Covid-19 team"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a cubic model to predict deaths from Covid-19. This particular cubic model suggests deaths will get to 0 in mid-May 2020."
]
},
{
"cell_type": "markdown",
"metadata": {
"cell_id": "4e53376b10074481ab3b6aa3ccb9144d",
"deepnote_cell_height": 635.25,
"deepnote_cell_type": "markdown",
"owner_user_id": "02be19f8-8497-4212-b8d0-46ca9f1d48b9",
"tags": []
},
"source": [
""
]
}
],
"metadata": {
"deepnote": {},
"deepnote_execution_queue": [],
"deepnote_notebook_id": "a77823bb-2b5b-436d-9a2e-689e1064d274",
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}