Week 5 Tuesday#

One of the main goals of Machine Learning is to make predictions based on data. We will do this using both real-world datasets as well as artificial datasets. Today we’ll see how to make an artificial dataset for linear regression.

reference: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html https://altair-viz.github.io/user_guide/generated/channels/altair.X.html https://altair-viz.github.io/user_guide/transform/bin.html

import pandas as pd
import altair as alt

Generating “perfect” linear data#

  • Import the make_regression function from sklearn.datasets.

from sklearn.datasets import make_regression

Generate artificial data for linear regression using the following parameters. (You can use help(make_regression) to check the documentation.)

  • There should be 2000 instances (also called observations).

  • There should be 1 input feature (also called a predictor).

  • Use the default value of one target dimension.

  • Set the random_state keyword argument to be 110.

  • Use the coef keyword argument so that the true linear regression coefficient is returned.

Store the output of this make_regression call with the variable name temp.

temp = make_regression(n_samples=2000, n_features=1, random_state=110, coef=True)
temp
(array([[-0.6660065 ],
        [-0.81590969],
        [-1.42480518],
        ...,
        [ 0.09257261],
        [ 1.7407368 ],
        [ 0.38394328]]),
 array([-17.04573353, -20.88234742, -36.46638477, ...,   2.36929821,
         44.55232112,   9.826623  ]),
 array(25.59394451))

The output is a little strange.

  • What is the data type of temp? What is its length? What do the entries represent?

type(temp)
tuple
len(temp)
3
type(temp[0])
numpy.ndarray

Instead of saving the output of make_regression as a single value, it is more common to use what is called tuple-unpacking.

  • Run the make_regression code again, and unpack the output into the variables X, y, and m.

(Notice that X is a capital letter and y is a lower-case letter. That is a reminder that X is two-dimensional and y is one-dimensional.)

X, y, m = make_regression(n_samples=2000, n_features=1, random_state=110, coef=True)
X
array([[-0.6660065 ],
       [-0.81590969],
       [-1.42480518],
       ...,
       [ 0.09257261],
       [ 1.7407368 ],
       [ 0.38394328]])
y
array([-17.04573353, -20.88234742, -36.46638477, ...,   2.36929821,
        44.55232112,   9.826623  ])
m
array(25.59394451)
  • Put X and y together into a single pandas DataFrame df with column names "x" (lower-case) and "y".

df = pd.DataFrame({
    "x": X[:,0],
    "y":y
})
  • Using Altair, make a scatter plot of this data

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

Generating “noisy” linear data#

Real-world data will not be perfectly linear, so we definitely need algorithms that work on more general datasets. We can add (Gaussian, aka normally distributed) random noise to this data with the noise keyword argument.

Repeat everything above, but including the following parameters when calling make_regression.

  • Set noise to be 4. (This corresponds to the standard deviation of the random noise we are adding.)

  • Set bias to be -15. (This subtracts 15 from every y-value.)

X, y, m = make_regression(
    n_samples=2000, 
    n_features=1, 
    random_state=110, 
    coef=True,
    noise=4,
    bias=-15,
)
df = pd.DataFrame({
    "x": X[:,0],
    "y":y
})
  • Store the resulting Altair chart with the variable name c1.

c1 = alt.Chart(df).mark_circle().encode(
    x="x",
    y="y"
)

c1

The real data (before adding the random noise) has the form \(y = mx+b\), where \(b\) is the bias we chose, -15.

  • Looking at the new Altair chart, can you estimate the coefficient m? Compare it to the actual value, that was output by make_regression because we set coef=True.

m
array(25.59394451)
  • Define a new column in df named "y_true" which holds the “true” linear outputs. (Your formula should use m and -15 and df["x"], but it should not use df["y"].)

df["y_true"] = m*df["x"]+-15
df.head
<bound method NDFrame.head of              x          y     y_true
0    -1.323266 -47.666158 -48.867601
1     0.662905   7.643532   1.966344
2    -0.329451 -20.589195 -23.431957
3     0.418390  -3.768828  -4.291743
4    -0.409929 -33.252898 -25.491703
...        ...        ...        ...
1995  0.507076  -1.399337  -2.021922
1996  0.377814  -4.605596  -5.330252
1997 -1.093173 -40.605049 -42.978600
1998 -0.840307 -33.793690 -36.506774
1999  1.021482  11.154310  11.143756

[2000 rows x 3 columns]>
  • Define a new Altair chart c2 using the following code:

c2 = alt.Chart(df).mark_line().encode(
    x="x",
    y="y_true",
    color=alt.value("red")
)
c2 = alt.Chart(df).mark_line().encode(
    x="x",
    y="y_true",
    color=alt.value("red")
)

c2
  • Display c2 on top of c1, by using c1+c2.

c1+c2

Recognizing the noise parameter#

  • Define a new column named "residual" in the DataFrame, which is the difference between the "y" column and the "y_true" column.

df["residual"] = df["y"] - df["y_true"]
  • What is the standard deviation of this new column? Is it what you expected?

df["residual"].std()
3.9867761198352816
  • Make a histogram for these residual values using the following code.

alt.Chart(df).mark_bar().encode(
    x="residual",
    y="count()"
)
alt.Chart(df).mark_bar().encode(
    x="residual",
    y="count()"
)

The chart looks pretty bad because it is not grouping any of the bars together.

  • Change the x-channel specification to the following, and display the chart again.

x = alt.X("residual", bin=alt.Bin(maxbins=100)),

Does the data look normally distributed (for exmaple, with an approximate bell-curve shape)?

alt.Chart(df).mark_bar().encode(
    x = alt.X("residual", bin=alt.Bin(maxbins=100)),
    y="count()"
)