Week 5 Tuesday#
One of the main goals of Machine Learning is to make predictions based on data. We will do this using both real-world datasets as well as artificial datasets. Today we’ll see how to make an artificial dataset for linear regression.
reference: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html https://altair-viz.github.io/user_guide/generated/channels/altair.X.html https://altair-viz.github.io/user_guide/transform/bin.html
import pandas as pd
import altair as alt
Generating “perfect” linear data#
Import the
make_regression
function fromsklearn.datasets
.
from sklearn.datasets import make_regression
Generate artificial data for linear regression using the following parameters. (You can use help(make_regression)
to check the documentation.)
There should be 2000 instances (also called observations).
There should be 1 input feature (also called a predictor).
Use the default value of one target dimension.
Set the
random_state
keyword argument to be110
.Use the
coef
keyword argument so that the true linear regression coefficient is returned.
Store the output of this make_regression
call with the variable name temp
.
temp = make_regression(n_samples=2000, n_features=1, random_state=110, coef=True)
temp
(array([[-0.6660065 ],
[-0.81590969],
[-1.42480518],
...,
[ 0.09257261],
[ 1.7407368 ],
[ 0.38394328]]),
array([-17.04573353, -20.88234742, -36.46638477, ..., 2.36929821,
44.55232112, 9.826623 ]),
array(25.59394451))
The output is a little strange.
What is the data type of
temp
? What is its length? What do the entries represent?
type(temp)
tuple
len(temp)
3
type(temp[0])
numpy.ndarray
Instead of saving the output of make_regression
as a single value, it is more common to use what is called tuple-unpacking.
Run the
make_regression
code again, and unpack the output into the variablesX
,y
, andm
.
(Notice that X
is a capital letter and y
is a lower-case letter. That is a reminder that X
is two-dimensional and y
is one-dimensional.)
X, y, m = make_regression(n_samples=2000, n_features=1, random_state=110, coef=True)
X
array([[-0.6660065 ],
[-0.81590969],
[-1.42480518],
...,
[ 0.09257261],
[ 1.7407368 ],
[ 0.38394328]])
y
array([-17.04573353, -20.88234742, -36.46638477, ..., 2.36929821,
44.55232112, 9.826623 ])
m
array(25.59394451)
Put
X
andy
together into a single pandas DataFramedf
with column names"x"
(lower-case) and"y"
.
df = pd.DataFrame({
"x": X[:,0],
"y":y
})
Using Altair, make a scatter plot of this data
alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
Generating “noisy” linear data#
Real-world data will not be perfectly linear, so we definitely need algorithms that work on more general datasets. We can add (Gaussian, aka normally distributed) random noise to this data with the noise
keyword argument.
Repeat everything above, but including the following parameters when calling make_regression
.
Set
noise
to be4
. (This corresponds to the standard deviation of the random noise we are adding.)Set
bias
to be-15
. (This subtracts 15 from every y-value.)
X, y, m = make_regression(
n_samples=2000,
n_features=1,
random_state=110,
coef=True,
noise=4,
bias=-15,
)
df = pd.DataFrame({
"x": X[:,0],
"y":y
})
Store the resulting Altair chart with the variable name
c1
.
c1 = alt.Chart(df).mark_circle().encode(
x="x",
y="y"
)
c1
The real data (before adding the random noise) has the form \(y = mx+b\), where \(b\) is the bias
we chose, -15
.
Looking at the new Altair chart, can you estimate the coefficient
m
? Compare it to the actual value, that was output bymake_regression
because we setcoef=True
.
m
array(25.59394451)
Define a new column in
df
named"y_true"
which holds the “true” linear outputs. (Your formula should usem
and-15
anddf["x"]
, but it should not usedf["y"]
.)
df["y_true"] = m*df["x"]+-15
df.head
<bound method NDFrame.head of x y y_true
0 -1.323266 -47.666158 -48.867601
1 0.662905 7.643532 1.966344
2 -0.329451 -20.589195 -23.431957
3 0.418390 -3.768828 -4.291743
4 -0.409929 -33.252898 -25.491703
... ... ... ...
1995 0.507076 -1.399337 -2.021922
1996 0.377814 -4.605596 -5.330252
1997 -1.093173 -40.605049 -42.978600
1998 -0.840307 -33.793690 -36.506774
1999 1.021482 11.154310 11.143756
[2000 rows x 3 columns]>
Define a new Altair chart
c2
using the following code:
c2 = alt.Chart(df).mark_line().encode(
x="x",
y="y_true",
color=alt.value("red")
)
c2 = alt.Chart(df).mark_line().encode(
x="x",
y="y_true",
color=alt.value("red")
)
c2
Display
c2
on top ofc1
, by usingc1+c2
.
c1+c2
Recognizing the noise
parameter#
Define a new column named
"residual"
in the DataFrame, which is the difference between the"y"
column and the"y_true"
column.
df["residual"] = df["y"] - df["y_true"]
What is the standard deviation of this new column? Is it what you expected?
df["residual"].std()
3.9867761198352816
Make a histogram for these residual values using the following code.
alt.Chart(df).mark_bar().encode(
x="residual",
y="count()"
)
alt.Chart(df).mark_bar().encode(
x="residual",
y="count()"
)
The chart looks pretty bad because it is not grouping any of the bars together.
Change the x-channel specification to the following, and display the chart again.
x = alt.X("residual", bin=alt.Bin(maxbins=100)),
Does the data look normally distributed (for exmaple, with an approximate bell-curve shape)?
alt.Chart(df).mark_bar().encode(
x = alt.X("residual", bin=alt.Bin(maxbins=100)),
y="count()"
)