Performance measures for machine learning
Contents
Performance measures for machine learning¶
import pandas as pd
import altair as alt
Warm-up question¶
Here is a dataset with only 4 data points. Which of the following linear models better fits the data?
Line A: \(f(x) = 2x\)
Line B: \(g(x) = -2\)
Start out by plotting both of these lines, together with the data, on the domain \((-2,5)\).
df = pd.DataFrame({"x":range(4), "y":[0,2,-10,6]})
df
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | -10 |
3 | 3 | 6 |
c = alt.Chart(df).mark_circle(size=100).encode(
x="x",
y="y"
)
c
df_line = pd.DataFrame({"x":[-2,5]})
df_lineA = df_line.copy()
df_lineB = df_line.copy()
We could define a function like the following (and this is the best way if the function has a complex definition).
def f(x):
return 2*x
f(10)
20
For such a simple function, it is better to use a lambda function.
f = lambda x: 2*x
f(10)
20
type(f)
function
Now we do the same thing for our other model. This is a constant function that always outputs -2.
g = lambda x: -2
g(7)
-2
df_lineA
x | |
---|---|
0 | -2 |
1 | 5 |
An alternative to the following would be df_lineA["y"] = f(df_lineA["x"])
, but I think the following way is more intuitive.
df_lineA["y"] = df_lineA["x"].map(f)
df_lineA
x | y | |
---|---|---|
0 | -2 | -4 |
1 | 5 | 10 |
df_lineB["y"] = df_lineB["x"].map(g)
df_lineB
x | y | |
---|---|---|
0 | -2 | -2 |
1 | 5 | -2 |
c_lineA = alt.Chart(df_lineA).mark_line(color="red").encode(
x="x",
y="y"
)
c_lineB = alt.Chart(df_lineB).mark_line(color="black").encode(
x="x",
y="y"
)
Looking at the two models in the following picture, which do you think fits the data better?
c+c_lineA+c_lineB
Add predicted columns to df
df
x | y | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | -10 |
3 | 3 | 6 |
df["PredA"] = df["x"].map(f)
df["PredB"] = df["x"].map(g)
df
x | y | PredA | PredB | |
---|---|---|---|---|
0 | 0 | 0 | 0 | -2 |
1 | 1 | 2 | 2 | -2 |
2 | 2 | -10 | 4 | -2 |
3 | 3 | 6 | 6 | -2 |
It turns out that, which model is better, depends heavily on what performance measure you use. Here are two of the most famous, mean_squared_error
(abbreviated MSE) and mean_absolute_error
(abbreviated MAE). When scikit-learn is performing linear regression, it is using the Mean Squared Error.
from sklearn.metrics import mean_squared_error, mean_absolute_error
Check that you can make the following sorts of computations by hand.
mean_squared_error(df["y"], df["PredA"])
49.0
mean_squared_error(df["y"], df["PredB"])
37.0
With loss functions, or cost functions, or error functions (these all mean the same thing), smaller is better/more accurate.
If we use MSE metric, the horizontal black line is the better model. MSE punishes outliers heavily.
mean_absolute_error(df["y"], df["PredA"])
3.5
mean_absolute_error(df["y"], df["PredB"])
5.5
If we use MAE metric, then the red line is the better model, since the Mean Absolute Error is lower for the “PredA” column.
Simulated data¶
The following data was simulated from the function \(f(x) = c_2 x^2 + c_1 x + c_0\). The exact outputs were computed (in the y_true
column), and then some random noise (following a normal distribution) was added to each of the points (the y
column), with the following coefficients.
(You can see the code that simulated this data in the HelperNotebook.ipynb
file. It includes some things we have done before, like adding new columns corresponding to powers of existing columns, and it also includes some things we haven’t discussed, like generating normally distributed random data using default_rng
from NumPy.)
df = pd.read_csv("../data/sim_data.csv")
c0 = -23.4
c1 = 4.1
c2 = 1.7
df.head()
x | y_true | y | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -3.329208 | -18.207589 | -117.484900 | -3.329208 | 11.083626 | -36.899694 | 122.846756 | -408.982395 | 1361.587441 | -4533.007730 | 1.509133e+04 | -5.024216e+04 | 1.672666e+05 |
1 | 6.465018 | 74.160562 | 73.954907 | 6.465018 | 41.796463 | 270.214901 | 1746.944309 | 11294.027098 | 73016.092970 | 472050.384357 | 3.051814e+06 | 1.973004e+07 | 1.275550e+08 |
2 | -4.478046 | -7.670062 | -13.810089 | -4.478046 | 20.052899 | -89.797810 | 402.118751 | -1800.706392 | 8063.646628 | -36109.383086 | 1.616995e+05 | -7.240978e+05 | 3.242544e+06 |
3 | 2.043272 | -7.925152 | 19.461182 | 2.043272 | 4.174960 | 8.530580 | 17.430295 | 35.614834 | 72.770792 | 148.690523 | 3.038152e+02 | 6.207771e+02 | 1.268416e+03 |
4 | 4.850593 | 36.485466 | 22.375230 | 4.850593 | 23.528255 | 114.125996 | 553.578791 | 2685.185564 | 13024.743051 | 63177.731115 | 3.064495e+05 | 1.486462e+06 | 7.210222e+06 |
The true data lies perfectly on a parabola (from a degree 2 polynomial).
alt.Chart(df).mark_circle().encode(
x="x",
y="y_true",
tooltip=["x"]
)
In the simulated data, we added some (significant) random noise to the data, so that the underlying parabola is much less apparent. This data is stored in the “y” column.
alt.Chart(df).mark_circle().encode(
x="x",
y="y",
tooltip = ["x","y","y_true"]
)
Also different powers of \(x\) were added in new columns: \(x^1\), \(x^2\), …, \(x^{10}\).
max_deg = 10
cols = [f"x{i}" for i in range(1, max_deg+1)]
cols
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']
Let’s see if we can recover the secret coefficients c0
, c1
, and c2
using the “y_true” column. You shouldn’t be too impressed by this, because the “y_true” column holds the values without the random noise.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
Let’s just use the first two columns (degree 1 and degree 2, no higher degree terms).
sub_cols = cols[:2]
sub_cols
['x1', 'x2']
reg.fit(df[sub_cols], df["y_true"])
LinearRegression()
We don’t even have to call reg.predict
in this case, because we don’t care so much about the outputs; instead we care about the coefficients.
reg.coef_
array([4.1, 1.7])
These match the true coefficients exactly.
c1
4.1
c2
1.7
The same is true (up to some slight numerical precision issue) for the constant term.
reg.intercept_
-23.40000000000001
c0
-23.4
Let’s try the same thing using a polynomial of degree 6.
sub_cols2 = cols[:6]
reg2 = LinearRegression()
reg2.fit(df[sub_cols2], df["y_true"])
LinearRegression()
Again, linear regression has perfect results in this case. It shows the true values for degree 1 and degree 2, and the higher degree coefficients are close to 0.
reg2.coef_
array([ 4.10000000e+00, 1.70000000e+00, -1.74892652e-13, -6.55378529e-15,
1.38777878e-16, 0.00000000e+00])