Performance measures for machine learning

Performance measures for machine learning

import pandas as pd
import altair as alt

Warm-up question

Here is a dataset with only 4 data points. Which of the following linear models better fits the data?

  • Line A: \(f(x) = 2x\)

  • Line B: \(g(x) = -2\)

Start out by plotting both of these lines, together with the data, on the domain \((-2,5)\).

df = pd.DataFrame({"x":range(4), "y":[0,2,-10,6]})

df
x y
0 0 0
1 1 2
2 2 -10
3 3 6
c = alt.Chart(df).mark_circle(size=100).encode(
    x="x",
    y="y"
)

c
df_line = pd.DataFrame({"x":[-2,5]})
df_lineA = df_line.copy()
df_lineB = df_line.copy()

We could define a function like the following (and this is the best way if the function has a complex definition).

def f(x):
    return 2*x
f(10)
20

For such a simple function, it is better to use a lambda function.

f = lambda x: 2*x
f(10)
20
type(f)
function

Now we do the same thing for our other model. This is a constant function that always outputs -2.

g = lambda x: -2
g(7)
-2
df_lineA
x
0 -2
1 5

An alternative to the following would be df_lineA["y"] = f(df_lineA["x"]), but I think the following way is more intuitive.

df_lineA["y"] = df_lineA["x"].map(f)
df_lineA
x y
0 -2 -4
1 5 10
df_lineB["y"] = df_lineB["x"].map(g)
df_lineB
x y
0 -2 -2
1 5 -2
c_lineA = alt.Chart(df_lineA).mark_line(color="red").encode(
    x="x",
    y="y"
)
c_lineB = alt.Chart(df_lineB).mark_line(color="black").encode(
    x="x",
    y="y"
)

Looking at the two models in the following picture, which do you think fits the data better?

c+c_lineA+c_lineB

Add predicted columns to df

df
x y
0 0 0
1 1 2
2 2 -10
3 3 6
df["PredA"] = df["x"].map(f)
df["PredB"] = df["x"].map(g)
df
x y PredA PredB
0 0 0 0 -2
1 1 2 2 -2
2 2 -10 4 -2
3 3 6 6 -2

It turns out that, which model is better, depends heavily on what performance measure you use. Here are two of the most famous, mean_squared_error (abbreviated MSE) and mean_absolute_error (abbreviated MAE). When scikit-learn is performing linear regression, it is using the Mean Squared Error.

from sklearn.metrics import mean_squared_error, mean_absolute_error

Check that you can make the following sorts of computations by hand.

mean_squared_error(df["y"], df["PredA"])
49.0
mean_squared_error(df["y"], df["PredB"])
37.0

With loss functions, or cost functions, or error functions (these all mean the same thing), smaller is better/more accurate.

If we use MSE metric, the horizontal black line is the better model. MSE punishes outliers heavily.

mean_absolute_error(df["y"], df["PredA"])
3.5
mean_absolute_error(df["y"], df["PredB"])
5.5

If we use MAE metric, then the red line is the better model, since the Mean Absolute Error is lower for the “PredA” column.

Simulated data

The following data was simulated from the function \(f(x) = c_2 x^2 + c_1 x + c_0\). The exact outputs were computed (in the y_true column), and then some random noise (following a normal distribution) was added to each of the points (the y column), with the following coefficients.

(You can see the code that simulated this data in the HelperNotebook.ipynb file. It includes some things we have done before, like adding new columns corresponding to powers of existing columns, and it also includes some things we haven’t discussed, like generating normally distributed random data using default_rng from NumPy.)

df = pd.read_csv("../data/sim_data.csv")
c0 = -23.4
c1 = 4.1
c2 = 1.7
df.head()
x y_true y x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
0 -3.329208 -18.207589 -117.484900 -3.329208 11.083626 -36.899694 122.846756 -408.982395 1361.587441 -4533.007730 1.509133e+04 -5.024216e+04 1.672666e+05
1 6.465018 74.160562 73.954907 6.465018 41.796463 270.214901 1746.944309 11294.027098 73016.092970 472050.384357 3.051814e+06 1.973004e+07 1.275550e+08
2 -4.478046 -7.670062 -13.810089 -4.478046 20.052899 -89.797810 402.118751 -1800.706392 8063.646628 -36109.383086 1.616995e+05 -7.240978e+05 3.242544e+06
3 2.043272 -7.925152 19.461182 2.043272 4.174960 8.530580 17.430295 35.614834 72.770792 148.690523 3.038152e+02 6.207771e+02 1.268416e+03
4 4.850593 36.485466 22.375230 4.850593 23.528255 114.125996 553.578791 2685.185564 13024.743051 63177.731115 3.064495e+05 1.486462e+06 7.210222e+06

The true data lies perfectly on a parabola (from a degree 2 polynomial).

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y_true",
    tooltip=["x"]
)

In the simulated data, we added some (significant) random noise to the data, so that the underlying parabola is much less apparent. This data is stored in the “y” column.

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y",
    tooltip = ["x","y","y_true"]
)

Also different powers of \(x\) were added in new columns: \(x^1\), \(x^2\), …, \(x^{10}\).

max_deg = 10
cols = [f"x{i}" for i in range(1, max_deg+1)]
cols
['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']

Let’s see if we can recover the secret coefficients c0, c1, and c2 using the “y_true” column. You shouldn’t be too impressed by this, because the “y_true” column holds the values without the random noise.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()

Let’s just use the first two columns (degree 1 and degree 2, no higher degree terms).

sub_cols = cols[:2]
sub_cols
['x1', 'x2']
reg.fit(df[sub_cols], df["y_true"])
LinearRegression()

We don’t even have to call reg.predict in this case, because we don’t care so much about the outputs; instead we care about the coefficients.

reg.coef_
array([4.1, 1.7])

These match the true coefficients exactly.

c1
4.1
c2
1.7

The same is true (up to some slight numerical precision issue) for the constant term.

reg.intercept_
-23.40000000000001
c0
-23.4

Let’s try the same thing using a polynomial of degree 6.

sub_cols2 = cols[:6]
reg2 = LinearRegression()
reg2.fit(df[sub_cols2], df["y_true"])
LinearRegression()

Again, linear regression has perfect results in this case. It shows the true values for degree 1 and degree 2, and the higher degree coefficients are close to 0.

reg2.coef_
array([ 4.10000000e+00,  1.70000000e+00, -1.74892652e-13, -6.55378529e-15,
        1.38777878e-16,  0.00000000e+00])