Performance measures for machine learning¶

import pandas as pd
import altair as alt

Warm-up question¶

Here is a dataset with only 4 data points. Which of the following linear models better fits the data?

Line A: \(f(x) = 2x\)
Line B: \(g(x) = -2\)

Start out by plotting both of these lines, together with the data, on the domain \((-2,5)\).

df = pd.DataFrame({"x":range(4), "y":[0,2,-10,6]})

df

	x	y
0	0	0
1	1	2
2	2	-10
3	3	6

c = alt.Chart(df).mark_circle(size=100).encode(
    x="x",
    y="y"
)

c

df_line = pd.DataFrame({"x":[-2,5]})
df_lineA = df_line.copy()
df_lineB = df_line.copy()

We could define a function like the following (and this is the best way if the function has a complex definition).

def f(x):
    return 2*x

f(10)

For such a simple function, it is better to use a lambda function.

f = lambda x: 2*x

f(10)

type(f)

function

Now we do the same thing for our other model. This is a constant function that always outputs -2.

g = lambda x: -2

g(7)

-2

df_lineA

	x
0	-2
1	5

An alternative to the following would be df_lineA["y"] = f(df_lineA["x"]), but I think the following way is more intuitive.

df_lineA["y"] = df_lineA["x"].map(f)

df_lineA

	x	y
0	-2	-4
1	5	10

df_lineB["y"] = df_lineB["x"].map(g)

df_lineB

	x	y
0	-2	-2
1	5	-2

c_lineA = alt.Chart(df_lineA).mark_line(color="red").encode(
    x="x",
    y="y"
)

c_lineB = alt.Chart(df_lineB).mark_line(color="black").encode(
    x="x",
    y="y"
)

Looking at the two models in the following picture, which do you think fits the data better?

c+c_lineA+c_lineB

Add predicted columns to df

df

	x	y
0	0	0
1	1	2
2	2	-10
3	3	6

df["PredA"] = df["x"].map(f)

df["PredB"] = df["x"].map(g)

df

	x	y	PredA	PredB
0	0	0	0	-2
1	1	2	2	-2
2	2	-10	4	-2
3	3	6	6	-2

It turns out that, which model is better, depends heavily on what performance measure you use. Here are two of the most famous, mean_squared_error (abbreviated MSE) and mean_absolute_error (abbreviated MAE). When scikit-learn is performing linear regression, it is using the Mean Squared Error.

from sklearn.metrics import mean_squared_error, mean_absolute_error

Check that you can make the following sorts of computations by hand.

mean_squared_error(df["y"], df["PredA"])

49.0

mean_squared_error(df["y"], df["PredB"])

37.0

With loss functions, or cost functions, or error functions (these all mean the same thing), smaller is better/more accurate.

If we use MSE metric, the horizontal black line is the better model. MSE punishes outliers heavily.

mean_absolute_error(df["y"], df["PredA"])

3.5

mean_absolute_error(df["y"], df["PredB"])

5.5

If we use MAE metric, then the red line is the better model, since the Mean Absolute Error is lower for the “PredA” column.

Simulated data¶

The following data was simulated from the function \(f(x) = c_2 x^2 + c_1 x + c_0\). The exact outputs were computed (in the y_true column), and then some random noise (following a normal distribution) was added to each of the points (the y column), with the following coefficients.

(You can see the code that simulated this data in the HelperNotebook.ipynb file. It includes some things we have done before, like adding new columns corresponding to powers of existing columns, and it also includes some things we haven’t discussed, like generating normally distributed random data using default_rng from NumPy.)

df = pd.read_csv("../data/sim_data.csv")

c0 = -23.4
c1 = 4.1
c2 = 1.7

df.head()

	x	y_true	y	x1	x2	x3	x4	x5	x6	x7	x8	x9	x10
0	-3.329208	-18.207589	-117.484900	-3.329208	11.083626	-36.899694	122.846756	-408.982395	1361.587441	-4533.007730	1.509133e+04	-5.024216e+04	1.672666e+05
1	6.465018	74.160562	73.954907	6.465018	41.796463	270.214901	1746.944309	11294.027098	73016.092970	472050.384357	3.051814e+06	1.973004e+07	1.275550e+08
2	-4.478046	-7.670062	-13.810089	-4.478046	20.052899	-89.797810	402.118751	-1800.706392	8063.646628	-36109.383086	1.616995e+05	-7.240978e+05	3.242544e+06
3	2.043272	-7.925152	19.461182	2.043272	4.174960	8.530580	17.430295	35.614834	72.770792	148.690523	3.038152e+02	6.207771e+02	1.268416e+03
4	4.850593	36.485466	22.375230	4.850593	23.528255	114.125996	553.578791	2685.185564	13024.743051	63177.731115	3.064495e+05	1.486462e+06	7.210222e+06

The true data lies perfectly on a parabola (from a degree 2 polynomial).

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y_true",
    tooltip=["x"]
)

In the simulated data, we added some (significant) random noise to the data, so that the underlying parabola is much less apparent. This data is stored in the “y” column.

alt.Chart(df).mark_circle().encode(
    x="x",
    y="y",
    tooltip = ["x","y","y_true"]
)

Also different powers of \(x\) were added in new columns: \(x^1\), \(x^2\), …, \(x^{10}\).

max_deg = 10
cols = [f"x{i}" for i in range(1, max_deg+1)]

cols

['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10']

Let’s see if we can recover the secret coefficients c0, c1, and c2 using the “y_true” column. You shouldn’t be too impressed by this, because the “y_true” column holds the values without the random noise.

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

Let’s just use the first two columns (degree 1 and degree 2, no higher degree terms).

sub_cols = cols[:2]

sub_cols

['x1', 'x2']

reg.fit(df[sub_cols], df["y_true"])

LinearRegression()

We don’t even have to call reg.predict in this case, because we don’t care so much about the outputs; instead we care about the coefficients.

reg.coef_

array([4.1, 1.7])

These match the true coefficients exactly.

c1

4.1

c2

1.7

The same is true (up to some slight numerical precision issue) for the constant term.

reg.intercept_

-23.40000000000001

c0

-23.4

Let’s try the same thing using a polynomial of degree 6.

sub_cols2 = cols[:6]

reg2 = LinearRegression()

reg2.fit(df[sub_cols2], df["y_true"])

LinearRegression()

Again, linear regression has perfect results in this case. It shows the true values for degree 1 and degree 2, and the higher degree coefficients are close to 0.

reg2.coef_

array([ 4.10000000e+00,  1.70000000e+00, -1.74892652e-13, -6.55378529e-15,
        1.38777878e-16,  0.00000000e+00])

UC Irvine Math 10 S22

Performance measures for machine learning

Contents

Performance measures for machine learning¶

Warm-up question¶

Simulated data¶