Introduction to scikit-learn¶

A recording is available on YuJa

Before that video (and this notebook) we spent about 15 minutes introducing the Machine Learning portion of Math 10. The most important concept covered was the concept of a cost function or loss function, that can be used to measure the performance of a model. For example, when trying to decide which line (or plane or …) best fits data using linear regression, the word best means the equation which minimizes the cost function. Natural choices of cost function for linear regression are Mean Squared Error (MSE) and Mean Absolute Error (MAE).

import numpy as np
import pandas as pd
import altair as alt

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")

alt.Chart(df).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

If that syntax looks strange, maybe this will make it look more familiar.

from numpy.random import default_rng
# vs our usual rng = np.random.default_rng()
rng = default_rng()

The following is a very common error. The problem is that df["Acousticness"] is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"], to be one-dimensional.)

reg.fit(df["Acousticness"], df["Energy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/897263380.py in <module>
----> 1 reg.fit(df["Acousticness"], df["Energy"])

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    660         accept_sparse = False if self.positive else ["csr", "csc", "coo"]
    661 
--> 662         X, y = self._validate_data(
    663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    767             # If input is 1D raise error
    768             if array.ndim == 1:
--> 769                 raise ValueError(
    770                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    771                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

# Double brackets say to return a DataFrame, not a Series
df[["Acousticness"]]

	Acousticness
0	0.12700
1	0.03830
2	0.33500
3	0.04690
4	0.02030
...	...
1551	0.00261
1552	0.24000
1553	0.18400
1554	0.24900
1555	0.43300

1556 rows × 1 columns

df[["Acousticness","Song Name", "Artist"]]

	Acousticness	Song Name	Artist
0	0.12700	Beggin'	Måneskin
1	0.03830	STAY (with Justin Bieber)	The Kid LAROI
2	0.33500	good 4 u	Olivia Rodrigo
3	0.04690	Bad Habits	Ed Sheeran
4	0.02030	INDUSTRY BABY (feat. Jack Harlow)	Lil Nas X
...	...	...	...
1551	0.00261	New Rules	Dua Lipa
1552	0.24000	Cheirosa - Ao Vivo	Jorge & Mateus
1553	0.18400	Havana (feat. Young Thug)	Camila Cabello
1554	0.24900	Surtada - Remix Brega Funk	Dadá Boladão, Tati Zaqui, OIK
1555	0.43300	Lover (Remix) [feat. Shawn Mendes]	Taylor Swift

1556 rows × 3 columns

Another common error, not removing missing values.

reg.fit(df[["Acousticness"]], df["Energy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/3812016578.py in <module>
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    660         accept_sparse = False if self.positive else ["csr", "csc", "coo"]
    661 
--> 662         X, y = self._validate_data(
    663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    798 
    799         if force_all_finite:
--> 800             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    801 
    802     if ensure_min_samples > 0:

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

df_clean = df[~df.isna().any(axis=1)]

df_clean.shape

(1545, 23)

reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])

LinearRegression()

Look at the scatter plot above. This coefficient and this intercept should look very reasonable.

reg.coef_

array([-0.35010056])

reg.intercept_

0.7205632317364903

reg.predict(df_clean[["Acousticness"]])

array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
       0.56896969])

# could fix this warning using .copy() earlier
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])

/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/2402964677.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])

df_clean

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord	pred
0	1	1	8	2021-07-23--2021-07-30	Beggin'	48,633,449	Måneskin	3377762.0	3Wrjm47oTz2sjIgck11l5e	['indie rock italiano', 'italian pop']	...	0.800	-4.808	0.0504	0.12700	0.3590	134.002	211560.0	0.589	B	0.676100
1	2	2	3	2021-07-23--2021-07-30	STAY (with Justin Bieber)	47,248,719	The Kid LAROI	2230022.0	5HCyWlXZPP0y6Gqq8TgA20	['australian hip hop']	...	0.764	-5.484	0.0483	0.03830	0.1030	169.928	141806.0	0.478	C#/Db	0.707154
2	3	1	11	2021-06-25--2021-07-02	good 4 u	40,162,559	Olivia Rodrigo	6266514.0	4ZtFanR9U6ndgddUvNcjcG	['pop']	...	0.664	-5.044	0.1540	0.33500	0.0849	166.928	178147.0	0.688	A	0.603280
3	4	3	5	2021-07-02--2021-07-09	Bad Habits	37,799,456	Ed Sheeran	83293380.0	6PQ88X9TkUIAUIZJHW2upE	['pop', 'uk pop']	...	0.897	-3.712	0.0348	0.04690	0.3640	126.026	231041.0	0.591	B	0.704144
4	5	5	1	2021-07-23--2021-07-30	INDUSTRY BABY (feat. Jack Harlow)	33,948,454	Lil Nas X	5473565.0	27NovPIUIRrOZoCHxABJwK	['lgbtq+ hip hop', 'pop rap']	...	0.704	-7.409	0.0615	0.02030	0.0501	149.995	212000.0	0.894	D#/Eb	0.713456
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1551	1552	195	1	2019-12-27--2020-01-03	New Rules	4,630,675	Dua Lipa	27167675.0	2ekn2ttSfGqwhhate0LSR0	['dance pop', 'pop', 'uk pop']	...	0.700	-6.021	0.0694	0.00261	0.1530	116.073	209320.0	0.608	A	0.719649
1552	1553	196	1	2019-12-27--2020-01-03	Cheirosa - Ao Vivo	4,623,030	Jorge & Mateus	15019109.0	2PWjKmjyTZeDpmOUa3a5da	['sertanejo', 'sertanejo universitario']	...	0.870	-3.123	0.0851	0.24000	0.3330	152.370	181930.0	0.714	B	0.636539
1553	1554	197	1	2019-12-27--2020-01-03	Havana (feat. Young Thug)	4,620,876	Camila Cabello	22698747.0	1rfofaqEpACxVEHIZBJe6W	['dance pop', 'electropop', 'pop', 'post-teen ...	...	0.523	-4.333	0.0300	0.18400	0.1320	104.988	217307.0	0.394	D	0.656145
1554	1555	198	1	2019-12-27--2020-01-03	Surtada - Remix Brega Funk	4,607,385	Dadá Boladão, Tati Zaqui, OIK	208630.0	5F8ffc8KWKNawllr5WsW0r	['brega funk', 'funk carioca']	...	0.550	-7.026	0.0587	0.24900	0.1820	154.064	152784.0	0.881	F	0.633388
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614.0	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.603	-7.176	0.0640	0.43300	0.0862	205.272	221307.0	0.422	G	0.568970

1545 rows × 24 columns

c1 = alt.Chart(df_clean).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

This is our first time using mark_line, which is used to draw line charts (like the default plot in Matlab). This method is probably a little inefficient, because it is using all 1545 rows of df_clean to make the straight line.

c2 = alt.Chart(df_clean).mark_line(color="red").encode(
    x = "Acousticness",
    y = "pred"
)

Unlike c1|c2 which puts the charts side-by-side, c1+c2 layers the charts on top of each other.

c1+c2

df_clean.columns

Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord', 'pred'],
      dtype='object')

Linear regression works basically the same with multiple input variables.

reg2 = LinearRegression()

df_clean[["Acousticness","Speechiness","Valence"]]

	Acousticness	Speechiness	Valence
0	0.12700	0.0504	0.589
1	0.03830	0.0483	0.478
2	0.33500	0.1540	0.688
3	0.04690	0.0348	0.591
4	0.02030	0.0615	0.894
...	...	...	...
1551	0.00261	0.0694	0.608
1552	0.24000	0.0851	0.714
1553	0.18400	0.0300	0.394
1554	0.24900	0.0587	0.881
1555	0.43300	0.0640	0.422

1545 rows × 3 columns

reg2.fit(df_clean[["Acousticness","Speechiness","Valence"]], df_clean["Energy"])

LinearRegression()

reg2.coef_

array([-0.33557117, -0.08205736,  0.21893852])

Interpretability¶

Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness correlates to having less energy, Valence suggests more energy. (I’m not using “correlates” in a technical sense.)

reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]])

array([0.69660978, 0.70224509, 0.63998475, ..., 0.63646318, 0.71891907,
       0.55624629])

from sklearn.metrics import mean_squared_error

mean_squared_error(df_clean["Energy"],df_clean["pred"])

0.01841456463873169

((df_clean["Energy"] - df_clean["pred"])**2).mean()

0.01841456463873169

mean_squared_error(reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]]), df_clean["Energy"])

0.015904580028495662

UC Irvine Math 10 W22

Introduction to scikit-learn

Contents

Introduction to scikit-learn¶

Interpretability¶