Introduction to scikit-learn

Introduction to scikit-learn

A recording is available on YuJa

Before that video (and this notebook) we spent about 15 minutes introducing the Machine Learning portion of Math 10. The most important concept covered was the concept of a cost function or loss function, that can be used to measure the performance of a model. For example, when trying to decide which line (or plane or …) best fits data using linear regression, the word best means the equation which minimizes the cost function. Natural choices of cost function for linear regression are Mean Squared Error (MSE) and Mean Absolute Error (MAE).

import numpy as np
import pandas as pd
import altair as alt
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
alt.Chart(df).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()

If that syntax looks strange, maybe this will make it look more familiar.

from numpy.random import default_rng
# vs our usual rng = np.random.default_rng()
rng = default_rng()

The following is a very common error. The problem is that df["Acousticness"] is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"], to be one-dimensional.)

reg.fit(df["Acousticness"], df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/897263380.py in <module>
----> 1 reg.fit(df["Acousticness"], df["Energy"])

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    660         accept_sparse = False if self.positive else ["csr", "csc", "coo"]
    661 
--> 662         X, y = self._validate_data(
    663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    767             # If input is 1D raise error
    768             if array.ndim == 1:
--> 769                 raise ValueError(
    770                     "Expected 2D array, got 1D array instead:\narray={}.\n"
    771                     "Reshape your data either using array.reshape(-1, 1) if "

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
# Double brackets say to return a DataFrame, not a Series
df[["Acousticness"]]
Acousticness
0 0.12700
1 0.03830
2 0.33500
3 0.04690
4 0.02030
... ...
1551 0.00261
1552 0.24000
1553 0.18400
1554 0.24900
1555 0.43300

1556 rows × 1 columns

df[["Acousticness","Song Name", "Artist"]]
Acousticness Song Name Artist
0 0.12700 Beggin' Måneskin
1 0.03830 STAY (with Justin Bieber) The Kid LAROI
2 0.33500 good 4 u Olivia Rodrigo
3 0.04690 Bad Habits Ed Sheeran
4 0.02030 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X
... ... ... ...
1551 0.00261 New Rules Dua Lipa
1552 0.24000 Cheirosa - Ao Vivo Jorge & Mateus
1553 0.18400 Havana (feat. Young Thug) Camila Cabello
1554 0.24900 Surtada - Remix Brega Funk Dadá Boladão, Tati Zaqui, OIK
1555 0.43300 Lover (Remix) [feat. Shawn Mendes] Taylor Swift

1556 rows × 3 columns

Another common error, not removing missing values.

reg.fit(df[["Acousticness"]], df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/3812016578.py in <module>
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    660         accept_sparse = False if self.positive else ["csr", "csc", "coo"]
    661 
--> 662         X, y = self._validate_data(
    663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    962         raise ValueError("y cannot be None")
    963 
--> 964     X = check_array(
    965         X,
    966         accept_sparse=accept_sparse,

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    798 
    799         if force_all_finite:
--> 800             _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
    801 
    802     if ensure_min_samples > 0:

~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
    112         ):
    113             type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114             raise ValueError(
    115                 msg_err.format(
    116                     type_err, msg_dtype if msg_dtype is not None else X.dtype

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
df_clean = df[~df.isna().any(axis=1)]
df_clean.shape
(1545, 23)
reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])
LinearRegression()

Look at the scatter plot above. This coefficient and this intercept should look very reasonable.

reg.coef_
array([-0.35010056])
reg.intercept_
0.7205632317364903
reg.predict(df_clean[["Acousticness"]])
array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
       0.56896969])
# could fix this warning using .copy() earlier
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/2402964677.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
df_clean
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord pred
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 Måneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.800 -4.808 0.0504 0.12700 0.3590 134.002 211560.0 0.589 B 0.676100
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db 0.707154
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A 0.603280
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B 0.704144
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb 0.713456
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A 0.719649
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B 0.636539
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D 0.656145
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F 0.633388
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G 0.568970

1545 rows × 24 columns

c1 = alt.Chart(df_clean).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

This is our first time using mark_line, which is used to draw line charts (like the default plot in Matlab). This method is probably a little inefficient, because it is using all 1545 rows of df_clean to make the straight line.

c2 = alt.Chart(df_clean).mark_line(color="red").encode(
    x = "Acousticness",
    y = "pred"
)

Unlike c1|c2 which puts the charts side-by-side, c1+c2 layers the charts on top of each other.

c1+c2
df_clean.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
       'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
       'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
       'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
       'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
       'Chord', 'pred'],
      dtype='object')

Linear regression works basically the same with multiple input variables.

reg2 = LinearRegression()
df_clean[["Acousticness","Speechiness","Valence"]]
Acousticness Speechiness Valence
0 0.12700 0.0504 0.589
1 0.03830 0.0483 0.478
2 0.33500 0.1540 0.688
3 0.04690 0.0348 0.591
4 0.02030 0.0615 0.894
... ... ... ...
1551 0.00261 0.0694 0.608
1552 0.24000 0.0851 0.714
1553 0.18400 0.0300 0.394
1554 0.24900 0.0587 0.881
1555 0.43300 0.0640 0.422

1545 rows × 3 columns

reg2.fit(df_clean[["Acousticness","Speechiness","Valence"]], df_clean["Energy"])
LinearRegression()
reg2.coef_
array([-0.33557117, -0.08205736,  0.21893852])

Interpretability

Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness correlates to having less energy, Valence suggests more energy. (I’m not using “correlates” in a technical sense.)

reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]])
array([0.69660978, 0.70224509, 0.63998475, ..., 0.63646318, 0.71891907,
       0.55624629])
from sklearn.metrics import mean_squared_error
mean_squared_error(df_clean["Energy"],df_clean["pred"])
0.01841456463873169
((df_clean["Energy"] - df_clean["pred"])**2).mean()
0.01841456463873169
mean_squared_error(reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]]), df_clean["Energy"])
0.015904580028495662