Introduction to scikit-learn
Contents
Introduction to scikit-learn¶
A recording is available on YuJa
Before that video (and this notebook) we spent about 15 minutes introducing the Machine Learning portion of Math 10. The most important concept covered was the concept of a cost function or loss function, that can be used to measure the performance of a model. For example, when trying to decide which line (or plane or …) best fits data using linear regression, the word best means the equation which minimizes the cost function. Natural choices of cost function for linear regression are Mean Squared Error (MSE) and Mean Absolute Error (MAE).
import numpy as np
import pandas as pd
import altair as alt
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ")
alt.Chart(df).mark_circle().encode(
x = "Acousticness",
y = "Energy"
)
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
If that syntax looks strange, maybe this will make it look more familiar.
from numpy.random import default_rng
# vs our usual rng = np.random.default_rng()
rng = default_rng()
The following is a very common error. The problem is that df["Acousticness"]
is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"]
, to be one-dimensional.)
reg.fit(df["Acousticness"], df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/897263380.py in <module>
----> 1 reg.fit(df["Acousticness"], df["Energy"])
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
660 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
661
--> 662 X, y = self._validate_data(
663 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
664 )
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
579 y = check_array(y, **check_y_params)
580 else:
--> 581 X, y = check_X_y(X, y, **check_params)
582 out = X, y
583
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
962 raise ValueError("y cannot be None")
963
--> 964 X = check_array(
965 X,
966 accept_sparse=accept_sparse,
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
767 # If input is 1D raise error
768 if array.ndim == 1:
--> 769 raise ValueError(
770 "Expected 2D array, got 1D array instead:\narray={}.\n"
771 "Reshape your data either using array.reshape(-1, 1) if "
ValueError: Expected 2D array, got 1D array instead:
array=[0.127 0.0383 0.335 ... 0.184 0.249 0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
# Double brackets say to return a DataFrame, not a Series
df[["Acousticness"]]
Acousticness | |
---|---|
0 | 0.12700 |
1 | 0.03830 |
2 | 0.33500 |
3 | 0.04690 |
4 | 0.02030 |
... | ... |
1551 | 0.00261 |
1552 | 0.24000 |
1553 | 0.18400 |
1554 | 0.24900 |
1555 | 0.43300 |
1556 rows × 1 columns
df[["Acousticness","Song Name", "Artist"]]
Acousticness | Song Name | Artist | |
---|---|---|---|
0 | 0.12700 | Beggin' | Måneskin |
1 | 0.03830 | STAY (with Justin Bieber) | The Kid LAROI |
2 | 0.33500 | good 4 u | Olivia Rodrigo |
3 | 0.04690 | Bad Habits | Ed Sheeran |
4 | 0.02030 | INDUSTRY BABY (feat. Jack Harlow) | Lil Nas X |
... | ... | ... | ... |
1551 | 0.00261 | New Rules | Dua Lipa |
1552 | 0.24000 | Cheirosa - Ao Vivo | Jorge & Mateus |
1553 | 0.18400 | Havana (feat. Young Thug) | Camila Cabello |
1554 | 0.24900 | Surtada - Remix Brega Funk | Dadá Boladão, Tati Zaqui, OIK |
1555 | 0.43300 | Lover (Remix) [feat. Shawn Mendes] | Taylor Swift |
1556 rows × 3 columns
Another common error, not removing missing values.
reg.fit(df[["Acousticness"]], df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/3812016578.py in <module>
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
660 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
661
--> 662 X, y = self._validate_data(
663 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
664 )
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
579 y = check_array(y, **check_y_params)
580 else:
--> 581 X, y = check_X_y(X, y, **check_params)
582 out = X, y
583
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
962 raise ValueError("y cannot be None")
963
--> 964 X = check_array(
965 X,
966 accept_sparse=accept_sparse,
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
798
799 if force_all_finite:
--> 800 _assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
801
802 if ensure_min_samples > 0:
~/miniconda3/envs/math11/lib/python3.9/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
112 ):
113 type_err = "infinity" if allow_nan else "NaN, infinity"
--> 114 raise ValueError(
115 msg_err.format(
116 type_err, msg_dtype if msg_dtype is not None else X.dtype
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
df_clean = df[~df.isna().any(axis=1)]
df_clean.shape
(1545, 23)
reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])
LinearRegression()
Look at the scatter plot above. This coefficient and this intercept should look very reasonable.
reg.coef_
array([-0.35010056])
reg.intercept_
0.7205632317364903
reg.predict(df_clean[["Acousticness"]])
array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
0.56896969])
# could fix this warning using .copy() earlier
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_5535/2402964677.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
df_clean
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | Måneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 | B | 0.676100 |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db | 0.707154 |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A | 0.603280 |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 | B | 0.704144 |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb | 0.713456 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A | 0.719649 |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B | 0.636539 |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D | 0.656145 |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F | 0.633388 |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G | 0.568970 |
1545 rows × 24 columns
c1 = alt.Chart(df_clean).mark_circle().encode(
x = "Acousticness",
y = "Energy"
)
This is our first time using mark_line
, which is used to draw line charts (like the default plot in Matlab). This method is probably a little inefficient, because it is using all 1545 rows of df_clean
to make the straight line.
c2 = alt.Chart(df_clean).mark_line(color="red").encode(
x = "Acousticness",
y = "pred"
)
Unlike c1|c2
which puts the charts side-by-side, c1+c2
layers the charts on top of each other.
c1+c2
df_clean.columns
Index(['Index', 'Highest Charting Position', 'Number of Times Charted',
'Week of Highest Charting', 'Song Name', 'Streams', 'Artist',
'Artist Followers', 'Song ID', 'Genre', 'Release Date', 'Weeks Charted',
'Popularity', 'Danceability', 'Energy', 'Loudness', 'Speechiness',
'Acousticness', 'Liveness', 'Tempo', 'Duration (ms)', 'Valence',
'Chord', 'pred'],
dtype='object')
Linear regression works basically the same with multiple input variables.
reg2 = LinearRegression()
df_clean[["Acousticness","Speechiness","Valence"]]
Acousticness | Speechiness | Valence | |
---|---|---|---|
0 | 0.12700 | 0.0504 | 0.589 |
1 | 0.03830 | 0.0483 | 0.478 |
2 | 0.33500 | 0.1540 | 0.688 |
3 | 0.04690 | 0.0348 | 0.591 |
4 | 0.02030 | 0.0615 | 0.894 |
... | ... | ... | ... |
1551 | 0.00261 | 0.0694 | 0.608 |
1552 | 0.24000 | 0.0851 | 0.714 |
1553 | 0.18400 | 0.0300 | 0.394 |
1554 | 0.24900 | 0.0587 | 0.881 |
1555 | 0.43300 | 0.0640 | 0.422 |
1545 rows × 3 columns
reg2.fit(df_clean[["Acousticness","Speechiness","Valence"]], df_clean["Energy"])
LinearRegression()
reg2.coef_
array([-0.33557117, -0.08205736, 0.21893852])
Interpretability¶
Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness correlates to having less energy, Valence suggests more energy. (I’m not using “correlates” in a technical sense.)
reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]])
array([0.69660978, 0.70224509, 0.63998475, ..., 0.63646318, 0.71891907,
0.55624629])
from sklearn.metrics import mean_squared_error
mean_squared_error(df_clean["Energy"],df_clean["pred"])
0.01841456463873169
((df_clean["Energy"] - df_clean["pred"])**2).mean()
0.01841456463873169
mean_squared_error(reg2.predict(df_clean[["Acousticness","Speechiness","Valence"]]), df_clean["Energy"])
0.015904580028495662