Week 5 Thursday

Week 5 Thursday#

import numpy as np
import pandas as pd
import altair as alt

Here is an elegant way to let pandas know how the missing values are represented in our dataset. This is much more efficient than using applymap and a lambda function.

df = pd.read_csv("spotify_dataset.csv", na_values=" ")

Notice how the data types are correct (we didn’t need to use pd.to_numeric).

df.dtypes

Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object

Plot Acousticness vs Energy.

alt.Chart(df).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

If we fit a linear model to this data (with “Acousticness” as our only predictor and with “Energy” as our target), would you expect the coefficient to be positive or negative? (You can either use the graph or intuition. “Acousticness” is as opposed to for example electric guitars.)

Check your answer using the LinearRegression from scikit-learn.

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

The following is a very common error. The problem is that df["Acousticness"] is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"], to be one-dimensional.)

reg.fit(df["Acousticness"], df["Energy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [8], line 1
----> 1 reg.fit(df["Acousticness"], df["Energy"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
n_jobs_ = self.n_jobs
accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
   X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
)
sample_weight = _check_sample_weight(
   sample_weight, X, dtype=X.dtype, only_non_negative=True
)
X, y, X_offset, y_offset, X_scale = _preprocess_data(
   X,
   y,
   (...)
   sample_weight=sample_weight,
)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
       y = check_array(y, input_name="y", **check_y_params)
   else:
--> 596         X, y = check_X_y(X, y, **check_params)
   out = X, y
if not no_val_X and check_params.get("ensure_2d", True):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       estimator_name = _check_estimator_name(estimator)
   raise ValueError(
       f"{estimator_name} requires y to be passed, but the target y is None"
   )
-> 1074 X = check_array(
   X,
   accept_sparse=accept_sparse,
   accept_large_sparse=accept_large_sparse,
   dtype=dtype,
   order=order,
   copy=copy,
   force_all_finite=force_all_finite,
   ensure_2d=ensure_2d,
   allow_nd=allow_nd,
   ensure_min_samples=ensure_min_samples,
   ensure_min_features=ensure_min_features,
   estimator=estimator,
   input_name="X",
)
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
check_consistent_length(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
   # If input is 1D raise error
   if array.ndim == 1:
--> 879         raise ValueError(
           "Expected 2D array, got 1D array instead:\narray={}.\n"
           "Reshape your data either using array.reshape(-1, 1) if "
           "your data has a single feature or array.reshape(1, -1) "
           "if it contains a single sample.".format(array)
       )
if dtype_numeric and array.dtype.kind in "USV":
   raise ValueError(
       "dtype='numeric' is not compatible with arrays of bytes/strings."
       "Convert your data to numeric values explicitly instead."
   )

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

df["Acousticness"]

     0.12700
     0.03830
     0.33500
     0.04690
     0.02030
         ...   
  0.00261
  0.24000
  0.18400
  0.24900
  0.43300
Name: Acousticness, Length: 1556, dtype: float64

df[["Acousticness"]]

	Acousticness
0	0.12700
1	0.03830
2	0.33500
3	0.04690
4	0.02030
...	...
1551	0.00261
1552	0.24000
1553	0.18400
1554	0.24900
1555	0.43300

1556 rows × 1 columns

type(df["Acousticness"])

pandas.core.series.Series

type(df[["Acousticness"]])

pandas.core.frame.DataFrame

df[["Acousticness","Song Name","Artist"]]

	Acousticness	Song Name	Artist
0	0.12700	Beggin'	Måneskin
1	0.03830	STAY (with Justin Bieber)	The Kid LAROI
2	0.33500	good 4 u	Olivia Rodrigo
3	0.04690	Bad Habits	Ed Sheeran
4	0.02030	INDUSTRY BABY (feat. Jack Harlow)	Lil Nas X
...	...	...	...
1551	0.00261	New Rules	Dua Lipa
1552	0.24000	Cheirosa - Ao Vivo	Jorge & Mateus
1553	0.18400	Havana (feat. Young Thug)	Camila Cabello
1554	0.24900	Surtada - Remix Brega Funk	Dadá Boladão, Tati Zaqui, OIK
1555	0.43300	Lover (Remix) [feat. Shawn Mendes]	Taylor Swift

1556 rows × 3 columns

Another common error, not removing missing values.

reg.fit(df[["Acousticness"]], df["Energy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [15], line 1
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
n_jobs_ = self.n_jobs
accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
   X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
)
sample_weight = _check_sample_weight(
   sample_weight, X, dtype=X.dtype, only_non_negative=True
)
X, y, X_offset, y_offset, X_scale = _preprocess_data(
   X,
   y,
   (...)
   sample_weight=sample_weight,
)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
       y = check_array(y, input_name="y", **check_y_params)
   else:
--> 596         X, y = check_X_y(X, y, **check_params)
   out = X, y
if not no_val_X and check_params.get("ensure_2d", True):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
       estimator_name = _check_estimator_name(estimator)
   raise ValueError(
       f"{estimator_name} requires y to be passed, but the target y is None"
   )
-> 1074 X = check_array(
   X,
   accept_sparse=accept_sparse,
   accept_large_sparse=accept_large_sparse,
   dtype=dtype,
   order=order,
   copy=copy,
   force_all_finite=force_all_finite,
   ensure_2d=ensure_2d,
   allow_nd=allow_nd,
   ensure_min_samples=ensure_min_samples,
   ensure_min_features=ensure_min_features,
   estimator=estimator,
   input_name="X",
)
y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
check_consistent_length(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:899, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
       raise ValueError(
           "Found array with dim %d. %s expected <= 2."
           % (array.ndim, estimator_name)
       )
   if force_all_finite:
--> 899         _assert_all_finite(
           array,
           input_name=input_name,
           estimator_name=estimator_name,
           allow_nan=force_all_finite == "allow-nan",
       )
if ensure_min_samples > 0:
   n_samples = _num_samples(array)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:146, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
       if (
           not allow_nan
           and estimator_name
   (...)
           # Improve the error message on how to handle missing values in
           # scikit-learn.
           msg_err += (
               f"\n{estimator_name} does not accept missing values"
               " encoded as NaN natively. For supervised learning, you might want"
   (...)
               "#estimators-that-handle-nan-values"
           )
--> 146         raise ValueError(msg_err)
# for object dtype data, we only check for NaNs (GH-13254)
elif X.dtype == np.dtype("object") and not allow_nan:

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

df_clean = df.dropna(axis = 0)

df.shape

(1556, 23)

df_clean.shape

(1545, 23)

reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

reg.coef_

array([-0.35010056])

reg.intercept_

0.7205632317364904

Look at the scatter plot above. This coefficient and this intercept should look very reasonable.

What are the predicted outputs?

reg.predict(df_clean[["Acousticness"]])

array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
       0.56896969])

Add a new column named “pred” with these predicted values. A warning (not an error) will show up because we didn’t use copy() above when dropping the rows with missing values.

df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])

/tmp/ipykernel_74/632255384.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])

df_clean2 = df.dropna(axis = 0).copy()

df_clean2["pred"] = reg.predict(df_clean2[["Acousticness"]])

df_clean

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord	pred
0	1	1	8	2021-07-23--2021-07-30	Beggin'	48,633,449	Måneskin	3377762.0	3Wrjm47oTz2sjIgck11l5e	['indie rock italiano', 'italian pop']	...	0.800	-4.808	0.0504	0.12700	0.3590	134.002	211560.0	0.589	B	0.676100
1	2	2	3	2021-07-23--2021-07-30	STAY (with Justin Bieber)	47,248,719	The Kid LAROI	2230022.0	5HCyWlXZPP0y6Gqq8TgA20	['australian hip hop']	...	0.764	-5.484	0.0483	0.03830	0.1030	169.928	141806.0	0.478	C#/Db	0.707154
2	3	1	11	2021-06-25--2021-07-02	good 4 u	40,162,559	Olivia Rodrigo	6266514.0	4ZtFanR9U6ndgddUvNcjcG	['pop']	...	0.664	-5.044	0.1540	0.33500	0.0849	166.928	178147.0	0.688	A	0.603280
3	4	3	5	2021-07-02--2021-07-09	Bad Habits	37,799,456	Ed Sheeran	83293380.0	6PQ88X9TkUIAUIZJHW2upE	['pop', 'uk pop']	...	0.897	-3.712	0.0348	0.04690	0.3640	126.026	231041.0	0.591	B	0.704144
4	5	5	1	2021-07-23--2021-07-30	INDUSTRY BABY (feat. Jack Harlow)	33,948,454	Lil Nas X	5473565.0	27NovPIUIRrOZoCHxABJwK	['lgbtq+ hip hop', 'pop rap']	...	0.704	-7.409	0.0615	0.02030	0.0501	149.995	212000.0	0.894	D#/Eb	0.713456
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1551	1552	195	1	2019-12-27--2020-01-03	New Rules	4,630,675	Dua Lipa	27167675.0	2ekn2ttSfGqwhhate0LSR0	['dance pop', 'pop', 'uk pop']	...	0.700	-6.021	0.0694	0.00261	0.1530	116.073	209320.0	0.608	A	0.719649
1552	1553	196	1	2019-12-27--2020-01-03	Cheirosa - Ao Vivo	4,623,030	Jorge & Mateus	15019109.0	2PWjKmjyTZeDpmOUa3a5da	['sertanejo', 'sertanejo universitario']	...	0.870	-3.123	0.0851	0.24000	0.3330	152.370	181930.0	0.714	B	0.636539
1553	1554	197	1	2019-12-27--2020-01-03	Havana (feat. Young Thug)	4,620,876	Camila Cabello	22698747.0	1rfofaqEpACxVEHIZBJe6W	['dance pop', 'electropop', 'pop', 'post-teen ...	...	0.523	-4.333	0.0300	0.18400	0.1320	104.988	217307.0	0.394	D	0.656145
1554	1555	198	1	2019-12-27--2020-01-03	Surtada - Remix Brega Funk	4,607,385	Dadá Boladão, Tati Zaqui, OIK	208630.0	5F8ffc8KWKNawllr5WsW0r	['brega funk', 'funk carioca']	...	0.550	-7.026	0.0587	0.24900	0.1820	154.064	152784.0	0.881	F	0.633388
1555	1556	199	1	2019-12-27--2020-01-03	Lover (Remix) [feat. Shawn Mendes]	4,595,450	Taylor Swift	42227614.0	3i9UVldZOE0aD0JnyfAZZ0	['pop', 'post-teen pop']	...	0.603	-7.176	0.0640	0.43300	0.0862	205.272	221307.0	0.422	G	0.568970

1545 rows × 24 columns

Plot the fit values using a red mark_line in Altair. Layer the two charts on top of each other.

c1 = alt.Chart(df_clean).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

c2 = alt.Chart(df_clean).mark_line(color="red").encode(
    x = "Acousticness",
    y = "pred"
)

c1+c2

Linear regression works basically the same with multiple input variables. Perform linear regression using “Acousticness”, “Speechiness”, “Valence” as our input features.

reg2 = LinearRegression()

df_clean[["Acousticness", "Speechiness", "Valence"]]

	Acousticness	Speechiness	Valence
0	0.12700	0.0504	0.589
1	0.03830	0.0483	0.478
2	0.33500	0.1540	0.688
3	0.04690	0.0348	0.591
4	0.02030	0.0615	0.894
...	...	...	...
1551	0.00261	0.0694	0.608
1552	0.24000	0.0851	0.714
1553	0.18400	0.0300	0.394
1554	0.24900	0.0587	0.881
1555	0.43300	0.0640	0.422

1545 rows × 3 columns

reg2.fit(df_clean[["Acousticness", "Speechiness", "Valence"]],df_clean["Energy"])

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

reg2.coef_

array([-0.33557117, -0.08205736,  0.21893852])

Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness generally corresponds to having less energy, Valence corresponds to more energy.

df_clean.corr()

	Index	Highest Charting Position	Number of Times Charted	Artist Followers	Popularity	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	pred
Index	1.000000	0.251112	-0.360013	0.090887	-0.333683	0.126254	-0.018813	-0.014321	0.109213	-0.063914	0.028976	0.023622	-0.023481	-0.054632	0.063914
Highest Charting Position	0.251112	1.000000	-0.417748	-0.233723	-0.164167	0.017149	0.063026	0.032166	0.041248	-0.012924	0.012718	0.026235	-0.033956	0.045362	0.012924
Number of Times Charted	-0.360013	-0.417748	1.000000	0.027458	0.232796	0.027026	-0.061139	0.031225	-0.060216	0.046651	-0.058436	-0.048307	0.033980	0.021570	-0.046651
Artist Followers	0.090887	-0.233723	0.027458	1.000000	0.104358	-0.097576	-0.065613	-0.033264	-0.072968	0.023830	-0.012491	-0.019881	0.142145	-0.108804	-0.023830
Popularity	-0.333683	-0.164167	0.232796	0.104358	1.000000	0.028435	0.094691	0.158767	-0.032091	-0.091245	-0.029460	-0.024951	0.082096	-0.000953	0.091245
Danceability	0.126254	0.017149	0.027026	-0.097576	0.028435	1.000000	0.142130	0.234928	0.237394	-0.316798	-0.114518	-0.040219	-0.101390	0.361627	0.316798
Energy	-0.018813	0.063026	-0.061139	-0.065613	0.094691	0.142130	1.000000	0.732616	0.023989	-0.542399	0.124693	0.113352	0.056624	0.356325	0.542399
Loudness	-0.014321	0.032166	0.031225	-0.033264	0.158767	0.234928	0.732616	1.000000	-0.018823	-0.477431	0.043141	0.104371	0.075262	0.298762	0.477431
Speechiness	0.109213	0.041248	-0.060216	-0.072968	-0.032091	0.237394	0.023989	-0.018823	1.000000	-0.131436	0.072774	0.111255	-0.089895	0.038032	0.131436
Acousticness	-0.063914	-0.012924	0.046651	0.023830	-0.091245	-0.316798	-0.542399	-0.477431	-0.131436	1.000000	-0.005469	-0.061632	-0.046010	-0.096997	-1.000000
Liveness	0.028976	0.012718	-0.058436	-0.012491	-0.029460	-0.114518	0.124693	0.043141	0.072774	-0.005469	1.000000	-0.018265	0.019685	0.007882	0.005469
Tempo	0.023622	0.026235	-0.048307	-0.019881	-0.024951	-0.040219	0.113352	0.104371	0.111255	-0.061632	-0.018265	1.000000	-0.004671	0.057563	0.061632
Duration (ms)	-0.023481	-0.033956	0.033980	0.142145	0.082096	-0.101390	0.056624	0.075262	-0.089895	-0.046010	0.019685	-0.004671	1.000000	-0.119981	0.046010
Valence	-0.054632	0.045362	0.021570	-0.108804	-0.000953	0.361627	0.356325	0.298762	0.038032	-0.096997	0.007882	0.057563	-0.119981	1.000000	0.096997
pred	0.063914	0.012924	-0.046651	-0.023830	0.091245	0.316798	0.542399	0.477431	0.131436	-1.000000	0.005469	0.061632	0.046010	0.096997	1.000000

reference:

reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html https://docs.python.org/3/library/copy.html https://realpython.com/copying-python-objects/