Week 5 Thursday#

import numpy as np
import pandas as pd
import altair as alt

Here is an elegant way to let pandas know how the missing values are represented in our dataset. This is much more efficient than using applymap and a lambda function.

df = pd.read_csv("spotify_dataset.csv", na_values=" ")

Notice how the data types are correct (we didn’t need to use pd.to_numeric).

df.dtypes
Index                          int64
Highest Charting Position      int64
Number of Times Charted        int64
Week of Highest Charting      object
Song Name                     object
Streams                       object
Artist                        object
Artist Followers             float64
Song ID                       object
Genre                         object
Release Date                  object
Weeks Charted                 object
Popularity                   float64
Danceability                 float64
Energy                       float64
Loudness                     float64
Speechiness                  float64
Acousticness                 float64
Liveness                     float64
Tempo                        float64
Duration (ms)                float64
Valence                      float64
Chord                         object
dtype: object
  • Plot Acousticness vs Energy.

alt.Chart(df).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)

If we fit a linear model to this data (with “Acousticness” as our only predictor and with “Energy” as our target), would you expect the coefficient to be positive or negative? (You can either use the graph or intuition. “Acousticness” is as opposed to for example electric guitars.)

  • Check your answer using the LinearRegression from scikit-learn.

from sklearn.linear_model import LinearRegression
reg = LinearRegression()

The following is a very common error. The problem is that df["Acousticness"] is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"], to be one-dimensional.)

reg.fit(df["Acousticness"], df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [8], line 1
----> 1 reg.fit(df["Acousticness"], df["Energy"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
    680 n_jobs_ = self.n_jobs
    682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
    685     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    686 )
    688 sample_weight = _check_sample_weight(
    689     sample_weight, X, dtype=X.dtype, only_non_negative=True
    690 )
    692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
    693     X,
    694     y,
   (...)
    698     sample_weight=sample_weight,
    699 )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    594         y = check_array(y, input_name="y", **check_y_params)
    595     else:
--> 596         X, y = check_X_y(X, y, **check_params)
    597     out = X, y
    599 if not no_val_X and check_params.get("ensure_2d", True):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1069         estimator_name = _check_estimator_name(estimator)
   1070     raise ValueError(
   1071         f"{estimator_name} requires y to be passed, but the target y is None"
   1072     )
-> 1074 X = check_array(
   1075     X,
   1076     accept_sparse=accept_sparse,
   1077     accept_large_sparse=accept_large_sparse,
   1078     dtype=dtype,
   1079     order=order,
   1080     copy=copy,
   1081     force_all_finite=force_all_finite,
   1082     ensure_2d=ensure_2d,
   1083     allow_nd=allow_nd,
   1084     ensure_min_samples=ensure_min_samples,
   1085     ensure_min_features=ensure_min_features,
   1086     estimator=estimator,
   1087     input_name="X",
   1088 )
   1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1092 check_consistent_length(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    877     # If input is 1D raise error
    878     if array.ndim == 1:
--> 879         raise ValueError(
    880             "Expected 2D array, got 1D array instead:\narray={}.\n"
    881             "Reshape your data either using array.reshape(-1, 1) if "
    882             "your data has a single feature or array.reshape(1, -1) "
    883             "if it contains a single sample.".format(array)
    884         )
    886 if dtype_numeric and array.dtype.kind in "USV":
    887     raise ValueError(
    888         "dtype='numeric' is not compatible with arrays of bytes/strings."
    889         "Convert your data to numeric values explicitly instead."
    890     )

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
df["Acousticness"]
0       0.12700
1       0.03830
2       0.33500
3       0.04690
4       0.02030
         ...   
1551    0.00261
1552    0.24000
1553    0.18400
1554    0.24900
1555    0.43300
Name: Acousticness, Length: 1556, dtype: float64
df[["Acousticness"]]
Acousticness
0 0.12700
1 0.03830
2 0.33500
3 0.04690
4 0.02030
... ...
1551 0.00261
1552 0.24000
1553 0.18400
1554 0.24900
1555 0.43300

1556 rows Ă— 1 columns

type(df["Acousticness"])
pandas.core.series.Series
type(df[["Acousticness"]])
pandas.core.frame.DataFrame
df[["Acousticness","Song Name","Artist"]]
Acousticness Song Name Artist
0 0.12700 Beggin' MĂĄneskin
1 0.03830 STAY (with Justin Bieber) The Kid LAROI
2 0.33500 good 4 u Olivia Rodrigo
3 0.04690 Bad Habits Ed Sheeran
4 0.02030 INDUSTRY BABY (feat. Jack Harlow) Lil Nas X
... ... ... ...
1551 0.00261 New Rules Dua Lipa
1552 0.24000 Cheirosa - Ao Vivo Jorge & Mateus
1553 0.18400 Havana (feat. Young Thug) Camila Cabello
1554 0.24900 Surtada - Remix Brega Funk Dadá Boladão, Tati Zaqui, OIK
1555 0.43300 Lover (Remix) [feat. Shawn Mendes] Taylor Swift

1556 rows Ă— 3 columns

Another common error, not removing missing values.

reg.fit(df[["Acousticness"]], df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [15], line 1
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
    680 n_jobs_ = self.n_jobs
    682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
    685     X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    686 )
    688 sample_weight = _check_sample_weight(
    689     sample_weight, X, dtype=X.dtype, only_non_negative=True
    690 )
    692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
    693     X,
    694     y,
   (...)
    698     sample_weight=sample_weight,
    699 )

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
    594         y = check_array(y, input_name="y", **check_y_params)
    595     else:
--> 596         X, y = check_X_y(X, y, **check_params)
    597     out = X, y
    599 if not no_val_X and check_params.get("ensure_2d", True):

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
   1069         estimator_name = _check_estimator_name(estimator)
   1070     raise ValueError(
   1071         f"{estimator_name} requires y to be passed, but the target y is None"
   1072     )
-> 1074 X = check_array(
   1075     X,
   1076     accept_sparse=accept_sparse,
   1077     accept_large_sparse=accept_large_sparse,
   1078     dtype=dtype,
   1079     order=order,
   1080     copy=copy,
   1081     force_all_finite=force_all_finite,
   1082     ensure_2d=ensure_2d,
   1083     allow_nd=allow_nd,
   1084     ensure_min_samples=ensure_min_samples,
   1085     ensure_min_features=ensure_min_features,
   1086     estimator=estimator,
   1087     input_name="X",
   1088 )
   1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
   1092 check_consistent_length(X, y)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:899, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
    893         raise ValueError(
    894             "Found array with dim %d. %s expected <= 2."
    895             % (array.ndim, estimator_name)
    896         )
    898     if force_all_finite:
--> 899         _assert_all_finite(
    900             array,
    901             input_name=input_name,
    902             estimator_name=estimator_name,
    903             allow_nan=force_all_finite == "allow-nan",
    904         )
    906 if ensure_min_samples > 0:
    907     n_samples = _num_samples(array)

File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:146, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
    124         if (
    125             not allow_nan
    126             and estimator_name
   (...)
    130             # Improve the error message on how to handle missing values in
    131             # scikit-learn.
    132             msg_err += (
    133                 f"\n{estimator_name} does not accept missing values"
    134                 " encoded as NaN natively. For supervised learning, you might want"
   (...)
    144                 "#estimators-that-handle-nan-values"
    145             )
--> 146         raise ValueError(msg_err)
    148 # for object dtype data, we only check for NaNs (GH-13254)
    149 elif X.dtype == np.dtype("object") and not allow_nan:

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
df_clean = df.dropna(axis = 0)
df.shape
(1556, 23)
df_clean.shape
(1545, 23)
reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg.coef_
array([-0.35010056])
reg.intercept_
0.7205632317364904

Look at the scatter plot above. This coefficient and this intercept should look very reasonable.

What are the predicted outputs?

reg.predict(df_clean[["Acousticness"]])
array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
       0.56896969])

Add a new column named “pred” with these predicted values. A warning (not an error) will show up because we didn’t use copy() above when dropping the rows with missing values.

df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
/tmp/ipykernel_74/632255384.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
df_clean2 = df.dropna(axis = 0).copy()
df_clean2["pred"] = reg.predict(df_clean2[["Acousticness"]])
df_clean
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord pred
0 1 1 8 2021-07-23--2021-07-30 Beggin' 48,633,449 MĂĄneskin 3377762.0 3Wrjm47oTz2sjIgck11l5e ['indie rock italiano', 'italian pop'] ... 0.800 -4.808 0.0504 0.12700 0.3590 134.002 211560.0 0.589 B 0.676100
1 2 2 3 2021-07-23--2021-07-30 STAY (with Justin Bieber) 47,248,719 The Kid LAROI 2230022.0 5HCyWlXZPP0y6Gqq8TgA20 ['australian hip hop'] ... 0.764 -5.484 0.0483 0.03830 0.1030 169.928 141806.0 0.478 C#/Db 0.707154
2 3 1 11 2021-06-25--2021-07-02 good 4 u 40,162,559 Olivia Rodrigo 6266514.0 4ZtFanR9U6ndgddUvNcjcG ['pop'] ... 0.664 -5.044 0.1540 0.33500 0.0849 166.928 178147.0 0.688 A 0.603280
3 4 3 5 2021-07-02--2021-07-09 Bad Habits 37,799,456 Ed Sheeran 83293380.0 6PQ88X9TkUIAUIZJHW2upE ['pop', 'uk pop'] ... 0.897 -3.712 0.0348 0.04690 0.3640 126.026 231041.0 0.591 B 0.704144
4 5 5 1 2021-07-23--2021-07-30 INDUSTRY BABY (feat. Jack Harlow) 33,948,454 Lil Nas X 5473565.0 27NovPIUIRrOZoCHxABJwK ['lgbtq+ hip hop', 'pop rap'] ... 0.704 -7.409 0.0615 0.02030 0.0501 149.995 212000.0 0.894 D#/Eb 0.713456
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1551 1552 195 1 2019-12-27--2020-01-03 New Rules 4,630,675 Dua Lipa 27167675.0 2ekn2ttSfGqwhhate0LSR0 ['dance pop', 'pop', 'uk pop'] ... 0.700 -6.021 0.0694 0.00261 0.1530 116.073 209320.0 0.608 A 0.719649
1552 1553 196 1 2019-12-27--2020-01-03 Cheirosa - Ao Vivo 4,623,030 Jorge & Mateus 15019109.0 2PWjKmjyTZeDpmOUa3a5da ['sertanejo', 'sertanejo universitario'] ... 0.870 -3.123 0.0851 0.24000 0.3330 152.370 181930.0 0.714 B 0.636539
1553 1554 197 1 2019-12-27--2020-01-03 Havana (feat. Young Thug) 4,620,876 Camila Cabello 22698747.0 1rfofaqEpACxVEHIZBJe6W ['dance pop', 'electropop', 'pop', 'post-teen ... ... 0.523 -4.333 0.0300 0.18400 0.1320 104.988 217307.0 0.394 D 0.656145
1554 1555 198 1 2019-12-27--2020-01-03 Surtada - Remix Brega Funk 4,607,385 Dadá Boladão, Tati Zaqui, OIK 208630.0 5F8ffc8KWKNawllr5WsW0r ['brega funk', 'funk carioca'] ... 0.550 -7.026 0.0587 0.24900 0.1820 154.064 152784.0 0.881 F 0.633388
1555 1556 199 1 2019-12-27--2020-01-03 Lover (Remix) [feat. Shawn Mendes] 4,595,450 Taylor Swift 42227614.0 3i9UVldZOE0aD0JnyfAZZ0 ['pop', 'post-teen pop'] ... 0.603 -7.176 0.0640 0.43300 0.0862 205.272 221307.0 0.422 G 0.568970

1545 rows Ă— 24 columns

Plot the fit values using a red mark_line in Altair. Layer the two charts on top of each other.

c1 = alt.Chart(df_clean).mark_circle().encode(
    x = "Acousticness",
    y = "Energy"
)
c2 = alt.Chart(df_clean).mark_line(color="red").encode(
    x = "Acousticness",
    y = "pred"
)
c1+c2

Linear regression works basically the same with multiple input variables. Perform linear regression using “Acousticness”, “Speechiness”, “Valence” as our input features.

reg2 = LinearRegression()
df_clean[["Acousticness", "Speechiness", "Valence"]]
Acousticness Speechiness Valence
0 0.12700 0.0504 0.589
1 0.03830 0.0483 0.478
2 0.33500 0.1540 0.688
3 0.04690 0.0348 0.591
4 0.02030 0.0615 0.894
... ... ... ...
1551 0.00261 0.0694 0.608
1552 0.24000 0.0851 0.714
1553 0.18400 0.0300 0.394
1554 0.24900 0.0587 0.881
1555 0.43300 0.0640 0.422

1545 rows Ă— 3 columns

reg2.fit(df_clean[["Acousticness", "Speechiness", "Valence"]],df_clean["Energy"])
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg2.coef_
array([-0.33557117, -0.08205736,  0.21893852])

Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness generally corresponds to having less energy, Valence corresponds to more energy.

df_clean.corr()
Index Highest Charting Position Number of Times Charted Artist Followers Popularity Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence pred
Index 1.000000 0.251112 -0.360013 0.090887 -0.333683 0.126254 -0.018813 -0.014321 0.109213 -0.063914 0.028976 0.023622 -0.023481 -0.054632 0.063914
Highest Charting Position 0.251112 1.000000 -0.417748 -0.233723 -0.164167 0.017149 0.063026 0.032166 0.041248 -0.012924 0.012718 0.026235 -0.033956 0.045362 0.012924
Number of Times Charted -0.360013 -0.417748 1.000000 0.027458 0.232796 0.027026 -0.061139 0.031225 -0.060216 0.046651 -0.058436 -0.048307 0.033980 0.021570 -0.046651
Artist Followers 0.090887 -0.233723 0.027458 1.000000 0.104358 -0.097576 -0.065613 -0.033264 -0.072968 0.023830 -0.012491 -0.019881 0.142145 -0.108804 -0.023830
Popularity -0.333683 -0.164167 0.232796 0.104358 1.000000 0.028435 0.094691 0.158767 -0.032091 -0.091245 -0.029460 -0.024951 0.082096 -0.000953 0.091245
Danceability 0.126254 0.017149 0.027026 -0.097576 0.028435 1.000000 0.142130 0.234928 0.237394 -0.316798 -0.114518 -0.040219 -0.101390 0.361627 0.316798
Energy -0.018813 0.063026 -0.061139 -0.065613 0.094691 0.142130 1.000000 0.732616 0.023989 -0.542399 0.124693 0.113352 0.056624 0.356325 0.542399
Loudness -0.014321 0.032166 0.031225 -0.033264 0.158767 0.234928 0.732616 1.000000 -0.018823 -0.477431 0.043141 0.104371 0.075262 0.298762 0.477431
Speechiness 0.109213 0.041248 -0.060216 -0.072968 -0.032091 0.237394 0.023989 -0.018823 1.000000 -0.131436 0.072774 0.111255 -0.089895 0.038032 0.131436
Acousticness -0.063914 -0.012924 0.046651 0.023830 -0.091245 -0.316798 -0.542399 -0.477431 -0.131436 1.000000 -0.005469 -0.061632 -0.046010 -0.096997 -1.000000
Liveness 0.028976 0.012718 -0.058436 -0.012491 -0.029460 -0.114518 0.124693 0.043141 0.072774 -0.005469 1.000000 -0.018265 0.019685 0.007882 0.005469
Tempo 0.023622 0.026235 -0.048307 -0.019881 -0.024951 -0.040219 0.113352 0.104371 0.111255 -0.061632 -0.018265 1.000000 -0.004671 0.057563 0.061632
Duration (ms) -0.023481 -0.033956 0.033980 0.142145 0.082096 -0.101390 0.056624 0.075262 -0.089895 -0.046010 0.019685 -0.004671 1.000000 -0.119981 0.046010
Valence -0.054632 0.045362 0.021570 -0.108804 -0.000953 0.361627 0.356325 0.298762 0.038032 -0.096997 0.007882 0.057563 -0.119981 1.000000 0.096997
pred 0.063914 0.012924 -0.046651 -0.023830 0.091245 0.316798 0.542399 0.477431 0.131436 -1.000000 0.005469 0.061632 0.046010 0.096997 1.000000

reference:

reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html https://docs.python.org/3/library/copy.html https://realpython.com/copying-python-objects/