Week 5 Thursday#
import numpy as np
import pandas as pd
import altair as alt
Here is an elegant way to let pandas know how the missing values are represented in our dataset. This is much more efficient than using applymap
and a lambda function.
df = pd.read_csv("spotify_dataset.csv", na_values=" ")
Notice how the data types are correct (we didn’t need to use pd.to_numeric
).
df.dtypes
Index int64
Highest Charting Position int64
Number of Times Charted int64
Week of Highest Charting object
Song Name object
Streams object
Artist object
Artist Followers float64
Song ID object
Genre object
Release Date object
Weeks Charted object
Popularity float64
Danceability float64
Energy float64
Loudness float64
Speechiness float64
Acousticness float64
Liveness float64
Tempo float64
Duration (ms) float64
Valence float64
Chord object
dtype: object
Plot Acousticness vs Energy.
alt.Chart(df).mark_circle().encode(
x = "Acousticness",
y = "Energy"
)
If we fit a linear model to this data (with “Acousticness” as our only predictor and with “Energy” as our target), would you expect the coefficient to be positive or negative? (You can either use the graph or intuition. “Acousticness” is as opposed to for example electric guitars.)
Check your answer using the
LinearRegression
from scikit-learn.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
The following is a very common error. The problem is that df["Acousticness"]
is one-dimensional, and scikit-learn wants something two-dimensional. (It’s fine and maybe required for the second input, df["Energy"]
, to be one-dimensional.)
reg.fit(df["Acousticness"], df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [8], line 1
----> 1 reg.fit(df["Acousticness"], df["Energy"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
680 n_jobs_ = self.n_jobs
682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
685 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
686 )
688 sample_weight = _check_sample_weight(
689 sample_weight, X, dtype=X.dtype, only_non_negative=True
690 )
692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
693 X,
694 y,
(...)
698 sample_weight=sample_weight,
699 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
594 y = check_array(y, input_name="y", **check_y_params)
595 else:
--> 596 X, y = check_X_y(X, y, **check_params)
597 out = X, y
599 if not no_val_X and check_params.get("ensure_2d", True):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1069 estimator_name = _check_estimator_name(estimator)
1070 raise ValueError(
1071 f"{estimator_name} requires y to be passed, but the target y is None"
1072 )
-> 1074 X = check_array(
1075 X,
1076 accept_sparse=accept_sparse,
1077 accept_large_sparse=accept_large_sparse,
1078 dtype=dtype,
1079 order=order,
1080 copy=copy,
1081 force_all_finite=force_all_finite,
1082 ensure_2d=ensure_2d,
1083 allow_nd=allow_nd,
1084 ensure_min_samples=ensure_min_samples,
1085 ensure_min_features=ensure_min_features,
1086 estimator=estimator,
1087 input_name="X",
1088 )
1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1092 check_consistent_length(X, y)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:879, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
877 # If input is 1D raise error
878 if array.ndim == 1:
--> 879 raise ValueError(
880 "Expected 2D array, got 1D array instead:\narray={}.\n"
881 "Reshape your data either using array.reshape(-1, 1) if "
882 "your data has a single feature or array.reshape(1, -1) "
883 "if it contains a single sample.".format(array)
884 )
886 if dtype_numeric and array.dtype.kind in "USV":
887 raise ValueError(
888 "dtype='numeric' is not compatible with arrays of bytes/strings."
889 "Convert your data to numeric values explicitly instead."
890 )
ValueError: Expected 2D array, got 1D array instead:
array=[0.127 0.0383 0.335 ... 0.184 0.249 0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
df["Acousticness"]
0 0.12700
1 0.03830
2 0.33500
3 0.04690
4 0.02030
...
1551 0.00261
1552 0.24000
1553 0.18400
1554 0.24900
1555 0.43300
Name: Acousticness, Length: 1556, dtype: float64
df[["Acousticness"]]
Acousticness | |
---|---|
0 | 0.12700 |
1 | 0.03830 |
2 | 0.33500 |
3 | 0.04690 |
4 | 0.02030 |
... | ... |
1551 | 0.00261 |
1552 | 0.24000 |
1553 | 0.18400 |
1554 | 0.24900 |
1555 | 0.43300 |
1556 rows Ă— 1 columns
type(df["Acousticness"])
pandas.core.series.Series
type(df[["Acousticness"]])
pandas.core.frame.DataFrame
df[["Acousticness","Song Name","Artist"]]
Acousticness | Song Name | Artist | |
---|---|---|---|
0 | 0.12700 | Beggin' | MĂĄneskin |
1 | 0.03830 | STAY (with Justin Bieber) | The Kid LAROI |
2 | 0.33500 | good 4 u | Olivia Rodrigo |
3 | 0.04690 | Bad Habits | Ed Sheeran |
4 | 0.02030 | INDUSTRY BABY (feat. Jack Harlow) | Lil Nas X |
... | ... | ... | ... |
1551 | 0.00261 | New Rules | Dua Lipa |
1552 | 0.24000 | Cheirosa - Ao Vivo | Jorge & Mateus |
1553 | 0.18400 | Havana (feat. Young Thug) | Camila Cabello |
1554 | 0.24900 | Surtada - Remix Brega Funk | Dadá Boladão, Tati Zaqui, OIK |
1555 | 0.43300 | Lover (Remix) [feat. Shawn Mendes] | Taylor Swift |
1556 rows Ă— 3 columns
Another common error, not removing missing values.
reg.fit(df[["Acousticness"]], df["Energy"])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [15], line 1
----> 1 reg.fit(df[["Acousticness"]], df["Energy"])
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/linear_model/_base.py:684, in LinearRegression.fit(self, X, y, sample_weight)
680 n_jobs_ = self.n_jobs
682 accept_sparse = False if self.positive else ["csr", "csc", "coo"]
--> 684 X, y = self._validate_data(
685 X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
686 )
688 sample_weight = _check_sample_weight(
689 sample_weight, X, dtype=X.dtype, only_non_negative=True
690 )
692 X, y, X_offset, y_offset, X_scale = _preprocess_data(
693 X,
694 y,
(...)
698 sample_weight=sample_weight,
699 )
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/base.py:596, in BaseEstimator._validate_data(self, X, y, reset, validate_separately, **check_params)
594 y = check_array(y, input_name="y", **check_y_params)
595 else:
--> 596 X, y = check_X_y(X, y, **check_params)
597 out = X, y
599 if not no_val_X and check_params.get("ensure_2d", True):
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:1074, in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
1069 estimator_name = _check_estimator_name(estimator)
1070 raise ValueError(
1071 f"{estimator_name} requires y to be passed, but the target y is None"
1072 )
-> 1074 X = check_array(
1075 X,
1076 accept_sparse=accept_sparse,
1077 accept_large_sparse=accept_large_sparse,
1078 dtype=dtype,
1079 order=order,
1080 copy=copy,
1081 force_all_finite=force_all_finite,
1082 ensure_2d=ensure_2d,
1083 allow_nd=allow_nd,
1084 ensure_min_samples=ensure_min_samples,
1085 ensure_min_features=ensure_min_features,
1086 estimator=estimator,
1087 input_name="X",
1088 )
1090 y = _check_y(y, multi_output=multi_output, y_numeric=y_numeric, estimator=estimator)
1092 check_consistent_length(X, y)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:899, in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator, input_name)
893 raise ValueError(
894 "Found array with dim %d. %s expected <= 2."
895 % (array.ndim, estimator_name)
896 )
898 if force_all_finite:
--> 899 _assert_all_finite(
900 array,
901 input_name=input_name,
902 estimator_name=estimator_name,
903 allow_nan=force_all_finite == "allow-nan",
904 )
906 if ensure_min_samples > 0:
907 n_samples = _num_samples(array)
File /shared-libs/python3.9/py/lib/python3.9/site-packages/sklearn/utils/validation.py:146, in _assert_all_finite(X, allow_nan, msg_dtype, estimator_name, input_name)
124 if (
125 not allow_nan
126 and estimator_name
(...)
130 # Improve the error message on how to handle missing values in
131 # scikit-learn.
132 msg_err += (
133 f"\n{estimator_name} does not accept missing values"
134 " encoded as NaN natively. For supervised learning, you might want"
(...)
144 "#estimators-that-handle-nan-values"
145 )
--> 146 raise ValueError(msg_err)
148 # for object dtype data, we only check for NaNs (GH-13254)
149 elif X.dtype == np.dtype("object") and not allow_nan:
ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
df_clean = df.dropna(axis = 0)
df.shape
(1556, 23)
df_clean.shape
(1545, 23)
reg.fit(df_clean[["Acousticness"]], df_clean["Energy"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
reg.coef_
array([-0.35010056])
reg.intercept_
0.7205632317364904
Look at the scatter plot above. This coefficient and this intercept should look very reasonable.
What are the predicted outputs?
reg.predict(df_clean[["Acousticness"]])
array([0.67610046, 0.70715438, 0.60327954, ..., 0.65614473, 0.63338819,
0.56896969])
Add a new column named “pred” with these predicted values. A warning (not an error) will show up because we didn’t use copy()
above when dropping the rows with missing values.
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
/tmp/ipykernel_74/632255384.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_clean["pred"] = reg.predict(df_clean[["Acousticness"]])
df_clean2 = df.dropna(axis = 0).copy()
df_clean2["pred"] = reg.predict(df_clean2[["Acousticness"]])
df_clean
Index | Highest Charting Position | Number of Times Charted | Week of Highest Charting | Song Name | Streams | Artist | Artist Followers | Song ID | Genre | ... | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | Chord | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 8 | 2021-07-23--2021-07-30 | Beggin' | 48,633,449 | MĂĄneskin | 3377762.0 | 3Wrjm47oTz2sjIgck11l5e | ['indie rock italiano', 'italian pop'] | ... | 0.800 | -4.808 | 0.0504 | 0.12700 | 0.3590 | 134.002 | 211560.0 | 0.589 | B | 0.676100 |
1 | 2 | 2 | 3 | 2021-07-23--2021-07-30 | STAY (with Justin Bieber) | 47,248,719 | The Kid LAROI | 2230022.0 | 5HCyWlXZPP0y6Gqq8TgA20 | ['australian hip hop'] | ... | 0.764 | -5.484 | 0.0483 | 0.03830 | 0.1030 | 169.928 | 141806.0 | 0.478 | C#/Db | 0.707154 |
2 | 3 | 1 | 11 | 2021-06-25--2021-07-02 | good 4 u | 40,162,559 | Olivia Rodrigo | 6266514.0 | 4ZtFanR9U6ndgddUvNcjcG | ['pop'] | ... | 0.664 | -5.044 | 0.1540 | 0.33500 | 0.0849 | 166.928 | 178147.0 | 0.688 | A | 0.603280 |
3 | 4 | 3 | 5 | 2021-07-02--2021-07-09 | Bad Habits | 37,799,456 | Ed Sheeran | 83293380.0 | 6PQ88X9TkUIAUIZJHW2upE | ['pop', 'uk pop'] | ... | 0.897 | -3.712 | 0.0348 | 0.04690 | 0.3640 | 126.026 | 231041.0 | 0.591 | B | 0.704144 |
4 | 5 | 5 | 1 | 2021-07-23--2021-07-30 | INDUSTRY BABY (feat. Jack Harlow) | 33,948,454 | Lil Nas X | 5473565.0 | 27NovPIUIRrOZoCHxABJwK | ['lgbtq+ hip hop', 'pop rap'] | ... | 0.704 | -7.409 | 0.0615 | 0.02030 | 0.0501 | 149.995 | 212000.0 | 0.894 | D#/Eb | 0.713456 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1551 | 1552 | 195 | 1 | 2019-12-27--2020-01-03 | New Rules | 4,630,675 | Dua Lipa | 27167675.0 | 2ekn2ttSfGqwhhate0LSR0 | ['dance pop', 'pop', 'uk pop'] | ... | 0.700 | -6.021 | 0.0694 | 0.00261 | 0.1530 | 116.073 | 209320.0 | 0.608 | A | 0.719649 |
1552 | 1553 | 196 | 1 | 2019-12-27--2020-01-03 | Cheirosa - Ao Vivo | 4,623,030 | Jorge & Mateus | 15019109.0 | 2PWjKmjyTZeDpmOUa3a5da | ['sertanejo', 'sertanejo universitario'] | ... | 0.870 | -3.123 | 0.0851 | 0.24000 | 0.3330 | 152.370 | 181930.0 | 0.714 | B | 0.636539 |
1553 | 1554 | 197 | 1 | 2019-12-27--2020-01-03 | Havana (feat. Young Thug) | 4,620,876 | Camila Cabello | 22698747.0 | 1rfofaqEpACxVEHIZBJe6W | ['dance pop', 'electropop', 'pop', 'post-teen ... | ... | 0.523 | -4.333 | 0.0300 | 0.18400 | 0.1320 | 104.988 | 217307.0 | 0.394 | D | 0.656145 |
1554 | 1555 | 198 | 1 | 2019-12-27--2020-01-03 | Surtada - Remix Brega Funk | 4,607,385 | Dadá Boladão, Tati Zaqui, OIK | 208630.0 | 5F8ffc8KWKNawllr5WsW0r | ['brega funk', 'funk carioca'] | ... | 0.550 | -7.026 | 0.0587 | 0.24900 | 0.1820 | 154.064 | 152784.0 | 0.881 | F | 0.633388 |
1555 | 1556 | 199 | 1 | 2019-12-27--2020-01-03 | Lover (Remix) [feat. Shawn Mendes] | 4,595,450 | Taylor Swift | 42227614.0 | 3i9UVldZOE0aD0JnyfAZZ0 | ['pop', 'post-teen pop'] | ... | 0.603 | -7.176 | 0.0640 | 0.43300 | 0.0862 | 205.272 | 221307.0 | 0.422 | G | 0.568970 |
1545 rows Ă— 24 columns
Plot the fit values using a red mark_line
in Altair. Layer the two charts on top of each other.
c1 = alt.Chart(df_clean).mark_circle().encode(
x = "Acousticness",
y = "Energy"
)
c2 = alt.Chart(df_clean).mark_line(color="red").encode(
x = "Acousticness",
y = "pred"
)
c1+c2
Linear regression works basically the same with multiple input variables. Perform linear regression using “Acousticness”, “Speechiness”, “Valence” as our input features.
reg2 = LinearRegression()
df_clean[["Acousticness", "Speechiness", "Valence"]]
Acousticness | Speechiness | Valence | |
---|---|---|---|
0 | 0.12700 | 0.0504 | 0.589 |
1 | 0.03830 | 0.0483 | 0.478 |
2 | 0.33500 | 0.1540 | 0.688 |
3 | 0.04690 | 0.0348 | 0.591 |
4 | 0.02030 | 0.0615 | 0.894 |
... | ... | ... | ... |
1551 | 0.00261 | 0.0694 | 0.608 |
1552 | 0.24000 | 0.0851 | 0.714 |
1553 | 0.18400 | 0.0300 | 0.394 |
1554 | 0.24900 | 0.0587 | 0.881 |
1555 | 0.43300 | 0.0640 | 0.422 |
1545 rows Ă— 3 columns
reg2.fit(df_clean[["Acousticness", "Speechiness", "Valence"]],df_clean["Energy"])
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
reg2.coef_
array([-0.33557117, -0.08205736, 0.21893852])
Linear regression is not the fanciest machine learning model, but it is probably the most interpretable model. For example, the coefficients above suggest that Acousticness generally corresponds to having less energy, Valence corresponds to more energy.
df_clean.corr()
Index | Highest Charting Position | Number of Times Charted | Artist Followers | Popularity | Danceability | Energy | Loudness | Speechiness | Acousticness | Liveness | Tempo | Duration (ms) | Valence | pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Index | 1.000000 | 0.251112 | -0.360013 | 0.090887 | -0.333683 | 0.126254 | -0.018813 | -0.014321 | 0.109213 | -0.063914 | 0.028976 | 0.023622 | -0.023481 | -0.054632 | 0.063914 |
Highest Charting Position | 0.251112 | 1.000000 | -0.417748 | -0.233723 | -0.164167 | 0.017149 | 0.063026 | 0.032166 | 0.041248 | -0.012924 | 0.012718 | 0.026235 | -0.033956 | 0.045362 | 0.012924 |
Number of Times Charted | -0.360013 | -0.417748 | 1.000000 | 0.027458 | 0.232796 | 0.027026 | -0.061139 | 0.031225 | -0.060216 | 0.046651 | -0.058436 | -0.048307 | 0.033980 | 0.021570 | -0.046651 |
Artist Followers | 0.090887 | -0.233723 | 0.027458 | 1.000000 | 0.104358 | -0.097576 | -0.065613 | -0.033264 | -0.072968 | 0.023830 | -0.012491 | -0.019881 | 0.142145 | -0.108804 | -0.023830 |
Popularity | -0.333683 | -0.164167 | 0.232796 | 0.104358 | 1.000000 | 0.028435 | 0.094691 | 0.158767 | -0.032091 | -0.091245 | -0.029460 | -0.024951 | 0.082096 | -0.000953 | 0.091245 |
Danceability | 0.126254 | 0.017149 | 0.027026 | -0.097576 | 0.028435 | 1.000000 | 0.142130 | 0.234928 | 0.237394 | -0.316798 | -0.114518 | -0.040219 | -0.101390 | 0.361627 | 0.316798 |
Energy | -0.018813 | 0.063026 | -0.061139 | -0.065613 | 0.094691 | 0.142130 | 1.000000 | 0.732616 | 0.023989 | -0.542399 | 0.124693 | 0.113352 | 0.056624 | 0.356325 | 0.542399 |
Loudness | -0.014321 | 0.032166 | 0.031225 | -0.033264 | 0.158767 | 0.234928 | 0.732616 | 1.000000 | -0.018823 | -0.477431 | 0.043141 | 0.104371 | 0.075262 | 0.298762 | 0.477431 |
Speechiness | 0.109213 | 0.041248 | -0.060216 | -0.072968 | -0.032091 | 0.237394 | 0.023989 | -0.018823 | 1.000000 | -0.131436 | 0.072774 | 0.111255 | -0.089895 | 0.038032 | 0.131436 |
Acousticness | -0.063914 | -0.012924 | 0.046651 | 0.023830 | -0.091245 | -0.316798 | -0.542399 | -0.477431 | -0.131436 | 1.000000 | -0.005469 | -0.061632 | -0.046010 | -0.096997 | -1.000000 |
Liveness | 0.028976 | 0.012718 | -0.058436 | -0.012491 | -0.029460 | -0.114518 | 0.124693 | 0.043141 | 0.072774 | -0.005469 | 1.000000 | -0.018265 | 0.019685 | 0.007882 | 0.005469 |
Tempo | 0.023622 | 0.026235 | -0.048307 | -0.019881 | -0.024951 | -0.040219 | 0.113352 | 0.104371 | 0.111255 | -0.061632 | -0.018265 | 1.000000 | -0.004671 | 0.057563 | 0.061632 |
Duration (ms) | -0.023481 | -0.033956 | 0.033980 | 0.142145 | 0.082096 | -0.101390 | 0.056624 | 0.075262 | -0.089895 | -0.046010 | 0.019685 | -0.004671 | 1.000000 | -0.119981 | 0.046010 |
Valence | -0.054632 | 0.045362 | 0.021570 | -0.108804 | -0.000953 | 0.361627 | 0.356325 | 0.298762 | 0.038032 | -0.096997 | 0.007882 | 0.057563 | -0.119981 | 1.000000 | 0.096997 |
pred | 0.063914 | 0.012924 | -0.046651 | -0.023830 | 0.091245 | 0.316798 | 0.542399 | 0.477431 | 0.131436 | -1.000000 | 0.005469 | 0.061632 | 0.046010 | 0.096997 | 1.000000 |
reference:
reference: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html https://docs.python.org/3/library/copy.html https://realpython.com/copying-python-objects/