Week 5 Videos

Unpacking

  • Warm-up task: Define ncol to be the number of columns in the Spotify dataset.

import pandas as pd
df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ").dropna()
t = df.shape
t[1]
23
_, ncol = df.shape
ncol
23
_
1545
import matplotlib.pyplot as plt
t = plt.subplots()
../_images/Week5-Videos_10_0.png
type(t)
tuple
type(t[0])
matplotlib.figure.Figure
type(t[1])
matplotlib.axes._subplots.AxesSubplot
# tuple unpacking
fig, ax = plt.subplots()
../_images/Week5-Videos_14_0.png
type(fig)
matplotlib.figure.Figure
type(ax)
matplotlib.axes._subplots.AxesSubplot
for a,b in df.groupby("Artist"):
    print(a)
    display(b)
    break
*NSYNC
Index Highest Charting Position Number of Times Charted Week of Highest Charting Song Name Streams Artist Artist Followers Song ID Genre ... Danceability Energy Loudness Speechiness Acousticness Liveness Tempo Duration (ms) Valence Chord
690 691 184 1 2020-12-18--2020-12-25 Merry Christmas, Happy Holidays 6,635,128 *NSYNC 1564750.0 4v9WbaxW8HdjqfUiWYWsII ['boy band', 'dance pop', 'pop', 'post-teen pop'] ... 0.643 0.939 -3.967 0.0463 0.104 0.881 104.999 255307.0 0.756 F

1 rows × 23 columns

Generating data for linear regression

from sklearn.datasets import make_regression
help(make_regression)
Help on function make_regression in module sklearn.datasets._samples_generator:

make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
    Generate a random regression problem.
    
    The input set can either be well conditioned (by default) or have a low
    rank-fat tail singular profile. See :func:`make_low_rank_matrix` for
    more details.
    
    The output is generated by applying a (potentially biased) random linear
    regression model with `n_informative` nonzero regressors to the previously
    generated input and some gaussian centered noise with some adjustable
    scale.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int, default=100
        The number of samples.
    
    n_features : int, default=100
        The number of features.
    
    n_informative : int, default=10
        The number of informative features, i.e., the number of features used
        to build the linear model used to generate the output.
    
    n_targets : int, default=1
        The number of regression targets, i.e., the dimension of the y output
        vector associated with a sample. By default, the output is a scalar.
    
    bias : float, default=0.0
        The bias term in the underlying linear model.
    
    effective_rank : int, default=None
        if not None:
            The approximate number of singular vectors required to explain most
            of the input data by linear combinations. Using this kind of
            singular spectrum in the input allows the generator to reproduce
            the correlations often observed in practice.
        if None:
            The input set is well conditioned, centered and gaussian with
            unit variance.
    
    tail_strength : float, default=0.5
        The relative importance of the fat noisy tail of the singular values
        profile if `effective_rank` is not None. When a float, it should be
        between 0 and 1.
    
    noise : float, default=0.0
        The standard deviation of the gaussian noise applied to the output.
    
    shuffle : bool, default=True
        Shuffle the samples and the features.
    
    coef : bool, default=False
        If True, the coefficients of the underlying linear model are returned.
    
    random_state : int, RandomState instance or None, default=None
        Determines random number generation for dataset creation. Pass an int
        for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    
    Returns
    -------
    X : ndarray of shape (n_samples, n_features)
        The input samples.
    
    y : ndarray of shape (n_samples,) or (n_samples, n_targets)
        The output values.
    
    coef : ndarray of shape (n_features,) or (n_features, n_targets)
        The coefficient of the underlying linear model. It is returned only if
        coef is True.
t = make_regression(n_features=1)
type(t)
tuple
len(t)
2
type(t[0])
numpy.ndarray
t[0].shape
(100, 1)
type(t[1])
numpy.ndarray
t[1].shape
(100,)
fig, ax = plt.subplots()
ax.plot(t[0], t[1])
[<matplotlib.lines.Line2D at 0x7f8e59a4eb10>]
../_images/Week5-Videos_28_1.png
fig, ax = plt.subplots()
ax.scatter(t[0], t[1])
<matplotlib.collections.PathCollection at 0x7f8e59a8d610>
../_images/Week5-Videos_29_1.png
X,y = make_regression(n_samples = 10, n_features=1, coef=True)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_90918/737216033.py in <module>
----> 1 X,y = make_regression(n_samples = 10, n_features=1, coef=True)

ValueError: too many values to unpack (expected 2)
X,y,m = make_regression(n_samples = 10, n_features=1, coef=True)
m
array(37.61300837)
fig, ax = plt.subplots()
ax.scatter(X,y)
<matplotlib.collections.PathCollection at 0x7f8e59a8d590>
../_images/Week5-Videos_33_1.png

Linear regression using scikit-learn

X,y,m = make_regression(n_samples = 10, n_features=1, coef=True, noise=5, bias=-17.4)
fig, ax = plt.subplots()
ax.scatter(X,y)
<matplotlib.collections.PathCollection at 0x7f8e4b0f75d0>
../_images/Week5-Videos_36_1.png
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
type(reg)
sklearn.linear_model._base.LinearRegression
reg.fit(X,y)
LinearRegression()
reg.coef_
array([80.74305618])
m
array(80.45838668)
reg.intercept_
-18.956015204674802
pred = reg.predict(X)
pred
array([ -64.36020915,  -71.53057174,   77.7462882 ,  -38.87491328,
          9.15078824,   -1.67347246,  101.57870959,   26.11546954,
       -119.41690634,   30.87178162])
fig, ax = plt.subplots()
ax.scatter(X,y)
ax.plot(X,pred)
[<matplotlib.lines.Line2D at 0x7f8e4b37aa50>]
../_images/Week5-Videos_46_1.png

Linear regression with a real dataset

  • Find the line of best fit for “Acousticness” vs “Energy” from the Spotify dataset.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ").dropna()
import altair as alt
alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy"
)
reg = LinearRegression()
reg.fit(df["Acousticness"],df["Energy"])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_90918/2831094536.py in <module>
----> 1 reg.fit(df["Acousticness"],df["Energy"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    661 
    662         X, y = self._validate_data(
--> 663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )
    665 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    974         ensure_min_samples=ensure_min_samples,
    975         ensure_min_features=ensure_min_features,
--> 976         estimator=estimator,
    977     )
    978 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
df["Acousticness"]
0       0.12700
1       0.03830
2       0.33500
3       0.04690
4       0.02030
         ...   
1551    0.00261
1552    0.24000
1553    0.18400
1554    0.24900
1555    0.43300
Name: Acousticness, Length: 1545, dtype: float64
df[["Acousticness"]]
Acousticness
0 0.12700
1 0.03830
2 0.33500
3 0.04690
4 0.02030
... ...
1551 0.00261
1552 0.24000
1553 0.18400
1554 0.24900
1555 0.43300

1545 rows × 1 columns

reg.fit(df[["Acousticness"]],df["Energy"])
LinearRegression()
reg.intercept_
0.7205632317364903
reg.coef_
array([-0.35010056])

Predict \(y = -0.35\cdot x + 0.72\).