Week 5 Videos

Contents

Week 5 Videos¶

Unpacking¶

Warm-up task: Define ncol to be the number of columns in the Spotify dataset.

import pandas as pd

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ").dropna()

t = df.shape

t[1]

_, ncol = df.shape

ncol

import matplotlib.pyplot as plt

t = plt.subplots()

../_images/Week5-Videos_10_0.png

type(t)

tuple

type(t[0])

matplotlib.figure.Figure

type(t[1])

matplotlib.axes._subplots.AxesSubplot

# tuple unpacking
fig, ax = plt.subplots()

../_images/Week5-Videos_14_0.png

type(fig)

matplotlib.figure.Figure

type(ax)

matplotlib.axes._subplots.AxesSubplot

for a,b in df.groupby("Artist"):
    print(a)
    display(b)
    break

*NSYNC

	Index	Highest Charting Position	Number of Times Charted	Week of Highest Charting	Song Name	Streams	Artist	Artist Followers	Song ID	Genre	...	Danceability	Energy	Loudness	Speechiness	Acousticness	Liveness	Tempo	Duration (ms)	Valence	Chord
690	691	184	1	2020-12-18--2020-12-25	Merry Christmas, Happy Holidays	6,635,128	*NSYNC	1564750.0	4v9WbaxW8HdjqfUiWYWsII	['boy band', 'dance pop', 'pop', 'post-teen pop']	...	0.643	0.939	-3.967	0.0463	0.104	0.881	104.999	255307.0	0.756	F

1 rows × 23 columns

Generating data for linear regression¶

from sklearn.datasets import make_regression

help(make_regression)

Help on function make_regression in module sklearn.datasets._samples_generator:

make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)
    Generate a random regression problem.
    
    The input set can either be well conditioned (by default) or have a low
    rank-fat tail singular profile. See :func:`make_low_rank_matrix` for
    more details.
    
    The output is generated by applying a (potentially biased) random linear
    regression model with `n_informative` nonzero regressors to the previously
    generated input and some gaussian centered noise with some adjustable
    scale.
    
    Read more in the :ref:`User Guide <sample_generators>`.
    
    Parameters
    ----------
    n_samples : int, default=100
        The number of samples.
    
    n_features : int, default=100
        The number of features.
    
    n_informative : int, default=10
        The number of informative features, i.e., the number of features used
        to build the linear model used to generate the output.
    
    n_targets : int, default=1
        The number of regression targets, i.e., the dimension of the y output
        vector associated with a sample. By default, the output is a scalar.
    
    bias : float, default=0.0
        The bias term in the underlying linear model.
    
    effective_rank : int, default=None
        if not None:
            The approximate number of singular vectors required to explain most
            of the input data by linear combinations. Using this kind of
            singular spectrum in the input allows the generator to reproduce
            the correlations often observed in practice.
        if None:
            The input set is well conditioned, centered and gaussian with
            unit variance.
    
    tail_strength : float, default=0.5
        The relative importance of the fat noisy tail of the singular values
        profile if `effective_rank` is not None. When a float, it should be
        between 0 and 1.
    
    noise : float, default=0.0
        The standard deviation of the gaussian noise applied to the output.
    
    shuffle : bool, default=True
        Shuffle the samples and the features.
    
    coef : bool, default=False
        If True, the coefficients of the underlying linear model are returned.
    
    random_state : int, RandomState instance or None, default=None
        Determines random number generation for dataset creation. Pass an int
        for reproducible output across multiple function calls.
        See :term:`Glossary <random_state>`.
    
    Returns
    -------
    X : ndarray of shape (n_samples, n_features)
        The input samples.
    
    y : ndarray of shape (n_samples,) or (n_samples, n_targets)
        The output values.
    
    coef : ndarray of shape (n_features,) or (n_features, n_targets)
        The coefficient of the underlying linear model. It is returned only if
        coef is True.

t = make_regression(n_features=1)

type(t)

tuple

len(t)

type(t[0])

numpy.ndarray

t[0].shape

(100, 1)

type(t[1])

numpy.ndarray

t[1].shape

(100,)

fig, ax = plt.subplots()
ax.plot(t[0], t[1])

[<matplotlib.lines.Line2D at 0x7f8e59a4eb10>]

../_images/Week5-Videos_28_1.png

fig, ax = plt.subplots()
ax.scatter(t[0], t[1])

<matplotlib.collections.PathCollection at 0x7f8e59a8d610>

../_images/Week5-Videos_29_1.png

X,y = make_regression(n_samples = 10, n_features=1, coef=True)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_90918/737216033.py in <module>
----> 1 X,y = make_regression(n_samples = 10, n_features=1, coef=True)

ValueError: too many values to unpack (expected 2)

X,y,m = make_regression(n_samples = 10, n_features=1, coef=True)

array(37.61300837)

fig, ax = plt.subplots()
ax.scatter(X,y)

<matplotlib.collections.PathCollection at 0x7f8e59a8d590>

../_images/Week5-Videos_33_1.png

Linear regression using scikit-learn¶

X,y,m = make_regression(n_samples = 10, n_features=1, coef=True, noise=5, bias=-17.4)

fig, ax = plt.subplots()
ax.scatter(X,y)

<matplotlib.collections.PathCollection at 0x7f8e4b0f75d0>

../_images/Week5-Videos_36_1.png

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

type(reg)

sklearn.linear_model._base.LinearRegression

reg.fit(X,y)

LinearRegression()

reg.coef_

array([80.74305618])

array(80.45838668)

reg.intercept_

-18.956015204674802

pred = reg.predict(X)

pred

array([ -64.36020915,  -71.53057174,   77.7462882 ,  -38.87491328,
          9.15078824,   -1.67347246,  101.57870959,   26.11546954,
       -119.41690634,   30.87178162])

fig, ax = plt.subplots()
ax.scatter(X,y)
ax.plot(X,pred)

[<matplotlib.lines.Line2D at 0x7f8e4b37aa50>]

../_images/Week5-Videos_46_1.png

Linear regression with a real dataset¶

Find the line of best fit for “Acousticness” vs “Energy” from the Spotify dataset.

df = pd.read_csv("../data/spotify_dataset.csv", na_values=" ").dropna()

import altair as alt

alt.Chart(df).mark_circle().encode(
    x="Acousticness",
    y="Energy"
)

reg = LinearRegression()

reg.fit(df["Acousticness"],df["Energy"])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/8j/gshrlmtn7dg4qtztj4d4t_w40000gn/T/ipykernel_90918/2831094536.py in <module>
----> 1 reg.fit(df["Acousticness"],df["Energy"])

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/linear_model/_base.py in fit(self, X, y, sample_weight)
    661 
    662         X, y = self._validate_data(
--> 663             X, y, accept_sparse=accept_sparse, y_numeric=True, multi_output=True
    664         )
    665 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
    579                 y = check_array(y, **check_y_params)
    580             else:
--> 581                 X, y = check_X_y(X, y, **check_params)
    582             out = X, y
    583 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, estimator)
    974         ensure_min_samples=ensure_min_samples,
    975         ensure_min_features=ensure_min_features,
--> 976         estimator=estimator,
    977     )
    978 

~/miniconda3/envs/math10s22/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
    771                     "Reshape your data either using array.reshape(-1, 1) if "
    772                     "your data has a single feature or array.reshape(1, -1) "
--> 773                     "if it contains a single sample.".format(array)
    774                 )
    775 

ValueError: Expected 2D array, got 1D array instead:
array=[0.127  0.0383 0.335  ... 0.184  0.249  0.433 ].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

df["Acousticness"]

     0.12700
     0.03830
     0.33500
     0.04690
     0.02030
         ...   
  0.00261
  0.24000
  0.18400
  0.24900
  0.43300
Name: Acousticness, Length: 1545, dtype: float64

df[["Acousticness"]]

	Acousticness
0	0.12700
1	0.03830
2	0.33500
3	0.04690
4	0.02030
...	...
1551	0.00261
1552	0.24000
1553	0.18400
1554	0.24900
1555	0.43300

1545 rows × 1 columns

reg.fit(df[["Acousticness"]],df["Energy"])

LinearRegression()

reg.intercept_

0.7205632317364903

reg.coef_

array([-0.35010056])

Predict \(y = -0.35\cdot x + 0.72\).

previous

K-Means clustering 2

next

Homework 4