Week 8 Wednesday

Week 8 Wednesday

Yuja recording

Before the recording, at the board we went over some different components related to Neural Networks and PyTorch, and especially we went over an example of performing gradient descent.

The goal of today’s class is to get more comfortable with the various components involved in building and training a neural network using PyTorch.

import numpy as np
import torch
from torch import nn
from torchvision.transforms import ToTensor

Gradient descent

Gradient descent can be used to try to find a minimum of any differentiable function. (Often it will only find a local minimum, not a global minimum, even if a global minimum exists.) We usually use gradient descent for very complicated functions, but here we give an example of performing gradient descent to attempt to find a minimum of the function

\[ f(x,y) = (x-3)^2 + (y+2)^2 + 8. \]

We call this function loss_fn so that the syntax is the same as what we’re used to in PyTorch.

loss_fn = lambda t: (t[0] - 3)**2 + (t[1] + 2)**2 + 8 

To perform gradient descent, you need to begin with an initial guess. We guess (10,10) and then gradually adjust this, hoping to move towards a minimum. Notice the decimal point after 10… this is a shortcut for telling PyTorch that these should be treated as floats.

a = torch.tensor([10.,10], requires_grad=True)
a
tensor([10., 10.], requires_grad=True)
loss_fn([10,10])
201
loss_fn(a)
tensor(201., grad_fn=<AddBackward0>)
type(loss_fn)
function

Because we specified requires_grad=True as a keyword argument, we will be able to find gradients of computations involving a. There isn’t any gradient yet because we haven’t computed one.

a.grad

Here we define a stochastic gradient descent optimizer like usual in PyTorch. The first input is usually something like model.parameters(). Here we try to use a as the first argument. That is almost right, but we need to put it in a list (or some other type of iterable).

optimizer = torch.optim.SGD(a, lr = 0.1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [8], in <module>
----> 1 optimizer = torch.optim.SGD(a, lr = 0.1)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/torch/optim/sgd.py:95, in SGD.__init__(self, params, lr, momentum, dampening, weight_decay, nesterov)
     93 if nesterov and (momentum <= 0 or dampening != 0):
     94     raise ValueError("Nesterov momentum requires a momentum and zero dampening")
---> 95 super(SGD, self).__init__(params, defaults)

File ~/miniconda3/envs/torch/lib/python3.8/site-packages/torch/optim/optimizer.py:40, in Optimizer.__init__(self, params, defaults)
     37 self._hook_for_profile()
     39 if isinstance(params, torch.Tensor):
---> 40     raise TypeError("params argument given to the optimizer should be "
     41                     "an iterable of Tensors or dicts, but got " +
     42                     torch.typename(params))
     44 self.state = defaultdict(dict)
     45 self.param_groups = []

TypeError: params argument given to the optimizer should be an iterable of Tensors or dicts, but got torch.FloatTensor
optimizer = torch.optim.SGD([a], lr = 0.1)
loss = loss_fn(a)

This next optimizer.zero_grad() is not important yet, but it is good to be in the habit, because otherwise multiple gradient computations will accumulate, and we want to start over each time.

optimizer.zero_grad()
type(loss)
torch.Tensor

Next we compute the gradient. This typically uses an algorithm called backpropagation, which is where the name backward comes from.

loss.backward()
a
tensor([10., 10.], requires_grad=True)

Now the grad attribute of a has a value. You should be able to compute this value by hand in this case, since our loss_fn is so simple.

a.grad
tensor([14., 24.])

Now we replace add a multiple (the learning rate lr) of the negative gradient to a. Again, you should be able to compute this by hand in this case. The formula is

\[ a \leadsto a - lr \cdot \nabla \]
optimizer.step()
a
tensor([8.6000, 7.6000], requires_grad=True)
loss_fn = lambda t: (t[0] - 3)**2 + (t[1] + 2)**2 + 8 

Notice how the value of a is approaching the minimum (3,-2), and notice how loss is approaching the minimum of our loss_fn, which is 8. (The only reason we’re using the terms loss and loss_fn is because those are the terms we usually use in PyTorch. In this case, loss_fn is just an ordinary two-variable function like from Math 2D which we are trying to minimize.)

epochs = 20
a = torch.tensor([10.,10], requires_grad=True)
optimizer = torch.optim.SGD([a], lr = 0.1)
for i in range(epochs):
    loss = loss_fn(a)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print("Epoch " + str(i))
    print(a)
    print(loss)
    print("")
Epoch 0
tensor([8.6000, 7.6000], requires_grad=True)
tensor(201., grad_fn=<AddBackward0>)

Epoch 1
tensor([7.4800, 5.6800], requires_grad=True)
tensor(131.5200, grad_fn=<AddBackward0>)

Epoch 2
tensor([6.5840, 4.1440], requires_grad=True)
tensor(87.0528, grad_fn=<AddBackward0>)

Epoch 3
tensor([5.8672, 2.9152], requires_grad=True)
tensor(58.5938, grad_fn=<AddBackward0>)

Epoch 4
tensor([5.2938, 1.9322], requires_grad=True)
tensor(40.3800, grad_fn=<AddBackward0>)

Epoch 5
tensor([4.8350, 1.1457], requires_grad=True)
tensor(28.7232, grad_fn=<AddBackward0>)

Epoch 6
tensor([4.4680, 0.5166], requires_grad=True)
tensor(21.2629, grad_fn=<AddBackward0>)

Epoch 7
tensor([4.1744, 0.0133], requires_grad=True)
tensor(16.4882, grad_fn=<AddBackward0>)

Epoch 8
tensor([ 3.9395, -0.3894], requires_grad=True)
tensor(13.4325, grad_fn=<AddBackward0>)

Epoch 9
tensor([ 3.7516, -0.7115], requires_grad=True)
tensor(11.4768, grad_fn=<AddBackward0>)

Epoch 10
tensor([ 3.6013, -0.9692], requires_grad=True)
tensor(10.2251, grad_fn=<AddBackward0>)

Epoch 11
tensor([ 3.4810, -1.1754], requires_grad=True)
tensor(9.4241, grad_fn=<AddBackward0>)

Epoch 12
tensor([ 3.3848, -1.3403], requires_grad=True)
tensor(8.9114, grad_fn=<AddBackward0>)

Epoch 13
tensor([ 3.3079, -1.4722], requires_grad=True)
tensor(8.5833, grad_fn=<AddBackward0>)

Epoch 14
tensor([ 3.2463, -1.5778], requires_grad=True)
tensor(8.3733, grad_fn=<AddBackward0>)

Epoch 15
tensor([ 3.1970, -1.6622], requires_grad=True)
tensor(8.2389, grad_fn=<AddBackward0>)

Epoch 16
tensor([ 3.1576, -1.7298], requires_grad=True)
tensor(8.1529, grad_fn=<AddBackward0>)

Epoch 17
tensor([ 3.1261, -1.7838], requires_grad=True)
tensor(8.0979, grad_fn=<AddBackward0>)

Epoch 18
tensor([ 3.1009, -1.8271], requires_grad=True)
tensor(8.0626, grad_fn=<AddBackward0>)

Epoch 19
tensor([ 3.0807, -1.8616], requires_grad=True)
tensor(8.0401, grad_fn=<AddBackward0>)

If we want a to approach the minimum (3,-2) faster, we can make the learning rate bigger, but here is an example of what can go wrong if we make the learning rate too big.

epochs = 20
a = torch.tensor([10.,10], requires_grad=True)
optimizer = torch.optim.SGD([a], lr = 10)
for i in range(epochs):
    loss = loss_fn(a)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print("Epoch " + str(i))
    print(a)
    print(loss)
    print("")
Epoch 0
tensor([-130., -230.], requires_grad=True)
tensor(201., grad_fn=<AddBackward0>)

Epoch 1
tensor([2530., 4330.], requires_grad=True)
tensor(69681., grad_fn=<AddBackward0>)

Epoch 2
tensor([-48010., -82310.], requires_grad=True)
tensor(25151960., grad_fn=<AddBackward0>)

Epoch 3
tensor([ 912250., 1563850.], requires_grad=True)
tensor(9.0799e+09, grad_fn=<AddBackward0>)

Epoch 4
tensor([-17332690., -29713190.], requires_grad=True)
tensor(3.2778e+12, grad_fn=<AddBackward0>)

Epoch 5
tensor([3.2932e+08, 5.6455e+08], requires_grad=True)
tensor(1.1833e+15, grad_fn=<AddBackward0>)

Epoch 6
tensor([-6.2571e+09, -1.0726e+10], requires_grad=True)
tensor(4.2717e+17, grad_fn=<AddBackward0>)

Epoch 7
tensor([1.1888e+11, 2.0380e+11], requires_grad=True)
tensor(1.5421e+20, grad_fn=<AddBackward0>)

Epoch 8
tensor([-2.2588e+12, -3.8723e+12], requires_grad=True)
tensor(5.5669e+22, grad_fn=<AddBackward0>)

Epoch 9
tensor([4.2917e+13, 7.3573e+13], requires_grad=True)
tensor(2.0097e+25, grad_fn=<AddBackward0>)

Epoch 10
tensor([-8.1543e+14, -1.3979e+15], requires_grad=True)
tensor(7.2549e+27, grad_fn=<AddBackward0>)

Epoch 11
tensor([1.5493e+16, 2.6560e+16], requires_grad=True)
tensor(2.6190e+30, grad_fn=<AddBackward0>)

Epoch 12
tensor([-2.9437e+17, -5.0464e+17], requires_grad=True)
tensor(9.4546e+32, grad_fn=<AddBackward0>)

Epoch 13
tensor([5.5930e+18, 9.5881e+18], requires_grad=True)
tensor(3.4131e+35, grad_fn=<AddBackward0>)

Epoch 14
tensor([-1.0627e+20, -1.8217e+20], requires_grad=True)
tensor(1.2321e+38, grad_fn=<AddBackward0>)

Epoch 15
tensor([2.0191e+21, 3.4613e+21], requires_grad=True)
tensor(inf, grad_fn=<AddBackward0>)

Epoch 16
tensor([-3.8363e+22, -6.5765e+22], requires_grad=True)
tensor(inf, grad_fn=<AddBackward0>)

Epoch 17
tensor([7.2889e+23, 1.2495e+24], requires_grad=True)
tensor(inf, grad_fn=<AddBackward0>)

Epoch 18
tensor([-1.3849e+25, -2.3741e+25], requires_grad=True)
tensor(inf, grad_fn=<AddBackward0>)

Epoch 19
tensor([2.6313e+26, 4.5108e+26], requires_grad=True)
tensor(inf, grad_fn=<AddBackward0>)

Here is an example for what seems to be a good choice of lr.

epochs = 20
a = torch.tensor([10.,10], requires_grad=True)
optimizer = torch.optim.SGD([a], lr = 0.25)
for i in range(epochs):
    loss = loss_fn(a)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print("Epoch " + str(i))
    print(a)
    print(loss)
    print("")
Epoch 0
tensor([6.5000, 4.0000], requires_grad=True)
tensor(201., grad_fn=<AddBackward0>)

Epoch 1
tensor([4.7500, 1.0000], requires_grad=True)
tensor(56.2500, grad_fn=<AddBackward0>)

Epoch 2
tensor([ 3.8750, -0.5000], requires_grad=True)
tensor(20.0625, grad_fn=<AddBackward0>)

Epoch 3
tensor([ 3.4375, -1.2500], requires_grad=True)
tensor(11.0156, grad_fn=<AddBackward0>)

Epoch 4
tensor([ 3.2188, -1.6250], requires_grad=True)
tensor(8.7539, grad_fn=<AddBackward0>)

Epoch 5
tensor([ 3.1094, -1.8125], requires_grad=True)
tensor(8.1885, grad_fn=<AddBackward0>)

Epoch 6
tensor([ 3.0547, -1.9062], requires_grad=True)
tensor(8.0471, grad_fn=<AddBackward0>)

Epoch 7
tensor([ 3.0273, -1.9531], requires_grad=True)
tensor(8.0118, grad_fn=<AddBackward0>)

Epoch 8
tensor([ 3.0137, -1.9766], requires_grad=True)
tensor(8.0029, grad_fn=<AddBackward0>)

Epoch 9
tensor([ 3.0068, -1.9883], requires_grad=True)
tensor(8.0007, grad_fn=<AddBackward0>)

Epoch 10
tensor([ 3.0034, -1.9941], requires_grad=True)
tensor(8.0002, grad_fn=<AddBackward0>)

Epoch 11
tensor([ 3.0017, -1.9971], requires_grad=True)
tensor(8.0000, grad_fn=<AddBackward0>)

Epoch 12
tensor([ 3.0009, -1.9985], requires_grad=True)
tensor(8.0000, grad_fn=<AddBackward0>)

Epoch 13
tensor([ 3.0004, -1.9993], requires_grad=True)
tensor(8.0000, grad_fn=<AddBackward0>)

Epoch 14
tensor([ 3.0002, -1.9996], requires_grad=True)
tensor(8.0000, grad_fn=<AddBackward0>)

Epoch 15
tensor([ 3.0001, -1.9998], requires_grad=True)
tensor(8., grad_fn=<AddBackward0>)

Epoch 16
tensor([ 3.0001, -1.9999], requires_grad=True)
tensor(8., grad_fn=<AddBackward0>)

Epoch 17
tensor([ 3.0000, -2.0000], requires_grad=True)
tensor(8., grad_fn=<AddBackward0>)

Epoch 18
tensor([ 3.0000, -2.0000], requires_grad=True)
tensor(8., grad_fn=<AddBackward0>)

Epoch 19
tensor([ 3.0000, -2.0000], requires_grad=True)
tensor(8., grad_fn=<AddBackward0>)