NN II – Compute Graph, Backprop and Training
Introduction
In this lecture we’ll gradually build out a light weight neural network training framework reminiscent of PyTorch.
We build:
- A simple neural network engine we call
Value
class that wraps numbers and math operators and includes useful attributes and methods for implementing forward pass and back propagation. (~63 lines of code)
We’ll provide (and explain)
- A simple compute graph visualization function. (34 lines of code)
- A small set of helper functions that easily define a neuron, layer and multi-layer perceptron (MLP). (84 lines of code)
With that we can implement a neural network training loop, and see how similar it is to a PyTorch implementation.
The code is based on Andrej Karpathy’s micrograd.
Artificial Neuron
Artificial Neuron
Recall we looked at an artificial neuron in the Neural Networks I lecture.
From cs231n
The common artifical neuron
- collects one or more inputs,
- each multiplied by a unique weight
- sums the weighted inputs
- adds a bias
- then applies a nonlinear activation function
Activation Functions
The activation function is typically some nonlinear function that compresses the input in some way. Historically, it’s been the sigmoid and \(\tanh()\) functions.
Without a nonlinear activation function, a neural network collapses to just a linear model and loses all its expressive power. To dive deeper, see the Universal Approximation Theorem in Chapter 3 of Understanding Deep Learning.
The sigmoid function is
\[ \mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}}. \]
The hyperbolic tangent, \(\tanh\), is basically the sigmoid function shifted and scaled to a range of [-1,1]. The tanh function is
\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}. \]
A more common activation function these days and that is more efficient to implement is the Rectified Linear Unit or ReLU.
\[ \textrm{ReLU}(x) = \mathrm{max}(0, x) \]
There are many other variations. See for example PyTorch Non-linear Activations.
Relating Back to an Earlier Lecture
Question
What does
\[ \mathrm{sigmoid}\left(\sum_i w_i x_i + b\right) \hspace{10pt} \textrm{where} \hspace{10pt} \mathrm{sigmoid}(x) = \frac{1}{1 + e^{-x}} \]
remind you of?
Answer
How about the Logistic Regression model?
In fact, just like in Logistic Regression, we use the sigmoid function on the last layer of a neural network that is doing binary classification to output the probabilities.
So Logistic Regression is similar to one neuron with a sigmoid activation function.
Fully Connected Network (FCN)
Fully Connected Network (FCN)
- Layer: Multiple artificial neurons acting on the same inputs.
- Network: Multiple cascading layers produce one or more outputs.
The above network has
- 3 inputs,
- three hidden layers of 4, 2 and 3 neurons, respectively, followed by
- one layer that produces two continuous value outputs.
This is a regression model that is predicting two continuous valued outputs.
FCN Matrix Equations
We can express the matrix equations for a \(K\) layer network as
\[ \begin{aligned} \mathbf{h}_1 &= \mathbf{a}(\boldsymbol{\beta}_0 + \boldsymbol{\Omega}_0 \mathbf{x}) \\ \mathbf{h}_2 &= \mathbf{a}(\boldsymbol{\beta}_1 + \boldsymbol{\Omega}_1 \mathbf{h}_1) \\ \vdots \\ \mathbf{h}_K &= \mathbf{a}(\boldsymbol{\beta}_{K-1} + \boldsymbol{\Omega}_{K-1} \mathbf{h}_{K-1}) \\ \mathbf{\hat{y}} &= \boldsymbol{\beta}_K + \boldsymbol{\Omega}_K \mathbf{h}_K \\ \end{aligned} \]
where \(\mathbf{a}(\cdot)\) is the element-wise activation function.
The square loss function is
\[ L = \sum_{i=1}^N \ell_i = \sum_{i=1}^N \|\mathbf{y}_i - \mathbf{\hat{y}}_i\|_2^2. \]
The total loss is the sum of the losses over the \(i\) training examples.
What we want to do is find weights in the NN that minimize the above loss function. We do this by
- finding the partial derivatives of \(L\) with respect to each parameter in the network
- updating the parameters with the negative of its partial derivatives
In other words, we compute
\[ \frac{\partial L}{\partial \boldsymbol{\beta}_k}, \frac{\partial L}{\partial \boldsymbol{\Omega}_k}, \frac{\partial L}{\partial \boldsymbol{\beta}_{k-1}}, \frac{\partial L}{\partial \boldsymbol{\Omega}_{k-1}}, \ldots, \frac{\partial L}{\partial \boldsymbol{\beta}_0}, \frac{\partial L}{\partial \boldsymbol{\Omega}_0}. \]
In the notation above, \(\boldsymbol{\beta}_k\) and \(\mathbf{\Omega}_k\) are the bias vectors and weight matrices of the \(k\)-th layer.
Just as with partial derivatives with respect to scalars, there are well-defined rules for how to compute the partial derivatives of vector and matrix expressions. PyTorch and TensorFlow implement these calculations.
The update rules for the parameters are given by
\[ \boldsymbol{\beta}_k \leftarrow \boldsymbol{\beta}_k - \eta\frac{\partial L}{\partial \boldsymbol{\beta}_k}, \]
\[ \boldsymbol{\Omega}_k \leftarrow \boldsymbol{\Omega}_k - \eta\frac{\partial L}{\partial \boldsymbol{\Omega}_k}, \]
where \(\eta\) is the learning rate.
Computation Graph
Computation Graph
The way we are going to differentiate more complex functions is to first build a “computation graph.”
We’ll see that we can “propagate backwards” through the graph to calculate the gradients of the loss function with respect to the parameters.
It’s a scalable approach employed by TensorFlow and PyTorch, and in fact we’ll follow PyTorch interface definition.
Building the Value
Class
To do that we will
- build a data wrapper as a
class
calledValue
, - gradually build in on all the functionality we need to define a Multi-Layer Neural Network (a.k.a. Multi-Layer Perceptron),
- and train the full network.
This is similar to how PyTorch defines its Tensor
class.
First, the class has only a simple initialization method and a representation method.
# Value version 1
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
Which we can instantiate and evaluate as follows.
= Value(4.0)
a a
Value(data=4.0)
If you are not familiar with python classes, there are a few things to note here.
- The property
self
is just a pointer to the object itself. - The
__init__
method is called when you initialize a class object. - The
__repr__
method is how you represent the class object.
Implementing Addition
The Value
object doesn’t do much yet except for taking a value and printing it. We’d also like to do things like addition and other operations with them, but…
= Value(4.0)
a = Value(-3.0)
b
try:
+b
aexcept Exception as e:
print("Uh oh!", e)
else:
print("It worked!")
Uh oh! unsupported operand type(s) for +: 'Value' and 'Value'
When python tries to add two objects a
and b
, internally it will call a.__add__(b)
. We have to add the __add__()
method.
# Value version 2
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data)
out return out
Now we try adding two Value
objects again.
= Value(4.0)
a = Value(-3.0)
b
try:
+b
aexcept Exception as e:
print("Uh oh!", e)
else:
print("It worked!")
It worked!
Which, as mentioned is equivalent to calling the __add__
method on a
.
__add__(b) a.
Value(data=1.0)
Implementing More Operations
Similarly we can add define a multiplication operation and a ReLU function.
import numpy as np
# Value version 3
class Value:
def __init__(self, data):
self.data = data
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
def __add__(self, other): # self + other
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data)
out return out
def __mul__(self, other): # self * other
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data)
out return out
def relu(self):
= Value(np.maximum(0, self.data))
out return out
Now we can use the additional operations.
= Value(4.0)
a = Value(-3.0)
b = Value(8.0)
c
= a*b+c
d d
Value(data=-4.0)
d.relu()
Value(data=0.0)
Internally, python will call __mul__
on a
, then __add__
on the temporary product object.
__mul__(b)).__add__(c) (a.
Value(data=-4.0)
Child Nodes
In order to calculate the gradients, we will need to capture the computation graphs.
To do that, we’ll need to store pointers to the operands of each operation as a tuple of child nodes in the initializer.
# Value version 4
class Value:
# vvvvvvvvvvvv
def __init__(self, data, _children=()):
self.data = data
self._prev = set(_children)
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other)) # store tuple of children
out return out # ^^^^^^^^^^^^^
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other)) # store tuple of children
out return out # ^^^^^^^^^^^^^
def relu(self):
= Value(np.maximum(0, self.data), (self,))
out return out # ^^^^^^^
Let’s instantiate a few Value
objects and perform some operations with them.
= Value(4.0)
a = Value(-3.0)
b = Value(8.0)
c
= a*b
d = d + c e
We can now see the children of the operands that produced the output value by printing the _prev
value. The name _prev
might not be intuitive yet, but it will make more sense when we view these operations as a graph.
d._prev
{Value(data=-3.0), Value(data=4.0)}
e._prev
{Value(data=-12.0), Value(data=8.0)}
Child Operations
Now we’ve recorded pointers to the child nodes. It would be helpful to also record the operator used.
We’ll also add labels for convenience.
# Value version 5
class Value:
# vvvvvvv vvvvvvvv
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self._prev = set(_children)
self._op = _op # store the operation that created this node
self.label = label # label for the node
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other), '+') # store operator used
out return out # ^^^
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other), '*') # store operator used
out return out # ^^^
def relu(self): # vvvvvv
= Value(np.maximum(0, self.data), (self,), 'ReLU')
out
return out
Let’s again instantiate a few Value
objects and do some operations with them.
= Value(4.0, label='a')
a = Value(-3.0, label='b')
b = Value(8.0, label='c')
c
= a*b ; d.label = 'd'
d = d + c ; e.label = 'e' e
We can now inspect the parent operands and the operations used to create the current node value.
d._prev, d._op, d.label
({Value(data=-3.0), Value(data=4.0)}, '*', 'd')
e._prev, e._op, e.label
({Value(data=-12.0), Value(data=8.0)}, '+', 'e')
The Compute Graph
We now have enough information stored about the compute graph to visualize it.
These are two functions to walk the graph and build sets of all nodes and edges (trace
) and then draw them as a directed graph (draw_dot
).
Show the code for trace()
and draw_dot()
# draw_dot version 1
from graphviz import Digraph
def trace(root):
# builds a set of all nodes and set of all edges in a graph
= set(), set()
nodes, edges def build(v):
if v not in nodes:
nodes.add(v)for child in v._prev:
edges.add((child, v))
build(child)
build(root)return nodes, edges
def draw_dot(root):
= Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
dot
= trace(root)
nodes, edges for n in nodes:
= str(id(n))
uid # for any value in the graph, create a rectangular ('record') node for it
= uid, label = "{ %s | data %.4f }" % (n.label, n.data), shape='record')
dot.node(name if n._op:
# if this value is a result of some operation, create an op node for it
= uid + n._op, label = n._op)
dot.node(name # and connect this no de to it
+ n._op, uid)
dot.edge(uid
for n1, n2 in edges:
# connect n1 to the op node of n2
str(id(n1)), str(id(n2)) + n2._op)
dot.edge(
return dot
= Value(4.0, label='a')
a = Value(-3.0, label='b')
b = Value(8.0, label='c')
c
= a*b ; d.label = 'd'
d = d + c ; e.label = 'e'
e
draw_dot(e)
Note that every value object becomes a node in the graph. The operators are also represented as a kind of fake node so they can be visualized too.
= trace(e)
nodes, edges print("Nodes: ", nodes)
print("Edges: ", edges)
Nodes: {Value(data=-12.0), Value(data=-3.0), Value(data=4.0), Value(data=8.0), Value(data=-4.0)}
Edges: {(Value(data=8.0), Value(data=-4.0)), (Value(data=4.0), Value(data=-12.0)), (Value(data=-3.0), Value(data=-12.0)), (Value(data=-12.0), Value(data=-4.0))}
Lets add one more operation, or stage, in the compute graph.
= Value(4.0, label='a')
a = Value(-3.0, label='b')
b = Value(8.0, label='c')
c
= a*b; d.label = 'd'
d = d + c; e.label = 'e'
e = Value(2.0, label='f')
f
= e*f; L.label = 'L'
L
draw_dot(L)
Recap
So far we’ve:
- built a
Value
class and associated data structures, - captured a computational graph, and
- calculated the output based on the inputs and operations.
We’ll call this the forward pass (or forward propagation).
We now need to calculate the gradient of the loss function \(L\) with respect to the parameters of the model.
Next we’ll update our Value
class to capture the partial derivative at each node relative to L
.
Calculating Gradients
Calculating Gradients
We now add a gradient member variable, grad
, to our class to store the partial derivative of the node with respect to the output node.
# Value version 6
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0 # default to 0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
self._prev = set(_children)
self._op = _op # store the operation that created this node
self.label = label # label for the node
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data})"
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, (self, other), '+')
return out
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, (self, other), '*')
return out
def relu(self):
out = Value(np.maximum(0, self.data), (self,), 'ReLU')
return out
We update draw_dot()
to show grad
in the node info.
Show the code for draw_dot()
# draw_dot version 2
from graphviz import Digraph
def trace(root):
# builds a set of all nodes and set of all edges in a graph
= set(), set()
nodes, edges def build(v):
if v not in nodes:
nodes.add(v)for child in v._prev:
edges.add((child, v))
build(child)
build(root)return nodes, edges
def draw_dot(root):
= Digraph(format='svg', graph_attr={'rankdir': 'LR'}) # LR = left to right
dot
= trace(root)
nodes, edges for n in nodes:
= str(id(n))
uid # for any value in the graph, create a rectangular ('record') node for it
= uid, label = "{ %s | data %.4f | grad %.4f }" % (n.label, n.data, n.grad), shape='record')
dot.node(name if n._op:
# if this value is a result of some operation, create an op node for it
= uid + n._op, label = n._op)
dot.node(name # and connect this node to it
+ n._op, uid)
dot.edge(uid
for n1, n2 in edges:
# connect n1 to the op node of n2
str(id(n1)), str(id(n2)) + n2._op)
dot.edge(
return dot
We also reinitialize and redraw the computational graph.
Code
= Value(4.0, label='a')
a = Value(-3.0, label='b')
b = Value(8.0, label='c')
c
= a*b; d.label = 'd'
d = d + c; e.label = 'e'
e = Value(2.0, label='f')
f
= e*f; L.label = 'L'
L
draw_dot(L)
We have placeholders for the gradients, but they are currently all zero.
Manual Gradient Calculation
Before we start implementing backpropagation, it is helpful to manually calculate some gradients to better understand the procedure.
For the node \(L\), we trivially calculate \(\frac{dL}{dL} = 1\).
We can easily see this from the limit ratio perspective,
\[ \frac{dL}{dL} = \lim_{h \rightarrow 0} \frac{ (L+h) - L }{h} = \frac{h}{h} = 1 \]
Let’s set the gradient of the last node L
to 1.
= 1.0 L.grad
If we go backwards a step in the graph, we see that \(L=e*f\), so we calculate
\[ \frac{\partial{L}}{\partial{e}} = \frac{\partial}{\partial{e}} (e\times f) = f, \]
and
\[ \frac{\partial{L}}{\partial{f}} = \frac{\partial}{\partial{f}} (e\times f) = e. \]
So we just assign the gradient value to the value of the other operand node.
= f.data
e.grad = e.data f.grad
For products, the partial derivative w.r.t. to one operand is simply the other operand.
We needed the node values, e.g., e.data
and f.data
, in order to update the gradients. Remember, we calculate all the node values in the forward pass.
We again redraw the above computational graph .
draw_dot(L)
How do the parameters \(e\) and \(f\) influence \(L\)? Here’s a function to vary (or wiggle) them and see the effect on \(L\).
# Try uncommenting `e += h` or `f += h` and calling `wiggle()` then `wiggle(1.0)`
# to see the influence of e or f on L
def wiggle(h = 0.0):
a = Value(4.0, label='a')
b = Value(-3.0, label='b')
c = Value(8.0, label='c')
d = a*b; d.label = 'd'
e = d + c; e.label = 'e'
# e += h
f = Value(2.0, label='f')
f += h
L = e*f; L.label = 'L'
print(L)
Let’s invoke it with no change \(0\) and then a change of \(1\).
0.0) wiggle(
Value(data=-8.0)
1.0) wiggle(
Value(data=-12.0)
Question
What does \(L\) change by if we change \(f\) by 1?
How much would \(L\) change if we changed \(e\) by 1?
Comment line 12 and uncomment line 10 to check your answer.
Propagating Back
Code
draw_dot(L)
Now we want to calculate
\[ \frac{\partial{L}}{\partial{c}}, \]
i.e., we want to know how much \(L\) varies if we vary \(c\), or how \(c\) influences \(L\).
Looking at the graph again we see that \(c\) influences \(e\) and \(e\) influences \(L\), so we should be able see the ripple effect of \(c\) on \(L\).
\[ c \rightarrow e \rightarrow L. \]
We have that \(e = c + d\), and so we calculate
\[ \frac{\partial{e}}{\partial{c}} = \frac{\partial{}}{\partial{c}} (d + c) = 1. \]
For addition, the partial derivative w.r.t. to one operand is 1.
Question
\[ c \rightarrow e \rightarrow L \]
Now we know \(\partial{L}/\partial{e}\) and we also know \(\partial{e}/\partial{c}\),
How do we get \(\partial{L}/\partial{c}\)?
The Chain Rule
To paraphrase from the Wikipedia page on Chain rule, if a variable \(L\) depends on the variable \(e\), which itself depends on the variable \(c\), then \(L\) depends on \(c\) as well, via the intermediate variable \(e\).
In this case, the chain rule is expressed as
\[ \frac{\partial L}{\partial c} = \frac{\partial L}{\partial e} \cdot \frac{\partial e}{\partial c}, \]
and
\[ \left.\frac{\partial L}{\partial c}\right|_{c} = \left.\frac{\partial L}{\partial e}\right|_{e(c)}\cdot \left. \frac{\partial e}{\partial c}\right|_{c}. \]
The above notation indicates at which points the derivatives have to be evaluated.
We evaluate the derivatives at the specific values of the variables that we calculated in the forward pass.
Since we’ve established that
\[ \frac{\partial{e}}{\partial{c}} = 1, \]
then
\[ \frac{\partial L}{\partial c} = \frac{\partial L}{\partial e} \cdot \frac{\partial{e}}{\partial{c}} = \frac{\partial L}{\partial e} \cdot 1. \]
In the case of an operand in an addition operation, we just copy the gradient of the parent node.
Or put another way:
With the addition operator, we just route the parent gradient to the child.
So let’s assign the gradient of e
to d
and c
.
= e.grad
d.grad = e.grad c.grad
And redraw the graph.
How does \(c\) and \(d\) influence \(L\)?
# Try uncommenting `c += h` or `d += h` and calling `wiggle()` then `wiggle(1.0)`
# to see the influence of c or d on L
def wiggle(h = 0.0):
= Value(4.0, label='a')
a = Value(-3.0, label='b')
b = Value(8.0, label='c')
c += h
c
= a*b; d.label = 'd'
d # d += h
= d + c; e.label = 'e'
e = Value(2.0, label='f')
f
= e*f; L.label = 'L'
L print(L)
Let’s look at the effect c
has on L
.
0.0) wiggle(
Value(data=-8.0)
1.0) wiggle(
Value(data=-6.0)
Change the code to see the effect of d
on L
.
Propagating Back Again
Now we want to calculate \(\frac{\partial{L}}{\partial{b}}\) and \(\frac{\partial{L}}{\partial{a}}\) but we have \(\frac{\partial{L}}{\partial{d}}\) and we know that
\[ \frac{\partial{d}}{\partial{b}} = \frac{\partial{}}{\partial{b}}(a\cdot b) = a, \]
so again from the chain rule
\[ \frac{\partial{L}}{\partial{b}} = \frac{\partial{L}}{\partial{d}} \cdot \frac{\partial{d}}{\partial{b}} = \frac{\partial{L}}{\partial{d}} \cdot a. \]
If we expanded all the partial derivatives, we would get
\[ \frac{\partial{L}}{\partial{b}} = \left(\frac{\partial{L}}{\partial{e}} \cdot \frac{\partial{e}}{\partial{d}}\right) \cdot \frac{\partial{d}}{\partial{b}}. \]
Let’s assign the gradient values to b
and a
.
= a.data * d.grad
b.grad = b.data * d.grad
a.grad draw_dot(L)
We’ve traversed all the way back to the inputs and calculated all the partial derivatives.
Now we can see the effects of varying b
and a
.
# Try uncommenting `a += h` or `b += h` and calling `wiggle()` then `wiggle(1.0)`
# to see the influence of a or b on L
def wiggle(h = 0.0):
= Value(4.0, label='a')
a # a += h
= Value(-3.0, label='b')
b += h
b = Value(8.0, label='c')
c
= a*b; d.label = 'd'
d
= d + c; e.label = 'e'
e = Value(2.0, label='f')
f
= e*f; L.label = 'L'
L print(L)
Let’s look at the effect b
has on L
if we wiggle it by 0 and 1.
0.0) wiggle(
Value(data=-8.0)
1.0) wiggle(
Value(data=0.0)
Change the code to see the effect of a
on L
.
Recap
As you saw, we recursively went backwards through the computation graph and applied the local gradients to the gradients calculated so far to get the partial gradients.
We propagated this calculation backwards through the graph.
Of course, in practice, we will only need the gradients on the parameters, not the inputs, so we won’t bother calculating them on inputs.
PyTorch let’s you control this by setting the requires_grad
flag on Tensors.
This is the essence of Back Propagation.
Model Optimization
A Step in Optimization
Let’s take a look at the graph again.
draw_dot(L)
Assume we want the value of L to decrease.
We are free to change the values of the leaf nodes – all the other nodes are derived from children and leaf nodes.
The leaf nodes are \(a, b, c\) and \(f\).
In practice only update the parameter leaf nodes, not the input leaf node, but we’ll ignore that distinction temporarily for this example.
Let’s check the current value of L.
# remind ourselves what L is
print(L.data)
-8.0
As we showed before, we want to nudge each of those leaf nodes by the negative of the gradient, multiplied by a step size, \(\eta\), i.e.,
\[ w_{n+1} = w_n - \eta * \frac{\partial{L}}{\partial{w_n}}, \]
where \(n\) is the step number.
# nudge all the leaf nodes along the negative direction of the gradient
= 0.01 # also called eta above
step_size
-= step_size * a.grad
a.data -= step_size * b.grad
b.data -= step_size * c.grad
c.data -= step_size * f.grad
f.data
# recompute the forward pass
= a*b
d = d + c
e = e*f
L
print(L.data)
-9.230591999999998
Observe that the value of L has indeed decreased.
A Single Neuron
Let’s programmatically define a single neuron with
- two inputs
- two weights (1 for each input)
- a bias
- the ReLU activation function
Recall the neuron figure above.
# inputs x1, x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2
# weights w1, w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2
# bias of the neuron
= Value(6.8813735870195432, label='b')
b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2
= x1w1 + x2w2; x1w1x2w2.label = 'x1w1 + x2w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.relu(); o.label = 'o' o
We can draw this compute graph:
draw_dot(o)
The only new operation we’ve added is the ReLU, so let’s take a quick look at how we differentiate the ReLU.
Like before, for the output node, o:
\[ \frac{\partial o}{\partial o} = 1. \]
= 1.0 o.grad
ReLU is technically not differentiable at 0, but practically we implement the derivative as 0 when \(\le 0\) and 1 when \(1 > 0\)
= (o.data > 0) * o.grad # = 0 when o.data <= 0; = o.grad when o.data > 0 n.grad
Coding Backpropagation
We’ll update our Value
class once more to support the backward pass.
There’s a
- private
_backward()
function in each operator that implements the local step of the chain rule, and - a
backward()
function in the class that topologically sorts the graph and calls the operator_backward()
function starting at the end of the graph and going backward.
# version 7
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0 # default to 0, no impact on the output
self._backward = lambda: None # by default backward doesn't do anything
self._prev = set(_children)
self._op = _op
self.label = label
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data}, grad={self.grad})"
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += out.grad
+= out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward return out
def relu(self):
= Value(np.maximum(0, self.data), (self,), 'ReLU')
out # out = Value(0 if self.data < 0 else self.data, (self,), 'ReLU')
def _backward():
self.grad += (out.data > 0) * out.grad
= _backward
out._backward
return out
def backward(self):
# topological order all of the children in the graph
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)self)
build_topo(
# go one variable at a time and apply the chain rule to get its gradient
self.grad = 1
for v in reversed(topo):
v._backward()
We redefined the class so we have to reinitialize the objects and run the operations again.
This constitutes the forward pass.
# inputs x1, x2
= Value(2.0, label='x1')
x1 = Value(0.0, label='x2')
x2
# weights w1, w2
= Value(-3.0, label='w1')
w1 = Value(1.0, label='w2')
w2
# bias of the neuron
#b = Value(6.7, label='b')
= Value(6.8813735870195432, label='b')
b
= x1*w1; x1w1.label = 'x1*w1'
x1w1 = x2*w2; x2w2.label = 'x2*w2'
x2w2
= x1w1 + x2w2; x1w1x2w2.label = 'x1w1 + x2w2'
x1w1x2w2 = x1w1x2w2 + b; n.label = 'n'
n = n.relu(); o.label = 'o' o
So we’ve filled the data values for all the nodes, but haven’t calculated the gradients.
draw_dot(o)
All we have to do is call the backward()
method of the last node…
o.backward()
and voila! We have all the gradients.
draw_dot(o)
Accumulating the Gradients
You might have noticed that we are accumulating the gradients.
This is because we calculate the gradient for each sample in a batch and then sum them up to calculate the average gradient for the batch.
The risk now is that if you don’t zero the gradients for the next iteration, you will have incorrect gradients.
Always remember to zero the gradient after each iteration where you update the parameters!
Accumulating the Gradients – Special Cases
There are also special cases to handle such as when a Value
object is on both sides of the operand like
= Value(3.0, label='a')
a = a + a ; b.label = 'b'
b
b.backward()
a
Value(data=3.0, grad=2.0)
If we didn’t have the accumulation, then a.grad = 1
instead.
Or the other case where a node goes to different operations.
= Value(-2.0, label='a')
a = Value(3.0, label='b')
b = a * b ; d.label = 'd'
d = a + b ; e.label = 'e'
e
draw_dot(f)
We can verify that the gradients are correct analytically.
To find the partial derivative \(\frac{\partial f}{\partial a}\), we first need to define \(f\) in terms of \(a\) and \(b\).
Given \[ \begin{aligned} d &= a \times b, \\ e &= a + b, \\ f &= d \times e \end{aligned} \]
then \(f\) can be expanded as \[ \begin{aligned} f &= (a \times b) \times (a + b), \\ &= a^2 \times b + a \times b^2. \end{aligned} \]
Next, we find the partial derivative of \(f\) with respect to \(a\) \[ \frac{\partial f}{\partial a} = 2a \times b + b^2. \]
Finally, we plug in the given values \(a = -2.0\) and \(b = 3.0\) \[ \begin{aligned} \frac{\partial f}{\partial a} &= 2(-2.0) \times 3.0 + 3.0^2, \\ \frac{\partial f}{\partial a} &= -12.0 + 9.0 , \\ \frac{\partial f}{\partial a} &= -3.0. \end{aligned} \]
So the partial derivative \(\frac{\partial f}{\partial a}\) for the value \(a = -2.0\) is \(-3.0\).
Enhancements to Value
Class
There are still some useful operations that Value
doesn’t support, so to be more complete we have the final version of the Value
class below.
We added:
__radd__
for when theValue
object is the right operand of an add__rmul__
for when theValue
object is the right operand of a product__pow__
to support the ** operator- plus some others you can see below
The Value Class with additional operators
# version 8
class Value:
def __init__(self, data, _children=(), _op='', label=''):
self.data = data
self.grad = 0.0 # default to 0, no impact on the output
self._backward = lambda: None # by default backward doesn't do anything
self._prev = set(_children)
self._op = _op
self.label = label
def __add__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data + other.data, (self, other), '+')
out
def _backward():
self.grad += out.grad
+= out.grad
other.grad = _backward
out._backward
return out
def __mul__(self, other):
= other if isinstance(other, Value) else Value(other)
other = Value(self.data * other.data, (self, other), '*')
out
def _backward():
self.grad += other.data * out.grad
+= self.data * out.grad
other.grad = _backward
out._backward
return out
def __pow__(self, other):
"""Adding support for ** operator, which we'll need for the
squared loss function"""
assert isinstance(other, (int, float)), "only supporting int/float powers for now"
= Value(self.data**other, (self,), f'**{other}')
out
def _backward():
self.grad += (other * self.data**(other-1)) * out.grad
= _backward
out._backward
return out
def relu(self):
= Value(np.maximum(0, self.data), (self,), 'ReLU')
out
def _backward():
self.grad += (out.data > 0) * out.grad
= _backward
out._backward
return out
def __neg__(self): # -self
return self * -1
def __radd__(self, other): # other + self
return self + other
def __sub__(self, other): # self - other
return self + (-other)
def __rsub__(self, other): # other - self
return other + (-self)
def __rmul__(self, other): # other * self
return self * other
def __truediv__(self, other): # self / other
return self * other**-1
def __rtruediv__(self, other): # other / self
return other * self**-1
def backward(self):
# topological order all of the children in the graph
= []
topo = set()
visited def build_topo(v):
if v not in visited:
visited.add(v)for child in v._prev:
build_topo(child)
topo.append(v)self)
build_topo(
# go one variable at a time and apply the chain rule to get its gradient
self.grad = 1
for v in reversed(topo):
v._backward()
def __repr__(self):
"""Return a string representation of the object for display"""
return f"Value(data={self.data}, grad={self.grad})"
Comparing to PyTorch
We’re using a class implementation that resembles the PyTorch implementation, and in fact we can compare our implementation with PyTorch.
import torch
= torch.Tensor([2.0]).double() ; x1.requires_grad = True
x1 = torch.Tensor([0.0]).double() ; x2.requires_grad = True
x2 = torch.Tensor([-3.0]).double() ; w1.requires_grad = True
w1 = torch.Tensor([1.0]).double() ; w2.requires_grad = True
w2 = torch.Tensor([6.8813735870195432]).double() ; b.requires_grad = True
b = x1*w1 + x2*w2 + b
n = torch.relu(n)
o
print(o.data.item())
o.backward()
print('---')
print('x2.grad', x2.grad.item())
print('w2.grad', w2.grad.item())
print('x1.grad', x1.grad.item())
print('w1.grad', w1.grad.item())
0.881373405456543
---
x2.grad 1.0
w2.grad 0.0
x1.grad -3.0
w1.grad 2.0
By default, tensors don’t store gradients and so won’t support backprop, so we explicitly set requires_grad = True
.
Building a Neural Network
Neural Network Modules
Now we’ll define some classes which help us build out a small neural network.
Module – A base class
Neuron – Implement a single linear or nonlinear neuron with nin
inputs.
Layer – Implement a layer of network consisting of nout
neurons, each taking nin
inputs
MLP – A Multi-Layer Perceptron that implements len(nouts)
layers of neurons.
Each class can calculate a forward pass and enumerate all its parameters.
Code
import random
# we assume that Value class is already defined
class Module:
"""Define a Neural Network Module base class """
def zero_grad(self):
"""When we run in a training loop, we'll need to zero out all the gradients
since they are defined to accumulate in the backwards passes."""
for p in self.parameters():
= 0
p.grad
def parameters(self):
return []
class Neuron(Module):
"""Define a Neuron as a subclass of Module"""
def __init__(self, nin, nonlin=True):
"""Randomly initialize a set of weights, one for each input, and initialize the bias to zero."""
self.w = [Value(random.uniform(-1,1)) for _ in range(nin)]
self.b = Value(0.0)
self.nonlin = nonlin
def __call__(self, x):
"""Implement the forward pass of the neuron"""
= sum((wi*xi for wi,xi in zip(self.w, x)), self.b)
act return act.relu() if self.nonlin else act
def parameters(self):
return self.w + [self.b]
def __repr__(self):
return f"{'ReLU' if self.nonlin else 'Linear'}Neuron({len(self.w)})"
class Layer(Module):
"""Define a Layer of Network as a subclass of Module"""
def __init__(self, nin, nout, **kwargs):
"""Initialize nout Neurons, each with nin inputs"""
self.neurons = [Neuron(nin, **kwargs) for _ in range(nout)]
def __call__(self, x):
"""Forward pass each neuron in the layer"""
= [n(x) for n in self.neurons]
out return out[0] if len(out) == 1 else out
def parameters(self):
return [p for n in self.neurons for p in n.parameters()]
def __repr__(self):
return f"Layer of [{', '.join(str(n) for n in self.neurons)}]"
class MLP(Module):
"""Define a Multi-Layer Perceptron"""
def __init__(self, nin: int, nouts: list):
"""
Initialize the Multi-Layer Perceptron, by initializing each layer
then initializing each neuron of each layer.
Parameters:
nin: Number of inputs (int)
nouts: A list of the number of neurons in each layer
"""
= [nin] + nouts
sz # self.layers = [Layer(sz[i], sz[i+1]) for i in range(len(nouts))]
# Create a list of layer objects for this MLP. All but the last layer
# have ReLU activations. The last layer is linear.
self.layers = [Layer(sz[i], sz[i+1], nonlin=i!=len(nouts)-1) for i in range(len(nouts))]
def __call__(self, x):
"""Forward pass through the MLP"""
for layer in self.layers:
= layer(x)
x return x
def parameters(self):
"""Recursively retrieve the parameters of the MLP"""
return [p for layer in self.layers for p in layer.parameters()]
def __repr__(self):
return f"MLP of [{', '.join(str(layer) for layer in self.layers)}]"
Feel free to uncomment lines of code in the following cell to see the docstrings for each class.
# help(Module)
# Module.__doc__
# help(Neuron)
# help(Layer)
# help(MLP)
Initialize and Evaluate a Neuron
# 2 inputs
= [2.0, 3.0]
x
# initialize neuron with 2 inputs
= Neuron(2, nonlin=False)
n
# evaluate our neuron with our 2 inputs
n(x)
Value(data=-0.7986044827378151, grad=0.0)
n
LinearNeuron(2)
# list the 2 weights and the bias
n.parameters()
[Value(data=0.13662034202253892, grad=0.0),
Value(data=-0.35728172226096433, grad=0.0),
Value(data=0.0, grad=0.0)]
Initialize and Evaluate a Layer
# same 2 inputs again
= [2.0, 3.0]
x
# Now initialize a layer of 3 neurons, each with 2 inputs
= Layer(2, 3, nonlin=False)
l
# Evaluate our layer of neurons with the 2 inputs
l(x)
[Value(data=2.075707642189308, grad=0.0),
Value(data=3.317750385687485, grad=0.0),
Value(data=1.3166915978118992, grad=0.0)]
l
Layer of [LinearNeuron(2), LinearNeuron(2), LinearNeuron(2)]
l.parameters()
[Value(data=-0.15375774313175783, grad=0.0),
Value(data=0.7944077094842745, grad=0.0),
Value(data=0.0, grad=0.0),
Value(data=0.7480309600826056, grad=0.0),
Value(data=0.6072294885074245, grad=0.0),
Value(data=0.0, grad=0.0),
Value(data=-0.5977201965507355, grad=0.0),
Value(data=0.8373773303044567, grad=0.0),
Value(data=0.0, grad=0.0)]
Initialize and Evaluate an MLP
We’ll instantiate an MLP like the picture below.
= [2.0, 3.0, -1.0]
x = MLP(3, [4, 4, 1])
m m(x)
Value(data=3.882975961981558, grad=0.0)
m
MLP of [Layer of [ReLUNeuron(3), ReLUNeuron(3), ReLUNeuron(3), ReLUNeuron(3)], Layer of [ReLUNeuron(4), ReLUNeuron(4), ReLUNeuron(4), ReLUNeuron(4)], Layer of [LinearNeuron(4)]]
# m.parameters()
draw_dot(m(x))
Training Loop
Training Loop
In order to run the training loop, let’s define inputs, targets, and initialize the MLP.
# Define 4 different sets of inputs
= [
xs 2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0]
[
]
# For each input set, we have a desired target value -- binary classification
# ys = [1.0, -1.0, -1.0, 1.0]
= [1.0, 0.0, 0.0, 1.0]
ys
# Manually seed the Random Number Generator for Reproducibility
# You can comment the next line out see the variability
1)
random.seed(
# Initialize an MLP with random weights
= MLP(3, [4, 4, 1]) m
Now we can run the training loop.
= []
losses = 100
niters = 0.01
step_size
for k in range(niters):
# Training Step 1: forward pass
= [m(x) for x in xs]
ypred
# Training Step 2: Calculate the loss
= sum((yout - ygt)**2 for ygt, yout in zip(ys, ypred))
loss
losses.append(loss.data)
# Training Step 3: Zero the gradients and run the backward pass
m.zero_grad()
loss.backward()
# Training Step 4: Update parameters
for p in m.parameters():
+= -step_size * p.grad
p.data
# print(k, loss.data)
print("Final Loss: ", loss.data)
Final Loss: 0.0073864224020347095
Let’s plot the loss per iteration.
plt.plot(losses)"Loss")
plt.ylabel("Iterations")
plt.xlabel("Loss Per Iteration") plt.title(
Text(0.5, 1.0, 'Loss Per Iteration')
Let’s look at the ground truth and predictions.
print(type(ypred))
print(ypred[0].data)
list(zip(ys, [y.data for y in ypred]))
<class 'list'>
1.0120099058492713
[(1.0, np.float64(1.0120099058492713)),
(0.0, np.float64(0.07244688900469765)),
(0.0, np.float64(-0.03539087511088844)),
(1.0, np.float64(0.9727765028706291))]
Build and Train the MLP in PyTorch
Define the MLP in PyTorch.
import torch
from torch import nn
from torch.optim import SGD
# Manually seed the Random Number Generator for Reproducibility
# You can comment the next line out see the variability
99)
torch.manual_seed(
# Step 1: Define the MLP model
class ptMLP(nn.Module):
def __init__(self):
super(ptMLP, self).__init__()
self.layers = nn.Sequential(
3, 4),
nn.Linear(
nn.ReLU(),4, 4),
nn.Linear(
nn.ReLU(),4, 1)
nn.Linear(
)
def forward(self, x):
return self.layers(x)
= ptMLP()
model print(model)
ptMLP(
(layers): Sequential(
(0): Linear(in_features=3, out_features=4, bias=True)
(1): ReLU()
(2): Linear(in_features=4, out_features=4, bias=True)
(3): ReLU()
(4): Linear(in_features=4, out_features=1, bias=True)
)
)
Define a loss function and an optimizer.
# Step 2: Define a loss function and an optimizer
= nn.MSELoss(reduction='sum')
criterion = SGD(model.parameters(), lr=0.01) optimizer
Recreate the tiny dataset.
# Step 3: Create a tiny dataset
= [
xs 2.0, 3.0, -1.0],
[3.0, -1.0, 0.5],
[0.5, 1.0, 1.0],
[1.0, 1.0, -1.0]
[
]
# we had to transpose ys for torch.tensor
= [[1.0],
ys_transpose 0.0],
[0.0],
[1.0]]
[
= torch.tensor(xs)
inputs = torch.tensor(ys_transpose) outputs
Now run the training loop.
# Step 4: Write the training loop
= []
losses = 100
niters
for epoch in range(niters):
# Training Step 1: Forward pass
= model(inputs)
predictions
# Training Step 2: Calculate the loss
= criterion(predictions, outputs)
loss
# Training Step 3: Zero the gradient and run backward pass
optimizer.zero_grad()
loss.backward()
# Training Step 4: Update parameters
optimizer.step()
losses.append(loss.item())# print(f'Epoch {epoch+1}, Loss: {loss.item()}')
print(f'Final Loss: {loss.item()}')
Final Loss: 0.06534025073051453
Let’s plot the loss per iteration.
Code
plt.plot(losses)"Loss")
plt.ylabel("Iterations")
plt.xlabel("Loss Per Iteration") plt.title(
Text(0.5, 1.0, 'Loss Per Iteration')
Let’s look at the predictions.
Code
list(zip(ys, [y.item() for y in predictions]))
[(1.0, 1.130600094795227),
(0.0, 0.03639908879995346),
(0.0, 0.017568498849868774),
(1.0, 0.7840131521224976)]
To Dig a Little Deeper
Summary
We have now covered the following topics:
- an overview of the wide applications of neural networks
- loss functions
- developed the notion of gradient descent first by intuition, then in the univariate case, then the multivariate case
- defined artificial neurons
- implemented a computation graph and visualization
- implemented the chain rule as backpropagation on the computation graph
- defined Neuron, Layer and MLP modules which completes a homegrown Neural Network Framework
- trained a small MLP on a tiny dataset
- implemented the same neural network architecture in PyTorch