master
Miloslav Ciz 2 years ago
parent 44809dc4bb
commit f15b7729ac

@ -1,64 +1,85 @@
# Backpropagation
WIP
{ Dunno if this is completely correct, I'm learning this as I'm writing it. There may be errors. ~drummyfish }
{ Dunno if this is completely correct, I'm learning this as I'm writing it. ~drummyfish }
Backpropagation, or backprop, is an [algorithm](algorithm.md), based on the chain rule of derivation, used in training [neural networks](neural_network.md); it computes the partial derivative (or [gradient](gradient.md)) of the function of the network's error so that we can perform a [gradient descent](gradient_descent.md), i.e. update the weights towards lowering the network's error. It computes the analytical derivative (theoretically you could estimate a derivative numerically, but that's not so accurate and can be too computationally expensive). It is called backpropagation because it works backwards and propagates the error from the output towards the input, due to how the chain rule works, and it's efficient by reusing already computed values.
## Details
Consider the following neural network:
```
w000 w100
x0------y0------z0
\ _// \ _// \
\/ /w010\/ /w110\
b0/\/ b1/\/ \_E
\/\ \/\ /
/\_\w001/\_\w101/
/ \\ / \\ /
\ / \ / \
\ / \ / \
\/w010 \/w11O \_E
/\w001 /\w1O1 /
/ \ / \ /
/ \ / \ /
x1------y1------z1
w011 w111
```
It has an input layer (neurons *x0*, *x1*), a hidden layer (neurons *y0*, *y1* and a bias *b0*) and an output layer (neurons *z0*, *z1* and a bias *b1*). At the end there is a total error *E* computed from the networks's output against the desired output (training data).
Each non-input neuron is a function: e.g. the neuron *z0* can be seen as a function *z0(x) = activation(w100 * y0(x) + w110 * y1(x) + b1)*. Let's say the *activation* function is the normally used [logistic function](logistic_function.md) *actiovation(x) = 1/(1 + e^x)*. If you don't know what the fuck is going on see [neural networks](neural_network.md) first.
It has an input layer (neurons *x0*, *x1*), a hidden layer (neurons *y0*, *y1*) and an output layer (neurons *z0*, *z1*). For simplicity there are no biases (biases can easily be added as input neurons that are always on). At the end there is a total error *E* computed from the networks's output against the desired output (training data).
Let's say the total error is computed as the squared error: *E = squared_error(z0) + squared_error(z1) = 1/2 * (z0 - z0_desired)^2 + 1/2 * (z1 - z1_desired)^2*.
What is our goal now? To find the **[partial derivative](partial_derivative.md) of the whole network's total error function** (at the current point defined by the weights and biases), or in other words the **gradient** at the current point. I.e. from the point of view of the total error (which is just a number output by this system), the network is a function of 10 variables (weights *w000*, *w001*, ... and the biases *b0* and *b1*), and we want to find a derivative of this function in respect to each of these variables (that's what a partial derivative is) at the current point (i.e. with current values of the weights and biases). This will, for each of these variables, tell us how much (at what rate and in which direction) the total error changes if we change that variable by certain amount. Why do we need to know this? So that we can do a [gradient descent](gradient_descent.md), i.e. this information is kind of a direction in which we want to move (change the weights and biases) towards lowering the total error (making the network compute results which are closer to the training data).
We can see each non-input neuron as a function. E.g. the neuron *z0* is a function *z0(x) = z0(a(z0s(x)))* where:
- *z0s* is the sum of inputs to the neuron, in this case *z0s(x) = w100 * y0(x) + +110 * y1(x)*
- *a* is the activation function, let's suppose the normally used [logistic function](logistic_function.md) *a(x) = 1/(1 + e^x)*.
If you don't know what the fuck is going on see [neural networks](neural_network.md) first.
What is our goal now? To find the **[partial derivative](partial_derivative.md) of the whole network's total error function** (at the current point defined by the weights), or in other words the **gradient** at the current point. I.e. from the point of view of the total error (which is just a number output by this system), the network is a function of 8 variables (weights *w000*, *w001*, ...) and we want to find a derivative of this function in respect to each of these variables (that's what a partial derivative is) at the current point (i.e. with current values of the weights). This will, for each of these variables, tell us how much (at what rate and in which direction) the total error changes if we change that variable by certain amount. Why do we need to know this? So that we can do a [gradient descent](gradient_descent.md), i.e. this information is kind of a direction in which we want to move (change the weights and biases) towards lowering the total error (making the network compute results which are closer to the training data).
Backpropagation is based on the **chain rule**, a rule of derivation that equates the derivative of a function composition (functions inside other functions) to a product of derivatives. This is important because by converting the derivatives to a product we will be able to **reuse** the individual factors and so compute very efficiently and quickly.
Let's write derivative of *f(x)* with respect to *x* as *D{f(x),x}*. The chain rule says that:
*D{f(g(x)),x} = D{f(g(x)),g(x)} * D{g(x),x}*
Notice that this can be applied to any number of composed functions, the product chain just becomes longer.
Let's get to the computation. Backpropagation work by going "backwards" from the output towards the input. So, let's start by computing the derivative against the weight *w100*. It will be a specific number; let's call it *'w100*. Derivative of a sum is equal to the sum of derivatives:
Backpropagation work by going "backwards" from the output towards the input. So, let's start by computing the derivative against the weight *w100*. It will be a specific number; let's call it *'w100*. Derivative of a sum is equal to the sum of derivatives:
*'w100 = D{E,w100} = D{squared_error(z0),w100} + D{squared_error(z0),w100} = D{squared_error(z0),w100} + 0*
*'w100 = derivative(E,w100) = derivative(squared_error(z0),w100) + derivative(squared_error(z0),w100) = derivative(squared_error(z0),w100) + 0*
(The second part of this sum became 0 because with respect to *w100* it is a constant.)
Notice that the second part of the sum (*derivative(squared_error(z1),w100)*) became 0 because when deriving in respect to *w100*, this expression is seen as a constant (as it doesn't depend on w100) and the derivative of a constant is 0. Now let's continue. We will now utilize the **chain rule** which is a rule of derivation that says:
Now we can continue and utilize the chain rule:
*derivative(f(g(x)),x) = derivative(f(g(x)),g(x)) * derivative(g(x),x)*
*'w100 = D{E,w100} = D{squared_error(z0),w100} = D{squared_error(z0(a(z0s))),w100} = D(squared_error(z0),z0) * D{a(z0s),z0s} * d{z0s,w100}*
In order to simplify the following equation let *T = w100 * y0 + w110 * y1 + b1*. Applying the chain rule to the above gives us (for demonstration with all intermediate steps):
We'll now skip the intermediate steps, they should be easy if you can do derivatives. The final results is:
*'w100 = derivative(E,w100) = derivative(squared_error(z0),w100) = derivative(squared_error(activation(w100 * y0 + w110 * y1 + b1)),w100) = derivative(squared_error(z0),z0) * derivative(activation(T),T) * derivative(w100 * y0 + w110 * y1 + b1,w100)*
*'w100 = (z0_desired - z0) * (z0s * (1 - z0s)) * y0*
Now we can compute all the three parts of the product:
**Now we have computed the derivative against w100**. In the same way can compute *'w101*, *'w110* and *'w111* (weights leading to the output layer).
*derivative(squared_error(z0),z0) = derivative(1/2 * (z0 - z0_desired)^(2),z0) = z0_desired - z0*
Now let's compute the derivative in respect to *w000*, i.e. the number *'w000*. We will proceed similarly but the computation will be different because the weight *w000* affects both output neurons ('z0' and 'z1'). Again, we'll use the chain rule.
*derivative(activation(T),T) = derivative(1/(1 + e^(T),T) = T * (1 - T)*
*w000 = D{E,w000} = D(E,y0) * D{a(y0s),y0s} * D{y0s,w000}*
*derivative(w100 * y0 + w110 * y1 + b1,w100) = y0*
*D(E,y0) = D{squared_error(z0),y0} + D{squared_error(z1),y0}*
**Now we have computed the derivative against w100**, the final formula is:
Let's compute the first part of the sum:
*w100' = (z0_desired - z0) * (T * (1 - T)) * y0*
*D{squared_error(z0),y0} = D{squared_error(z0),z0s} * D{squared_error(z0s),y0}*
At any specific moment during training, values of all the variables in this formula are known to us so we can plug them in and get a specific number.
*D{squared_error(z0),z0s} = D{squared_error(z0),z0} * D{a(z0s)),z0s}*
In the same way can compute *'w101*, *'w110* and *'w111* (weights leading to the output layer).
Note that this last equation uses already computed values which we can reuse. Finally:
Now let's compute the derivative in respect to *w000*, i.e. the number *'w000*. We will proceed similarly but the computation will be different because the weight *w000* affects both output neurons ('z0' and 'z1').
*D{squared_error(z0s),y0} = D{squared_error(w100 * y0 + w110 * y1),y0} = w100*
TO BE CONTINUED
And we get:
*D{squared_error(z0),y0} = D{squared_error(z0),z0} * D{a(z0s)),z0s} * w100*
And so on until we get all the derivatives.
Once we have them, we multiply them all by some value (**learning rate**, a distance by which we move in the computed direction) and substract them from the current weights by which we perform the gradient descent and lower the total error.
Note that here we've only used one training sample, i.e. the error *E* was computed from the network against a single desired output. If more example are used in a single update step, they are usually somehow averaged.
Loading…
Cancel
Save