Update
This commit is contained in:
parent
0e54363e3c
commit
20b099eac2
1 changed files with 18 additions and 12 deletions
|
@ -27,32 +27,38 @@ Let's say the total error is computed as the squared error: *E = squared_error(z
|
|||
|
||||
What is our goal now? To find the **[partial derivative](partial_derivative.md) of the whole network's total error function** (at the current point defined by the weights and biases), or in other words the **gradient** at the current point. I.e. from the point of view of the total error (which is just a number output by this system), the network is a function of 10 variables (weights *w000*, *w001*, ... and the biases *b0* and *b1*), and we want to find a derivative of this function in respect to each of these variables (that's what a partial derivative is) at the current point (i.e. with current values of the weights and biases). This will, for each of these variables, tell us how much (at what rate and in which direction) the total error changes if we change that variable by certain amount. Why do we need to know this? So that we can do a [gradient descent](gradient_descent.md), i.e. this information is kind of a direction in which we want to move (change the weights and biases) towards lowering the total error (making the network compute results which are closer to the training data).
|
||||
|
||||
Backpropagation work by going "backwards" from the output towards the input. So, let's start by computing the derivative against the weight *w100*.
|
||||
Backpropagation work by going "backwards" from the output towards the input. So, let's start by computing the derivative against the weight *w100*. It will be a specific number; let's call it *'w100*. Derivative of a sum is equal to the sum of derivatives:
|
||||
|
||||
*derivative(E,w100) = derivative(squared_error(z0),w100) + derivative(squared_error(z0),w100) = derivative(squared_error(z0),w100) + 0*
|
||||
*'w100 = derivative(E,w100) = derivative(squared_error(z0),w100) + derivative(squared_error(z0),w100) = derivative(squared_error(z0),w100) + 0*
|
||||
|
||||
Notice that the second part of the sum (*derivative(squared_error(z1),w100)*) became 0 because when deriving in respect to *w100*, this expression is seen as a constant (as it doesn't depend on w100) and the derivative of a constant is 0. Now let's continue. We will now utilize the **chain rule** which is a rule of derivation that says:
|
||||
|
||||
*derivative(f(g(x)),x) = derivative(derivative(f(g(x)),g(x))) * derivative(g(x),x)*
|
||||
*derivative(f(g(x)),x) = derivative(f(g(x)),g(x)) * derivative(g(x),x)*
|
||||
|
||||
In order to simplify the following equation let *T = w100 * y0(x) + w110 * y1(x) + b1*. Applying this rule to the above gives us (for demonstration with all intermediate steps):
|
||||
*derivative(E,w100) = derivative(squared_error(z0),w100) = derivative(squared_error(activation(w100 * y0(x) + w110 * y1(x) + b1)),w100) = derivative(squared_error(z0),z0) * derivative(activation(T),T) * derivative(w100 * y0(x) + w110 * y1(x) + b1,w110)*
|
||||
In order to simplify the following equation let *T = w100 * y0 + w110 * y1 + b1*. Applying the chain rule to the above gives us (for demonstration with all intermediate steps):
|
||||
|
||||
Now we can compute all the three parts of the sum:
|
||||
*'w100 = derivative(E,w100) = derivative(squared_error(z0),w100) = derivative(squared_error(activation(w100 * y0 + w110 * y1 + b1)),w100) = derivative(squared_error(z0),z0) * derivative(activation(T),T) * derivative(w100 * y0 + w110 * y1 + b1,w100)*
|
||||
|
||||
*derivative(squared_error(z0),z0) = derivative(1/2 * (z0 - z0_desired)^2,z0) = z0_desired - z0*
|
||||
Now we can compute all the three parts of the product:
|
||||
|
||||
*derivative(activation(T),T) = derivative(1/(1 + e^T),T) = T * (1 - T)*
|
||||
*derivative(squared_error(z0),z0) = derivative(1/2 * (z0 - z0_desired)^(2),z0) = z0_desired - z0*
|
||||
|
||||
*derivative(w100 * y0(x) + w110 * y1(x) + b1,w100) = y0(x)*
|
||||
*derivative(activation(T),T) = derivative(1/(1 + e^(T),T) = T * (1 - T)*
|
||||
|
||||
**Now we have computed the derivative against w100**, i.e. we have a formula with only variables whose values we know, so we can plug the values in and compute the derivative as a number *w100'*:
|
||||
*derivative(w100 * y0 + w110 * y1 + b1,w100) = y0*
|
||||
|
||||
*w100' = (derivative(1/2 * (z0 - z0_desired)^2,z0) = z0_desired - z0) * derivative(1/(1 + e^T),T) = T * (1 - T) * y0(x)*
|
||||
**Now we have computed the derivative against w100**, the final formula is:
|
||||
|
||||
*w100' = (z0_desired - z0) * (T * (1 - T)) * y0*
|
||||
|
||||
At any specific moment during training, values of all the variables in this formula are known to us so we can plug them in and get a specific number.
|
||||
|
||||
In the same way can compute *'w101*, *'w110* and *'w111* (weights leading to the output layer).
|
||||
|
||||
Now let's compute the derivative in respect to *w000*, i.e. the number *'w000*. We will proceed similarly but the computation will be different because the weight *w000* affects both output neurons ('z0' and 'z1').
|
||||
|
||||
TO BE CONTINUED
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
|
Loading…
Reference in a new issue