Back propagation with TensorFlow
(Updated for TensorFlow 1.0, at March 6th, 2017)
When I first read about neural network in Michael Nielsen’s Neural Networks and Deep Learning, I was excited to find a good source that explains the material along with actual code. However there was a rather steep jump in the part that describes the basic math and the part that goes about implementing it, and it was especially apparant in the numpy
based code that implements backward propagation.
So, in order to explain it better to myself, and learn about TensorFlow in the process, I took it upon myself to implement the first network in the book using TensorFlow by two means. First, manually defining the back propagation step, and the second  letting TensorFlow do the hard work using automatic differentiation.
Setup
First, we need to load TensorFlow and setup the basic parts of the graph  inputs (a_0
, y
), and states (w_1
, b_1
, w_2
, b_2
).

The sigmoid function
Our sigmoid function, although provided by TensorFlow’s extensive function library, is brought here as reference:
\[ \sigma(z) = \frac{1}{1+e^{z}} \]

The forward propagation
Provided that the input image is given by the \(a_0\) matrix, calculating forward propagation for multiple images at a time can be done with simple matrix multiplication, defined as such:
\[ \begin{align} & z_1 = a_0 \cdot w_1 + b_1 \\ & a_1 = \sigma(z_1) \\ & z_2 = a_1 \cdot w_2 + b_2 \\ & a_2 = \sigma(z_2) \\ \end{align} \]
Given a tensor of multiple images, this can done in TensorFlow for all them at the same time (thanks to ‘broadcasting’), so the above gets a onetoone transltion in TensorFlow:

Difference
The input provides \(y\) as the test for the accuracy of the network’s output, so we compute the following vector:
\[ \begin{align} & \nabla a = a_2  y \\ \end{align} \]

The sigmoid prime function
Here’s the derivate of the sigmoid function from above, which will be needed during the backward propagation:
\[ \sigma'(z) = \sigma(z)(1  \sigma(z)) \]

Backward propagation
The most complicated part is the backward propagation. First, we need to compute the deltas of the weights and biases. In the original book the Python code was a bit puzzling, but here we can describe the same algorithm in a functional, stateless way.
\[ \begin{align} & \nabla z_2 = \nabla a \cdot \sigma'(z_2) \\ & \nabla b_2 = \nabla z_2 \\ & \nabla w_2 = a_1^T \cdot \nabla z_2 \\ & \\ & \nabla a_1 = \nabla z_2 \cdot w_2^T \\ & \nabla z_1 = \nabla a_1 \cdot \sigma'(z_1) \\ & \nabla b_1 = \nabla z_1 \\ & \nabla w_1 = a_0^T \cdot \nabla z_1 \\ \end{align} \]
It’s also onetoone with:

Updating the network
We take the computed \(\nabla\)s and update the weights in one step. Note that the following does not precisely match the book  I have omitted the constant \(1/n\) divider. For simplicity, it’s not really needed, as it comes into play inside the \(\eta\) itself, in this case.
\[ \begin{align} & w_1 \leftarrow w_1  \eta \cdot \nabla w_1 \\ & b_1 \leftarrow b_1  \eta \cdot \nabla b_1 \\ & w_2 \leftarrow w_2  \eta \cdot \nabla w_2 \\ & b_2 \leftarrow b_2  \eta \cdot \nabla b_2 \\ \end{align} \]
In TensorFlow, it can translate to a list of a assignments:

Running and testing the training process
The following will be able to train the network and test it in the meanwhile, using minibatches of 10. Here, I chose to test with 1000 items from the test set, every 1000 minibatches.

Running it shows that it manages to train the network, as we quickly get 923 correct out of 1000 tests.

Automatic differentiation
The great part about TensorFlow is its ability to derive the step function on its own. So, instead of the rather complicated ‘Backward propagation’ and ‘Updating the network’ given above for educational purposes (section 1.5 and section 1.6), we can simply write:
Step function alternative 

And observe that the training still works.