Computing symbolic gradient vectors with plain Haskell
While writing my previous post, I was curious how easy it would be to implement TensorFlow‘s automatic differentiation for back propagation. In TensorFlow’s web site they call it ’automatic differentiation’ but in fact they probably do ‘symbolic differentiation’, as mentioned in their white paper. The difference between the two relates to whether the differentiation is done during the original computation or beforehand. It makes sense to do the latter, because then you can maintain a separate computational graph of the back propagation to perform the updates.
I’ve looked into this topic in the Haskell ecosystem, and found many useful and extensive libraries, namely ad by Edward Kmett. However, to use these libraries or understand many of the blog posts on the subject requires some advanced Haskell, and I was wondering whether one can get going with differentiation using very basic and lean use of Haskell.
So, how easy it would be to compute gradients of singleoutput functions, using Haskell with only the basic arsenal at our hands?
Data and imports
First, we perform some imports and declare the basic data type to hold our expression tree:

We shall be able to derive symbolic gradients for any function built with this data type.
Poor man’s prettyprinting
One cannot go by without a nice String representation:

This implementation is basic in so that a sequence of summations will bear a horrible representation similar to (x1 + (x2 + (x3 + (...))))
 however it’s enough to get us going.
Sample
The Wikipedia page for Automatic Differentiation demonstrates with the following function:
\[ f(x_1, x_2) = \sin x_1 + x_1x_2 \]
It should be easy enough to represent it with our Haskell data, and use fshow
from above:

Gradient
The gradient
function below takes an expression, and returns a map from each term number to the expression that computes it. The definition of the function is recursive and based on known simple derivation rules:

The interesting parts are where Map.unionWith
is used for addition and multiplication. Notice how easily the Mul
part relates to the known derivation rule:
\[(f(x)g(x))' = g(x)f'(x) + g'(x)f(x)\]
The documentation for Data.Map can tell about Map.map
and Map.unionWith
.
Small helpers
Before testing it, we’ll add just two helper functions. The first function simplifies expressions by getting rid of the \(1.0\) literals we have added.

The second function will do all the work at the program’s top level to compute the gradient and print it:

Does it work?

Looks that it does. We have arrived at the same results as Wikipedia.
\[ \begin{align} & \frac{∂ f}{∂ x_1} = \cos x_1 + x_2 \\ & \frac{∂ f}{∂ x_2} = x_1 \\ \end{align} \]
Will it work with something more complex?

Comparing with Wolfram Alpha, it seems to get it right.
End note
Advance extensions of what I illusrated here can add a considerable amount of functionality and ease of use. We will definitely need to support matrices for instance, if we would like to derive a backpropagation graph. You can browse the ad package to get some ideas.