Глоссарий

Выберите одно из ключевых слов слева ...

Numerical ComputationAutomatic Differentiation

Время чтения: ~20 min

Suppose that f: \mathbb{R} \to \mathbb{R} is a function whose definition is too complicated for us to feasibly differentiate symbolically. Perhaps the most straightforward way to approximate the derivative of f is to calculate the difference quotient

\begin{align*} \frac{f(x + \epsilon) - f(x)}{\epsilon} \end{align*}

a small value of \epsilon. However, this approach is very inaccurate because the subtraction step is ill-conditioned.

Exercise
Use difference quotients to approximate the derivative of f(x) = x^2 at x = \frac{2}{3}, with \epsilon = 2^k as k ranges from -60 to -20. What is the least error over these values of k? How does that error compare to machine epsilon?

Solution. The block

diffquotient(f,x,ϵ) = (f(x+ϵ) - f(x))/ϵ
m = minimum([abs(diffquotient(x->x^2,2/3,2.0^k) - 4/3) for k = -60:-20])

returns 2.48 \times 10^{-9}. This error is more than ten million times larger than we could hope for just from roundoff error:

m / (nextfloat(4/3) - 4/3)

The problem with difference quotients is that the relative error of f(x+\epsilon) - f(x) blows up as \epsilon \to 0 (due to catastrophic cancellation). Larger values of \epsilon are inaccurate because of truncation error. Even at the optimal value of \epsilon, the precision is still poor.

On the other hand, the problem of calculating the derivative of f is well-conditioned as long as the condition number |xf''(x)/f'(x)| of f'(x) isn't too large. So the difference quotient algorithm is unstable, and we can hope for a stable alternative.

Another straightforward alternative to difference quotients is to calculate the derivative symbolically, the way introductory calculus students learn to do it. For example, the derivative of x^2 is 2x. However, this approach quickly becomes untenable as the functions get sufficiently complicated, as is typically the case in modern machine learning applications.

Indeed, there is an approach to derivative computation which is precise, fast, and scalable: automatic differentiation. The idea is to replace Float64's with objects called dual numbers which track values and gradients at the same time.

One concrete way to realize dual numbers is to use the matrix

\begin{align*}\begin{bmatrix} x & 1 \\ 0 & x \end{bmatrix}\end{align*}

in place of x in the program that computes f(x). This requires that any internal calculations performed by f are able to handle 2 \times 2 matrices as well as plain numbers. The matrix resulting from this calculation will be equal to

\begin{align*}\begin{bmatrix} f(x) & f'(x) \\ 0 & f(x) \end{bmatrix},\end{align*}

allowing us to read off the derivative as the top-right entry:

f(x) = x^2 + 8x
f([5 1; 0 5])

Exercise
In this exercise, we will explain why

\begin{align*} f\left(\begin{bmatrix} x & 1 \\ 0 & x \end{bmatrix}\right) = \begin{bmatrix} f(x) & f'(x) \\ 0 & f(x) \end{bmatrix}, \end{align*}

for any polynomial f.

  • Check that the above equation holds for the identity function (the function which returns its input) and for the function which returns the multiplicative identity (in other words, 1 for real numbers, or the identity matrix for matrices).
  • Check that if the above equation holds for two differentiable functions f and g, then it holds for the sum f+g and the product fg.
  • Explain why the above equation holds for any polynomial function f(x).

Solution.

  • If f is the identity function, then both sides of the above equation reduce to \begin{bmatrix} x & 1 \\ 0 & x \end{bmatrix}. If f returns the multiplicative identity, then both sides reduce to the identity matrix.
  • We have

\begin{align*} \begin{bmatrix} f(x) & f'(x) \\ 0 & f(x) \end{bmatrix} \begin{bmatrix} g(x) & g'(x) \\ 0 & g(x) \end{bmatrix} = \begin{bmatrix} f(x)g(x) & f'(x)g(x) + f(x)g'(x) \\ 0 & f(x)g(x) \end{bmatrix}, \end{align*}

which in turn is equal to \begin{bmatrix} f(x)g(x) & (f(x)g(x))' \\ 0 & f(x)g(x) \end{bmatrix} by the product rule. The result for f+g works similarly, with linearity of the derivative in place of the product rule.

  • The set of functions which satisfies the above equation includes 1 and x and is closed under multiplication and addition. Therefore, this set of functions at least includes all .

While the exercise above only addresses polynomial functions f, the relationship f\left(\begin{bmatrix} x & 1 \\ 0 & x \end{bmatrix}\right) = \begin{bmatrix} f(x) & f'(x) \\ 0 & f(x) \end{bmatrix} actually holds for many more functions, because many common functions may be described as limits of polynomials: if A is a matrix, then

\begin{align*}\operatorname{e}^A = 1 + A + \frac{1}{2}A^2 + \frac{1}{6}A^3 + \cdots\end{align*}

Since the identity f\left(\begin{bmatrix} x & 1 \\ 0 & x \end{bmatrix}\right) = \begin{bmatrix} f(x) & f'(x) \\ 0 & f(x) \end{bmatrix} is true for every truncation of the sum on the right-hand side, it's true for the exponential function as well.

Exercise
Use automatic differentiation to find the derivative of f(x) = (x^4 - 2x^3 - x^2 + 3x - 1)e^{-x^4/4} at the point x = 2. Compare your answer to the true value of f'(2). Hint: You'll want to define f using

using LinearAlgebra
f(t) = exp(-t^2/4) * (t^4 - 2t^3 - t^2 + 3t - I)

where I is an object in Julia which is defined to behave like multiplicative identity (note that subtracting the identity matrix is the appropriate matrix analogue of subtracting 1 from a real number).

Also, to help check your answer, here's the symbolic derivative of f:

df(t) = (-t^5 + 2*t^4 + 9*t^3 - 15*t^2 - 3*t + 6) * exp(-t^2/4)/2

Solution. We define f as suggested and then calculate f([2 1; 0 2])[1,2]. The result is exactly the same as df(2) and 7.46 \times 10^{-17} different from df(big(2)). So we see that automatic differentiation gives a major improvement over the difference quotient approach in this instance.

In practice, you will usually want to use a library to perform automatic differentiation, because ensuring suitable dual-number-awareness of all of the functions called by f can be a daunting task. Julia has several packages for this purpose, including ForwardDiff. In Python you can use autograd, which works for all of the NumPy functions. (We note that these libraries don't actually use 2 \times 2 matrices to represent dual numbers; they introduce a custom type which has the same behavior.)

Bruno
Bruno Bruno