Глоссарий

Выберите одно из ключевых слов слева ...

Multivariable CalculusDifferentiation

Время чтения: ~45 min

Differentiating a single-variable function involves answering the question near a given point, how much does the value of the function change per unit change in the input? In the higher-dimensional setting, the question must be made more specific, since the change in output depends not only on how much the input is changed but also on the of the change in input.

Consider, for example, the function f(x,y) which returns the altitude of the point on earth with latitude x and longitude y. If the point (x,y) identifies a point on a sloping hillside, then there are some directions in which f increases, others in which f decreases, and two directions in which f neither increases nor decreases (these are the directions along the hill's contour lines, as you would see represented on a map).

Partial derivatives

The simplest directions for inquiring about the instantaneous rate of change of f are those along the axes: The partial derivative \frac{\partial f}{\partial x}(x_0,y_0) of a function f(x,y) at a point (x_0,y_0) is the slope of the graph of f in the x-direction at the point (x_0,y_0). In other words, it's the slope of the intersection of the graph of f with the plane y=y_0. The partial derivative \frac{\partial f}{\partial x}(x_0,y_0) may also be denoted f_x(x_0,y_0).

Exercise
Consider the function f whose graph is shown. Determine the sign of f_x(1,1) () and the sign of f_y(1,1) ().

Solution. If we increase x a little while holding y constant, then f decreases. Therefore, f_x(1,1) < 0. If we increase y a little while holding x constant, then f increases. Therefore, f_y(1,1) > 0.

Graphically, the partial derivative with respect to x at a point is equal to the slope of the trace of the graph in the "y =\text{constant}" plane passing through that point. Similarly, the partial derivative with respect to y at a point is equal to the slope of the trace of the graph in the "x =\text{constant}" plane passing through that point.

We can partial-differentiate multiple times, and it turns out that the order in which we apply these partial differentiation operations doesn't matter. This fact is called Clairaut's theorem.

Exercise
Consider the function f(x,y) = \mathrm{e}^{xy}\sin(y). Show that differentiating with respect to x and then with respect to y gives the same result as differentiating with respect to y and then with respect to x.

Solution. The partial derivative of f with respect to x is y\mathrm{e}^{xy}\sin(y), and the derivative of that with respect to y is \mathrm{e}^{xy} (x y \sin(y) + \sin(y) + y \cos(y)). The partial derivative of f with respect to y is e^{xy} (x sin(y) + cos(y)), and the derivative of that with respect to x is \mathrm{e}^{xy} (x y \sin(y) + \sin(y) + y \cos(y)). Therefore, the conclusion of Clairaut's theorem is satisfied in this case.

Differentiability

A single-variable function is differentiable at a point if and only if its graph looks increasingly like that of a non-vertical line when zoomed increasingly far in. In other words, f is differentiable if and only if there's a linear function L such that \frac{f(x) - L(x)}{x-a} goes to 0 as x \to a.

Likewise, a function of two variables is said to be differentiable at a point if its graph looks like a plane when you zoom in sufficiently around the point; that is, f is differentiable at (a,b) if

\begin{align*}\lim_{(x,y) \to (a,b)}\frac{f(x,y) - c_0 - c_1(x-a) - c_2(y-b)}{|[x,y] - [a,b]|} = 0\end{align*}

for some real numbers c_0, c_1, and c_2. If such a linear function c_0 + c_1(x-a) + c_2(y-b) exists, then its coefficients are necessarily c_0 = f(a,b), c_1 = f_x(a,b), and c_2 = f_y(a,b).

The function f is differentiable at the point shown, because its graph looks increasingly like the dark green plane shown as you zoom in around the point.

So, the equation of the plane tangent to the graph of a differentiable function f at the point (a,b,f(a,b)) is given by

\begin{align*}z = f(a,b) + f_x(a,b)(x-a) + f_y(a,b)(y-b).\end{align*}

This equation says how f behaves for values of (x,y) very close to (a,b): the output changes by the x-change x-a times f's sensitivity to changes in x (namely f_x(a,b)) plus the y-change times f's sensitivity to changes in y (namely f_y(a,b)).

Gradient

Once we know how a differentiable function f: \mathbb{R}^d \to \mathbb{R} changes in the coordinate-axis directions, we can use the formula z = f(a,b) + f_x(a,b)(x-a) + f_y(a,b)(y-b) to succinctly express how it changes in any direction: we form the gradient \nabla f of f by putting all of the partial derivatives of a function f together into a vector. Then, for any unit vector \mathbf{u}, the rate of change of f in the \mathbf{u} direction is equal to \nabla f\cdot \mathbf{u}.

Since \nabla f\cdot \mathbf{u} = |\nabla f| \cos \theta, where \theta is the angle between \nabla f and \mathbf{u}, the direction of the gradient is the direction in which f increases most rapidly. The direction opposite to the gradient is the direction of maximum decrease, and the directions orthogonal to these are the ones in which f is constant.

Exercise
Suppose that f:\mathbb{R}^2 \to \mathbb{R} is a differentiable function at the point (a,b)\in \mathbb{R}^2 and that its instantaneous rates of change in the directions \mathbf{u} and \mathbf{v} are known. Show that if \mathbf{u} and \mathbf{v} are not parallel, then it is always possible to infer f's rates of change in the coordinate-axis directions.

Solution. The problem stipulates that we are given equations of the form

\begin{align*}u_1 f_x(a,b) + u_2 f_y(a,b) &= c_1 \\ v_1 f_x(a,b) + v_2 f_y(a,b) &= c_2\end{align*}

for some numbers u_1, u_2, c_1, v_1, v_2, c_2. This system may be written in matrix form as

\begin{align*}\begin{bmatrix} u_1 & u_2 \\\ v_1 & v_2 \end{bmatrix} \begin{bmatrix} f_x(a,b) \\\ f_y(a,b) \end{bmatrix} = \begin{bmatrix} c_1 \\\ c_2 \end{bmatrix}.\end{align*}

Since \mathbf{u} and \mathbf{v} are not parallel, they span \mathbb{R}^2. Therefore, the matrix \begin{bmatrix} u_1 & u_2 \\\\ v_1 & v_2 \end{bmatrix} is invertible, and the solution of is \begin{bmatrix} u_1 & u_2 \\\\ v_1 & v_2 \end{bmatrix}^{-1}\begin{bmatrix} c_1 \\\\ c_2 \end{bmatrix}.

Exercise
Consider a differentiable function f from \mathbb{R}^2 to \mathbb{R} and a point where f is differentiable with nonzero gradient. The number of directions in which f increases maximally from that point is . The number of directions in which f decreases maximally from that point is . The number of directions in which f remains approximately constant is .

Solution. f increases maximally in the direction of its gradient and decreases maximally in the opposite direction. It remains approximately constant in the two directions orthogonal to its gradient.

Exercise
Consider a differentiable function f from \mathbb{R}^3 to \mathbb{R} and a point where f is differentiable with nonzero gradient. The number of directions in which f increases maximally from that point is . The number of directions in which f decreases maximally from that point is . The number of directions in which f remains approximately constant is .

Solution. f increases maximally in the direction of its gradient and decreases maximally in the opposite direction. It remains approximately constant in the plane of directions orthogonal to its gradient. Since a plane contains infinitely many directions, the number of directions in which f remains approximately constant is infinite.

Second-order differentiation

We can take the notion of a gradient, which measures the linear change of a function, up a degree. The Hessian of a function f: \mathbb{R}^n \to \mathbb{R} is defined to be the matrix

\begin{align*}\mathbf{H}(\mathbf{x}) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1\,\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1\,\partial x_n} \\[2.2ex] \frac{\partial^2 f}{\partial x_2\,\partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2\,\partial x_n} \\[2.2ex] \vdots & \vdots & \ddots & \vdots \\[2.2ex] \frac{\partial^2 f}{\partial x_n\,\partial x_1} & \frac{\partial^2 f}{\partial x_n\,\partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}\end{align*}

The best quadratic approximation to the graph of a twice-differentiable function f:\mathbb{R}^n \to \mathbb{R} at the origin is

\begin{align*}Q(\mathbf{x}) = f(\mathbf{0}) + (\nabla f(\mathbf{0}))'\mathbf{x}+ \frac{1}{2}\mathbf{x}' \mathbf{H}(\mathbf{0})\mathbf{x}.\end{align*}

The same is true at points \mathbf{a} other than the origin if we evaluate the gradient and Hessian at \mathbf{a} instead of \mathbf{0} and if we replace \mathbf{x} with \mathbf{x}-\mathbf{a}.

Q is the best quadratic approximation of f at the origin.

Exercise
Suppose that a,b,c,d,e and f are real numbers and that f(x,y) = a + bx + cy + dx^2 + exy + fy^2. Show that the quadratic approximation of f at the origin is equal to f.

Solution. The gradient of f evaluated at the origin is [b, c], so the linear approximation of f is

\begin{align*}f(0,0) + f_x(0,0) x + f_y(0,0) = a + bx + cy.\end{align*}

The Hessian is \begin{bmatrix} 2d & e \\\ e & 2f \end{bmatrix}, so the quadratic terms in the quadratic approximation are

\begin{align*}\frac{1}{2}(2dx^2 + exy + exy + 2ey^2) = dx^2 + exy + fy^2,\end{align*}

as desired.

We can combine the ideas of quadratic approximation and diagonalization to gain sharp insight into the shape a function's graph at a point where the gradient is zero. Since the Hessian matrix H is by Clairaut's theorem, the spectral theorem implies that it is orthogonally diagonalizable.

With VDV' as the diagonalization of H, the quadratic term in the quadratic approximation becomes

\begin{align*}\frac{1}{2} \mathbf{x}' V \Lambda V' \mathbf{x} = \frac{1}{2} (V'\mathbf{x})' \Lambda (V' \mathbf{x})\end{align*}

Since the components of V'\mathbf{x} are the coordinates of \mathbf{x} with respect to the basis given by the columns \mathbf{v}_1, \ldots, \mathbf{v}_n of V, the quadratic term may be written as

\begin{align*}\frac{1}{2} (\lambda_1\tilde{x}_1^2 + \lambda_2\tilde{x}_2^2 + \cdots + \lambda_n\tilde{x}_n^2),\end{align*}

where [\tilde{x}_1, \tilde{x}_2, \ldots, \tilde{x}_n] is the vector of coordinates of [x_1, x_2, \ldots, x_n] with respect to the basis given by the columns of V.

Writing the quadratic approximation of f in the form \frac{1}{2}(\lambda_1\tilde{x}_1^2 + \lambda_2\tilde{x}_2^2 + \cdots + \lambda_n\tilde{x}_n^2) is powerful because it presents the changes in f as a sum of n separate changes, each of which is as simple as the parabola y = ax^2.

a=${a}

If \lambda_1 is negative, then the graph of f is shaped like an up-turned parabola along the \mathbf{v}_1 axis. If it's positive, then the graph of f is shaped like a down-turned parabola along that axis.

Exercise
Consider a point (x_1, \ldots x_n) where f has zero gradient and a Hessian with eigenvalues \lambda_1, \ldots, \lambda_n.

If all of the eigenvalues are positive, then f is at (x_1, \ldots x_n) than at nearby points.

If all of the eigenvalues are negative, then f is at (x_1, \ldots x_n) than at nearby points.

If some eigenvalues are positive and some are negative, then f increases as you move away from (x_1, \ldots x_n) in some directions and in other directions.

If every slice of the graph of f is convex, then f has a local minimum at the origin

If every slice is concave, then f has local maximum

If there are both concave and convex slices, then f has a saddle point

In addition to helping distinguish local minima, local maxima, and saddle points, the diagonalized Hessian can also help us recognize ravines in the graph of f. This idea arises in the context of numerical optimization methods for deep learning.

The minimum of this function is in a long, narrow valley with steep sides.

Exercise
Suppose that f:\mathbb{R}^2 \to \mathbb{R} has zero gradient at a given point, and suppose that its Hessian matrix at that point has eigenvalues \lambda_1 and \lambda_2. How can you recognize based on the values of \lambda_1 and \lambda_2 whether the graph of f is ravine-shaped?

Solution. If \lambda_1 and \lambda_2 are both positive, with one close to zero and the other very large, then the graph of f will be ravine-shaped. That's because the steep increase in one direction corresponds to one of the eigenvalues being very large, and the shallow increase in the orthogonal direction is indicated by the other eigenvalue being very small.

Bruno
Bruno Bruno