Custom autodiff part 5: a final example
Table of Contents
This post is Part 5/5 in a series on how to derive custom rules for differentiating through functions in autodiff frameworks like PyTorch, JAX, or Flux. In Part 1 we went over the basic idea of automatic differentiation as applied to functions with a single vector-valued input and output. In Part 2 we took a step back to go over linear maps, their adjoints, and how this relates to automatic differentiation. In Part 3 we learned how to derive custom rules for forward mode automatic differentiation, and then walked through a number of examples of doing so. In Part 4 we did the same for reverse-mode autodiff. Here, we’ll end with one last example worked end-to-end to see the full process: differentiating through an optimization problem.
As a reminder, here’s the outline of this series:
- Part 1: Autodiff basics: forward- and reverse-mode
- Part 2: Linear maps and adjoints
- Part 3: Deriving forward-mode AD rules (w/ examples)
- Part 4: Deriving reverse-mode AD rules (w/ examples)
- Part 5: Putting it all together: differentiating constrained optimization
For the code, I’ll once again be using my fork of “micrograd” to demonstrate implementations of the examples. Micrograd is a minimal autodiff engine developed by Andrej Karpathy as a hands-on explanation of how reverse-mode automatic differentiation works. It’s not intended for any heavy-duty machine learning or scientific computing, but it’s easy to understand and work with, and the basic mechanics are very similar to what you would need in order to implement one of these rules yourself in something like PyTorch or JAX.
Equality-constrained quadratic program #
The example problem in this post is an equality-constrained quadratic program (QP), defined mathematically as
Here and , so (symmetric) and . That is, there are “decision variables” and constraints.
As with the Lyapunov example, it’s easiest to view the QP solve as a function defined implicitly by its optimality conditions. These can be derived from the Lagrangian of the optimization problem. Introducing the Lagrange multipliers , the Lagrangian is
The optimality conditions are determined by setting the variations of the Lagrangian with respect to and equal to zero in turn. For the variation with respect to :
and the variation for the result is just the constraint:
Note that in deriving the first equation I transposed the last term of the Lagrangian so that wasn’t transposed: .
Combined, we can view these two as a linear system of equations implicitly defining as a function of the various parameters :
In contrast to inequality-constrained quadratic or more general nonlinear programs, the inequality-constrained QP can be solved in one shot by solving this symmetric linear system. Even better, Q and A are often sparse, so it can be solved efficiently with iterative linear system solvers or sparse direct solvers. Systems of this form are also often solved as subproblems of iterative solution methods for more general constrained optimization problems.
We could continue working with this as a function with four inputs and two outputs, but now that we see the structure it is easier to just treat it as a linear system solve and re-use our results from Part 3, example 2. That is, let
forming a square system of equations .
Then the pushforward is
which we can expand to the forward-mode rule
Given the primals and tangent input values , this tells us how to compute the tangent outputs by solving a (symmetric) linear system of equations. In fact, comparing the structure of the tangent problem with the original optimality conditions, we can see that this is just another quadratic program with different values for the vectors and ! As for the Lyapunov equations in Parts 3 and 4, we can re-use our original solver code and/or matrix factorizations to compute the pushforward.
The corresponding pullback could also be derived by reusing the results for the linear system solve. However, this case does require a little special treatment due to the symmetric structure of . Because the matrix is symmetric, the input space is not the space of matrices but symmetric matrices. As a result, the adjoint values must also be symmetric.
We can derive a pullback that enforces this by going back to the standard procedure of taking an inner product of the pushforward with an arbitrary adjoint value: in this case . In matrix notation this looks like the following:
but we could also write it using inner product notation:
As usual, we want to manipulate these inner products to isolate the tangent values (i.e. isolate everything with a “dot” superscript in the second operand). The algebra here isn’t anything new, so let’s skip to the result:
From this we can read off equations for each of the adjoint “bar” values:
What do we do with this? So far, we’ve seen many cases where the adjoints end up used as an intermediate stage of the calculation: we solve some “adjoint problem” first to determine and then use these to determine the result of the pullback. In this case, we will be given primals and adjoint outputs and will need to compute the adjoint inputs .
We can rewrite this set of equations in the two-step form by first writing a system of equations for the unknowns in terms of the inputs to the pullback:
Comparing again with the optimality conditions we originally derived, we see this is a third quadratic program! In fact, it has the same and matrices, but replaces the vectors and . Once again, this can be taken advantage of by reusing matrix factorizations.
From there we have everything we need to compute the adjoint inputs . The last wrinkle is the symmetry of . The formula above has , but there is no guarantee that this rank-1 matrix will be symmetric. We can instead enforce that it is symmetric by taking its symmetric part: . This probably amounts to a projection onto the input space of symmetric matrices, but honestly I don’t know if that’s technically true or not. The final equations for the adjoint inputs given the intermediate adjoint values are
Here’s a simple implementation in micrograd using OSQP that doesn’t take advantage of sparse matrices, but does reuse matrix factorizations by keeping the same OSQP object:
def solve_qp(Q, c, A, b, **settings):
# Solve the equality-constrained quadratic program
# min 0.5 x^T Q x + c^T x
# s.t. A x = b
# Initialize the OSQP solver
solver = osqp.OSQP()
P_sp = scipy_sparse.csc_matrix(Q.data)
A_sp = scipy_sparse.csc_matrix(A.data)
solver.setup(P=P_sp, q=c.data, A=A_sp, l=b.data, u=b.data, **settings)
results = solver.solve()
x = Array(results.x, (Q, c, A, b), 'solve_qp') # Solution
y = Array(results.y, (Q, c, A, b), 'solve_qp') # Lagrange multipliers
def _backward():
x_bar, y_bar = x.grad, y.grad
# Solve the adjoint system using the same OSQP solver
solver.update(q=-x_bar, l=y_bar, u=y_bar)
adj_results = solver.solve()
w_x, w_y = adj_results.x, adj_results.y
# Compute the adjoint inputs
Q_bar = -0.5 * (w_x @ x.data.T + x.data @ w_x.T)
c_bar = -w_x
A_bar = -(np.outer(w_y, x.data) + np.outer(y.data, w_x))
b_bar = w_y
# Accumulate the adjoint inputs in the gradients
Q.grad += Q_bar
c.grad += c_bar
A.grad += A_bar
b.grad += b_bar
x._backward = _backward
y._backward = _backward
return x, y
As usual, the naive mathematical derivation isn’t always the most practically efficient. For more on this example, see a 2017 paper from Amos & Kolter that derives forward- and reverse-mode rules for the more general inequality-constrained quadratic program using the Karush-Kuhn-Tucker (KKT) conditions, including an efficient algorithm for the reverse-mode computation. You can compare the equations here to Eq (8) in that paper (by taking the case where there are no inequality constraints).
Final thoughts #
At this point hopefully you’ve been able to work through these examples and build up some confidence in deriving custom rules for automatic differentiation. Being able to do this helps release you from the bounds of “what has already been implemented” and also lets you look out for opportunities to do something more efficient numerically.