In PyTorch, you can easily compute derivatives of a scalar-valued variable
f w.r.t. to a variable
param by calling
f.backward(). This computes the Jacobian
∂(f) / ∂(param)
that has shape
f is a reduction of a batched scalar
fs of shape
[N], then BackPACK is capable to compute the individual gradients for
each scalar with its
BatchGrad extension. This yields the Jacobian
∂(fs) / ∂(param) of shape
This example demonstrates how to compute the Jacobian of a tensor-valued
fs, here for the example of a batched vector of shape
[N, C], whose Jacobian has shape
[N, C, *param.shape].
We will use the batched vector-valued output of a simple MLP as tensor
fs that should be differentiated w.r.t. the model parameters
param_1, param_2, .... For
param_i, this leads to a Jacobian
∂(fs) / ∂(param_i) of shape
[N, C, *param_i.shape].
Let’s start by importing the required functionality and write a setup function to create our synthetic data.
import itertools from math import sqrt from typing import List, Tuple import matplotlib.pyplot as plt from torch import Tensor, allclose, cat, manual_seed, rand, zeros, zeros_like from torch.autograd import grad from torch.nn import Linear, MSELoss, ReLU, Sequential from backpack import backpack, extend, extensions # architecture specifications N = 15 D_in = 10 D_hidden = 7 C = 5 def setup() -> Tuple[Sequential, Tensor]: """Create a simple MLP with ReLU activations and its synthetic input. Returns: A simple MLP and a tensor that can be fed to it. """ X = rand(N, D_in) model = Sequential(Linear(D_in, D_hidden), ReLU(), Linear(D_hidden, C)) return model, X
First, let’s compute the Jacobians with PyTorch’s
autograd to verify
To do that, we need to differentiate per component of
fs. This means
that we will differentiate multiple times through its graph, therefore we
need to set
manual_seed(0) model, X = setup() fs = model(X) autograd_jacobians = [zeros(fs.shape + param.shape) for param in model.parameters()] for n, c in itertools.product(range(N), range(C)): grads_n_c = grad(fs[n, c], model.parameters(), retain_graph=True) for param_idx, param_grad in enumerate(grads_n_c): autograd_jacobians[param_idx][n, c, :] = param_grad
Let’s visualize the Jacobians by flattening the dimensions stemming from
fs and from
param_i, and by concatenating them along the
<matplotlib.colorbar.Colorbar object at 0x7f8f28e34550>
In the following, we will compute the same Jacobian tensor lists with BackPACK. To compare our results, we will use the following helper function:
def compare_tensor_lists( tensor_list1: List[Tensor], tensor_list_2: List[Tensor] ) -> None: """Checks equality of two tensor lists. Args: tensor_list1: First tensor list. tensor_list2: Second tensor list. Raises: ValueError: If the two tensor lists don't match. """ if len(tensor_list1) != len(tensor_list_2): raise ValueError("Tensor lists have different length.") for tensor1, tensor2 in zip(tensor_list1, tensor_list_2): if tensor1.shape != tensor2.shape: raise ValueError("Tensors have different sizes.") if not allclose(tensor1, tensor2): raise ValueError("Tensors have different values.") print("Both tensor lists match.")
Next, we will present two approaches to compute such Jacobians with BackPACK.
You can imagine the first one as carrying out the for-loop over
parallel, and the second one as carrying out both for loops over
in parallel. The first approach relies on a first-order extension, the second
one on a second-order extension. This means that while the first approach
works on quite general graphs, for the second one to work your graph must be
As described in the introduction, BackPACK’s
BatchGrad extension can
compute Jacobians of batched scalars. We can therefore compute the
fs[:, c] in one iteration, parallelizing the Jacobian
computation over the batch axis. For the full Jacobian, this requires
C backpropagations, hence we need to tell both
BackPACK to retain the graph.
Let’s do that in code, and check the result:
manual_seed(0) model, X = setup() model = extend(model) fs = model(X) backpack_first_jacobians = [zeros(fs.shape + p.shape) for p in model.parameters()] for c in range(C): with backpack(extensions.BatchGrad(), retain_graph=True): f = fs[:, c].sum() f.backward(retain_graph=True) for param_idx, param in enumerate(model.parameters()): backpack_first_jacobians[param_idx][:, c, :] = param.grad_batch print("Comparing batched Jacobian from autograd with BackPACK (via BatchGrad):") compare_tensor_lists(autograd_jacobians, backpack_first_jacobians)
Comparing batched Jacobian from autograd with BackPACK (via BatchGrad): Both tensor lists match.
The second approach uses BackPACK’s
extension. It computes the matrix square root of the GGN/Fisher.
This approach uses that after feeding
fs through a square loss with
reduction='sum', the GGN’s square root is the desired Jacobian up to
a normalization factor of √2 (to find out more, read Section 2 of [Dangel,
2021]), and a transposition due to
Like that, we get the Jacobian in a single backward pass and don’t have to retain the graph:
manual_seed(0) model, X = setup() model = extend(model) loss_func = extend(MSELoss(reduction="sum")) fs = model(X) fs_labels = zeros_like(fs) # can contain arbitrary values. loss = loss_func(fs, fs_labels) with backpack(extensions.SqrtGGNExact()): loss.backward() backpack_second_jacobians = [ param.sqrt_ggn_exact.transpose(0, 1) / sqrt(2) for param in model.parameters() ] print("Comparing batched Jacobian from autograd with BackPACK (via SqrtGGNExact):") compare_tensor_lists(autograd_jacobians, backpack_second_jacobians)
Comparing batched Jacobian from autograd with BackPACK (via SqrtGGNExact): Both tensor lists match.
Total running time of the script: ( 0 minutes 0.190 seconds)