Available Extensions

First order extensions

First-order extensions make it easier to extract information from the gradients being already backpropagated through the computational graph. They do not backpropagate additional information, and have small overhead. The implemented extensions are

BatchGrad The individual gradients, rather than the sum over the samples
SumGradSquared The second moment of the individual gradient
Variance The variance of the individual gradients
BatchL2Grad The L2 norm of the individual gradients

backpack.extensions.BatchGrad(subsampling: List[int] = None)

Individual gradients for each sample in a minibatch.

Stores the output in grad_batch as a [N x ...] tensor, where N batch size and ... is the shape of the gradient.

If subsampling is specified, N is replaced by the number of active samples.

Note

Beware of scaling issue

The individual gradients depend on the scaling of the overall function. Let fᵢ be the loss of the i th sample, with gradient gᵢ. BatchGrad will return

[g₁, …, gₙ] if the loss is a sum, ∑ᵢ₌₁ⁿ fᵢ,
[¹/ₙ g₁, …, ¹/ₙ gₙ] if the loss is a mean, ¹/ₙ ∑ᵢ₌₁ⁿ fᵢ.

The concept of individual gradients is only meaningful if the objective is a sum of independent functions (no batchnorm).

backpack.extensions.BatchL2Grad()

The squared L2 norm of individual gradients in the minibatch.

Stores the output in batch_l2 as a tensor of size [N], where N is the batch size.

Note

Beware of scaling issue

The individual L2 norm depends on the scaling of the overall function. Let fᵢ be the loss of the i th sample, with gradient gᵢ. BatchL2Grad will return the L2 norm of

[g₁, …, gₙ] if the loss is a sum, ∑ᵢ₌₁ⁿ fᵢ,
[¹/ₙ g₁, …, ¹/ₙ gₙ] if the loss is a mean, ¹/ₙ ∑ᵢ₌₁ⁿ fᵢ.

backpack.extensions.SumGradSquared()

The sum of individual-gradients-squared, or second moment of the gradient.

Stores the output in sum_grad_squared. Same dimension as the gradient.

Note

Beware of scaling issue

The second moment depends on the scaling of the overall function. Let fᵢ be the loss of the i th sample, with gradient gᵢ. SumGradSquared will return the sum of the squared

[g₁, …, gₙ] if the loss is a sum, ∑ᵢ₌₁ⁿ fᵢ,
[¹/ₙ g₁, …, ¹/ₙ gₙ] if the loss is a mean, ¹/ₙ ∑ᵢ₌₁ⁿ fᵢ.

backpack.extensions.Variance()

Estimates the variance of the gradient using the samples in the minibatch.

Stores the output in variance. Same dimension as the gradient.

Note

Beware of scaling issue

The variance depends on the scaling of the overall function. Let fᵢ be the loss of the i th sample, with gradient gᵢ. Variance will return the variance of the vectors

[g₁, …, gₙ] if the loss is a sum, ∑ᵢ₌₁ⁿ fᵢ,
[¹/ₙ g₁, …, ¹/ₙ gₙ] if the loss is a mean, ¹/ₙ ∑ᵢ₌₁ⁿ fᵢ.

Second order extensions

Second-order extensions propagate additional information through the graph to extract structural or local approximations to second-order information. They are more expensive to run than a standard gradient backpropagation. The implemented extensions are

The diagonal of the Generalized Gauss-Newton (GGN)/Fisher information, using exact computation (DiagGGNExact) or Monte-Carlo approximation (DiagGGNMC).
Kronecker Block-Diagonal approximations of the GGN/Fisher KFAC, KFRA, KFLR.
The diagonal of the Hessian DiagHessian
The symmetric (square root) factorization of the GGN/Fisher information, using exact computation (SqrtGGNExact) or a Monte-Carlo (MC) approximation (SqrtGGNMC)

backpack.extensions.DiagGGNMC(mc_samples: int = 1)

Diagonal of the Generalized Gauss-Newton/Fisher.

Uses a Monte-Carlo approximation of the Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_mc, has the same dimensions as the gradient.

For a more precise but slower alternative, see backpack.extensions.DiagGGNExact().

backpack.extensions.DiagGGNExact()

Diagonal of the Generalized Gauss-Newton/Fisher.

Uses the exact Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_exact, has the same dimensions as the gradient.

For a faster but less precise alternative, see backpack.extensions.DiagGGNMC().

backpack.extensions.BatchDiagGGNMC(mc_samples: int = 1)

Individual diagonal of the Generalized Gauss-Newton/Fisher.

Uses a Monte-Carlo approximation of the Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_mc_batch as a [N x ...] tensor, where N is the batch size and ... is the shape of the gradient.

For a more precise but slower alternative, see backpack.extensions.BatchDiagGGNExact().

backpack.extensions.BatchDiagGGNExact()

Individual diagonal of the Generalized Gauss-Newton/Fisher.

Uses the exact Hessian of the loss w.r.t. the model output.

Stores the output in diag_ggn_exact_batch as a [N x ...] tensor, where N is the batch size and ... is the shape of the gradient.

backpack.extensions.KFAC(mc_samples=1)

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using Monte-Carlo sampling.

Stores the output in kfac as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention, but torch defaults to a row-major storing scheme of tensors. The order of factors might differs from the presentation in the literature.

Implements the procedures described by

Optimizing Neural Networks with Kronecker-factored Approximate Curvature by James Martens and Roger Grosse, 2015.
A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.KFLR()

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output.

Stores the output in kflr as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in torch. Therefore, the order of factors differs from the presentation in the literature.

Implements the procedures described by

Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.

Extended for convolutions following

A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.KFRA()

Approximate Kronecker factorization of the Generalized Gauss-Newton/Fisher using the full Hessian of the loss function w.r.t. the model output and averaging after every backpropagation step.

Stores the output in kfra as a list of Kronecker factors.

If there is only one element, the item represents the GGN/Fisher approximation itself.
If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized Gauss-Newton/Fisher approximation.
The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.

Note

The literature uses column-stacking as vectorization convention. This is in contrast to the default row-major storing scheme of tensors in torch. Therefore, the order of factors differs from the presentation in the literature.

Practical Gauss-Newton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.

Extended for convolutions following

A Kronecker-factored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.DiagHessian()

BackPACK extension that computes the Hessian diagonal.

Stores the output in diag_h, has the same dimensions as the gradient.

Warning

Very expensive on networks with non-piecewise linear activations.

backpack.extensions.BatchDiagHessian()

BackPACK extensions that computes the per-sample (individual) Hessian diagonal.

Stores the output in diag_h_batch as a [N x ...] tensor, where N is the batch size and ... is the parameter shape.

Warning

Very expensive on networks with non-piecewise linear activations.

backpack.extensions.SqrtGGNExact(subsampling: List[int] = None)

Exact matrix square root of the generalized Gauss-Newton/Fisher.

Uses the exact Hessian of the loss w.r.t. the model output.

Stores the output in sqrt_ggn_exact, has shape [C, N, param.shape], where C is the model output dimension (number of classes for classification problems) and N is the batch size. If sub-sampling is enabled, N is replaced by the number of active samples, len(subsampling).

For a faster but less precise alternative, see backpack.extensions.SqrtGGNMC().

Note

(Relation to the GGN/Fisher) For each parameter, param.sqrt_ggn_exact can be viewed as a [C * N, param.numel()] matrix. Concatenating this matrix over all parameters results in a matrix Vᵀ, which is the GGN/Fisher’s matrix square root, i.e. G = V Vᵀ.

backpack.extensions.SqrtGGNMC(mc_samples: int = 1, subsampling: List[int] = None)

Approximate matrix square root of the generalized Gauss-Newton/Fisher.

Uses a Monte-Carlo (MC) approximation of the Hessian of the loss w.r.t. the model output.

Stores the output in sqrt_ggn_mc, has shape [M, N, param.shape], where M is the number of Monte-Carlo samples and N is the batch size. If sub-sampling is enabled, N is replaced by the number of active samples, len(subsampling).

For a more precise but slower alternative, see backpack.extensions.SqrtGGNExact().

Note

(Relation to the GGN/Fisher) For each parameter, param.sqrt_ggn_mc can be viewed as a [M * N, param.numel()] matrix. Concatenating this matrix over all parameters results in a matrix Vᵀ, which is the approximate GGN/Fisher’s matrix square root, i.e. G ≈ V Vᵀ.

Block-diagonal curvature products

These extensions do not compute information directly, but give access to functions to compute matrix-matrix products with block-diagonal approximations of the Hessian.

Extensions propagate functions through the computation graph. In contrast to standard gradient computation, the graph is retained during backpropagation (this results in higher memory consumption). The cost of one matrix-vector multiplication is on the order of one backward pass.

Implemented extensions are matrix-free curvature-matrix multiplication with the block-diagonal of the Hessian, generalized Gauss-Newton (GGN)/Fisher, and positive-curvature Hessian. They are formalized by the concept of Hessian backpropagation, described in:

Modular Block-diagonal Curvature Approximations for Feedforward Architectures by Felix Dangel, Stefan Harmeling, Philipp Hennig, 2020.

backpack.extensions.HMP(savefield='hmp')

Matrix-free multiplication with the block-diagonal Hessian.

Stores the multiplication function in hmp.

For a parameter of shape [...] the function receives and returns a tensor of shape [V, ...]. Each vector slice across the leading dimension is multiplied with the block-diagonal Hessian.

backpack.extensions.GGNMP(savefield='ggnmp')

Matrix-free Multiplication with the block-diagonal generalized Gauss-Newton/Fisher.

Stores the multiplication function in ggnmp.

For a parameter of shape [...] the function receives and returns a tensor of shape [V, ...]. Each vector slice across the leading dimension is multiplied with the block-diagonal GGN/Fisher.

backpack.extensions.PCHMP(savefield='pchmp', modify='clip')

Matrix-free multiplication with the block-diagonal positive-curvature Hessian (PCH).

Stores the multiplication function in pchmp.

For a parameter of shape [...] the function receives and returns a tensor of shape [V, ...]. Each vector slice across the leading dimension is multiplied with the block-diagonal positive curvature Hessian.

The PCH is proposed in

BDA-PCH: Block-Diagonal Approximation of Positive-Curvature Hessian for Training Neural Networks by Sheng-Wei Chen, Chun-Nan Chou and Edward Y. Chang, 2018.

There are different concavity-eliminating modifications which can be selected by the modify argument (“abs” or “clip”).

Note

The name positive-curvature Hessian may be misleading. While the PCH is always positive semi-definite (PSD), it does not refer to the projection of the exact Hessian on to the space of PSD matrices.