Available Extensions¶
First order extensions.¶
Firstorder extensions make it easier to extract information from the gradients being already backpropagated through the computational graph. They do not backpropagate additional information, and have small overhead. The implemented extensions are
BatchGrad
The individual gradients, rather than the sum over the samplesSumGradSquared
The second moment of the individual gradientVariance
The variance of the individual gradientsBatchL2Grad
The L2 norm of the individual gradients

backpack.extensions.
BatchGrad
()¶ Individual gradients for each sample in a minibatch.
Stores the output in
grad_batch
as a[N x ...]
tensor, whereN
batch size and...
is the shape of the gradient. Note: beware of scaling issue
The individual gradients depend on the scaling of the overall function. Let
fᵢ
be the loss of thei
th sample, with gradientgᵢ
.BatchGrad
will return[g₁, …, gₙ]
if the loss is a sum,∑ᵢ₌₁ⁿ fᵢ
,[¹/ₙ g₁, …, ¹/ₙ gₙ]
if the loss is a mean,¹/ₙ ∑ᵢ₌₁ⁿ fᵢ
.
The concept of individual gradients is only meaningful if the objective is a sum of independent functions (no batchnorm).

backpack.extensions.
BatchL2Grad
()¶ The squared L2 norm of individual gradients in the minibatch.
Stores the output in
batch_l2
as a tensor of size[N]
, whereN
is the batch size. Note: beware of scaling issue
The individual L2 norm depends on the scaling of the overall function. Let
fᵢ
be the loss of thei
th sample, with gradientgᵢ
.BatchL2Grad
will return the L2 norm of[g₁, …, gₙ]
if the loss is a sum,∑ᵢ₌₁ⁿ fᵢ
,[¹/ₙ g₁, …, ¹/ₙ gₙ]
if the loss is a mean,¹/ₙ ∑ᵢ₌₁ⁿ fᵢ
.

backpack.extensions.
SumGradSquared
()¶ The sum of individualgradientssquared, or second moment of the gradient.
Stores the output in
sum_grad_squared
. Same dimension as the gradient. Note: beware of scaling issue
The second moment depends on the scaling of the overall function. Let
fᵢ
be the loss of thei
th sample, with gradientgᵢ
.SumGradSquared
will return the sum of the squared[g₁, …, gₙ]
if the loss is a sum,∑ᵢ₌₁ⁿ fᵢ
,[¹/ₙ g₁, …, ¹/ₙ gₙ]
if the loss is a mean,¹/ₙ ∑ᵢ₌₁ⁿ fᵢ
.

backpack.extensions.
Variance
()¶ Estimates the variance of the gradient using the samples in the minibatch.
Stores the output in
variance
. Same dimension as the gradient. Note: beware of scaling issue
The variance depends on the scaling of the overall function. Let
fᵢ
be the loss of thei
th sample, with gradientgᵢ
.Variance
will return the variance of the vectors[g₁, …, gₙ]
if the loss is a sum,∑ᵢ₌₁ⁿ fᵢ
,[¹/ₙ g₁, …, ¹/ₙ gₙ]
if the loss is a mean,¹/ₙ ∑ᵢ₌₁ⁿ fᵢ
.
Second order extensions.¶
Secondorder extensions propagate additional information through the graph to extract structural or local approximations to secondorder information. They are more expensive to run than a standard gradient backpropagation. The implemented extensions are
 The diagonal of the Generalized GaussNewton (GGN)/Fisher information,
using exact computation
(
DiagGGNExact
) or MonteCarlo approximation (DiagGGNMC
).  Kronecker BlockDiagonal approximations of the GGN/Fisher
KFAC
,KFRA
,KFLR
.  The diagonal of the Hessian
DiagHessian

backpack.extensions.
DiagGGNMC
(mc_samples=1)¶ Diagonal of the Generalized GaussNewton/Fisher. Uses a MonteCarlo approximation of the Hessian of the loss w.r.t. the model output.
Stores the output in
diag_ggn_mc
, has the same dimensions as the gradient.For a more precise but slower alternative, see
backpack.extensions.DiagGGNExact()
. Args:
 mc_samples (int, optional): Number of MonteCarlo samples. Default:
1
.

backpack.extensions.
DiagGGNExact
()¶ Diagonal of the Generalized GaussNewton/Fisher. Uses the exact Hessian of the loss w.r.t. the model output.
Stores the output in
diag_ggn_exact
, has the same dimensions as the gradient.For a faster but less precise alternative, see
backpack.extensions.DiagGGNMC()
.

backpack.extensions.
KFAC
(mc_samples=1)¶ Approximate Kronecker factorization of the Generalized GaussNewton/Fisher using MonteCarlo sampling.
Stores the output in
kfac
as a list of Kronecker factors. If there is only one element, the item represents the GGN/Fisher approximation itself.
 If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized GaussNewton/Fisher approximation.
 The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses columnstacking as vectorization convention, but
torch
defaults to a rowmajor storing scheme of tensors. The order of factors might differs from the presentation in the literature.Implements the procedures described by
 Optimizing Neural Networks with Kroneckerfactored Approximate Curvature by James Martens and Roger Grosse, 2015.
 A Kroneckerfactored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.
KFLR
()¶ Approximate Kronecker factorization of the Generalized GaussNewton/Fisher using the full Hessian of the loss function w.r.t. the model output.
Stores the output in
kflr
as a list of Kronecker factors. If there is only one element, the item represents the GGN/Fisher approximation itself.
 If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized GaussNewton/Fisher approximation.
 The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses columnstacking as vectorization convention. This is in contrast to the default rowmajor storing scheme of tensors in
torch
. Therefore, the order of factors differs from the presentation in the literature.Implements the procedures described by
 Practical GaussNewton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.
Extended for convolutions following
 A Kroneckerfactored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.
KFRA
()¶ Approximate Kronecker factorization of the Generalized GaussNewton/Fisher using the full Hessian of the loss function w.r.t. the model output and averaging after every backpropagation step.
Stores the output in
kfra
as a list of Kronecker factors. If there is only one element, the item represents the GGN/Fisher approximation itself.
 If there are multiple elements, they are arranged in the order such that their Kronecker product represents the Generalized GaussNewton/Fisher approximation.
 The dimension of the factors depends on the layer, but the product of all row dimensions (or column dimensions) yields the dimension of the layer parameter.
Note
The literature uses columnstacking as vectorization convention. This is in contrast to the default rowmajor storing scheme of tensors in
torch
. Therefore, the order of factors differs from the presentation in the literature. Practical GaussNewton Optimisation for Deep Learning by Aleksandar Botev, Hippolyt Ritter and David Barber, 2017.
Extended for convolutions following
 A Kroneckerfactored approximate Fisher matrix for convolution layers by Roger Grosse and James Martens, 2016

backpack.extensions.
DiagHessian
()¶ Diagonal of the Hessian.
Stores the output in
diag_h
, has the same dimensions as the gradient.Warning
Very expensive on networks with nonpiecewise linear activations.
Blockdiagonal curvature products¶
These extensions do not compute information directly, but give access to functions to compute matrixmatrix products with blockdiagonal approximations of the Hessian.
Extensions propagate functions through the computation graph. In contrast to standard gradient computation, the graph is retained during backpropagation (this results in higher memory consumption). The cost of one matrixvector multiplication is on the order of one backward pass.
Implemented extensions are matrixfree curvaturematrix multiplication with the blockdiagonal of the Hessian, generalized GaussNewton (GGN)/Fisher, and positivecurvature Hessian. They are formalized by the concept of Hessian backpropagation, described in:
 Modular Blockdiagonal Curvature Approximations for Feedforward Architectures by Felix Dangel, Stefan Harmeling, Philipp Hennig, 2020.

backpack.extensions.
HMP
(savefield='hmp')¶ Matrixfree multiplication with the blockdiagonal Hessian.
Stores the multiplication function in
hmp
.For a parameter of shape
[...]
the function receives and returns a tensor of shape[V, ...]
. Each vector slice across the leading dimension is multiplied with the blockdiagonal Hessian.

backpack.extensions.
GGNMP
(savefield='ggnmp')¶ Matrixfree Multiplication with the blockdiagonal generalized GaussNewton/Fisher.
Stores the multiplication function in
ggnmp
.For a parameter of shape
[...]
the function receives and returns a tensor of shape[V, ...]
. Each vector slice across the leading dimension is multiplied with the blockdiagonal GGN/Fisher.

backpack.extensions.
PCHMP
(savefield='pchmp', modify='clip')¶ Matrixfree multiplication with the blockdiagonal positivecurvature Hessian (PCH).
Stores the multiplication function in
pchmp
.For a parameter of shape
[...]
the function receives and returns a tensor of shape[V, ...]
. Each vector slice across the leading dimension is multiplied with the blockdiagonal positive curvature Hessian.The PCH is proposed in
 BDAPCH: BlockDiagonal Approximation of PositiveCurvature Hessian for Training Neural Networks by ShengWei Chen, ChunNan Chou and Edward Y. Chang, 2018.
There are different concavityeliminating modifications which can be selected by the modify argument (“abs” or “clip”).
Note
The name positivecurvature Hessian may be misleading. While the PCH is always positive semidefinite (PSD), it does not refer to the projection of the exact Hessian on to the space of PSD matrices.