Improve grouping example

ValerianRey · ValerianRey · commit fefc53a82e9d · 2026-04-16T00:43:27.000+02:00
* Add link to the paper
* Simplify some formulations
* Rename strategies whole model =&gt; together; encoder-decoder =&gt; per network; all layers =&gt; per layer; all matrices =&gt; per tensor
* Place a bit less emphasis on GradVac: rename gradvac to aggregator when possible
* Create losses in separate lines in the code examples
* Remove a few redundant sentences from a note
diff --git a/docs/source/examples/grouping.rst b/docs/source/examples/grouping.rst
@@ -3,42 +3,45 @@ Grouping
 
 When applying a conflict-resolving aggregator such as :class:`~torchjd.aggregation.GradVac` in
 multi-task learning, the cosine similarities between task gradients can be computed at different
-granularities. The GradVac paper introduces four strategies, each partitioning the shared
-parameter vector differently:
+granularities. The [Gradient Vaccine paper](https://arxiv.org/pdf/2010.05874) introduces four
+strategies, each partitioning the shared parameter vector differently:
 
-1. **Whole Model** (default) — one group covering all shared parameters.
-2. **Encoder-Decoder** — one group per top-level sub-network (e.g. encoder and decoder separately).
-3. **All Layers** — one group per leaf module of the encoder.
-4. **All Matrices** — one group per individual parameter tensor.
+1. **Together** (baseline): one group covering all shared parameters. Corresponds to the
+   `whole_model` stategy in the paper.
+
+2. **Per network**: one group per top-level sub-network (e.g. encoder and decoder separately).
+   Corresponds to the `enc_dec` stategy in the paper.
+
+3. **Per layer**: one group per leaf module of the encoder. Corresponds to the `all_layer` stategy
+   in the paper.
+
+4. **Per tensor**: one group per individual parameter tensor. Corresponds to the `all_matrix`
+   stategy in the paper.
 
 In TorchJD, grouping is achieved by calling :func:`~torchjd.autojac.jac_to_grad` once per group
 after :func:`~torchjd.autojac.mtl_backward`, with a dedicated aggregator instance per group.
-For stateful aggregators such as :class:`~torchjd.aggregation.GradVac`, each instance
-independently maintains its own EMA state :math:`\hat{\phi}`, matching the per-block targets from
-the original paper.
+For :class:`~torchjd.aggregation.Stateful` aggregators, each instance independently maintains its
+own state (e.g. the EMA :math:`\hat{\phi}` state in :class:`~torchjd.aggregation.GradVac`), matching
+the per-block targets from the original paper.
 
 .. note::
     The grouping is orthogonal to the choice of
     :func:`~torchjd.autojac.backward` vs :func:`~torchjd.autojac.mtl_backward`. Those functions
     determine *which* parameters receive Jacobians; grouping then determines *how* those Jacobians
-    are partitioned for aggregation. Calling :func:`~torchjd.autojac.jac_to_grad` once on all shared
-    parameters corresponds to the Whole Model strategy. Splitting those parameters into
-    sub-networks and calling :func:`~torchjd.autojac.jac_to_grad` separately on each — with a
-    dedicated aggregator per sub-network — gives an arbitrary custom grouping, such as the
-    Encoder-Decoder strategy described in the GradVac paper for encoder-decoder architectures.
+    are partitioned for aggregation.
 
 .. note::
     The examples below use :class:`~torchjd.aggregation.GradVac`, but the same pattern applies to
-    any aggregator.
+    any :class:`~torchjd.aggregation.Aggregator`.
 
-1. Whole Model
---------------
+1. Together
+-----------
 
-A single :class:`~torchjd.aggregation.GradVac` instance aggregates all shared parameters
+A single :class:`~torchjd.aggregation.Aggregator` instance aggregates all shared parameters
 together. Cosine similarities are computed between the full task gradient vectors.
 
 .. testcode::
-    :emphasize-lines: 14, 19
+    :emphasize-lines: 14, 21
 
     import torch
     from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -53,25 +56,27 @@ together. Cosine similarities are computed between the full task gradient vector
     loss_fn = MSELoss()
     inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
 
-    gradvac = GradVac()
+    aggregator = GradVac()
 
     for x, y1, y2 in zip(inputs, t1, t2):
         features = encoder(x)
-        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
-        jac_to_grad(encoder.parameters(), gradvac)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        jac_to_grad(encoder.parameters(), aggregator)
         optimizer.step()
         optimizer.zero_grad()
 
-2. Encoder-Decoder
-------------------
+2. Per network
+--------------
 
-One :class:`~torchjd.aggregation.GradVac` instance per top-level sub-network. Here the model
+One :class:`~torchjd.aggregation.Aggregator` instance per top-level sub-network. Here the model
 is split into an encoder and a decoder; cosine similarities are computed separately within each.
 Passing ``features=dec_out`` to :func:`~torchjd.autojac.mtl_backward` causes both sub-networks
 to receive Jacobians, which are then aggregated independently.
 
 .. testcode::
-    :emphasize-lines: 8-9, 15-16, 22-23
+    :emphasize-lines: 8-9, 15-16, 24-25
 
     import torch
     from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -87,26 +92,28 @@ to receive Jacobians, which are then aggregated independently.
     loss_fn = MSELoss()
     inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
 
-    encoder_gradvac = GradVac()
-    decoder_gradvac = GradVac()
+    encoder_aggregator = GradVac()
+    decoder_aggregator = GradVac()
 
     for x, y1, y2 in zip(inputs, t1, t2):
         enc_out = encoder(x)
         dec_out = decoder(enc_out)
-        mtl_backward([loss_fn(task1_head(dec_out), y1), loss_fn(task2_head(dec_out), y2)], features=dec_out)
-        jac_to_grad(encoder.parameters(), encoder_gradvac)
-        jac_to_grad(decoder.parameters(), decoder_gradvac)
+        loss1 = loss_fn(task1_head(dec_out), y1)
+        loss2 = loss_fn(task2_head(dec_out), y2)
+        mtl_backward([loss1, loss2], features=dec_out)
+        jac_to_grad(encoder.parameters(), encoder_aggregator)
+        jac_to_grad(decoder.parameters(), decoder_aggregator)
         optimizer.step()
         optimizer.zero_grad()
 
-3. All Layers
--------------
+3. Per layer
+------------
 
-One :class:`~torchjd.aggregation.GradVac` instance per leaf module. Cosine similarities are
-computed between the per-layer blocks of the task gradients.
+One :class:`~torchjd.aggregation.Aggregator` instance per leaf module. Cosine similarities are
+computed per-layer between the task gradients.
 
 .. testcode::
-    :emphasize-lines: 14-15, 20-21
+    :emphasize-lines: 14-15, 22-23
 
     import torch
     from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -121,26 +128,28 @@ computed between the per-layer blocks of the task gradients.
     loss_fn = MSELoss()
     inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
 
-    leaf_layers = [m for m in encoder.modules() if not list(m.children()) and list(m.parameters())]
-    gradvacs = [GradVac() for _ in leaf_layers]
+    leaf_layers = [m for m in encoder.modules() if list(m.parameters()) and not list(m.children())]
+    aggregators = [GradVac() for _ in leaf_layers]
 
     for x, y1, y2 in zip(inputs, t1, t2):
         features = encoder(x)
-        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
-        for layer, gradvac in zip(leaf_layers, gradvacs):
-            jac_to_grad(layer.parameters(), gradvac)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        for layer, aggregator in zip(leaf_layers, aggregators):
+            jac_to_grad(layer.parameters(), aggregator)
         optimizer.step()
         optimizer.zero_grad()
 
-4. All Matrices
----------------
+4. Per parameter
+----------------
 
-One :class:`~torchjd.aggregation.GradVac` instance per individual parameter tensor. Cosine
-similarities are computed between the per-tensor blocks of the task gradients (e.g. weights and
-biases of each layer are treated as separate groups).
+One :class:`~torchjd.aggregation.Aggregator` instance per individual parameter tensor. Cosine
+similarities are computed per-tensor between the task gradients (e.g. weights and biases of each
+layer are treated as separate groups).
 
 .. testcode::
-    :emphasize-lines: 14-15, 20-21
+    :emphasize-lines: 14-15, 22-23
 
     import torch
     from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -156,12 +165,14 @@ biases of each layer are treated as separate groups).
     inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
 
     shared_params = list(encoder.parameters())
-    gradvacs = [GradVac() for _ in shared_params]
+    aggregators = [GradVac() for _ in shared_params]
 
     for x, y1, y2 in zip(inputs, t1, t2):
         features = encoder(x)
-        mtl_backward([loss_fn(task1_head(features), y1), loss_fn(task2_head(features), y2)], features=features)
-        for param, gradvac in zip(shared_params, gradvacs):
-            jac_to_grad([param], gradvac)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        for param, aggregator in zip(shared_params, aggregators):
+            jac_to_grad([param], aggregator)
         optimizer.step()
         optimizer.zero_grad()