SimplexLab
diff --git a/‎stable/_sources/docs/aggregation/gradvac.rst.txt‎
Lines changed: 14 additions & 0 deletions b/‎stable/_sources/docs/aggregation/gradvac.rst.txt‎
Lines changed: 14 additions & 0 deletions
diff --git a/‎stable/_sources/docs/aggregation/index.rst.txt‎
Lines changed: 5 additions & 0 deletions b/‎stable/_sources/docs/aggregation/index.rst.txt‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎stable/_sources/examples/amp.rst.txt‎
Lines changed: 1 addition & 1 deletion b/‎stable/_sources/examples/amp.rst.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎stable/_sources/examples/basic_usage.rst.txt‎
Lines changed: 8 additions & 8 deletions b/‎stable/_sources/examples/basic_usage.rst.txt‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎stable/_sources/examples/grouping.rst.txt‎
Lines changed: 177 additions & 0 deletions b/‎stable/_sources/examples/grouping.rst.txt‎
Lines changed: 177 additions & 0 deletions
diff --git a/‎stable/_sources/examples/index.rst.txt‎
Lines changed: 4 additions & 0 deletions b/‎stable/_sources/examples/index.rst.txt‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎stable/_sources/examples/iwmtl.rst.txt‎
Lines changed: 1 addition & 1 deletion b/‎stable/_sources/examples/iwmtl.rst.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎stable/_sources/examples/iwrm.rst.txt‎
Lines changed: 3 additions & 3 deletions b/‎stable/_sources/examples/iwrm.rst.txt‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎stable/_sources/examples/lightning_integration.rst.txt‎
Lines changed: 12 additions & 1 deletion b/‎stable/_sources/examples/lightning_integration.rst.txt‎
Lines changed: 12 additions & 1 deletion
diff --git a/‎stable/_sources/examples/monitoring.rst.txt‎
Lines changed: 25 additions & 1 deletion b/‎stable/_sources/examples/monitoring.rst.txt‎
Lines changed: 25 additions & 1 deletion
@@ -0,0 +1,14 @@
+:hide-toc:
+
+GradVac
+=======
+
+.. autoclass:: torchjd.aggregation.GradVac
+    :members:
+    :undoc-members:
+    :exclude-members: forward, eps, beta
+
+.. autoclass:: torchjd.aggregation.GradVacWeighting
+    :members:
+    :undoc-members:
+    :exclude-members: forward, eps, beta
@@ -22,6 +22,10 @@ Abstract base classes
     :undoc-members:
     :exclude-members: forward
 
+.. autoclass:: torchjd.aggregation.Stateful
+    :members:
+    :undoc-members:
+
 
 .. toctree::
     :hidden:
@@ -35,6 +39,7 @@ Abstract base classes
     dualproj.rst
     flattening.rst
     graddrop.rst
+    gradvac.rst
     imtl_g.rst
     krum.rst
     mean.rst
 
@@ -11,7 +11,7 @@ case, the losses) should preferably be scaled with a `GradScaler
 <https://pytorch.org/docs/stable/amp.html#gradient-scaling>`_ to avoid gradient underflow. The
 following example shows the resulting code for a multi-task learning use-case.
 
-.. code-block:: python
+.. testcode::
     :emphasize-lines: 2, 17, 27, 34-35, 37-38
 
     import torch
 
@@ -12,7 +12,7 @@ the parameters are updated using the resulting aggregation.
 
 Import several classes from ``torch`` and ``torchjd``:
 
-.. code-block:: python
+.. testcode::
 
     import torch
     from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -24,14 +24,14 @@ Import several classes from ``torch`` and ``torchjd``:
 
 Define the model and the optimizer, as usual:
 
-.. code-block:: python
+.. testcode::
 
     model = Sequential(Linear(10, 5), ReLU(), Linear(5, 2))
     optimizer = SGD(model.parameters(), lr=0.1)
 
 Define the aggregator that will be used to combine the Jacobian matrix:
 
-.. code-block:: python
+.. testcode::
 
     aggregator = UPGrad()
 
@@ -41,7 +41,7 @@ negatively affected by the update.
 
 Now that everything is defined, we can train the model. Define the input and the associated target:
 
-.. code-block:: python
+.. testcode::
 
     input = torch.randn(16, 10)  # Batch of 16 random input vectors of length 10
     target1 = torch.randn(16)  # First batch of 16 targets
@@ -51,7 +51,7 @@ Here, we generate fake inputs and labels for the sake of the example.
 
 We can now compute the losses associated to each element of the batch.
 
-.. code-block:: python
+.. testcode::
 
     loss_fn = MSELoss()
     output = model(input)
@@ -62,7 +62,7 @@ The last steps are similar to gradient descent-based optimization, but using the
 
 Perform the Jacobian descent backward pass:
 
-.. code-block:: python
+.. testcode::
 
     autojac.backward([loss1, loss2])
     jac_to_grad(model.parameters(), aggregator)
@@ -73,14 +73,14 @@ field of the parameters. It also deletes the ``.jac`` fields save some memory.
 
 Update each parameter based on its ``.grad`` field, using the ``optimizer``:
 
-.. code-block:: python
+.. testcode::
 
     optimizer.step()
 
 The model's parameters have been updated!
 
 As usual, you should now reset the ``.grad`` field of each model parameter:
 
-.. code-block:: python
+.. testcode::
 
     optimizer.zero_grad()
@@ -0,0 +1,177 @@
+Grouping
+========
+
+The aggregation can be made independently on groups of parameters, at different granularities. The
+`Gradient Vaccine paper <https://arxiv.org/pdf/2010.05874>`_ introduces four strategies to partition
+the parameters:
+
+1. **Together** (baseline): one group covering all parameters. Corresponds to the `whole_model`
+   stategy in the paper.
+
+2. **Per network**: one group per top-level sub-network (e.g. encoder and decoder separately).
+   Corresponds to the `enc_dec` stategy in the paper.
+
+3. **Per layer**: one group per leaf module of the network. Corresponds to the `all_layer` stategy
+   in the paper.
+
+4. **Per tensor**: one group per individual parameter tensor. Corresponds to the `all_matrix`
+   stategy in the paper.
+
+In TorchJD, grouping is achieved by calling :func:`~torchjd.autojac.jac_to_grad` once per group
+after :func:`~torchjd.autojac.backward` or :func:`~torchjd.autojac.mtl_backward`, with a dedicated
+aggregator instance per group. For :class:`~torchjd.aggregation.Stateful` aggregators, each instance
+should independently maintains its own state (e.g. the EMA :math:`\hat{\phi}` state in
+:class:`~torchjd.aggregation.GradVac`, matching the per-block targets from the original paper).
+
+.. note::
+    The grouping is orthogonal to the choice between
+    :func:`~torchjd.autojac.backward` vs :func:`~torchjd.autojac.mtl_backward`. Those functions
+    determine *which* parameters receive Jacobians; grouping then determines *how* those Jacobians
+    are partitioned for aggregation.
+
+.. note::
+    The examples below use :class:`~torchjd.aggregation.GradVac`, but the same pattern applies to
+    any :class:`~torchjd.aggregation.Aggregator`.
+
+1. Together
+-----------
+
+A single :class:`~torchjd.aggregation.Aggregator` instance aggregates all shared parameters
+together. Cosine similarities are computed between the full task gradient vectors.
+
+.. testcode::
+    :emphasize-lines: 14, 21
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    aggregator = GradVac()
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        jac_to_grad(encoder.parameters(), aggregator)
+        optimizer.step()
+        optimizer.zero_grad()
+
+2. Per network
+--------------
+
+One :class:`~torchjd.aggregation.Aggregator` instance per top-level sub-network. Here the model
+is split into an encoder and a decoder; cosine similarities are computed separately within each.
+Passing ``features=dec_out`` to :func:`~torchjd.autojac.mtl_backward` causes both sub-networks
+to receive Jacobians, which are then aggregated independently.
+
+.. testcode::
+    :emphasize-lines: 8-9, 15-16, 24-25
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU())
+    decoder = Sequential(Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *decoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    encoder_aggregator = GradVac()
+    decoder_aggregator = GradVac()
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        enc_out = encoder(x)
+        dec_out = decoder(enc_out)
+        loss1 = loss_fn(task1_head(dec_out), y1)
+        loss2 = loss_fn(task2_head(dec_out), y2)
+        mtl_backward([loss1, loss2], features=dec_out)
+        jac_to_grad(encoder.parameters(), encoder_aggregator)
+        jac_to_grad(decoder.parameters(), decoder_aggregator)
+        optimizer.step()
+        optimizer.zero_grad()
+
+3. Per layer
+------------
+
+One :class:`~torchjd.aggregation.Aggregator` instance per leaf module. Cosine similarities are
+computed per-layer between the task gradients.
+
+.. testcode::
+    :emphasize-lines: 14-15, 22-23
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    leaf_layers = [m for m in encoder.modules() if list(m.parameters()) and not list(m.children())]
+    aggregators = [GradVac() for _ in leaf_layers]
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        for layer, aggregator in zip(leaf_layers, aggregators):
+            jac_to_grad(layer.parameters(), aggregator)
+        optimizer.step()
+        optimizer.zero_grad()
+
+4. Per parameter
+----------------
+
+One :class:`~torchjd.aggregation.Aggregator` instance per individual parameter tensor. Cosine
+similarities are computed per-tensor between the task gradients (e.g. weights and biases of each
+layer are treated as separate groups).
+
+.. testcode::
+    :emphasize-lines: 14-15, 22-23
+
+    import torch
+    from torch.nn import Linear, MSELoss, ReLU, Sequential
+    from torch.optim import SGD
+
+    from torchjd.aggregation import GradVac
+    from torchjd.autojac import jac_to_grad, mtl_backward
+
+    encoder = Sequential(Linear(10, 5), ReLU(), Linear(5, 3), ReLU())
+    task1_head, task2_head = Linear(3, 1), Linear(3, 1)
+    optimizer = SGD([*encoder.parameters(), *task1_head.parameters(), *task2_head.parameters()], lr=0.1)
+    loss_fn = MSELoss()
+    inputs, t1, t2 = torch.randn(8, 16, 10), torch.randn(8, 16, 1), torch.randn(8, 16, 1)
+
+    shared_params = list(encoder.parameters())
+    aggregators = [GradVac() for _ in shared_params]
+
+    for x, y1, y2 in zip(inputs, t1, t2):
+        features = encoder(x)
+        loss1 = loss_fn(task1_head(features), y1)
+        loss2 = loss_fn(task2_head(features), y2)
+        mtl_backward([loss1, loss2], features=features)
+        for param, aggregator in zip(shared_params, aggregators):
+            jac_to_grad([param], aggregator)
+        optimizer.step()
+        optimizer.zero_grad()
@@ -29,6 +29,9 @@ This section contains some usage examples for TorchJD.
 - :doc:`PyTorch Lightning Integration <lightning_integration>` showcases how to combine
   TorchJD with PyTorch Lightning, by providing an example implementation of a multi-task
   ``LightningModule`` optimized by Jacobian descent.
+- :doc:`Grouping <grouping>` shows how to apply an aggregator independently per parameter group
+  (e.g. per layer), so that conflict resolution happens at a finer granularity than the full
+  parameter vector.
 - :doc:`Automatic Mixed Precision <amp>` shows how to combine mixed precision training with TorchJD.
 
 .. toctree::
@@ -43,3 +46,4 @@ This section contains some usage examples for TorchJD.
     monitoring.rst
     lightning_integration.rst
     amp.rst
+    grouping.rst
@@ -9,7 +9,7 @@ this Gramian to reweight the gradients and resolve conflict entirely.
 
 The following example shows how to do that.
 
-.. code-block:: python
+.. testcode::
     :emphasize-lines: 5-6, 18-20, 31-32, 34-35, 37-38, 40-41
 
     import torch
 
@@ -41,7 +41,7 @@ batch of data. When minimizing per-instance losses (IWRM), we use either autojac
 .. tab-set::
     .. tab-item:: autograd (baseline)
 
-        .. code-block:: python
+        .. testcode::
 
             import torch
             from torch.nn import Linear, MSELoss, ReLU, Sequential
@@ -75,7 +75,7 @@ batch of data. When minimizing per-instance losses (IWRM), we use either autojac
 
     .. tab-item:: autojac
 
-        .. code-block:: python
+        .. testcode::
             :emphasize-lines: 5-6, 12, 16, 21-23
 
             import torch
@@ -110,7 +110,7 @@ batch of data. When minimizing per-instance losses (IWRM), we use either autojac
 
     .. tab-item:: autogram (recommended)
 
-        .. code-block:: python
+        .. testcode::
             :emphasize-lines: 5-6, 12, 16-17, 21-24
 
             import torch
 
@@ -10,7 +10,18 @@ The following code example demonstrates a basic multi-task learning setup using
 :class:`~lightning.pytorch.core.LightningModule` that will call :doc:`mtl_backward
 <../docs/autojac/mtl_backward>` at each training iteration.
 
-.. code-block:: python
+.. testsetup::
+
+    import warnings
+    import logging
+    from lightning.fabric.utilities.warnings import PossibleUserWarning
+
+    logging.disable(logging.INFO)
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+    warnings.filterwarnings("ignore", category=FutureWarning)
+    warnings.filterwarnings("ignore", category=PossibleUserWarning)
+
+.. testcode::
     :emphasize-lines: 9-10, 18, 31-32
 
     import torch
 
@@ -14,7 +14,12 @@ Jacobian descent is doing something different than gradient descent. With
 :doc:`UPGrad <../docs/aggregation/upgrad>`, this happens when the original gradients conflict (i.e.
 they have a negative inner product).
 
-.. code-block:: python
+.. testsetup::
+
+    import torch
+    torch.manual_seed(0)
+
+.. testcode::
     :emphasize-lines: 9-11, 13-18, 33-34
 
     import torch
@@ -67,3 +72,22 @@ they have a negative inner product).
         jac_to_grad(shared_module.parameters(), aggregator)
         optimizer.step()
         optimizer.zero_grad()
+
+.. testoutput::
+
+    Weights: tensor([0.5000, 0.5000])
+    Cosine similarity: 1.0000
+    Weights: tensor([0.5000, 0.5000])
+    Cosine similarity: 1.0000
+    Weights: tensor([0.5000, 0.5000])
+    Cosine similarity: 1.0000
+    Weights: tensor([0.6618, 1.0554])
+    Cosine similarity: 0.9249
+    Weights: tensor([0.6569, 1.2146])
+    Cosine similarity: 0.8661
+    Weights: tensor([0.5004, 0.5060])
+    Cosine similarity: 1.0000
+    Weights: tensor([0.5000, 0.5000])
+    Cosine similarity: 1.0000
+    Weights: tensor([0.5746, 1.1607])
+    Cosine similarity: 0.9301