Add gradient accumulation to Llama3 recipe (#1386)

savitha-eng · web-flow · commit bcb127bbfc22 · 2025-12-30T07:11:41.000Z
### Description Implements gradient accumulation for the Llama3 Native TE recipe, following the pattern from ESM2 PR #1254. This enables training with larger effective batch sizes without increasing GPU memory usage by accumulating gradients across multiple microbatches before performing an optimizer step. **Key Changes:** - **perf_logger.py**: Added `log_micro_step()` method to track metrics across microbatches, updated `log_step()` signature to use accumulated metrics, added configurable `pad_token_id` parameter (defaults to 1) - **train_ddp.py**: Implemented gradient accumulation loop with `model.no_sync()` for efficiency, added validation for `grad_acc_steps >= 1` - **train_fsdp2.py**: Implemented gradient accumulation loop (without `model.no_sync()` as FSDP2 handles synchronization internally), added validation - **defaults.yaml**: Added `grad_acc_steps` parameter (default: 1 for backward compatibility) - **test_gradient_accumulation.py**: Added golden value test that validates mathematical correctness of gradient accumulation **Validation:** Lingua1B DCLM Benchmark trained with Grad_Acc=4, 2 Nodes, MBS=4 -> GBS=256 https://api.wandb.ai/links/clara-discovery/5laqf4gm Has matching loss curve: <img width="3268" height="1454" alt="image" src="https://github.com/user-attachments/assets/55e5505f-5527-4a11-9a0d-9958eea046f0" /> DDP Results: https://api.wandb.ai/links/clara-discovery/6ncxn9n4 - DDP Training Loss curves for single node & 4 node training runs are similar with varying levels of gradient accumulation (grad acc=1, grad acc=2, grad acc=4) for a mbs=4: <img width="1260" height="641" alt="image" src="https://github.com/user-attachments/assets/02e610a7-704a-469b-97c0-fd6615c35cea" /> FSDP2 Results: https://api.wandb.ai/links/clara-discovery/lcvrsgm8 - FSDP2 Training Loss Curves for single node and 4 node training runs are similar with and without gradient accumulation: <img width="1265" height="627" alt="image" src="https://github.com/user-attachments/assets/0576bb6f-de0b-47b6-b305-9437366dd451" /> Golden value test confirms that `micro_batch=1, grad_acc=2` produces mathematically identical gradients to `micro_batch=2, grad_acc=1`. **References:** Adapts the gradient accumulation implementation from ESM2: #1254 #### Usage ##### Without gradient accumulation (default, backward compatible) python train_fsdp2.py --config-name L2_lingua_1b ##### With gradient accumulation (reduce memory usage) python train_fsdp2.py \ --config-name L2_lingua_1b \ dataset.micro_batch_size=2 \ grad_acc_steps=2 ##### Effective batch size formula: effective_batch = micro_batch_size × num_gpus × grad_acc_steps ##### Example: 2 × 16 × 2 = 64 samples per optimizer step**Benefits:** - Enables larger effective batch sizes on memory-constrained GPUs - Allows training larger models by reducing micro batch size - Maintains identical training dynamics to larger microbatches - Backward compatible: `grad_acc_steps=1` behaves as before #### Type of changes - [x] New feature (non-breaking change which adds functionality) ### CI Pipeline Configuration - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes to validate gradient accumulation doesn't break existing tests ### Pre-submit Checklist - [x] I have tested these changes locally (single-GPU validation on SLURM) - [x] I have updated the documentation accordingly (inline comments, test docstrings) - [x] I have added/updated tests as needed (test_gradient_accumulation.py with golden value test) - [x] All existing tests pass successfully (pre-commit hooks pass, golden value test passes) ### Testing Notes **Golden Value Test:** pytest bionemo-recipes/recipes/llama3_native_te/tests/test_gradient_accumulation.py -vValidates that gradient accumulation produces mathematically equivalent gradients by comparing: - Loss values (within 1% tolerance) - Gradient norms (within 1% tolerance) - Individual parameter gradients (within 0.1% tolerance) **Integration Testing:** Testing with Lingua-1B benchmark on DCLM dataset - loss curves match --------- Signed-off-by: savitha-eng <savithas@nvidia.com> Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
diff --git a/bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml b/bionemo-recipes/recipes/llama3_native_te/hydra_config/defaults.yaml
@@ -4,6 +4,7 @@ config_name_or_path: ??? # E.g., meta-llama/Llama-3.2-1B or ./model_configs/meta
 config_kwargs: {}
 
 num_train_steps: ???
+grad_acc_steps: 1  # Gradient accumulation steps - effective batch = micro_batch_size * num_gpus * grad_acc_steps
 
 use_meta_device: true
 
diff --git a/bionemo-recipes/recipes/llama3_native_te/perf_logger.py b/bionemo-recipes/recipes/llama3_native_te/perf_logger.py
@@ -51,7 +51,6 @@ def __init__(self, dist_config: DistributedConfig, args: DictConfig):
         self.min_loss = float("inf")
 
         self.logging_frequency = args.logger.frequency
-        # Track whether to collect memory stats (disabled by default for max performance)
 
         metrics_dict = {
             "train/loss": torchmetrics.MeanMetric(),
@@ -80,44 +79,64 @@ def __init__(self, dist_config: DistributedConfig, args: DictConfig):
                 self._profiler = setup_profiler(args, self._wandb_run)
                 self._profiler.__enter__()
 
+        # Gradient accumulation tracking
+        self.num_tokens = 0
+        self.num_unpadded_tokens = 0
+        self.running_loss = 0.0
+        self.grad_acc_step_count = 0
+
+    def log_micro_step(self, batch: dict[str, torch.Tensor], outputs: CausalLMOutputWithPast):
+        """Store data on micro step for gradient accumulation metrics.
+
+        Args:
+            batch: The batch of data for the micro step.
+            outputs: The outputs of the micro step.
+        """
+        self.grad_acc_step_count += 1
+        self.num_tokens += batch["input_ids"].numel()
+        # Use attention_mask to count unpadded tokens (works for both BSHD and THD)
+        if "attention_mask" in batch:
+            self.num_unpadded_tokens += batch["attention_mask"].sum().item()
+        else:
+            # Fallback for pure sequence packing with no padding: all tokens are unpadded
+            self.num_unpadded_tokens += batch["input_ids"].numel()
+        self.running_loss += outputs.loss.item()
+
     def log_step(
         self,
         step: int,
-        batch: dict[str, torch.Tensor],
-        outputs: CausalLMOutputWithPast,
         grad_norm: float,
         lr: float,
     ):
         """Log a step to the logger and wandb.
 
         Args:
             step: The step number.
-            batch: The batch of data for the step.
-            outputs: The outputs of the step.
             grad_norm: The gradient norm of the step.
             lr: The learning rate of the step.
         """
-        num_tokens = batch["input_ids"].numel()
-        if "attention_mask" in batch:
-            num_unpadded_tokens = batch["attention_mask"].sum().item()
-        else:
-            num_unpadded_tokens = num_tokens
-
-        self.min_loss = min(self.min_loss, outputs.loss.item())
+        # Use accumulated metrics from gradient accumulation
+        assert self.grad_acc_step_count > 0, (
+            f"Gradient accumulation steps ({self.grad_acc_step_count}) must be greater than 0, "
+            f"and can be incremented by log_micro_step()."
+        )
+
+        avg_loss = self.running_loss / self.grad_acc_step_count
+        self.min_loss = min(self.min_loss, avg_loss)
         step_time, self.previous_step_time = time.perf_counter() - self.previous_step_time, time.perf_counter()
 
-        self.metrics["train/loss"].update(outputs.loss)
+        self.metrics["train/loss"].update(avg_loss)
         self.metrics["train/learning_rate"].update(lr)
         self.metrics["train/grad_norm"].update(grad_norm)
         self.metrics["train/step_time"].update(step_time)
-        self.metrics["train/tokens_per_second_per_gpu"].update(num_tokens / step_time)
-        self.metrics["train/unpadded_tokens_per_second_per_gpu"].update(num_unpadded_tokens / step_time)
-        self.metrics["train/total_unpadded_tokens_per_batch"].update(num_unpadded_tokens / self.logging_frequency)
+        self.metrics["train/tokens_per_second_per_gpu"].update(self.num_tokens / step_time)
+        self.metrics["train/unpadded_tokens_per_second_per_gpu"].update(self.num_unpadded_tokens / step_time)
+        self.metrics["train/total_unpadded_tokens_per_batch"].update(self.num_unpadded_tokens / self.logging_frequency)
 
         if self._profiler is not None:
             self._profiler.step()
 
-        if (step + 1) % self.logging_frequency == 0:
+        if step % self.logging_frequency == 0 and step > 0:
             memory_allocated = torch.cuda.memory_allocated() / (1024**3)
             self.metrics["train/gpu_memory_allocated_max_gb"].update(memory_allocated)
             self.metrics["train/gpu_memory_allocated_mean_gb"].update(memory_allocated)
@@ -129,11 +148,17 @@ def log_step(
             if self._dist_config.is_main_process():
                 wandb.log(metrics, step=step)
                 self._progress_bar.update(self.logging_frequency)
-                self._progress_bar.set_postfix({"loss": outputs.loss.item()})
+                self._progress_bar.set_postfix({"loss": avg_loss})
 
             if self._dist_config.local_rank == 0:
                 logger.info(", ".join([f"{k.split('/')[1]}: {v:.3g}" for k, v in metrics.items()]))
 
+        # Reset gradient accumulation tracking for next step
+        self.num_tokens = 0
+        self.num_unpadded_tokens = 0
+        self.running_loss = 0.0
+        self.grad_acc_step_count = 0
+
     def finish(self):
         """Finish the logger and close the progress bar."""
         if self._profiler is not None:
diff --git a/bionemo-recipes/recipes/llama3_native_te/tests/test_train.py b/bionemo-recipes/recipes/llama3_native_te/tests/test_train.py
@@ -62,6 +62,26 @@ def test_sanity_convergence_ddp_te(tmp_path, recipe_path):
     assert final_loss < 2.0, f"Final loss {final_loss} is too high, expected < 2.0"
 
 
+def test_sanity_convergence_ddp_te_grad_acc(tmp_path, recipe_path):
+    """Test DDP training with gradient accumulation."""
+    with initialize_config_dir(config_dir=str(recipe_path / "hydra_config"), version_base="1.2"):
+        sanity_config = compose(
+            config_name="L0_sanity",
+            overrides=[
+                f"+wandb.dir={tmp_path}",
+                f"checkpoint.ckpt_dir={tmp_path}",
+                "checkpoint.resume_from_checkpoint=false",
+                "grad_acc_steps=2",
+            ],
+        )
+
+    final_loss = main_ddp(sanity_config)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    assert final_loss < 2.0, f"Final loss {final_loss} is too high, expected < 2.0"
+
+
 def test_sanity_convergence_ddp_hf(tmp_path, recipe_path):
     """Test that DDP training converges on mock genomic data.
 
@@ -146,6 +166,50 @@ def test_sanity_convergence_fsdp2_te_thd(tmp_path, recipe_path):
     assert final_loss < 2.0, f"Final loss {final_loss} is too high, expected < 2.0"
 
 
+def test_sanity_convergence_fsdp2_te_bshd_grad_acc(tmp_path, recipe_path):
+    """Test FSDP2 training with BSHD format and gradient accumulation."""
+    with initialize_config_dir(config_dir=str(recipe_path / "hydra_config"), version_base="1.2"):
+        sanity_config = compose(
+            config_name="L0_sanity",
+            overrides=[
+                f"+wandb.dir={tmp_path}",
+                f"checkpoint.ckpt_dir={tmp_path}",
+                "checkpoint.resume_from_checkpoint=false",
+                "config_kwargs.attn_input_format=bshd",
+                "grad_acc_steps=2",
+            ],
+        )
+
+    final_loss = main_fsdp2(sanity_config)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    assert final_loss < 2.0, f"Final loss {final_loss} is too high, expected < 2.0"
+
+
+def test_sanity_convergence_fsdp2_te_thd_grad_acc(tmp_path, recipe_path):
+    """Test FSDP2 training with THD format and gradient accumulation."""
+    with initialize_config_dir(config_dir=str(recipe_path / "hydra_config"), version_base="1.2"):
+        sanity_config = compose(
+            config_name="L0_sanity",
+            overrides=[
+                f"+wandb.dir={tmp_path}",
+                f"checkpoint.ckpt_dir={tmp_path}",
+                "checkpoint.resume_from_checkpoint=false",
+                "use_sequence_packing=true",
+                "config_kwargs.attn_input_format=thd",
+                "dataset.max_seq_length=1024",
+                "grad_acc_steps=2",
+            ],
+        )
+
+    final_loss = main_fsdp2(sanity_config)
+    gc.collect()
+    torch.cuda.empty_cache()
+
+    assert final_loss < 2.0, f"Final loss {final_loss} is too high, expected < 2.0"
+
+
 def test_sanity_convergence_fsdp2_hf(tmp_path, recipe_path):
     """Test that FSDP2 training converges on mock genomic data.
 
diff --git a/bionemo-recipes/recipes/llama3_native_te/train_ddp.py b/bionemo-recipes/recipes/llama3_native_te/train_ddp.py
@@ -14,6 +14,7 @@
 # limitations under the License.
 
 import logging
+from contextlib import nullcontext
 from pathlib import Path
 
 import hydra
@@ -119,50 +120,59 @@ def main(args: DictConfig) -> float | None:
 
     # Training loop
     step = start_step
+    micro_step = 0
     while step < args.num_train_steps:
         for batch in train_dataloader:
             batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}  # noqa PLW2901
 
-            # Forward pass with mixed precision.
-            with transformer_engine.pytorch.fp8_autocast(enabled=args.fp8_config.enabled, fp8_recipe=fp8_recipe):
-                outputs = model(**batch)
-
-            # Backward pass.
-            loss = outputs.loss
-            loss.backward()
-
-            # Compute and clip gradient norms.
-            total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).item()
-
-            # Step optimizer.
-            optimizer.step()
-            scheduler.step()
-            optimizer.zero_grad()
-
-            perf_logger.log_step(
-                step=step,
-                batch=batch,
-                outputs=outputs,
-                grad_norm=total_norm,
-                lr=optimizer.param_groups[0]["lr"],
-            )
-
-            if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps):
-                save_checkpoint_ddp(
-                    model=model,
-                    optimizer=optimizer,
-                    scheduler=scheduler,
-                    ckpt_path=ckpt_path,
+            micro_step += 1
+            # Use no_sync to prevent gradient synchronization until the last microbatch
+            with model.no_sync() if micro_step % args.grad_acc_steps != 0 else nullcontext():
+                # Forward pass with mixed precision.
+                with transformer_engine.pytorch.fp8_autocast(enabled=args.fp8_config.enabled, fp8_recipe=fp8_recipe):
+                    outputs = model(**batch)
+
+                # Backward pass - scale loss by grad_acc_steps for proper gradient averaging
+                loss = outputs.loss / args.grad_acc_steps
+                loss.backward()
+
+                # Log microbatch step data for accumulation metrics
+                perf_logger.log_micro_step(batch=batch, outputs=outputs)
+
+            # Gradient accumulation - only step optimizer after accumulating gradients
+            if micro_step % args.grad_acc_steps == 0:
+                micro_step = 0
+
+                # Compute and clip gradient norms.
+                total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).item()
+
+                # Step optimizer.
+                optimizer.step()
+                scheduler.step()
+                optimizer.zero_grad()
+
+                perf_logger.log_step(
                     step=step,
-                    epoch=epoch,
-                    dist_config=dist_config,
-                    dataloader=train_dataloader if args.dataset.use_stateful_dataloader else None,
-                    max_checkpoints=args.checkpoint.max_checkpoints,
+                    grad_norm=total_norm,
+                    lr=optimizer.param_groups[0]["lr"],
                 )
 
-            step += 1
-            if step >= args.num_train_steps:
-                break
+                if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps):
+                    save_checkpoint_ddp(
+                        model=model,
+                        optimizer=optimizer,
+                        scheduler=scheduler,
+                        ckpt_path=ckpt_path,
+                        step=step,
+                        epoch=epoch,
+                        dist_config=dist_config,
+                        dataloader=train_dataloader if args.dataset.use_stateful_dataloader else None,
+                        max_checkpoints=args.checkpoint.max_checkpoints,
+                    )
+
+                step += 1
+                if step >= args.num_train_steps:
+                    break
 
         # Dataloader exhausted, incrementing epoch
         epoch += 1
diff --git a/bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py b/bionemo-recipes/recipes/llama3_native_te/train_fsdp2.py
@@ -134,52 +134,60 @@ def main(args: DictConfig) -> float | None:
     # Training loop
     logger.info(f"Starting training loop from step {start_step} to {args.num_train_steps}")
     step = start_step
+    micro_step = 0
     while step < args.num_train_steps:
         for batch in train_dataloader:
             batch = {k: v.to(device) if isinstance(v, torch.Tensor) else v for k, v in batch.items()}  # noqa: PLW2901
 
+            micro_step += 1
+
             # Forward pass with mixed precision.
             with transformer_engine.pytorch.fp8_autocast(enabled=args.fp8_config.enabled, fp8_recipe=fp8_recipe):
                 outputs = model(**batch)
 
-            # Backward pass.
-            loss = outputs.loss
+            # Backward pass - scale loss by grad_acc_steps for proper gradient averaging
+            loss = outputs.loss / args.grad_acc_steps
             loss.backward()
 
-            # Compute and clip gradient norms.
-            total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).item()
-
-            # Step optimizer.
-            optimizer.step()
-            scheduler.step()
-            optimizer.zero_grad()
-
-            perf_logger.log_step(
-                step=step,
-                batch=batch,
-                outputs=outputs,
-                grad_norm=total_norm,
-                lr=optimizer.param_groups[0]["lr"],
-            )
-
-            if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps):
-                save_checkpoint_fsdp2(
-                    model=model,
-                    optimizer=optimizer,
-                    scheduler=scheduler,
-                    ckpt_path=ckpt_path,
+            # Log microbatch step data for accumulation metrics
+            perf_logger.log_micro_step(batch=batch, outputs=outputs)
+
+            # Gradient accumulation - only step optimizer after accumulating gradients
+            if micro_step % args.grad_acc_steps == 0:
+                micro_step = 0
+
+                # Compute and clip gradient norms.
+                total_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0).item()
+
+                # Step optimizer.
+                optimizer.step()
+                scheduler.step()
+                optimizer.zero_grad()
+
+                perf_logger.log_step(
                     step=step,
-                    epoch=epoch,
-                    dist_config=dist_config,
-                    dataloader=train_dataloader if args.dataset.use_stateful_dataloader else None,
-                    process_group=device_mesh.get_group("dp"),
-                    max_checkpoints=args.checkpoint.max_checkpoints,
-                    async_save=args.checkpoint.async_save,
+                    grad_norm=total_norm,
+                    lr=optimizer.param_groups[0]["lr"],
                 )
 
-            step += 1
-            if step >= args.num_train_steps:
-                break
+                if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps):
+                    save_checkpoint_fsdp2(
+                        model=model,
+                        optimizer=optimizer,
+                        scheduler=scheduler,
+                        ckpt_path=ckpt_path,
+                        step=step,
+                        epoch=epoch,
+                        dist_config=dist_config,
+                        dataloader=train_dataloader if args.dataset.use_stateful_dataloader else None,
+                        process_group=device_mesh.get_group("dp"),
+                        max_checkpoints=args.checkpoint.max_checkpoints,
+                        async_save=args.checkpoint.async_save,
+                    )
+
+                step += 1
+                if step >= args.num_train_steps:
+                    break
 
         # Dataloader exhausted, incrementing epoch
         epoch += 1