Free cached GPU memory before AR validation to avoid OOM

yeyu-nvidia · claude · yeyu-nvidia · commit ec61f24f71f9 · 2026-03-19T19:43:05.000-07:00
With LoRA co-training the model carries extra parameters and optimizer
states (LoRA A/B + Adam moments), reducing the headroom available for
the validation forward passes. Call torch.cuda.empty_cache() before
validate_ar() to release unused cached allocations without affecting
any live tensors (parameters, optimizer states, gradients).

Co-Authored-By: Claude Sonnet 4.6 &lt;noreply@anthropic.com&gt;
Signed-off-by: Ye Yu &lt;yeyu@nvidia.com&gt;
diff --git a/examples/speculative_decoding/eagle_utils.py b/examples/speculative_decoding/eagle_utils.py
@@ -260,6 +260,7 @@ def on_step_end(self, args, state, control, **kwargs):
             return control
         if state.global_step % self.ar_validate_steps == 0 and state.global_step > 0:
             print_rank_0("Running AR validation...")
+            torch.cuda.empty_cache()
             try:
                 ars = validate_ar(
                     model=kwargs["model"],