Skip to content

Commit 321dcea

Browse files
abrichrclaude
andauthored
fix: configurable max_grad_norm, lower default lr, remove premature deprecation (#255)
Three changes based on client training results (grad_norm=101, 0.00 eval delta): 1. Add max_grad_norm to TrainingConfig (was hardcoded to 1.0). When grad_norm >> max_grad_norm, gradients are clipped to a near-random direction — training makes no progress despite non-zero loss. Now warns when grad_norm > 10x the clip threshold. 2. Lower default learning_rate from 5e-6 to 1e-6. With grad_norm=101 and lr=5e-6, effective step size overshoots. lr=1e-6 with max_grad_norm=1.0 gives stable updates. 3. Remove "standalone trainer is deprecated" warning. It was premature — TRL's rollout_func doesn't support multimodal VLMs (issue #5120). The standalone trainer is the production training path until TRL PR #5323 merges. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a5f290d commit 321dcea

File tree

2 files changed

+22
-5
lines changed

2 files changed

+22
-5
lines changed

openadapt_evals/training/standalone/config.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,12 @@ class TrainingConfig:
4747
task_dir: str | None = None
4848
screen_size: tuple[int, int] = (1920, 1080)
4949
stuck_window: int = 3
50-
learning_rate: float = 5e-6
50+
learning_rate: float = 1e-6
51+
# Maximum gradient norm for clipping. Critical for stable training:
52+
# grad_norm > 100 means gradients are dominated by clipping direction
53+
# (effectively random) rather than the actual gradient signal. Lower
54+
# values (0.5-1.0) stabilize training at the cost of slower learning.
55+
max_grad_norm: float = 1.0
5156
num_training_steps: int = 1000
5257
save_every_steps: int = 50
5358
output_dir: str = "checkpoints/grpo"

openadapt_evals/training/standalone/trainer.py

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -457,7 +457,18 @@ def _training_step(self, rollouts: list[Rollout]) -> dict[str, float]:
457457
l = self._compute_rollout_loss(r, a, 1.0 / n)
458458
losses.append(l)
459459
grad_norm = torch.nn.utils.clip_grad_norm_(
460-
[p for p in self._model.parameters() if p.requires_grad], max_norm=1.0)
460+
[p for p in self._model.parameters() if p.requires_grad],
461+
max_norm=self._config.max_grad_norm,
462+
)
463+
gn = grad_norm.item() if hasattr(grad_norm, "item") else float(grad_norm)
464+
if gn > 10 * self._config.max_grad_norm:
465+
logger.warning(
466+
"grad_norm=%.1f is %.0fx the clip threshold (%.1f). "
467+
"Gradients are dominated by clipping, not learning signal. "
468+
"Consider lowering learning_rate (current: %.1e).",
469+
gn, gn / self._config.max_grad_norm,
470+
self._config.max_grad_norm, self._config.learning_rate,
471+
)
461472
self._optimizer.step()
462473

463474
avg_loss = sum(losses) / max(n, 1)
@@ -485,9 +496,10 @@ def train(self) -> str:
485496
"""Run GRPO training loop. Returns path to final checkpoint."""
486497
import torch
487498

488-
logger.warning(
489-
"The standalone GRPO trainer is deprecated. Use scripts/train_trl_grpo.py "
490-
"with TRL's GRPOTrainer instead. See docs/eval_results/ for migration guide."
499+
logger.info(
500+
"Using standalone GRPO trainer. This is the production training "
501+
"path for VLM agents with dynamic screenshots. TRL migration "
502+
"pending multimodal environment_factory support (TRL PR #5323)."
491503
)
492504

493505
self._load_task_configs()

0 commit comments

Comments
 (0)