get_optimizer: respect learning_rate_schedule_steps config knob

Pooya Moradi · Pooya Moradi · commit 6052513aef98 · 2026-06-01T17:06:42.000Z
base.yml documents learning_rate_schedule_steps as the LR schedule shape
control ("By default the length of the schedule is set to the number of
steps", but configurable to a longer/different value). The post_train RL
get_optimizer ignored this knob and always used max_train_steps directly,
silently dropping any non-default value.

This matters for GPU&lt;-&gt;TPU recipe parity: when reproducing a GPU recipe
with NUM_BATCHES different from the GPU's, you need to keep the LR
schedule SHAPE the same (e.g., warmup=50, decay=500 like NeMo-RL's
lr_warmup_iters/lr_decay_iters) regardless of how many TPU steps you
run. Without this fix, integrated LR scales linearly with NUM_BATCHES.

Backward-compatible: default learning_rate_schedule_steps=-1 (or unset)
falls back to max_train_steps, identical to old behavior.
diff --git a/src/maxtext/trainers/post_train/rl/utils_rl.py b/src/maxtext/trainers/post_train/rl/utils_rl.py
@@ -531,15 +531,28 @@ def check_correctness(extracted_response: str, acceptable_answers: list[str], tm
 
 
 def get_optimizer(tmvp_config: Any, max_train_steps: int) -> optax.GradientTransformation:
-  """Function to obtain an optax optimizer, currently we use adamw."""
+  """Function to obtain an optax optimizer, currently we use adamw.
+
+  Schedule shape is controlled by `learning_rate_schedule_steps` when set
+  (>0); this decouples warmup/decay shape from training length so the same
+  schedule can be applied across runs of different num_batches. Default
+  (-1) falls back to `max_train_steps` for backward compatibility — matches
+  the documented behavior of base.yml's `learning_rate_schedule_steps: -1`
+  ("By default the length of the schedule is set to the number of steps").
+  """
+  schedule_steps = getattr(tmvp_config, "learning_rate_schedule_steps", -1)
+  if schedule_steps is None or schedule_steps <= 0:
+    schedule_steps = max_train_steps
   schedule = optax.schedules.warmup_cosine_decay_schedule(
       init_value=0.0,
       peak_value=tmvp_config.learning_rate,
       # Linearly increase learning rate from 0. to learning_rate in the first
-      # warmup_steps_fraction training steps, and then gradually decrease the
-      # learning rate to 0 using cosine scheduler.
-      warmup_steps=int(tmvp_config.warmup_steps_fraction * max_train_steps),
-      decay_steps=max_train_steps,
+      # warmup_steps_fraction × schedule_steps steps, then cosine-decay to 0
+      # over the remaining schedule_steps. When schedule_steps > max_train_steps
+      # the run ends partway through the schedule (useful for matching a fixed
+      # GPU LR schedule across TPU runs with different num_batches).
+      warmup_steps=int(tmvp_config.warmup_steps_fraction * schedule_steps),
+      decay_steps=schedule_steps,
       end_value=0.0,
   )