R3 delta replay picks#2647
Draft
samsja wants to merge 13 commits into
Draft
Conversation
Squashed from origin/r3-delta (tip 5c94833, which extends the earlier 3799bda with 'Support branched routed expert deltas' for cases where the routed-experts payload diverges across siblings in a group). Adapts delta replay to main's deferred routed-experts chunk concat: first step starts at 0; extended steps use prefix_len - 1; row 0 fills the boundary, remaining rows append as the new suffix. Bumps router wheel pin to local-path. Bumps deps/verifiers gitlink to d39cc5876. Adds four debug configs for router-replay validation. Co-Authored-By: S1ro1 <matej.sirovatka@gmail.com>
The first-match-wins loop over active_samples picks the wrong sample when one active prefix is a strict prefix of another. This can happen after a compaction/rollback step whose prompt is shorter than an existing sample's prefix and whose completion re-generates the same tokens and extends past them: the new sample's prefix then starts with the older sample's prefix, and any later step that extends the new sample also satisfies the slice check against the older one. When that happens, extend_sample folds the newer sample's generated tokens into the older sample as user-input tokens (mask=False, logprob=0) and leaves the newer sample stale -- a silent Exact-Prefix invariant violation. Switch to longest-match: strictly more specific, never worse than first-match when only one prefix matches. Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit 0e239d1)
When more than one active prefix matches a step's prompt, log a warning with the example id, step index, set of matching prefix lengths, total active prefixes, and the prompt length. Longest-match still picks the correct extension; the warning just surfaces the rare ambiguous case so it's debuggable if it starts showing up in real rollouts (e.g. from compaction/rollback turns). Co-authored-by: Cursor <cursoragent@cursor.com> (cherry picked from commit ca38614)
Add slurm.cleanup_grace_period_seconds (default 3600) so that when a component exits — completion, crash, or SIGTERM — the multi-node RL and inference sbatch teardown sends SIGTERM and then waits up to the grace period for the remaining processes to exit before force-killing and releasing the allocation. This gives in-flight work, notably trainer checkpoint writes, a bounded window to flush. The wait ends as soon as all processes exit, so it is only an upper bound; set to 0 for the previous immediate force-kill behavior. Closes #2664 Co-authored-by: Cursor <cursoragent@cursor.com>
Drop the _seconds suffix; the unit is documented in the field docstring. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
The previous SIGTERM-then-wait approach didn't help the target case (inference dies while the trainer is mid-checkpoint on another node): that teardown is driven by `srun --kill-on-bad-exit=1`, which reaps the trainer task via SLURM's own KillWait path and never runs our in-task grace loop. Instead, on a non-zero exit the failing node now stays alive (signalling nothing) for the grace period before propagating the exit. Because --kill-on-bad-exit only fires when a task exits, holding the failing task keeps peer nodes' checkpointing trainers running untouched until they flush. Clean (zero-exit) completion is unaffected. Scope to multi_node_rl only; the inference-only template has no trainer checkpoints to protect, so it reverts to immediate teardown. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.