Fix GRPO OLMo-core bookkeeping PG deadlock + Qwen3 parity tweaks#1708
Open
finbarrtimbers wants to merge 7 commits into
Open
Fix GRPO OLMo-core bookkeeping PG deadlock + Qwen3 parity tweaks#1708finbarrtimbers wants to merge 7 commits into
finbarrtimbers wants to merge 7 commits into
Conversation
…step (https://github.com/allenai/open-instruct/pull/0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s://github.com/allenai/open-instruct/pull/0). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ng PG deadlock; switch dapo_math_oc eval datasets to allenai/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Opus 4.7 <noreply@anthropic.com>
…eply@anthropic.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the ai2-olmo-core dependency, adjusts training scripts, changes the metrics collection interval, and introduces a keepalive metric to prevent potential deadlocks. It also enables full-precision RoPE for Qwen models. Feedback suggests adding a safety check for self.trainer to prevent AttributeError during initialization or dry runs, and making the Qwen configuration name check case-insensitive.
| f"Available config names: {', '.join(available_configs)}" | ||
| ) | ||
| extra_kwargs = {} | ||
| if config_name.startswith("qwen"): |
Contributor
There was a problem hiding this comment.
The check config_name.startswith("qwen") is case-sensitive. If a configuration name starts with an uppercase letter (e.g., "Qwen"), this check will fail to apply rope_full_precision. It is safer to perform a case-insensitive check using .lower().
Suggested change
| if config_name.startswith("qwen"): | |
| if config_name.lower().startswith("qwen"): |
…on, fp32 cos/sin) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reduce_metricsdeadlock in GRPO + OLMo-core (olmo_core_train_modules.py): record a_metrics_keepalivemetric on every rank every step, outside thenum_steps > 0gate. OLMo-core's_log_metricsskips submission whenself._metricsis empty (trainer.py:1395); under--active_sampling --inflight_updatesa learner rank can reach a save-time flush with no recorded metrics while peers have some, desyncing the (untagged) bookkeeping process group's collective stream and cross-matchingreduce_metricsagainst_check_if_canceled. That cross-match hangs gloo for its full 30-min timeout and kills training (verified previously in rune6yft2fs).finbarr/tie-lmto enable tied word embeddings for Qwen3, use fp32 RoPE in Qwen3 OLMo-core configs, and switch GRPO OLMo-core metrics-collect-interval to 1 so logs match grpo_fast cadence.qwen3_4b_dapo_math_oc.sheval datasets toallenai/aime_2025_openinstruct/allenai/brumo_2025_openinstruct, and drop a stale--budgetflag fromqwen3_4b_dapo_math.sh.Test plan
make style && make qualityclean.qwen3_4b_dapo_math_oc.sh) launched on Beaker: previously hung 30 min onreduce_metricsat the step-200 save. With the fix, the step-200 checkpoint saves in 32 seconds and training continues past step 240+.Runs:
GPU_TESTS=bypass
🤖 Generated with Claude Code