Skip to content

Fix GRPO OLMo-core bookkeeping PG deadlock + Qwen3 parity tweaks#1708

Open
finbarrtimbers wants to merge 7 commits into
mainfrom
fix-olmo-issues
Open

Fix GRPO OLMo-core bookkeeping PG deadlock + Qwen3 parity tweaks#1708
finbarrtimbers wants to merge 7 commits into
mainfrom
fix-olmo-issues

Conversation

@finbarrtimbers
Copy link
Copy Markdown
Collaborator

Summary

  • Fix reduce_metrics deadlock in GRPO + OLMo-core (olmo_core_train_modules.py): record a _metrics_keepalive metric on every rank every step, outside the num_steps > 0 gate. OLMo-core's _log_metrics skips submission when self._metrics is empty (trainer.py:1395); under --active_sampling --inflight_updates a learner rank can reach a save-time flush with no recorded metrics while peers have some, desyncing the (untagged) bookkeeping process group's collective stream and cross-matching reduce_metrics against _check_if_canceled. That cross-match hangs gloo for its full 30-min timeout and kills training (verified previously in run e6yft2fs).
  • Qwen3 OLMo-core parity (carried from prior work on this branch): bump OLMo-core to finbarr/tie-lm to enable tied word embeddings for Qwen3, use fp32 RoPE in Qwen3 OLMo-core configs, and switch GRPO OLMo-core metrics-collect-interval to 1 so logs match grpo_fast cadence.
  • Switch qwen3_4b_dapo_math_oc.sh eval datasets to allenai/aime_2025_openinstruct / allenai/brumo_2025_openinstruct, and drop a stale --budget flag from qwen3_4b_dapo_math.sh.

Test plan

  • make style && make quality clean.
  • Reproducer (qwen3_4b_dapo_math_oc.sh) launched on Beaker: previously hung 30 min on reduce_metrics at the step-200 save. With the fix, the step-200 checkpoint saves in 32 seconds and training continues past step 240+.

Runs:

  1. Deadlock-fix verification (currently running, past step 240): Beaker

GPU_TESTS=bypass

🤖 Generated with Claude Code

…ng PG deadlock; switch dapo_math_oc eval datasets to allenai/ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
… Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ai2-olmo-core dependency, adjusts training scripts, changes the metrics collection interval, and introduces a keepalive metric to prevent potential deadlocks. It also enables full-precision RoPE for Qwen models. Feedback suggests adding a safety check for self.trainer to prevent AttributeError during initialization or dry runs, and making the Qwen configuration name check case-insensitive.

Comment thread open_instruct/olmo_core_train_modules.py
Comment thread open_instruct/olmo_core_utils.py Outdated
f"Available config names: {', '.join(available_configs)}"
)
extra_kwargs = {}
if config_name.startswith("qwen"):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check config_name.startswith("qwen") is case-sensitive. If a configuration name starts with an uppercase letter (e.g., "Qwen"), this check will fail to apply rope_full_precision. It is safer to perform a case-insensitive check using .lower().

Suggested change
if config_name.startswith("qwen"):
if config_name.lower().startswith("qwen"):

…on, fp32 cos/sin) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant