LC Anneal by dirkgr · Pull Request #626 · allenai/OLMo-core

dirkgr · 2026-02-26T23:38:11Z

There are four changes in here:

A script that does LC anneals, with LC data, on Olmo 3 models
A generalization / tightening up of the code that disables parallelism when running with train_single
--launch.debug, which sets CUDA_LAUNCH_BLOCKING
Fixing dump_training_batch.py, so that it does the right thing with datasets that shuffle themselves. Also adds an annotation to the batches so you can see which file it came from.

…checkpoint

This reverts commit a27325a.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

cursor · 2026-02-26T23:53:56Z

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 3.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

dirkgr · 2026-02-27T00:01:08Z

Code review

Found 2 issues:

Missing ep_config in train_single parallelism-disabling loop. The loop disables dp_config, tp_config, cp_config, and pp_config but omits ep_config (expert parallelism). TransformerTrainModule.__init__ checks all five (including ep_config) and raises OLMoConfigurationError if any are non-None when not running distributed, so any MoE script using train_single would crash instead of silently disabling EP.

OLMo-core/src/olmo_core/internal/experiment.py

Line 149 in fcaa57c

for parallelism_style in ["dp_config", "tp_config", "cp_config", "pp_config"]:

NCCL_DEBUG=INFO dropped from --debug flag. The old _build_config() set both CUDA_LAUNCH_BLOCKING=1 and NCCL_DEBUG=INFO when --debug was passed. The refactored BeakerLaunchConfig.default_env_vars only sets CUDA_LAUNCH_BLOCKING=1, silently losing NCCL debug output.

OLMo-core/src/olmo_core/launch/beaker.py

Lines 430 to 431 in fcaa57c

    
           if self.debug: 
        
               env_vars.append(("CUDA_LAUNCH_BLOCKING", "1"))

Generated with Claude Code

The loop that disables parallelism configs for train_single was missing ep_config (expert parallelism), causing MoE scripts to crash with OLMoConfigurationError instead of gracefully disabling EP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When the --debug flag handling was moved from _build_config() into BeakerLaunchConfig.default_env_vars, NCCL_DEBUG=INFO was accidentally dropped. Only CUDA_LAUNCH_BLOCKING=1 was carried over. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

epwalsh · 2026-03-03T00:02:35Z

+    # Pre-compute source_sizes from the already-cached file_sizes to avoid a second round of
+    # ~960 concurrent HEAD requests to R2 that overwhelms the connection pool. The file_sizes
+    # property was already populated during prepare(), and source_sizes is just file_sizes
+    # divided by item_size, but it's a separate property that would re-query every path.
+    from olmo_core.data.numpy_dataset import NumpyPackedFSLDataset
+
+    if isinstance(dataset, NumpyPackedFSLDataset):
+        item_size = dataset.dtype(0).itemsize
+        dataset._source_sizes = [s // item_size for s in dataset.file_sizes]


How about fixing this here instead:

OLMo-core/src/olmo_core/data/numpy_dataset.py

Line 1105 in cc940d6

self._source_sizes = self.map(lambda path, _: get_file_size(path) // item_size)

epwalsh · 2026-03-03T00:03:47Z

            env_vars.append((OLMO_SHARED_FS_ENV_VAR, "1"))
+        if self.debug:
+            env_vars.append(("CUDA_LAUNCH_BLOCKING", "1"))
+            env_vars.append(("NCCL_DEBUG", "INFO"))


This is already in the default env vars:

OLMo-core/src/olmo_core/launch/beaker.py

Line 416 in cc940d6

("NCCL_DEBUG", "INFO"),

which kinda annoys me TBH. I think we should default to WARN

epwalsh · 2026-03-03T00:07:51Z

+log = logging.getLogger(__name__)
+
+
+if __name__ == "__main__":


Kind of odd and error-prone to define the whole script globally in the if __name__ == "__main__" block. It's easy to mix up locals and globals, and adds unnecessary indentation.

dirkgr and others added 19 commits February 20, 2026 11:00

Adds a script that performs a long context anneal of an arbitrary 7B …

e937f73

…checkpoint

Calculate the Yarn factor

d0c6174

Unused imports

8e005cb

Can't specify this this way.

8298620

Use named constants

6165277

Don't eval

5b0676b

Longmino is only on R2

db8eb7a

Don't load trainer state

0a10fc5

Formatting

1e0a739

More memory

a27325a

Revert "More memory"

f858ed1

This reverts commit a27325a.

Adds a way to set CUDA_LAUNCH_BLOCKING

c57744f

Verify data mixes at the right time

3188147

More general way of disabling parallelism

a8a8cf1

Fix dump_training_batch when reading from a slow source

90f3566

Add source file information to every batch

2b90fd9

Merge branch 'main' into dirkg/LCAnneal

52ca4a4

Fix style and type errors for CI

53e29ff

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add remaining changelog entries for LC anneal PR

fcaa57c

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

dirkgr marked this pull request as ready for review February 26, 2026 23:53

dirkgr requested review from epwalsh and tyler-romero February 26, 2026 23:53

dirkgr and others added 3 commits February 26, 2026 16:02

Fix style for ep_config addition

cc940d6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

epwalsh suggested changes Mar 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LC Anneal#626

LC Anneal#626
dirkgr wants to merge 22 commits into
mainfrom
dirkg/LCAnneal

dirkgr commented Feb 26, 2026 •

edited

Loading

Uh oh!

cursor Bot commented Feb 26, 2026

Uh oh!

dirkgr commented Feb 27, 2026

Uh oh!

epwalsh Mar 3, 2026

Uh oh!

epwalsh Mar 3, 2026

Uh oh!

epwalsh Mar 3, 2026

Uh oh!

epwalsh Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dirkgr commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot commented Feb 26, 2026

Uh oh!

dirkgr commented Feb 27, 2026

Code review

Uh oh!

epwalsh Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

epwalsh Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

epwalsh Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

epwalsh Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dirkgr commented Feb 26, 2026 •

edited

Loading