Skip to content

LC Anneal#626

Open
dirkgr wants to merge 22 commits into
mainfrom
dirkg/LCAnneal
Open

LC Anneal#626
dirkgr wants to merge 22 commits into
mainfrom
dirkg/LCAnneal

Conversation

@dirkgr
Copy link
Copy Markdown
Contributor

@dirkgr dirkgr commented Feb 26, 2026

There are four changes in here:

  • A script that does LC anneals, with LC data, on Olmo 3 models
  • A generalization / tightening up of the code that disables parallelism when running with train_single
  • --launch.debug, which sets CUDA_LAUNCH_BLOCKING
  • Fixing dump_training_batch.py, so that it does the right thing with datasets that shuffle themselves. Also adds an annotation to the batches so you can see which file it came from.

@dirkgr dirkgr marked this pull request as ready for review February 26, 2026 23:53
@cursor
Copy link
Copy Markdown

cursor Bot commented Feb 26, 2026

You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 3.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@dirkgr
Copy link
Copy Markdown
Contributor Author

dirkgr commented Feb 27, 2026

Code review

Found 2 issues:

  1. Missing ep_config in train_single parallelism-disabling loop. The loop disables dp_config, tp_config, cp_config, and pp_config but omits ep_config (expert parallelism). TransformerTrainModule.__init__ checks all five (including ep_config) and raises OLMoConfigurationError if any are non-None when not running distributed, so any MoE script using train_single would crash instead of silently disabling EP.

for parallelism_style in ["dp_config", "tp_config", "cp_config", "pp_config"]:

  1. NCCL_DEBUG=INFO dropped from --debug flag. The old _build_config() set both CUDA_LAUNCH_BLOCKING=1 and NCCL_DEBUG=INFO when --debug was passed. The refactored BeakerLaunchConfig.default_env_vars only sets CUDA_LAUNCH_BLOCKING=1, silently losing NCCL debug output.

if self.debug:
env_vars.append(("CUDA_LAUNCH_BLOCKING", "1"))

Generated with Claude Code

dirkgr and others added 3 commits February 26, 2026 16:02
The loop that disables parallelism configs for train_single was missing
ep_config (expert parallelism), causing MoE scripts to crash with
OLMoConfigurationError instead of gracefully disabling EP.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the --debug flag handling was moved from _build_config() into
BeakerLaunchConfig.default_env_vars, NCCL_DEBUG=INFO was accidentally
dropped. Only CUDA_LAUNCH_BLOCKING=1 was carried over.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment on lines +209 to +217
# Pre-compute source_sizes from the already-cached file_sizes to avoid a second round of
# ~960 concurrent HEAD requests to R2 that overwhelms the connection pool. The file_sizes
# property was already populated during prepare(), and source_sizes is just file_sizes
# divided by item_size, but it's a separate property that would re-query every path.
from olmo_core.data.numpy_dataset import NumpyPackedFSLDataset

if isinstance(dataset, NumpyPackedFSLDataset):
item_size = dataset.dtype(0).itemsize
dataset._source_sizes = [s // item_size for s in dataset.file_sizes]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fixing this here instead:

self._source_sizes = self.map(lambda path, _: get_file_size(path) // item_size)

env_vars.append((OLMO_SHARED_FS_ENV_VAR, "1"))
if self.debug:
env_vars.append(("CUDA_LAUNCH_BLOCKING", "1"))
env_vars.append(("NCCL_DEBUG", "INFO"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already in the default env vars:

("NCCL_DEBUG", "INFO"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which kinda annoys me TBH. I think we should default to WARN

log = logging.getLogger(__name__)


if __name__ == "__main__":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of odd and error-prone to define the whole script globally in the if __name__ == "__main__" block. It's easy to mix up locals and globals, and adds unnecessary indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants