LC Anneal#626
Conversation
This reverts commit a27325a.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
You have run out of free Bugbot PR reviews for this billing cycle. This will reset on March 3. To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial. |
Code reviewFound 2 issues:
OLMo-core/src/olmo_core/launch/beaker.py Lines 430 to 431 in fcaa57c Generated with Claude Code |
The loop that disables parallelism configs for train_single was missing ep_config (expert parallelism), causing MoE scripts to crash with OLMoConfigurationError instead of gracefully disabling EP. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the --debug flag handling was moved from _build_config() into BeakerLaunchConfig.default_env_vars, NCCL_DEBUG=INFO was accidentally dropped. Only CUDA_LAUNCH_BLOCKING=1 was carried over. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| # Pre-compute source_sizes from the already-cached file_sizes to avoid a second round of | ||
| # ~960 concurrent HEAD requests to R2 that overwhelms the connection pool. The file_sizes | ||
| # property was already populated during prepare(), and source_sizes is just file_sizes | ||
| # divided by item_size, but it's a separate property that would re-query every path. | ||
| from olmo_core.data.numpy_dataset import NumpyPackedFSLDataset | ||
|
|
||
| if isinstance(dataset, NumpyPackedFSLDataset): | ||
| item_size = dataset.dtype(0).itemsize | ||
| dataset._source_sizes = [s // item_size for s in dataset.file_sizes] |
There was a problem hiding this comment.
How about fixing this here instead:
OLMo-core/src/olmo_core/data/numpy_dataset.py
Line 1105 in cc940d6
| env_vars.append((OLMO_SHARED_FS_ENV_VAR, "1")) | ||
| if self.debug: | ||
| env_vars.append(("CUDA_LAUNCH_BLOCKING", "1")) | ||
| env_vars.append(("NCCL_DEBUG", "INFO")) |
There was a problem hiding this comment.
This is already in the default env vars:
OLMo-core/src/olmo_core/launch/beaker.py
Line 416 in cc940d6
There was a problem hiding this comment.
which kinda annoys me TBH. I think we should default to WARN
| log = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| if __name__ == "__main__": |
There was a problem hiding this comment.
Kind of odd and error-prone to define the whole script globally in the if __name__ == "__main__" block. It's easy to mix up locals and globals, and adds unnecessary indentation.
There are four changes in here:
train_single--launch.debug, which setsCUDA_LAUNCH_BLOCKINGdump_training_batch.py, so that it does the right thing with datasets that shuffle themselves. Also adds an annotation to the batches so you can see which file it came from.