TL;DR
When checkpoint.save_consolidated: true is enabled and the base HF model uses non-standard shard filenames (e.g., Qwen3.5 series: model.safetensors-NNNNN-of-NNNNN.safetensors), nemo_automodel's filename parser silently collapses all weights to a single output file. This causes the inline consolidate_safetensors_files_on_every_rank to assign all work to a single rank, leaving the other N−1 ranks idle at an ALLREDUCE barrier — while still holding their GPUs. For a 397B-A17B model on 32×8=256 GPUs, this manifests as a 10-minute NCCL collective timeout (job-crashing) AND ~256 GPU-hours of waste per failed run. Even when the write succeeds, the design holds all GPUs idle while one rank writes ~750 GB single-handedly to Lustre.
Environment
- nemo_automodel:
0.4.0+8972eeb2 (NeMo Automodel container 26_04 + github main branch)
- transformers:
5.5.0
- torch:
2.11.0a0+eb65b36914.nv26.02, CUDA 13.1
- Strategy:
fsdp2, PP=8, EP=32, ep_size=32
- Affected models:
Qwen/Qwen3.5-397B-A17B, Qwen/Qwen3.5-35B-A3B (both model_type: qwen3_5_moe)
- Storage: Lustre
Reproduction
checkpoint:
save_consolidated: true
dist_env:
timeout_minutes: 10 # default
distributed:
strategy: fsdp2
ep_size: 32
Train on any Qwen3.5 model. At the first checkpoint step, all ranks hang on a 1-element ALLREDUCE (the barrier after consolidation). NCCL watchdog kills the job at 10 min.
Root-cause chain
-
Upstream Qwen ships Qwen3.5 to HF Hub with non-standard shard names. Verified on https://huggingface.co/Qwen/Qwen3.5-35B-A3B/tree/main : files are model.safetensors-00001-of-00014.safetensors (note the doubled .safetensors), not the community-standard model-00001-of-00014.safetensors. HF transformers handles this fine because it reads model.safetensors.index.json rather than parsing filenames.
-
_extract_file_index in nemo_automodel/components/checkpoint/_backports/hf_storage.py parses filenames, not index.json:
parts = basename.split("-")
if "model" in parts: # strict element equality
idx_pos = parts.index("model") + 1
...
return 1 # fallback
For Qwen3.5 files, basename.split("-") yields ["model.safetensors", "00001", "of", "00094.safetensors"]. "model" in parts is False (string equality, not substring). Every shard silently falls back to index 1.
-
_maybe_build_consolidated_index in checkpointing.py: because qwen3_5_moe is not in MODELS_REQUIRING_TENSOR_MERGING (conversion_mapping.py), the code takes the else branch and passes the corrupted mapping ({every_fqn: 1}) straight through.
-
consolidate_safetensors_files_on_every_rank distributes by idx % world_size:
for idx in unique_indices:
if idx % world_size == rank:
indices_for_this_rank.append(idx)
With unique_indices = {1} and world_size = 256, only rank 1 is assigned work. Ranks 0 and 2–255 enter the barrier and wait.
-
Single rank serially writes a ~750 GB single safetensors file to Lustre. Even on a healthy fabric, that's a 20–30 min single-process write. The 10-min timeout_minutes default fires first → all ranks crash → 256 GPUs are billed for the whole prep + train + hang window with zero useful work.
Concrete evidence from our run: 256 GPUs × ~30 min hang + ~30 min unrecoverable build time ≈ ~256 GPU-hours wasted in a single failed checkpoint. Cluster admins flagged the idle utilization.
Why this is the central design concern, not a peripheral parser quirk
Even if the filename parser worked perfectly and produced the upstream's natural 94 output files for the 397B model, the design would still leave 256 − 94 = 162 GPUs idle at the barrier for the duration of the consolidate write. The deeper issue is that save_consolidated=true couples checkpoint export latency to GPU-allocation latency. Any file-system slowness or any reduction in unique_indices becomes pure GPU-hour waste.
This shows up disproportionately at scale:
- 35B models barely notice — 1 rank can finish a ~70 GB single-file write inside the 10-min budget.
- 400B+ models reliably fail and waste an entire allocation.
- Small clusters notice less; thousand-GPU jobs lose serious money per failed save.
Proposed feature requests (any subset would help)
-
Make consolidation a true offline / CPU-only step by default. Train job only writes sharded; a separate CLI/sbatch (CPU-only) does the HF-format export. Release GPU at end of training.
-
At minimum, decouple the consolidation barrier from the GPU allocation. After the training loop's last step, optionally dist.destroy_process_group() and free CUDA contexts before entering consolidation, so the slurm allocation can be released or repurposed.
-
Robust upstream-naming support. Replace _extract_file_index with a path that reads model.safetensors.index.json directly — that file is authoritative, parsing filenames is best-effort. Or at minimum accept the Qwen3.5 variant: re.search(r'(\d+)-of-\d+\.safetensors$', basename).
-
Add qwen3_5_moe to MODELS_REQUIRING_TENSOR_MERGING if it actually requires tensor merging (the model has grouped-experts), which would route through _equally_divide_layers(num_shards, keys) and split across more output files even if _extract_file_index regresses.
-
Warn loudly when _extract_file_index falls back to 1 for >1 input filename — currently a silent footgun.
TL;DR
When
checkpoint.save_consolidated: trueis enabled and the base HF model uses non-standard shard filenames (e.g., Qwen3.5 series:model.safetensors-NNNNN-of-NNNNN.safetensors), nemo_automodel's filename parser silently collapses all weights to a single output file. This causes the inlineconsolidate_safetensors_files_on_every_rankto assign all work to a single rank, leaving the other N−1 ranks idle at anALLREDUCEbarrier — while still holding their GPUs. For a 397B-A17B model on 32×8=256 GPUs, this manifests as a 10-minute NCCL collective timeout (job-crashing) AND ~256 GPU-hours of waste per failed run. Even when the write succeeds, the design holds all GPUs idle while one rank writes ~750 GB single-handedly to Lustre.Environment
0.4.0+8972eeb2(NeMo Automodel container26_04 + github main branch)5.5.02.11.0a0+eb65b36914.nv26.02, CUDA 13.1fsdp2, PP=8, EP=32,ep_size=32Qwen/Qwen3.5-397B-A17B,Qwen/Qwen3.5-35B-A3B(bothmodel_type: qwen3_5_moe)Reproduction
Train on any Qwen3.5 model. At the first checkpoint step, all ranks hang on a 1-element
ALLREDUCE(the barrier after consolidation). NCCL watchdog kills the job at 10 min.Root-cause chain
Upstream Qwen ships Qwen3.5 to HF Hub with non-standard shard names. Verified on https://huggingface.co/Qwen/Qwen3.5-35B-A3B/tree/main : files are
model.safetensors-00001-of-00014.safetensors(note the doubled.safetensors), not the community-standardmodel-00001-of-00014.safetensors. HF transformers handles this fine because it readsmodel.safetensors.index.jsonrather than parsing filenames._extract_file_indexinnemo_automodel/components/checkpoint/_backports/hf_storage.pyparses filenames, notindex.json:For Qwen3.5 files,
basename.split("-")yields["model.safetensors", "00001", "of", "00094.safetensors"]."model" in partsisFalse(string equality, not substring). Every shard silently falls back to index1._maybe_build_consolidated_indexincheckpointing.py: becauseqwen3_5_moeis not inMODELS_REQUIRING_TENSOR_MERGING(conversion_mapping.py), the code takes theelsebranch and passes the corrupted mapping ({every_fqn: 1}) straight through.consolidate_safetensors_files_on_every_rankdistributes byidx % world_size:With
unique_indices = {1}andworld_size = 256, only rank 1 is assigned work. Ranks 0 and 2–255 enter the barrier and wait.Single rank serially writes a ~750 GB single safetensors file to Lustre. Even on a healthy fabric, that's a 20–30 min single-process write. The 10-min
timeout_minutesdefault fires first → all ranks crash → 256 GPUs are billed for the whole prep + train + hang window with zero useful work.Concrete evidence from our run: 256 GPUs × ~30 min hang + ~30 min unrecoverable build time ≈ ~256 GPU-hours wasted in a single failed checkpoint. Cluster admins flagged the idle utilization.
Why this is the central design concern, not a peripheral parser quirk
Even if the filename parser worked perfectly and produced the upstream's natural 94 output files for the 397B model, the design would still leave
256 − 94 = 162GPUs idle at the barrier for the duration of the consolidate write. The deeper issue is thatsave_consolidated=truecouples checkpoint export latency to GPU-allocation latency. Any file-system slowness or any reduction inunique_indicesbecomes pure GPU-hour waste.This shows up disproportionately at scale:
Proposed feature requests (any subset would help)
Make consolidation a true offline / CPU-only step by default. Train job only writes sharded; a separate CLI/sbatch (CPU-only) does the HF-format export. Release GPU at end of training.
At minimum, decouple the consolidation barrier from the GPU allocation. After the training loop's last step, optionally
dist.destroy_process_group()and free CUDA contexts before entering consolidation, so the slurm allocation can be released or repurposed.Robust upstream-naming support. Replace
_extract_file_indexwith a path that readsmodel.safetensors.index.jsondirectly — that file is authoritative, parsing filenames is best-effort. Or at minimum accept the Qwen3.5 variant:re.search(r'(\d+)-of-\d+\.safetensors$', basename).Add
qwen3_5_moetoMODELS_REQUIRING_TENSOR_MERGINGif it actually requires tensor merging (the model has grouped-experts), which would route through_equally_divide_layers(num_shards, keys)and split across more output files even if_extract_file_indexregresses.Warn loudly when
_extract_file_indexfalls back to1for >1 input filename — currently a silent footgun.