Skip to content

Feature Request: Make save_consolidated parallelism resilient to non-standard upstream shard filenames (Qwen3.5 family) — currently wastes hundreds of GPU-hours on hangs #2279

@linmuchuiyang

Description

@linmuchuiyang

TL;DR

When checkpoint.save_consolidated: true is enabled and the base HF model uses non-standard shard filenames (e.g., Qwen3.5 series: model.safetensors-NNNNN-of-NNNNN.safetensors), nemo_automodel's filename parser silently collapses all weights to a single output file. This causes the inline consolidate_safetensors_files_on_every_rank to assign all work to a single rank, leaving the other N−1 ranks idle at an ALLREDUCE barrier — while still holding their GPUs. For a 397B-A17B model on 32×8=256 GPUs, this manifests as a 10-minute NCCL collective timeout (job-crashing) AND ~256 GPU-hours of waste per failed run. Even when the write succeeds, the design holds all GPUs idle while one rank writes ~750 GB single-handedly to Lustre.

Environment

  • nemo_automodel: 0.4.0+8972eeb2 (NeMo Automodel container 26_04 + github main branch)
  • transformers: 5.5.0
  • torch: 2.11.0a0+eb65b36914.nv26.02, CUDA 13.1
  • Strategy: fsdp2, PP=8, EP=32, ep_size=32
  • Affected models: Qwen/Qwen3.5-397B-A17B, Qwen/Qwen3.5-35B-A3B (both model_type: qwen3_5_moe)
  • Storage: Lustre

Reproduction

checkpoint:
  save_consolidated: true
dist_env:
  timeout_minutes: 10   # default
distributed:
  strategy: fsdp2
  ep_size: 32

Train on any Qwen3.5 model. At the first checkpoint step, all ranks hang on a 1-element ALLREDUCE (the barrier after consolidation). NCCL watchdog kills the job at 10 min.

Root-cause chain

  1. Upstream Qwen ships Qwen3.5 to HF Hub with non-standard shard names. Verified on https://huggingface.co/Qwen/Qwen3.5-35B-A3B/tree/main : files are model.safetensors-00001-of-00014.safetensors (note the doubled .safetensors), not the community-standard model-00001-of-00014.safetensors. HF transformers handles this fine because it reads model.safetensors.index.json rather than parsing filenames.

  2. _extract_file_index in nemo_automodel/components/checkpoint/_backports/hf_storage.py parses filenames, not index.json:

    parts = basename.split("-")
    if "model" in parts:           # strict element equality
        idx_pos = parts.index("model") + 1
        ...
    return 1                       # fallback

    For Qwen3.5 files, basename.split("-") yields ["model.safetensors", "00001", "of", "00094.safetensors"]. "model" in parts is False (string equality, not substring). Every shard silently falls back to index 1.

  3. _maybe_build_consolidated_index in checkpointing.py: because qwen3_5_moe is not in MODELS_REQUIRING_TENSOR_MERGING (conversion_mapping.py), the code takes the else branch and passes the corrupted mapping ({every_fqn: 1}) straight through.

  4. consolidate_safetensors_files_on_every_rank distributes by idx % world_size:

    for idx in unique_indices:
        if idx % world_size == rank:
            indices_for_this_rank.append(idx)

    With unique_indices = {1} and world_size = 256, only rank 1 is assigned work. Ranks 0 and 2–255 enter the barrier and wait.

  5. Single rank serially writes a ~750 GB single safetensors file to Lustre. Even on a healthy fabric, that's a 20–30 min single-process write. The 10-min timeout_minutes default fires first → all ranks crash → 256 GPUs are billed for the whole prep + train + hang window with zero useful work.

Concrete evidence from our run: 256 GPUs × ~30 min hang + ~30 min unrecoverable build time ≈ ~256 GPU-hours wasted in a single failed checkpoint. Cluster admins flagged the idle utilization.

Why this is the central design concern, not a peripheral parser quirk

Even if the filename parser worked perfectly and produced the upstream's natural 94 output files for the 397B model, the design would still leave 256 − 94 = 162 GPUs idle at the barrier for the duration of the consolidate write. The deeper issue is that save_consolidated=true couples checkpoint export latency to GPU-allocation latency. Any file-system slowness or any reduction in unique_indices becomes pure GPU-hour waste.

This shows up disproportionately at scale:

  • 35B models barely notice — 1 rank can finish a ~70 GB single-file write inside the 10-min budget.
  • 400B+ models reliably fail and waste an entire allocation.
  • Small clusters notice less; thousand-GPU jobs lose serious money per failed save.

Proposed feature requests (any subset would help)

  1. Make consolidation a true offline / CPU-only step by default. Train job only writes sharded; a separate CLI/sbatch (CPU-only) does the HF-format export. Release GPU at end of training.

  2. At minimum, decouple the consolidation barrier from the GPU allocation. After the training loop's last step, optionally dist.destroy_process_group() and free CUDA contexts before entering consolidation, so the slurm allocation can be released or repurposed.

  3. Robust upstream-naming support. Replace _extract_file_index with a path that reads model.safetensors.index.json directly — that file is authoritative, parsing filenames is best-effort. Or at minimum accept the Qwen3.5 variant: re.search(r'(\d+)-of-\d+\.safetensors$', basename).

  4. Add qwen3_5_moe to MODELS_REQUIRING_TENSOR_MERGING if it actually requires tensor merging (the model has grouped-experts), which would route through _equally_divide_layers(num_shards, keys) and split across more output files even if _extract_file_index regresses.

  5. Warn loudly when _extract_file_index falls back to 1 for >1 input filename — currently a silent footgun.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions