You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix: Baichuan2 checkpoint robustness test CI failures (NVIDIA-NeMo#1727)
* fix: checkpoint robustness test CI failures
- Add trust_remote_code: true to baichuan ci.checkpoint_robustness
- Add hf_device_map_auto: true to nemotron nano configs
- Bump robustness global_batch_size 16→32 for multi-node compatibility
- Remove hardcoded trust_remote_code=False that broke tokenizer loading
- Fix dotted keys in ci.checkpoint_robustness being silently ignored
(e.g. distributed.tp_size, dataset.limit_dataset_samples)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: Baichuan2 checkpoint robustness test CI failures
- Register MLP-only TP plan for BaichuanForCausalLM (NormHead is not
nn.Linear, W_pack has non-interleaved QKV layout — both incompatible
with ColwiseParallel)
- Fix HF remote code meta-tensor issue: RotaryEmbedding creates
inv_freq/cos_cached/sin_cached as plain attributes that stay on meta
device; added _fix_meta_rotary_embeddings helper for Phase 4
- Set appropriate KL/loss thresholds for Baichuan2 with TP=2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: Baichuan2 PEFT checkpoint robustness test CI failures
- Apply _fix_meta_rotary_embeddings to PEFT base model loading path
- Add KL/loss thresholds to baichuan_2_7b_squad_peft.yaml CI config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: remove unused cross-TP/resume settings from Baichuan2 PEFT config
Cross-TP and resume assertion are skipped for PEFT models in the test.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: add gc.collect() before torch.cuda.empty_cache() in checkpoint robustness test
FSDP2/DTensor circular references prevented GPU memory from being freed
between test phases, causing OOM on large models (e.g. Nemotron Super 120B)
when Phase 4 tries to reload via vanilla HF with device_map="auto".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: PEFT checkpoint restore for MoE models with activation checkpointing
- Strip _checkpoint_wrapped_module. from FQNs in _get_peft_state_dict and
_set_peft_state_dict to match DCP's normalization. Without this, expert
LoRA weights are silently skipped on reload when activation checkpointing
is enabled (keys mismatch), causing KL divergence of ~0.5.
- Wire up no_check_hf flag to skip Phase 4 vanilla HF check when configured
- Qwen3 MoE 30B LoRA: reduce to 1 node, add no_check_hf
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: Qwen3 MoE PEFT adapter HF compatibility via ParamWrapper format
Save Qwen3 MoE expert LoRA adapters in PEFT v0.18+ ParamWrapper format
so PeftModel.from_pretrained() can load them directly. Previously, adapters
were saved with per-expert individual keys (experts.0.gate_proj.lora_A.weight)
which vanilla HF couldn't load because Qwen3 MoE uses fused nn.Parameter
tensors (experts.gate_up_proj), not individual nn.Module per expert.
The new format (default, v4_compatible=False) uses target_parameters in
adapter_config.json and 2D fused LoRA tensors matching ParamWrapper's
expected key layout. Legacy per-expert format is preserved when
v4_compatible=True.
Also: reduce Qwen3 MoE CI from 2 nodes to 1, remove dead no_check_hf
parsing from test, clean up _extract_target_modules helpers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
* fix: remove debug print statement from checkpoint robustness test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
---------
Signed-off-by: adil-a <adil.asif2000@hotmail.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0 commit comments