docs(nemotron-omni): use device_map fast path for SFT inference (#2126)

HuiyingLi · claude · web-flow · commit 31405ac9c5f8 · 2026-05-05T09:05:22.000-07:00
The SFT inference snippet under Step 4 instantiated a 30B model on CPU via
`AutoModel.from_config()` solely to read its concrete `trust_remote_code`
class, then re-loaded weights through that class. On the v3 dump this CPU
instantiation alone takes ~5 minutes.

Verified locally on `auto2604rc4` against both the base v3 dump and a
consolidated SFT checkpoint that `AutoModel.from_pretrained(CKPT,
trust_remote_code=True, dtype=torch.bfloat16, device_map={"":
torch.cuda.current_device()})` resolves to `NemotronH_Nano_Omni_Reasoning_V3`
correctly and produces structured `&lt;s_total&gt;...&lt;/s_total&gt;` output — the
`from_config` round-trip and the `all_tied_weights_keys` patch are no longer
needed. The LoRA section already uses the same fast path.

Total inference setup drops from ~5 min to ~85 s on the consolidated dump.

Signed-off-by: HuiyingLi &lt;willwin.lee@gmail.com&gt;
Co-authored-by: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/guides/vlm/nemotron-omni.md b/docs/guides/vlm/nemotron-omni.md
@@ -313,7 +313,7 @@ to spot-check structured output.
 ```python
 import torch
 import json
-from transformers import AutoConfig, AutoModel, AutoProcessor
+from transformers import AutoModel, AutoProcessor
 from datasets import load_dataset
 from nemo_automodel.components.datasets.vlm.utils import json2token
 
@@ -323,19 +323,18 @@ CKPT = "<checkpoint_dir>/LOWEST_VAL/model/consolidated"
 processor = AutoProcessor.from_pretrained(CKPT, trust_remote_code=True)
 tokenizer = processor.tokenizer
 
-# Resolve the trust_remote_code model class via from_config, then load weights.
-# Using AutoModel.from_pretrained directly can mis-route on v3 dumps.
-config = AutoConfig.from_pretrained(CKPT, trust_remote_code=True)
-model_class = type(AutoModel.from_config(config, trust_remote_code=True))
-if not hasattr(model_class, "all_tied_weights_keys"):
-    model_class.all_tied_weights_keys = {}
-model = model_class.from_pretrained(CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16)
+# `device_map` streams weights directly to GPU; skipping the AutoModel.from_config
+# CPU-instantiation step saves ~5 min on the 30B v3 dump.
+model = AutoModel.from_pretrained(
+    CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16,
+    device_map={"": torch.cuda.current_device()},
+)
 
 # Reset RADIO's `summary_idxs` (non-persistent buffer; can be a meta tensor after load)
 if hasattr(model, "vision_model") and hasattr(model.vision_model, "radio_model"):
     model.vision_model.radio_model.summary_idxs = None
 
-model = model.cuda().eval()
+model.eval()
 
 # Load dataset
 dataset = load_dataset("naver-clova-ix/cord-v2")