docs(nemotron-omni): use device_map fast path for SFT inference by HuiyingLi · Pull Request #2126 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-05-05T01:41:43Z

Summary

Replace the SFT inference snippet's CPU-instantiation workaround with the existing fast path used by the LoRA section.

The current Step 4 — Run Inference SFT example does:

config = AutoConfig.from_pretrained(CKPT, trust_remote_code=True)
model_class = type(AutoModel.from_config(config, trust_remote_code=True))
if not hasattr(model_class, "all_tied_weights_keys"):
    model_class.all_tied_weights_keys = {}
model = model_class.from_pretrained(CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16)
...
model = model.cuda().eval()

That pattern resolves the trust_remote_code model class by instantiating a 30B model on CPU with random weights just to read type(...). On the v3 dump that CPU init alone burns ~5 minutes before any real loading happens. The comment claims AutoModel.from_pretrained "can mis-route on v3 dumps", but that is no longer true on current transformers / auto26.04 containers.

What changes

Use the same direct-to-GPU pattern the LoRA section already documents:

model = AutoModel.from_pretrained(
    CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16,
    device_map={"": torch.cuda.current_device()},
)

Drops AutoConfig import.
Drops the from_config round-trip and the all_tied_weights_keys patch.
Replaces model.cuda().eval() with model.eval() (weights already on GPU).
Keeps the RADIO summary_idxs = None reset and the PROCESSOR_METADATA_KEYS filter — those are independent.

Verification

Ran the fast path locally on auto2604rc4 against:

Checkpoint	Load time	Class resolved	Generation
`nemotron-3-nano-omni-ea1_v2.0` (base v3 dump)	~21 s	`NemotronH_Nano_VL_V2`	OK
`cordv2_v3_400_ckpts/LOWEST_VAL/model/consolidated` (SFT)	~85 s	`NemotronH_Nano_Omni_Reasoning_V3`	`<s_total><s_total_price>45,500</s_total_price>...<s_nm>REAL GANACHE</s_nm>...`

Total Step 4 inference setup drops from ~5 min → ~85 s on the consolidated dump. No mis-routing observed.

Test plan

Render the docs locally and confirm the SFT code block compiles cleanly.
Optional: re-run the snippet against an internal SFT consolidated dump on a current container to confirm the same class resolution.

The SFT inference snippet under Step 4 instantiated a 30B model on CPU via `AutoModel.from_config()` solely to read its concrete `trust_remote_code` class, then re-loaded weights through that class. On the v3 dump this CPU instantiation alone takes ~5 minutes. Verified locally on `auto2604rc4` against both the base v3 dump and a consolidated SFT checkpoint that `AutoModel.from_pretrained(CKPT, trust_remote_code=True, dtype=torch.bfloat16, device_map={"": torch.cuda.current_device()})` resolves to `NemotronH_Nano_Omni_Reasoning_V3` correctly and produces structured `<s_total>...</s_total>` output — the `from_config` round-trip and the `all_tied_weights_keys` patch are no longer needed. The LoRA section already uses the same fast path. Total inference setup drops from ~5 min to ~85 s on the consolidated dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot · 2026-05-05T01:41:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sync the Step 4 SFT inference snippet with the matching change going into NVIDIA-NeMo/Automodel#2126: replace `AutoConfig.from_pretrained` + `AutoModel.from_config(...)` round-trip + `all_tied_weights_keys` patch with `AutoModel.from_pretrained(..., device_map={"": cuda})`. Verified on both base v3 dump and a consolidated SFT checkpoint — class resolves to `NemotronH_Nano_Omni_Reasoning_V3`, structured output unchanged. Drops ~5 min of CPU instantiation from the inference setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

copy-pr-bot · 2026-05-05T07:41:23Z

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

HuiyingLi · 2026-05-05T07:42:54Z

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

copy-pr-bot · 2026-05-05T07:42:57Z

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

akoumpa · 2026-05-05T16:05:04Z

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

copy-pr-bot · 2026-05-05T16:05:08Z

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@akoumpa, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

HuiyingLi requested review from adil-a, akoumpa, athitten, hemildesai, jgerh, pthombre and zyzhou5 as code owners May 5, 2026 01:41

akoumpa approved these changes May 5, 2026

View reviewed changes

akoumpa added the docs-only With great power comes great responsibility. label May 5, 2026

akoumpa enabled auto-merge (squash) May 5, 2026 16:05

akoumpa disabled auto-merge May 5, 2026 16:05

akoumpa merged commit 31405ac into main May 5, 2026
4 checks passed

akoumpa deleted the huiyingl/docs/nomni-fast-path-inference branch May 5, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(nemotron-omni): use device_map fast path for SFT inference#2126

docs(nemotron-omni): use device_map fast path for SFT inference#2126
akoumpa merged 1 commit intomainfrom
huiyingl/docs/nomni-fast-path-inference

HuiyingLi commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

HuiyingLi commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

akoumpa commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HuiyingLi commented May 5, 2026

Summary

What changes

Verification

Test plan

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

HuiyingLi commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

akoumpa commented May 5, 2026

Uh oh!

copy-pr-bot Bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants