Skip to content

docs(nemotron-omni): use device_map fast path for SFT inference#2126

Merged
akoumpa merged 1 commit intomainfrom
huiyingl/docs/nomni-fast-path-inference
May 5, 2026
Merged

docs(nemotron-omni): use device_map fast path for SFT inference#2126
akoumpa merged 1 commit intomainfrom
huiyingl/docs/nomni-fast-path-inference

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

Summary

Replace the SFT inference snippet's CPU-instantiation workaround with the existing fast path used by the LoRA section.

The current Step 4 — Run Inference SFT example does:

config = AutoConfig.from_pretrained(CKPT, trust_remote_code=True)
model_class = type(AutoModel.from_config(config, trust_remote_code=True))
if not hasattr(model_class, "all_tied_weights_keys"):
    model_class.all_tied_weights_keys = {}
model = model_class.from_pretrained(CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16)
...
model = model.cuda().eval()

That pattern resolves the trust_remote_code model class by instantiating a 30B model on CPU with random weights just to read type(...). On the v3 dump that CPU init alone burns ~5 minutes before any real loading happens. The comment claims AutoModel.from_pretrained "can mis-route on v3 dumps", but that is no longer true on current transformers / auto26.04 containers.

What changes

Use the same direct-to-GPU pattern the LoRA section already documents:

model = AutoModel.from_pretrained(
    CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16,
    device_map={"": torch.cuda.current_device()},
)
  • Drops AutoConfig import.
  • Drops the from_config round-trip and the all_tied_weights_keys patch.
  • Replaces model.cuda().eval() with model.eval() (weights already on GPU).
  • Keeps the RADIO summary_idxs = None reset and the PROCESSOR_METADATA_KEYS filter — those are independent.

Verification

Ran the fast path locally on auto2604rc4 against:

Checkpoint Load time Class resolved Generation
nemotron-3-nano-omni-ea1_v2.0 (base v3 dump) ~21 s NemotronH_Nano_VL_V2 OK
cordv2_v3_400_ckpts/LOWEST_VAL/model/consolidated (SFT) ~85 s NemotronH_Nano_Omni_Reasoning_V3 <s_total><s_total_price>45,500</s_total_price>...<s_nm>REAL GANACHE</s_nm>...

Total Step 4 inference setup drops from ~5 min → ~85 s on the consolidated dump. No mis-routing observed.

Test plan

  • Render the docs locally and confirm the SFT code block compiles cleanly.
  • Optional: re-run the snippet against an internal SFT consolidated dump on a current container to confirm the same class resolution.

The SFT inference snippet under Step 4 instantiated a 30B model on CPU via
`AutoModel.from_config()` solely to read its concrete `trust_remote_code`
class, then re-loaded weights through that class. On the v3 dump this CPU
instantiation alone takes ~5 minutes.

Verified locally on `auto2604rc4` against both the base v3 dump and a
consolidated SFT checkpoint that `AutoModel.from_pretrained(CKPT,
trust_remote_code=True, dtype=torch.bfloat16, device_map={"":
torch.cuda.current_device()})` resolves to `NemotronH_Nano_Omni_Reasoning_V3`
correctly and produces structured `<s_total>...</s_total>` output — the
`from_config` round-trip and the `all_tied_weights_keys` patch are no longer
needed. The LoRA section already uses the same fast path.

Total inference setup drops from ~5 min to ~85 s on the consolidated dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi added a commit to HuiyingLi/Nemotron that referenced this pull request May 5, 2026
Sync the Step 4 SFT inference snippet with the matching change going into
NVIDIA-NeMo/Automodel#2126: replace `AutoConfig.from_pretrained` +
`AutoModel.from_config(...)` round-trip + `all_tied_weights_keys` patch
with `AutoModel.from_pretrained(..., device_map={"": cuda})`. Verified on
both base v3 dump and a consolidated SFT checkpoint — class resolves to
`NemotronH_Nano_Omni_Reasoning_V3`, structured output unchanged.

Drops ~5 min of CPU instantiation from the inference setup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@HuiyingLi
Copy link
Copy Markdown
Contributor Author

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@HuiyingLi, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@akoumpa akoumpa added the docs-only With great power comes great responsibility. label May 5, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented May 5, 2026

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 5, 2026

/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c

@akoumpa, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@akoumpa akoumpa enabled auto-merge (squash) May 5, 2026 16:05
@akoumpa akoumpa disabled auto-merge May 5, 2026 16:05
@akoumpa akoumpa merged commit 31405ac into main May 5, 2026
4 checks passed
@akoumpa akoumpa deleted the huiyingl/docs/nomni-fast-path-inference branch May 5, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs-only With great power comes great responsibility.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants