docs(nemotron-omni): use device_map fast path for SFT inference#2126
Merged
docs(nemotron-omni): use device_map fast path for SFT inference#2126
Conversation
The SFT inference snippet under Step 4 instantiated a 30B model on CPU via
`AutoModel.from_config()` solely to read its concrete `trust_remote_code`
class, then re-loaded weights through that class. On the v3 dump this CPU
instantiation alone takes ~5 minutes.
Verified locally on `auto2604rc4` against both the base v3 dump and a
consolidated SFT checkpoint that `AutoModel.from_pretrained(CKPT,
trust_remote_code=True, dtype=torch.bfloat16, device_map={"":
torch.cuda.current_device()})` resolves to `NemotronH_Nano_Omni_Reasoning_V3`
correctly and produces structured `<s_total>...</s_total>` output — the
`from_config` round-trip and the `all_tied_weights_keys` patch are no longer
needed. The LoRA section already uses the same fast path.
Total inference setup drops from ~5 min to ~85 s on the consolidated dump.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
HuiyingLi
added a commit
to HuiyingLi/Nemotron
that referenced
this pull request
May 5, 2026
Sync the Step 4 SFT inference snippet with the matching change going into NVIDIA-NeMo/Automodel#2126: replace `AutoConfig.from_pretrained` + `AutoModel.from_config(...)` round-trip + `all_tied_weights_keys` patch with `AutoModel.from_pretrained(..., device_map={"": cuda})`. Verified on both base v3 dump and a consolidated SFT checkpoint — class resolves to `NemotronH_Nano_Omni_Reasoning_V3`, structured output unchanged. Drops ~5 min of CPU instantiation from the inference setup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
@HuiyingLi, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
Contributor
Author
|
/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c |
@HuiyingLi, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
akoumpa
approved these changes
May 5, 2026
Contributor
|
/ok to test e86f150ca47f20c21d78ee6416ad3cad16ceea3c |
@akoumpa, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace the SFT inference snippet's CPU-instantiation workaround with the existing fast path used by the LoRA section.
The current
Step 4 — Run InferenceSFT example does:That pattern resolves the
trust_remote_codemodel class by instantiating a 30B model on CPU with random weights just to readtype(...). On the v3 dump that CPU init alone burns ~5 minutes before any real loading happens. The comment claimsAutoModel.from_pretrained"can mis-route on v3 dumps", but that is no longer true on currenttransformers/auto26.04containers.What changes
Use the same direct-to-GPU pattern the LoRA section already documents:
AutoConfigimport.from_configround-trip and theall_tied_weights_keyspatch.model.cuda().eval()withmodel.eval()(weights already on GPU).summary_idxs = Nonereset and thePROCESSOR_METADATA_KEYSfilter — those are independent.Verification
Ran the fast path locally on
auto2604rc4against:nemotron-3-nano-omni-ea1_v2.0(base v3 dump)NemotronH_Nano_VL_V2cordv2_v3_400_ckpts/LOWEST_VAL/model/consolidated(SFT)NemotronH_Nano_Omni_Reasoning_V3<s_total><s_total_price>45,500</s_total_price>...<s_nm>REAL GANACHE</s_nm>...Total Step 4 inference setup drops from ~5 min → ~85 s on the consolidated dump. No mis-routing observed.
Test plan