Commit 31405ac
docs(nemotron-omni): use device_map fast path for SFT inference (#2126)
The SFT inference snippet under Step 4 instantiated a 30B model on CPU via
`AutoModel.from_config()` solely to read its concrete `trust_remote_code`
class, then re-loaded weights through that class. On the v3 dump this CPU
instantiation alone takes ~5 minutes.
Verified locally on `auto2604rc4` against both the base v3 dump and a
consolidated SFT checkpoint that `AutoModel.from_pretrained(CKPT,
trust_remote_code=True, dtype=torch.bfloat16, device_map={"":
torch.cuda.current_device()})` resolves to `NemotronH_Nano_Omni_Reasoning_V3`
correctly and produces structured `<s_total>...</s_total>` output — the
`from_config` round-trip and the `all_tied_weights_keys` patch are no longer
needed. The LoRA section already uses the same fast path.
Total inference setup drops from ~5 min to ~85 s on the consolidated dump.
Signed-off-by: HuiyingLi <willwin.lee@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent bb209de commit 31405ac
1 file changed
Lines changed: 8 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
313 | 313 | | |
314 | 314 | | |
315 | 315 | | |
316 | | - | |
| 316 | + | |
317 | 317 | | |
318 | 318 | | |
319 | 319 | | |
| |||
323 | 323 | | |
324 | 324 | | |
325 | 325 | | |
326 | | - | |
327 | | - | |
328 | | - | |
329 | | - | |
330 | | - | |
331 | | - | |
332 | | - | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
333 | 332 | | |
334 | 333 | | |
335 | 334 | | |
336 | 335 | | |
337 | 336 | | |
338 | | - | |
| 337 | + | |
339 | 338 | | |
340 | 339 | | |
341 | 340 | | |
| |||
0 commit comments