Skip to content

[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254

Draft
MrGeva wants to merge 1 commit into
eg/ad-host-optfrom
eg/ad-nest-seq-staging
Draft

[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254
MrGeva wants to merge 1 commit into
eg/ad-host-optfrom
eg/ad-nest-seq-staging

Conversation

@MrGeva

@MrGeva MrGeva commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Optimizes the per-arg decode input staging in AutoDeploy's InputBuffer.stage (the nest_sequences host path), which py-spy profiling identified as the dominant host hotspot at low concurrency (_prepare_inputs ~23–30% of the decode loop, almost all in per-arg staging).

Python-list inputs (input_ids, cu_seqlen, input_pos, slot_idx — staged every decode step) previously went through _list_to_tensor (np.arraytorch.from_numpy) and were then unwrapped again with two .numpy() calls + np.copyto. This change assigns the list directly into the pinned host numpy view (host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a single pass. The torch.Tensor staging path is unchanged.

Perf

Microbench (tiny c=1 decode args, 4 list args/step): 1.79× faster staging (35.3 → 19.8 µs/step). Output bit-identical to the previous path.

Test

  • Microbench asserts bit-identical host-buffer contents vs the previous path for both list and tensor inputs.
  • End-to-end WS=1 greedy decode (Nemotron-3-Nano NVFP4): deterministic, unchanged output, no regression.

🤖 Generated with Claude Code

The per-arg decode staging path (InputBuffer.stage) converted every Python-list
input via _list_to_tensor (np.array -> torch.from_numpy) and then unwrapped it
again with two .numpy() calls + np.copyto. For the small lists staged on every
decode step at low concurrency (input_ids, cu_seqlen, input_pos, slot_idx) this
list -> array -> tensor -> array round-trip is pure host overhead.

Assign the list directly into the pinned host numpy view
(host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a
single pass. The torch.Tensor staging path is unchanged.

Microbench (tiny c=1 decode args, 4 list args/step): 1.79x faster staging
(35.3 -> 19.8 us/step), output bit-identical to the previous path. Validated
end-to-end with WS=1 greedy decode (deterministic, unchanged output).

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
@MrGeva

MrGeva commented Jun 8, 2026

Copy link
Copy Markdown
Author

End-to-end measurement (honest result)

Microbench (isolated staging): 1.79× faster (35.3 → 19.8 µs/step for 4 c=1 list args), output bit-identical.

TP4 c=1 decode A/B (Nemotron-3-Nano NVFP4, multi_stream_moe:false, sustained greedy decode):

mean tok/s
baseline (eg/ad-host-opt) 405.6
+ this opt 405.7

No measurable end-to-end gain at TP4 c=1 (within run-to-run noise). The ~15 µs/step saved is ~0.6% of the ~2.46 ms step.

TP4 c=1 py-spy breakdown: _forward_step ~31%, synchronize (GPU-wait/comms) ~17%, sampler ~28% (shared with PyTorch), _prepare_inputs ~19% — but the staging this PR touches is only ~4% of the loop, and the AD-specific host cost is diffuse across nest_sequences, not concentrated.

Caveat: the production config (multi_stream_moe ON) can't be measured at TP — it deadlocks under CUDA-graph + TP collective (separate root-cause; PR NVIDIA#14917 does not fix that nano_v3 case). This opt is correct/harmless and may help in the production overlap config or at higher concurrency, but isn't a measurable win in the only currently-measurable TP config. Keeping as draft pending that decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant