[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences by MrGeva · Pull Request #254 · nv-auto-deploy/TensorRT-LLM

MrGeva · 2026-06-08T16:03:59Z

Summary

Optimizes the per-arg decode input staging in AutoDeploy's InputBuffer.stage (the nest_sequences host path), which py-spy profiling identified as the dominant host hotspot at low concurrency (_prepare_inputs ~23–30% of the decode loop, almost all in per-arg staging).

Python-list inputs (input_ids, cu_seqlen, input_pos, slot_idx — staged every decode step) previously went through _list_to_tensor (np.array → torch.from_numpy) and were then unwrapped again with two .numpy() calls + np.copyto. This change assigns the list directly into the pinned host numpy view (host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a single pass. The torch.Tensor staging path is unchanged.

Perf

Microbench (tiny c=1 decode args, 4 list args/step): 1.79× faster staging (35.3 → 19.8 µs/step). Output bit-identical to the previous path.

Test

Microbench asserts bit-identical host-buffer contents vs the previous path for both list and tensor inputs.
End-to-end WS=1 greedy decode (Nemotron-3-Nano NVFP4): deterministic, unchanged output, no regression.

🤖 Generated with Claude Code

The per-arg decode staging path (InputBuffer.stage) converted every Python-list input via _list_to_tensor (np.array -> torch.from_numpy) and then unwrapped it again with two .numpy() calls + np.copyto. For the small lists staged on every decode step at low concurrency (input_ids, cu_seqlen, input_pos, slot_idx) this list -> array -> tensor -> array round-trip is pure host overhead. Assign the list directly into the pinned host numpy view (host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a single pass. The torch.Tensor staging path is unchanged. Microbench (tiny c=1 decode args, 4 list args/step): 1.79x faster staging (35.3 -> 19.8 us/step), output bit-identical to the previous path. Validated end-to-end with WS=1 greedy decode (deterministic, unchanged output). Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

MrGeva · 2026-06-08T16:23:24Z

End-to-end measurement (honest result)

Microbench (isolated staging): 1.79× faster (35.3 → 19.8 µs/step for 4 c=1 list args), output bit-identical.

TP4 c=1 decode A/B (Nemotron-3-Nano NVFP4, multi_stream_moe:false, sustained greedy decode):

	mean tok/s
baseline (`eg/ad-host-opt`)	405.6
+ this opt	405.7

→ No measurable end-to-end gain at TP4 c=1 (within run-to-run noise). The ~15 µs/step saved is ~0.6% of the ~2.46 ms step.

TP4 c=1 py-spy breakdown: _forward_step ~31%, synchronize (GPU-wait/comms) ~17%, sampler ~28% (shared with PyTorch), _prepare_inputs ~19% — but the staging this PR touches is only ~4% of the loop, and the AD-specific host cost is diffuse across nest_sequences, not concentrated.

Caveat: the production config (multi_stream_moe ON) can't be measured at TP — it deadlocks under CUDA-graph + TP collective (separate root-cause; PR NVIDIA#14917 does not fix that nano_v3 case). This opt is correct/harmless and may help in the production overlap config or at higher concurrency, but isn't a measurable win in the only currently-measurable TP config. Keeping as draft pending that decision.

github-actions Bot assigned MrGeva Jun 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254

[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254
MrGeva wants to merge 1 commit into
eg/ad-host-optfrom
eg/ad-nest-seq-staging

MrGeva commented Jun 8, 2026

Uh oh!

MrGeva commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MrGeva commented Jun 8, 2026

Summary

Perf

Test

Uh oh!

MrGeva commented Jun 8, 2026

End-to-end measurement (honest result)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant