[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254
[None][perf] AutoDeploy: fast-path Python-list staging in nest_sequences#254MrGeva wants to merge 1 commit into
Conversation
The per-arg decode staging path (InputBuffer.stage) converted every Python-list input via _list_to_tensor (np.array -> torch.from_numpy) and then unwrapped it again with two .numpy() calls + np.copyto. For the small lists staged on every decode step at low concurrency (input_ids, cu_seqlen, input_pos, slot_idx) this list -> array -> tensor -> array round-trip is pure host overhead. Assign the list directly into the pinned host numpy view (host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a single pass. The torch.Tensor staging path is unchanged. Microbench (tiny c=1 decode args, 4 list args/step): 1.79x faster staging (35.3 -> 19.8 us/step), output bit-identical to the previous path. Validated end-to-end with WS=1 greedy decode (deterministic, unchanged output). Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
End-to-end measurement (honest result)Microbench (isolated staging): 1.79× faster (35.3 → 19.8 µs/step for 4 c=1 list args), output bit-identical. TP4 c=1 decode A/B (Nemotron-3-Nano NVFP4,
→ No measurable end-to-end gain at TP4 c=1 (within run-to-run noise). The ~15 µs/step saved is ~0.6% of the ~2.46 ms step. TP4 c=1 py-spy breakdown: Caveat: the production config ( |
Summary
Optimizes the per-arg decode input staging in AutoDeploy's
InputBuffer.stage(thenest_sequenceshost path), which py-spy profiling identified as the dominant host hotspot at low concurrency (_prepare_inputs~23–30% of the decode loop, almost all in per-arg staging).Python-list inputs (
input_ids,cu_seqlen,input_pos,slot_idx— staged every decode step) previously went through_list_to_tensor(np.array→torch.from_numpy) and were then unwrapped again with two.numpy()calls +np.copyto. This change assigns the list directly into the pinned host numpy view (host_view[:n].numpy()[:] = data), letting numpy cast to the buffer dtype in a single pass. Thetorch.Tensorstaging path is unchanged.Perf
Microbench (tiny c=1 decode args, 4 list args/step): 1.79× faster staging (35.3 → 19.8 µs/step). Output bit-identical to the previous path.
Test
🤖 Generated with Claude Code