Skip to content

Commit 23f04b8

Browse files
Enable Rust frontend (VLLM_USE_RUST_FRONTEND=1) (#1634)
* Enable Rust frontend (VLLM_USE_RUST_FRONTEND=1) With Rust frontend, we don't change kernel, attention, MoE GEMM, or KV cache. So it won't change the Througput and TPOT. But it benefits TTFT as it helps to decrease the frontend CUP time cost from the moment of requesting to generate the first token. * Update per-changelog --------- Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
1 parent 21aa356 commit 23f04b8

2 files changed

Lines changed: 9 additions & 0 deletions

File tree

benchmarks/single_node/fixed_seq_len/minimaxm2.5_fp4_mi355x.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
2525
fi
2626

2727
export VLLM_ROCM_USE_AITER=1
28+
export VLLM_USE_RUST_FRONTEND=1
2829
EXTRA_VLLM_ARGS=""
2930
# if [ "$TP" -ge 4 ]; then
3031
# # AITER CK fused MoE kernels lack compiled tiles for N=intermediate_size/TP

perf-changelog.yaml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3523,3 +3523,11 @@
35233523
- "Aligned decode params with Weiliang config: swa-full-tokens-ratio=0.20, max-running-requests=18432, moe-dense-tp-size=1; added prefill enable-dp-lm-head and cuda-graph-max-bs=512"
35243524
- "Remove 4 dominated old configs (4p-dep16-8n, 8p-dep16-12n, 10p-dep16-14n, 12p-dep12-15n) superseded by wide-EP frontier"
35253525
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1586
3526+
3527+
- config-keys:
3528+
- minimaxm2.5-fp4-mi355x-vllm
3529+
description:
3530+
- "Enable vLLM Rust request frontend by exporting VLLM_USE_RUST_FRONTEND=1 in benchmarks/single_node/minimaxm2.5_fp4_mi355x.sh (v0.22.0 ROCm image ships the vllm-rs binary, so the flag engages it). Environment-only change; serve flags, TP/EP, attention/kernel settings unchanged"
3531+
- "The Rust frontend replaces only the Python serving/API layer (HTTP, tokenization, scheduling glue, detokenization) and spawns the same Python EngineCore, so GPU kernels/attention/MoE GEMM/KV cache are untouched"
3532+
- "A/B sweep (28 single-node points, 1k1k + 8k1k, TP 1/2/4) vs the Python-frontend baseline (run 26696260751): throughput Pareto-neutral (peak tok/s/GPU within <1.5%, frontiers coincident) and TPOT flat (+-0.5%); TTFT improves ~8% at 1k1k and ~22% at 8k1k (every point), the expected signature of lower frontend CPU latency before first token, scaling with input length"
3533+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1634

0 commit comments

Comments
 (0)