Skip to content

Commit 4f63034

Browse files
functionstackxclaudeclaude-fix-bot
authored
[Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 (#1475)
* Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight. * Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local>
1 parent 8d76685 commit 4f63034

4 files changed

Lines changed: 11 additions & 4 deletions

File tree

.github/configs/nvidia-master.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2454,7 +2454,7 @@ qwen3.5-fp8-b300-sglang:
24542454
- { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
24552455

24562456
qwen3.5-fp4-b300-sglang:
2457-
image: lmsysorg/sglang:v0.5.11-cu130
2457+
image: lmsysorg/sglang:v0.5.12-cu130
24582458
model: nvidia/Qwen3.5-397B-A17B-NVFP4
24592459
model-prefix: qwen3.5
24602460
runner: b300
@@ -2475,7 +2475,7 @@ qwen3.5-fp4-b300-sglang:
24752475
- { tp: 2, ep: 2, conc-start: 4, conc-end: 128 }
24762476

24772477
qwen3.5-fp4-b300-sglang-mtp:
2478-
image: lmsysorg/sglang:v0.5.11-cu130
2478+
image: lmsysorg/sglang:v0.5.12-cu130
24792479
model: nvidia/Qwen3.5-397B-A17B-NVFP4
24802480
model-prefix: qwen3.5
24812481
runner: b300

benchmarks/single_node/qwen3.5_fp4_b300.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
7373
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
7474
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
7575
--context-length $CONTEXT_LENGTH --disable-radix-cache \
76-
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
76+
--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
7777
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
7878
--tokenizer-worker-num 6 --stream-interval 30 > $SERVER_LOG 2>&1 &
7979

benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
7373
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
7474
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
7575
--context-length $CONTEXT_LENGTH --disable-radix-cache \
76-
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
76+
--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
7777
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
7878
--tokenizer-worker-num 6 --stream-interval 30 \
7979
--speculative-algorithm EAGLE \

perf-changelog.yaml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3036,3 +3036,10 @@
30363036
- "TP=4 shows +3.2% to +16.3% throughput improvement across 1k1k and 8k1k workloads (concurrency 4-256)"
30373037
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1411
30383038

3039+
3040+
- config-keys:
3041+
- qwen3.5-fp4-b300-sglang
3042+
- qwen3.5-fp4-b300-sglang-mtp
3043+
description:
3044+
- "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130"
3045+
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475

0 commit comments

Comments
 (0)