[Klaud Cold] Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 (#1475)

functionstackx · claude · claude-fix-bot · web-flow · commit 4f630348d98c · 2026-05-20T03:00:35.000-04:00
* Update qwen3.5-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight. * Re-trigger sweep (previous Run Sweep run stuck pending with 0 jobs) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude-fix-bot <claude-fix-bot@local>
diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
@@ -2454,7 +2454,7 @@ qwen3.5-fp8-b300-sglang:
       - { tp: 4, ep: 1, conc-start: 4, conc-end: 256 }
 
 qwen3.5-fp4-b300-sglang:
-  image: lmsysorg/sglang:v0.5.11-cu130
+  image: lmsysorg/sglang:v0.5.12-cu130
   model: nvidia/Qwen3.5-397B-A17B-NVFP4
   model-prefix: qwen3.5
   runner: b300
@@ -2475,7 +2475,7 @@ qwen3.5-fp4-b300-sglang:
       - { tp: 2, ep: 2, conc-start: 4, conc-end: 128 }
 
 qwen3.5-fp4-b300-sglang-mtp:
-  image: lmsysorg/sglang:v0.5.11-cu130
+  image: lmsysorg/sglang:v0.5.12-cu130
   model: nvidia/Qwen3.5-397B-A17B-NVFP4
   model-prefix: qwen3.5
   runner: b300
diff --git a/benchmarks/single_node/qwen3.5_fp4_b300.sh b/benchmarks/single_node/qwen3.5_fp4_b300.sh
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
 --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
 --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
 --context-length $CONTEXT_LENGTH --disable-radix-cache \
---attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
+--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
 $EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
 --tokenizer-worker-num 6 --stream-interval 30 > $SERVER_LOG 2>&1 &
 
diff --git a/benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh b/benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
 --cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
 --mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
 --context-length $CONTEXT_LENGTH --disable-radix-cache \
---attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
+--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
 $EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
 --tokenizer-worker-num 6 --stream-interval 30 \
 --speculative-algorithm EAGLE \
diff --git a/perf-changelog.yaml b/perf-changelog.yaml
@@ -3036,3 +3036,10 @@
     - "TP=4 shows +3.2% to +16.3% throughput improvement across 1k1k and 8k1k workloads (concurrency 4-256)"
   pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1411
   
+
+- config-keys:
+    - qwen3.5-fp4-b300-sglang
+    - qwen3.5-fp4-b300-sglang-mtp
+  description:
+    - "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475