Skip to content

Commit e1d3a18

Browse files
claude-fix-botfunctionstackx
authored andcommitted
fix(qwen3.5_fp8_b300): use --mm-attention-backend triton_attn
Same workaround as PR #1422 — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 + Dao-AILab/flash-attention#2572 for the upstream root cause and the in-flight fix.
1 parent 2315338 commit e1d3a18

2 files changed

Lines changed: 2 additions & 0 deletions

File tree

benchmarks/single_node/qwen3.5_fp8_b300.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
4040
--kv-cache-dtype fp8_e4m3 \
4141
--mamba-ssm-dtype bfloat16 \
4242
--attention-backend trtllm_mha \
43+
--mm-attention-backend triton_attn \
4344
--moe-runner-backend flashinfer_trtllm \
4445
--cuda-graph-max-bs $CONC \
4546
--max-running-requests $CONC \

benchmarks/single_node/qwen3.5_fp8_b300_mtp.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ SGLANG_ENABLE_SPEC_V2=1 PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --mod
4040
--kv-cache-dtype fp8_e4m3 \
4141
--mamba-ssm-dtype bfloat16 \
4242
--attention-backend trtllm_mha \
43+
--mm-attention-backend triton_attn \
4344
--moe-runner-backend flashinfer_trtllm \
4445
--cuda-graph-max-bs $CONC \
4546
--max-running-requests $CONC \

0 commit comments

Comments
 (0)