Skip to content

Commit 108982c

Browse files
author
claude-fix-bot
committed
fix(qwen3.5_fp4_b300): use --mm-attention-backend triton_attn
Same workaround as #1422 (bf16) and #1451 (fp8) — bypass the broken flash-attn cute kernel sm_103 assertion in the Qwen-3.5-VL vision encoder by switching only the multi-modal attention path to triton_attn. Text decoder still uses --attention-backend trtllm_mha. See sgl-project/sglang#25564 (root cause: cutedsl Arch enum aliasing on non-cu13 path collapses sm_100..sm_110f range to exclude sm_103) and Dao-AILab/flash-attention#2572 for the upstream fix in flight.
1 parent 2562dd5 commit 108982c

2 files changed

Lines changed: 2 additions & 2 deletions

File tree

benchmarks/single_node/qwen3.5_fp4_b300.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
7373
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
7474
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
7575
--context-length $CONTEXT_LENGTH --disable-radix-cache \
76-
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
76+
--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
7777
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
7878
--tokenizer-worker-num 6 --stream-interval 30 > $SERVER_LOG 2>&1 &
7979

benchmarks/single_node/qwen3.5_fp4_b300_mtp.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
7373
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
7474
--mem-fraction-static $MEM_FRAC_STATIC --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
7575
--context-length $CONTEXT_LENGTH --disable-radix-cache \
76-
--attention-backend trtllm_mha --moe-runner-backend flashinfer_trtllm \
76+
--attention-backend trtllm_mha --mm-attention-backend triton_attn --moe-runner-backend flashinfer_trtllm \
7777
$EXTRA_ARGS --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
7878
--tokenizer-worker-num 6 --stream-interval 30 \
7979
--speculative-algorithm EAGLE \

0 commit comments

Comments
 (0)