Commit 4adcfe7

and

authored

[Opt] GsaOnDevice cuda bugfix & optimization (#659)

# Purpose What this PR does / why we need it? 1、GsaOnDevice bugfix & optimization 2、Update vllm-adapt-sparse.patch  # Modifications Does this PR introduce _any_ user-facing change? - Decode metadata (decode_req_ids, block_table_decode, decode_seq_lens) are constructed in build_sparse_meta - On external prefix-cache hits, prefix_slot_mapping and prefix_block_ids are rebuilt to ensure that k_hash computation covers the full required prefix - Decode-only batches are optimized using tensor slicing  # Test How was this patch tested? export MODEL_PATH="/home/models/Qwen3-Coder-30B-A3B-Instruct-FP8" export VLLM_HASH_ATTENTION=1 python examples/offline_inference_gsaondevice.py <img width="1315" height="350" alt="image" src="https://github.com/user-attachments/assets/b42e8db7-de25-4e01-afc0-aabfbef023d7" />  --------- Co-authored-by: AooooooA-C <chenaozhu@outlook.com>

1 parent 6746f8b commit 4adcfe7Copy full SHA for 4adcfe7

5 files changed

ucm
- integration/vllm/patch/0.11.0
  - vllm-adapt-sparse.patch
- sparse
  - base.py
  - esa
    - esa.py
  - gsa_on_device
    - configs
      - gsa_on_device_qwen3_coder_30B_A3B_config.json
    - gsa_on_device.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 4adcfe7

File tree

0 commit comments