[Opt] GsaOnDevice cuda bugfix & optimization by wangwenxin0312 · Pull Request #659 · ModelEngine-Group/unified-cache-management

wangwenxin0312 · 2026-01-20T04:56:51Z

Purpose

What this PR does / why we need it?
1、GsaOnDevice bugfix & optimization
2、Update vllm-adapt-sparse.patch

Modifications

Does this PR introduce any user-facing change?

Decode metadata (decode_req_ids, block_table_decode, decode_seq_lens) are constructed in build_sparse_meta
On external prefix-cache hits, prefix_slot_mapping and prefix_block_ids are rebuilt to ensure that k_hash computation covers the full required prefix
Decode-only batches are optimized using tensor slicing

Test

How was this patch tested?
export MODEL_PATH="/home/models/Qwen3-Coder-30B-A3B-Instruct-FP8"
export VLLM_HASH_ATTENTION=1
python examples/offline_inference_gsaondevice.py

wangwenxin0312 requested review from Infinite666, harrisonyhq, mag1c-h, qyh111, wuhuxiao and ygwpz as code owners January 20, 2026 04:56

wangwenxin0312 force-pushed the dev_gsa_device_pr branch from 7e8ae37 to 2cc678f Compare January 20, 2026 07:01

Infinite666 reviewed Jan 20, 2026

View reviewed changes

Comment thread ucm/sparse/gsa_on_device/gsa_on_device.py Outdated

Comment thread ucm/sparse/gsa_on_device/gsa_on_device.py Outdated

Comment thread examples/offline_inference_gsaondevice.py

Comment thread ucm/sparse/gsa_on_device/gsa_on_device.py Outdated

wangwenxin0312 force-pushed the dev_gsa_device_pr branch from 07ef34f to bde19ca Compare January 20, 2026 11:50

Infinite666 changed the title ~~[feat] GsaOnDevice cuda bugfix & optimization~~ [Opt] GsaOnDevice cuda bugfix & optimization Jan 21, 2026

Infinite666 reviewed Jan 21, 2026

View reviewed changes

Comment thread ucm/sparse/gsa_on_device/gsa_on_device.py

Infinite666 mentioned this pull request Jan 21, 2026

[Feat] Update sparse method patches for vllm 0.11.0 #638

Merged

wangwenxin0312 force-pushed the dev_gsa_device_pr branch 7 times, most recently from 2559c3e to b311c54 Compare January 23, 2026 03:23

Infinite666 approved these changes Jan 27, 2026

View reviewed changes

Infinite666 previously approved these changes Jan 27, 2026

View reviewed changes

wangwenxin0312 and others added 9 commits January 27, 2026 00:27

[gsaonDevice] bugfix & optimization

d243489

update patch for gsaondevice

ab6f9ed

bugfix

f8c6e08

update patch for cacheblend

7de303e

Refine attention_begin structure by splitting CUDA/NPU and MLA/GQA

182bd55

bugfix

7b1d88f

resolve conflicts

d7b399c

update patch

909b8e7

fix & opt

bfa9b83

wangwenxin0312 dismissed Infinite666’s stale review via bfa9b83 January 27, 2026 08:41

wangwenxin0312 force-pushed the dev_gsa_device_pr branch from c924140 to bfa9b83 Compare January 27, 2026 08:41

Infinite666 approved these changes Jan 27, 2026

View reviewed changes

wuhuxiao approved these changes Jan 27, 2026

View reviewed changes

Comment thread ucm/integration/vllm/patch/0.9.2/vllm-adapt-sparse.patch

Infinite666 merged commit 4adcfe7 into ModelEngine-Group:develop Jan 27, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Opt] GsaOnDevice cuda bugfix & optimization#659

[Opt] GsaOnDevice cuda bugfix & optimization#659
Infinite666 merged 9 commits intoModelEngine-Group:developfrom
wangwenxin0312:dev_gsa_device_pr

wangwenxin0312 commented Jan 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

wangwenxin0312 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Test

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wangwenxin0312 commented Jan 20, 2026 •

edited

Loading