Skip to content

[opt]Defer UCM KV Dump Waiting Until Request Completion#989

Merged
qyh111 merged 12 commits into
ModelEngine-Group:developfrom
qyh111:dev_qyh
Jun 15, 2026
Merged

[opt]Defer UCM KV Dump Waiting Until Request Completion#989
qyh111 merged 12 commits into
ModelEngine-Group:developfrom
qyh111:dev_qyh

Conversation

@qyh111

@qyh111 qyh111 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Purpose

Move UCM KV dump completion waiting out of wait_for_save() so model execution is no longer blocked by async dump tasks, while still preserving delayed block release correctness for finished requests.

This also fixes the non-HMA MultiConnector path by forwarding request_finished() from the outer UCMConnector to the actual inner connector, ensuring requests that UCM saves asynchronously are properly marked for delayed free before they are returned from get_finished().

Modifications

  • Added async dump task tracking for UCM Direct and LayerWise connectors.
  • Changed wait_for_save() to submit dump tasks without blocking on task completion.
  • Moved dump completion waiting to get_finished() for requests that have finished and require delayed block release.
  • Added preemption handling to wait only for pending dump tasks related to preempted requests.
  • Added per-task event handle release after dump task completion or submission failure.
  • Removed save-duration metrics from the delayed wait_for_save() path because the timing is no longer accurate after deferring completion.
  • Updated LayerWise save flow to use the same delayed dump completion mechanism.
  • Added UCMConnector.request_finished() forwarding for the non-HMA path so MultiConnector can correctly count UCM async saves.
  • Rebased the branch on latest origin/develop and resolved conflicts with the new layerwise / pipeline store metrics changes.

Test

Tested on 8*NPU on A3 for Minimax2.7
direct connector
16K concurrancy 8 TTFT decrease(3.6605->3.2271)
64K concurrancy 4 TTFT decrease(13.0390->11.4577)
layerwise connector
slight improve almost the same

@qyh111 qyh111 changed the title Dev qyh delay wait for save Jun 1, 2026
@qyh111 qyh111 changed the title delay wait for save [opt]Defer UCM KV Dump Waiting Until Request Completion Jun 1, 2026
Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py
@dante159753

dante159753 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

I found one issue in the CP path: UCMCPConnector.build_connector_meta() does not mirror UCMDirectConnector.build_connector_meta() by adding requests with dump_block_ids to the scheduler-side _async_dump_req_ids set. As a result, request_finished() returns False in the CP path, so vLLM may free the blocks immediately while the worker-side dump task has been deferred until get_finished(). That leaves a window where the blocks can be reused or overwritten before the dump wait completes. Please mirror the DirectConnector tracking logic in this override before returning the metadata.

@qyh111

qyh111 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

I found one issue in the CP path: UCMCPConnector.build_connector_meta() does not mirror UCMDirectConnector.build_connector_meta() by adding requests with dump_block_ids to the scheduler-side _async_dump_req_ids set. As a result, request_finished() returns False in the CP path, so vLLM may free the blocks immediately while the worker-side dump task has been deferred until get_finished(). That leaves a window where the blocks can be reused or overwritten before the dump wait completes. Please mirror the DirectConnector tracking logic in this override before returning the metadata.

I think this does not apply to the CP path. UCMCPConnector does not override build_connector_meta(); its MRO is UCMCPConnector -> UCMLayerWiseConnector -> UCMDirectConnector, and UCMLayerWiseConnector does not override build_connector_meta either. Therefore CP uses UCMDirectConnector.build_connector_meta(), which already adds dump requests to scheduler-side _async_dump_req_ids before returning metadata.

Comment thread ucm/integration/vllm/ucm_connector.py
@qyh111 qyh111 force-pushed the dev_qyh branch 2 times, most recently from 4507ee9 to a1cd30f Compare June 6, 2026 10:24
Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py

@ygwpz ygwpz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review (2026-06-13)

I've reviewed the new commits pushed today. The previous concerns from 2026-06-07 have been addressed:

Event handle release (line 930): The finally block in _wait_pending_dump_task ensures cleanup on all paths.

State accumulation (line 958): _async_dump_req_ids is properly cleared in request_finished() (discard) and get_finished() (difference_update).

Type safety (line 939): The isinstance check before getattr properly handles both KVConnectorMetadata | set[str] types.

However, I found 2 NEW issues that need attention:

See inline comments below for details on:

  1. Resource leak when store.check() fails in _poll_pending_dump_tasks
  2. Partial task handling logic in get_finished

Overall, the deferred wait mechanism is well-designed, but these edge cases need handling to prevent resource leaks and ensure correctness.

Comment thread ucm/integration/vllm/ucm_connector.py
Comment thread ucm/integration/vllm/ucm_connector.py
@qyh111 qyh111 merged commit 1bc1bc1 into ModelEngine-Group:develop Jun 15, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants