Skip to content

vLLM unified backend: prefill worker has no drain() — NIXL KV transfer can be torn down mid-flight #9344

@tanmayv25

Description

@tanmayv25

Describe the Bug

VllmLLMEngine (unified backend, components/src/dynamo/vllm/llm_engine.py) does not override LLMEngine.drain() — it inherits the default no-op. On a prefill worker shutdown (SIGTERM), the unified Worker runs:

  1. discovery unregister
  2. DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS sleep (default 5s)
  3. engine.drain()no-op for vLLM
  4. engine.cleanup()engine_client.shutdown()

If a decode peer is mid-NIXL-pull when step 4 fires, vLLM's NixlConnector and the GPU memory backing the prefill worker's KV cache are torn down with the transfer in flight. Decode peer crashes (issue #7319).

This is at parity with the legacy python -m dynamo.vllm path — it has the same gap. Not a regression introduced by the unified abstraction.

Comparison: TRT-LLM has this fix. TrtllmLLMEngine.drain() polls engine.llm.get_stats_async() until numActiveRequests + numQueuedRequests == 0 (or 30s timeout) before returning, mirroring the legacy _make_drain_callback pattern in trtllm/main.py.

Steps to Reproduce

  1. Launch unified vLLM disagg with prefill + decode on separate GPUs.
  2. Send a steady stream of requests so the prefill worker has in-flight NIXL transfers when shutdown fires.
  3. SIGTERM the prefill worker.
  4. The decode peer's NIXL pull will fail or crash because the prefill side's KV memory was freed mid-transfer.

Expected Behavior

VllmLLMEngine.drain() should poll vLLM's connector for outstanding KV transfers and wait for them to complete (with a sensible timeout, e.g. 30s) before returning. Mirror the TRT-LLM pattern.

The blocker today is that vLLM's NixlConnector may not expose a query API for outstanding transfers — would need a vLLM upstream change or a different strategy (e.g. pause new requests on prefill, drain decode peers, etc).

Actual Behavior

drain() is a no-op; cleanup proceeds immediately, freeing GPU memory while transfers may still be in flight.

Environment

  • OS: Ubuntu 24.04.4 LTS (host); Ubuntu container in dynamo:latest-vllm-test
  • Dynamo Runtime Version: branch tanmayv-unified-disagg
  • vLLM Version: v0.20.1
  • CPU Architecture: x86_64
  • CUDA Version: 12.9.1
  • GPU Architecture: NVIDIA RTX 5880 Ada Generation
  • Python Version: 3.12.3

Related

Metadata

Metadata

Assignees

Labels

backend::vllmRelates to the vllm backendbugSomething isn't workingfault tolerancelanguage::pythonIssues/PRs that reference Python codenixlRelates to NIXL

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions