Describe the Bug
VllmLLMEngine (unified backend, components/src/dynamo/vllm/llm_engine.py) does not override LLMEngine.drain() — it inherits the default no-op. On a prefill worker shutdown (SIGTERM), the unified Worker runs:
- discovery unregister
DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS sleep (default 5s)
engine.drain() — no-op for vLLM
engine.cleanup() → engine_client.shutdown()
If a decode peer is mid-NIXL-pull when step 4 fires, vLLM's NixlConnector and the GPU memory backing the prefill worker's KV cache are torn down with the transfer in flight. Decode peer crashes (issue #7319).
This is at parity with the legacy python -m dynamo.vllm path — it has the same gap. Not a regression introduced by the unified abstraction.
Comparison: TRT-LLM has this fix. TrtllmLLMEngine.drain() polls engine.llm.get_stats_async() until numActiveRequests + numQueuedRequests == 0 (or 30s timeout) before returning, mirroring the legacy _make_drain_callback pattern in trtllm/main.py.
Steps to Reproduce
- Launch unified vLLM disagg with prefill + decode on separate GPUs.
- Send a steady stream of requests so the prefill worker has in-flight NIXL transfers when shutdown fires.
- SIGTERM the prefill worker.
- The decode peer's NIXL pull will fail or crash because the prefill side's KV memory was freed mid-transfer.
Expected Behavior
VllmLLMEngine.drain() should poll vLLM's connector for outstanding KV transfers and wait for them to complete (with a sensible timeout, e.g. 30s) before returning. Mirror the TRT-LLM pattern.
The blocker today is that vLLM's NixlConnector may not expose a query API for outstanding transfers — would need a vLLM upstream change or a different strategy (e.g. pause new requests on prefill, drain decode peers, etc).
Actual Behavior
drain() is a no-op; cleanup proceeds immediately, freeing GPU memory while transfers may still be in flight.
Environment
- OS: Ubuntu 24.04.4 LTS (host); Ubuntu container in
dynamo:latest-vllm-test
- Dynamo Runtime Version: branch
tanmayv-unified-disagg
- vLLM Version: v0.20.1
- CPU Architecture: x86_64
- CUDA Version: 12.9.1
- GPU Architecture: NVIDIA RTX 5880 Ada Generation
- Python Version: 3.12.3
Related
Describe the Bug
VllmLLMEngine(unified backend,components/src/dynamo/vllm/llm_engine.py) does not overrideLLMEngine.drain()— it inherits the default no-op. On a prefill worker shutdown (SIGTERM), the unifiedWorkerruns:DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECSsleep (default 5s)engine.drain()— no-op for vLLMengine.cleanup()→engine_client.shutdown()If a decode peer is mid-NIXL-pull when step 4 fires, vLLM's
NixlConnectorand the GPU memory backing the prefill worker's KV cache are torn down with the transfer in flight. Decode peer crashes (issue #7319).This is at parity with the legacy
python -m dynamo.vllmpath — it has the same gap. Not a regression introduced by the unified abstraction.Comparison: TRT-LLM has this fix.
TrtllmLLMEngine.drain()pollsengine.llm.get_stats_async()untilnumActiveRequests + numQueuedRequests == 0(or 30s timeout) before returning, mirroring the legacy_make_drain_callbackpattern intrtllm/main.py.Steps to Reproduce
Expected Behavior
VllmLLMEngine.drain()should poll vLLM's connector for outstanding KV transfers and wait for them to complete (with a sensible timeout, e.g. 30s) before returning. Mirror the TRT-LLM pattern.The blocker today is that vLLM's
NixlConnectormay not expose a query API for outstanding transfers — would need a vLLM upstream change or a different strategy (e.g. pause new requests on prefill, drain decode peers, etc).Actual Behavior
drain()is a no-op; cleanup proceeds immediately, freeing GPU memory while transfers may still be in flight.Environment
dynamo:latest-vllm-testtanmayv-unified-disaggRelated
components/src/dynamo/trtllm/llm_engine.py:407-449(drain()poll loop)