Describe the Bug
SglangLLMEngine (unified backend, components/src/dynamo/sglang/llm_engine.py) does not override LLMEngine.drain() — it inherits the default no-op. On a prefill worker shutdown (SIGTERM), the unified Worker runs:
- discovery unregister
DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS sleep (default 5s)
engine.drain() — no-op for SGLang
engine.cleanup() → cancels _prefill_consume_tasks, then engine.shutdown()
If a decode peer is mid-NIXL-pull on a bootstrap room when step 4 fires, SGLang tears down the engine and the bootstrap room while the transfer is in flight. Decode peer's NIXL connect fails or returns garbage data (issue #7319).
This is at parity with the legacy python -m dynamo.sglang path — the legacy shutdown.py:install_graceful_shutdown only does discovery unregister + grace period, not transfer draining. Not a regression from the unified abstraction.
Comparison: TRT-LLM has this fix. TrtllmLLMEngine.drain() polls engine.llm.get_stats_async() until idle. SGLang likely needs an analogous poll on tokenizer_manager's scheduler state (e.g. get_internal_state() or similar) — needs investigation.
Steps to Reproduce
- Launch unified SGLang disagg with prefill + decode on separate GPUs.
- Send a steady stream of requests so the prefill worker has in-flight NIXL transfers when shutdown fires.
- SIGTERM the prefill worker.
- Decode peer's NIXL pull will fail because the prefill side's bootstrap room was torn down mid-transfer.
Expected Behavior
SglangLLMEngine.drain() should poll SGLang's scheduler / tokenizer_manager for outstanding KV transfers and wait until they complete (with a sensible timeout). Add the override and mirror the TRT-LLM pattern.
Actual Behavior
drain() is a no-op; cleanup proceeds immediately. The cancellation of _prefill_consume_tasks in cleanup() then aborts in-flight prefill streams.
Environment
- OS: Ubuntu 24.04.4 LTS (host); Ubuntu container in
dynamo:latest-sglang-test
- Dynamo Runtime Version: branch
tanmayv-unified-disagg
- SGLang Version: (latest in
dynamo:latest-sglang-test)
- CPU Architecture: x86_64
- CUDA Version: 12.9.1
- GPU Architecture: NVIDIA RTX 5880 Ada Generation
- Python Version: 3.12.3
Related
Describe the Bug
SglangLLMEngine(unified backend,components/src/dynamo/sglang/llm_engine.py) does not overrideLLMEngine.drain()— it inherits the default no-op. On a prefill worker shutdown (SIGTERM), the unifiedWorkerruns:DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECSsleep (default 5s)engine.drain()— no-op for SGLangengine.cleanup()→ cancels_prefill_consume_tasks, thenengine.shutdown()If a decode peer is mid-NIXL-pull on a bootstrap room when step 4 fires, SGLang tears down the engine and the bootstrap room while the transfer is in flight. Decode peer's NIXL connect fails or returns garbage data (issue #7319).
This is at parity with the legacy
python -m dynamo.sglangpath — the legacyshutdown.py:install_graceful_shutdownonly does discovery unregister + grace period, not transfer draining. Not a regression from the unified abstraction.Comparison: TRT-LLM has this fix.
TrtllmLLMEngine.drain()pollsengine.llm.get_stats_async()until idle. SGLang likely needs an analogous poll ontokenizer_manager's scheduler state (e.g.get_internal_state()or similar) — needs investigation.Steps to Reproduce
Expected Behavior
SglangLLMEngine.drain()should poll SGLang's scheduler / tokenizer_manager for outstanding KV transfers and wait until they complete (with a sensible timeout). Add the override and mirror the TRT-LLM pattern.Actual Behavior
drain()is a no-op; cleanup proceeds immediately. The cancellation of_prefill_consume_tasksincleanup()then aborts in-flight prefill streams.Environment
dynamo:latest-sglang-testtanmayv-unified-disaggdynamo:latest-sglang-test)Related
components/src/dynamo/trtllm/llm_engine.py:407-449(drain()poll loop)