Commit c3b0d63
Decentralized disaggregated deployment architecture (#947)
## Summary
Integrated Mooncake's disaggregated deployment mode into the runner to
provide LightX2V with full **three-stage** disaggregated inference
capability. The inference pipeline can be split into **Encoder**,
**Transformer**, and **Decoder** nodes, where the **VAE Decoder** is
deployed independently on the Decoder node. This support includes both
**Wan** and **Qwen** model families.
On top of the three-stage foundation, this PR further introduces
**decentralized queue scheduling**: a Controller process hosts RDMA
metadata ring buffers, Transformer and Decoder run as pull-based
workers, and the client only needs **a single HTTP POST to the
Encoder**—no more three-way sequential requests. Multiple Transformer
workers can be deployed across GPUs for parallel DiT execution.
## Feature Highlights
1. Disaggregated deployment is integrated with the Mooncake engine,
enabling efficient RDMA-based data transfer. Inference I/O can reach the
**theoretical maximum bandwidth** of the GPUs.
2. The **text encoder** component is integrated with **LightLLM**
optimizations. It supports **kernel-level** optimizations and
**service-level** optimizations, delivering an additional **~30%
performance improvement**.
3. Compared with Mooncake's standalone disagg submission, this
integration is implemented within the **local runner**. Currently, it
supports both **Wan runner** and **Qwen runner**.
4. In Mooncake's original disagg approach, each stage runs as different
threads within a unified process, which creates tight producer/consumer
coupling and does not match high-concurrency scenarios. We decouple them
into **independent processes**, allowing the three stages (**encoder +
transformer + decoder**) to be deployed on different machines and
different GPUs. Under high concurrency, this improves throughput.
5. **Decentralized queue scheduling** with RDMA ring buffers
(`RDMABuffer`): a Controller hosts request / phase1 / phase2 metadata
rings; Encoder publishes dispatch metadata after inference; Transformer
and Decoder workers pull tasks from the rings automatically. The client
sends **one HTTP request** to the Encoder instead of three sequential
POSTs.
6. **Multi-Transformer worker parallelism**: multiple Transformer
workers (each with a unique `receiver_engine_rank`) can run on different
GPUs. Requests specify `disagg_phase1_receiver_engine_rank` to target a
specific worker, enabling round-robin or explicit routing.
7. **True RDMA atomics**: `rdma_faa` upgraded from read-modify-write
shim to real `IBV_WR_ATOMIC_FETCH_AND_ADD`; new `rdma_cas`
(`IBV_WR_ATOMIC_CMP_AND_SWP`) added. Both RDMAServer and RDMAClient
register `REMOTE_ATOMIC` access flags.
8. **Queue metrics & monitoring**: each service (Encoder / Transformer /
Decoder) reports queue depth (`queue_sizes`, `queue_total_pending`,
`all_queues_empty`) via the Reporter's `set_extra_metrics_provider()`
hook, providing real-time pipeline backlog visibility.
## Disaggregated Architecture (Three-Stage Pipeline)
Based on the `disagg_mode` configuration, the inference pipeline is
physically split into three independent services. Data flows through
**Phase1 (Encoder → Transformer)** and **Phase2 (Transformer →
Decoder)**, requiring **two Mooncake transfers**.
### Encoder Role (`disagg_mode="encoder"`)
- Loads only:
- Text Encoder
- Image Encoder (for **I2V / I2I**)
- VAE Encoder
- Skips:
- DiT
- VAE Decoder (handled by the Decoder node in the three-stage setup)
After startup, it performs feature extraction and sends tensors through
Mooncake Phase1 to the Transformer node, including:
- `context`
- `clip_encoder_out`
- `vae_encoder_out`
- `latent_shape`
- (other required intermediate tensors)
### Transformer Role (`disagg_mode="transformer"`)
- Loads only:
- DiT
- Skips:
- Encoder
- VAE Decoder
- (VAE decoding is handled by the Decoder node)
After startup, it waits for Phase1 data. Upon receiving it, it performs:
- Hash verification
- Input assembly
- Denoising
If `decoder_engine_rank` is configured, it sends the **denoised latent
space** to the Decoder node via Mooncake Phase2, and **does not**
perform local VAE decoding.
### Decoder Role (`disagg_mode="decode"`)
- Loads only:
- VAE Decoder
- Skips:
- Text/Image Encoder
- DiT
After startup, it enters a Phase2 receive-and-wait state. When it
receives the latent space from the Transformer, it performs:
- VAE decoding
- Saving output videos/images
Both task completion status and result files are stored on the Decoder
node.
## Decentralized Queue Scheduling
### Architecture
```
┌──────────┐ HTTP POST ┌──────────┐ Phase1 RDMA ┌─────────────┐ Phase2 RDMA ┌──────────┐
│ Client │ ──────────→ │ Encoder │ ──────────→ │ Transformer │ ──────────→ │ Decoder │
└──────────┘ │ (GPU 0) │ │ (GPU 1/2/3) │ │ (GPU 0) │
└──────────┘ └─────────────┘ └──────────┘
↑ ↑ ↑
lightx2v.server pull worker ×N pull worker
HTTP port 8002 (qwen_t2i_queue_workers) (qwen_t2i_queue_workers)
│
┌──────────┐
│Controller│ ← RDMA metadata ring buffers (always-on)
└──────────┘
```
### How it differs from standard three-stage
| Aspect | Standard three-stage | Decentralized queue |
|--------|---------------------|---------------------|
| **Client calls** | Must POST to Decoder → Transformer → Encoder
separately | Single POST to Encoder HTTP |
| **Transformer** | HTTP server, one request at a time | Pull worker,
multiple instances consume in parallel |
| **Decoder** | HTTP server | Pull worker, auto-consumes Phase2 |
| **Request routing** | Client explicitly specifies | Encoder writes
RDMA ring, workers pull by rank |
| **Result retrieval** | Poll Decoder HTTP | Poll Encoder HTTP |
| **Scaling** | Fixed 1:1:1 ratio | N Transformer workers on N GPUs |
### Data flow
1. **Client** POSTs to Encoder HTTP (`/v1/tasks/image/`) with prompt,
`data_bootstrap_room` (unique room ID), and
`disagg_phase1_receiver_engine_rank` (target Transformer rank).
2. **Encoder** runs Text Encoder inference, creates a per-request
Mooncake session, sends feature tensors via Phase1, and publishes
dispatch metadata to the Phase1 RDMA ring.
3. **Transformer** (pull worker) consumes the Phase1 ring slot matching
its rank, initializes Mooncake Phase1 receiver + Phase2 sender, runs DiT
denoising, sends latents via Phase2, and publishes dispatch metadata to
the Phase2 RDMA ring.
4. **Decoder** (pull worker) consumes the Phase2 ring, initializes
Mooncake Phase2 receiver, runs VAE decode, and saves the output image.
5. **Client** polls Encoder's `/v1/tasks/{task_id}/status` until
`completed`.
### Key components
- **Controller** (`ControllerService.serve_rdma_dispatch_only()`): hosts
three RDMA ring buffers (request / phase1 / phase2), no model loading,
always-on background process.
- **RDMABuffer** (`rdma_buffer.py`): shared ring buffer over
`RDMAServer`/`RDMAClient` with slot-level atomic coordination for
multi-producer/multi-consumer JSON dispatch.
- **Pull workers** (`qwen_t2i_queue_workers.py`): Transformer and
Decoder worker loops that consume from RDMA rings via
`disagg_try_consume_phase1()` / `disagg_try_consume_phase2()`, then call
`disagg_transformer_prepare_dispatch()` /
`disagg_decoder_prepare_dispatch()` to set up per-request Mooncake
sessions.
---------
Co-authored-by: Gu Shiqiao <77222802+gushiqiao@users.noreply.github.com>
Co-authored-by: LiangLiu <1432249204@qq.com>
Co-authored-by: PengGao <peng.gaoc@gmail.com>
Co-authored-by: Musisoul <106440666+Musisoul@users.noreply.github.com>
Co-authored-by: STwangyingrui <86730325+STwangyingrui@users.noreply.github.com>
Co-authored-by: root <root@pt-de4c35727a1b4d1b9f27f422f06026ec-worker-0.pt-de4c35727a1b4d1b9f27f422f06026ec.ns-devsft-3460edd0.svc.cluster.local>
Co-authored-by: root <root@pt-9b2035a55fe647eeb007584b238e5077-worker-0.pt-9b2035a55fe647eeb007584b238e5077.ns-devsft-3460edd0.svc.cluster.local>
Co-authored-by: yihuiwen <617954457@qq.com>
Co-authored-by: yihuiwen <yihuiwen@sensetime.com>
Co-authored-by: sandy <wangshankun2011@hotmail.com>
Co-authored-by: wangshankun <wangshankun@sensetime.com>
Co-authored-by: Ian Thompson <37408934+Naist4869@users.noreply.github.com>
Co-authored-by: Yang Yong (雍洋) <yongyang1030@163.com>
Co-authored-by: qinxinyi <qxy118045534@163.com>
Co-authored-by: WateBear <540295877@qq.com>
Co-authored-by: Watebear <wushuo@bupt.cn>
Co-authored-by: Kane <62586707+Wq-dd@users.noreply.github.com>
Co-authored-by: Zhuguanyu Wu <goatwu0415@gmail.com>
Co-authored-by: XHPlus <xhplus@163.com>
Co-authored-by: Fredy Rivera <fredyriveraacevedo13@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: vivienfanghuagood <89012307+vivienfanghuagood@users.noreply.github.com>
Co-authored-by: triple-mu <gpu@163.com>
Co-authored-by: llmc-reviewer <llmc_reviewer@163.com>
Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
Co-authored-by: Sita Bérété <sita.berete.3@gmail.com>
Co-authored-by: LuoLongZan <2200013198@stu.pku.edu.cn>
Co-authored-by: Vivek Bhakta <vivek@wombo.ai>
Co-authored-by: xiehao <hxie_chn@163.com>
Co-authored-by: root <root@pt-72be2ccd01a14fa18a4b18c6c347f823-worker-0.pt-72be2ccd01a14fa18a4b18c6c347f823.ns-devsft-3460edd0.svc.cluster.local>
Co-authored-by: Lihuang-a <3189274310@qq.com>
Co-authored-by: Franc1sCai <guanghan@atlasv.com>
Co-authored-by: Wu Ruixiao <62665119+kikidouloveme79@users.noreply.github.com>
Co-authored-by: wrx <kikidouloveme79@users.noreply.github.com>
Co-authored-by: root <root@pt-1566c00962444e589a1c9589088689e2-worker-0.pt-1566c00962444e589a1c9589088689e2.ns-devsft-3460edd0.svc.cluster.local>
Co-authored-by: storyicon <storyicon@foxmail.com>
Co-authored-by: xjq <xjq314@gmail.com>
Co-authored-by: M4jupitercannon <speedforcy@outlook.com>
Co-authored-by: Chengtao Lv <lvchengtao0319@gmail.com>
Co-authored-by: root <root@pt-0699d18802514bc1b116c156f9ce2bc1-worker-0.pt-0699d18802514bc1b116c156f9ce2bc1.ns-devsft-3460edd0.svc.cluster.local>
Co-authored-by: Harahan <yh4717023@gmail.com>
Co-authored-by: ziyanxzy <109060006+ziyanxzy@users.noreply.github.com>
Co-authored-by: zhtshr <44193225+zhtshr@users.noreply.github.com>
Co-authored-by: jasonzhang517 <yzhang298@e.ntu.edu.sg>1 parent 69648c8 commit c3b0d63
39 files changed
Lines changed: 4144 additions & 423 deletions
File tree
- configs/disagg
- qwen
- wan
- examples/BeginnerGuide
- EN
- ZH_CN
- lightx2v
- disagg
- services
- models
- runners
- qwen_image
- wan
- schedulers
- server
- scripts
- base
- server/disagg
- qwen
- wan
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
Lines changed: 32 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
Lines changed: 29 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
Lines changed: 28 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
Lines changed: 33 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
Lines changed: 32 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
Lines changed: 41 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
0 commit comments