Skip to content

Commit 2f58cfc

Browse files
committed
add lightx2v_platform
1 parent 9d90683 commit 2f58cfc

5 files changed

Lines changed: 516 additions & 1 deletion

File tree

_articles/LightX2VPlatform.md

Lines changed: 348 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,348 @@
1+
---
2+
layout: post
3+
title: "LightX2V Multi-Platform Deployment Solutions"
4+
author: "LightX2V Team"
5+
date: 2026-05-19
6+
tags: [Deploy, Multi-Platform Deployment, Non-Nvidia Platform Deployment]
7+
---
8+
9+
Video generation inference has long been tightly coupled to the NVIDIA CUDA ecosystem. FlashAttention, cuBLAS, and NCCL are deeply embedded in the hot path of DiT inference. When deploying LightX2V on domestic or alternative AI accelerators—Cambricon MLU, Ascend NPU, Hygon DCU, MetaX, AMD ROCm, and others—the challenge is not just "make PyTorch run," but **aligning every performance-critical operator** (Attention, quantized MatMul, RMSNorm, RoPE, etc.) with the chip vendor's native kernel APIs.
10+
11+
`lightx2v_platform` is a **standalone functional layer** decoupled from the core `lightx2v` inference engine. Its job is to unify inference interfaces across non-NVIDIA chip backends. To support a new accelerator, you only need to implement the corresponding device abstraction and operator kernels inside `lightx2v_platform`—the upper-level model runners, schedulers, and pipeline logic remain unchanged.
12+
13+
**Table of contents:**
14+
15+
- [Why a Separate Platform Layer?](#why-a-separate-platform-layer)
16+
- [Architecture Overview](#architecture-overview)
17+
- [Core Design: Registry + Template + Environment Variable](#core-design-registry--template--environment-variable)
18+
- [Supported Backends and Operator Coverage](#supported-backends-and-operator-coverage)
19+
- [How It Integrates with LightX2V](#how-it-integrates-with-lightx2v)
20+
- [Quick Start: Running on a Non-NVIDIA Platform](#quick-start-running-on-a-non-nvidia-platform)
21+
- [Porting a New Chip Backend](#porting-a-new-chip-backend)
22+
- [Resources](#resources)
23+
24+
---
25+
26+
## Why a Separate Platform Layer?
27+
28+
LightX2V's core codebase is organized around model structure, scheduling, parallelism, and offload—these concerns are hardware-agnostic in principle. What *is* hardware-specific are the low-level compute kernels:
29+
30+
| Operator Category | Typical NVIDIA Implementation | What Changes on Other Chips |
31+
|---|---|---|
32+
| Attention | FlashAttention / SageAttention | Vendor fusion ops (e.g. `npu_fusion_attention`, `tmo.flash_attention`) |
33+
| Quantized MatMul | CUTLASS / sgl_kernel / vLLM quant | Vendor quant APIs (e.g. `npu_quant_matmul`, `tmo.scaled_matmul`) |
34+
| Normalization | Triton / CUDA kernels | Vendor RMSNorm / LayerNorm |
35+
| RoPE | Custom CUDA | Vendor-specific or fallback to PyTorch |
36+
| Distributed | NCCL | CNCL (MLU), HCCL (NPU), RCCL (ROCm), etc. |
37+
38+
Without a dedicated abstraction layer, every new chip would require scattered `if platform == ...` branches throughout the model code. `lightx2v_platform` solves this by:
39+
40+
1. **Isolating** all chip-specific logic into a single module.
41+
2. **Registering** platform kernels through a unified registry mechanism.
42+
3. **Selecting** the correct implementation at runtime via the `PLATFORM` environment variable and JSON config fields like `self_attn_1_type`.
43+
44+
The result: LightX2V's upper layers always call the same interface (`AttnWeightTemplate.apply`, `MMWeightTemplate.apply`, etc.), regardless of which chip is underneath.
45+
46+
---
47+
48+
## lightx2v_platform Architecture Overview
49+
50+
![lightx2v_platform architecture overview]({{ site.baseurl }}/assets/LightX2VPlatform/platform_img1.png)
51+
52+
The module has two main parts:
53+
54+
- **`base/`** — Device abstraction. Each chip backend registers a `*Device` class that handles device initialization, availability checks, device name resolution, and distributed backend setup (e.g. NCCL for CUDA, CNCL for MLU, HCCL for NPU).
55+
- **`ops/`** — Operator kernels organized by category (`attn`, `mm`, `norm`, `rope`), with per-platform subdirectories containing chip-specific implementations.
56+
57+
At import time, `set_ai_device.py` reads the `PLATFORM` environment variable, initializes the device, and conditionally loads the corresponding operator modules.
58+
59+
---
60+
61+
## Core Design: Registry + Template + Environment Variable
62+
63+
### 1. Registry Pattern
64+
65+
`registry_factory.py` defines a lightweight `Register` class and six platform-level registries:
66+
67+
```python
68+
PLATFORM_DEVICE_REGISTER = Register()
69+
PLATFORM_ATTN_WEIGHT_REGISTER = Register()
70+
PLATFORM_MM_WEIGHT_REGISTER = Register()
71+
PLATFORM_RMS_WEIGHT_REGISTER = Register()
72+
PLATFORM_LAYERNORM_WEIGHT_REGISTER = Register()
73+
PLATFORM_ROPE_REGISTER = Register()
74+
```
75+
76+
Each chip backend registers its implementations via decorators. For example, Ascend NPU registers its Flash Attention kernel as `"npu_flash_attn"`:
77+
78+
```python
79+
@PLATFORM_ATTN_WEIGHT_REGISTER("npu_flash_attn")
80+
class NpuFlashAttnWeight(AttnWeightTemplate):
81+
def apply(self, q, k, v, ...):
82+
x = torch_npu.npu_fusion_attention(q, k, v, ...)
83+
return x
84+
```
85+
86+
On the LightX2V side, `lightx2v/utils/registry_factory.py` **merges** the platform registries into the main registries at startup:
87+
88+
```python
89+
ATTN_WEIGHT_REGISTER.merge(PLATFORM_ATTN_WEIGHT_REGISTER)
90+
MM_WEIGHT_REGISTER.merge(PLATFORM_MM_WEIGHT_REGISTER)
91+
RMS_WEIGHT_REGISTER.merge(PLATFORM_RMS_WEIGHT_REGISTER)
92+
LN_WEIGHT_REGISTER.merge(PLATFORM_LAYERNORM_WEIGHT_REGISTER)
93+
ROPE_REGISTER.merge(PLATFORM_ROPE_REGISTER)
94+
```
95+
96+
This means platform kernels appear alongside NVIDIA-native kernels in the same lookup table. The JSON config simply specifies which kernel name to use—no platform-specific branching in model code.
97+
98+
### 2. Template Classes
99+
100+
Each operator category defines an abstract template in `ops/`:
101+
102+
| Template | Location | Key Method |
103+
|---|---|---|
104+
| `AttnWeightTemplate` | `ops/attn/template.py` | `apply(q, k, v, ...)` |
105+
| `MMWeightTemplate` / `MMWeightQuantTemplate` | `ops/mm/template.py` | `load()`, `apply()` |
106+
| `RMSWeightTemplate` | `ops/norm/norm_template.py` | `apply(input_tensor)` |
107+
| `LayerNormWeightTemplate` | `ops/norm/norm_template.py` | `apply(input_tensor)` |
108+
| `RopeTemplate` | `ops/rope/rope_template.py` | `apply(xq, xk, cos_sin_cache)` |
109+
110+
Templates handle the common logic—weight loading, CPU/GPU buffer management, lazy load, state dict serialization—while subclasses only implement the chip-specific `apply()` (and optionally custom `load()` / quantization paths).
111+
112+
For quantized MatMul, `MMWeightQuantTemplate` provides a rich set of built-in weight/act quantization helpers (`load_int8_perchannel_sym`, `load_fp8_perchannel_sym`, etc.), so platform implementations often only need to plug in the vendor's `act_quant_func` and kernel call.
113+
114+
### 3. Environment Variable `PLATFORM`
115+
116+
The `PLATFORM` environment variable is the single switch that selects the chip backend:
117+
118+
```bash
119+
export PLATFORM=ascend_npu # Huawei Ascend 910B
120+
export PLATFORM=cambricon_mlu # Cambricon MLU590
121+
export PLATFORM=amd_rocm # AMD MI350
122+
export PLATFORM=hygon_dcu # Hygon DCU
123+
export PLATFORM=metax_cuda # MetaX C500
124+
export PLATFORM=musa # MThreads MUSA
125+
export PLATFORM=enflame_gcu # Enflame S60 (GCU)
126+
export PLATFORM=intel_xpu # Intel AIPC PTL
127+
export PLATFORM=iluvatar_cuda # Iluvatar
128+
# Default (unset): cuda (NVIDIA)
129+
```
130+
131+
The initialization flow in `set_ai_device.py`:
132+
133+
1. Read `PLATFORM` from environment (default: `"cuda"`).
134+
2. Call `init_ai_device(platform)` → look up the device class in `PLATFORM_DEVICE_REGISTER`, set global `AI_DEVICE` and `PLATFORM`.
135+
3. Call `check_ai_device(platform)` → verify the chip runtime is available.
136+
4. Conditionally import platform-specific ops modules (e.g. only load `ops/attn/ascend_npu/` when `PLATFORM=ascend_npu`).
137+
138+
Since `lightx2v/__init__.py` imports `lightx2v_platform.set_ai_device` at package load time, the platform is initialized automatically whenever LightX2V is imported.
139+
140+
### 4. Global Variables
141+
142+
`base/global_var.py` exposes two module-level globals used throughout LightX2V:
143+
144+
- `AI_DEVICE` — the PyTorch device string (e.g. `"cuda"`, `"npu"`, `"mlu"`, `"xpu"`).
145+
- `PLATFORM` — the platform identifier string (e.g. `"ascend_npu"`, `"cambricon_mlu"`).
146+
147+
All tensor placement in LightX2V references `AI_DEVICE` instead of hardcoded `"cuda"`, enabling transparent multi-platform execution.
148+
149+
---
150+
151+
## Supported Backends and Operator Coverage
152+
153+
Currently supported backends:
154+
155+
| Chip | `PLATFORM` Value | Device String | Distributed Backend |
156+
|---|---|---|---|
157+
| NVIDIA GPU | `cuda` (default) | `cuda` | NCCL |
158+
| Cambricon MLU590 | `cambricon_mlu` | `mlu` | CNCL |
159+
| MetaX C500 | `metax_cuda` | `cuda` | NCCL |
160+
| Hygon DCU | `hygon_dcu` | `cuda` | NCCL |
161+
| Huawei Ascend 910B | `ascend_npu` | `npu` | HCCL |
162+
| AMD ROCm (MI350) | `amd_rocm` | `cuda` | NCCL (RCCL) |
163+
| MThreads MUSA | `musa` | `musa` | MCCL |
164+
| Enflame S60 (GCU) | `enflame_gcu` | `gcu` | ECCL |
165+
| Intel AIPC PTL | `intel_xpu` | `xpu` | CCL |
166+
| Iluvatar | `iluvatar_cuda` | `cuda` | NCCL |
167+
168+
Operator kernel coverage per platform (registered names that can be referenced in JSON configs):
169+
170+
| Platform | Attention | Quantized MatMul | Normalization | RoPE |
171+
|---|---|---|---|---|
172+
| **cambricon_mlu** | `mlu_flash_attn`, `mlu_sage_attn` | `int8-tmo` | `mlu_rms_norm` ||
173+
| **ascend_npu** | `npu_flash_attn` | `int8-npu` | — (use `torch`) ||
174+
| **hygon_dcu** | `flash_attn_hygon_dcu` | `int8-vllm-hygon-dcu` |||
175+
| **amd_rocm** | `aiter_attn` | via aiter compat layer |||
176+
| **enflame_gcu** | `flash_attn_enflame_gcu` || `gcu_layer_norm` | `enflame_wan_rope` |
177+
| **intel_xpu** | `intel_xpu_flash_attn` | `intel_xpu_mm`, `intel_xpu_fp8` |||
178+
| **iluvatar_cuda** | `iluvatar_flash_attn` | `int8-iluvatar` | `iluvatar_rms_norm` | `iluvatar_wan_rope` |
179+
| **metax_cuda** | `metax_sage_attn2` | — (default CUDA kernels) |||
180+
| **musa** | — (fallback `torch_sdpa`) ||||
181+
182+
Platforms without a custom kernel for a given operator category can fall back to PyTorch native implementations by setting the corresponding `*_type` field to `"torch"` in the JSON config.
183+
184+
---
185+
186+
## How It Integrates with LightX2V
187+
188+
The integration follows a clean three-step pattern:
189+
190+
**Step 1 — Platform init at import time**
191+
192+
```python
193+
# lightx2v/__init__.py
194+
import lightx2v_platform.set_ai_device # triggers device init + ops loading
195+
```
196+
197+
**Step 2 — Registry merge**
198+
199+
Platform kernels are merged into LightX2V's main registries, so model code uses a single lookup path:
200+
201+
```python
202+
# In model weight initialization (simplified)
203+
attn_cls = ATTN_WEIGHT_REGISTER[config["self_attn_1_type"]]
204+
self.self_attn = attn_cls()
205+
```
206+
207+
**Step 3 — Config-driven kernel selection**
208+
209+
Each platform has dedicated JSON configs under `configs/platforms/` that specify which registered kernel to use. For example, Ascend NPU Wan2.1 T2V:
210+
211+
```json
212+
{
213+
"self_attn_1_type": "npu_flash_attn",
214+
"cross_attn_1_type": "npu_flash_attn",
215+
"cross_attn_2_type": "npu_flash_attn",
216+
"rms_norm_type": "torch",
217+
"cpu_offload": true,
218+
"offload_granularity": "model"
219+
}
220+
```
221+
222+
Cambricon MLU uses its own optimized kernels:
223+
224+
```json
225+
{
226+
"self_attn_1_type": "mlu_sage_attn",
227+
"cross_attn_1_type": "mlu_sage_attn",
228+
"cross_attn_2_type": "mlu_sage_attn",
229+
"rms_norm_type": "mlu_rms_norm"
230+
}
231+
```
232+
233+
This design means LightX2V features like **parallelism**, **offload**, and **disaggregated deployment** work on non-NVIDIA platforms without modification—the platform layer only replaces the compute kernels and device management underneath.
234+
235+
---
236+
237+
## Quick Start: Running on a Non-NVIDIA Platform
238+
239+
Here is a minimal example for running Wan2.1 T2V on Ascend 910B:
240+
241+
```bash
242+
# 1. Set platform and visible devices
243+
export PLATFORM=ascend_npu
244+
export ASCEND_RT_VISIBLE_DEVICES=0
245+
246+
# 2. Run inference with platform-specific config
247+
python -m lightx2v.infer \
248+
--model_cls wan2.1 \
249+
--task t2v \
250+
--model_path $model_path \
251+
--config_json configs/platforms/ascend_npu/wan_t2v.json \
252+
--prompt "Two anthropomorphic cats in comfy boxing gear..." \
253+
--save_result_path output.mp4
254+
```
255+
256+
Key points:
257+
258+
- Always set `PLATFORM` **before** importing LightX2V (or use the provided shell scripts that export it).
259+
- Use the matching config from `configs/platforms/<platform>/`.
260+
- Refer to `scripts/platforms/<platform>/` for complete, tested launch scripts covering Wan, Qwen-Image, Z-Image, and other models.
261+
262+
---
263+
264+
## Porting a New Chip Backend
265+
266+
Adding support for a new accelerator requires changes **only inside `lightx2v_platform`**. Here is the step-by-step workflow:
267+
268+
### Step 1: Implement Device Abstraction
269+
270+
Create `base/my_chip.py`:
271+
272+
```python
273+
from lightx2v_platform.registry_factory import PLATFORM_DEVICE_REGISTER
274+
275+
@PLATFORM_DEVICE_REGISTER("my_chip")
276+
class MyChipDevice:
277+
name = "my_chip"
278+
279+
@staticmethod
280+
def init_device_env():
281+
pass # any chip-specific env setup
282+
283+
@staticmethod
284+
def is_available() -> bool:
285+
# check chip runtime is installed and hardware is present
286+
...
287+
288+
@staticmethod
289+
def get_device() -> str:
290+
return "my_device" # PyTorch device string
291+
292+
@staticmethod
293+
def init_parallel_env():
294+
dist.init_process_group(backend="my_backend")
295+
...
296+
```
297+
298+
Register it in `base/__init__.py`.
299+
300+
### Step 2: Implement Operator Kernels
301+
302+
For each operator category the chip supports, create implementations under `ops/<category>/my_chip/`:
303+
304+
```
305+
ops/
306+
├── attn/my_chip/flash_attn.py → @PLATFORM_ATTN_WEIGHT_REGISTER("my_chip_flash_attn")
307+
├── mm/my_chip/mm_weight.py → @PLATFORM_MM_WEIGHT_REGISTER("int8-my_chip")
308+
├── norm/my_chip/rms_norm.py → @PLATFORM_RMS_WEIGHT_REGISTER("my_chip_rms_norm")
309+
└── rope/my_chip/wan_rope.py → @PLATFORM_ROPE_REGISTER("my_chip_wan_rope")
310+
```
311+
312+
Each class inherits from the corresponding template and implements the `apply()` method using the vendor's kernel API.
313+
314+
### Step 3: Register Ops Loading
315+
316+
Add a branch in `ops/__init__.py`:
317+
318+
```python
319+
elif PLATFORM == "my_chip":
320+
from .attn.my_chip import *
321+
from .mm.my_chip import *
322+
```
323+
324+
### Step 4: Provide Config and Scripts
325+
326+
- Add JSON configs under `configs/platforms/my_chip/`.
327+
- Add launch scripts under `scripts/platforms/my_chip/`.
328+
- Optionally add a Dockerfile under `dockerfiles/platforms/`.
329+
330+
### Step 5: Test
331+
332+
```bash
333+
PLATFORM=my_chip python lightx2v_platform/test/test_device.py
334+
# Then run a full inference with the platform config
335+
```
336+
337+
No changes to `lightx2v/` model code, runners, or schedulers are needed.
338+
339+
---
340+
341+
## Resources
342+
343+
- **Platform module**: [`LightX2V/lightx2v_platform`](https://github.com/ModelTC/LightX2V/tree/main/lightx2v_platform)
344+
- **Docker environments**: [`dockerfiles/platforms`](https://github.com/ModelTC/LightX2V/tree/main/dockerfiles/platforms)
345+
- **Launch scripts**: [`scripts/platforms`](https://github.com/ModelTC/LightX2V/tree/main/scripts/platforms)
346+
- **Platform configs**: [`configs/platforms`](https://github.com/ModelTC/LightX2V/tree/main/configs/platforms)
347+
348+
`lightx2v_platform` turns multi-chip deployment from a cross-cutting refactor into a localized, registry-driven extension problem. Whether you are running on Cambricon MLU in a data center, Ascend NPU in a cloud cluster, or Intel XPU on a laptop, the same LightX2V pipeline code path applies—you just point `PLATFORM` at the right backend and select the matching config.

_articles/Parallel.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ layout: post
33
title: "Parallel Mechanism of LightX2V"
44
author: "LightX2V Team"
55
date: 2026-05-19
6-
tags: [Parallel]
6+
tags: [Parallel, CFG Parallelism, Ulysses, Ring]
77
---
88

99
## I. Overview of LightX2V's Parallel Mechanism

0 commit comments

Comments
 (0)