|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "LightX2V Multi-Platform Deployment Solutions" |
| 4 | +author: "LightX2V Team" |
| 5 | +date: 2026-05-19 |
| 6 | +tags: [Deploy, Multi-Platform Deployment, Non-Nvidia Platform Deployment] |
| 7 | +--- |
| 8 | + |
| 9 | +Video generation inference has long been tightly coupled to the NVIDIA CUDA ecosystem. FlashAttention, cuBLAS, and NCCL are deeply embedded in the hot path of DiT inference. When deploying LightX2V on domestic or alternative AI accelerators—Cambricon MLU, Ascend NPU, Hygon DCU, MetaX, AMD ROCm, and others—the challenge is not just "make PyTorch run," but **aligning every performance-critical operator** (Attention, quantized MatMul, RMSNorm, RoPE, etc.) with the chip vendor's native kernel APIs. |
| 10 | + |
| 11 | +`lightx2v_platform` is a **standalone functional layer** decoupled from the core `lightx2v` inference engine. Its job is to unify inference interfaces across non-NVIDIA chip backends. To support a new accelerator, you only need to implement the corresponding device abstraction and operator kernels inside `lightx2v_platform`—the upper-level model runners, schedulers, and pipeline logic remain unchanged. |
| 12 | + |
| 13 | +**Table of contents:** |
| 14 | + |
| 15 | +- [Why a Separate Platform Layer?](#why-a-separate-platform-layer) |
| 16 | +- [Architecture Overview](#architecture-overview) |
| 17 | +- [Core Design: Registry + Template + Environment Variable](#core-design-registry--template--environment-variable) |
| 18 | +- [Supported Backends and Operator Coverage](#supported-backends-and-operator-coverage) |
| 19 | +- [How It Integrates with LightX2V](#how-it-integrates-with-lightx2v) |
| 20 | +- [Quick Start: Running on a Non-NVIDIA Platform](#quick-start-running-on-a-non-nvidia-platform) |
| 21 | +- [Porting a New Chip Backend](#porting-a-new-chip-backend) |
| 22 | +- [Resources](#resources) |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## Why a Separate Platform Layer? |
| 27 | + |
| 28 | +LightX2V's core codebase is organized around model structure, scheduling, parallelism, and offload—these concerns are hardware-agnostic in principle. What *is* hardware-specific are the low-level compute kernels: |
| 29 | + |
| 30 | +| Operator Category | Typical NVIDIA Implementation | What Changes on Other Chips | |
| 31 | +|---|---|---| |
| 32 | +| Attention | FlashAttention / SageAttention | Vendor fusion ops (e.g. `npu_fusion_attention`, `tmo.flash_attention`) | |
| 33 | +| Quantized MatMul | CUTLASS / sgl_kernel / vLLM quant | Vendor quant APIs (e.g. `npu_quant_matmul`, `tmo.scaled_matmul`) | |
| 34 | +| Normalization | Triton / CUDA kernels | Vendor RMSNorm / LayerNorm | |
| 35 | +| RoPE | Custom CUDA | Vendor-specific or fallback to PyTorch | |
| 36 | +| Distributed | NCCL | CNCL (MLU), HCCL (NPU), RCCL (ROCm), etc. | |
| 37 | + |
| 38 | +Without a dedicated abstraction layer, every new chip would require scattered `if platform == ...` branches throughout the model code. `lightx2v_platform` solves this by: |
| 39 | + |
| 40 | +1. **Isolating** all chip-specific logic into a single module. |
| 41 | +2. **Registering** platform kernels through a unified registry mechanism. |
| 42 | +3. **Selecting** the correct implementation at runtime via the `PLATFORM` environment variable and JSON config fields like `self_attn_1_type`. |
| 43 | + |
| 44 | +The result: LightX2V's upper layers always call the same interface (`AttnWeightTemplate.apply`, `MMWeightTemplate.apply`, etc.), regardless of which chip is underneath. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +## lightx2v_platform Architecture Overview |
| 49 | + |
| 50 | + |
| 51 | + |
| 52 | +The module has two main parts: |
| 53 | + |
| 54 | +- **`base/`** — Device abstraction. Each chip backend registers a `*Device` class that handles device initialization, availability checks, device name resolution, and distributed backend setup (e.g. NCCL for CUDA, CNCL for MLU, HCCL for NPU). |
| 55 | +- **`ops/`** — Operator kernels organized by category (`attn`, `mm`, `norm`, `rope`), with per-platform subdirectories containing chip-specific implementations. |
| 56 | + |
| 57 | +At import time, `set_ai_device.py` reads the `PLATFORM` environment variable, initializes the device, and conditionally loads the corresponding operator modules. |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## Core Design: Registry + Template + Environment Variable |
| 62 | + |
| 63 | +### 1. Registry Pattern |
| 64 | + |
| 65 | +`registry_factory.py` defines a lightweight `Register` class and six platform-level registries: |
| 66 | + |
| 67 | +```python |
| 68 | +PLATFORM_DEVICE_REGISTER = Register() |
| 69 | +PLATFORM_ATTN_WEIGHT_REGISTER = Register() |
| 70 | +PLATFORM_MM_WEIGHT_REGISTER = Register() |
| 71 | +PLATFORM_RMS_WEIGHT_REGISTER = Register() |
| 72 | +PLATFORM_LAYERNORM_WEIGHT_REGISTER = Register() |
| 73 | +PLATFORM_ROPE_REGISTER = Register() |
| 74 | +``` |
| 75 | + |
| 76 | +Each chip backend registers its implementations via decorators. For example, Ascend NPU registers its Flash Attention kernel as `"npu_flash_attn"`: |
| 77 | + |
| 78 | +```python |
| 79 | +@PLATFORM_ATTN_WEIGHT_REGISTER("npu_flash_attn") |
| 80 | +class NpuFlashAttnWeight(AttnWeightTemplate): |
| 81 | + def apply(self, q, k, v, ...): |
| 82 | + x = torch_npu.npu_fusion_attention(q, k, v, ...) |
| 83 | + return x |
| 84 | +``` |
| 85 | + |
| 86 | +On the LightX2V side, `lightx2v/utils/registry_factory.py` **merges** the platform registries into the main registries at startup: |
| 87 | + |
| 88 | +```python |
| 89 | +ATTN_WEIGHT_REGISTER.merge(PLATFORM_ATTN_WEIGHT_REGISTER) |
| 90 | +MM_WEIGHT_REGISTER.merge(PLATFORM_MM_WEIGHT_REGISTER) |
| 91 | +RMS_WEIGHT_REGISTER.merge(PLATFORM_RMS_WEIGHT_REGISTER) |
| 92 | +LN_WEIGHT_REGISTER.merge(PLATFORM_LAYERNORM_WEIGHT_REGISTER) |
| 93 | +ROPE_REGISTER.merge(PLATFORM_ROPE_REGISTER) |
| 94 | +``` |
| 95 | + |
| 96 | +This means platform kernels appear alongside NVIDIA-native kernels in the same lookup table. The JSON config simply specifies which kernel name to use—no platform-specific branching in model code. |
| 97 | + |
| 98 | +### 2. Template Classes |
| 99 | + |
| 100 | +Each operator category defines an abstract template in `ops/`: |
| 101 | + |
| 102 | +| Template | Location | Key Method | |
| 103 | +|---|---|---| |
| 104 | +| `AttnWeightTemplate` | `ops/attn/template.py` | `apply(q, k, v, ...)` | |
| 105 | +| `MMWeightTemplate` / `MMWeightQuantTemplate` | `ops/mm/template.py` | `load()`, `apply()` | |
| 106 | +| `RMSWeightTemplate` | `ops/norm/norm_template.py` | `apply(input_tensor)` | |
| 107 | +| `LayerNormWeightTemplate` | `ops/norm/norm_template.py` | `apply(input_tensor)` | |
| 108 | +| `RopeTemplate` | `ops/rope/rope_template.py` | `apply(xq, xk, cos_sin_cache)` | |
| 109 | + |
| 110 | +Templates handle the common logic—weight loading, CPU/GPU buffer management, lazy load, state dict serialization—while subclasses only implement the chip-specific `apply()` (and optionally custom `load()` / quantization paths). |
| 111 | + |
| 112 | +For quantized MatMul, `MMWeightQuantTemplate` provides a rich set of built-in weight/act quantization helpers (`load_int8_perchannel_sym`, `load_fp8_perchannel_sym`, etc.), so platform implementations often only need to plug in the vendor's `act_quant_func` and kernel call. |
| 113 | + |
| 114 | +### 3. Environment Variable `PLATFORM` |
| 115 | + |
| 116 | +The `PLATFORM` environment variable is the single switch that selects the chip backend: |
| 117 | + |
| 118 | +```bash |
| 119 | +export PLATFORM=ascend_npu # Huawei Ascend 910B |
| 120 | +export PLATFORM=cambricon_mlu # Cambricon MLU590 |
| 121 | +export PLATFORM=amd_rocm # AMD MI350 |
| 122 | +export PLATFORM=hygon_dcu # Hygon DCU |
| 123 | +export PLATFORM=metax_cuda # MetaX C500 |
| 124 | +export PLATFORM=musa # MThreads MUSA |
| 125 | +export PLATFORM=enflame_gcu # Enflame S60 (GCU) |
| 126 | +export PLATFORM=intel_xpu # Intel AIPC PTL |
| 127 | +export PLATFORM=iluvatar_cuda # Iluvatar |
| 128 | +# Default (unset): cuda (NVIDIA) |
| 129 | +``` |
| 130 | + |
| 131 | +The initialization flow in `set_ai_device.py`: |
| 132 | + |
| 133 | +1. Read `PLATFORM` from environment (default: `"cuda"`). |
| 134 | +2. Call `init_ai_device(platform)` → look up the device class in `PLATFORM_DEVICE_REGISTER`, set global `AI_DEVICE` and `PLATFORM`. |
| 135 | +3. Call `check_ai_device(platform)` → verify the chip runtime is available. |
| 136 | +4. Conditionally import platform-specific ops modules (e.g. only load `ops/attn/ascend_npu/` when `PLATFORM=ascend_npu`). |
| 137 | + |
| 138 | +Since `lightx2v/__init__.py` imports `lightx2v_platform.set_ai_device` at package load time, the platform is initialized automatically whenever LightX2V is imported. |
| 139 | + |
| 140 | +### 4. Global Variables |
| 141 | + |
| 142 | +`base/global_var.py` exposes two module-level globals used throughout LightX2V: |
| 143 | + |
| 144 | +- `AI_DEVICE` — the PyTorch device string (e.g. `"cuda"`, `"npu"`, `"mlu"`, `"xpu"`). |
| 145 | +- `PLATFORM` — the platform identifier string (e.g. `"ascend_npu"`, `"cambricon_mlu"`). |
| 146 | + |
| 147 | +All tensor placement in LightX2V references `AI_DEVICE` instead of hardcoded `"cuda"`, enabling transparent multi-platform execution. |
| 148 | + |
| 149 | +--- |
| 150 | + |
| 151 | +## Supported Backends and Operator Coverage |
| 152 | + |
| 153 | +Currently supported backends: |
| 154 | + |
| 155 | +| Chip | `PLATFORM` Value | Device String | Distributed Backend | |
| 156 | +|---|---|---|---| |
| 157 | +| NVIDIA GPU | `cuda` (default) | `cuda` | NCCL | |
| 158 | +| Cambricon MLU590 | `cambricon_mlu` | `mlu` | CNCL | |
| 159 | +| MetaX C500 | `metax_cuda` | `cuda` | NCCL | |
| 160 | +| Hygon DCU | `hygon_dcu` | `cuda` | NCCL | |
| 161 | +| Huawei Ascend 910B | `ascend_npu` | `npu` | HCCL | |
| 162 | +| AMD ROCm (MI350) | `amd_rocm` | `cuda` | NCCL (RCCL) | |
| 163 | +| MThreads MUSA | `musa` | `musa` | MCCL | |
| 164 | +| Enflame S60 (GCU) | `enflame_gcu` | `gcu` | ECCL | |
| 165 | +| Intel AIPC PTL | `intel_xpu` | `xpu` | CCL | |
| 166 | +| Iluvatar | `iluvatar_cuda` | `cuda` | NCCL | |
| 167 | + |
| 168 | +Operator kernel coverage per platform (registered names that can be referenced in JSON configs): |
| 169 | + |
| 170 | +| Platform | Attention | Quantized MatMul | Normalization | RoPE | |
| 171 | +|---|---|---|---|---| |
| 172 | +| **cambricon_mlu** | `mlu_flash_attn`, `mlu_sage_attn` | `int8-tmo` | `mlu_rms_norm` | — | |
| 173 | +| **ascend_npu** | `npu_flash_attn` | `int8-npu` | — (use `torch`) | — | |
| 174 | +| **hygon_dcu** | `flash_attn_hygon_dcu` | `int8-vllm-hygon-dcu` | — | — | |
| 175 | +| **amd_rocm** | `aiter_attn` | via aiter compat layer | — | — | |
| 176 | +| **enflame_gcu** | `flash_attn_enflame_gcu` | — | `gcu_layer_norm` | `enflame_wan_rope` | |
| 177 | +| **intel_xpu** | `intel_xpu_flash_attn` | `intel_xpu_mm`, `intel_xpu_fp8` | — | — | |
| 178 | +| **iluvatar_cuda** | `iluvatar_flash_attn` | `int8-iluvatar` | `iluvatar_rms_norm` | `iluvatar_wan_rope` | |
| 179 | +| **metax_cuda** | `metax_sage_attn2` | — (default CUDA kernels) | — | — | |
| 180 | +| **musa** | — (fallback `torch_sdpa`) | — | — | — | |
| 181 | + |
| 182 | +Platforms without a custom kernel for a given operator category can fall back to PyTorch native implementations by setting the corresponding `*_type` field to `"torch"` in the JSON config. |
| 183 | + |
| 184 | +--- |
| 185 | + |
| 186 | +## How It Integrates with LightX2V |
| 187 | + |
| 188 | +The integration follows a clean three-step pattern: |
| 189 | + |
| 190 | +**Step 1 — Platform init at import time** |
| 191 | + |
| 192 | +```python |
| 193 | +# lightx2v/__init__.py |
| 194 | +import lightx2v_platform.set_ai_device # triggers device init + ops loading |
| 195 | +``` |
| 196 | + |
| 197 | +**Step 2 — Registry merge** |
| 198 | + |
| 199 | +Platform kernels are merged into LightX2V's main registries, so model code uses a single lookup path: |
| 200 | + |
| 201 | +```python |
| 202 | +# In model weight initialization (simplified) |
| 203 | +attn_cls = ATTN_WEIGHT_REGISTER[config["self_attn_1_type"]] |
| 204 | +self.self_attn = attn_cls() |
| 205 | +``` |
| 206 | + |
| 207 | +**Step 3 — Config-driven kernel selection** |
| 208 | + |
| 209 | +Each platform has dedicated JSON configs under `configs/platforms/` that specify which registered kernel to use. For example, Ascend NPU Wan2.1 T2V: |
| 210 | + |
| 211 | +```json |
| 212 | +{ |
| 213 | + "self_attn_1_type": "npu_flash_attn", |
| 214 | + "cross_attn_1_type": "npu_flash_attn", |
| 215 | + "cross_attn_2_type": "npu_flash_attn", |
| 216 | + "rms_norm_type": "torch", |
| 217 | + "cpu_offload": true, |
| 218 | + "offload_granularity": "model" |
| 219 | +} |
| 220 | +``` |
| 221 | + |
| 222 | +Cambricon MLU uses its own optimized kernels: |
| 223 | + |
| 224 | +```json |
| 225 | +{ |
| 226 | + "self_attn_1_type": "mlu_sage_attn", |
| 227 | + "cross_attn_1_type": "mlu_sage_attn", |
| 228 | + "cross_attn_2_type": "mlu_sage_attn", |
| 229 | + "rms_norm_type": "mlu_rms_norm" |
| 230 | +} |
| 231 | +``` |
| 232 | + |
| 233 | +This design means LightX2V features like **parallelism**, **offload**, and **disaggregated deployment** work on non-NVIDIA platforms without modification—the platform layer only replaces the compute kernels and device management underneath. |
| 234 | + |
| 235 | +--- |
| 236 | + |
| 237 | +## Quick Start: Running on a Non-NVIDIA Platform |
| 238 | + |
| 239 | +Here is a minimal example for running Wan2.1 T2V on Ascend 910B: |
| 240 | + |
| 241 | +```bash |
| 242 | +# 1. Set platform and visible devices |
| 243 | +export PLATFORM=ascend_npu |
| 244 | +export ASCEND_RT_VISIBLE_DEVICES=0 |
| 245 | + |
| 246 | +# 2. Run inference with platform-specific config |
| 247 | +python -m lightx2v.infer \ |
| 248 | + --model_cls wan2.1 \ |
| 249 | + --task t2v \ |
| 250 | + --model_path $model_path \ |
| 251 | + --config_json configs/platforms/ascend_npu/wan_t2v.json \ |
| 252 | + --prompt "Two anthropomorphic cats in comfy boxing gear..." \ |
| 253 | + --save_result_path output.mp4 |
| 254 | +``` |
| 255 | + |
| 256 | +Key points: |
| 257 | + |
| 258 | +- Always set `PLATFORM` **before** importing LightX2V (or use the provided shell scripts that export it). |
| 259 | +- Use the matching config from `configs/platforms/<platform>/`. |
| 260 | +- Refer to `scripts/platforms/<platform>/` for complete, tested launch scripts covering Wan, Qwen-Image, Z-Image, and other models. |
| 261 | + |
| 262 | +--- |
| 263 | + |
| 264 | +## Porting a New Chip Backend |
| 265 | + |
| 266 | +Adding support for a new accelerator requires changes **only inside `lightx2v_platform`**. Here is the step-by-step workflow: |
| 267 | + |
| 268 | +### Step 1: Implement Device Abstraction |
| 269 | + |
| 270 | +Create `base/my_chip.py`: |
| 271 | + |
| 272 | +```python |
| 273 | +from lightx2v_platform.registry_factory import PLATFORM_DEVICE_REGISTER |
| 274 | + |
| 275 | +@PLATFORM_DEVICE_REGISTER("my_chip") |
| 276 | +class MyChipDevice: |
| 277 | + name = "my_chip" |
| 278 | + |
| 279 | + @staticmethod |
| 280 | + def init_device_env(): |
| 281 | + pass # any chip-specific env setup |
| 282 | + |
| 283 | + @staticmethod |
| 284 | + def is_available() -> bool: |
| 285 | + # check chip runtime is installed and hardware is present |
| 286 | + ... |
| 287 | + |
| 288 | + @staticmethod |
| 289 | + def get_device() -> str: |
| 290 | + return "my_device" # PyTorch device string |
| 291 | + |
| 292 | + @staticmethod |
| 293 | + def init_parallel_env(): |
| 294 | + dist.init_process_group(backend="my_backend") |
| 295 | + ... |
| 296 | +``` |
| 297 | + |
| 298 | +Register it in `base/__init__.py`. |
| 299 | + |
| 300 | +### Step 2: Implement Operator Kernels |
| 301 | + |
| 302 | +For each operator category the chip supports, create implementations under `ops/<category>/my_chip/`: |
| 303 | + |
| 304 | +``` |
| 305 | +ops/ |
| 306 | +├── attn/my_chip/flash_attn.py → @PLATFORM_ATTN_WEIGHT_REGISTER("my_chip_flash_attn") |
| 307 | +├── mm/my_chip/mm_weight.py → @PLATFORM_MM_WEIGHT_REGISTER("int8-my_chip") |
| 308 | +├── norm/my_chip/rms_norm.py → @PLATFORM_RMS_WEIGHT_REGISTER("my_chip_rms_norm") |
| 309 | +└── rope/my_chip/wan_rope.py → @PLATFORM_ROPE_REGISTER("my_chip_wan_rope") |
| 310 | +``` |
| 311 | + |
| 312 | +Each class inherits from the corresponding template and implements the `apply()` method using the vendor's kernel API. |
| 313 | + |
| 314 | +### Step 3: Register Ops Loading |
| 315 | + |
| 316 | +Add a branch in `ops/__init__.py`: |
| 317 | + |
| 318 | +```python |
| 319 | +elif PLATFORM == "my_chip": |
| 320 | + from .attn.my_chip import * |
| 321 | + from .mm.my_chip import * |
| 322 | +``` |
| 323 | + |
| 324 | +### Step 4: Provide Config and Scripts |
| 325 | + |
| 326 | +- Add JSON configs under `configs/platforms/my_chip/`. |
| 327 | +- Add launch scripts under `scripts/platforms/my_chip/`. |
| 328 | +- Optionally add a Dockerfile under `dockerfiles/platforms/`. |
| 329 | + |
| 330 | +### Step 5: Test |
| 331 | + |
| 332 | +```bash |
| 333 | +PLATFORM=my_chip python lightx2v_platform/test/test_device.py |
| 334 | +# Then run a full inference with the platform config |
| 335 | +``` |
| 336 | + |
| 337 | +No changes to `lightx2v/` model code, runners, or schedulers are needed. |
| 338 | + |
| 339 | +--- |
| 340 | + |
| 341 | +## Resources |
| 342 | + |
| 343 | +- **Platform module**: [`LightX2V/lightx2v_platform`](https://github.com/ModelTC/LightX2V/tree/main/lightx2v_platform) |
| 344 | +- **Docker environments**: [`dockerfiles/platforms`](https://github.com/ModelTC/LightX2V/tree/main/dockerfiles/platforms) |
| 345 | +- **Launch scripts**: [`scripts/platforms`](https://github.com/ModelTC/LightX2V/tree/main/scripts/platforms) |
| 346 | +- **Platform configs**: [`configs/platforms`](https://github.com/ModelTC/LightX2V/tree/main/configs/platforms) |
| 347 | + |
| 348 | +`lightx2v_platform` turns multi-chip deployment from a cross-cutting refactor into a localized, registry-driven extension problem. Whether you are running on Cambricon MLU in a data center, Ascend NPU in a cloud cluster, or Intel XPU on a laptop, the same LightX2V pipeline code path applies—you just point `PLATFORM` at the right backend and select the matching config. |
0 commit comments