You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-1Lines changed: 21 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -578,6 +578,26 @@ To generate images, run the following command:
578
578
* For Wan2.2 T2V, use `base_wan_27b.yml`.
579
579
* For Wan2.2 I2V, use `base_wan_i2v_27b.yml`.
580
580
581
+
### Ulysses Attention
582
+
583
+
MaxDiffusion supports Ulysses attention for WAN TPU inference. Enable it by setting `attention="ulysses"`.
584
+
585
+
Internally, this follows the Ulysses sequence-parallel attention pattern and trades sequence shards for head shards around the local TPU splash kernel. For background, see [DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models](https://arxiv.org/abs/2309.14509).
586
+
587
+
To enable Ulysses attention, set the corresponding override in your config YAML or pass it as a command-line override:
588
+
589
+
```bash
590
+
python src/maxdiffusion/generate_wan.py \
591
+
src/maxdiffusion/configs/base_wan_i2v_27b.yml \
592
+
attention="ulysses" \
593
+
ici_context_parallelism=4 \
594
+
...
595
+
```
596
+
597
+
Ulysses requires `ici_context_parallelism` greater than 1, and the number of attention heads must be divisible by the context shard count. `flash_block_sizes` tuning is optional and can still be used for hardware-specific tuning.
598
+
599
+
In our Wan2.2 I2V benchmarks at 40 inference steps, 81 frames, and `720x1280` resolution, Ulysses improved inference time by roughly `~10%` compared with flash attention, with about `~20s` lower latency on the v6e-8 and v7x-8 TPU setup.
600
+
581
601
### Caching Mechanisms
582
602
583
603
Wan 2.x pipelines support several caching strategies to accelerate inference by skipping redundant transformer forward passes. These are **mutually exclusive** — enable only one at a time.
@@ -780,4 +800,4 @@ This script will automatically format your code with `pyink` and help you identi
780
800
The full suite of -end-to end tests is in`tests` and `src/maxdiffusion/tests`. We run them with a nightly cadance.
781
801
782
802
## Profiling
783
-
To learn how to enable ML Diagnostics and XProf profiling for your runs, please see our [ML Diagnostics Guide](docs/profiling.md).
803
+
To learn how to enable ML Diagnostics and XProf profiling for your runs, please see our [ML Diagnostics Guide](docs/profiling.md).
f"Adding sequence sharding to q and kv if not already present because {raw_keys['attention']}=='ring' or {raw_keys['attention_sharding_uniform']} is set."
219
+
"Adding sequence sharding to q and kv if not already present because "
220
+
f"{attention=} requires it or attention_sharding_uniform={uses_uniform_sequence_sharding} is set."
0 commit comments