You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-1Lines changed: 21 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -572,6 +572,26 @@ To generate images, run the following command:
572
572
* For Wan2.2 T2V, use `base_wan_27b.yml`.
573
573
* For Wan2.2 I2V, use `base_wan_i2v_27b.yml`.
574
574
575
+
### Ulysses Attention
576
+
577
+
MaxDiffusion supports Ulysses attention for WAN TPU inference. Enable it by setting `attention="ulysses"`.
578
+
579
+
Internally, this follows the Ulysses sequence-parallel attention pattern and trades sequence shards for head shards around the local TPU splash kernel. For background, see [DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models](https://arxiv.org/abs/2309.14509).
580
+
581
+
To enable Ulysses attention, set the corresponding override in your config YAML or pass it as a command-line override:
582
+
583
+
```bash
584
+
python src/maxdiffusion/generate_wan.py \
585
+
src/maxdiffusion/configs/base_wan_i2v_27b.yml \
586
+
attention="ulysses" \
587
+
ici_context_parallelism=4 \
588
+
...
589
+
```
590
+
591
+
Ulysses requires `ici_context_parallelism` greater than 1, and the number of attention heads must be divisible by the context shard count. `flash_block_sizes` tuning is optional and can still be used for hardware-specific tuning.
592
+
593
+
In our Wan2.2 I2V benchmarks at 40 inference steps, 81 frames, and `720x1280` resolution, Ulysses improved inference time by roughly `~10%` compared with flash attention, with about `~20s` lower latency on the v6e-8 and v7x-8 TPU setup.
594
+
575
595
### Caching Mechanisms
576
596
577
597
Wan 2.x pipelines support several caching strategies to accelerate inference by skipping redundant transformer forward passes. These are **mutually exclusive** — enable only one at a time.
@@ -774,4 +794,4 @@ This script will automatically format your code with `pyink` and help you identi
774
794
The full suite of -end-to end tests is in`tests` and `src/maxdiffusion/tests`. We run them with a nightly cadance.
775
795
776
796
## Profiling
777
-
To learn how to enable ML Diagnostics and XProf profiling for your runs, please see our [ML Diagnostics Guide](docs/profiling.md).
797
+
To learn how to enable ML Diagnostics and XProf profiling for your runs, please see our [ML Diagnostics Guide](docs/profiling.md).
f"Adding sequence sharding to q and kv if not already present because {raw_keys['attention']}=='ring' or {raw_keys['attention_sharding_uniform']} is set."
219
+
"Adding sequence sharding to q and kv if not already present because "
220
+
f"{attention=} requires it or attention_sharding_uniform={uses_uniform_sequence_sharding} is set."
0 commit comments