Skip to content

Commit bd7a7a0

Browse files
authored
Optimize Helios docs (#13222)
optimize helios docs
1 parent 9254417 commit bd7a7a0

File tree

1 file changed

+3
-4
lines changed

1 file changed

+3
-4
lines changed

docs/source/en/api/pipelines/helios.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ The example below demonstrates how to generate a video from text optimized for m
4444

4545
Refer to the [Reduce memory usage](../../optimization/memory) guide for more details about the various memory saving techniques.
4646

47-
The Helios model below requires ~19GB of VRAM.
47+
The Helios model below requires ~6GB of VRAM.
4848

4949
```py
5050
import torch
@@ -63,8 +63,7 @@ pipeline = HeliosPipeline.from_pretrained(
6363
pipeline.enable_group_offload(
6464
onload_device=torch.device("cuda"),
6565
offload_device=torch.device("cpu"),
66-
offload_type="block_level",
67-
num_blocks_per_group=1,
66+
offload_type="leaf_level",
6867
use_stream=True,
6968
record_stream=True,
7069
)
@@ -97,7 +96,7 @@ export_to_video(output, "helios_base_t2v_output.mp4", fps=24)
9796
</hfoption>
9897
<hfoption id="inference speed">
9998

100-
[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Attention Backends](../../optimization/attention_backends) such as FlashAttention and SageAttention can significantly increase speed by optimizing the computation of the attention mechanism. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.
99+
[Compilation](../../optimization/fp16#torchcompile) is slow the first time but subsequent calls to the pipeline are faster. [Attention Backends](../../optimization/attention_backends) such as FlashAttention and SageAttention can significantly increase speed by optimizing the computation of the attention mechanism. [Context Parallelism](../../training/distributed_inference#context-parallelism) splits the input sequence across multiple devices to enable processing of long contexts in parallel, reducing memory pressure and latency. [Caching](../../optimization/cache) may also speed up inference by storing and reusing intermediate outputs.
101100

102101
```py
103102
import torch

0 commit comments

Comments
 (0)