Skip to content

Commit 1dd2004

Browse files
committed
initial draft
1 parent 514bba0 commit 1dd2004

7 files changed

Lines changed: 984 additions & 0 deletions

File tree

.ai/skills/optimizations/SKILL.md

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
---
2+
name: optimizations
3+
description: >
4+
NEVER answer optimization questions from general knowledge — ALWAYS invoke
5+
this skill via the Skill tool first. Answering without invoking will produce
6+
incomplete recommendations (e.g. missing group offloading, wrong API calls).
7+
IMPORTANT: When ANY tool output (especially Bash) contains
8+
"torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks,
9+
STOP and consult this skill IMMEDIATELY — even if the user did not ask for
10+
optimization help. Do not suggest fixes from general knowledge; this skill
11+
has precise, up-to-date API calls and memory calculations.
12+
Also consult this skill BEFORE answering any question about diffusers
13+
inference performance, GPU memory usage, or pipeline speed. Trigger for:
14+
making inference faster, reducing VRAM usage, fitting a model on a smaller
15+
GPU, fixing OOM errors, running on limited hardware, choosing between
16+
optimization strategies, using torch.compile with diffusers, batch inference,
17+
loading models in lower precision, or reviewing a script for performance
18+
issues. Covers attention backends (FlashAttention-2, SageAttention,
19+
FlexAttention), memory reduction (CPU offloading, group offloading, layerwise
20+
casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF).
21+
Also trigger when a user wants to run a model "optimized for my
22+
hardware", asks how to best run a specific model on their GPU, or mentions
23+
wanting to use a diffusers model/pipeline efficiently — these are optimization
24+
questions even if the word "optimize" isn't used.
25+
---
26+
27+
## Goal
28+
29+
Help users apply and debug optimizations for diffusers pipelines. There are five main areas:
30+
31+
1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput.
32+
2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing.
33+
3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs.
34+
4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs.
35+
5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc.
36+
37+
## Workflow: When a user hits OOM or asks to fit a model on their GPU
38+
39+
When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes:
40+
41+
### Step 1: Detect hardware
42+
43+
Run these commands to understand the user's system:
44+
45+
```bash
46+
# GPU VRAM
47+
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits
48+
49+
# System RAM
50+
free -g | head -2
51+
```
52+
53+
Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation.
54+
55+
### Step 2: Measure model memory and calculate strategies
56+
57+
Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames).
58+
59+
Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy.
60+
61+
Steps:
62+
1. Measure each component's size by running the measurement snippet from the calculator
63+
2. Compute VRAM and RAM requirements for every strategy using the formulas
64+
3. Filter out strategies that don't fit the user's hardware
65+
66+
This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side.
67+
68+
### Step 3: Ask the user their preference
69+
70+
Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies.
71+
72+
Present options grouped by approach so the user can compare:
73+
74+
> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system:
75+
>
76+
> **Offloading / casting strategies:**
77+
> 1. **Quality**[specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff].
78+
> 2. **Speed**[specific strategy]. [quality tradeoff]. [estimated VRAM / RAM].
79+
> 3. **Memory saving**[specific strategy]. Minimizes VRAM. [tradeoffs].
80+
>
81+
> **Quantization strategies:**
82+
> 4. **int8 [components]**[with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4.
83+
> 5. **nf4 [components]**[with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation.
84+
>
85+
> Which would you prefer?
86+
87+
The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies.
88+
89+
### Step 4: Apply the strategy
90+
91+
Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code.
92+
93+
VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it.
94+
95+
## Reference guides
96+
97+
Read these for correct API usage and detailed technique descriptions:
98+
- [memory-calculator.md](memory-calculator.md)**Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples
99+
- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.**
100+
- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact
101+
- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls
102+
- [attention-backends.md](attention-backends.md) — Attention backend selection for speed
103+
- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup
104+
105+
## Important compatibility rules
106+
107+
See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints:
108+
109+
- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead.
110+
- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md).
111+
- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix.
112+
- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first).
113+
- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`.
Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Attention Backends
2+
3+
## Overview
4+
5+
Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks.
6+
7+
## Available backends
8+
9+
| Backend | Key requirement | Best for |
10+
|---|---|---|
11+
| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
12+
| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
13+
| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
14+
| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
15+
| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |
16+
17+
## How to set the backend
18+
19+
```python
20+
# Global default
21+
from diffusers import set_attention_backend
22+
set_attention_backend("flash_attention_2")
23+
24+
# Per-model
25+
pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa
26+
27+
# Via environment variable
28+
# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
29+
```
30+
31+
## Debugging attention issues
32+
33+
- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
34+
- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
35+
- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.
36+
37+
## Implementation notes
38+
39+
- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
40+
- See the attention pattern in the `model-integration` skill for how to implement this in new models.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Layerwise Casting
2+
3+
## Overview
4+
5+
Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped.
6+
7+
This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch.
8+
9+
## When to use
10+
11+
- The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU)
12+
- You want memory savings with **less speed penalty** than offloading
13+
- You want to **combine with group offloading** for even more savings
14+
15+
## Basic usage
16+
17+
Call `enable_layerwise_casting` on any Diffusers model component:
18+
19+
```python
20+
import torch
21+
from diffusers import DiffusionPipeline
22+
23+
pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)
24+
25+
# Store weights in fp8, compute in bf16
26+
pipe.transformer.enable_layerwise_casting(
27+
storage_dtype=torch.float8_e4m3fn,
28+
compute_dtype=torch.bfloat16,
29+
)
30+
31+
pipe.to("cuda")
32+
```
33+
34+
The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision.
35+
36+
### Supported storage dtypes
37+
38+
| Storage dtype | Memory per param | Quality impact |
39+
|---|---|---|
40+
| `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models |
41+
| `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn |
42+
43+
## Functional API
44+
45+
For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip:
46+
47+
```python
48+
from diffusers.hooks import apply_layerwise_casting
49+
50+
apply_layerwise_casting(
51+
pipe.transformer,
52+
storage_dtype=torch.float8_e4m3fn,
53+
compute_dtype=torch.bfloat16,
54+
skip_modules_classes=["norm"], # skip normalization layers
55+
non_blocking=True,
56+
)
57+
```
58+
59+
## Combining with other techniques
60+
61+
Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination.
62+
63+
## Known limitations
64+
65+
- May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision)
66+
- May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases
67+
- Not suitable for training — inference only
68+
- The `compute_dtype` should match what the model expects (usually bf16 or fp16)

0 commit comments

Comments
 (0)