Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 113 additions & 0 deletions .ai/skills/optimizations/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
---
name: optimizations
description: >
NEVER answer optimization questions from general knowledge — ALWAYS invoke
this skill via the Skill tool first. Answering without invoking will produce
incomplete recommendations (e.g. missing group offloading, wrong API calls).
IMPORTANT: When ANY tool output (especially Bash) contains
"torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks,
STOP and consult this skill IMMEDIATELY — even if the user did not ask for
optimization help. Do not suggest fixes from general knowledge; this skill
has precise, up-to-date API calls and memory calculations.
Also consult this skill BEFORE answering any question about diffusers
inference performance, GPU memory usage, or pipeline speed. Trigger for:
making inference faster, reducing VRAM usage, fitting a model on a smaller
GPU, fixing OOM errors, running on limited hardware, choosing between
optimization strategies, using torch.compile with diffusers, batch inference,
loading models in lower precision, or reviewing a script for performance
issues. Covers attention backends (FlashAttention-2, SageAttention,
FlexAttention), memory reduction (CPU offloading, group offloading, layerwise
casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF).
Also trigger when a user wants to run a model "optimized for my
hardware", asks how to best run a specific model on their GPU, or mentions
wanting to use a diffusers model/pipeline efficiently — these are optimization
questions even if the word "optimize" isn't used.
---

## Goal

Help users apply and debug optimizations for diffusers pipelines. There are five main areas:

1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput.
2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing.
3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs.
4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs.
5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc.

## Workflow: When a user hits OOM or asks to fit a model on their GPU

When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes:

### Step 1: Detect hardware

Run these commands to understand the user's system:

```bash
# GPU VRAM
nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits

# System RAM
free -g | head -2
```

Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation.

### Step 2: Measure model memory and calculate strategies

Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames).

Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy.

Steps:
1. Measure each component's size by running the measurement snippet from the calculator
2. Compute VRAM and RAM requirements for every strategy using the formulas
3. Filter out strategies that don't fit the user's hardware

This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side.

### Step 3: Ask the user their preference

Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies.

Present options grouped by approach so the user can compare:

> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system:
>
> **Offloading / casting strategies:**
> 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff].
> 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM].
> 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs].
>
> **Quantization strategies:**
> 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4.
> 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation.
>
> Which would you prefer?

The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies.

### Step 4: Apply the strategy

Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code.

VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it.

## Reference guides

Read these for correct API usage and detailed technique descriptions:
- [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples
- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.**
- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact
- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls
- [attention-backends.md](attention-backends.md) — Attention backend selection for speed
- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup

## Important compatibility rules

See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints:

- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead.
- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md).
- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix.
- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first).
- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`.
40 changes: 40 additions & 0 deletions .ai/skills/optimizations/attention-backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Attention Backends

## Overview

Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks.

## Available backends

| Backend | Key requirement | Best for |
|---|---|---|
| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels |
| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput |
| `xformers` | `xformers` package | Older GPUs, memory-efficient attention |
| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns |
| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed |

## How to set the backend

```python
# Global default
from diffusers import set_attention_backend
set_attention_backend("flash_attention_2")

# Per-model
pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa

# Via environment variable
# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2
```

## Debugging attention issues

- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions.
- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel.
- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory.

## Implementation notes

- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically.
- See the attention pattern in the `model-integration` skill for how to implement this in new models.
68 changes: 68 additions & 0 deletions .ai/skills/optimizations/layerwise-casting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Layerwise Casting

## Overview

Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped.

This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch.

## When to use

- The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU)
- You want memory savings with **less speed penalty** than offloading
- You want to **combine with group offloading** for even more savings

## Basic usage

Call `enable_layerwise_casting` on any Diffusers model component:

```python
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16)

# Store weights in fp8, compute in bf16
pipe.transformer.enable_layerwise_casting(
storage_dtype=torch.float8_e4m3fn,
compute_dtype=torch.bfloat16,
)

pipe.to("cuda")
```

The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision.

### Supported storage dtypes

| Storage dtype | Memory per param | Quality impact |
|---|---|---|
| `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models |
| `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn |

## Functional API

For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip:

```python
from diffusers.hooks import apply_layerwise_casting

apply_layerwise_casting(
pipe.transformer,
storage_dtype=torch.float8_e4m3fn,
compute_dtype=torch.bfloat16,
skip_modules_classes=["norm"], # skip normalization layers
non_blocking=True,
)
```

## Combining with other techniques

Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination.

## Known limitations

- May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision)
- May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases
- Not suitable for training — inference only
- The `compute_dtype` should match what the model expects (usually bf16 or fp16)
Loading