diff --git a/.ai/skills/optimizations/SKILL.md b/.ai/skills/optimizations/SKILL.md new file mode 100644 index 000000000000..28df164b3762 --- /dev/null +++ b/.ai/skills/optimizations/SKILL.md @@ -0,0 +1,113 @@ +--- +name: optimizations +description: > + NEVER answer optimization questions from general knowledge — ALWAYS invoke + this skill via the Skill tool first. Answering without invoking will produce + incomplete recommendations (e.g. missing group offloading, wrong API calls). + IMPORTANT: When ANY tool output (especially Bash) contains + "torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks, + STOP and consult this skill IMMEDIATELY — even if the user did not ask for + optimization help. Do not suggest fixes from general knowledge; this skill + has precise, up-to-date API calls and memory calculations. + Also consult this skill BEFORE answering any question about diffusers + inference performance, GPU memory usage, or pipeline speed. Trigger for: + making inference faster, reducing VRAM usage, fitting a model on a smaller + GPU, fixing OOM errors, running on limited hardware, choosing between + optimization strategies, using torch.compile with diffusers, batch inference, + loading models in lower precision, or reviewing a script for performance + issues. Covers attention backends (FlashAttention-2, SageAttention, + FlexAttention), memory reduction (CPU offloading, group offloading, layerwise + casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF). + Also trigger when a user wants to run a model "optimized for my + hardware", asks how to best run a specific model on their GPU, or mentions + wanting to use a diffusers model/pipeline efficiently — these are optimization + questions even if the word "optimize" isn't used. +--- + +## Goal + +Help users apply and debug optimizations for diffusers pipelines. There are five main areas: + +1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput. +2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing. +3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs. +4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs. +5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc. + +## Workflow: When a user hits OOM or asks to fit a model on their GPU + +When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes: + +### Step 1: Detect hardware + +Run these commands to understand the user's system: + +```bash +# GPU VRAM +nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits + +# System RAM +free -g | head -2 +``` + +Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation. + +### Step 2: Measure model memory and calculate strategies + +Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames). + +Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy. + +Steps: +1. Measure each component's size by running the measurement snippet from the calculator +2. Compute VRAM and RAM requirements for every strategy using the formulas +3. Filter out strategies that don't fit the user's hardware + +This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side. + +### Step 3: Ask the user their preference + +Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies. + +Present options grouped by approach so the user can compare: + +> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system: +> +> **Offloading / casting strategies:** +> 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff]. +> 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM]. +> 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs]. +> +> **Quantization strategies:** +> 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4. +> 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation. +> +> Which would you prefer? + +The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies. + +### Step 4: Apply the strategy + +Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code. + +VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it. + +## Reference guides + +Read these for correct API usage and detailed technique descriptions: +- [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples +- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.** +- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact +- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls +- [attention-backends.md](attention-backends.md) — Attention backend selection for speed +- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup + +## Important compatibility rules + +See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints: + +- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead. +- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md). +- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix. +- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first). +- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`. diff --git a/.ai/skills/optimizations/attention-backends.md b/.ai/skills/optimizations/attention-backends.md new file mode 100644 index 000000000000..7f36df045111 --- /dev/null +++ b/.ai/skills/optimizations/attention-backends.md @@ -0,0 +1,40 @@ +# Attention Backends + +## Overview + +Diffusers supports multiple attention backends through `dispatch_attention_fn`. The backend affects both speed and memory usage. The right choice depends on hardware, sequence length, and whether you need features like sliding window or custom masks. + +## Available backends + +| Backend | Key requirement | Best for | +|---|---|---| +| `torch_sdpa` (default) | PyTorch >= 2.0 | General use; auto-selects FlashAttention or memory-efficient kernels | +| `flash_attention_2` | `flash-attn` package, Ampere+ GPU | Long sequences, training, best raw throughput | +| `xformers` | `xformers` package | Older GPUs, memory-efficient attention | +| `flex_attention` | PyTorch >= 2.5 | Custom attention masks, block-sparse patterns | +| `sage_attention` | `sageattention` package | INT8 quantized attention for inference speed | + +## How to set the backend + +```python +# Global default +from diffusers import set_attention_backend +set_attention_backend("flash_attention_2") + +# Per-model +pipe.transformer.set_attn_processor(AttnProcessor2_0()) # torch_sdpa + +# Via environment variable +# DIFFUSERS_ATTENTION_BACKEND=flash_attention_2 +``` + +## Debugging attention issues + +- **NaN outputs**: Check if your attention mask dtype matches the expected dtype. Some backends require `bool`, others require float masks with `-inf` for masked positions. +- **Speed regression**: Profile with `torch.profiler` to verify the expected kernel is actually being dispatched. SDPA can silently fall back to the math kernel. +- **Memory spike**: FlashAttention-2 is memory-efficient for long sequences but has overhead for very short ones. For short sequences, `torch_sdpa` with math fallback may use less memory. + +## Implementation notes + +- Models integrated into diffusers should use `dispatch_attention_fn` (not `F.scaled_dot_product_attention` directly) so that backend switching works automatically. +- See the attention pattern in the `model-integration` skill for how to implement this in new models. diff --git a/.ai/skills/optimizations/layerwise-casting.md b/.ai/skills/optimizations/layerwise-casting.md new file mode 100644 index 000000000000..b7d45441f341 --- /dev/null +++ b/.ai/skills/optimizations/layerwise-casting.md @@ -0,0 +1,68 @@ +# Layerwise Casting + +## Overview + +Layerwise casting stores model weights in a smaller data format (e.g., `torch.float8_e4m3fn`) to use less memory, and upcasts them to a higher precision (e.g., `torch.bfloat16`) on-the-fly during computation. This cuts weight memory roughly in half (bf16 → fp8) with minimal quality impact because normalization and modulation layers are automatically skipped. + +This is one of the most effective techniques for fitting a large model on a GPU that's just slightly too small — it doesn't require any special quantization libraries, just PyTorch. + +## When to use + +- The model **almost** fits in VRAM (e.g., 28GB model on a 32GB GPU) +- You want memory savings with **less speed penalty** than offloading +- You want to **combine with group offloading** for even more savings + +## Basic usage + +Call `enable_layerwise_casting` on any Diffusers model component: + +```python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) + +# Store weights in fp8, compute in bf16 +pipe.transformer.enable_layerwise_casting( + storage_dtype=torch.float8_e4m3fn, + compute_dtype=torch.bfloat16, +) + +pipe.to("cuda") +``` + +The `storage_dtype` controls how weights are stored in memory. The `compute_dtype` controls the precision used during the actual forward pass. Normalization and modulation layers are automatically kept at full precision. + +### Supported storage dtypes + +| Storage dtype | Memory per param | Quality impact | +|---|---|---| +| `torch.float8_e4m3fn` | 1 byte (vs 2 for bf16) | Minimal for most models | +| `torch.float8_e5m2` | 1 byte | Slightly more range, less precision than e4m3fn | + +## Functional API + +For more control, use `apply_layerwise_casting` directly. This lets you target specific submodules or customize which layers to skip: + +```python +from diffusers.hooks import apply_layerwise_casting + +apply_layerwise_casting( + pipe.transformer, + storage_dtype=torch.float8_e4m3fn, + compute_dtype=torch.bfloat16, + skip_modules_classes=["norm"], # skip normalization layers + non_blocking=True, +) +``` + +## Combining with other techniques + +Layerwise casting is compatible with both group offloading and model CPU offloading. Always apply layerwise casting **before** enabling offloading. See [reduce-memory.md](reduce-memory.md) for code examples and the memory savings formulas for each combination. + +## Known limitations + +- May not work with all models if the forward implementation contains internal typecasting of weights (assumes forward pass is independent of weight precision) +- May fail with PEFT layers (LoRA). There are some checks but they're not guaranteed for all cases +- Not suitable for training — inference only +- The `compute_dtype` should match what the model expects (usually bf16 or fp16) diff --git a/.ai/skills/optimizations/memory-calculator.md b/.ai/skills/optimizations/memory-calculator.md new file mode 100644 index 000000000000..f6d5a9d63c46 --- /dev/null +++ b/.ai/skills/optimizations/memory-calculator.md @@ -0,0 +1,298 @@ +# Memory Calculator + +Use this guide to measure VRAM and RAM requirements for each optimization strategy, then recommend the best fit for the user's hardware. + +## Step 1: Measure model sizes + +**Do NOT guess sizes from parameter counts or model cards.** Pipelines often contain components that are not obvious from the model name (e.g., a pipeline marketed as having a "28B transformer" may also include a 24 GB text encoder, 6 GB connectors module, etc.). Always measure by running this snippet after loading the pipeline: + +```python +import torch +from diffusers import DiffusionPipeline # or the specific pipeline class + +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) + +for name, component in pipe.components.items(): + if hasattr(component, 'parameters'): + size_gb = sum(p.numel() * p.element_size() for p in component.parameters()) / 1e9 + print(f"{name}: {size_gb:.2f} GB") +``` + +For the transformer, also measure block-level and leaf-level sizes: + +```python +# S_block: size of one transformer block +transformer = pipe.transformer +block_attr = None +for attr in ["transformer_blocks", "blocks", "layers"]: + if hasattr(transformer, attr): + block_attr = attr + break +if block_attr: + blocks = getattr(transformer, block_attr) + block_size = sum(p.numel() * p.element_size() for p in blocks[0].parameters()) / 1e9 + print(f"S_block: {block_size:.2f} GB ({len(blocks)} blocks)") + +# S_leaf: largest leaf module +max_leaf = max( + (sum(p.numel() * p.element_size() for p in m.parameters(recurse=False)) + for m in transformer.modules() if list(m.parameters(recurse=False))), + default=0 +) / 1e9 +print(f"S_leaf: {max_leaf:.4f} GB") +``` + +To measure the effect of layerwise casting on a component, apply it and re-measure: + +```python +pipe.transformer.enable_layerwise_casting( + storage_dtype=torch.float8_e4m3fn, + compute_dtype=torch.bfloat16, +) +size_after = sum(p.numel() * p.element_size() for p in pipe.transformer.parameters()) / 1e9 +print(f"Transformer after layerwise casting: {size_after:.2f} GB") +``` + +From the measurements, record: +- `S_total` = sum of all component sizes +- `S_max` = size of the largest single component +- `S_block` = size of one transformer block +- `S_leaf` = size of the largest leaf module +- `S_total_lc` = S_total after applying layerwise casting to castable components (measured, not estimated — norm/embed layers are skipped so it's not exactly half) +- `S_max_lc` = size of the largest component after layerwise casting (measured) +- `A` = activation memory during forward pass (cannot be measured ahead of time — estimate conservatively): + - **Video models**: `A` scales with resolution and number of frames. A 5-second 960x544 video at 24fps can use ~7-8 GB. Higher resolution or more seconds = more activation memory. + - **Image models**: `A` scales with image resolution. A 1024x1024 image might use 2-4 GB, but 2048x2048 could use 8-16 GB. + - **Edit/inpainting models**: `A` includes the reference image(s) in addition to the generation activations, so budget extra. + - When in doubt, estimate conservatively: `A ≈ 5-8 GB` for typical video workloads, `A ≈ 2-4 GB` for typical image workloads. For high-resolution or long video, increase accordingly. + +## Step 2: Compute VRAM and RAM per strategy + +### No optimization (all on GPU) + +| | Estimate | +|---|---| +| **VRAM** | `S_total + A` | +| **RAM** | Minimal (just for loading) | +| **Speed** | Fastest — no transfers | +| **Quality** | Full precision | + +### Model CPU offloading + +| | Estimate | +|---|---| +| **VRAM** | `S_max + A` (only one component on GPU at a time) | +| **RAM** | `S_total` (all components stored on CPU) | +| **Speed** | Moderate — full model transfers between CPU/GPU per step | +| **Quality** | Full precision | + +### Group offloading: block_level (no stream) + +| | Estimate | +|---|---| +| **VRAM** | `num_blocks_per_group * S_block + A` | +| **RAM** | `S_total` (all weights on CPU, no pinned copy) | +| **Speed** | Moderate — synchronous transfers per group | +| **Quality** | Full precision | + +Tune `num_blocks_per_group` to fill available VRAM: `floor((VRAM - A) / S_block)`. + +### Group offloading: block_level (with stream) + +Streams force `num_blocks_per_group=1`. Prefetches the next block while the current one runs. + +| | Estimate | +|---|---| +| **VRAM** | `2 * S_block + A` (current block + prefetched next block) | +| **RAM** | `~2.5-3 * S_total` (original weights + pinned copies + allocation overhead) | +| **Speed** | Fast — overlaps transfer and compute | +| **Quality** | Full precision | + +With `low_cpu_mem_usage=True`: RAM drops to `~S_total` (pins tensors on-the-fly instead of pre-pinning), but slower. + +With `record_stream=True`: slightly more VRAM (delays memory reclamation), slightly faster (avoids stream synchronization). + +> **Note on RAM estimates with streams:** Measured RAM usage is consistently higher than the theoretical `2 * S_total`. Pinned memory allocation, CUDA runtime overhead, and memory fragmentation add ~30-50% on top. Always use `~2.5-3 * S_total` when checking if the user has enough RAM for streamed offloading. + +### Group offloading: leaf_level (no stream) + +| | Estimate | +|---|---| +| **VRAM** | `S_leaf + A` (single leaf module, typically very small) | +| **RAM** | `S_total` | +| **Speed** | Slow — synchronous transfer per leaf module (many transfers) | +| **Quality** | Full precision | + +### Group offloading: leaf_level (with stream) + +| | Estimate | +|---|---| +| **VRAM** | `2 * S_leaf + A` (current + prefetched leaf) | +| **RAM** | `~2.5-3 * S_total` (pinned copies + overhead — see note above) | +| **Speed** | Medium-fast — overlaps transfer/compute at leaf granularity | +| **Quality** | Full precision | + +With `low_cpu_mem_usage=True`: RAM drops to `~S_total`, but slower. + +### Sequential CPU offloading (legacy) + +| | Estimate | +|---|---| +| **VRAM** | `S_leaf + A` (similar to leaf_level group offloading) | +| **RAM** | `S_total` | +| **Speed** | Very slow — no stream support, synchronous per-leaf | +| **Quality** | Full precision | + +Group offloading `leaf_level + use_stream=True` is strictly better. Prefer that. + +### Layerwise casting (fp8 storage) + +Reduces weight memory by casting to fp8. Norm and embedding layers are automatically skipped, so the reduction is less than 50% — always measure with the snippet above. + +**`pipe.to()` caveat:** `pipe.to(device)` internally calls `module.to(device, dtype)` where dtype is `None` when not explicitly passed. This preserves fp8 weights. However, if the user passes dtype explicitly (e.g., `pipe.to("cuda", torch.bfloat16)` or the pipeline has internal dtype overrides), the fp8 storage will be overridden back to bf16. When in doubt, combine with `enable_model_cpu_offload()` which safely moves one component at a time without dtype overrides. + +**Case 1: Everything on GPU** (if `S_total_lc + A <= VRAM`) + +| | Estimate | +|---|---| +| **VRAM** | `S_total_lc + A` (measured — use the layerwise casting measurement snippet) | +| **RAM** | Minimal | +| **Speed** | Near-native — small cast overhead per layer | +| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) | + +Use `pipe.to("cuda")` (without explicit dtype) after applying layerwise casting. Or move each component individually. + +**Case 2: With model CPU offloading** (if Case 1 doesn't fit but `S_max_lc + A <= VRAM`) + +| | Estimate | +|---|---| +| **VRAM** | `S_max_lc + A` (largest component after layerwise casting, one on GPU at a time) | +| **RAM** | `S_total` (all components on CPU) | +| **Speed** | Fast — small cast overhead per layer, component transfer overhead between steps | +| **Quality** | Slight degradation (fp8 weights, norm layers kept full precision) | + +Apply layerwise casting to target components, then call `pipe.enable_model_cpu_offload()`. + +### Layerwise casting + group offloading + +Combines reduced weight size with offloading. The offloaded weights are in fp8, so transfers are faster and pinned copies smaller. + +| | Estimate | +|---|---| +| **VRAM** | `num_blocks_per_group * S_block * 0.5 + A` (block_level) or `S_leaf * 0.5 + A` (leaf_level) | +| **RAM** | `S_total * 0.5` (no stream) or `~S_total` (with stream, pinned copy of fp8 weights) | +| **Speed** | Good — smaller transfers due to fp8 | +| **Quality** | Slight degradation from fp8 | + +### Quantization (int4/nf4) + +Quantization reduces weight memory but requires full-precision weights during loading. Always use `device_map="cpu"` so quantization happens on CPU. + +Notation: +- `S_component_q` = quantized size of a component (int4/nf4 ≈ `S_component * 0.25`, int8 ≈ `S_component * 0.5`) +- `S_total_q` = total pipeline size after quantizing selected components +- `S_max_q` = size of the largest single component after quantization + +**Loading (with `device_map="cpu"`):** + +| | Estimate | +|---|---| +| **RAM (peak during loading)** | `S_largest_component_bf16` — full-precision weights of the largest component must fit in RAM during quantization | +| **RAM (after loading)** | `S_total_q` — all components at their final (quantized or bf16) sizes | + +**Inference with `pipe.to(device)`:** + +| | Estimate | +|---|---| +| **VRAM** | `S_total_q + A` (all components on GPU at once) | +| **RAM** | Minimal | +| **Speed** | Good — smaller model, may have dequantization overhead | +| **Quality** | Noticeable degradation possible, especially int4. Try int8 first. | + +**Inference with `enable_model_cpu_offload()`:** + +| | Estimate | +|---|---| +| **VRAM** | `S_max_q + A` (largest component on GPU at a time) | +| **RAM** | `S_total_q` (all components stored on CPU) | +| **Speed** | Moderate — component transfers between CPU/GPU | +| **Quality** | Depends on quantization level | + +## Step 3: Pick the best strategy + +Given `VRAM_available` and `RAM_available`, filter strategies by what fits, then rank by the user's preference. + +### Algorithm + +``` +1. Measure S_total, S_max, S_block, S_leaf, S_total_lc, S_max_lc, A for the pipeline +2. For each strategy (offloading, casting, AND quantization), compute estimated VRAM and RAM +3. Filter out strategies where VRAM > VRAM_available or RAM > RAM_available +4. Present ALL viable strategies to the user grouped by approach (offloading/casting vs quantization) +5. Let the user pick based on their preference: + - Quality: pick the one with highest precision that fits + - Speed: pick the one with lowest transfer overhead + - Memory: pick the one with lowest VRAM usage + - Balanced: pick the lightest technique that fits comfortably (target ~80% VRAM) +``` + +### Quantization size estimates + +Always compute these alongside offloading strategies — don't treat quantization as a last resort. +Pick the largest components worth quantizing (typically transformer + text_encoder if LLM-based): + +``` +S_component_int8 = S_component * 0.5 +S_component_nf4 = S_component * 0.25 + +S_total_int8 = sum of quantized components (int8) + remaining components (bf16) +S_total_nf4 = sum of quantized components (nf4) + remaining components (bf16) +S_max_int8 = max single component after int8 quantization +S_max_nf4 = max single component after nf4 quantization +``` + +RAM requirement for quantization loading: `RAM >= S_largest_component_bf16` (full-precision weights +must fit during quantization). If this doesn't hold, quantization is not viable unless pre-quantized +checkpoints are available. + +### Quick decision flowchart + +Offloading / casting path: +``` +VRAM >= S_total + A? + → YES: No optimization needed (maybe attention backend for speed) + → NO: + VRAM >= S_total_lc + A? (layerwise casting, everything on GPU) + → YES: Layerwise casting, pipe.to("cuda") without explicit dtype + → NO: + VRAM >= S_max + A? (model CPU offload, full precision) + → YES: Model CPU offloading + - Want less VRAM? → add layerwise casting too + → NO: + VRAM >= S_max_lc + A? (layerwise casting + model CPU offload) + → YES: Layerwise casting + model CPU offloading + → NO: Need group offloading + RAM >= 3 * S_total? (enough for pinned copies + overhead) + → YES: group offload leaf_level + stream (fast) + → NO: + RAM >= S_total? + → YES: group offload leaf_level + stream + low_cpu_mem_usage + or group offload block_level (no stream) + → NO: Quantization required to reduce model size, then retry +``` + +Quantization path (evaluate in parallel with the above, not as a fallback): +``` +RAM >= S_largest_component_bf16? (must fit full-precision weights during quantization) + → NO: Cannot quantize — need more RAM or pre-quantized checkpoints + → YES: Compute quantized sizes for target components (typically transformer + text_encoder) + nf4 quantization: + VRAM >= S_total_nf4 + A? → pipe.to("cuda"), fastest (no offloading overhead) + VRAM >= S_max_nf4 + A? → model CPU offload, moderate speed + int8 quantization: + VRAM >= S_total_int8 + A? → pipe.to("cuda"), fastest + VRAM >= S_max_int8 + A? → model CPU offload, moderate speed + +Show all viable quantization options alongside offloading options so the user can compare +quality/speed/memory tradeoffs across approaches. +``` diff --git a/.ai/skills/optimizations/quantization.md b/.ai/skills/optimizations/quantization.md new file mode 100644 index 000000000000..3b551bcb128f --- /dev/null +++ b/.ai/skills/optimizations/quantization.md @@ -0,0 +1,180 @@ +# Quantization + +## Overview + +Quantization reduces model weights from fp16/bf16 to lower precision (int8, int4, fp8), cutting memory usage and often improving throughput. Diffusers supports several quantization backends. + +## Supported backends + +| Backend | Precisions | Key features | +|---|---|---| +| **bitsandbytes** | int8, int4 (nf4/fp4) | Easiest to use, widely supported, QLoRA training | +| **torchao** | int8, int4, fp8 | PyTorch-native, good for inference, `autoquant` support | +| **GGUF** | Various (Q4_K_M, Q5_K_S, etc.) | Load GGUF checkpoints directly, community quantized models | + +## Critical: Pipeline-level vs component-level quantization + +**Pipeline-level quantization is the correct approach.** Pass a `PipelineQuantizationConfig` to `from_pretrained`. Do NOT pass a `BitsAndBytesConfig` directly — the pipeline's `from_pretrained` will reject it with `"quantization_config must be an instance of PipelineQuantizationConfig"`. + +### Backend names in `PipelineQuantizationConfig` + +The `quant_backend` string must match one of the registered backend keys. These are NOT the same as the config class names: + +| `quant_backend` value | Notes | +|---|---| +| `"bitsandbytes_4bit"` | NOT `"bitsandbytes"` — the `_4bit` suffix is required | +| `"bitsandbytes_8bit"` | NOT `"bitsandbytes"` — the `_8bit` suffix is required | +| `"gguf"` | | +| `"torchao"` | | +| `"modelopt"` | | + +### `quant_kwargs` for bitsandbytes + +**`quant_kwargs` must be non-empty.** The validator raises `ValueError: Both quant_kwargs and quant_mapping cannot be None` if it's `{}` or `None`. Always pass at least one kwarg. + +For `bitsandbytes_4bit`, the quantizer class is selected by backend name — `load_in_4bit=True` is redundant (the quantizer ignores it) but harmless. Pass the bnb-specific options instead: + +```python +quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"} +``` + +For `bitsandbytes_8bit`, there are no bnb_8bit-specific kwargs, so pass the flag explicitly to satisfy the non-empty requirement: + +```python +quant_kwargs={"load_in_8bit": True} +``` + +## Usage patterns + +### bitsandbytes (pipeline-level, recommended) + +```python +from diffusers import PipelineQuantizationConfig, DiffusionPipeline + +quantization_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}, + components_to_quantize=["transformer"], # specify which components to quantize +) + +pipe = DiffusionPipeline.from_pretrained( + "model_id", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, + device_map="cpu", # load on CPU first to avoid OOM during quantization +) +``` + +### torchao (pipeline-level) + +```python +from diffusers import PipelineQuantizationConfig, DiffusionPipeline + +quantization_config = PipelineQuantizationConfig( + quant_backend="torchao", + quant_kwargs={"quant_type": "int8_weight_only"}, + components_to_quantize=["transformer"], +) + +pipe = DiffusionPipeline.from_pretrained( + "model_id", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, + device_map="cpu", +) +``` + +### GGUF (pipeline-level) + +```python +from diffusers import PipelineQuantizationConfig, DiffusionPipeline + +quantization_config = PipelineQuantizationConfig( + quant_backend="gguf", + quant_kwargs={"compute_dtype": torch.bfloat16}, +) + +pipe = DiffusionPipeline.from_pretrained( + "model_id", + quantization_config=quantization_config, + torch_dtype=torch.bfloat16, + device_map="cpu", +) +``` + +## Loading: memory requirements and `device_map="cpu"` + +Quantization is NOT free at load time. The full-precision (bf16/fp16) weights must be loaded into memory first, then compressed. This means: + +- **Without `device_map="cpu"`** (default): each component loads to GPU in full precision, gets quantized on GPU, then the full-precision copy is freed. But while loading, you need VRAM for the full-precision weights of the current component PLUS all previously loaded components (already quantized or not). For large models, this causes OOM. +- **With `device_map="cpu"`**: components load and quantize on CPU. This requires **RAM >= S_component_bf16** for the largest component being quantized (the full-precision weights must fit in RAM during quantization). After quantization, RAM usage drops to the quantized size. + +**Always pass `device_map="cpu"` when using quantization.** Then choose how to move to GPU: + +1. **`pipe.to(device)`** — moves everything to GPU at once. Only works if all components (quantized + non-quantized) fit in VRAM simultaneously: `VRAM >= S_total_after_quant`. +2. **`pipe.enable_model_cpu_offload(device=device)`** — moves components to GPU one at a time during inference. Use this when `S_total_after_quant > VRAM` but `S_max_after_quant + A <= VRAM`. + +### Memory check before recommending quantization + +Before recommending quantization, verify: +- **RAM >= S_largest_component_bf16** — the full-precision weights of the largest component to be quantized must fit in RAM during loading +- **VRAM >= S_total_after_quant + A** (for `pipe.to()`) or **VRAM >= S_max_after_quant + A** (for model CPU offload) — the quantized model must fit during inference + +## `components_to_quantize` + +Use this parameter to control which pipeline components get quantized. Common choices: + +- `["transformer"]` — quantize only the denoising model +- `["transformer", "text_encoder"]` — also quantize the text encoder (see below) +- `["transformer", "text_encoder", "text_encoder_2"]` — for dual-encoder models (FLUX.1, SD3, etc.) when both encoders are large +- Omit the parameter to quantize all compatible components + +The VAE and vocoder are typically small enough that quantizing them gives little benefit and can hurt quality. + +### Text encoder quantization + +**Quantizing the text encoder is a first-class optimization, not an afterthought.** Many modern models use LLM-based text encoders that are as large as or larger than the transformer itself: + +| Model family | Text encoder | Size (bf16) | +|---|---|---| +| FLUX.2 Klein | Qwen3 | ~9 GB | +| FLUX.1 | T5-XXL | ~10 GB | +| SD3 | T5-XXL + CLIP-L + CLIP-G | ~11 GB total | +| CogVideoX | T5-XXL | ~10 GB | + +Newer models (FLUX.2 Klein, etc.) use a **single LLM-based text encoder** — check the pipeline definition for `text_encoder` vs `text_encoder_2`. Never assume CLIP+T5 dual-encoder layout. + +When the text encoder is LLM-based, always include it in `components_to_quantize`. The combined savings often allow both components to fit in VRAM simultaneously, eliminating the need for CPU offloading entirely: + +```python +# Both transformer (~4.5 GB) + Qwen3 text encoder (~4.5 GB) fit in VRAM at int4 +quantization_config = PipelineQuantizationConfig( + quant_backend="bitsandbytes_4bit", + quant_kwargs={"bnb_4bit_compute_dtype": torch.bfloat16, "bnb_4bit_quant_type": "nf4"}, + components_to_quantize=["transformer", "text_encoder"], +) +pipe = DiffusionPipeline.from_pretrained("model_id", quantization_config=quantization_config, device_map="cpu") +pipe.to("cuda") # everything fits — no offloading needed +``` + +vs. transformer-only quantization, which may still require offloading because the text encoder alone exceeds available VRAM. + +## Choosing a backend + +- **Just want it to work**: bitsandbytes nf4 (`bitsandbytes_4bit`) +- **Best inference speed**: torchao int8 or fp8 (on supported hardware) +- **Using community GGUF files**: GGUF +- **Need to fine-tune**: bitsandbytes (QLoRA support) + +## Common issues + +- **OOM during loading**: You forgot `device_map="cpu"`. See the loading section above. +- **`quantization_config must be an instance of PipelineQuantizationConfig`**: You passed a `BitsAndBytesConfig` directly. Wrap it in `PipelineQuantizationConfig` instead. +- **`quant_backend not found`**: The backend name is wrong. Use `bitsandbytes_4bit` or `bitsandbytes_8bit`, not `bitsandbytes`. See the backend names table above. +- **`Both quant_kwargs and quant_mapping cannot be None`**: `quant_kwargs` is empty or `None`. Always pass at least one kwarg — see the `quant_kwargs` section above. +- **OOM during `pipe.to(device)` after loading**: Even quantized, all components don't fit in VRAM at once. Use `enable_model_cpu_offload()` instead of `pipe.to(device)`. +- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails at inference**: `LLM.int8()` (bitsandbytes 8-bit) can only execute on CUDA — it cannot run on CPU. When `enable_model_cpu_offload()` moves the quantized component back to CPU between steps, the int8 matmul fails. **Fix**: keep the int8 component on CUDA permanently (`pipe.transformer.to("cuda")`) and use group offloading with `exclude_modules=["transformer"]` for the rest, or switch to `bitsandbytes_4bit` which supports device moves. +- **Quality degradation**: int4 can produce noticeable artifacts for some models. Try int8 first, then drop to int4 if memory requires it. +- **Slow first inference**: Some backends (torchao) compile/calibrate on first run. Subsequent runs are faster. +- **Incompatible layers**: Not all layer types support all quantization schemes. Check backend docs for supported module types. +- **Training**: Only bitsandbytes supports training (via QLoRA). Other backends are inference-only. diff --git a/.ai/skills/optimizations/reduce-memory.md b/.ai/skills/optimizations/reduce-memory.md new file mode 100644 index 000000000000..4e200695a8fb --- /dev/null +++ b/.ai/skills/optimizations/reduce-memory.md @@ -0,0 +1,213 @@ +# Reduce Memory + +## Overview + +Large diffusion models can exceed GPU VRAM. Diffusers provides several techniques to reduce peak memory, each with different speed/memory tradeoffs. + +## Techniques (ordered by ease of use) + +### 1. Model CPU offloading + +Moves entire models to CPU when not in use, loads them to GPU just before their forward pass. + +```python +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) +pipe.enable_model_cpu_offload() +# Do NOT call pipe.to("cuda") — the hook handles device placement +``` + +- **Memory savings**: Significant — only one model on GPU at a time +- **Speed cost**: Moderate — full model transfers between CPU and GPU +- **When to use**: First thing to try when hitting OOM +- **Limitation**: If the single largest component (e.g. transformer) exceeds VRAM, this won't help — you need group offloading or layerwise casting instead. + +### 2. Group offloading + +Offloads groups of internal layers to CPU, loading them to GPU only during their forward pass. More granular than model offloading, faster than sequential offloading. + +**Two offload types:** +- `block_level` — offloads groups of N layers at a time. Lower memory, moderate speed. +- `leaf_level` — offloads individual leaf modules. Equivalent to sequential offloading but can be made faster with CUDA streams. + +**IMPORTANT**: `enable_model_cpu_offload()` will raise an error if any component has group offloading enabled. If you need offloading for the whole pipeline, use pipeline-level `enable_group_offload()` instead — it handles all components in one call. + +#### Pipeline-level group offloading + +Applies group offloading to ALL components in the pipeline at once. Simplest approach. + +```python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) + +# Option A: leaf_level with CUDA streams (recommended — fast + low memory) +pipe.enable_group_offload( + onload_device=torch.device("cuda"), + offload_device=torch.device("cpu"), + offload_type="leaf_level", + use_stream=True, +) + +# Option B: block_level (more memory savings, slower) +pipe.enable_group_offload( + onload_device=torch.device("cuda"), + offload_device=torch.device("cpu"), + offload_type="block_level", + num_blocks_per_group=2, +) +``` + +#### Component-level group offloading + +Apply group offloading selectively to specific components. Useful when only the transformer is too large for VRAM but other components fit fine. + +For Diffusers model components (inheriting from `ModelMixin`), use `enable_group_offload`: + +```python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) + +# Group offload the transformer (the largest component) +pipe.transformer.enable_group_offload( + onload_device=torch.device("cuda"), + offload_device=torch.device("cpu"), + offload_type="leaf_level", + use_stream=True, +) + +# Group offload the VAE too if needed +pipe.vae.enable_group_offload( + onload_device=torch.device("cuda"), + offload_type="leaf_level", +) +``` + +For non-Diffusers components (e.g. text encoders from transformers library), use the functional API: + +```python +from diffusers.hooks import apply_group_offloading + +apply_group_offloading( + pipe.text_encoder, + onload_device=torch.device("cuda"), + offload_type="block_level", + num_blocks_per_group=2, +) +``` + +#### CUDA streams for faster group offloading + +When `use_stream=True`, the next layer is prefetched to GPU while the current layer runs. This overlaps data transfer with computation. Requires ~2x CPU memory of the model. + +```python +pipe.transformer.enable_group_offload( + onload_device=torch.device("cuda"), + offload_device=torch.device("cpu"), + offload_type="leaf_level", + use_stream=True, + record_stream=True, # slightly more speed, slightly more memory +) +``` + +If using `block_level` with `use_stream=True`, set `num_blocks_per_group=1` (a warning is raised otherwise). + +#### Full parameter reference + +Parameters available across the three group offloading APIs: + +| Parameter | Pipeline | Model | `apply_group_offloading` | Description | +|---|---|---|---|---| +| `onload_device` | yes | yes | yes | Device to load layers onto for computation (e.g. `torch.device("cuda")`) | +| `offload_device` | yes | yes | yes | Device to offload layers to when idle (default: `torch.device("cpu")`) | +| `offload_type` | yes | yes | yes | `"block_level"` (groups of N layers) or `"leaf_level"` (individual modules) | +| `num_blocks_per_group` | yes | yes | yes | Required for `block_level` — how many layers per group | +| `non_blocking` | yes | yes | yes | Non-blocking data transfer between devices | +| `use_stream` | yes | yes | yes | Overlap data transfer and computation via CUDA streams. Requires ~2x CPU RAM of the model | +| `record_stream` | yes | yes | yes | With `use_stream`, marks tensors for stream. Faster but slightly more memory | +| `low_cpu_mem_usage` | yes | yes | yes | Pins tensors on-the-fly instead of pre-pinning. Saves CPU RAM when using streams, but slower | +| `offload_to_disk_path` | yes | yes | yes | Path to offload weights to disk instead of CPU RAM. Useful when system RAM is also limited | +| `exclude_modules` | **yes** | no | no | Pipeline-only: list of component names to skip (they get placed on `onload_device` instead) | +| `block_modules` | no | **yes** | **yes** | Override which submodules are treated as blocks for `block_level` offloading | +| `exclude_kwargs` | no | **yes** | **yes** | Kwarg keys that should not be moved between devices (e.g. mutable cache state) | + +### 3. Sequential CPU offloading + +Moves individual layers to GPU one at a time during forward pass. + +```python +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16) +pipe.enable_sequential_cpu_offload() +# Do NOT call pipe.to("cuda") first — saves minimal memory if you do +``` + +- **Memory savings**: Maximum — only one layer on GPU at a time +- **Speed cost**: Very high — many small transfers per forward pass +- **When to use**: Last resort when group offloading with streams isn't enough +- **Note**: Group offloading with `leaf_level` + `use_stream=True` is essentially the same idea but faster. Prefer that. + +### 4. VAE slicing + +Processes VAE encode/decode in slices along the batch dimension. + +```python +pipe.vae.enable_slicing() +``` + +- **Memory savings**: Reduces VAE peak memory for batch sizes > 1 +- **Speed cost**: Minimal +- **When to use**: When generating multiple images/videos in a batch +- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support slicing. +- **API note**: The pipeline-level `pipe.enable_vae_slicing()` is deprecated since v0.40.0. Use `pipe.vae.enable_slicing()`. + +### 5. VAE tiling + +Processes VAE encode/decode in spatial tiles. This is a **VRAM optimization** — only use when the VAE decode/encode would OOM without it. + +```python +pipe.vae.enable_tiling() +``` + +- **Memory savings**: Bounds VAE peak memory by tile size rather than full resolution +- **Speed cost**: Some overhead from tile overlap processing +- **When to use** (only when VAE decode would OOM): + - **Image models**: Typically needed above ~1.5 MP on ≤16 GB GPUs, or ~4 MP on ≤32 GB GPUs + - **Video models**: When `H × W × num_frames` is large relative to remaining VRAM after denoising +- **When NOT to use**: At standard resolutions where the VAE fits comfortably — tiling adds overhead for no benefit +- **Note**: `AutoencoderKLWan` and `AsymmetricAutoencoderKL` don't support tiling. +- **API note**: The pipeline-level `pipe.enable_vae_tiling()` is deprecated since v0.40.0. Use `pipe.vae.enable_tiling()`. +- **Tip for group offloading with streams**: If combining VAE tiling with group offloading (`use_stream=True`), do a dummy forward pass first to avoid device mismatch errors. + +### 6. Attention slicing (legacy) + +```python +pipe.enable_attention_slicing() +``` + +- Largely superseded by `torch_sdpa` and FlashAttention +- Still useful on very old GPUs without SDPA support + +## Combining techniques + +Compatible combinations: +- Group offloading (pipeline-level) + VAE tiling — good general setup +- Group offloading (pipeline-level, `exclude_modules=["small_component"]`) — keeps small models on GPU, offloads large ones +- Model CPU offloading + VAE tiling — simple and effective when the largest component fits in VRAM +- Layerwise casting + group offloading — maximum savings (see [layerwise-casting.md](layerwise-casting.md)) +- Layerwise casting + model CPU offloading — also works +- Quantization + model CPU offloading — works well +- Per-component group offloading with different configs — e.g. `block_level` for transformer, `leaf_level` for VAE + +**Incompatible combinations:** +- `enable_model_cpu_offload()` on a pipeline where ANY component has group offloading — raises ValueError +- `enable_sequential_cpu_offload()` on a pipeline where ANY component has group offloading — same error + +## Debugging OOM + +1. Check which stage OOMs: loading, encoding, denoising, or decoding +2. If OOM during `.to("cuda")` — the full pipeline doesn't fit. Use model CPU offloading or group offloading +3. If OOM during denoising with model CPU offloading — the transformer alone exceeds VRAM. Use layerwise casting (see [layerwise-casting.md](layerwise-casting.md)) or group offloading instead +4. If still OOM during VAE decode, add `pipe.vae.enable_tiling()` +5. Consider quantization (see [quantization.md](quantization.md)) as a complementary approach diff --git a/.ai/skills/optimizations/torch-compile.md b/.ai/skills/optimizations/torch-compile.md new file mode 100644 index 000000000000..c6b25e15a6a5 --- /dev/null +++ b/.ai/skills/optimizations/torch-compile.md @@ -0,0 +1,72 @@ +# torch.compile + +## Overview + +`torch.compile` traces a model's forward pass and compiles it to optimized machine code (via Triton or other backends). For diffusers, it typically speeds up the denoising loop by 20-50% after a warmup period. + +## Full model compilation + +Compile individual components, not the whole pipeline: + +```python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda") + +pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True) +# Optionally compile the VAE decoder too +pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True) +``` + +The first 1-3 inference calls are slow (compilation/warmup). Subsequent calls are fast. Always do a warmup run before benchmarking. + +## Regional compilation (preferred) + +Regional compilation compiles only the frequently repeated sub-modules (transformer blocks) instead of the whole model. It provides the same runtime speedup but with ~8-10x faster compile time and better compatibility with offloading. + +Diffusers models declare their repeated blocks via the `_repeated_blocks` class attribute (a list of class name strings). Most modern transformers define this: + +```python +# FluxTransformer defines: +_repeated_blocks = ["FluxTransformerBlock", "FluxSingleTransformerBlock"] +``` + +Use `compile_repeated_blocks()` to compile them: + +```python +pipe = DiffusionPipeline.from_pretrained("model_id", torch_dtype=torch.bfloat16).to("cuda") +pipe.transformer.compile_repeated_blocks(fullgraph=True) +``` + +**Always guard before calling** — raises `ValueError` if `_repeated_blocks` is empty or the named classes aren't found. Use this pattern universally, whether or not you're using offloading: + +```python +# Works with or without enable_model_cpu_offload() / enable_group_offload() +if getattr(pipe.transformer, "_repeated_blocks", None): + pipe.transformer.compile_repeated_blocks(fullgraph=True) +else: + pipe.transformer = torch.compile(pipe.transformer, mode="reduce-overhead", fullgraph=True) +``` + +`torch.compile` is compatible with diffusers' offloading methods — the offloading hooks use `@torch.compiler.disable()` on device-transfer operations so they run natively outside the compiled graph. Regional compilation is preferred when combining with offloading because it avoids compiling the parts that interact with the hooks. + +Models with `_repeated_blocks` defined include: Flux, Flux2, HunyuanVideo, LTX2Video, Wan, CogVideo, SD3, UNet2DConditionModel, and most other modern architectures. + +## Compile modes + +| Mode | Speed gain | Compile time | Notes | +|---|---|---|---| +| `"default"` | Moderate | Fast | Safe starting point | +| `"reduce-overhead"` | Good | Moderate | Reduces Python overhead via CUDA graphs | +| `"max-autotune"` | Best | Very slow | Tries many kernel configs; best for repeated inference | + +## `fullgraph=True` + +Requires the entire forward pass to be compilable as a single graph. Most diffusers transformers support this. If you get a `torch._dynamo` graph break error, remove `fullgraph=True` to allow partial compilation. + +## Limitations + +- **Dynamic shapes**: Changing resolution between calls triggers recompilation. Use `torch.compile(..., dynamic=True)` for variable resolutions, at some speed cost. +- **First call is slow**: Budget 1-3 minutes for initial compilation depending on model size. +- **Windows**: `reduce-overhead` and `max-autotune` modes may have issues. Use `"default"` if you hit errors.