|
| 1 | +--- |
| 2 | +name: optimizations |
| 3 | +description: > |
| 4 | + NEVER answer optimization questions from general knowledge — ALWAYS invoke |
| 5 | + this skill via the Skill tool first. Answering without invoking will produce |
| 6 | + incomplete recommendations (e.g. missing group offloading, wrong API calls). |
| 7 | + IMPORTANT: When ANY tool output (especially Bash) contains |
| 8 | + "torch.OutOfMemoryError", "CUDA out of memory", or OOM tracebacks, |
| 9 | + STOP and consult this skill IMMEDIATELY — even if the user did not ask for |
| 10 | + optimization help. Do not suggest fixes from general knowledge; this skill |
| 11 | + has precise, up-to-date API calls and memory calculations. |
| 12 | + Also consult this skill BEFORE answering any question about diffusers |
| 13 | + inference performance, GPU memory usage, or pipeline speed. Trigger for: |
| 14 | + making inference faster, reducing VRAM usage, fitting a model on a smaller |
| 15 | + GPU, fixing OOM errors, running on limited hardware, choosing between |
| 16 | + optimization strategies, using torch.compile with diffusers, batch inference, |
| 17 | + loading models in lower precision, or reviewing a script for performance |
| 18 | + issues. Covers attention backends (FlashAttention-2, SageAttention, |
| 19 | + FlexAttention), memory reduction (CPU offloading, group offloading, layerwise |
| 20 | + casting, VAE slicing/tiling), and quantization (bitsandbytes, torchao, GGUF). |
| 21 | + Also trigger when a user wants to run a model "optimized for my |
| 22 | + hardware", asks how to best run a specific model on their GPU, or mentions |
| 23 | + wanting to use a diffusers model/pipeline efficiently — these are optimization |
| 24 | + questions even if the word "optimize" isn't used. |
| 25 | +--- |
| 26 | + |
| 27 | +## Goal |
| 28 | + |
| 29 | +Help users apply and debug optimizations for diffusers pipelines. There are five main areas: |
| 30 | + |
| 31 | +1. **Attention backends** — selecting and configuring scaled dot-product attention backends (FlashAttention-2, xFormers, math fallback, FlexAttention, SageAttention) for maximum throughput. |
| 32 | +2. **Memory reduction** — techniques to reduce peak GPU memory: model CPU offloading, group offloading, layerwise casting, VAE slicing/tiling, and attention slicing. |
| 33 | +3. **Quantization** — reducing model precision with bitsandbytes, torchao, or GGUF to fit larger models on smaller GPUs. |
| 34 | +4. **torch.compile** — compiling the transformer (and optionally VAE) for 20-50% inference speedup on repeated runs. |
| 35 | +5. **Combining techniques** — layerwise casting + group offloading, quantization + offloading, etc. |
| 36 | + |
| 37 | +## Workflow: When a user hits OOM or asks to fit a model on their GPU |
| 38 | + |
| 39 | +When a user asks how to make a pipeline run on their hardware, or hits an OOM error, follow these steps **in order** before proposing any changes: |
| 40 | + |
| 41 | +### Step 1: Detect hardware |
| 42 | + |
| 43 | +Run these commands to understand the user's system: |
| 44 | + |
| 45 | +```bash |
| 46 | +# GPU VRAM |
| 47 | +nvidia-smi --query-gpu=name,memory.total,memory.free --format=csv,noheader,nounits |
| 48 | + |
| 49 | +# System RAM |
| 50 | +free -g | head -2 |
| 51 | +``` |
| 52 | + |
| 53 | +Record the GPU name, total VRAM (in GB), and total system RAM (in GB). These numbers drive the recommendation. |
| 54 | + |
| 55 | +### Step 2: Measure model memory and calculate strategies |
| 56 | + |
| 57 | +Read the user's script to identify the pipeline class, model ID, `torch_dtype`, and generation params (resolution, frames). |
| 58 | + |
| 59 | +Then **measure actual component sizes** by running a snippet against the loaded pipeline. Do NOT guess sizes from parameter counts or model cards — always measure. See [memory-calculator.md](memory-calculator.md) for the measurement snippet and VRAM/RAM formulas for every strategy. |
| 60 | + |
| 61 | +Steps: |
| 62 | +1. Measure each component's size by running the measurement snippet from the calculator |
| 63 | +2. Compute VRAM and RAM requirements for every strategy using the formulas |
| 64 | +3. Filter out strategies that don't fit the user's hardware |
| 65 | + |
| 66 | +This is the critical step — the calculator contains exact formulas for every strategy including the RAM cost of CUDA streams (which requires ~2x model size in pinned memory). Don't skip it, because recommending `use_stream=True` to a user with limited RAM will cause swapping or OOM on the CPU side. |
| 67 | + |
| 68 | +### Step 3: Ask the user their preference |
| 69 | + |
| 70 | +Present the user with a clear summary of what fits. **Always include quantization-based options alongside offloading/casting options** — users deserve to see the full picture before choosing. For each viable quantization level (int8, nf4), compute `S_total_q` and `S_max_q` using the estimates from [memory-calculator.md](memory-calculator.md) (int4/nf4 ≈ 0.25x, int8 ≈ 0.5x component size), then check fit just like other strategies. |
| 71 | + |
| 72 | +Present options grouped by approach so the user can compare: |
| 73 | + |
| 74 | +> Based on your hardware (**X GB VRAM**, **Y GB RAM**) and the model requirements (~**Z GB** total, largest component ~**W GB**), here are the strategies that fit your system: |
| 75 | +> |
| 76 | +> **Offloading / casting strategies:** |
| 77 | +> 1. **Quality** — [specific strategy]. Full precision, no quality loss. [estimated VRAM / RAM / speed tradeoff]. |
| 78 | +> 2. **Speed** — [specific strategy]. [quality tradeoff]. [estimated VRAM / RAM]. |
| 79 | +> 3. **Memory saving** — [specific strategy]. Minimizes VRAM. [tradeoffs]. |
| 80 | +> |
| 81 | +> **Quantization strategies:** |
| 82 | +> 4. **int8 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Less quality loss than int4. |
| 83 | +> 5. **nf4 [components]** — [with offloading if needed]. [estimated VRAM / RAM]. Maximum memory savings, some quality degradation. |
| 84 | +> |
| 85 | +> Which would you prefer? |
| 86 | +
|
| 87 | +The key difference from a generic recommendation: every option shown should already be validated against the user's actual VRAM and RAM. Don't show options that won't fit. Read [quantization.md](quantization.md) for correct API usage when applying quantization strategies. |
| 88 | + |
| 89 | +### Step 4: Apply the strategy |
| 90 | + |
| 91 | +Propose **specific code changes** to the user's script. Always show the exact code diff. Read [reduce-memory.md](reduce-memory.md) and [layerwise-casting.md](layerwise-casting.md) for correct API usage before writing code. |
| 92 | + |
| 93 | +VAE tiling is a VRAM optimization — only add it when the VAE decode/encode would OOM without it, not by default. See [reduce-memory.md](reduce-memory.md) for thresholds, the correct API (`pipe.vae.enable_tiling()` — pipeline-level is deprecated since v0.40.0), and which VAEs don't support it. |
| 94 | + |
| 95 | +## Reference guides |
| 96 | + |
| 97 | +Read these for correct API usage and detailed technique descriptions: |
| 98 | +- [memory-calculator.md](memory-calculator.md) — **Read this first when recommending strategies.** VRAM/RAM formulas for every technique, decision flowchart, and worked examples |
| 99 | +- [reduce-memory.md](reduce-memory.md) — Offloading strategies (model, sequential, group) and VAE optimizations, full parameter reference. **Authoritative source for compatibility rules.** |
| 100 | +- [layerwise-casting.md](layerwise-casting.md) — fp8 weight storage for memory reduction with minimal quality impact |
| 101 | +- [quantization.md](quantization.md) — int8/int4/fp8 quantization backends, text encoder quantization, common pitfalls |
| 102 | +- [attention-backends.md](attention-backends.md) — Attention backend selection for speed |
| 103 | +- [torch-compile.md](torch-compile.md) — torch.compile for inference speedup |
| 104 | + |
| 105 | +## Important compatibility rules |
| 106 | + |
| 107 | +See [reduce-memory.md](reduce-memory.md) for the full compatibility reference. Key constraints: |
| 108 | + |
| 109 | +- **`enable_model_cpu_offload()` and group offloading cannot coexist** on the same pipeline — use pipeline-level `enable_group_offload()` instead. |
| 110 | +- **`torch.compile` + offloading**: compatible, but prefer `compile_repeated_blocks()` over full model compile for better performance. See [torch-compile.md](torch-compile.md). |
| 111 | +- **`bitsandbytes_8bit` + `enable_model_cpu_offload()` fails** — int8 matmul cannot run on CPU. See [quantization.md](quantization.md) for the fix. |
| 112 | +- **Layerwise casting** can be combined with either group offloading or model CPU offloading (apply casting first). |
| 113 | +- **`bitsandbytes_4bit`** supports device moves and works correctly with `enable_model_cpu_offload()`. |
0 commit comments