Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
203 commits
Select commit Hold shift + click to select a range
da3f43c
+baseline
jzhang38 Sep 28, 2025
85553c2
+ fp4 linear + fp4 attn, all with 16-bit bwd
jzhang38 Sep 28, 2025
6a58c3a
+ generator sage 3
jzhang38 Sep 28, 2025
9161ef6
update
jzhang38 Sep 29, 2025
8abfe23
update
jzhang38 Sep 29, 2025
a1ab4a7
stash
jzhang38 Sep 30, 2025
2a28c08
update
jzhang38 Oct 2, 2025
032016f
update
jzhang38 Oct 3, 2025
c47be5c
update
jzhang38 Oct 3, 2025
fa91001
update
jzhang38 Oct 4, 2025
130b466
save
jzhang38 Oct 5, 2025
84ed3ce
1005 morning
jzhang38 Oct 5, 2025
18aebfa
update
jzhang38 Oct 7, 2025
c137c9f
add real and fake quant precision tests
RandNMR73 Oct 20, 2025
3a9e4d3
fake quant done
jzhang38 Oct 23, 2025
b628b5f
save
jzhang38 Oct 23, 2025
ebdb160
checkpoint
RandNMR73 Nov 10, 2025
9b12552
nvfp4 utils in progress
RandNMR73 Nov 10, 2025
2fa6fc1
add inference repo
RandNMR73 Nov 11, 2025
fe9e94e
fix DeepGEMM path
RandNMR73 Nov 11, 2025
6a7f08e
fix DeepGEMM path
RandNMR73 Nov 11, 2025
b06c038
qat attn in progress + refactor nvfp4 utils
RandNMR73 Nov 12, 2025
d41912c
fix import
RandNMR73 Nov 12, 2025
9bcc411
checkpoint (qat attn in progress)
RandNMR73 Nov 16, 2025
3b70702
fix masking + causal and non-causal logic in qat attn
RandNMR73 Nov 17, 2025
7afe836
5090 testing
RandNMR73 Dec 24, 2025
8d68a4f
add SageAttn3 with QAT
RandNMR73 Dec 24, 2025
3641487
print sageattn file
RandNMR73 Dec 24, 2025
460df09
adjust sage3 block size to 64x64
RandNMR73 Dec 24, 2025
52480b2
adjus quant kernel block size
RandNMR73 Dec 24, 2025
346bbfa
revert block size to 128x128
RandNMR73 Dec 24, 2025
8678798
add delta_s back
RandNMR73 Dec 24, 2025
f6080f7
add sageattn3 inference example
RandNMR73 Dec 24, 2025
f2eb855
change block size back to 64x64
RandNMR73 Dec 24, 2025
d0e4417
disable delta_s
RandNMR73 Dec 24, 2025
56faaa3
re-enable delta_s since inference is terrible without it
RandNMR73 Dec 25, 2025
36cdddf
bigger batch size + no grad around wrapped_flash_attn
RandNMR73 Dec 25, 2025
efcb3c4
8 gpus
RandNMR73 Dec 25, 2025
ca3b398
6 gpus
RandNMR73 Dec 25, 2025
a34a6df
update training script
RandNMR73 Dec 25, 2025
eb160a2
modify scripts + launch qat run
RandNMR73 Dec 25, 2025
6c3d878
test sage3 inference with smoothing q but no delta s
RandNMR73 Dec 26, 2025
388f6f1
set per_block_mean=True and modify api
RandNMR73 Dec 26, 2025
dbd5bd8
disable delta s
RandNMR73 Dec 26, 2025
30eeec0
try sage3 finetuning again
RandNMR73 Jan 5, 2026
3adfb07
try sage3 finetuning with aligned backward pass
RandNMR73 Jan 7, 2026
d76dfa0
add inference script with custom weights
RandNMR73 Jan 13, 2026
654cc49
change block size to 128x128
RandNMR73 Jan 13, 2026
71ecf4a
fix qat_attn kwargs
RandNMR73 Jan 13, 2026
8d43721
5090 example
RandNMR73 Jan 13, 2026
4cd1297
a bunch of changes
RandNMR73 Jan 22, 2026
e21848a
add benchmarking script + script changes + fix distillation pipeline
RandNMR73 Jan 22, 2026
e332dbb
more scripts
RandNMR73 Jan 22, 2026
bdbf327
[feat] Add Matrix-Game 2.0 (#938)
H1yori233 Dec 20, 2025
2b8c937
[bugfix] Added VSA Padding logic (#944)
loaydatrain Dec 20, 2025
6c4cb78
[misc] Allow manual override of Pipeline class through override_pipel…
SolitaryThinker Dec 20, 2025
fdca8fd
[New Model] Hunyuan1.5 (#943)
JerryZhou54 Dec 21, 2025
b824f95
[docs] small fixes (#947)
RandNMR73 Dec 22, 2025
2f49e50
[feat] add sliding_tile attention triton kernel and ROCM support (#916)
ZiguanWang Dec 23, 2025
f2cced2
[rocm] Add rocm fastvideo docker image (#952)
SolitaryThinker Dec 23, 2025
9211fa8
Add LongCat T2V (Base, Distillation and Refinement) Support to FastVi…
alexzms Dec 23, 2025
8f4cd10
feat: consolidate attention kernels into unified fastvideo-kernel pac…
ShreejithSG Dec 24, 2025
254b49c
[kernel] Reorg and fix fastvideo-kernel (#962)
SolitaryThinker Dec 26, 2025
042c20f
[kernel] Release fastvideo-kernel v0.2.1 (#963)
SolitaryThinker Dec 26, 2025
9ba352f
[docs] refactor attention docs (#964)
SolitaryThinker Dec 26, 2025
1a637e7
[kernel] Fix docker release build for kernel (#965)
SolitaryThinker Dec 27, 2025
9eabb5d
[chore] release v0.1.7 (#955)
SolitaryThinker Dec 27, 2025
05214b5
[feat] Add new feature extractors for fvd (#954)
ketakitank Dec 27, 2025
6eedc0a
[fix]: fix sliding_tile_attn with sdpa(without flash_attn) (#967)
ZiguanWang Dec 29, 2025
bf1892b
[fix]: fix fastvideo-kernel Rocm build and Dockerfile for Rocm (#968)
ZiguanWang Dec 29, 2025
9c3b8ac
[fix]: fix STA trition kernel for AMD RDNA archs (#969)
ZiguanWang Dec 29, 2025
1afbf8c
[misc] Add util script to create diffuser HF repo from custom compone…
SolitaryThinker Dec 30, 2025
6c18dda
[kernel] add turbodiffusion kernels (#972)
SolitaryThinker Dec 30, 2025
e24ea17
[docs]: fix various broken links across the documentation (#979)
kuafou Jan 2, 2026
b894a82
[feat] Support absmax style quantization for FP8 (#981)
XOR-op Jan 2, 2026
f74d6e7
[New Model] Turbodiffusion (#971)
loaydatrain Jan 2, 2026
ac013ac
[feat] support Matrix-Game 2.0 streaming generation (#957)
H1yori233 Jan 3, 2026
a6b738f
[feat] Support text encoder weight override and quantization (#983)
XOR-op Jan 3, 2026
f72ced7
Layer offloading (#966)
Ohm-Rishabh Jan 4, 2026
0db7e3c
[docs] Update docs and README (#975)
SolitaryThinker Jan 4, 2026
5dc8625
[ci] increase ssim and lora inference test timeout (#985)
SolitaryThinker Jan 4, 2026
9d1cb80
[chore] release fastvideo-kernel 0.2.2 (#986)
SolitaryThinker Jan 5, 2026
98452d9
[chore] update wechat QR code (#988)
SolitaryThinker Jan 5, 2026
ae03169
Add LongCat-Video I2V and Video Continuation (Base, Distillation and …
shaoxiongduan Jan 5, 2026
f795892
[misc] pin fastvideo-kernel in .toml file (#989)
SolitaryThinker Jan 5, 2026
490ba38
[feat] add Turbodiffusion I2V pipeline (#984)
loaydatrain Jan 5, 2026
cd5c017
[misc] add pin_cpu_memory false for RTX 4090 (#990)
SolitaryThinker Jan 5, 2026
6b30205
[chore] release 0.1.7 (real) (#980)
SolitaryThinker Jan 5, 2026
44d32ad
[examples] Added longcat-video python api examples (#994)
shaoxiongduan Jan 6, 2026
7381e19
[docs]: add LoRA extraction utilities documentation (#992)
ShreejithSG Jan 6, 2026
3e4b8b2
[bugfix] Add configs for TurboDiffusion T2V/I2V models (#993)
loaydatrain Jan 6, 2026
c6dc7e8
dit
SolitaryThinker Jan 7, 2026
6b9626c
Revert "dit"
SolitaryThinker Jan 7, 2026
5753d92
[ci] temporarily disable turbodiffusion ssim test (#1000)
SolitaryThinker Jan 8, 2026
989f1d5
[CI] Fixed Turbodiffusion I2V CI (#1002)
loaydatrain Jan 13, 2026
d8b9d7c
[feat!] Disable FSDP inference by default (#1001)
XOR-op Jan 13, 2026
0d6c9a5
[misc] [bugfix] unpin 'av' in pyproject (#1009)
SolitaryThinker Jan 13, 2026
ae439b1
[feat] Introduce Cosmos 2.5 Text2World pipeline (#974)
KyleShao1016 Jan 15, 2026
8c99407
[CI] SSIM tests optimization: load all model weights from Modal persi…
alexzms Jan 16, 2026
f78843c
[CI] Fix OOM issues in ssim tests (#1011)
SolitaryThinker Jan 17, 2026
0224ea9
[Bug Fix] Add autograd wrapper for block-sparse attention in fastvide…
alexzms Jan 17, 2026
5ec0e09
[chore] release fastvideo-kernel 0.2.3 (#1018)
SolitaryThinker Jan 17, 2026
bee1a5a
[feat] Hooks API and layerwise offloading for all DiTs (#1006)
XOR-op Jan 17, 2026
9ba9a3f
[kernel] Fix fastvideo-kernel release workflow (#1019)
SolitaryThinker Jan 17, 2026
e2979d6
[kernel] [bugfix] [ci] bump v0.2.4. Fix STA output handling, TurboDif…
alexzms Jan 20, 2026
51bef40
Added LTX-2 Distilled T2V Generation (#1016)
shaoxiongduan Jan 21, 2026
aec9c4d
fix: SP for hunyuanvideo 1.5 (#1026)
XOR-op Jan 21, 2026
5121718
add inference repo
RandNMR73 Nov 11, 2025
ce95729
fix DeepGEMM path
RandNMR73 Nov 11, 2025
900cc8d
checkpoint (qat attn in progress)
RandNMR73 Nov 16, 2025
a7098ad
rebase
RandNMR73 Jan 22, 2026
aa49fb9
add a bunch of scripts + modify sage3 to support turning off two leve…
RandNMR73 Jan 25, 2026
c00fac9
fix random seed in training pipeline bug
RandNMR73 Jan 25, 2026
c8d3a05
update .gitmodules
RandNMR73 Jan 26, 2026
d127e51
updates
RandNMR73 Jan 27, 2026
2e7b3f7
update pyproject.toml
RandNMR73 Jan 27, 2026
afd6769
update pyrpoject.toml again
RandNMR73 Jan 27, 2026
1f39e5c
ignore fastvideo_kernel
RandNMR73 Jan 27, 2026
e57ffd2
update batch inference script
RandNMR73 Jan 27, 2026
dd2096a
fix sage3 api bug
RandNMR73 Jan 27, 2026
0c128c8
ignore fastvideo_kernel
RandNMR73 Jan 27, 2026
72b61ab
add sage3 fwd + bf16 bwd script
RandNMR73 Jan 27, 2026
75d6cf1
update benchmark scripts
RandNMR73 Jan 27, 2026
6f532a1
update benchmarking script attention flop counts
RandNMR73 Jan 27, 2026
643c8e9
fix 14B validation videos
RandNMR73 Jan 27, 2026
33a2d51
update sage3 to use per_block_mean=True
RandNMR73 Jan 27, 2026
fc10481
modified sage3 with cfg=3 inference
RandNMR73 Jan 28, 2026
3aa5b1f
update
RandNMR73 Jan 28, 2026
0543dbd
two level P ablation
RandNMR73 Jan 28, 2026
098ee30
smooth k ablation:
RandNMR73 Jan 28, 2026
58cb2f4
test 2 level quant q, k, v
RandNMR73 Jan 28, 2026
3ee1357
use the product of the two sfs
RandNMR73 Jan 28, 2026
9361c72
remove incorrect two level qkv and revert
RandNMR73 Jan 28, 2026
96609c0
sage3 inference 1.3B baseline
RandNMR73 Jan 28, 2026
a866163
fp4 1.3B inference
RandNMR73 Jan 29, 2026
af03817
add fa2 benchmarking
RandNMR73 Jan 29, 2026
011b9e7
add combined benchmarks
RandNMR73 Jan 29, 2026
c1a6d7f
update benchmark scripts
RandNMR73 Jan 29, 2026
68002a0
update default num heads
RandNMR73 Jan 29, 2026
061c205
vsa with qat training kernel
RandNMR73 Feb 14, 2026
9c3d7e3
clean
RandNMR73 Apr 8, 2026
1cdec40
Merge branch 'matthew/clean' into sync-branch
RandNMR73 Apr 8, 2026
ad099ca
clean
RandNMR73 Apr 8, 2026
bda89a5
fix
RandNMR73 Apr 8, 2026
46cb669
clean
RandNMR73 Apr 8, 2026
088cc8d
clean
RandNMR73 Apr 8, 2026
e3e075a
clean
RandNMR73 Apr 8, 2026
12829a6
clean
RandNMR73 Apr 8, 2026
764fa94
clean
RandNMR73 Apr 8, 2026
7126879
clean
RandNMR73 Apr 8, 2026
6278c9b
clean
RandNMR73 Apr 8, 2026
84f615b
clean
RandNMR73 Apr 8, 2026
0232183
clean
RandNMR73 Apr 8, 2026
026ca65
clean
RandNMR73 Apr 8, 2026
12639e1
clean
RandNMR73 Apr 8, 2026
8af2a5b
clean
RandNMR73 Apr 9, 2026
027c324
clean
RandNMR73 Apr 9, 2026
d96de5f
clean
RandNMR73 Apr 9, 2026
5832ad9
clean
RandNMR73 Apr 9, 2026
4ed79b9
fix
RandNMR73 Apr 9, 2026
43d277e
fix precommit
RandNMR73 Apr 9, 2026
3fd5502
fix
RandNMR73 Apr 9, 2026
8a70dd0
Apply suggestion from @gemini-code-assist[bot]
RandNMR73 Apr 9, 2026
159376f
fix
RandNMR73 Apr 9, 2026
a06fb48
fix
RandNMR73 Apr 9, 2026
9a6e327
fix
RandNMR73 Apr 9, 2026
bf010e8
fix
RandNMR73 Apr 9, 2026
6e7d1a9
fix
RandNMR73 Apr 9, 2026
54b2182
fix
RandNMR73 Apr 9, 2026
1fa1739
fix
RandNMR73 Apr 9, 2026
f313af0
fix
RandNMR73 Apr 9, 2026
f1e7b2a
fix
RandNMR73 Apr 9, 2026
4636a8a
fix
RandNMR73 Apr 9, 2026
2e2899f
fix
RandNMR73 Apr 9, 2026
e4a7074
fix
RandNMR73 Apr 9, 2026
1d3ff73
fix
RandNMR73 Apr 9, 2026
24d3f97
fix
RandNMR73 Apr 9, 2026
a6a1efd
fix
RandNMR73 Apr 9, 2026
5230aaa
fix
RandNMR73 Apr 9, 2026
6e1b285
fix
RandNMR73 Apr 9, 2026
6ccc603
fix
RandNMR73 Apr 9, 2026
5fab762
fix
RandNMR73 Apr 9, 2026
eccf1a9
fix
RandNMR73 Apr 9, 2026
10d99ea
fix
RandNMR73 Apr 9, 2026
bf39c83
fix
RandNMR73 Apr 9, 2026
952b702
fix
RandNMR73 Apr 9, 2026
2619107
fix
RandNMR73 Apr 9, 2026
b46ce2c
fix
RandNMR73 Apr 9, 2026
ee775e8
fix
RandNMR73 Apr 9, 2026
cb5eb77
fix
RandNMR73 Apr 9, 2026
a3a4e04
fix
RandNMR73 Apr 9, 2026
88f9ac1
fix
RandNMR73 Apr 9, 2026
16d8874
all tests passing
RandNMR73 Apr 9, 2026
15bda19
fix
RandNMR73 Apr 9, 2026
b978a92
fix
RandNMR73 Apr 9, 2026
ca96349
fix
RandNMR73 Apr 9, 2026
aafa257
fix
RandNMR73 Apr 9, 2026
12a993e
fix
RandNMR73 Apr 9, 2026
6b42166
fix
RandNMR73 Apr 10, 2026
52e3484
fix
RandNMR73 Apr 10, 2026
42a292e
fix
RandNMR73 Apr 10, 2026
567dd87
Update .gitignore
RandNMR73 Apr 11, 2026
3f818d0
Remove explicit_package_bases from mypy settings
RandNMR73 Apr 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file added .codex
Empty file.
318 changes: 318 additions & 0 deletions docs/attention/attn_qat/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,318 @@
# Attention QAT

Attention QAT in FastVideo covers two related, but different, backends:

- `ATTN_QAT_INFER`: the inference-oriented CUDA kernel path
- `ATTN_QAT_TRAIN`: the training-oriented Triton attention path

Both are selected with `FASTVIDEO_ATTENTION_BACKEND`, but they are not
interchangeable. The main practical split is:

- use `ATTN_QAT_INFER` for standalone inference with the dedicated inference
kernel
- use `ATTN_QAT_TRAIN` for finetuning, validation during training, or when you
specifically want to reproduce the training-side attention path

## Quick Start

If your goal is "run Wan 2.1 14B with Attention QAT inference weights", this is
the shortest path:

1. Build the in-repo kernel package so FastVideo can import `attn_qat_infer`.
2. Download the Wan 2.1 14B QAT checkpoint.
3. Edit the provided inference example to point at the 14B base model and the
downloaded QAT safetensors.
4. Run the example with `ATTN_QAT_INFER`.

### Step 1. Build the kernel package

Before using either Attention QAT backend, build the in-repo
`fastvideo-kernel` package from source:

```bash
git submodule update --init --recursive
cd fastvideo-kernel
./build.sh
```

After a successful build:

- `ATTN_QAT_TRAIN` should be able to import `fastvideo_kernel`
- `ATTN_QAT_INFER` should be able to import `attn_qat_infer`

`ATTN_QAT_INFER` currently targets the Blackwell CUDA path under
`fastvideo-kernel/attn_qat_infer/` and requires CUDA 12.8+.

### Step 2. Download the Wan 2.1 14B QAT checkpoint

FastVideo includes a helper script:

- `examples/inference/optimizations/download_14B_qat.sh`

By default it downloads:

- Hugging Face repo: `FastVideo/14B_qat_400`
- local directory: `checkpoints/14B_qat_400`

Prerequisites:

- `huggingface_hub` installed, for example:
`uv pip install huggingface_hub`
- access to the model repo if it is private or gated:
`huggingface-cli login`

Run the downloader:

```bash
bash examples/inference/optimizations/download_14B_qat.sh
```

To download into a custom directory:

```bash
bash examples/inference/optimizations/download_14B_qat.sh /path/to/14B_qat_400
```

The script prints a ready-to-copy `init_weights_from_safetensors=...` value at
the end.

### Step 3. Edit the provided inference example

The example to start from is:

- `examples/inference/optimizations/attn_qat_inference_example.py`

Open that file and update these two values:

1. Change the base model from `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` to
`Wan-AI/Wan2.1-T2V-14B-Diffusers`
2. Replace
`init_weights_from_safetensors="safetensors_path"` with the directory that
contains the downloaded `.safetensors` files

Example:

```python
import os

from fastvideo import VideoGenerator

os.environ["FASTVIDEO_ATTENTION_BACKEND"] = "ATTN_QAT_INFER"

generator = VideoGenerator.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
num_gpus=1,
use_fsdp_inference=True,
dit_cpu_offload=False,
vae_cpu_offload=False,
text_encoder_cpu_offload=True,
pin_cpu_memory=False,
init_weights_from_safetensors="checkpoints/14B_qat_400",
)
```

Important:

- the checked-in example currently uses the `1.3B` base model until you edit it
- do not load the 14B QAT weights on top of the `1.3B` base model; the weights
and model config will not match

### Step 4. Run the inference example

```bash
python examples/inference/optimizations/attn_qat_inference_example.py
```

Generated videos are written to `video_samples/` by default.

## Backend Overview

| Backend | Best for | Package requirement | Primary kernel location |
|---------|----------|---------------------|-------------------------|
| `ATTN_QAT_TRAIN` | finetuning, training-time validation, reproducing the training path | `fastvideo_kernel` | `fastvideo-kernel/python/fastvideo_kernel/triton_kernels/attn_qat_train.py` |
| `ATTN_QAT_INFER` | standalone inference with the dedicated CUDA kernel | `attn_qat_infer` from the in-repo `fastvideo-kernel` checkout | `fastvideo-kernel/attn_qat_infer/` |

FastVideo routes backend selection through:

- `fastvideo/envs.py`
- `fastvideo/platforms/cuda.py`
- `fastvideo/attention/backends/attn_qat_train.py`
- `fastvideo/attention/backends/attn_qat_infer.py`

The legacy training pipeline also contains explicit Attention QAT integration:

- `fastvideo/training/training_pipeline.py`

That pipeline forces generator loading through `ATTN_QAT_TRAIN` when
`FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_TRAIN` or `--generator_4bit_attn` is
enabled.

## Inference Workflows

For standalone inference, prefer `ATTN_QAT_INFER` when the CUDA kernel is
available. Use `ATTN_QAT_TRAIN` for inference only if you intentionally want to
exercise the training-side attention path for debugging or parity checks.

### Minimal Python example

```python
import os

from fastvideo import VideoGenerator

os.environ["FASTVIDEO_ATTENTION_BACKEND"] = "ATTN_QAT_INFER"

generator = VideoGenerator.from_pretrained(
"Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
num_gpus=1,
)

generator.generate_video(
"A cinematic close-up of rain on a neon street at night.",
output_path="video_samples",
save_video=True,
)
```

### Loading custom safetensors during inference

FastVideo supports loading custom transformer weights through
`init_weights_from_safetensors`.

This value can point to either:

- a directory containing one or more `.safetensors` files
- a single `.safetensors` file

For Wan 2.1 14B QAT inference, the common pattern is:

```python
generator = VideoGenerator.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
num_gpus=1,
use_fsdp_inference=True,
init_weights_from_safetensors="checkpoints/14B_qat_400",
)
```

### CLI example

You can also force the backend from the command line:

```bash
FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_INFER \
fastvideo generate \
--model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
--num-gpus 1 \
--sp-size 1 \
--tp-size 1 \
--height 480 \
--width 832 \
--num-frames 77 \
--num-inference-steps 50 \
--guidance-scale 6.0 \
--prompt "A cinematic close-up of rain on a neon street at night." \
--output-path outputs_video/
```

If you want to use custom QAT transformer weights from the CLI, pass the same
custom weight override that the Python API uses:

```bash
FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_INFER \
fastvideo generate \
--model-path Wan-AI/Wan2.1-T2V-14B-Diffusers \
--init-weights-from-safetensors checkpoints/14B_qat_400 \
--num-gpus 1 \
--output-path outputs_video/ \
--prompt "A cinematic close-up of rain on a neon street at night."
```

## Training Workflows

Today the checked-in Attention QAT training launchers use the legacy training
pipeline in `fastvideo/training/wan_training_pipeline.py`.

### Ready-made launchers

Use the provided SLURM scripts directly:

```bash
sbatch examples/training/finetune/wan_t2v_1.3B/crush_smol/finetune_t2v_qat_attn.sh
sbatch examples/training/finetune/wan_t2v_14B/finetune_t2v_qat_attn.sh
```

Both scripts already set:

```bash
export FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_TRAIN
```

Before launching, update the script-local values that depend on your
environment:

- `WANDB_API_KEY`
- `MODEL_PATH`
- `DATA_DIR`
- `VALIDATION_DATASET_FILE`
- output directory and SLURM resource requests

### What the launchers run

The training scripts eventually invoke:

```bash
torchrun fastvideo/training/wan_training_pipeline.py ...
```

If you are adapting the workflow to your own cluster or running outside SLURM,
the main Attention QAT requirement is still:

```bash
export FASTVIDEO_ATTENTION_BACKEND=ATTN_QAT_TRAIN
```

Then launch the normal Wan training pipeline with your preferred `torchrun`
arguments and training flags.

## Where The Code Lives

Use these paths when you want to trace or modify the Attention QAT flow:

| Location | Purpose |
|----------|---------|
| `fastvideo/attention/backends/attn_qat_train.py` | FastVideo wrapper that imports and calls the Triton training kernel |
| `fastvideo/attention/backends/attn_qat_infer.py` | FastVideo wrapper that imports and calls the inference kernel |
| `fastvideo-kernel/CMakeLists.txt` | Kernel build definition that compiles the `attn_qat_infer` inference extensions |
| `fastvideo/platforms/cuda.py` | Chooses the concrete attention backend at runtime |
| `fastvideo/envs.py` | Documents supported `FASTVIDEO_ATTENTION_BACKEND` values |
| `fastvideo/training/training_pipeline.py` | Training-time forcing logic for the generator attention backend |
| `fastvideo-kernel/python/fastvideo_kernel/triton_kernels/attn_qat_train.py` | Triton implementation for `ATTN_QAT_TRAIN` |
| `fastvideo-kernel/attn_qat_infer/api.py` | Python API entrypoint for the inference kernel |
| `fastvideo-kernel/benchmarks/benchmark_*.py` | Kernel-side benchmark scripts for FlashAttn2, SageAttention3, FP4, and comparison plots |
| `fastvideo-kernel/attn_qat_infer/blackwell/api.cu` | CUDA implementation behind `ATTN_QAT_INFER` |
| `fastvideo-kernel/tests/test_attn_qat_train.py` | Kernel-level test coverage for the training path |
| `examples/inference/optimizations/attn_qat_inference_example.py` | Ready-to-edit inference example for custom Attention QAT weights |
| `examples/inference/optimizations/download_14B_qat.sh` | Helper script for downloading the Wan 2.1 14B QAT checkpoint |
| `examples/training/finetune/wan_t2v_1.3B/crush_smol/finetune_t2v_qat_attn.sh` | Ready-to-run Wan 1.3B Attention QAT finetune launcher |
| `examples/training/finetune/wan_t2v_14B/finetune_t2v_qat_attn.sh` | Ready-to-run Wan 14B Attention QAT finetune launcher |

## Troubleshooting

- If `ATTN_QAT_TRAIN` fails to import, verify that `fastvideo-kernel` built
successfully and exposes `fastvideo_kernel`.
- If `ATTN_QAT_INFER` fails to import, verify that the local build exposes the
`attn_qat_infer` package.
- If the Wan 2.1 14B example fails after you changed only the checkpoint path,
make sure you also changed the base model to
`Wan-AI/Wan2.1-T2V-14B-Diffusers`.
- If you hit issues with CPU memory pressure or obscure CUDA argument errors in
the example script, try setting `pin_cpu_memory=False`.
- If you want a known-safe fallback for debugging, use
`FASTVIDEO_ATTENTION_BACKEND=TORCH_SDPA`.

## Related Pages

- [Attention Overview](../index.md)
- [Inference Optimizations](../../inference/optimizations.md)
- [Debugging](../../utilities/debugging.md)
2 changes: 2 additions & 0 deletions docs/attention/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ FastVideo provides highly optimized custom attention kernels to accelerate video
## Supported Kernels

* **[Video Sparse Attention (VSA)](vsa/index.md)**: Sparse attention mechanism selecting top-k blocks.
* **[Attention QAT](attn_qat/index.md)**: Dedicated guide for Attention QAT
inference, training, checkpoint loading, and troubleshooting.
* **[Sliding Tile Attention (STA)](sta/index.md)**: STA kernel support is kept in
`fastvideo-kernel`; full FastVideo STA pipeline workflow is archived in
`sta_do_not_delete`.
Expand Down
14 changes: 5 additions & 9 deletions docs/design/inference_schema_parity_inventory.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ surfaces:
prompt_txt: request.inputs.prompt_path
override_text_encoder_safetensors: generator.pipeline.components.text_encoder_weights
override_text_encoder_quant: generator.engine.quantization.text_encoder_quant
transformer_quant: generator.engine.quantization.transformer_quant
override_transformer_cls_name: generator.pipeline.components.override_transformer_cls_name
init_weights_from_safetensors: generator.pipeline.components.transformer_weights
init_weights_from_safetensors_2: generator.pipeline.components.transformer_2_weights
Expand Down Expand Up @@ -345,6 +346,7 @@ surfaces:
num_inference_steps: request.sampling.num_inference_steps
num_inference_steps_sr: request.sampling.num_inference_steps_sr
guidance_scale: request.sampling.guidance_scale
guidance_scale_2: request.sampling.guidance_scale_2
guidance_rescale: request.sampling.guidance_rescale
boundary_ratio: request.sampling.boundary_ratio
sigmas: request.sampling.sigmas
Expand All @@ -364,15 +366,7 @@ surfaces:
data_type: "Derived from the request shape and not a public input."

sampling_param_extensions:
moved:
guidance_scale_2:
target: request.sampling.guidance_scale_2
sources:
- fastvideo.configs.sample.lingbotworld.LingBotWorld_SamplingParam
- fastvideo.configs.sample.lingbotworld.Wan2_2_I2V_A14B_SamplingParam
- fastvideo.configs.sample.wan.SelfForcingWan2_2_T2V_A14B_480P_SamplingParam
- fastvideo.configs.sample.wan.Wan2_2_I2V_A14B_SamplingParam
- fastvideo.configs.sample.wan.Wan2_2_T2V_A14B_SamplingParam
moved: {}
profile_owned:
action_list:
target: request.extensions.hunyuangamecraft.action_list
Expand Down Expand Up @@ -597,6 +591,7 @@ cli:
- text_encoder_cpu_offload
- text_encoder_precisions
- torch_compile_kwargs
- transformer_quant
- tp_size
- trust_remote_code
- use_fsdp_inference
Expand Down Expand Up @@ -702,6 +697,7 @@ cli:
- text_encoder_cpu_offload
- text_encoder_precisions
- torch_compile_kwargs
- transformer_quant
- tp_size
- trust_remote_code
- use_fsdp_inference
Expand Down
5 changes: 5 additions & 0 deletions docs/design/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,11 @@ How this maps to FastVideo:

- Attention backends live in `fastvideo/attention/` and can be selected via
`FASTVIDEO_ATTENTION_BACKEND`.
- SageAttention3 is split into two selectable backends:
`SAGE_ATTN_THREE` for the regular upstream package and
`ATTN_QAT_INFER` for the FastVideoKernel-backed inference variant.
- `ATTN_QAT_TRAIN` is a separate FastVideoKernel Triton backend for the QAT attention
path.
- `LocalAttention` is used for cross-attention and most attention layers.
- `DistributedAttention` is used for full-sequence self-attention in the DiT.
- Tensor-parallel layers live in `fastvideo/layers/`.
Expand Down
2 changes: 2 additions & 0 deletions docs/inference/inference_quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,8 @@ If you encounter CUDA out of memory errors:
(single GPU) or `use_fsdp_inference=True` (multi-GPU)
- Try a smaller model or use distilled versions
- Use `num_gpus` > 1 if multiple GPUs are available
- Try enabling FSDP inference with `use_fsdp_inference=True` (may slow down generation)
- Try enabling DiT layerwise offload with `dit_layerwise_offload=True` (now only a few models support this, but may introduce less overhead than FSDP)

### Slow Generation

Expand Down
Loading
Loading