Skip to content

Commit 1ae4eea

Browse files
ynankanikinjalpatel27
authored andcommitted
Ynankani/fvd benchmark (#1182)
### What does this PR do? Type of change: ? New benchmark addition <!-- Details about the change. --> This PR adds the following to examples/windows/: 1) FVD (Fréchet Video Distance) evaluation tool — a standalone script for computing FVD between two sets of videos using a pre-trained I3D model (Kinetics-400, 1024-dim pooled features). 2) Directory reorganization — moved torch_onnx/ and qad_example/ under a new diffusers/ folder for clearer grouping of diffusion model examples. 3) Minor fixes: Updated WikiText-2 dataset link in perplexity_metrics/README.md (old wikitext → Salesforce/wikitext) ### Usage ```python python compute_fvd.py \ --ref-dir /path/to/reference/videos \ --gen-dir /path/to/generated/videos \ --output results.json \ --device cuda ``` ### Testing Evaluated FVD tool on LTX-2.3 outputs comparing QAD vs PTQ checkpoints (both NVFp4) against a BF16 baseline across 11 VBench dimensions. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A <!--- Mandatory --> - Did you write any new necessary tests?: N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information <!-- E.g. related issue. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added FVD (Fréchet Video Distance) evaluation: command‑line tool, I3D‑based feature extraction, PCA option, and quantized comparisons (PTQ vs QAD) with per‑category and overall results showing QAD's lower average FVD. * **Documentation** * Expanded benchmarks TOC with “Additional Metrics”, added FVD evaluation guide, usage examples, troubleshooting, and updated perplexity dataset reference. * **Chores** * Added Python requirements for the FVD benchmark. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: ynankani <ynankani@nvidia.com> Signed-off-by: Yash Nankani <ynankani@nvidia.com>
1 parent 232736b commit 1ae4eea

13 files changed

Lines changed: 938 additions & 2 deletions

File tree

examples/windows/Benchmark.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,34 @@ KL-divergence (Kullback-Leibler divergence) quantifies the distributional differ
9494
*All KL-divergence results above are obtained via PyTorch fake quantization simulation unless otherwise noted. Inference with ONNX-runtime can also be evaluated .*
9595

9696
For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md).
97+
98+
#### 1.2.4 FVD (Fréchet Video Distance)
99+
100+
FVD measures the distributional similarity between two sets of videos using features extracted from a pre-trained I3D model (Kinetics-400). Lower FVD values indicate that the generated videos are closer to the reference. FVD is the standard metric for evaluating video generation quality.
101+
102+
**Learn more about FVD:** [FVD: A New Metric for Video Generation (arXiv)](https://arxiv.org/abs/1812.01717)
103+
104+
- **Reference baseline**: BF16 model outputs
105+
- **Quantized models**: PTQ (Post-Training Quantization) and QAD (Quantization-Aware Distillation)
106+
- **Model**: LTX-2.3 video generation
107+
- **Evaluation dimensions**: [VBench](https://github.com/Vchitect/VBench) benchmark categories
108+
- **Feature extractor**: I3D (1024-dim pooled features from `rgb_imagenet.pt`)
109+
110+
| Category | FVD: PTQ vs BF16 ↓ | FVD: QAD vs BF16 ↓ |
111+
|:---|:---|:---|
112+
| Temporal Flickering | 31.92 | 21.97 |
113+
| Subject Dynamic Motion | 23.44 | 16.28 |
114+
| Multiple Objects | 35.35 | 22.47 |
115+
| Human Action | 30.08 | 21.82 |
116+
| Object Class | 51.51 | 26.86 |
117+
| Color | 36.52 | 25.09 |
118+
| Spatial Relationship | 25.07 | 18.41 |
119+
| Scene Background | 64.92 | 35.69 |
120+
| Appearance Style | 31.08 | 20.82 |
121+
| Temporal Style | 23.61 | 15.85 |
122+
| Overall Consistency | 25.03 | 18.85 |
123+
| **Average** | **34.41** | **22.19** |
124+
125+
QAD consistently outperforms PTQ across all VBench dimensions, achieving 35% lower average FVD (22.19 vs 34.41).
126+
127+
For detailed instructions on computing FVD, please refer to the [FVD Evaluation Guide](./accuracy_benchmark/fvd_metrics/README.md).

examples/windows/accuracy_benchmark/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
- [MMLU (Massive Multitask Language Understanding)](#mmlu-massive-multitask-language-understanding)
77
- [Setup](#setup)
88
- [Evaluation Methods](#evaluation-methods)
9+
- [Additional Metrics](#additional-metrics)
910
- [API changes in ONNX Runtime GenAI v0.6](#api-changes-in-onnx-runtime-genai-v06)
1011
- [Troubleshoot](#troubleshoot)
1112

@@ -183,6 +184,16 @@ To evaluate the PyTorch Hugging Face (HF) model, use the `--ep pt` argument.
183184
184185
</details>
185186
187+
## Additional Metrics
188+
189+
| Metric | Directory | Description |
190+
|--------|-----------|-------------|
191+
| **KL Divergence** | [`kl_divergence_metrics/`](kl_divergence_metrics/) | Measures output similarity between two models using KL divergence |
192+
| **Perplexity** | [`perplexity_metrics/`](perplexity_metrics/) | Evaluates language model quality using WikiText-2 perplexity |
193+
| **FVD** | [`fvd_metrics/`](fvd_metrics/) | Computes Fréchet Video Distance between two sets of videos using I3D features |
194+
195+
Each sub-directory contains its own `README.md` with detailed setup and usage instructions.
196+
186197
## API changes in ONNX Runtime GenAI v0.6
187198
188199
In onnxruntime-genai (GenAI) v0.6, `generator.compute_logits()` and `generator_params.input_ids` are deprecated and new API `generator.append_tokens(List: token_ids)` is added (see GenAI [PR-867](https://github.com/microsoft/onnxruntime-genai/pull/867) for details).
Lines changed: 229 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,229 @@
1+
# FVD (Fréchet Video Distance) Evaluation Tool
2+
3+
## Overview
4+
5+
This tool computes the Fréchet Video Distance (FVD) between two sets of videos using a pre-trained I3D model (Kinetics-400, RGB stream). FVD is a distribution-level metric that measures how similar two collections of videos are — lower values indicate more similar distributions (0 = identical).
6+
7+
**Primary Use Cases:**
8+
9+
1. **Model Optimization Validation** — Verify that quantized/pruned video generation models maintain output quality
10+
2. **Precision Analysis** — Compare BF16 vs INT8 vs INT4 generated video outputs
11+
3. **Framework Comparison** — Evaluate outputs across different inference backends
12+
13+
## Key Components
14+
15+
| Script | Purpose |
16+
|--------|---------|
17+
| `compute_fvd.py` | Main script — loads videos, extracts I3D features, computes FVD |
18+
| `i3d_model.py` | I3D (Inception-v1 Inflated 3D) model architecture and weight loading |
19+
20+
### I3D Model Details
21+
22+
- **Architecture**: Inception-v1 inflated to 3D ([Carreira & Zisserman, CVPR 2017](https://arxiv.org/abs/1705.07750))
23+
- **Weights**: `rgb_imagenet.pt` from [pytorch-i3d](https://github.com/piergiaj/pytorch-i3d) (~49 MB, auto-downloaded on first run)
24+
- **Feature dimension**: 1024 (from the final average pooling layer)
25+
- **Input**: 16-frame clips, center-cropped to 224×224, normalized to [-1, 1]
26+
27+
## Installation
28+
29+
### 1. Create and Activate a Virtual Environment (Recommended)
30+
31+
```bash
32+
python -m venv fvd_env
33+
source fvd_env/bin/activate # Linux/macOS
34+
# .\fvd_env\Scripts\Activate.ps1 # Windows PowerShell
35+
```
36+
37+
### 2. Install Requirements
38+
39+
```bash
40+
pip install -r requirements.txt
41+
```
42+
43+
Note: For GPU acceleration, install PyTorch with CUDA support:
44+
45+
```bash
46+
pip install torch --index-url https://download.pytorch.org/whl/cu129
47+
```
48+
49+
## Usage Examples
50+
51+
### Quick Start
52+
53+
Compare two directories of videos:
54+
55+
```bash
56+
python compute_fvd.py \
57+
--ref-dir /path/to/reference/videos \
58+
--gen-dir /path/to/generated/videos
59+
```
60+
61+
The I3D weights (~49 MB) are downloaded automatically on first run and cached in `~/.cache/fvd/rgb_imagenet.pt`.
62+
63+
### Save Results to JSON
64+
65+
```bash
66+
python compute_fvd.py \
67+
--ref-dir /path/to/reference/videos \
68+
--gen-dir /path/to/generated/videos \
69+
--output results.json
70+
```
71+
72+
### Use a Locally Downloaded I3D Checkpoint
73+
74+
```bash
75+
python compute_fvd.py \
76+
--ref-dir /path/to/reference/videos \
77+
--gen-dir /path/to/generated/videos \
78+
--weights ./rgb_imagenet.pt
79+
```
80+
81+
### Increase Sample Count with Multiple Clips per Video
82+
83+
```bash
84+
python compute_fvd.py \
85+
--ref-dir /path/to/reference/videos \
86+
--gen-dir /path/to/generated/videos \
87+
--clips-per-video 4 \
88+
--output results.json
89+
```
90+
91+
### Specify Device and Batch Size
92+
93+
```bash
94+
python compute_fvd.py \
95+
--ref-dir ./real --gen-dir ./fake \
96+
--device cuda \
97+
--batch-size 16
98+
```
99+
100+
### Explicit PCA Dimension
101+
102+
```bash
103+
python compute_fvd.py \
104+
--ref-dir ./real --gen-dir ./fake \
105+
--pca-dim 64 \
106+
--output results.json
107+
```
108+
109+
## Configuration Parameters
110+
111+
### Required Parameters
112+
113+
| Parameter | Description |
114+
|-----------|-------------|
115+
| `--ref-dir` | Directory containing reference (real) videos |
116+
| `--gen-dir` | Directory containing generated videos |
117+
118+
### Optional Parameters
119+
120+
| Parameter | Description | Default |
121+
|-----------|-------------|---------|
122+
| `--weights` | Path to I3D weights file | Auto-downloaded `rgb_imagenet.pt` |
123+
| `--device` | Torch device (`cuda`, `cpu`, `cuda:0`) | Auto-detected |
124+
| `--clip-length` | Number of frames per clip | 16 |
125+
| `--clips-per-video` | Number of clips sampled per video | 1 |
126+
| `--batch-size` | Batch size for I3D inference | 8 |
127+
| `--pca-dim` | PCA dimension for features (0 to disable, auto-selected when clips < 1024) | Auto |
128+
| `--output` | Path to save JSON results | None (prints to console) |
129+
130+
### Supported Video Formats
131+
132+
`.mp4`, `.avi`, `.mov`, `.mkv`, `.webm`, `.flv`, `.m4v`
133+
134+
Videos are discovered recursively under the specified directories.
135+
136+
## Expected Output
137+
138+
### Console Output
139+
140+
```text
141+
2025-01-15 10:30:00 | INFO | Device: cuda
142+
2025-01-15 10:30:02 | INFO | I3D model loaded from rgb_imagenet.pt (1024-dim features)
143+
2025-01-15 10:30:02 | INFO | Reference videos: 100
144+
2025-01-15 10:30:02 | INFO | Generated videos: 100
145+
Loading ref: 100%|██████████| 100/100 [00:15<00:00, 6.5video/s]
146+
Loading gen: 100%|██████████| 100/100 [00:14<00:00, 6.8video/s]
147+
2025-01-15 10:30:32 | INFO | Total clips — ref: 100, gen: 100
148+
Extracting ref features: 100%|██████████| 13/13 [00:08<00:00, 1.5it/s]
149+
Extracting gen features: 100%|██████████| 13/13 [00:07<00:00, 1.6it/s]
150+
2025-01-15 10:30:48 | INFO | FVD = 12.3456
151+
```
152+
153+
### JSON Output
154+
155+
```json
156+
{
157+
"fvd": 12.3456,
158+
"ref_dir": "/path/to/reference/videos",
159+
"gen_dir": "/path/to/generated/videos",
160+
"num_ref_clips": 100,
161+
"num_gen_clips": 100,
162+
"clip_length": 16,
163+
"clips_per_video": 1,
164+
"feature_dim": 1024,
165+
"pca_dim": null,
166+
"model": "I3D (Kinetics-400, 1024-dim pool)"
167+
}
168+
```
169+
170+
## Benchmark Results
171+
172+
### LTX-2.3 Video Generation — PTQ vs QAD (BF16 Reference)
173+
174+
FVD scores comparing PTQ-quantized and QAD-quantized LTX-2.3 video generation outputs against BF16 baseline, evaluated across [VBench](https://github.com/Vchitect/VBench) dimensions. Lower is better.
175+
176+
| Category | FVD: PTQ vs BF16 ↓ | FVD: QAD vs BF16 ↓ |
177+
|---|---|---|
178+
| Temporal Flickering | 31.92 | 21.97 |
179+
| Subject Dynamic Motion | 23.44 | 16.28 |
180+
| Multiple Objects | 35.35 | 22.47 |
181+
| Human Action | 30.08 | 21.82 |
182+
| Object Class | 51.51 | 26.86 |
183+
| Color | 36.52 | 25.09 |
184+
| Spatial Relationship | 25.07 | 18.41 |
185+
| Scene Background | 64.92 | 35.69 |
186+
| Appearance Style | 31.08 | 20.82 |
187+
| Temporal Style | 23.61 | 15.85 |
188+
| Overall Consistency | 25.03 | 18.85 |
189+
| **Average** | **34.41** | **22.19** |
190+
191+
**Takeaways:**
192+
193+
- QAD consistently outperforms PTQ across all 11 VBench dimensions, with an average FVD of **22.19** vs **34.41** (35% lower).
194+
- The largest gap is on **Scene Background** (64.92 vs 35.69) and **Object Class** (51.51 vs 26.86), indicating PTQ degrades spatial detail fidelity more than QAD.
195+
- Both methods perform best on **Temporal Style** and **Subject Dynamic Motion**, suggesting temporal dynamics are more robust to quantization.
196+
197+
## Key Insights
198+
199+
- **Lower is better**: FVD = 0 means identical distributions
200+
- **Sample count matters**: FVD estimates are noisy below ~256 clips; 2048+ clips recommended for publishable results. Use `--clips-per-video` to increase sample count.
201+
- **PCA auto-selection**: When the number of clips is less than the feature dimension (1024), PCA is automatically applied to avoid rank-deficient covariance matrices
202+
203+
## Troubleshooting
204+
205+
### CUDA Out of Memory
206+
207+
**Solutions:**
208+
209+
- Reduce batch size: `--batch-size 2`
210+
- Use CPU: `--device cpu`
211+
- Close other GPU applications
212+
213+
### No Videos Found
214+
215+
Ensure your video files have a supported extension (`.mp4`, `.avi`, etc.) and are located in or under the specified directory. The script searches recursively.
216+
217+
### Noisy / Unstable FVD Values
218+
219+
If FVD values vary significantly between runs, you likely have too few clips. Increase the sample count:
220+
221+
```bash
222+
python compute_fvd.py --ref-dir ./real --gen-dir ./fake --clips-per-video 8
223+
```
224+
225+
## References
226+
227+
- Unterthiner et al., ["FVD: A New Metric for Video Generation"](https://arxiv.org/abs/1812.01717), 2019
228+
- Carreira & Zisserman, ["Quo Vadis, Action Recognition?"](https://arxiv.org/abs/1705.07750), CVPR 2017
229+
- I3D PyTorch weights: [piergiaj/pytorch-i3d](https://github.com/piergiaj/pytorch-i3d)

0 commit comments

Comments
 (0)