22
33In-process evaluation suite for video generations. Includes pixel
44metrics (SSIM, PSNR, LPIPS), optical-flow comparisons, the full VBench
5- suite, Physics-IQ, and a VLM scorer (VideoScore-2) behind a single
5+ suite, Physics-IQ, audio metrics, and a VLM scorer behind a single
66registry-driven API.
77
88## Install
99
1010| Use case | Install |
1111| ---| ---|
12- | Default (common, optical_flow, vbench-light , physics_iq, videoscore2) | ` uv pip install -e .[eval] ` |
13- | Just VBench (12 of 16 sub-metrics ) | ` uv pip install -e .[eval-vbench] ` |
12+ | Default (common, optical_flow, vbench, physics_iq, videoscore2) | ` uv pip install -e .[eval] ` |
13+ | Just VBench (11 of 16 by default; +4 with detectron2 ) | ` uv pip install -e .[eval-vbench] ` |
1414| Just Physics-IQ (covered by ` [eval] ` ) | ` uv pip install -e .[eval-physics-iq] ` |
15- | Plus ` vbench.scene ` (AVoCaDO) | ` uv pip install -e .[eval-full] ` |
15+ | Audio metrics (CLAP, FAD, KL, WER, AudioBox, DeSync, ImageBind) | ` uv pip install -e .[eval-audio] ` |
16+ | Everything: ` [eval] ` + ` [eval-audio] ` + ` vbench.scene ` (AVoCaDO) | ` uv pip install -e .[eval-full] ` |
17+
18+ ` [eval-audio] ` covers every ` audio.* ` metric. ImageBind
19+ (` facebookresearch/ImageBind ` , CC BY-NC-SA 4.0) is git-sourced via
20+ ` [tool.uv.sources] ` rather than vendored. ` torchaudio ` at the cu128
21+ wheel is pulled transitively by ` audiobox_aesthetics ` ; on cu128 hosts
22+ using raw ` pip ` , install ` torchaudio ` from
23+ ` https://download.pytorch.org/whl/cu128 ` first.
24+
25+ ` audio.desync ` and ` audio.wer (glm_asr) ` import vendored upstream from
26+ ` fastvideo/third_party/eval/synchformer/ ` (MIT) and
27+ ` fastvideo/third_party/eval/glmasr/ ` (Apache-2.0). Both trees keep
28+ their upstream ` LICENSE ` files alongside.
29+
30+ ### ` audio.* ` metric input contracts
31+
32+ Every audio metric reads from these sample-dict keys (extra keys are
33+ ignored):
34+
35+ | Metric | Per-sample? | Required keys |
36+ | ---| ---| ---|
37+ | ` audio.clap_score ` | yes | ` audio ` (path), ` text_prompt ` (str) |
38+ | ` audio.audiobox_aesthetics ` | yes | ` audio ` (path) |
39+ | ` audio.kl_divergence ` | yes | ` audio ` (path), ` reference_audio ` (path) |
40+ | ` audio.frechet_distance ` | ** set-vs-set** | ` audio ` (path), ` reference_audio ` (path) — accumulated across ≥2 samples; ` corpus["audio.frechet_distance"] ` carries the score |
41+ | ` audio.wer ` | yes | ` audio ` (path), ` reference_text ` (str) or ` text_prompt ` (str) |
42+ | ` audio.desync ` | yes | ` video ` (decoded tensor or path), ` audio ` (path) |
43+ | ` audio.imagebind_score ` | yes | ` video_path ` (str) ** and** ` audio ` (path) — needs the path, not the pool-decoded tensor, because ImageBind's preprocessing decodes its own clips |
44+
45+ ` audio.frechet_distance ` is the only set-vs-set metric. The kwargs
46+ form (` ev.evaluate(audio=...) ` ) raises with a clear message because a
47+ single sample cannot produce a corpus result; use
48+ ` ev.evaluate(samples=[...]) ` .
49+
50+ ### Reference repos for audio
51+
52+ The audio set ports its math 1:1 from ` hkchengrex/av-benchmark ` (the
53+ V2A literature's de-facto eval harness — used by MMAudio, FoleyCrafter,
54+ V2A-Mapper). Per-metric upstream:
55+
56+ | Metric | Upstream |
57+ | ---| ---|
58+ | ` audio.frechet_distance ` (PaSST-FAD) | ` av_bench/metrics/fad.py::compute_fd ` over ` hear21passt ` 768-d embeds |
59+ | ` audio.kl_divergence ` | ` av_bench/metrics/kl.py::compute_kl ` over PaSST 527-d logits |
60+ | ` audio.clap_score ` | HF ` transformers.ClapModel ` (` laion/clap-htsat-fused ` — closest HF mirror of ` 630k-audioset-fusion-best ` ) |
61+ | ` audio.audiobox_aesthetics ` | ` facebookresearch/audiobox-aesthetics ` (PQ as primary score, CE/CU/PC in details) |
62+ | ` audio.wer ` | MagiHuman-style: NFKC + CJK char-level via ` jiwer ` , GLM-ASR or Whisper backbone |
63+ | ` audio.desync ` | ` av_bench/synchformer/ ` (vendored under ` third_party/eval/synchformer/ ` ); checkpoint from ` hkchengrex/MMAudio/releases/v0.1/synchformer_state_dict.pth ` |
64+ | ` audio.imagebind_score ` | ` facebookresearch/ImageBind ` (` imagebind_huge ` pretrained) |
1665| Plus ` vbench.{color, multiple_objects, object_class, spatial_relationship} ` (GRiT) | ` uv pip install -e .[eval-vbench] ` then ` uv pip install --no-build-isolation 'git+https://github.com/facebookresearch/detectron2.git' ` |
1766
1867To use VBench, also pull the upstream submodule:
@@ -80,14 +129,18 @@ fastvideo/
80129│ ├── base.py # BaseMetric + @register contract
81130│ ├── common/ # SSIM, PSNR, LPIPS
82131│ ├── optical_flow/ # gt_optical_flow, synthetic_optical_flow
132+ │ ├── audio/ # clap_score, audiobox_aesthetics, kl_divergence,
133+ │ │ # frechet_distance, wer, desync, imagebind_score
83134│ ├── videoscore2/ # VideoScore-2 (Qwen2.5-VL)
84135│ ├── physics_iq/ # PhysicsIQ + sub-metrics
85136│ └── vbench/ # adapter: sys.path bootstrap + shims
86137│ ├── __init__.py
87138│ └── <16 sub-metric pkgs>
88139└── third_party/
89140 └── eval/
90- └── vbench/ # git submodule (Vchitect/VBench)
141+ ├── vbench/ # git submodule (Vchitect/VBench)
142+ ├── synchformer/ # vendored (MIT), used by audio.desync
143+ └── glmasr/ # vendored (Apache-2.0), used by audio.wer (glm_asr)
91144```
92145
93146### Prompt datasets
@@ -134,31 +187,29 @@ class YourMetric(BaseMetric):
134187 needs_gpu = False
135188 dependencies: list[str ] = [] # e.g. ["pyiqa"] if relevant
136189
137- def compute (self , sample ) -> list[ MetricResult] :
190+ def compute (self , sample ) -> MetricResult:
138191 ...
139192```
140193
141194The metric is auto-discovered by ` fastvideo/eval/metrics/__init__.py ` ,
142195which walks all non-underscore subdirectories and imports their
143196` metric ` module.
144197
145- ### Wrapping upstream code via a submodule
146-
147- See ` fastvideo/eval/metrics/vbench/ ` for a worked example. The
148- contract is:
198+ ### Wrapping upstream code
149199
150- 1 . Upstream lives as a git submodule under
151- ` fastvideo/third_party/eval/<bench>/ ` , pinned to a SHA in repo-root
152- ` .gitmodules ` .
153- 2 . The metric package's ` __init__.py `
154- (` fastvideo/eval/metrics/<bench>/__init__.py ` ) inserts that
155- submodule path on ` sys.path ` and installs any compat shims for
156- modern torch/transformers/numpy. Do not modify upstream files on
157- disk.
158- 3 . Per-sub-metric ` metric.py ` files use ` @register("<bench>.<name>") ` .
200+ Three patterns coexist depending on how the upstream ships and what
201+ licence it's under. All three keep upstream files on disk unmodified;
202+ behavioural patches live as runtime shims in the consuming code.
159203
160- Patches live as Python in the metric's ` __init__.py ` so they are
161- grep-able and reviewable.
204+ 1 . ** Git submodule** — large research packages pinned to a SHA, accessed
205+ via ` sys.path ` bootstrap. See ` fastvideo/eval/metrics/vbench/ ` (with
206+ ` fastvideo/third_party/eval/vbench/ ` ).
207+ 2 . ** Vendored under ` third_party/eval/<name>/ ` ** — small/surgical upstream
208+ trees with permissive licences (MIT, Apache-2.0). See
209+ ` fastvideo/third_party/eval/synchformer/ ` and ` .../glmasr/ ` .
210+ 3 . ** Git-source via ` [tool.uv.sources] ` ** — license-restricted upstream
211+ that cannot be redistributed in the FastVideo source tree. See
212+ ImageBind (CC BY-NC-SA 4.0) in ` pyproject.toml ` .
162213
163214## Caches
164215
@@ -167,7 +218,7 @@ Eval cache root: `${FASTVIDEO_CACHE_ROOT}/eval/`, default
167218
168219```
169220${FASTVIDEO_CACHE_ROOT}/eval/
170- ├── models/ # URL-fetched checkpoints (LAION head, AMT, GRiT )
221+ ├── models/ # URL-fetched checkpoints (LAION head, GRiT, Synchformer, ImageBind )
171222├── torch/ # redirected TORCH_HOME (DINO via torch.hub, lpips)
172223├── clip/ # passed as download_root= to clip.load callsites
173224└── datasets/ # auto-fetched dataset assets, one subdir per benchmark
0 commit comments