Skip to content

Commit f4208d2

Browse files
authored
QVAC-20984 feat: add analytic gradchecked backward pass for the CAMPPlus speaker encoder (#61)
* feat: add analytic gradchecked backward pass for CAMPPlus speaker encoder Make CAMPPlus differentiable for the voice-clone enrollment loop: an analytic C++ backward returning d(loss)/d(fbank) with frozen weights (target-WAV embedding stays forward-only). Mirrors campplus_embed_cpu in channel-major layout. Covers FCM (Conv2d + residual blocks), TDNN, CAMDenseTDNN blocks (context-attention gate + dense concat), stats pooling and the dense head. Tests (always-on unit tier, model-free): - test-campplus-backward: gradcheck every primitive + full chain vs central finite differences (Task 2 harness). - test-campplus-backward-parity: analytic double forward vs production campplus_embed_cpu on synthetic weights. QVAC-20984 * test: anchor CAMPPlus backward parity to multi-layer CAM blocks Address PR #61 review notes (non-blocking): - Parity test now builds CAM blocks with num_layers 2/3/2 (was 1/1/1) so the dense-concat accumulation (layer i enters with C_in + i*growth) is anchored to the production forward, not only to the self-referential full-chain gradcheck. Parity stays green (max_abs ~4.6e-08, max_rel ~8.9e-08). - Document the trust chain in the parity test header and the gap-matrix doc: every campplus_embed caller in the repo (main.cpp, test-campplus, test-voice-embedding) uses the scalar CPU forward, which is validated against the Python reference; campplus_embed_ggml is not wired to any caller yet.
1 parent a85a444 commit f4208d2

6 files changed

Lines changed: 1975 additions & 0 deletions

File tree

tts-cpp/CMakeLists.txt

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -857,6 +857,35 @@ if (TTS_CPP_BUILD_TESTS)
857857
tts_cpp_apply_ccache(test-supertonic-vector-estimator-backward)
858858
tts_cpp_register_test(test-supertonic-vector-estimator-backward LABEL "unit")
859859

860+
# QVAC-20984 — analytic backward of the CAMPPlus speaker encoder (FCM Conv2d
861+
# head + residual blocks, TDNN, CAM dense-TDNN blocks with context-attention
862+
# gating and dense concat, statistics pooling, dense head). Model-free: every
863+
# analytic input-gradient is gradchecked against finite differences via the
864+
# Task 2 harness, so it ALWAYS runs on a fresh checkout (no-skip policy, no
865+
# model/fixtures needed).
866+
add_executable(test-campplus-backward
867+
test/test_campplus_backward.cpp
868+
src/campplus_backward.cpp
869+
src/voiceclone_gradcheck.cpp)
870+
target_include_directories(test-campplus-backward PRIVATE src)
871+
tts_cpp_apply_ccache(test-campplus-backward)
872+
tts_cpp_register_test(test-campplus-backward LABEL "unit")
873+
874+
# Forward-parity: the analytic double forward must match the production scalar
875+
# CAMPPlus forward (campplus_embed_cpu) on synthetic weights, anchoring the
876+
# gradcheck to the real model. Links campplus.cpp -> ggml.
877+
add_executable(test-campplus-backward-parity
878+
test/test_campplus_backward_parity.cpp
879+
src/campplus_backward.cpp
880+
src/campplus.cpp)
881+
target_link_libraries(test-campplus-backward-parity PRIVATE ggml)
882+
target_include_directories(test-campplus-backward-parity PRIVATE ggml/include src)
883+
if (OpenMP_CXX_FOUND)
884+
target_link_libraries(test-campplus-backward-parity PRIVATE OpenMP::OpenMP_CXX)
885+
endif()
886+
tts_cpp_apply_ccache(test-campplus-backward-parity)
887+
tts_cpp_register_test(test-campplus-backward-parity LABEL "unit")
888+
860889
# Engine-level streaming-callback contract test for the per-sentence
861890
# segmentation path (Fix #2): monotonic global chunk_index, single final
862891
# is_last, result.pcm == concat(callbacks), accumulated stats. Gated on
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# Voice-clone backward — CAMPPlus speaker encoder (op × backend gap matrix)
2+
3+
Scope for ticket *"GGML backward pass: CAMPPlus speaker encoder"* (QVAC-20984).
4+
This doc scopes the work to make the CAMPPlus speaker encoder **differentiable in
5+
GGML** on the CPU path used for enrollment, and records which backward ops are
6+
still missing in the vendored `ggml`.
7+
8+
It is committed alongside the interim deliverable of this PR: an analytic,
9+
gradchecked C++ backward of the whole CAMPPlus chain
10+
(`src/campplus_backward.{h,cpp}`). See
11+
[Interim vs Phase-2](#interim-solution-shipped-in-this-pr) for how the two
12+
relate.
13+
14+
## Why the gap exists
15+
16+
In the enrollment loop CAMPPlus provides the **speaker-similarity loss** between
17+
the target-WAV embedding (constant, forward-only) and the generated-audio
18+
embedding. Only the generated-audio path needs gradients, so the gradient we
19+
need is `d(loss)/d(fbank)` — the input gradient with the model weights frozen.
20+
The fbank is differentiated further back to the waveform by a separate stage;
21+
this module stops at the CAMPPlus input.
22+
23+
A fully GGML-native backward (the Phase-2 goal, needed by the on-device
24+
enrollment loop) requires every op on the forward graph to have a backward in
25+
`ggml_compute_backward` (`ggml/src/ggml.c`) **and** a CPU kernel for the ops the
26+
backward expands into. Several are missing today.
27+
28+
## Forward ops on the CAMPPlus path
29+
30+
Source: `src/campplus_forward.inc` (the GGML graph) and `src/campplus.cpp` (the
31+
scalar CPU reference `campplus_embed_cpu`).
32+
33+
| Forward op | Where (forward) |
34+
| --- | --- |
35+
| `ggml_conv_2d` / `ggml_im2col` + `ggml_mul_mat` | FCM Conv2d head + residual blocks |
36+
| `conv1d_f32` (`ggml_im2col` + `ggml_mul_mat`) | TDNN, linear1, linear_local, cam linear1/2, transits, dense |
37+
| `ggml_mul` / `ggml_add` (broadcast) | pre-fused BN (scale/shift), bias adds, residuals |
38+
| `ggml_relu` | every nonlinear1/2, transit, out_nonlinear, FCM |
39+
| `ggml_sigmoid` | CAMLayer context gate |
40+
| `ggml_mean` | CAMLayer global context, stats-pool mean + variance |
41+
| `ggml_sum_rows` | CAMLayer seg-pool reduction |
42+
| `ggml_pad` / `ggml_repeat` | CAMLayer seg-pool reshape + broadcast |
43+
| `ggml_sqrt` | stats-pool std |
44+
| `ggml_concat` | dense concat (CAMDenseTDNN), stats-pool mean‖std |
45+
| `ggml_cont`/`reshape`/`view` | layout shuffles, FCM (32,10,T)→(320,T) flatten |
46+
47+
## Gap matrix
48+
49+
Legend: **OK** = implemented; **MISSING** = aborts / not implemented; **n/a** =
50+
not on the enrollment path.
51+
52+
"Graph backward" = a case in `ggml_compute_backward` (`ggml/src/ggml.c`). It is
53+
backend-agnostic: if it aborts, no backend can differentiate the op. "CPU bwd
54+
kernel" = the kernels the backward expands into exist for the CPU backend
55+
(`ggml-cpu`), the only backend enrollment needs in Phase 2. GPU columns are out
56+
of scope for Phase 2 (enrollment runs on CPU) and tracked only for visibility.
57+
58+
| Op | Graph backward (ggml.c) | CPU bwd kernel | CUDA / Metal / Vulkan / OpenCL |
59+
| --- | --- | --- | --- |
60+
| `MUL_MAT` | OK | OK (`out_prod`/`mul_mat`) | out of scope |
61+
| `ADD` / `MUL` | OK | OK | out of scope |
62+
| `CONT`/`RESHAPE`/`VIEW`/`PERMUTE` | OK | OK | out of scope |
63+
| `IM2COL` | OK (`im2col_back`) | OK | out of scope |
64+
| `RELU` (unary) | OK | OK | out of scope |
65+
| `SIGMOID` (unary) | **MISSING** |||
66+
| `MEAN` | **MISSING** |||
67+
| `SUM_ROWS` | **MISSING** |||
68+
| `SQRT` (unary) | **MISSING** |||
69+
| `PAD` | **MISSING** |||
70+
| `REPEAT` | **MISSING** |||
71+
| `CONCAT` | **MISSING** |||
72+
73+
Confirmed against the `ggml_compute_backward` switch: handled ops include `ADD`,
74+
`MUL`, `SCALE`, `CPY`, `CONT`, `RESHAPE`, `PERMUTE`, `TRANSPOSE`, `GET_ROWS`,
75+
`DIAG_MASK_INF`, `RMS_NORM`, `MUL_MAT`, `SOFT_MAX`, `IM2COL`, and a subset of
76+
`UNARY` (`ABS`, `SGN`, `NEG`, `STEP`, `RELU`, `SILU`, `EXP`, `EXPM1`,
77+
`SOFTPLUS`). `SIGMOID`, `SQRT`, `MEAN`, `SUM_ROWS`, `PAD`, `REPEAT`, and `CONCAT`
78+
fall through to `GGML_ABORT`.
79+
80+
## Remaining Phase-2 work items
81+
82+
To reach a fully GGML-native, on-device backward of CAMPPlus:
83+
84+
1. **`SIGMOID` backward** — add `s*(1-s)` to the `UNARY` switch + CPU kernel
85+
(needed by the CAMLayer gate).
86+
2. **`SQRT` backward** — add `1/(2*sqrt(x))` to the `UNARY` switch + CPU kernel
87+
(stats-pool std).
88+
3. **`MEAN` / `SUM_ROWS` backward** — broadcast the upstream grad back over the
89+
reduced axis (`1/N` for mean) + CPU kernels.
90+
4. **`PAD` / `REPEAT` backward** — slice off the padding / sum over the repeated
91+
axis (`ggml_repeat_back` already exists; wire it into `ggml_compute_backward`).
92+
5. **`CONCAT` backward** — slice-and-route the grad to each input (dense concat
93+
and stats-pool concat).
94+
6. **Per-stage gradcheck** — wire each lowered stage into the Task 2 harness;
95+
the analytic backward from this PR is the reference oracle.
96+
97+
Alternatively, the seg-pool / stats-pool subgraphs can be lowered to
98+
`mul_mat`-based reductions (which already have backward), avoiding new kernels for
99+
`MEAN`/`SUM_ROWS`/`REPEAT`.
100+
101+
## Interim solution shipped in this PR
102+
103+
Because the gaps above block a GGML-native backward today, this PR ships an
104+
**analytic C++ backward** of the whole CAMPPlus chain, validated component-wise
105+
against finite differences via the Task 2 gradcheck harness
106+
(`src/voiceclone_gradcheck.{h,cpp}`):
107+
108+
- `conv1d_backward_input` / `conv2d_backward_input` — transpose-conv input grad
109+
(stride / pad / dilation aware)
110+
- `bn_backward_input` — pre-fused affine BN (per-channel scale)
111+
- `relu_backward` / `sigmoid_backward` — pointwise nonlinearities
112+
- `mean_T_backward` / `seg_pool_backward` — CAMLayer context reductions
113+
- `stats_pool_backward_input` — mean + unbiased std pooling
114+
- `fcm_resblock_backward` — Conv2d residual block (with optional shortcut)
115+
- `cam_layer_backward` — CAMDenseTDNN layer (gate + dense-concat split)
116+
- `CampplusBackward::backward` — full chain → `d(loss)/d(fbank)`
117+
118+
It mirrors the layout and conventions of `campplus_embed_cpu` exactly. Two tests
119+
guard it (both in the always-on `unit` ctest tier, model-free):
120+
121+
- `test-campplus-backward` — gradchecks every primitive and the full chain
122+
against central finite differences.
123+
- `test-campplus-backward-parity` — asserts the analytic double forward matches
124+
the production scalar forward (`campplus_embed_cpu`) on synthetic weights
125+
(multi-layer CAM blocks, 2/3/2, so the dense-concat accumulation is exercised),
126+
anchoring the gradcheck's relevance to the real model.
127+
128+
The scalar CPU forward is the path every `campplus_embed` caller uses today
129+
(production `main.cpp`, `test-campplus`, `test-voice-embedding` all pass
130+
`backend==nullptr`), and `test-campplus` / `test-voice-embedding` validate it
131+
against the Python reference embedding. So the trust chain is complete:
132+
Python → `campplus_embed_cpu` → analytic forward → gradchecked backward. The
133+
`campplus_embed_ggml` graph path is not wired to any caller yet; when it is, it
134+
gets its own fixture parity against the CPU/Python path.
135+
136+
This is mathematically exact, runs on CPU (the enrollment target), and serves as
137+
the **reference oracle** for the per-stage gradcheck once the GGML-native ops in
138+
the work items above are implemented.
139+
140+
> Note: `campplus_embed_cpu`'s `fcm_forward` hardcodes the input feature
141+
> dimension to 80 (the production fbank width), so the production scalar path is
142+
> only self-consistent at `feat_dim=80`; the parity test uses that. The analytic
143+
> backward derives every dimension from `feat_dim`, so it is geometry-agnostic.

0 commit comments

Comments
 (0)