|
| 1 | +# Voice-clone backward — CAMPPlus speaker encoder (op × backend gap matrix) |
| 2 | + |
| 3 | +Scope for ticket *"GGML backward pass: CAMPPlus speaker encoder"* (QVAC-20984). |
| 4 | +This doc scopes the work to make the CAMPPlus speaker encoder **differentiable in |
| 5 | +GGML** on the CPU path used for enrollment, and records which backward ops are |
| 6 | +still missing in the vendored `ggml`. |
| 7 | + |
| 8 | +It is committed alongside the interim deliverable of this PR: an analytic, |
| 9 | +gradchecked C++ backward of the whole CAMPPlus chain |
| 10 | +(`src/campplus_backward.{h,cpp}`). See |
| 11 | +[Interim vs Phase-2](#interim-solution-shipped-in-this-pr) for how the two |
| 12 | +relate. |
| 13 | + |
| 14 | +## Why the gap exists |
| 15 | + |
| 16 | +In the enrollment loop CAMPPlus provides the **speaker-similarity loss** between |
| 17 | +the target-WAV embedding (constant, forward-only) and the generated-audio |
| 18 | +embedding. Only the generated-audio path needs gradients, so the gradient we |
| 19 | +need is `d(loss)/d(fbank)` — the input gradient with the model weights frozen. |
| 20 | +The fbank is differentiated further back to the waveform by a separate stage; |
| 21 | +this module stops at the CAMPPlus input. |
| 22 | + |
| 23 | +A fully GGML-native backward (the Phase-2 goal, needed by the on-device |
| 24 | +enrollment loop) requires every op on the forward graph to have a backward in |
| 25 | +`ggml_compute_backward` (`ggml/src/ggml.c`) **and** a CPU kernel for the ops the |
| 26 | +backward expands into. Several are missing today. |
| 27 | + |
| 28 | +## Forward ops on the CAMPPlus path |
| 29 | + |
| 30 | +Source: `src/campplus_forward.inc` (the GGML graph) and `src/campplus.cpp` (the |
| 31 | +scalar CPU reference `campplus_embed_cpu`). |
| 32 | + |
| 33 | +| Forward op | Where (forward) | |
| 34 | +| --- | --- | |
| 35 | +| `ggml_conv_2d` / `ggml_im2col` + `ggml_mul_mat` | FCM Conv2d head + residual blocks | |
| 36 | +| `conv1d_f32` (`ggml_im2col` + `ggml_mul_mat`) | TDNN, linear1, linear_local, cam linear1/2, transits, dense | |
| 37 | +| `ggml_mul` / `ggml_add` (broadcast) | pre-fused BN (scale/shift), bias adds, residuals | |
| 38 | +| `ggml_relu` | every nonlinear1/2, transit, out_nonlinear, FCM | |
| 39 | +| `ggml_sigmoid` | CAMLayer context gate | |
| 40 | +| `ggml_mean` | CAMLayer global context, stats-pool mean + variance | |
| 41 | +| `ggml_sum_rows` | CAMLayer seg-pool reduction | |
| 42 | +| `ggml_pad` / `ggml_repeat` | CAMLayer seg-pool reshape + broadcast | |
| 43 | +| `ggml_sqrt` | stats-pool std | |
| 44 | +| `ggml_concat` | dense concat (CAMDenseTDNN), stats-pool mean‖std | |
| 45 | +| `ggml_cont`/`reshape`/`view` | layout shuffles, FCM (32,10,T)→(320,T) flatten | |
| 46 | + |
| 47 | +## Gap matrix |
| 48 | + |
| 49 | +Legend: **OK** = implemented; **MISSING** = aborts / not implemented; **n/a** = |
| 50 | +not on the enrollment path. |
| 51 | + |
| 52 | +"Graph backward" = a case in `ggml_compute_backward` (`ggml/src/ggml.c`). It is |
| 53 | +backend-agnostic: if it aborts, no backend can differentiate the op. "CPU bwd |
| 54 | +kernel" = the kernels the backward expands into exist for the CPU backend |
| 55 | +(`ggml-cpu`), the only backend enrollment needs in Phase 2. GPU columns are out |
| 56 | +of scope for Phase 2 (enrollment runs on CPU) and tracked only for visibility. |
| 57 | + |
| 58 | +| Op | Graph backward (ggml.c) | CPU bwd kernel | CUDA / Metal / Vulkan / OpenCL | |
| 59 | +| --- | --- | --- | --- | |
| 60 | +| `MUL_MAT` | OK | OK (`out_prod`/`mul_mat`) | out of scope | |
| 61 | +| `ADD` / `MUL` | OK | OK | out of scope | |
| 62 | +| `CONT`/`RESHAPE`/`VIEW`/`PERMUTE` | OK | OK | out of scope | |
| 63 | +| `IM2COL` | OK (`im2col_back`) | OK | out of scope | |
| 64 | +| `RELU` (unary) | OK | OK | out of scope | |
| 65 | +| `SIGMOID` (unary) | **MISSING** | — | — | |
| 66 | +| `MEAN` | **MISSING** | — | — | |
| 67 | +| `SUM_ROWS` | **MISSING** | — | — | |
| 68 | +| `SQRT` (unary) | **MISSING** | — | — | |
| 69 | +| `PAD` | **MISSING** | — | — | |
| 70 | +| `REPEAT` | **MISSING** | — | — | |
| 71 | +| `CONCAT` | **MISSING** | — | — | |
| 72 | + |
| 73 | +Confirmed against the `ggml_compute_backward` switch: handled ops include `ADD`, |
| 74 | +`MUL`, `SCALE`, `CPY`, `CONT`, `RESHAPE`, `PERMUTE`, `TRANSPOSE`, `GET_ROWS`, |
| 75 | +`DIAG_MASK_INF`, `RMS_NORM`, `MUL_MAT`, `SOFT_MAX`, `IM2COL`, and a subset of |
| 76 | +`UNARY` (`ABS`, `SGN`, `NEG`, `STEP`, `RELU`, `SILU`, `EXP`, `EXPM1`, |
| 77 | +`SOFTPLUS`). `SIGMOID`, `SQRT`, `MEAN`, `SUM_ROWS`, `PAD`, `REPEAT`, and `CONCAT` |
| 78 | +fall through to `GGML_ABORT`. |
| 79 | + |
| 80 | +## Remaining Phase-2 work items |
| 81 | + |
| 82 | +To reach a fully GGML-native, on-device backward of CAMPPlus: |
| 83 | + |
| 84 | +1. **`SIGMOID` backward** — add `s*(1-s)` to the `UNARY` switch + CPU kernel |
| 85 | + (needed by the CAMLayer gate). |
| 86 | +2. **`SQRT` backward** — add `1/(2*sqrt(x))` to the `UNARY` switch + CPU kernel |
| 87 | + (stats-pool std). |
| 88 | +3. **`MEAN` / `SUM_ROWS` backward** — broadcast the upstream grad back over the |
| 89 | + reduced axis (`1/N` for mean) + CPU kernels. |
| 90 | +4. **`PAD` / `REPEAT` backward** — slice off the padding / sum over the repeated |
| 91 | + axis (`ggml_repeat_back` already exists; wire it into `ggml_compute_backward`). |
| 92 | +5. **`CONCAT` backward** — slice-and-route the grad to each input (dense concat |
| 93 | + and stats-pool concat). |
| 94 | +6. **Per-stage gradcheck** — wire each lowered stage into the Task 2 harness; |
| 95 | + the analytic backward from this PR is the reference oracle. |
| 96 | + |
| 97 | +Alternatively, the seg-pool / stats-pool subgraphs can be lowered to |
| 98 | +`mul_mat`-based reductions (which already have backward), avoiding new kernels for |
| 99 | +`MEAN`/`SUM_ROWS`/`REPEAT`. |
| 100 | + |
| 101 | +## Interim solution shipped in this PR |
| 102 | + |
| 103 | +Because the gaps above block a GGML-native backward today, this PR ships an |
| 104 | +**analytic C++ backward** of the whole CAMPPlus chain, validated component-wise |
| 105 | +against finite differences via the Task 2 gradcheck harness |
| 106 | +(`src/voiceclone_gradcheck.{h,cpp}`): |
| 107 | + |
| 108 | +- `conv1d_backward_input` / `conv2d_backward_input` — transpose-conv input grad |
| 109 | + (stride / pad / dilation aware) |
| 110 | +- `bn_backward_input` — pre-fused affine BN (per-channel scale) |
| 111 | +- `relu_backward` / `sigmoid_backward` — pointwise nonlinearities |
| 112 | +- `mean_T_backward` / `seg_pool_backward` — CAMLayer context reductions |
| 113 | +- `stats_pool_backward_input` — mean + unbiased std pooling |
| 114 | +- `fcm_resblock_backward` — Conv2d residual block (with optional shortcut) |
| 115 | +- `cam_layer_backward` — CAMDenseTDNN layer (gate + dense-concat split) |
| 116 | +- `CampplusBackward::backward` — full chain → `d(loss)/d(fbank)` |
| 117 | + |
| 118 | +It mirrors the layout and conventions of `campplus_embed_cpu` exactly. Two tests |
| 119 | +guard it (both in the always-on `unit` ctest tier, model-free): |
| 120 | + |
| 121 | +- `test-campplus-backward` — gradchecks every primitive and the full chain |
| 122 | + against central finite differences. |
| 123 | +- `test-campplus-backward-parity` — asserts the analytic double forward matches |
| 124 | + the production scalar forward (`campplus_embed_cpu`) on synthetic weights |
| 125 | + (multi-layer CAM blocks, 2/3/2, so the dense-concat accumulation is exercised), |
| 126 | + anchoring the gradcheck's relevance to the real model. |
| 127 | + |
| 128 | +The scalar CPU forward is the path every `campplus_embed` caller uses today |
| 129 | +(production `main.cpp`, `test-campplus`, `test-voice-embedding` all pass |
| 130 | +`backend==nullptr`), and `test-campplus` / `test-voice-embedding` validate it |
| 131 | +against the Python reference embedding. So the trust chain is complete: |
| 132 | +Python → `campplus_embed_cpu` → analytic forward → gradchecked backward. The |
| 133 | +`campplus_embed_ggml` graph path is not wired to any caller yet; when it is, it |
| 134 | +gets its own fixture parity against the CPU/Python path. |
| 135 | + |
| 136 | +This is mathematically exact, runs on CPU (the enrollment target), and serves as |
| 137 | +the **reference oracle** for the per-stage gradcheck once the GGML-native ops in |
| 138 | +the work items above are implemented. |
| 139 | + |
| 140 | +> Note: `campplus_embed_cpu`'s `fcm_forward` hardcodes the input feature |
| 141 | +> dimension to 80 (the production fbank width), so the production scalar path is |
| 142 | +> only self-consistent at `feat_dim=80`; the parity test uses that. The analytic |
| 143 | +> backward derives every dimension from `feat_dim`, so it is geometry-agnostic. |
0 commit comments