|
| 1 | +# Stable Timestamps - How whisper.cpp Works (Relevant Internals) |
| 2 | + |
| 3 | +## Codebase Structure |
| 4 | + |
| 5 | +- `include/whisper.h` (741 lines) -- Public C API |
| 6 | +- `src/whisper.cpp` (9016 lines) -- Entire implementation in one file |
| 7 | +- `src/whisper-arch.h` -- Tensor name maps (encoder/decoder/VAD) |
| 8 | +- `ggml/` -- Tensor library backend |
| 9 | +- `examples/cli/cli.cpp` -- Main CLI |
| 10 | + |
| 11 | +## Key Data Structures (all in `src/whisper.cpp`) |
| 12 | + |
| 13 | +### Token Data (`whisper_token_data`, whisper.h:131) |
| 14 | +```c |
| 15 | +id, tid (timestamp token), p (probability), plog, pt (timestamp prob), |
| 16 | +ptsum (sum of timestamp probs), t0/t1, t_dtw, vlen (voice length) |
| 17 | +``` |
| 18 | +
|
| 19 | +### Segment (`whisper_segment`, line 460) |
| 20 | +```c |
| 21 | +t0, t1, text, no_speech_prob, tokens (vector<whisper_token_data>) |
| 22 | +``` |
| 23 | + |
| 24 | +### State (`whisper_state`, line 834) |
| 25 | +Holds: `mel`, `kv_self/kv_cross`, `decoders[8]`, `result_all` (segments), `energy` (PCM signal energy), `aheads_masks`, `aheads_cross_QKs`, `vad_context/segments/mapping` |
| 26 | + |
| 27 | +## Decoding Pipeline |
| 28 | + |
| 29 | +Entry: **`whisper_full_with_state()`** at line 6805 |
| 30 | + |
| 31 | +1. **PCM -> Mel** (line 6818): `whisper_pcm_to_mel_with_state()` -- FFT + mel filterbank, 80 bands, hop=160 (10ms/frame) |
| 32 | +2. **Signal energy** (line 6847): `get_signal_energy(samples, n_samples, 32)` -- smoothed abs amplitude for token timestamps |
| 33 | +3. **Main loop** (line 7012): `while(true)` over 30s chunks, advancing by `seek` |
| 34 | +4. **Encoder** (line 7033): `whisper_encode_internal()` -- conv + encoder + cross-attn KV cache |
| 35 | +5. **Prompt setup** (line 7098-7157): `[<prev>] + past + [<sot>] + [<lang>] + [<transcribe>]` |
| 36 | +6. **Token-by-token** (line 7197): `for (i = 0; i < n_max; ++i)` where `n_max = n_text_ctx/2 - 4` |
| 37 | + |
| 38 | +### Logit Processing -- `whisper_process_logits()` at line 6155 |
| 39 | + |
| 40 | +This is WHERE ALL LOGIT FILTERING HAPPENS: |
| 41 | + |
| 42 | +- **Line 6232**: `logits_filter_callback` -- user-supplied callback (external injection point) |
| 43 | +- **Line 6268-6308**: Timestamp pairing constraints (must come in pairs, must increase) |
| 44 | +- **Line 6291-6298**: `max_initial_ts` constraint -- limits first timestamp to <= 1.0s |
| 45 | + - **stable-ts removes this** by setting it to `None` |
| 46 | + - whisper.cpp param: `params.max_initial_ts` (default 1.0f, line 5950) |
| 47 | +- **Line 6300-6308**: Increasing timestamp enforcement via `decoder.seek_delta/2` |
| 48 | +- **Line 6314-6365**: Force timestamp when `sum(ts_probs) > max(text_probs)` |
| 49 | + |
| 50 | +**INJECTION POINT for constrained decoding:** Between lines 6300-6308 (after increasing-ts check), add `logits[token_beg + t] = -INFINITY` for silent positions. Or use the existing `logits_filter_callback` externally. |
| 51 | + |
| 52 | +### Sampling -- `whisper_sample_token()` at line 6438 |
| 53 | +Greedy: argmax. Also computes `tid` (best timestamp), `pt` (timestamp prob), `ptsum` (sum timestamp probs). |
| 54 | + |
| 55 | +## Word-Level Timestamps |
| 56 | + |
| 57 | +### Method 1: Non-DTW (simpler, existing) |
| 58 | + |
| 59 | +**`whisper_exp_compute_token_level_timestamps()`** at line 8433 |
| 60 | + |
| 61 | +1. Uses `state.energy` (smoothed PCM amplitude) |
| 62 | +2. Confident timestamps from `token.tid` when `pt > thold_pt && ptsum > thold_ptsum` |
| 63 | +3. Fills gaps by proportional splitting based on `vlen` |
| 64 | +4. **Energy-based refinement** (lines 8563-8631): Expands/contracts token boundaries using signal energy. This is a PRIMITIVE form of silence snapping already present -- but crude. |
| 65 | + |
| 66 | +### Method 2: DTW (experimental, more accurate) |
| 67 | + |
| 68 | +**`whisper_exp_compute_token_level_timestamps_dtw()`** at line 8815 |
| 69 | + |
| 70 | +1. Build token sequence: `[sot] + [lang] + [no_timestamps] + all_text_tokens + [eot]` |
| 71 | +2. Full decoder pass with `save_alignment_heads_QKs=true` |
| 72 | +3. Copy cross-attention QKs to CPU: shape `[n_tokens, n_audio_tokens, n_heads]` |
| 73 | +4. Normalize (line 8907) |
| 74 | +5. Median filter width 7 over audio dimension (line 8914) |
| 75 | +6. **Mean across heads** (line 8919) -- all selected heads weighted equally |
| 76 | +7. Scale by -1 (line 8920) |
| 77 | +8. Standard DTW + backtrace via `dtw_and_backtrace()` (line 8690) |
| 78 | +9. Assign timestamps from DTW path (lines 8940-8963) |
| 79 | + |
| 80 | +**IMPORTANT:** DTW does NOT work with `flash_attn=true` (line 3708-3710) because flash attention doesn't expose intermediate attention weights. |
| 81 | + |
| 82 | +Called at lines 7725-7728 after all segments created for a 30s window. |
| 83 | + |
| 84 | +### Alignment Heads -- Hardcoded (lines 384-409) |
| 85 | + |
| 86 | +```c |
| 87 | +static const whisper_ahead g_aheads_large_v3[] = { |
| 88 | + {7,0}, {10,17}, {12,18}, {13,12}, {16,1}, {17,14}, {19,11}, {21,4}, {24,1}, {25,6} |
| 89 | +}; |
| 90 | +static const whisper_ahead g_aheads_large_v3_turbo[] = { |
| 91 | + {2,4}, {2,11}, {3,3}, {3,6}, {3,11}, {3,14} |
| 92 | +}; |
| 93 | +``` |
| 94 | + |
| 95 | +Selected via `get_alignment_heads_by_layer()` (line 8666). Modes: preset-specific, N-top-most layers, or custom user-provided heads. |
| 96 | + |
| 97 | +Masks built in `aheads_masks_init()` (line 1160), used during decoder graph construction at lines 2720-2734 in the cross-attention block. |
| 98 | + |
| 99 | +### WHERE TO ADD IMPROVEMENTS: |
| 100 | + |
| 101 | +**Gap padding:** In DTW function at line 8843-8860 when building token sequence. Insert `" ..."` tokens after `no_timestamps` but before text tokens. Adjust `sot_sequence_length`. |
| 102 | + |
| 103 | +**Dynamic head selection:** At line 8919 (currently takes mean). Instead: score each head for monotonicity, select top-k, then average only those. Would need to expose all heads first (currently only preset heads captured). |
| 104 | + |
| 105 | +## VAD Support (Already Exists!) |
| 106 | + |
| 107 | +whisper.cpp has full Silero-style neural VAD: |
| 108 | + |
| 109 | +- **`whisper_vad()`** at line 6621 -- called from `whisper_full()` when `params.vad == true` |
| 110 | +- Strips silence, concatenates speech segments with overlap |
| 111 | +- Builds `vad_mapping_table` to remap timestamps back to original audio |
| 112 | +- **Per-frame speech probabilities** available via `whisper_vad_probs()` API |
| 113 | +- Params: `threshold`, `min_speech_duration_ms`, `min_silence_duration_ms`, etc. |
| 114 | + |
| 115 | +This is relevant because: we could use the existing VAD probabilities as input for the silence mask instead of building our own loudness-based detector (or offer both options like stable-ts). |
| 116 | + |
| 117 | +## Segment Creation & Output |
| 118 | + |
| 119 | +### How Segments Are Created (lines 7616-7718) |
| 120 | +1. Scan tokens for timestamp tokens (`id > whisper_token_beg()`) |
| 121 | +2. Text between timestamps -> segment with `t0`, `t1`, text, tokens |
| 122 | +3. Pushed to `result_all` |
| 123 | +4. If `token_timestamps == true`: per-segment token timestamps computed |
| 124 | +5. If DTW enabled: DTW timestamps computed per-window after all segments |
| 125 | + |
| 126 | +### WHERE TO HOOK POST-HOC SNAPPING: |
| 127 | + |
| 128 | +**Option A -- Internal:** After DTW (line 7735) or after non-DTW token timestamps (lines 7663/7708), iterate all segments and snap word boundaries to speech edges using silence mask. |
| 129 | + |
| 130 | +**Option B -- End of pipeline:** Before `whisper_full_with_state()` returns (line 7753), as a final pass over all `result_all`. |
| 131 | + |
| 132 | +**Option C -- New public API:** `whisper_snap_timestamps(ctx, state)` that callers invoke after `whisper_full()`. Cleanest, non-invasive. |
| 133 | + |
| 134 | +## Existing Energy-Based "Snapping" (Primitive) |
| 135 | + |
| 136 | +Lines 8563-8631 in `whisper_exp_compute_token_level_timestamps()`: |
| 137 | +- Computes energy sum in token's time range |
| 138 | +- Expands/contracts boundaries based on energy threshold |
| 139 | +- Already exists but is crude compared to stable-ts |
| 140 | + |
| 141 | +## Key Constants |
| 142 | + |
| 143 | +| Constant | Value | Meaning | |
| 144 | +|----------|-------|---------| |
| 145 | +| `WHISPER_SAMPLE_RATE` | 16000 | Hz | |
| 146 | +| `WHISPER_HOP_LENGTH` | 160 | samples per mel frame = 10ms | |
| 147 | +| `WHISPER_CHUNK_SIZE` | 30 | seconds per chunk | |
| 148 | +| `WHISPER_N_FFT` | 400 | FFT window size | |
| 149 | +| Audio token resolution | 320 samples = 20ms | Each audio ctx position | |
| 150 | +| Timestamp token resolution | 20ms | Each increment of timestamp token | |
| 151 | +| `n_audio_ctx` | 1500 | Audio tokens per 30s chunk | |
| 152 | +| `n_text_ctx` | 448 | Max text tokens | |
| 153 | + |
| 154 | +## Public API Surface (relevant) |
| 155 | + |
| 156 | +```c |
| 157 | +// After transcription: |
| 158 | +whisper_full_n_segments(ctx) |
| 159 | +whisper_full_get_segment_t0/t1(ctx, i) // centiseconds (1 = 10ms) |
| 160 | +whisper_full_get_segment_text(ctx, i) |
| 161 | +whisper_full_n_tokens(ctx, i) |
| 162 | +whisper_full_get_token_data(ctx, i, j) // -> whisper_token_data |
| 163 | +whisper_full_get_segment_no_speech_prob(ctx, i) |
| 164 | + |
| 165 | +// Params: |
| 166 | +params.token_timestamps // enable non-DTW word timestamps |
| 167 | +params.max_initial_ts // default 1.0s (stable-ts sets to 0) |
| 168 | +params.logits_filter_callback // can inject custom logit filters externally |
| 169 | +ctx_params.dtw_token_timestamps // enable DTW mode |
| 170 | +ctx_params.dtw_aheads_preset // which alignment heads |
| 171 | +params.vad // enable built-in VAD |
| 172 | +``` |
0 commit comments