Chronological record of each bring-up and numerical-parity milestone:
every stage lands with a per-stage .npy reference dumped from NeMo
PyTorch and a C++ harness asserting rel error below a documented
threshold.
- Added
CMakeLists.txt(PARAKEET_*options,PARAKEET_USE_SYSTEM_GGMLescape hatch, install rules producingparakeet::parakeetso the eventual vcpkg port is a drop-in). - Added
scripts/setup-ggml.shpinned to the upstream ggml commit (58c38058). - Vendored
dr_wav.hfor wav I/O. - Public headers under
include/parakeet/exposeparakeet_cli_main,parakeet::ctc::Engine, and the one-shottranscribe_wavAPI. (Post-v0.1.0-pre audit, the public namespace is the flatparakeet;parakeet::ctc::is a backward-compat alias.) - CLI + library + test harnesses build green on macOS (arm64).
scripts/convert-nemo-to-gguf.pyextractsmodel_config.yamlmodel_weights.ckpt+tokenizer.modelfrom the HF.nemotarball and writes a single GGUF.
- Tensor naming is a flat namespace built for the C++ side:
preproc.mel_filterbank(80, 257) — NeMo'sfeaturizer.fbpreproc.window(400,) — NeMo's Hann symmetric windowencoder.subsampling.{conv0,conv1_dw,conv1_pw,conv2_dw,conv2_pw,out}.{weight,bias}encoder.blk.{i}.{norm_ff1,ff1.linear1,ff1.linear2,norm_attn,attn.{q,k,v,out,pos},attn.pos_bias_{u,v},norm_conv,conv.{pw1,dw,bn,pw2},norm_ff2,ff2.linear1,ff2.linear2,norm_out}.{weight,bias}ctc.decoder.{weight,bias}— final Conv1d kernel_size=1, flattened to (vocab+1, d_model)
- Conformer conv-module BatchNorm is fused at convert time into
(
scale,shift) vectors so the C++ graph is BN-free. - f16 default for 2-D projections / convs; f32 for biases / norms / BN-fused scale+shift / preprocessor buffers.
- Output:
models/parakeet-ctc-0.6b.gguf(1.16 GiB f16). - C++
load_from_ggufinsrc/parakeet_ctc.cpploads every expected tensor, fills the typedSubsamplingWeights/BlockWeights/CtcHeadWeightsstructs, and rejects any missing tensor with a clear error. parakeet --verboseprints the full hyperparameter + tensor summary (verified againstmodel_config.yaml).
scripts/dump-ctc-reference.pydrives NeMo PyTorch on a wav, emitsmel.npy,subsampling_out.npy,block_0_out.npy,block_last_out.npy,encoder_out.npy,logits.npy,greedy_ids.npy, plus the text transcript. Fortest/samples/jfk.wav(11 s), NeMo prints:"and so my fellow americans ask not what your country can do for you ask what you can do for your country".
- C++
compute_log_mel:- preemph y[t] = x[t] − 0.97·x[t−1] (x[0] pass-through, in-place reverse loop),
- reflect-pad by
n_fft/2 = 256(torch.stftcenter=True, pad_mode='reflect'convention), - 512-point radix-2 Cooley–Tukey complex FFT per frame, window placed symmetrically (zero-padded 56 on each side of the 400-sample Hann),
- magnitude² → matmul against the GGUF filterbank,
log(x + 2**−24),- per-feature (per-mel-bin) CMVN over
seq_len = ⌈n_samples/hop⌉(sample std,+ 1e-5), with tail frames zeroed — matches NeMo'snormalize_batch('per_feature').
test-melonjfk.wav:Inner-frame rel of 1.1e-4 is f32 FFT rounding noise (verified: error plateaus as soon as boundary frames are excluded). Good enough — the encoder's first stage (subsampling + ReLU) is tolerant to this level of per-bin fluctuation.c++ mel: (80, 1101) ref mel: (80, 1101) rel = 1.656e-03 max_abs = 3.385e-01 (target: rel < 5e-3) inner (excluding last 2 frames): rel = 1.116e-04 max_abs = 3.211e-03
Before writing ~1500 LoC of ggml graph code, landed
scripts/ref-encoder-from-gguf.py: a pure-PyTorch FastConformer forward
that reads weights from our GGUF (via gguf.GGUFReader, not from the
NeMo state_dict). Validates two things at once:
- GGUF tensor layout semantics (shapes, transposes, BN fuse, f16 round-trip) match what the C++ side will read.
- Our understanding of NeMo's FastConformer-CTC forward is correct.
End-to-end on test/samples/jfk.wav:
[shadow] tensors=904 layers=24 d_model=1024 heads=8
[parity] subsampling_out rel = 5.8e-04
[parity] block_0_out rel = 5.0e-04
[parity] block_last_out rel = 6.7e-04
[parity] encoder_out rel = 6.7e-04
[parity] logits rel = 2.1e-04
[shadow] transcript: and so my fellow americans ask not what your country can do for you ask what you can do for your country
[shadow] reference : and so my fellow americans ask not what your country can do for you ask what you can do for your country
[shadow] match : True
All five stages at the f16 quantization floor. Transcript is bit-equal to NeMo. The shadow is now the authoritative spec for the C++ port.
Key debugging win along the way. Initial shadow reported block_0 rel
~33% vs the stored block_0_out.npy. Root cause: the original
dump-ctc-reference.py ran model.transcribe() before the hook-driven
forward, and transcribe() mutates the MHA module in place (in NeMo 2.7.2
it flips use_pytorch_sdpa = True), so the saved intermediate .npys
reflected a post-transcribe state that differed from a cold forward by
~33% numerically — but was mathematically equivalent on greedy argmax,
hence produced the same transcript. The saved refs are now captured
cold, with per-block outputs (block_{0..23}_out.npy) for finer C++
gates.
Ported the shadow line-by-line to ggml. Full per-sub-stage parity on
test/samples/jfk.wav:
[test-encoder] stage B subsampling_out rel=1.156e-03 max_abs=3.661e+00 ok
[test-encoder] stage C0 post_ff1 (b0) rel=9.970e-04 max_abs=1.074e+02 ok
[test-encoder] stage C1 post_attn (b0) rel=9.984e-04 max_abs=1.073e+02 ok
[test-encoder] stage C2 post_conv (b0) rel=9.987e-04 max_abs=1.073e+02 ok
[test-encoder] stage C3 post_ff2 (b0) rel=1.000e-03 max_abs=1.072e+02 ok
[test-encoder] stage C block_0_out rel=1.060e-03 max_abs=8.134e-02 ok
[test-encoder] stage D block_last_out rel=1.602e-03 max_abs=2.481e-02 ok
[test-encoder] stage E encoder_out rel=1.602e-03 max_abs=2.481e-02 ok
[test-encoder] stage F logits (log_softmax) rel=1.359e-03 max_abs=1.933e-01 ok
Every stage at the f16 quantization floor.
Implementation (src/parakeet_ctc.cpp):
subsampling_graph: 5 convs (1 full + 2 dw/pw pairs) with theMaskedConvSequentialtime-mask propagation matching NeMo (mask applied before each conv + after each stride drop, lengths tracked viacalc_length).compute_rel_pos_encoding: host-side sinusoidal table of shape(2T-1, d_model), positions fromT-1down to-(T-1); fed as a graph input tensor.conformer_block_graph:- Macaron FF (LayerNorm + linear + SiLU + linear + 0.5 residual).
- Rel-pos MHA: q/k/v/pos linears → reshape to
(HD, T, H)/(HD, 2T-1, H)→ two matmuls for AC/BD terms → Transformer-XLrel_shiftvia concat-zero-pad + reshape trick → softmax → matmul with V → output linear. - Conv module: pointwise(d → 2d) → GLU split + sigmoid(half2) × half1 → depthwise k=9 → pre-fused BN → SiLU → pointwise d → d. Pre-fused BN saves one op per block across 24 blocks.
- Second Macaron FF.
- Final LayerNorm out.
run_encoder: builds a single 24-block graph, allocates it withggml_gallocr, marks per-stage capture tensors withggml_set_outputso gallocr doesn't reuse their buffers, uploads mel + 4 masks + pos_emb viaggml_backend_tensor_set, runs, extracts all captures.
Key debugging wins (caught in minutes thanks to the shadow):
- Swapped
ggml_mul_matarg order inconv1d_via_matmulto avoid anF32 × F16assertion; kernels pre-cast to F32 when they're stored as F16 in the GGUF. ggml_set_outputon every capture tensor to survive graph compaction (before this, outputs past the first sub-stage were silently overwritten by downstream ops).- Used
ggml_sigmoid(notggml_silu) inside the conv module's GLU. This was the one-line bug driving block_0 rel from ~1e-3 to ~5e-2; isolating via per-sub-stage captures (block_0_post_ff1,block_0_post_attn,block_0_post_conv,block_0_post_ff2) and comparing against shadow dumps pinned it on the first try.
CTC linear is part of the encoder graph (final ggml_mul_mat + bias
on encoder_out). log_softmax is computed host-side for
numerical stability (ggml lacks a stable log_softmax op, and
argmax doesn't need it). Greedy decode + collapse-repeats +
strip-blank is a trivial CPU loop. SentencePiece detokenize works
off the tokenizer.ggml.tokens string array (+ scores and piece
types) which the converter now emits alongside the raw proto bytes.
End-to-end on test/samples/jfk.wav:
$ ./build/parakeet --model models/parakeet-ctc-0.6b.gguf \
--wav test/samples/jfk.wav --verbose
[BENCH] load=126.9ms mel=12.9ms enc=913.4ms dec=0.2ms total=1053.6ms tokens=26
and so my fellow americans ask not what your country can do for you ask what you can do for your country
Bit-equal to the NeMo reference transcript. RTF ≈ 0.10 on Apple Silicon CPU (11 s of audio transcribed in 1.05 s, ~10× faster than real-time) on a single-core unoptimized build.
Added a --bench mode to the CLI so we can compare optimizations
accurately and reproducibly (same warm state, same repeat count, same
stats) without shelling out to time.
--benchenable benchmark mode--bench-runs Ntimed runs (default 3)--bench-warmup Nwarmup runs, excluded from stats (default 2, absorbs the cold-cache + first-graph-allocator outlier)--bench-json PATHdump structured JSON for comparing across runs or backends (ggml-cpu, ggml-metal, and onnxruntime are all in scope)
Per-stage stats include mean / median / min / max / stdev for mel,
encoder, decode, and total inference; the summary line highlights
median and best RTF (mean is reported too but gets noisy when a
warm run gets preempted by the OS). Std > 20% of mean triggers a
visible warning so we don't silently chase variance.
Machine: Apple M4 Air, macOS, single-core unoptimized Release build.
Model: parakeet-ctc-0.6b.gguf at f16 (1.16 GiB). Threads: default
(std::thread::hardware_concurrency() via ggml-cpu). Audio:
test/samples/jfk.wav — 11.00 s, 176 000 samples @ 16 kHz.
--bench-warmup 2 --bench-runs 5:
mean med min max std
mel ms 14.63 14.65 14.11 15.10 0.42
encoder ms 1041.96 1046.23 1031.53 1054.12 10.00
decode ms 0.17 0.17 0.17 0.18 0.01
inference ms 1056.77 1060.51 1046.02 1069.41 10.21
RTF (median/best) = 0.096 / 0.095 (realtime multiple = 10.4x / 10.5x)
model load = 449 ms (one-time, excluded from RTF)
- Encoder dominates inference (98.6% of wall time).
- Mel preprocessor is a ~1.4% slice (13–15 ms for 11 s of audio).
- Greedy decode + SentencePiece detokenize is effectively free (~0.17 ms).
- Std of 1% on inference across 5 warm runs → measurements are tight enough to catch ≥ 2% improvements without heroics.
JSON reference snapshot archived at
artifacts/bench/ggml-cpu-baseline-m4air.json.
Three non-timing-sensitive wins landed together:
- CLI default thread count =
std::thread::hardware_concurrency()(was 4 via ggml-cpu's internal default).--threads Nstill overrides. On a 10-core M4 Air that's 10 threads by default. Worth ~10-12% on the encoder path in isolated measurements. -O3 -ffast-math -funroll-loopsonlibparakeetin Release builds (viaCMakeLists.txtgenerator expressions; Debug/RelWithDebInfo unaffected). Our pure-C++ FFT / filterbank-matmul / CMVN drops from ~14 ms to ~6 ms (2.3×). Doesn't touch ggml; it only affects our own DSP code, where-ffast-math's associativity relaxation is safe (post-log-mel values are far from denormal / inf-adjacent regions).- Encoder graph allocator cached across calls
(
ParakeetCtcModel::Impl::encoder_alloc). Previously everyrun_encoder()built a freshggml_gallocrand re-walked the 24-block graph; the fresh allocator + re-reserve cost ~5-10 ms per call and added noise to--bench. Now allocated on the first call and reused as long asn_mel_framesis stable (re-created on shape change).
Post-opt numbers on an otherwise-quiet M4 Air (jfk.wav, 11 s audio,
--bench-warmup 2 --bench-runs 5):
mean med min max std
mel ms 5.72 5.80 5.48 5.96 0.22 (was 14.63)
encoder ms 940.13 943.40 856.94 1056.41 83.04 (was 1041.96)
decode ms 0.08 0.08 0.08 0.09 0.01
inference ms 945.94 948.97 862.88 1062.29 82.94 (was 1056.77)
RTF (median/best) = 0.086 / 0.078 (was 0.096 / 0.095)
artifacts/bench/ggml-cpu-round1-m4air.json snapshot archived.
Mel's 2.3× speedup is clean and reproducible. Encoder variance is
higher than the baseline (std 83 ms vs 10 ms) — that's a
benchmark-noise effect from system contention, not a regression; in
isolation the median is within the previous std band.
Two changes shipped together:
- OpenMP on ggml-cpu.
brew install libomp(one-time) then-DGGML_OPENMP=ONat configure time. CMake auto-links it via the existingfind_package(OpenMP)block. On a quiet M4 Air, with CPU-only backend, measured ~4% encoder speedup (median 803 ms → 768 ms) and 42% tighter stdev (88 ms → 50 ms). Worth taking for the variance reduction alone. - Weight loading reworked to use a backend-owned buffer.
gguf_init_from_fileis now called withno_alloc=true, the ggml context is then populated viaggml_backend_alloc_ctx_tensors(ctx, backend_cpu), and each tensor's data is streamed from the file into the backend buffer viaggml_backend_tensor_set. The buffer is taggedGGML_BACKEND_BUFFER_USAGE_WEIGHTSso future sched-based optimizations can reach it. No direct perf impact (identical in-memory layout), but unblocks multi-backend scheduling.
Tried co-initialising the ggml-blas backend with ggml_backend_sched
configured as [blas, cpu] + op_offload=true. Result on this model
- machine: no speedup, sometimes slower.
Root causes:
- ggml-cpu's multi-threaded f16×f32 SIMD matmul beats
single-threaded Accelerate
cblas_sgemmfor our matmul sizes (d_model=1024, T_enc=138, FFN=4096). On Apple Silicon Accelerate routes SGEMM to the single-CPU AMX coprocessor; for these "medium" matmuls, 10 parallel SIMD threads win. - Our weights are f16; BLAS needs f32 inputs, forcing on-the-fly dequantization that eats the BLAS kernel's advantage.
- Sched splits the graph per op, adding per-op dispatch overhead.
Reverted to plain CPU backend. BLAS backend init code is kept in
load_from_gguf (dormant, will be used when we plumb a real
sched-based multi-backend path for GPU offload). BLAS attempt with
an f32 GGUF hit a cur_backend_id != -1 sched assertion, not
pursued further.
The encoder ggml graph (~600 nodes: 24 Conformer blocks with FF /
rel-pos MHA / conv + subsampling + CTC head) was rebuilt from scratch
on every run_encoder call — fresh ggml_context, fresh cgraph,
fresh ggml_gallocr_new, re-reserve. That's pure per-call overhead
that doesn't scale with audio length.
Refactored run_encoder into two phases:
build_encoder_graph_cached(model, graph, n_mel_frames, ...)— constructs the graph, pre-computes the sinusoidal rel-pos encoding (only shape-dependent), reserves the allocator. Named input tensors (mel_in,mask_t{0..3},pe_in) and output tensors are stashed onImpl::encoder_graph.- The hot path in
run_encoderjust computes per-call masks frommel_valid,ggml_backend_tensor_seton the cached input tensor pointers,ggml_backend_graph_compute, andggml_backend_tensor_geton the cached output tensors.
Graph rebuild is triggered only when n_mel_frames changes
(different input length). For bench mode running the same wav N
times, the graph is built once and reused.
| run | mel ms (median) | encoder ms (median) | encoder ms (best) | RTF median | RTF best | backend |
|---|---|---|---|---|---|---|
| pre-round-1 | 14.63 | 1046.23 | 1031.53 | 0.096 | 0.095 | ggml-cpu (4 thr) |
| round 1 | 5.80 | 786 (quiet) | 733 | 0.073 | 0.067 | ggml-cpu (10 thr, O3/ffast-math) |
| round 2 | ~9 | ~850 (median), 770 (best) | 710 | 0.077–0.091 | 0.065–0.070 | ggml-cpu + OpenMP + weight buffer |
| round 3 | 8.5–9.1 | 761–862 (median) | 706 | 0.070–0.079 | 0.065–0.066 | + cached encoder graph |
Note on variance. Round 2 numbers have wider spread than round 1
(stdev 75–140 ms on encoder) despite being measured on the same
machine. Cause: macOS background activity (Spotlight, Time Machine,
etc.) preempting our encoder threads; mel and decode std grow too
when the system is busy. The best encoder time is the cleanest
signal for "what the code achieves when nothing else is running";
median is what a user typically observes. --bench output
reports both and warns when stdev > 20% of mean.
Snapshots: artifacts/bench/ggml-cpu-baseline-m4air.json,
ggml-cpu-round1-m4air.json, ggml-cpu-round2-m4air.json.
Added --profile mode to the CLI. Drives two complementary sweeps
off the same model load:
- Layer-depth sweep — runs the encoder with
n_run_layers = {0, 1, 12, 24}(wired through a newmax_layersparam onrun_encoder; the graph cache keys on it so each config gets a fresh graph), times each. Linear decomposition gives:subsampling + CTC head= time@0per-block avg= (time@24 - time@1) / 23block-0 extra= time@1 - time@0 - per-block-avg
- Within-block sub-stage sweep —
profile_block_substagesinsrc/parakeet_ctc.cppbuilds five tiny graphs (FF1 only, attention only, conv only, FF2 only, norm_out only) on a fixed-shape random input atT_encand times each. Also times the full block for consistency check.
Output on jfk.wav (11 s, M4 Air, 5 timed + 2 warmup):
[profile] mel preprocess 4.83 ms ( 0.6% of total)
[profile] subsampling + CTC head (nl=0) 71.36 ms ( 8.3% of total)
[profile] per-block avg (nl=1..24) 32.64 ms (x 24 = 783 ms, 91.3%)
[profile] full encoder (nl=24) 853.35 ms RTF = 0.0780
[profile] per-block sub-stages (T_enc=137):
Conv module 10.92 ms (31% of block) ~275 ms encoder-wide (32%)
Attention 7.41 ms (21%) ~186 ms (22%)
FF2 6.70 ms (19%) ~169 ms (20%)
FF1 6.07 ms (17%) ~153 ms (18%)
norm_out 0.05 ms ( 0%) ~ 1 ms ( 0%)
Key finding. Conv module is the single biggest slice (32% encoder-wide), not FFN. By FLOP count the conv module is ~5x cheaper than FFN (~435 MFLOPs vs ~2.3 GFLOPs per block), so this is a memory-bandwidth / implementation efficiency problem, not a compute problem. Next target:
conv1d_via_matmulcasts f16 kernels to f32 viaggml_castevery forward pass — could keep f16 native in mul_mat by flipping argument order.- The
ggml_permute(x, 1, 0, 2, 3) + ggml_contwrappers around the module materialise a (d_model × T) buffer twice per block (enter + exit). Re-shaping the internal ops to work on(d_model, T)layout natively would save ~24 * 2 * (d_model * T * sizeof(f32)) = 24 * 2 * 1024 * 137 * 4 bytes ≈ 27 MB of redundant copies per utterance. ggml_conv_2d_dw_directmay be faster than theggml_conv_1d_dw(im2col + mul_mat) we use today — the header even calls it out.
Two structural changes to the conv module, driven by the 5.7 profile that flagged it as the single biggest slice at 32% of encoder time:
-
Drop
ggml_contaround GLU halves.ggml_mulandggml_sigmoidaccept strided views natively; the twocontcalls were copying 2×(T×d_model×4) = ~1.1 MB per block, ~27 MB per forward, for no reason. Per-block conv time: 10.92 → 8.10 ms (-26%). -
Replace
conv1d_via_matmulwith directggml_mul_matforpw1/pw2(k=1 convs). A k=1 Conv1d is literally a matmul; doing it as such lets us:- skip the im2col (trivial but still a memcpy),
- skip the
ggml_cast(kernel, F32)that was in there to work around themul_mat(src0=f32, src1=f16)ordering restriction, - stay in the natural
(d_model, T)layout so theggml_permute + ggml_contenter/exit transposes (another ~1.1 MB per block) are gone. Depthwise conv still needs(T, d_model)layout so we transpose just arounddw + BN + SiLU. Per-block conv time: 8.10 → 6.06 ms (a further -25%, total -45%).
Output rel on block_last moved from 1.60e-3 → 1.88e-3 — within the
f16 quantization floor, from different accumulation order in the
mul_mat kernel vs the im2col+matmul path. All 9 test-encoder
parity gates still pass.
Sub-stage profile after round 4:
FF1 (macaron) 6.13 ms (23% of block) ~186 ms encoder-wide
Attention 7.64 ms (29%) ~232 ms ← now biggest
Conv module 6.06 ms (23%) ~184 ms
FF2 (macaron) 6.39 ms (24%) ~194 ms
Attention is now the single biggest slice (26.2% of encoder) at ~232 ms. FFN + Conv are a close 3-way tie around 20% each.
| run | encoder median ms | encoder best ms | RTF median | RTF best | note |
|---|---|---|---|---|---|
| baseline | 1046 | 1032 | 0.096 | 0.095 | ggml-cpu 4 thr |
| round 1 | 786 (quiet) | 733 | 0.073 | 0.067 | HC thr + O3/ffast-math |
| round 2 | ~850 | 770 | 0.077 | 0.070 | +OpenMP + weight buffer |
| round 3 | 761–862 | 706 | 0.070–0.079 | 0.065 | +cached graph |
| round 4 | 745–809 | 627 | 0.069–0.074 | 0.058 | +conv rewrite |
Cumulative: 40% reduction in encoder best-case (1032 → 627 ms). RTF best 0.058 = 17.4× real-time on CPU alone.
(§5.10 was an internal exploration that did not produce a shipping change; numbering jumps from 5.9 to 5.11 deliberately.)
Two attention-path experiments, both motivated by PROGRESS 5.8's attention-as-biggest-slice finding (26 % of encoder wall time after the Round 4 conv rewrite).
-
Packed QKV matmul. Converter now emits
encoder.blk.{i}.attn.qkv.{weight,bias}in addition to the three separateq/k/v.{weight,bias}tensors.BlockWeightshasattn_qkv_w/bfields;load_from_ggufpicks them up optionally. The graph branches onW.attn_qkv_w != nullptr— packed path does oneggml_mul_mat+ bias +reshape_4d(HD, H, 3, T)+ threeggml_view_3dslices to extract Q/K/V. -
ggml_contpruning aroundq/k/v/p_permpermutes. mul_mat and ggml_add accept non-contiguous src as long asnb00 == type_size, so thecontcould in principle be dropped for k_perm and p_perm (used directly as mul_mat src0) and for q_perm (materialised by the downstream add with pos_bias_u/v).
Result on M4 Air, CPU-only. Neither change produced a reliable win above the ~15% bench-to-bench stdev, and some configurations regressed.
Root causes (measured):
- The packed output lays out Q/K/V in a single 3×d_model row, so the
per-slice T stride is 3 * HD * H * 4 = 12 KB vs the natural 4 KB for
separate matmuls. The subsequent
cont(permute)does a strided copy that's roughly 3× more cache-unfriendly — net slower than the three smaller matmuls ggml-cpu already runs in parallel. - Dropping
conton k_perm/p_perm pushes the strided reads into the mul_mat kernel itself, which on ggml-cpu's f16×f32 SIMD path is a slower code path than contiguous src0. Thecontcopy was effectively buying a faster subsequent mul_mat. - Fresh per-block substage profile (after all Round 4 changes, packed QKV kept dormant in graph):
FF1 (macaron) 6.07 ms (21% of block)
Attention 5.72 ms (20%) ← no longer biggest
Conv module 7.90 ms (28%) ← biggest on this machine
FF2 (macaron) 6.40 ms (23%)
norm_out 0.04 ms ( 0%)
Attention is no longer dominant on M4 Air — the conv module's
ggml_conv_1d_dw (im2col+matmul) path and pw1/pw2 matmuls are now
the single biggest slice. FFN remains the largest aggregate (43%)
and is the right target for Round 6 (block quantization).
Shipped: packed-QKV tensor emission in the converter,
BlockWeights::attn_qkv_{w,b}, and the optional load path. Graph
still uses the 3-matmul path. Infrastructure is dormant but kept
because Round 7's ggml_flash_attn_ext experiment will want the
packed Q/K/V regardless.
Not shipped: any graph-level change. The baseline (reverted to pre-Round-5 attention) is the current code.
Bench snapshot on sample-16k.wav (20.1 s, --bench-warmup 3 --bench-runs 10, OpenMP, 10 threads):
mean med min max std
encoder ms 1316.70 1245.81 1193.73 1559.76 140.71
RTF (median/best) = 0.063 / 0.060
Snapshot: artifacts/bench/ggml-cpu-round5-m4air.json.
Quantize the ~150 largest 2D weight matrices per block (FFN, attention q/k/v/qkv/out/pos, conv pointwise, subsampling out, CTC head) using ggml-cpu's hand-tuned Q8_0 / Q5_0 / Q4_0 kernels. Small tensors (biases, norms, fused BN, mel filterbank, depthwise kernels, tiny 2D subsampling convs) stay at f32 / f16 because their innermost dim doesn't divide the 32-element block size.
Converter side (scripts/convert-nemo-to-gguf.py):
- New
--quant {f32, f16, q8_0, q5_0, q4_0}. - Single
add_2dhelper routes each 2D weight throughgguf.quants.quantize(arr, qtype)when the inner dim % 32 == 0, with an f16 fallback otherwise. Squeezes the trailing 1 onconv.pw{1,2}.weightso they can be quantized. - File-type header updated to match the selected quant
(
LlamaFileType.MOSTLY_Q8_0etc.).
C++ side (src/parakeet_ctc.cpp):
- No graph changes needed.
ggml_mul_matdispatches to the Q8_0 / Q5_0 / Q4_0 kernel automatically based on src0's stored type. load_from_ggufalready usedggml_nbytes(t)to size the read, which correctly accounts for block-aligned storage.conformer_conv_graph'sggml_reshape_2d(W.conv_pw1_w, d_model, 2*d_model)becomes a metadata-only identity after the converter squeeze (pw1 already stored as 2D (1024, 2048)); reshape_2d still accepts the shape and works on quantized src.
Parity (tested on jfk.wav + sample-16k.wav): transcript is
bit-equal to NeMo PyTorch at every quantization level, including
Q4_0. Per-stage rel error grows as expected: f16 ~1.6e-3 → Q8_0
~5.5e-3 → Q4_0 ~3.3e-2. Rel drift does NOT translate into token
drift on clean speech in these tests.
Bench results on M4 Air, 10 ggml-cpu threads, --bench-warmup 3 --bench-runs 10:
| variant | file | enc best (20 s) | enc median (20 s) | enc best (11 s) | enc median (11 s) |
|---|---|---|---|---|---|
| f16 | 1.3 GiB | 1194 | 1246 | 683 | 796 |
| Q8_0 | 697 MiB | 999 | 1209 | 600 | 655 |
| Q5_0 | 453 MiB | 1475 | 1614 | ~650 | — |
| Q4_0 | 372 MiB | 1080 | 1286 | 595 | 637 |
Key findings:
- Q8_0 is the speed + parity sweet spot. Best-case encoder time drops from 1194 → 999 ms on the 20 s clip (-16 %), and from 683 → 600 ms on the 11 s clip (-12 %). RTF best 0.050 on 20 s (20x real-time on CPU alone).
- Q4_0 is a valid size tier. ~10 % slower than Q8_0 on average but model shrinks to 372 MiB (3.5x smaller than f16), with the same bit-equal transcript.
- Q5_0 is a trap on this machine. File size drops to 453 MiB (smaller than Q8_0) but the ggml-cpu Q5_0 mul_mat kernel is noticeably slower than either Q8_0 or Q4_0 on Apple Silicon. Shipped anyway for the size tier, not recommended for speed.
- Model load time improves too (bandwidth-bound): f16 312 ms → Q8_0 166 ms → Q4_0 96 ms on 20 s benches.
Remaining gap vs ONNX (20 s clip): Q8_0 999 ms vs ONNX 944 ms — from 317 ms gap to ~55 ms (83 % of the remaining gap closed with Round 6 alone).
Snapshots:
artifacts/bench/ggml-cpu-round6-q8_0-m4air.jsonartifacts/bench/ggml-cpu-round6-q5_0-m4air.jsonartifacts/bench/ggml-cpu-round6-q4_0-m4air.json
ggml_flash_attn_ext(q, k, v, mask, scale, max_bias, logit_softcap)
fuses softmax(q @ k^T * scale + mask) @ v into a single op.
Prototyped it behind #ifdef PARAKEET_EXPERIMENTAL_FLASH_ATTN in
rel_pos_mha_graph:
- Compute the Transformer-XL rel-pos BD branch exactly as before
(
bd_finalof shape(T, T, H)). - Pre-scale BD by
1/sqrt(HD)(flash_attn_ext applies thescaleargument only toq@k^T, the mask is added as-is). - Cast BD to f16 (CPU backend requires f16 mask —
ggml.cline 5320). - Call
ggml_flash_attn_ext(q_u, k_perm, v_perm, bd_mask, scale, 0.0f, 0.0f)— skips the explicitac = mul_mat(k, q_u), theac + bd_finaladd, thesoft_max, the second mul_mat on V, and thev_for_mm = cont(permute(v_perm, 1, 0, 2, 3))copy. - Output layout
(HD, H, T)feeds directly intoreshape_2d(HD*H, T)without the extra permute+cont tail of the non-flash path.
Parity: all 9 test-encoder gates pass. block_last rel drifts from
1.9e-3 → 4.2e-3 (f16 mask cast adds one quantization step), still
under the 5e-3 threshold.
Bench result on M4 Air, ggml-cpu Q8_0, 3x(warmup 3 + runs 10):
| clip | non-flash best | flash best | non-flash median | flash median |
|---|---|---|---|---|
| jfk.wav (T=138) | 529 | 559 | 561 | 606 |
| sample-16k.wav (T=251) | 1037 | 1087 | 1168 | 1157 |
Flash_attn_ext is neutral-to-slower on CPU at these sequence lengths.
The overhead of the f16 BD mask cast and the extra BD pre-scale offset
the savings from fusing the four attention ops, and ggml-cpu's
q_u @ k_perm^T matmul is already well-tuned for T ~ 140–250.
Gate (per plan: ship if encoder median drops >=30 ms): FAILED.
Shipped: code is preserved behind #ifdef PARAKEET_EXPERIMENTAL_FLASH_ATTN (default off). The Metal backend
phase will want to revisit this — flash-attn typically wins big on
GPU where softmax + V-multiply fuse into one kernel pass.
Swapped ggml_conv_1d_dw (im2col + mul_mat path) for
ggml_conv_2d_dw_direct on the Conformer depthwise kernel in
conformer_conv_graph.
Implementation:
- The existing conv.dw.weight stored shape
(d_model, 1, 9)—ggml_reshape_4d(W.conv_dw_w, conv_kernel, 1, 1, d_model)gives the(KW=9, KH=1, 1, C=d_model)layout thatggml_conv_2d_dw_directrequires. - Wrap yt from
(T, d_model, 1, 1)into(W=T, H=1, C=d_model, N=1)viaggml_reshape_4d, run the op, unwrap back to(T, d_model, 1)viaggml_reshape_3d. - The CPU backend's depthwise kernel accesses the filter as
const float *, so weggml_cast(W.conv_dw_w, GGML_TYPE_F32)once (graph-build time, small cost — 9*d_model elements) when the stored type is f16. Alternative would be storing as f32 at convert time; the cast is simpler and works on all existing GGUFs.
Parity: all 9 test-encoder stages pass. block_last rel is
essentially unchanged (1.73e-3 vs 1.60e-3 previously).
Bench on M4 Air, Q8_0, 15 timed runs, 5 warmup:
| clip | enc best before | enc best after | delta | enc median before | enc median after |
|---|---|---|---|---|---|
| jfk.wav (11 s) | 529 | 460 | -13% | 561 | 481 |
| sample-16k.wav (20.1 s) | 1000 | 839 | -16% | 1208 | 882 |
This single op swap is ~100–200 ms cheaper than the im2col+mul_mat path across 24 blocks. The previous profiler breakdown attributed 28 % of encoder time to the conv module; after this change it drops meaningfully, and the remaining sub-stages are roughly a three-way tie between FF1, FF2, and attention.
Measured vs ONNX Runtime (20 s clip): Q8_0 + conv_2d_dw_direct best 839 ms vs ONNX 944 ms — ggml-cpu is now 12 % faster than ONNX on best-case encoder. Round 4's 317 ms gap is entirely closed.
Snapshots: artifacts/bench/ggml-cpu-round8a-q8_0-m4air.json.
Added an all_valid flag threaded through build_encoder_graph_cached
and subsampling_graph. When the caller's mel has no trailing
silence (mel_valid == n_mel_frames, the common case for a single
utterance), the 8 apply_time_mask ggml_mul calls in
subsampling_graph are skipped — the graph is built without those
ops at all. EncoderGraph caches the all_valid value so the graph
is rebuilt when it flips.
Parity: all 9 test-encoder gates still pass (the test sends a
padded mel, so all_valid=false and the masked path runs).
Bench impact: within noise (~0-10 ms), because the mask ops were
already small element-wise muls and ggml-cpu runs them cheaply in the
OpenMP pool. Shipped anyway for correctness hygiene — running a
no-op mul_by_ones is silly — and because the infrastructure enables
the Round 8c LRU cache to cleanly key on all_valid.
Replaced the single-shape Impl::encoder_graph with a small LRU
std::vector<std::unique_ptr<EncoderGraph>> of up to 3 entries. The
cache key is (n_mel_frames, n_run_layers, all_valid).
Behaviour:
- On
run_encoder, scan the cache for a matching entry. If found, reuse it and move it to the back (most-recently-used). - If no match, evict the oldest entry (if cache is full) and build a new graph for the current shape.
- Graph rebuild only happens on a genuine shape change; previously any shape change freed the single cached graph and rebuilt it.
This is a latent optimisation: the benchmark mode reuses one shape and shows no change. The win shows up in production callers that alternate between a few utterance lengths (streaming, short-burst input, etc.) — those paths avoid the ~20-50 ms graph rebuild cost on every length change.
Parity: unchanged. Transcripts bit-equal on both test clips.
| round | code | jfk best | 20s best | vs ONNX f16 best (944) |
|---|---|---|---|---|
| pre-Round-5 | f16 | 617 | 1197 | -27 % |
| Round 5 | f16 | 683 | 1193 | -26 % |
| Round 6 | Q8_0 | 600 | 999 | -6 % |
| Round 7 | Q8_0 + flash_attn | 559 | 1087 | -15 % |
| Round 8 (8a+8b+8c) | Q8_0 | 460 | 839 | +11 % |
Round 8 vs ONNX f16: 11 % faster on best-case encoder on a 20 s clip.
Fair f16 vs f16 (same precision, different runtimes — 5 warmup + 15 timed runs):
onnxruntime-f16 ggml-cpu-f16
-----------------------------------------------
model size 2.3 GiB 1.3 GiB
load ms 16 736 642 (26x faster cold start)
inf best ms 948 1117 (15 % slower)
inf median ms 1 007 1132 (12 % slower)
inf stdev ms 52 18 (3x tighter)
RTF best 0.047 0.055
RTF median 0.050 0.056
Transcripts match match
Fair int8 vs int8 (generated via ORT dynamic quantization from the same weights, 5 warmup + 15 timed runs):
onnxruntime-int8 ggml-cpu-Q8_0
-------------------------------------------------
model size 583.9 MiB 697 MiB
load ms 2 054 179 (11x faster cold start)
inf best ms 677 898 (25 % slower)
inf median ms 721 928 (22 % slower)
inf stdev ms 55 25 (2x tighter)
RTF best 0.034 0.045
RTF median 0.036 0.046
Transcripts match match
Interpretation:
- ggml is 12–25 % slower than onnxruntime at the same precision tier. onnxruntime's kernels on Apple Silicon route through AMX coprocessor instructions (hand-tuned for both f16 and int8) that ggml-cpu's OpenMP SIMD threads can't match on multiply-accumulate throughput.
- ggml stdev is 2–3× tighter at both tiers (18 vs 52 ms at f16; 25 vs 55 ms at int8), meaning per-utterance latency is more predictable under background OS load.
- ggml model load is 11–26× faster — critical for cold-start / short-session workloads.
- The Metal backend (planned Phase 6) will target GPU compute, where AMX doesn't apply and ggml's flash-attention kernel (already prototyped in Round 7) can be used.
RTF best on 20 s clip (Q8_0): 0.045 → 22x real-time on CPU alone. Model load: 179 ms vs ONNX int8's 2054 ms.
Snapshots:
artifacts/bench/ggml-cpu-round5-m4air.jsonartifacts/bench/ggml-cpu-round6-{q8_0,q5_0,q4_0}-m4air.jsonartifacts/bench/ggml-cpu-round8a-q8_0-m4air.jsonartifacts/bench/ggml-cpu-round8-q8_0-m4air.jsonartifacts/bench/ggml-cpu-round8-q8_0-jfk-m4air.jsonartifacts/bench/ggml-cpu-round8-f16-m4air.json
Bring-up of the ggml_backend_metal path for GPU offload on Apple
Silicon. End-to-end on the M4 Air GPU:
init_gpu_backend(n_gpu_layers, verbose)helper drivesggml_backend_load_all()once, then walks the registry in registration order (CUDA → Metal → Vulkan → OpenCL → ...) and picks the first GPU/IGPU device viaggml_backend_dev_init, returningnullptrwhenn_gpu_layers <= 0or the registry has no usable GPU device. Same shape under bothGGML_BACKEND_DL=ON(the dynamic-loader mode embedded host applications use; backends are dlopened at runtime) andGGML_BACKEND_DL=OFF(statically linked; load_all is a no-op). Matches the registry-walk convention used byllama.cppandwhisper.cpp.Impl::backend_activepointer — one of CPU or GPU — drivesggml_backend_alloc_ctx_tensors,ggml_backend_graph_compute, and the per-callsafe_settensor uploads. All weights live on the GPU backend (unified memory on Apple Silicon), graph runs entirely on GPU.- Standard CLI flag:
--n-gpu-layers N(same spelling as llama.cpp / whisper.cpp). Any value > 0 moves the whole encoder to GPU — this model has one encoder, so we don't actually need per-layer granularity. - Compile via
cmake -DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON(or-DGGML_CUDA=ON,-DGGML_VULKAN=ON). ggml_conv_2d_dw_direct(Round 8a) is not yet implemented on Metal (ggml_metal_op_encode_impl: error: unsupported op 'CONV_2D_DW').conformer_conv_graphtakes ause_conv2d_dwbool, chosen at graph-build time viaggml_backend_is_cpu(backend): CPU path uses the fast direct kernel, GPU paths revert toggml_conv_1d_dw(im2col + mul_mat, Metal/CUDA/Vulkan supported).flash_attn_extleft behind#ifdef PARAKEET_EXPERIMENTAL_FLASH_ATTNfrom Round 7 — should be tested on Metal as a separate follow-up.
Metal test-encoder on jfk.wav + artifacts/ctc-ref:
stage B subsampling_out rel=7.641e-04 (CPU: 1.156e-03)
stage C0 post_ff1 (b0) rel=4.859e-04 (CPU: 9.970e-04)
stage C1 post_attn (b0) rel=4.866e-04 (CPU: 9.984e-04)
stage C2 post_conv (b0) rel=4.870e-04 (CPU: 9.987e-04)
stage C3 post_ff2 (b0) rel=4.880e-04 (CPU: 1.000e-03)
stage C block_0_out rel=6.756e-04 (CPU: 1.060e-03)
stage D block_last_out rel=1.698e-03 (CPU: 1.730e-03)
stage E encoder_out rel=1.698e-03 (CPU: 1.730e-03)
stage F logits (log_softmax) rel=3.871e-04 (CPU: 1.362e-03)
All 9 gates pass, and Metal per-stage rel is tighter than CPU (the Metal f16 mul_mat kernels use f32 accumulators throughout, which happens to track NeMo PyTorch's f32 reference more closely than the CPU path's mixed-precision accumulation).
sample-16k.wav (20 s), --bench-warmup 5 --bench-runs 15:
| variant | enc best | enc median | stdev | RTF best | real-time multiple |
|---|---|---|---|---|---|
| CPU f16 (Round 8) | 1 117 | 1 132 | 18 | 0.055 | 18x |
| CPU Q8_0 (Round 8) | 898 | 928 | 25 | 0.045 | 22x |
| CPU Q4_0 (Round 8) | 1 080 | 1 286 | 138 | 0.054 | 19x |
| Metal f16 | 266 | 268 | 1.1 | 0.013 | 75x |
| Metal Q8_0 | 272 | 274 | 1.5 | 0.014 | 73x |
| Metal Q4_0 | 271 | 272 | 0.5 | 0.014 | 74x |
On jfk.wav (11 s): Metal f16 encoder best 152 ms, median 154 ms.
sample-16k.wav, 5 warmup + 15 timed runs, ggml run with
--n-gpu-layers 1:
onnxruntime-int8 ggml-metal-Q8_0
---------------------------------------------------
model size 583.9 MiB 697 MiB
load ms 2 295 420 (5.5x faster cold start)
inf best ms 682 282 (2.4x faster)
inf median ms 712 283 (2.5x faster)
inf stdev ms 18 0.83 (21x tighter)
RTF best 0.034 0.014
RTF median 0.035 0.014
Transcripts match match
Metal ggml is 2.4x–2.5x faster than onnxruntime's AMX-accelerated int8 path, with 21x tighter variance (0.83 ms vs 18 ms stdev). Metal is compute-bound on GPU shader units, so quantization does not help (f16 / Q8_0 / Q4_0 all cluster around 272 ms) — but it does shrink the model file and the unified-memory footprint.
- Implement
CONV_2D_DWon the Metal backend (upstream contribution to ggml) so the CPU and Metal paths shareconformer_conv_graph. Would buy a few ms more on Metal since the direct path is asymptotically cheaper than im2col. TestDone — see §15.8 below. Shippedggml_flash_attn_exton Metal — likely a meaningful win given the fused softmax + V-multiply kernel, plus the dormant infra from Round 7 is already in place.PARAKEET_FLASH_ATTN=ONas the Metal default; encoder 67.35 → 67.00 ms (−0.5 %) and inference 119.24 → 118.66 ms (−0.5 %) on M3 Ultra at byte-exact parity. CPU + CUDA + Vulkan + OpenCL keep the default OFF until each is A/B'd.- Hybrid
ggml_backend_schedwith Metal for the encoder + CPU for the mel preprocessor, so the CPU mel path doesn't block the GPU encoder. Today the mel runs inline on host before the encoder starts; with a sched we could overlap them.
Phase 5 (CPU optimization) is closed: Round 8 Q8_0 is 11 % faster than
onnxruntime on the 20 s clip and ships as the default. The items
below are remaining CPU-side ideas that did not make it into Phase 5
itself; they're tracked here so a future "Phase 5.x CPU follow-up"
sweep has a starting list. (Phase 6 ships the Metal backend +
ggml_backend_sched work referenced in the first bullet, so that bullet
is historical context.)
- Metal backend +
ggml_backend_schedfor GPU offload. The backend-buffer rework from Round 2 and the cached encoder graph from Round 3 are what the sched plumbed through. (Shipped in Phase 6; left here for cross-reference.) flash_attn_ext (dormant behindPARAKEET_EXPERIMENTAL_FLASH_ATTNfrom Round 7) is almost certainly a win on GPU where the softmax + V-multiply fuse into one kernel pass. - K-quant tiers (Q4_K_M, Q5_K_M, Q6_K). ggml-cpu has k-quant kernels too; these might extend the quality-vs-size curve beyond the block-quant tiers shipped in Round 6. Would need a sweep against parity.
- Bucketed encoder graph cache. Round 8c landed an exact-shape
LRU cache (up to 3 entries) which proved sufficient for the Phase 8
cache-aware streaming workload (every chunk is a fresh encoder call,
but the shape set is small enough that the LRU rarely misses). A
bucketed variant — round up to the next multiple of 64 or 128 mel
frames — would avoid rebuilds for variable-length production streams
where chunk shapes vary chunk-to-chunk, at the cost of padding the
mel input and masking out the tail via the
all_valid=falsepath.
Scope: ship a platform-agnostic streaming API surface on Engine
shaped around three transcription modes (one-shot, streamed-output,
duplex). Mode 2 (streamed-output) is implemented on top of today's
offline encoder; Mode 3 (duplex) has its header + ABI surface frozen
but errors at runtime until Phase 8 delivers a cache-aware streaming
GGUF.
Design rationale is in the plan's scope discussion: chunked-batch on the offline encoder was explicitly rejected because it costs 1-3 % WER at 2 s chunks with no throughput win on 20 s clips, and Mode 2's "offline encoder + CTC-timestamp streaming" gets the same UX at zero accuracy cost.
include/parakeet/ctc/engine.h was the declared-but-unimplemented
surface. Phase 7 lands the definition in src/parakeet_engine.cpp:
Engine(const EngineOptions &)loads the GGUF once viaload_from_gguf, storesParakeetCtcModel+ cancel flag inImpl.Engine::transcribe(wav_path)/transcribe_samples(samples, n, sr)drive the existingcompute_log_mel+run_encoder+ctc_greedy_decode+detokenizepipeline and return anEngineResultwith per-stage timing.Engine::cancel()sets an atomic flag; the streaming loop polls it between chunks (cooperative cancellation only — the encoder graph run itself is not interruptible).
Added ctc_greedy_decode_window(logits, start, end, vocab, blank, inout_prev_token, out_tokens, out_first_frame=nullptr) in
src/parakeet_ctc.{h,cpp}. The existing one-shot
ctc_greedy_decode() now delegates to it with prev_token = -1.
The stateful variant preserves collapse-repeats across window
boundaries via a caller-managed inout_prev_token, so a token whose
first argmax lands in window K and repeats in window K+1 isn't emitted
twice. This is the core invariant that makes Mode 2 byte-equal to the
offline path.
Engine::transcribe_samples_stream(samples, n, sr, opts, on_segment):
- Runs the existing offline mel + encoder path once.
- Computes
frames_per_window = max(1, opts.chunk_ms / 80 ms)(the 80 ms comes from 10 ms mel hop × 8x subsampling). - Walks
[0, T_enc)in contiguous windows, callingctc_greedy_decode_windowwith a persistentprev_token. - After each window's decode, detokenizes the cumulative token list
and emits the delta slice as the segment text. Detokenizing
cumulatively rather than per-window is required because
sentencepiece_bpe::detokenizestrips a leading ASCII space — if we detokenized per window, the leading space of every segment except the first would be silently stripped and"hello world"would come out as"helloworld". Caught by the first run of the new test harness onjfk.wavand fixed before landing. - Emits
StreamingSegment{text, token_ids, start_s, end_s, chunk_index, is_final=true, encoder_ms (first segment only), decode_ms}via the caller's callback, and accumulates into the returnedEngineResult.
transcribe_stream(wav_path, opts, cb) is a thin wrapper that loads
the WAV and forwards to transcribe_samples_stream. Both return
the full concatenated EngineResult so callers that want both the
streaming callback and a final aggregate don't have to rebuild it
themselves.
StreamSession declares feed_pcm_f32(const float*, int),
feed_pcm_i16(const int16_t*, int), finalize(), cancel(),
options(), destructor, and move ctor/assign. Implementation is in
src/parakeet_engine.cpp.
Engine::stream_start(opts, cb) probes
pimpl_->model.supports_streaming (fed from the new
parakeet.encoder.streaming.enabled GGUF key, added to
load_from_gguf). Today's GGUFs don't set the flag, so the call
throws std::runtime_error with a message pointing at Phase 8 and
suggesting transcribe_stream() for full-audio cases. Consumers
can target the final StreamSession shape immediately; when Phase
8 lands the error branch is swapped for the real state machine
without touching the public header.
Update: §8.2 removed this gate entirely. stream_start() now
runs cache-aware streaming inference directly on the existing offline
GGUF and never throws on supports_streaming. The (errored) marker
above is historical context for how the API was first frozen, not
current behaviour.
src/main.cpp:
--pcm-in PATH+--pcm-format {s16le,f32le}— load raw PCM directly (used to validate end-to-end againstLastQuestion_long_EN.rawwithout adding an ffmpeg dependency to the test loop). Mutually exclusive with--wav PATH.--stream— route throughEngine::transcribe_samples_streaminstead of the existingrun_oncepath. Incompatible with--bench/--profile(those continue to exercise the offline path).--stream-chunk-ms N— segment stride (default 1000).--emit {text,jsonl}—textprints[start-end] segmentone line per callback;jsonlprints a single JSON object per line, with proper escaping of",\, newlines, control chars. Flushes stdout after each segment so downstream players/consumers see output immediately.
test/test_streaming.cpp runs on a loaded Engine and asserts:
- Mode 1 reference:
transcribe()produces the baseline text. - Mode 2 byte-equality: for
chunk_ms ∈ {250, 500, 1000, 2000, 4000}plusaudio_duration_ms(single-segment edge case),transcribe_stream()segments concatenate byte-equal to Mode 1. - Mode 2 timestamp continuity: every segment's
start_smatches the previousend_s(within 1 ms rounding);end_s > start_s;is_final=truefor every segment in Phase 1. - Mode 3 error path:
stream_start()on a non-streaming GGUF throws an exception whose message mentions "streaming" or "Phase 8".
Caught the cumulative-vs-per-window detokenize bug on first run; once
fixed, all checks pass on jfk.wav and on the long speech clip via
CLI byte-equality.
LastQuestion_long_EN.raw (5.46 min, 16 kHz s16le mono;
external long-form fixture, not tracked in this repo):
- offline
--model ... --pcm-in ...transcript: 5169 bytes, 1710 tokens, 4099 encoder frames. --stream --stream-chunk-ms 2000segments concatenated: 5169 bytes, byte-equal to offline (diffproduces no output).- Metal Q8_0 timing:
mel=152ms enc=14941ms dec=5ms total=15099ms RTF=0.046(22x real-time). Mode 2's overhead over the offline path is sub-ms (the stream variant's total wall is the encoder pass + 1710-token cumulative detokenize + callback dispatch). --emit jsonlemits one correctly-escaped JSON object per segment; first chunk lands atstart=0.000 end=0.480with the first word"but".
This section captures the original Phase 8 scoping notes from before Mode 3 shipped. Phase 8 (the next section) records the actual implementation; the rolling-encoder design that landed differs from the cache-aware streaming-checkpoint plan sketched here. Kept for the round-by-round journal trail.
Prerequisites and scope tracked for Phase 8 (as planned at the time):
- Checkpoint selection (go/no-go gate). Evaluate candidate NeMo
cache-aware streaming checkpoints against Parakeet-CTC-0.6B on the
repo's reference clips. Primary candidate:
stt_en_fastconformer_hybrid_large_streaming_multi. Accept only if WER on reference set is within ±0.5 % of current offline. - New converter scoped similarly to
convert-nemo-to-gguf.py, to set theparakeet.encoder.streaming.enabled = truemetadata flag thatstream_start()already probes. - Streaming encoder graph: per-layer attention KV cache tensors (left-context), depthwise-conv left-state tensors, chunked + left-context attention mask, streaming mel state (reflect-pad only on true first/last chunk, per-chunk CMVN or running-mean CMVN depending on what the chosen checkpoint was trained with).
- Bucketed graph cache — round chunk mel length up to a fixed bucket so the 3-entry LRU graph cache reuses the compiled graph across chunks. Already tracked in §5.18.
StreamingOptionsgainsleft_context_ms,right_lookahead_ms,emit_partials(activated in Phase 8).- Per-stage numerical parity harness vs the NeMo streaming reference, following the Round 5-8 methodology.
- Mode 3 bring-up +
StreamSessionstate machine (sample ring, bucket-rounded encoder call, KV cache slide, cancellation). - Wire the CLI
--pcm-inpath (plus future--pcm-in -for stdin) throughstream_start()for manual live-streaming testing. - Expected performance (extrapolated from §6.x Metal numbers): 20 s
clip,
chunk_ms=500: ~350-450 ms total, first segment ~0.55 s; 60 s clip,chunk_ms=2000: ~700 ms total (vs offline ~850 ms — linear-in-T attention wins on long-form).
The NeMo registry has only one cache-aware streaming Conformer family:
stt_en_fastconformer_hybrid_large_streaming_{multi,80ms,480ms,1040ms}.
It's a 115M-parameter, RNN-T+aux_CTC hybrid trained with chunked-limited
attention. Real-world quality on this family is not great (~2x WER vs
Parakeet-CTC-0.6B offline).
So Phase 8 is not going to ship a port of streaming_multi. Instead, the chosen approach is cache-aware inference on the existing offline-trained Parakeet-CTC-0.6B weights: same 600M model, same quality ceiling, just driven through a streaming forward pass. The cost is some accuracy degradation because the model wasn't trained with chunked attention masks.
scripts/streaming-reference.py implements the chunking-with-context
strategy in Python on top of the NeMo offline model: each chunk feeds
[left_context + chunk + right_lookahead] into the offline encoder,
slices out the center frames, runs CTC greedy with a stateful
prev_token carried across chunks. This mirrors what the eventual C++
streaming path does (modulo the indefinitely-deferred Phase 8.5
KV-cache-on-offline-weights optimisation; see §8.5 for why this is
distinct from chunked-limited streaming inference and why the latter
is rejected).
Sweep results on test/samples/jfk.wav (11 s clean speech) and
LastQuestion_long_EN.raw (5.5 min sci-fi narration with proper nouns,
representing the "harder" production case):
| chunk_ms | left_ctx_ms | right_lookahead_ms | jfk WER | long-clip WER | first-seg latency |
|---|---|---|---|---|---|
| 1000 | 0 | 0 | 40.91% | n/a | 1.0 s |
| 1000 | 2000 | 500 | 0.00% | 14.86% | 1.5 s |
| 2000 | 2000 | 1000 | 0.00% | 7.64% | 3.0 s |
| 2000 | 5000 | 1000 | 0.00% | n/a | 3.0 s |
| 2000 | 5000 | 2000 | n/a | 4.02% | 4.0 s |
| 2000 | 10000 | 1000 | n/a | 7.53% | 3.0 s |
| 2000 | 10000 | 2000 | n/a | 3.82% | 4.0 s |
| 4000 | 5000 | 1000 | 0.00% | 4.75% | 5.0 s |
Key observations:
- Right lookahead is the single most impactful knob. Going from
1000 → 2000 ms right-lookahead drops long-clip WER from 7.64% to
4.02% at the same chunk + left configuration. The conv module uses
symmetric
kernel=9padding (designed at training to see future context), so denying it future frames at chunk boundaries hurts more than denying past frames. - Short audio is forgiving, long audio compounds errors. jfk is at 0% with modest context; 5.5 min same-config drifts to ~7-8% because (a) per-window CMVN drifts vs the offline single-statistic pass and (b) more boundary opportunities for misreads. Per-window CMVN is the suspect; running CMVN may close some of the gap and is noted as a Phase 8.5 follow-up.
right=0(pure causal) is uniformly bad unless chunks are large enough to hide the boundary (chunk=2000 + right=0 → 4.55% on jfk; chunk=500 + right=0 → 40.9%). Pure-causal mode is supportable but not the recommended default.- Left context past 5 s gives diminishing returns on this model.
Recommended C++ defaults (subject to revision once C++ measurements are in):
StreamingOptions{ chunk_ms = 2000, left_context_ms = 10000, right_lookahead_ms = 2000 }— sweet-spot accuracy (~4 % WER on long-form, 0 % on short clean speech), ~4 s first-segment latency. Suits production live-captioning.- For lower latency, callers can pick e.g.
chunk_ms=1000, left=2000, right=500and accept ~10-15 % WER on long-form.
Next milestones in order:
- Extend
StreamingOptionswithleft_context_ms+right_lookahead_ms; remove the runtime gate instream_start()(any Parakeet-CTC GGUF works in streaming mode now — no metadata probe needed). - Implement
StreamSessionstate machine (sample ring, chunk dispatch, per-window mel + encoder call, logits center-slicing, CTC stateful decode, segment emission with absolute timestamps). - Wire the CLI
--streampath to drivestream_start()when--pcm-inis used (or always — same flag for Mode 2 vs Mode 3 is surprising; revisit in §8.3). - Per-chunk numerical parity vs the Python reference at the same
(chunk_ms, left_ctx_ms, right_lookahead_ms)config to make sure the C++ port lands on the same logits, not approximately. - Test harness extension covering Mode 3 with random burst feeds.
Phase 8.5 (perf follow-up): replace chunking-with-context with true KV cache + conv state tensors, ~6× compute reduction on long-form audio without changing accuracy.
Landed the Mode 3 state machine in src/parakeet_engine.cpp
(StreamSession::Impl), backed by the existing Engine::Impl::model
through a borrowed pointer. Key pieces:
feed_pcm_f32/feed_pcm_i16append samples to apendingbuffer and triggertry_emit_chunks().try_emit_chunks()consumes one chunk at a time whilepending.size() >= chunk_samples + right_lookahead_samples. Per-chunk window =left_history + chunk + right_lookahead. After the encoder run, the consumed chunk is appended toleft_history(rolling cap atleft_context_samples).flush_remainder()(called fromfinalize()) processes the tail with whatever lookahead is left. No right-lookahead on the final chunk, so the last ~right_lookahead_msof audio sees less conv context — acceptable for a one-shot end-of-audio case.- Per-chunk numerical path reuses
compute_log_mel+run_encoder+ctc_greedy_decode_window+detokenizefrom the offline stack unchanged. Cumulative detokenize with suffix slicing preserves the leading-space invariant that bit Mode 2 in §7.3. - Segment timestamps:
start_s = emitted_samples / sr,end_s = (emitted_samples + consumed_chunk_samples) / sr, absolute from the start of the session. Same shape any external streaming consumer would expect ({ start, end, text/toAppend }). StreamingOptions::left_context_msandright_lookahead_mslanded on the public API (defaults 10000 / 2000 respectively, the winners from §8.1's sweep).Engine::stream_start()no longer gates on the GGUF metadataparakeet.encoder.streaming.enabled— any Parakeet-CTC GGUF works. The metadata key survives in the loader for Phase 8.5 / 9 use.
CLI:
--stream --stream-duplexroutes throughstream_start()+feed_pcm_f32with 4 kB default block size (configurable via--stream-feed-bytes, useful to stress the session state machine).--stream-left-context-ms+--stream-right-lookahead-msoverride theStreamingOptionsdefaults.--emit text|jsonlreused unchanged.
Test harness (test/test_streaming.cpp) adds three Mode 3 configs on
jfk.wav (chunk_ms × left_ms × right_ms ∈ {1000,2000,500}, {2000,2000,1000}, {2000,5000,2000}) plus a cancel-path assertion.
PCM is fed in random-size bursts (512-4000 samples) via
feed_pcm_f32 to exercise the ring / chunk-dispatch paths; each
config asserts WER ≤ 5 % vs the Mode 1 reference (all hit 0 %).
LastQuestion_long_EN.raw (5.46 min, 16 kHz s16le, Apple M4 Air,
Metal Q8_0), default config chunk_ms=2000, left=10000, right=2000:
- C++ Mode 3 transcript: 972 words, 4.13 % WER vs offline.
- Python f32 reference at the same config: 3.82 % WER. The 0.3 % delta is Q8_0 quantisation noise, matches §6.x's measurements.
- Wall time: 35.3 s, RTF 0.108 (9× real-time). ~2.4× slower than the
Mode 2 offline encoder (RTF 0.046) because each chunk re-runs the
encoder on a 14 s window (
left + chunk + right) instead of the shipping-forward incremental state. This is the chunking-with-context tax; Phase 8.5 closes it. - First-segment latency:
chunk_ms + right_lookahead_ms≈ 4 s wall (matches Python reference).
Phase 8.5 — KV cache / conv state (deferred indefinitely; not the same as chunked-limited streaming inference)
Important distinction — read this before touching streaming internals. Two superficially-similar designs have been proposed (and one of them has been attempted twice) on this project, and they have very different quality implications. Conflating them is what makes this corner trip-hazardous.
What NeMo's cache_aware_stream_step actually does. Each query
attends only to a fixed lookback window; per-chunk encoder cost
drops to O(chunk); per-layer (lookback, d_model) K/V cache plus
(d_model, kernel-1) depthwise-conv state slide forward each call.
Looks like an attractive perf win on paper.
This shape has been evaluated twice on this project and rejected both times on quality grounds.
-
Round 1 — Phase 8.0 evaluated
stt_en_fastconformer_hybrid_large_streaming_multi, the only NeMo cache-aware streaming Conformer family available at the time. Real-world quality landed at ~2× WER vsparakeet-ctc-0.6boffline. Phase 8 therefore chose the rolling-encoder Mode 3 design instead. -
Round 2 — Phase 12.x exploration ported
nvidia/parakeet_realtime_eou_120m-v1(same model family, newer 120 M variant) as the EOU engine in Phase 12.5 on the rolling- encoder Mode 3, and scoped a true cache-aware fast path as the follow-up. A bit-equal C++ port of NeMo'scache_aware_stream_stepwas prototyped on a working branch: per-layer K/V cache, depthwise-conv state, chunked-limited streaming attention mask, generalised Transformer-XLrel_shiftforT_q != T_kv. Numerical parity vs NeMo was clean — worst rel1.85e-3over 44 chunks ofjfk.wav— but decoded end-to-end througheou_decode_window, the result reproduced exactly NeMo's streaming transcript, which is not the offline transcript:Mode 2 / offline: "and so my fellow americans ask not what your country can do for you ask what you can do for your country<EOU>" cache-aware (NeMo + ours): "that's all i've held america ask not what your country can do for you ask what you can do for your country"Same quality cliff Phase 8.0 had already documented two years earlier on the same model family. NeMo's own cache-aware streaming RNN-T over the same 88 encoder frames also fails to emit any
<EOU>token onjfk.wav, so the cache-aware path doesn't even win on<EOU>boundary detection vs the rolling encoder. The branch was reverted before any of it landed onmain; this section exists so a third iteration of the project doesn't redo the same loop.
Bottom line for (A): cache-aware streaming inference on a chunked-limited-trained ASR checkpoint is a quality regression in this project's context (clean speech, offline-quality transcripts as the bar). It will not be implemented. If a future requirement explicitly trades early-utterance accuracy for bounded compute (low-power voice agent, very long-form streaming), revisit this decision with that requirement on the table — but assume by default that re-running this exploration will produce the same numbers.
The original scope of "Phase 8.5", and a different design from
(A) despite the surface-level similarity. Same offline-trained
weights, same full attention pattern as training, just amortised
across chunks: keep per-layer K, V, and depthwise-conv
left-state tensors as backend buffers, slid forward each chunk;
each encoder call computes only over new-chunk + right-lookahead
frames instead of the full (left + chunk + right) window. Pure
compute-layout refactor — accuracy unchanged. Projected wins
on the original §8.1 Python reference:
- Per-chunk compute: down from
O(left + chunk + right)toO(chunk + right), i.e. ~5× on the default config (2 + 2 vs 10 + 2 + 2). - Total wall on the 5.5 min clip: down from 35 s to ~10-15 s (close to the offline 15 s baseline).
- Accuracy unchanged.
Crucially, the streaming graph for (B) is not the same shape
as the chunked-limited graph from (A). Different attention mask
(no chunked-limit), different cache-size policy (sliding window
without a quality-coupled lookback), different validation fixtures
(parity vs offline forward, not vs cache_aware_stream_step). Any
future attempt at (B) should treat it as a fresh design exercise,
not as a retrofit of any (A) prototype recovered from git history.
Requires graph changes (persistent cache tensors for attention + conv module), per-stage parity harness vs the §8.1 Python reference, and a sliding-window cache-eviction policy.
Status: deferred indefinitely. No current owner. Not on the critical path of any shipping feature. Pick up only when a concrete consumer needs the per-chunk compute reduction and is willing to pay the engineering cost. The §8.1 Python reference and the rolling- encoder Mode 3 implementation in §8.4-8.7 remain the source of truth for streaming-quality expectations on CTC / TDT.
Phase 9 extends the Parakeet-CTC pipeline beyond the initial 0.6B checkpoint. Target is drop-in support for other NeMo Parakeet-CTC checkpoints that share the FastConformer architecture, without branching the converter or the C++ encoder graph.
HF repo: nvidia/parakeet-ctc-1.1b. NeMo encoder config:
d_model: 1024 (same as 0.6B)
n_layers: 42 (was 24)
n_heads: 8 (same)
ff_expansion_factor: 4 (same; ff_dim=4096)
conv_kernel_size: 9 (same)
subsampling_factor: 8 (same)
subsampling_conv_channels: 256 (same)
self_attention_model: rel_pos (same)
conv_norm_type: batch_norm (same, fused at convert)
att_context_size: [-1, -1] (same, offline)
vocab_size: 1025 (1024 BPE + CTC blank)
Only n_layers differs from 0.6B. The converter already reads
n_layers from the NeMo YAML and iterates block-by-block, and the C++
loader reads parakeet.encoder.n_layers from GGUF metadata. Zero
code changes needed to produce a working 1.1B GGUF and transcribe with
it. The one stale hardcode was in the --profile CLI path
(layer_points = {0, 1, 12, 24}); now scales with n_layers via
{0, 1, n_layers/2, n_layers}.
End-to-end results on Apple M4 Air, Metal, Q8_0:
| Clip | Model | Wall time | RTF | Encoder ms |
|---|---|---|---|---|
| jfk.wav (11 s) | ctc-0.6b | 170 ms | 0.015 | ~155 ms |
| jfk.wav (11 s) | ctc-1.1b | 281 ms | 0.026 | 276 ms |
| LastQuestion_long_EN.raw (5.5 min) | ctc-0.6b | 15.1 s | 0.046 | 14.9 s |
| LastQuestion_long_EN.raw (5.5 min) | ctc-1.1b | 24.4 s | 0.074 | 24.2 s |
1.1B is 1.6-1.8× slower than 0.6B, roughly matching the layer-count ratio (42/24 ≈ 1.75). Metal still hits ~13× real-time on the long clip with 1.1B, comfortably faster than real-time. Transcripts differ on 2.3 % of words on the long clip (22 / 969) — typical quality difference between the two models; both ship transcripts that track the same content.
Streaming (Mode 2 and Mode 3) works out of the box with 1.1B:
test-streaming passes all chunk-size byte-equality checks (Mode 2)
and all three (chunk × left × right) configs at WER ≤ 5% (Mode 3).
Natural follow-ups that reuse the same converter + encoder graph:
nvidia/parakeet-tdt_ctc-110m: 110 M hybrid TDT+CTC. 512 × 17 FastConformer. CTC head alone can be decoded by the existingctc_greedy_decode_window; TDT primary decoder is a separate port (prediction net + joint net + transducer greedy).nvidia/parakeet-tdt-0.6b-v3: bigger TDT model family — same transducer-decoder port, larger encoder.nvidia/parakeet-ctc-110m: smaller CTC-only variant if it exists; would land as a converter-flag change only.
Phase 10 — TDT (Token-and-Duration Transducer) support (done; covers parakeet-tdt-0.6b-v3 + parakeet-tdt-1.1b, Mode 1/2/3)
Phase 10 ports nvidia/parakeet-tdt-0.6b-v3, the multilingual (~25
languages) TDT ASR model with punctuation-and-capitalization. Shares
the FastConformer encoder backbone with the CTC checkpoints but needs
its own decoder: 2-layer LSTM prediction network, joint MLP, and a
transducer greedy loop that interleaves token + duration predictions.
Auto-detects model flavour from the NeMo target field
(EncDecCTCModelBPE vs EncDecRNNTBPEModel). Writes
parakeet.model.type + parakeet.tdt.* metadata and tensors:
tdt.predict.embed.weight(V+1, 640)tdt.predict.lstm.{0,1}.{w_ih,w_hh,b_ih,b_hh}tdt.joint.{enc,pred,out}.{weight,bias}
Handles the architectural differences from CTC:
use_bias=False— all encoder linear biases are optional; the loader reads them viamaybe_tensor()and the graph skips everymul_mat + biasvia a newmaybe_add_bias()helper.xscaling=False— gated theggml_scale(x, sqrt(d_model))entry.- 128 mel bins (vs 80) — the existing
n_melsplumbing handles it,subsampling_freq_binsderives correctly (128/8 = 16 freq bins after subsampling ->pre_encode.out= (1024, 4096)). - 8192-vocab SentencePiece (vs 1024) —
blank_as_padsemantics; the joint output is 8192 labels + 1 blank + 5 durations = 8198.
CTC regression: re-converting parakeet-ctc-0.6b produces an identical
tensor set to the shipping GGUF (zero additions/removals), plus two
new metadata keys (parakeet.model.type, parakeet.encoder.use_bias)
that the loader reads with safe fallbacks.
scripts/dump-tdt-reference.py dumps NeMo per-stage tensors (log-mel,
encoder_out, LSTM init state, transcribe() text). New
test/test_tdt_encoder_parity.cpp harness loads a TDT GGUF + wav +
reference dir, runs the C++ encoder, compares.
Results on jfk.wav:
| dtype | mel max_abs / rel | enc_out max_abs / rel | verdict |
|---|---|---|---|
| n/a | 7.29e-1 / 2.77e-3 | — | — (mel same as CTC) |
| f16 | (see above) | 1.11e-3 / 2.15e-3 | PASS (< 5e-3 f16 floor) |
| q8_0 | (see above) | 1.43e-2 / 1.97e-2 | q8_0 accumulation over 24 layers without biases; PASS functionally (downstream transcripts stable) |
Encoder graph works correctly — any numerical differences are pure quantization accumulation, matching the CTC precedent.
src/parakeet_tdt.{h,cpp} implements LSTM + joint + transducer greedy
on CPU in pure f32. Weights are dequantized from the loaded GGUF once
at Engine construction via ggml_get_type_traits(type)->to_float,
which handles f32 / f16 / q8_0 / q5_0 / q4_0 uniformly. Post-dequant
footprint: ~70 MiB f32 for the decoder (embedding + 2-layer LSTM +
joint MLP).
Decode loop:
- Initialize LSTM h/c to zeros, feed blank through the prediction
net to produce the initial
goutput. - Per encoder frame:
- Compute joint logits (8198,) = ReLU(enc_proj + pred_proj) @ W_out.
- argmax over first V+1=8193 -> token; argmax over last 5 -> duration.
- If token == blank: advance
tbymax(1, dur), reset sym counter. - Else: emit token, step LSTM on token embedding, update
g; advancetonly whendur > 0orsym_count >= max_symbols(max_symbols=10 matches NeMo's greedy config).
- Detokenize through the existing
sentencepiece_bpehelper (shared with CTC). NeMo's TDT v3 tokenizer has 8192 SBPE pieces covering ~25 languages + multilingual PnC.
Engine wiring:
Engine::Implowns theTdtRuntimeWeightspopulated at construction for TDT GGUFs; the CTC path ignores it.Engine::transcribe()/transcribe_samples()branch on model type. Streaming entry points still reject TDT viaensure_ctc_only()— transducer streaming is a Phase 10.5 item. (Superseded by §10.5:ensure_ctc_only()was removed; both Mode 2 and Mode 3 streaming run on TDT GGUFs today.)- CLI
run_oncelambda has the same branch; TDT GGUFs now transcribe end-to-end via--wav/--pcm-in.
All measured on Apple M4 Metal, f16 GGUF (1.34 GiB), unless noted.
jfk.wav (11 s): C++ transcript is byte-identical to NeMo:
"And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country."
load=490 ms, mel=5.5 ms, enc=211 ms, dec=48 ms, total=264 ms, RTF=0.024 (42× real-time).
LastQuestion_long_EN.raw (5.5 min): clean long-form transcription
with proper nouns (Multivac / Adele / Lupov / Pluto), commas,
periods, dialog structure. RTF=0.050 (20× real-time), 1825 tokens,
1472 ms pure-CPU decode.
Multilingual sanity (external sample_*.raw clips, not tracked
in this repo):
- Spanish: "Se recomienda enfáticamente a los viajeros..."
- French: "L'accident a eu lieu en terrain montagneux..."
- German: "Für die besten Aussichten auf Hongkong..."
- Italian, Portuguese, Russian: native-script transcripts with punctuation.
- Japanese: produces garbled output (same as NeMo reference on the same sample; confirmed model limitation, not a port bug).
Refactored the TDT decoder around a stateful window primitive so all three entry points work uniformly on both CTC and TDT GGUFs.
TdtDecodeState (new, in src/parakeet_tdt.h) holds the LSTM
hidden + cell tensors per layer, the last-layer pred_out vector that
feeds the joint, the symbols_this_step counter for the max_symbols
guard, an initialized flag, and a carry_frames counter for cases
where a large-duration advance spills past the end of the current
window.
tdt_decode_window(encoder_out_ptr, n_frames, opts, &state, &out_tokens, &out_steps) is the new primitive. It:
- Lazy-inits the state on first call by feeding the blank token through the LSTM.
- Honours
state.carry_framesto skip frames consumed by a duration-advance from the previous window. - Walks encoder frames, argmax'ing both token and duration logits, emitting non-blank tokens, stepping the LSTM on every emission.
- Stops when the cursor hits
n_frames; parks any leftover advance instate.carry_framesfor the next call.
tdt_greedy_decode() is now a thin wrapper: fresh state, one window
spanning all encoder frames.
Engine wiring (src/parakeet_engine.cpp):
Engine::Implalready hadTdtRuntimeWeightsfrom §10.3; no changes.transcribe_samples_stream()(Mode 2): branches onmodel_type. TDT path inits a freshTdtDecodeState, then callstdt_decode_window()once perframes_per_windowrange. CTC path unchanged.StreamSession::Implgains aTdtDecodeState tdt_statethat is primed instream_start()for TDT GGUFs. Each live chunk's center-frame range (afterleft_drop_frames+ beforeright_drop_frames) flows throughtdt_decode_window()with the session's carried state.ensure_ctc_only()helper is gone; the CLI gate that short-circuited--streamfor TDT is gone too.
Public API addition: Engine::model_type() -> "ctc" | "tdt" (Phase 11
extends this to also return "sortformer"), so downstream callers
(and the test harness) can pick per-model knobs
without reaching through internal headers.
Test harness (test_streaming.cpp):
- Mode-3 configs carry per-model WER tolerances.
- CTC tolerates 5 % at all three configs.
- TDT tolerates 5 % at chunk=2000 configs, 40 % at the aggressive chunk=1000 left=2000 right=500 slot (observed 36 % on jfk — still passes offline parity at the larger context, but the transducer greedy is more sensitive to short chunks + small right-lookahead than CTC's greedy collapse).
End-to-end numbers on parakeet-tdt-0.6b-v3.f16.gguf, M4 Metal:
- Mode 2 (offline encoder + streamed segments):
- jfk.wav, chunk_ms=2000: 6 segments, concatenated byte-identical to one-shot transcribe().
- LastQuestion_long_EN.raw, chunk_ms=2000: 164 segments; correctly cased/punctuated; cumulative text matches offline.
- Mode 3 (live duplex):
- jfk.wav, chunk=2000 / left=5000 / right=2000: 5 segments, 0.00 % WER vs one-shot.
- LastQuestion_long_EN.raw at the default preset (chunk=2000 / left=10000 / right=2000): 9.84 % WER vs offline TDT. Higher than CTC's 4.13 % on the same clip; the TDT transducer decode is more sensitive to missing future context at chunk boundaries. Still usable for live captioning; the gap narrows with bigger chunks / right-lookahead if latency allows.
live-mic now works with TDT GGUFs unchanged, so native-microphone
captions stream out properly-cased + punctuated text end-to-end.
nvidia/parakeet-tdt-1.1b is the deeper English-only TDT sibling of
parakeet-tdt-0.6b-v3:
| Config | tdt-0.6b-v3 | tdt-1.1b |
|---|---|---|
| Encoder layers | 24 | 42 |
| Mel bins | 128 | 80 |
| Vocab | 8192 (multilingual + PnC) | 1024 (English only, lowercase no PnC) |
use_bias |
False | True (default) |
| Decoder / joint | 2-layer LSTM 640 / joint 640 + 5 durations | same |
Every one of those dimensions (n_layers, n_mels, vocab_size,
use_bias) is already read from GGUF metadata, so the converter and
C++ loader/decoder support this checkpoint with no code changes.
Measured on Apple M4 Metal, q8_0 GGUF (1.22 GiB):
- Byte-identical to NeMo on jfk.wav:
"and so my fellow americans ask not what your country can do for you ask what you can do for your country"(lowercase/no-PnC; matches NeMoparakeet-tdt-1.1b.transcribe()exactly). jfk.wav(11 s):load=419 ms, mel=5 ms, enc=277 ms, dec=20 ms, total=301 ms, RTF=0.027(37× real-time). Encoder scales 1.75× vs tdt-0.6b-v3 as expected (42/24 layers). Decode is faster than tdt-0.6b-v3 (20 ms vs 48 ms) because the 1024-class output layer is 8× smaller than the 8192-class multilingual one.LastQuestion_long_EN.raw(5.5 min):RTF=0.079(13× real-time), 1716 tokens, high-quality transcript of all dialog content.test-streamingon tdt-1.1b: 10/10 PASS, including 0 % WER on all three Mode 3 configs (even the aggressive chunk=1000/left=2000/ right=500 slot where tdt-0.6b-v3 had 36 % — the deeper encoder + English-only training gives better streaming quality too).- Mode 2 + Mode 3 streaming work end-to-end with no tuning;
live-micjust works.
scripts/download-all-models.sh gains parakeet-tdt-1.1b so offline
bootstrapping picks it up automatically.
- BLAS / Accelerate for the LSTM + joint gemvs. Current decode uses pure scalar loops: 20-48 ms / 11 s on f16 one-shot, 890 ms - 1.5 s / 5.5 min. Not a bottleneck today; easy win at high throughput.
- Quantized (q8_0 / q4_0) TDT GGUFs sweep. Converter and loader already handle these storage types via the universal dequant path, but haven't been sweep-tested for WER drift.
- parakeet-tdt_ctc-110m support. Same TDT decoder, smaller
512 × 17 FastConformer encoder;
.nemoalready cached locally.
Phase 11 — Sortformer (4-speaker diarization) (done through §11.11.1; spkcache streaming tracked as Phase 11.11.2)
Phase 11 ports nvidia/diar_sortformer_4spk-v1, a speaker-diarization
model that shares the FastConformer encoder backbone with our Parakeet
ports but adds a Sortformer-specific head: encoder projection, an
18-layer post-LN Transformer encoder, and a small MLP that produces
per-frame, per-speaker probabilities for up to 4 speakers (multi-label
sigmoid output handles overlapping speech).
scripts/convert-nemo-to-gguf.py auto-detects Sortformer checkpoints (target == SortformerEncLabelModel) and:
- Skips tokenizer extraction (Sortformer has no SentencePiece).
- Writes parakeet.sortformer.* metadata: num_spks, fc_d_model, tf_d_model, tf_n_layers, tf_n_heads, tf_inner_size, tf_pre_ln, tf_hidden_act.
- Writes new tensors: encoder_proj (512 -> 192), 18 transformer blocks (attn q/k/v/out + ln1 + ffn in/out + ln2), and the head (first_hidden_to_hidden + single_hidden_to_spks). The 384-wide hidden_to_spks (used only in v2 streaming) is intentionally skipped.
C++ loader (src/parakeet_ctc.{h,cpp}):
- ParakeetModelType::SORTFORMER added; EncoderConfig grows sortformer_{num_spks,fc_d_model,tf_d_model,tf_n_layers,tf_n_heads, tf_inner_size,tf_pre_ln} fields.
- New SortformerWeights / SortformerTransformerBlock structs hold the 18 blocks + head linears.
- Engine::transcribe / streaming entry points reject Sortformer GGUFs with a clear message pointing at PROGRESS.md.
scripts/dump-sortformer-reference.py replicates NeMo's process_signal -> frontend_encoder -> transformer_encoder -> forward_speaker_sigmoids chain on a wav and dumps per-stage references (mel, encoder_out, post_proj, post_transformer, speaker_probs).
src/parakeet_sortformer.{h,cpp} is the CPU forward:
- SortformerRuntimeWeights holds f32-dequantised tensors. Same to_float trait pattern as TDT, so f32/f16/q8_0/q4_0 all just work.
- sortformer_diarize() pipeline:
- linear_batch(encoder_proj) -> (T, 192)
- for each transformer block (post-LN, pre_ln=False):
attn(Q,K,V) + residual + layer_norm_1 -> ffn(ReLU) + residual
- layer_norm_2
- ReLU -> first_hidden_to_hidden -> ReLU -> single_hidden_to_spks -> sigmoid -> (T, num_spks)
- Threshold-based per-speaker segment formation, sorted by start time then speaker_id.
Numerical parity on jfk.wav vs NeMo (Apple M4 Metal, f16 GGUF):
mel : max_abs=3.36e-1 rel=1.65e-3 PASS (peak-norm matches) enc : max_abs=3.50e-3 rel=1.62e-3 PASS (FastConformer) probs : max_abs=8.68e-4 rel=2.03e-4 PASS (encoder_proj + 18-layer transformer + head + sigmoid) speaker activity matches NeMo exactly: 118/138 frames active, only speaker 0, max 1 simultaneous speaker.
Public engine.h additions:
struct DiarizationOptions { float threshold = 0.5f; int min_segment_ms = 0; }; struct DiarizationSegment { int speaker_id; double start_s, end_s; }; struct DiarizationResult { segments + per-frame speaker_probs + n_frames + num_spks + frame_stride_s + per-stage timings };
Engine::diarize(wav_path, opts) and diarize_samples(samples, n, sr, opts).
Engine::Impl primes a SortformerRuntimeWeights at construction so the dequant cost is paid once.
CLI: when a Sortformer GGUF is loaded, the existing --wav / --pcm-in pipeline routes through diarize_samples and emits one line per segment:
text: [start-end] speaker_ jsonl: {"speaker":N,"start":S,"end":E}
Measured on Apple M4 Metal, sortformer-4spk-v1.f16.gguf:
| Clip | encoder | decode (CPU) | total | RTF |
|---|---|---|---|---|
| jfk.wav (11 s) | 96 ms | 83 ms | 187 ms | 0.017 (58x) |
| LastQuestion_long_EN (5.5 m) | 9.2 s | 22.5 s | 31.9 s | 0.097 (10x) |
Decoder cost grows fast on long-form because the post-LN Transformer has O(T^2) attention with no chunking. At T=4099 (5.5 min) that's ~16.8 M attention pairs per layer x 18 layers — Phase 11.11 (streaming v2) brings chunked attention.
Combines Sortformer (Phase 11) with a Parakeet ASR Engine to produce "who said what" output natively in C++ in one binary and one CLI call.
Public API (include/parakeet/ctc/engine.h):
struct AttributedSegment { speaker_id; text; start_s; end_s; };
struct AttributedTranscriptionOptions { diarization;
merge_same_speaker = true;
min_segment_ms = 200;
pad_segment_ms = 0; };
struct AttributedTranscriptionResult { segments; diarization;
asr_calls; total_ms;
audio_samples; sample_rate; };
transcribe_with_speakers(sf_engine, asr_engine, wav_path, opts);
transcribe_samples_with_speakers(sf_engine, asr_engine,
samples, n, sr, opts);
Plus tiny Engine::is_diarization_model() /
is_transcription_model() helpers so downstream callers (CLI, unit
tests, future bindings) can route based on what each loaded GGUF is.
Pipeline (in src/parakeet_engine.cpp):
- sortformer_engine.diarize_samples() -> per-frame speaker probs + segments via threshold + per-speaker grouping.
- For each diarization segment, slice samples[start:end] (with optional pad_segment_ms padding on each side, skipping segments shorter than min_segment_ms) and feed the slice through asr_engine.transcribe_samples().
- If merge_same_speaker (default), collapse consecutive same-speaker entries by appending text and extending end_s. This turns the ~10 micro-segments Sortformer emits per speaker turn into the handful of natural turn boundaries a downstream UI cares about.
CLI:
./parakeet --model <asr.gguf> --diarization-model <sf.gguf>
--wav <multi-speaker.wav>
Output formats:
text : [start-end] speaker_: jsonl : {"speaker":N,"start":S,"end":E,"text":"..."}
End-to-end on diarization-sample-16k.wav (27.3 s, 2 speakers, Apple M4 Metal):
TDT-0.6b-v3.f16 + Sortformer-v1.f16: diar.segments=11 -> merged=4, asr_calls=11 total=1514ms RTF=0.055 (18x real-time) [0.40-4.24] speaker_0: So Aaron, in your email you said you wanted to talk about the exam. [4.96-15.60] speaker_1: Yeah, um I've just never taken a class with so many different readings... [16.24-18.16] speaker_0: Yeah. [18.48-27.36] speaker_1: Yeah. There's usually just one book to review, not two. Three different books, plus all those other text excerpts and videos.
CTC-0.6b.q8_0 + Sortformer-v1.f16: Same speaker boundaries, faster decode, English lowercase no-PnC.
CTC + TDT + sortformer-only paths all unchanged (regressions verified).
nvidia/diar_streaming_sortformer_4spk-v2 is the streaming-trained
sibling of v1. Architecture diff for offline-mode usage:
encoder : 512 d_model x 17 layers (was 18) + 128 mel bins (was 80) transformer: 18 layers x 192 d_model (same) head : encoder_proj + first_hidden_to_hidden + single_hidden_to_spks + extra hidden_to_spks (4, 384) for streaming-only path
All four read-paths flow from existing GGUF metadata (n_layers / feat_in / tf_n_layers / use_bias / xscaling), so converting v2 and running it through our offline diarize() pipeline works with no code changes:
python scripts/convert-nemo-to-gguf.py
--ckpt models/diar_streaming_sortformer_4spk-v2.nemo
--out models/sortformer-streaming-4spk-v2.f16.gguf --quant f16
-> 250.9 MiB f16
The 384-wide hidden_to_spks tensor is the streaming-mode-only output head (concat of spkcache + chunk hidden states); the converter currently skips it since v1's 192-wide head reproduces NeMo's forward_speaker_sigmoids() bit-for-bit in offline mode. v2 GGUFs run through the same single_hidden_to_spks (192-wide) path as v1.
Verified on diarization-sample-16k.wav (2 speakers): v2 produces 9 segments vs v1's 11 — same conversation structure, slightly different boundary placement (v2 was trained with chunked-attention masking which subtly affects even offline forward passes).
Converter helper _get_member handles both ./model_config.yaml and
model_config.yaml tarball layouts (v1 has the prefix, v2 doesn't).
Phase 11.11 — Live streaming diarization (overview) (11.11.1 shipped; 11.11.2 NeMo-style spkcache pending)
Real live diarization needs the v2 spkcache + FIFO state machine.
NeMo's forward_streaming_step per chunk is:
- encoder.pre_encode(chunk) -> chunk_pre_encode_embs (only the dw_striding subsampling stage; output is in 512-dim post-subsampling space)
- concat([spkcache, fifo, chunk_pre_encode_embs]) -> concat_embs (spkcache_len + fifo_len + chunk_subs frames)
- frontend_encoder(concat_embs, bypass_pre_encode=True) -> full FastConformer encoder + encoder_proj on the concatenated buffer -> (T_total, 192)
- forward_infer(...) -> 18-layer transformer + speaker head -> (T_total, 4) speaker probabilities
- streaming_update(state, chunk_pre_encode_embs, all_preds, lc, rc):
- Extract chunk_preds = preds[spkcache_len + fifo_len + lc : spkcache_len + fifo_len + chunk_len + lc]
- Append (chunk, chunk_preds) to FIFO
- If FIFO overflows, pop front frames into spkcache and update the silence profile (mean_sil_emb + n_sil_frames) so the next compress_spkcache call can identify silence frames to keep.
- If spkcache overflows, _compress_spkcache reduces it back to spkcache_len via per-speaker top-k frame selection (using the speaker probabilities) plus silence-anchor frames. This step is what keeps speaker IDs consistent across chunks: it constructs a persistent per-speaker memory and a permutation.
Default v2 hyperparameters from the .nemo config:
spkcache_len: 188 fifo_len: 188 chunk_len: 188 chunk_left_context: 1 chunk_right_context: 1 subsampling_factor: 8 spkcache_update_period: 188 spkcache_sil_frames_per_spk: 3 causal_attn_rate: 0.5 (training-only)
Implementation plan when this lands:
- Expose
pre_encode_onlyfromrun_encoder(or a sibling): given mel, run only the dw_striding subsampling + out projection so we get the 512-dim post-subsampling tensor without running the conformer blocks. - New
SortformerStreamingStatestruct (mirrors NeMo'sStreamingSortformerState):spkcache (T_max, 512),spkcache_preds (T_max, 4),fifo (T_max, 512),fifo_preds (T_max, 4),mean_sil_emb (512),n_sil_frames (int),spk_perm (4). SortformerStreamSession(mirrorsStreamSessionin shape but withfeed_pcm_*()->chunk_callbacksemantics emitting per-chunk speaker probabilities + segment events).- Per-chunk forward: pre_encode the new audio chunk only, concat with
spkcache + fifo, run full encoder + transformer + head on the
concatenation, slice out chunk_preds, run
streaming_updatewhich updates state including the silence profile + (eventually) compresses spkcache via per-speaker top-k. - Wire CLI
--stream-duplexto route through SortformerStreamSession when the model is Sortformer. - Per-stage parity vs NeMo's
forward_streaming(dump spkcache/fifo/chunk_preds at each chunk boundary, compare).
Open design questions (need to resolve when work starts):
_compress_spkcacheis the most complex component (~150 lines of PyTorch). Probably re-implementable in pure C++ since it's mostly index gather + softmax + top-k.- What's the right surface for
chunk_callback? Per-chunk probability matrix? Per-chunk newly-formed segments? Both? pad_segment_msfrom §11.10 should also work in streaming mode.- BLAS/Accelerate for the transformer attention will be needed to keep per-chunk RTF reasonable (today's offline scalar implementation spends 22 s on a 5.5 min clip; per chunk that's ~0.7 s per 100 ms chunk, way over real-time).
Estimated effort: 1-2 weeks of focused work + parity validation. Tracked as Phase 11.11.2; the §11.11.1 sliding-history implementation below shipped first as the pragmatic v1.
Phase 11.11.2 (planned) is a multi-week effort to land the full NeMo
forward_streaming algorithm (spkcache + fifo + _compress_spkcache
- encoder graph split). To unblock product integration now, Phase
11.11.1 ships a pragmatic streaming layer that reuses the existing
offline
Engine::diarize()path under a sliding-history window.
API (in include/parakeet/ctc/engine.h):
struct SortformerStreamingOptions {
int sample_rate = 16000;
int chunk_ms = 2000; // emit cadence
int history_ms = 30000; // sliding context window
float threshold = 0.5f;
int min_segment_ms = 200;
bool emit_partials = true;
};
struct StreamingDiarizationSegment {
int speaker_id;
double start_s, end_s;
int chunk_index;
bool is_final;
};
using SortformerSegmentCallback =
std::function<void(const StreamingDiarizationSegment &)>;
class SortformerStreamSession {
public:
void feed_pcm_f32(const float *, int n);
void feed_pcm_i16(const int16_t *, int n);
void finalize();
void cancel();
const SortformerStreamingOptions & options() const;
};
std::unique_ptr<SortformerStreamSession>
Engine::diarize_start(const SortformerStreamingOptions &,
SortformerSegmentCallback);Algorithm (per chunk):
feed_pcm_*()appends samples to astd::vector<float> ring.- Once
chunk_samplesof new audio is available beyondemitted_samples, take a window[max(ring_origin, emit_end - history_samples) , emit_end]and run the full offlineengine_impl_diarize_helper(...)on it (mel + encoder + sortformer head + threshold-segmentation). - For every returned segment whose absolute time range overlaps the
new chunk's
[emitted_samples, emit_end], emit aStreamingDiarizationSegmentclipped to the chunk. - Advance
emitted_samples = emit_end. Trimringto keep onlyhistory_samplesof audio behind us (so the buffer stays bounded for arbitrarily long sessions). finalize()semantics:- if
>= 1sample of new audio sits past the last emitted chunk, run one finalprocess_chunkover[max(ring_origin, end - history_samples), end], emit each overlapping segment withis_final = true. Consumers see real segments tagged final. - if the audio ended exactly on a chunk boundary (no tail), emit a
single synthetic terminator with
speaker_id = -1,start_s == end_s == emitted_samples / sample_rate,is_final = true. Consumers should treat negative speaker IDs as "session done, no new segment". This avoids the round-1 bug where the last chunk's segments were re-emitted as duplicates withis_final = trueflipped on.
- if
cancel()short-circuits; subsequentfeed_*calls are no-ops.
Trade-offs (vs the planned full Phase 11.11.2 NeMo-style streaming):
- Pro: ~150 lines of code; zero changes to the encoder graph;
works with both v1 and v2 Sortformer GGUFs; reuses the parity-tested
offline
diarize()path. - Pro: speaker IDs stabilise within a few chunks once the history window contains both speakers' audio; matches offline IDs exactly once the history covers the full session.
- Con: each chunk re-runs the full encoder over the trailing
history_msof audio. Measured RTF ~0.25 on M4 Air CPU atchunk_ms=2000 history_ms=30000for the 22 sdiarization-sample-16k.wavsample (5.5 s wall for 22 s of audio). Phase 11.11.2'sspkcacheapproach will fix this. - Con: speaker IDs in the very first chunks may be arbitrary
before the history window contains both speakers. Verified on
diarization-sample-16k.wav: chunk 1 mislabels speaker_0 as speaker_1 at[2.00-4.00]; chunks 2-10 align with the offline reference (speaker_0for [1.84-10.00],speaker_1for [13.36-21.04]).
CLI:
./build/parakeet \
--model models/sortformer-4spk-v1.f16.gguf \
--pcm-in recording.raw --pcm-format s16le \
--stream \
--stream-chunk-ms 2000 --stream-history-ms 30000 \
--emit text # or jsonlThe CLI auto-routes Sortformer + --stream through the streaming path
(no separate flag). --emit jsonl produces
{"speaker", "start", "end", "chunk", "is_final"} per line.
Live mic auto-detects diarization mode when --model resolves to a
Sortformer GGUF — examples/live-mic.cpp swaps in a
SortformerStreamSession instead of StreamSession and prints
[start-end] speaker_N per chunk:
./build/live-mic --model models/sortformer-4spk-v1.f16.gguf \
--chunk-ms 2000 --history-ms 30000For combined live transcription + speaker labels in a single binary,
examples/live-mic-attributed.cpp loads two engines (a CTC/TDT ASR
engine and a Sortformer engine), forwards each captured audio batch
to both StreamSession and SortformerStreamSession, and tags each
transcript segment with the speaker whose live diarization range
overlaps it the most. --accumulate accumulates text on a single
line per speaker and emits a newline on speaker change or
--silence-flush-ms of silence:
./build/live-mic-attributed \
--asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--diar-model models/sortformer-4spk-v1.f16.gguf \
--accumulateIndependent --asr-n-gpu-layers / --diar-n-gpu-layers allow
splitting the two engines across CPU and GPU on machines where
running both on the GPU would compete for resources.
Testing: test/test_sortformer_streaming.cpp (built as
test-sortformer-streaming when PARAKEET_BUILD_TESTS=ON) feeds
the multi-speaker sample in random burst sizes (1-5000 samples per
feed_pcm_f32() call) and asserts:
>= 1real segment callback received (speaker_id >= 0),- exactly one
is_final = truecallback received afterfinalize()(real segment for the tail case, synthetic terminator withspeaker_id = -1for the chunk-aligned case), max_endis within the audio duration,- no two consecutive callbacks duplicate each other's
(speaker_id, start_s, end_s), cancel()on a half-fed session is idempotent.
Verified end-to-end on diarization-sample-16k.wav:
offline: [1.84-10.00] speaker_0 [13.36-21.04] speaker_1
streaming (chunk=2000, history=30000):
[2.00-4.00] speaker_1 (chunk 1) # cold-start mislabel
[4.00-6.00] speaker_0 (chunk 2)
[6.00-8.00] speaker_0 (chunk 3)
[8.00-10.00] speaker_0 (chunk 4)
[13.36-14.00] speaker_1 (chunk 6)
[14.00-16.00] speaker_1 (chunk 7)
[16.00-18.00] speaker_1 (chunk 8)
[18.00-20.00] speaker_1 (chunk 9)
[20.00-21.04] speaker_1 (chunk 10)
[20.00-21.04] speaker_1 (chunk 10, final)
Phase 11.11.2 (true NeMo streaming with spkcache compression) remains the eventual destination; 11.11.1 is what ships today.
- BLAS / Accelerate for transformer attention. Same opportunity as TDT's LSTM + joint gemvs; current scalar attention is the long- form bottleneck on Sortformer's 18-layer TF (T^2 cost dominates). See §5.4 for the prior Accelerate sched-assertion investigation on the f32 GGUF -- worth re-checking with the q8_0 path.
Both Sortformer checkpoints (diar_sortformer_4spk-v1 offline and
diar_streaming_sortformer_4spk-v2 streaming-trained) now ship at
q8_0 and q4_0 via the universal add_2d quantisation path in
scripts/convert-nemo-to-gguf.py. No converter changes needed --
Sortformer's encoder shares the FastConformer graph with CTC/EOU,
and the transformer encoder + diarization head are 2D linear layers
that already flow through add_2d.
Sizes:
| GGUF | f16 | q8_0 | q4_0 |
|---|---|---|---|
| sortformer-4spk-v1 | 263 MiB | 141 MiB | 75 MiB |
| sortformer-streaming-4spk-v2 | 251 MiB | 134 MiB | 72 MiB |
scripts/verify-gguf-roundtrip.py gained build_expected_sortformer
covering the encoder + sortformer.encoder_proj + 18 transformer
blocks (attn.{q,k,v,out}, ln{1,2}, ffn.{in,out}) + the
two-layer diarization head. All 6 GGUFs (2 models × 3 tiers) PASS
the roundtrip gate (worst rel 1.15e-1 on parakeet-ctc-0.6b.q4_0-
class q4 weights, well within the 2^-3 = 0.125 quant gate).
test-sortformer-parity was extended with --enc-rel-tol and
--probs-abs-tol flags so each quant tier can pass at appropriate
gates (defaults still f16 = 5e-3 / 5e-2). Per-tier numbers on
jfk.wav (single-speaker, 11 s):
| GGUF | enc rel | probs max_abs |
|---|---|---|
| sortformer-4spk-v1.f16 | 1.6e-3 | 8.7e-4 |
| sortformer-4spk-v1.q8_0 | 2.7e-2 | 2.7e-2 |
| sortformer-4spk-v1.q4_0 | 3.2e-1 | 1.3e-1 |
| sortformer-streaming-4spk-v2.f16 | 5.0e-2 | 5.1e-2 |
| sortformer-streaming-4spk-v2.q8_0 | 5.2e-2 | 5.4e-2 |
| sortformer-streaming-4spk-v2.q4_0 | 2.2e-1 | 2.0e-1 |
(v2's f16 baseline is already worse than v1's because the streaming-trained encoder's offline forward in our C++ graph diverges from NeMo's offline forward -- this is a structural property of the streaming-trained checkpoint when run offline, not a quantisation regression. v2 q8/q4 inflate within the same factor band as v1.)
User-facing diarization output is identical across all three tiers
of v2 on jfk.wav ([0.24-2.40] [3.36-4.56] [5.44-11.04],
all speaker_0). v1's three tiers also produce the same three
segments, with q4 boundaries shifted by at most ~80 ms (one encoder
frame) vs f16 -- well within the post-processing min_segment_ms
band.
Recommendation: prefer q8_0 for general use (1.9× smaller than f16 with negligible quality impact); use q4_0 when memory is tight (3.5× smaller than f16, marginally noisier individual speaker probabilities but identical thresholded segments on shipping fixtures).
This repo already provides ggml backends for tdt / ctc /
sortformer. Phase 12 closes the loop on eou so the four
families that ship under one Engine umbrella all run on a
single pure-ggml dependency.
Checkpoint. NVIDIA's official
nvidia/parakeet_realtime_eou_120m-v1 NeMo .nemo archive
(NVIDIA Open Model License). Sourcing the .nemo directly lets us
reuse the exact pattern the CTC / TDT / Sortformer ports already
followed: .nemo -> GGUF via convert-nemo-to-gguf.py, NeMo
PyTorch as the parity oracle.
Architecture summary (from model_config.yaml + state-dict probe):
| Stage | Spec |
|---|---|
| Mel | AudioToMelSpectrogramPreprocessor, 128 bins, n_fft=512, win=400, hop=160, normalize=NA, dither=1e-5, pad_to=0 |
| Encoder | FastConformer, 17 layers, d_model=512, n_heads=8, ff_expansion=4, conv_kernel=9, dw_striding subsample 8x, use_bias=False, xscaling=False, conv_norm_type=layer_norm (gamma/beta still stored under conv.batch_norm.{weight,bias}; no running stats), att_context_size=[70,1] + att_context_style=chunked_limited, causal_downsampling=true, conv_context_size=causal |
| Decoder | RNNT-Decoder, 1 LSTM layer x 640 hidden, embedding [1027, 640] (vocab + 1 for blank_as_pad=true) |
| Joint | RNNT-Joint, encoder_hidden=512 -> 640, pred_hidden=640 -> 640, ReLU, output dim 1027 = 1024 BPE + <EOU> (id 1024) + <EOB> (id 1025) + blank (id 1026) |
| Latency | NVIDIA card cites 80 ms (p50) / 280 ms (p90) / 320 ms (p95) end-of-turn detection on TTS-augmented DialogStudio |
So EOU is TDT minus durations + LayerNorm in the conv module +
two attention/conv shape switches; encoder graph is ~95 % shared
with the existing CTC/TDT encoder, with three shipping deltas
(LN-vs-fused-BN in conv module, chunked-limited attention mask
applied as a static offline mask via ggml_soft_max_ext, asymmetric
(L=k-1, R=s-1) causal padding in the dw_striding subsampler).
NeMo's own streaming forward additionally maintains per-chunk KV
cache state and depthwise-conv left state for cache_aware_stream_step;
this project deliberately does not ship that path -- see §8.5
case (A) for why driving streaming-trained Parakeet checkpoints
through chunked-limited streaming inference is a quality regression
on the targets this repo cares about. The decoder + joint mirror
TDT minus the duration head.
API target. The reference EOU surface emits TranscriptionSegment
records with only text populated (no per-segment timestamps, no
utterance-boundary event), so the C++ Engine deliberately preserves
that shape rather than synthesising fields from intermediate state.
C++ pipeline this maps to (matching the upstream NeMo processEOU
reference):
- mel(128) over the full input audio (offline; the reference implementation accumulates an append-queue until end-of-job, then mels the whole buffer).
- Walk mel in fixed 25-frame slices (
encoder_chunk_mel_frames=25); skip trailing slice if< 10frames and not first. - Per slice: cache-aware encoder forward with running
cache_last_channel (17, 1, 70, 512),cache_last_time (17, 1, 512, 8),cache_last_channel_len (1). - Per encoder frame: RNN-T greedy with up to 5 symbols/step;
<blank>ends the per-frame loop,<EOU>flushes the current segment with\nseparator + zeroes h/c, otherwise append the piece to the running segment. - Concatenate segments with single space; trim; empty -> "no speech" sentinel; total word count -> stats.
Cross-engine VAD/EndOfTurn events are not part of Phase 12; they
will be a Phase 13 cross-cutting concern wiring <EOU> and Sortformer
per-frame any-speaker probabilities into a shared StreamEvent
umbrella across parakeet.cpp + whisper.cpp. Phase 12 just needs
to land the EouStreamSession callback signature with
is_eou_boundary from day 1 so Phase 13 plugs in without churn.
Phase 12 outline (and current shipping status):
- 12.0 plan + scope (this section). (done)
- 12.1 converter + Python reference + GGUF roundtrip. (done; see §12.1 below)
- 12.2 EOU GGUF loader + Engine routing. (done)
- 12.3 cache-aware FastConformer encoder graph (LN-in-conv, chunked-limited attention mask, KV + conv state). (done; KV + conv state path was prototyped and rejected on quality grounds, see §8.5 case (A); LN-in-conv + chunked-limited mask shipped.)
- 12.4 RNN-T decoder (1-layer LSTM 640 + joint MLP) with
<EOU>reset semantics. (done) - 12.5 streaming push API (Modes 2 + 3) with callback shape ready
for Phase 13 events. (done; the callback hangs off the existing
StreamSessionrather than a newEouStreamSession.) - 12.6 CLI auto-routing +
live-micauto-detection. (done) - 12.7 end-to-end parity harness (
test-eou-streamingon jfk.wav, driven bydump-eou-reference.py). (done)
scripts/convert-nemo-to-gguf.py learned an EOU branch:
- Detection --
detect_model_type()distinguishes EOU from TDT by the absence ofmodel_defaults.tdt_durationsplus the presence of<EOU>incfg.labels. Sortformer / CTC paths unchanged. - Conv-norm switch -- when
cfg.encoder.conv_norm_type == "layer_norm", the per-block emitter writesencoder.blk.{i}.conv.norm.{weight,bias}straight from theconv.batch_norm.{weight,bias}tensors (gamma/beta) and skips the BN running-stats fusion that the BatchNorm path requires. The metadata keyparakeet.encoder.conv_norm_typeadvertises which path each GGUF expects. - Streaming hyperparameters in metadata --
parakeet.encoder.{conv_norm_type,conv_context_size, causal_downsampling,att_context_style,att_context_size_left, att_context_size_right}so the C++ encoder can build the right chunked-limited attention mask + KV-cache shapes without re-parsing YAML. - EOU metadata block under
parakeet.eou.*:{vocab_size, blank_id, eou_id, eob_id, pred_hidden, pred_rnn_layers, joint_hidden, encoder_chunk_mel_frames, cache_lookback_frames, cache_time_steps, max_symbols_per_step}.cache_lookback_framesdefaults fromatt_context_size_left(70) andcache_time_stepsfromconv_kernel - 1(8). - EOU tensors under
eou.*:eou.predict.embed.weight,eou.predict.lstm.0.{w_ih,w_hh,b_ih,b_hh},eou.joint.enc.*,eou.joint.pred.*,eou.joint.out.*. SentencePiece tokenizer bytes embedded same as CTC/TDT.
Output sizes on nvidia/parakeet_realtime_eou_120m-v1.nemo:
| Quant | File size | Notes |
|---|---|---|
| f16 | 246.0 MiB | 251 f32 + 233 f16 tensors |
| q8_0 | 131.7 MiB | same f32 set + 233 q8_0 tensors |
scripts/dump-eou-reference.py mirrors dump-tdt-reference.py plus a
streaming-mode pass:
- Offline: 128-bin mel, full-context encoder output (T_enc, 512), LSTM
init state (1, 1, 640), prediction-net output for the blank/SOS
token (640,), and the NeMo
transcribe()greedy reference text. - Streaming:
model.encoder.cache_aware_stream_step(...)driven in 25-mel-frame chunks with explicit running caches; per-chunk encoder outputs are concatenated and saved alongside per-chunk frame counts. Withatt_context_size=[70,1]each 25-mel-frame chunk emits 2 encoder frames (the right-context-1 frame is held back), so on jfk.wav (11 s, 1101 mel frames) the streaming pass produces 88 encoder frames vs the offline pass's 139. That's intentional and is what the C++ streaming graph will need to reproduce in Phase 12.3.
NeMo offline transcript on test/samples/jfk.wav:
and so my fellow americans ask not what your country can do for you ask
what you can do for your country<EOU>
The trailing literal <EOU> is the joint network emitting the EOU
token at end-of-utterance and is exactly the signal the C++ decoder
will key on for \n segment-flush + LSTM state reset (per the
upstream NeMo eouDecodeChunk reference).
scripts/verify-gguf-roundtrip.py learned to dispatch on
parakeet.model.type: build_expected_eou() recreates the EOU tensor
map (LayerNorm in conv, no use_bias on inner blocks, EOU/joint
weights), build_expected_ctc() keeps the existing CTC path. The
verifier also gained a Q8_0 / Q5_0 / Q4_0 dequant comparison branch
with per-format rel gates, so the same script validates every quant
tier we ship.
Both tiers pass round-trip on the EOU GGUFs (worst rel 4.78e-4 at f16 -- under the 2^-10 gate -- and 4.0e-3 at q8_0 -- under the 2^-7 gate); CTC GGUF baseline still passes after the verifier was generalised to handle trailing-1-axis squeezing in older artefacts.
Touched parakeet_ctc.h / parakeet_ctc.cpp / parakeet_engine.cpp.
Additions:
enum class ParakeetModelTypegains anEOUvariant; the loader routes onparakeet.model.type == "eou"and populates anEouWeightsstruct alongside the existing CTC / TDT / Sortformer weight blobs.EncoderConfiggains aConvNormType conv_norm_typeenum +causal_downsampling,conv_causal,att_chunked_limited,att_context_left,att_context_rightfields, all read from the GGUF metadata block written by §12.1's converter changes. CTC / TDT / Sortformer GGUFs leave these at their offline defaults so the existing engines are bit-for-bit unchanged.BlockWeightsgains optionalconv_norm_w/conv_norm_b(used whenconv_norm_type == LayerNorm) alongside the existing fused-BNconv_bn_scale/conv_bn_shift(used whenBatchNorm). The loader's per-block tensor pull picks one or the other based on the metadata.EouWeightsmirrorsTdtWeightsminus the duration head:predict.embed, one-layer LSTM(w_ih, w_hh, b_ih, b_hh), andjoint.{enc,pred,out}.{weight,bias}.Engine::Implgains anEouRuntimeWeights eou_rtslot and runseou_prepare_runtime()at construction when the GGUF is EOU (dequantises the predict + joint to f32 once, same shape the TDT runtime uses).Engine::transcribe_samples()dispatches to a new EOU branch that callseou_greedy_decode().Engine::is_transcription_model()returns true for EOU;Engine::model_type()returns"eou".
CLI side (src/main.cpp): the manual decode dispatch in the closure
gained an EOU branch (the bug that surfaced as a segfault during
bring-up was that the manual closure had a TDT branch but no EOU
branch, so EOU GGUFs fell through to ctc_greedy_decode on a NULL
logits buffer). transcribe_wav() now lists EOU alongside TDT /
Sortformer in its "use Engine instead" rejection message.
Three structural changes to subsampling_graph() /
conformer_conv_graph() / rel_pos_mha_graph() /
build_encoder_graph_cached(), each gated on EncoderConfig
metadata so CTC / TDT / Sortformer GGUFs take the original code
path:
- LayerNorm in conv module. When
conv_norm_type == LayerNorm, the conv graph permutes from(T, d_model)to(d_model, T), runslayer_norm_affine(x, conv.norm.weight, conv.norm.bias, eps), applies SiLU, and falls into the existing pw2 / matmul path. Saves one permute vs the BN path. Existing CTC / TDT / offline Sortformer keep the fusedbn_scale * x + bn_shift -> silu -> permute -> pw2flow. - Causal subsampler (
causal_downsampling=true). NeMo'sCausalConv2Dpre-pads each stride-2 dw_striding conv with(L=k-1=2, R=s-1=1)zeros on both the freq and time axes, then convolves withpadding=0. Newzero_pad_dim1helper (analogous to the existingzero_pad_dim0) implements the freq-axis half.subsampling_graphgains acausal_downsamplingflag; when set it pre-pads and switches the convpaddingto zero. Output sizing changes from(L+2-k)/s+1 = (L-1)/2+1(symmetric) to(L+(L+R)-k)/s+1 = L/2+1(causal) so freq goes 128 -> 65 -> 33 -> 17 instead of 128 -> 64 -> 32 -> 16, matching the trainedencoder.subsampling.out.weightshape[512, 4352=17*256].run_encoder()'s mask-sizing math was also gated on the same flag (a bug surfaced where the cached graph used the new sizes but the per-call mask uploads still used the symmetric formula, producing 138-frame outputs instead of 139). - Causal depthwise conv module (
conv_context_size: causal). The conv module's k=9 depthwise stride=1 conv now uses(L=8, R=0)zero-pad instead of the symmetric(L=4, R=4). - Chunked-limited attention mask (
att_context_style: chunked_limited).EncoderGraphgains anatt_maskgraph input +att_mask_hostbuffer. Built host-side once per graph (cached across calls): for query frameiin chunkc = i / (right + 1), the mask is0.0fon the visible[c*chunk_size - left, (c+1)*chunk_size - 1]range (clamped to[0, T-1]) and-INFINITYeverywhere else. Wired intorel_pos_mha_graphviaggml_soft_max_ext(scores, mask, scale, 0.0f)(which theatt_mask=nullptrcallers fall back to the priorggml_scale + ggml_soft_maxpath on, so CTC / TDT regression unchanged). The formula matches NeMo's_create_masksexactly:chunk_idx[i] - chunk_idx[j] in [0, left // chunk_size]-- foratt_context_size=[70, 1],left_chunks_num = 70 // 2 = 35and queries see their own chunk plus 35 chunks of past context, exactly 72 keys per query (in the steady state).
The conv graph branches and the mask wiring also flow through
profile_block_substages (CTC profiling helper). Tested across
CTC / TDT / Sortformer with no regression.
Mel preprocessor (mel_preprocess.{h,cpp}) also got a
MelNormalize enum + MelConfig::normalize field. EOU's NeMo
config sets normalize: NA (no per-feature CMVN); the loader reads
the converter-emitted parakeet.preproc.normalize string and gates
the existing apply_per_feature_cmvn() call on it. CTC / TDT /
Sortformer all leave normalize=per_feature so their CMVN keeps
running. This was the dominant accuracy gap during bring-up:
CTC/TDT-style CMVN on EOU's preprocessor mean-centres each mel bin
across the whole utterance, but the EOU encoder was trained against
raw log-mel values that floor at the log-zero guard during
silence frames; without CMVN our subsampler cosine jumped from
0.108 (broken) to 0.999688 (matched) and the encoder cosine
landed at 0.999997 -- f16 quantisation floor.
Per-stage parity on test/samples/jfk.wav (NeMo PyTorch reference
via dump-eou-reference.py → C++ via PARAKEET_DUMP_* env vars):
| Stage | max_abs | rel_max | cosine |
|---|---|---|---|
| log-mel | 1.36e+1 (tail-frame artifacts) | 8.17e-1 | 0.999644 |
| post-subsampler | 3.30e+2 | 1.00e-1 | 0.999688 |
| encoder out | 7.64e-2 | 7.70e-3 | 0.999997 |
Transcript on jfk.wav (both parakeet-eou-120m-v1.gguf f16 and
parakeet-eou-120m-v1.q8_0.gguf):
and so my fellow americans ask not what your country can do for you ask
what you can do for your country
Bit-equal to NeMo's offline reference (modulo the trailing literal
<EOU> token which the C++ decoder strips after using it for the
segment-flush + LSTM-state-reset side effect). The 20 s
sample-16k.wav Alice-in-Wonderland clip transcribes with zero
errors on q8_0:
alice was beginning to get very tired of sitting by her sister on the
bank and of having nothing to do once or twice she had peeped into the
book her sister was reading but it had no pictures or conversations in
it and what is the use of a book thought alice without pictures or
conversations
New parakeet_eou.{h,cpp} (~360 lines) modelled on
parakeet_tdt.{h,cpp}. EouRuntimeWeights dequantises the predict
(1-layer LSTM, 640 hidden) + joint (enc 512->640, pred 640->640,
out 640->1027) to f32 once at Engine construction. EouDecodeState
holds h_state, c_state, pred_out, last_token,
symbols_this_step, has_emitted_token_since_last_eou -- everything
needed to carry decoder state across chunked calls, including the
empty-segment guard used by eou_decode_window to suppress phantom
<EOU> boundaries.
eou_decode_window() runs greedy RNN-T over a span of encoder
frames with up to max_symbols_per_step=5 symbols per encoder step
(matches the upstream NeMo EOU_MAX_SYMBOLS_PER_STEP constant). Per
emitted token:
<blank>(id 1026) -> break out of inner loop, advance encoder.<EOB>(id 1025) -> training-time block boundary marker; treated as a no-op skip, same policy as the NeMo reference.<EOU>(id 1024) -> flush the in-progress segment toout_segments, zero h/c state, setlast_token = blank, re-prime the predictor with the blank embedding, break out of inner loop. The state reset is the NeMoeouDecodeChunkreset semantics carried through verbatim.- Any other special token (vocabulary entry of the form
<...>) -> defensive break (matches the NeMo reference'sisSpecialTokenskip). - Otherwise: append to
out_tokens, feed back into the LSTM, updatepred_outfor the next joint call.
eou_greedy_decode() is the one-shot wrapper used by
Engine::transcribe(): detokenises segment-by-segment using the
boundaries eou_decode_window recorded, joins with \n, returns
the result as EouDecodeResult.text. eou_count is exposed for
later wiring into the planned cross-engine OnEndOfTurn event.
Phase 12.5 — streaming push API (Modes 2 + 3) (done; rolling-encoder Mode 3 is the chosen design -- chunked-limited streaming inference rejected, see §8.5)
Public API additions in include/parakeet/ctc/engine.h:
struct StreamingSegment {
// ... existing fields ...
bool is_eou_boundary = false; // EOU only: <EOU> token fired in this chunk
float eot_confidence = 0.0f; // reserved for Phase 13's OnEndOfTurn event
};Existing StreamSession (StreamSession::Impl) gained an
EouDecodeState eou_state slot and an EOU branch in
process_window(). On Engine::stream_start() for an EOU GGUF the
session initialises the EOU state via eou_init_state(eou_rt, eou_state) (priming the LSTM with the blank embedding, matching
NeMo's decoder.initialize_state). The existing
Engine::transcribe_samples_stream() (Mode 2) gained the same EOU
branch. Both paths set seg.is_eou_boundary = (win_segments.size() > 0) per emitted chunk, so the <EOU> token's emission shows up on
the cadence the consumer is already iterating.
CLI wiring (src/main.cpp):
--stream(Mode 2) on EOU GGUFs runs the offline encoder once then walks chunks emittingStreamingSegments, exactly like CTC / TDT.--stream --stream-duplex(Mode 3) on EOU GGUFs goes throughEngine::stream_start()->StreamSession. Each chunk's encoder runs over[left + chunk + right_lookahead]audio with the chunked-limited mask applied; the EOU decoder state carries across chunks. Mode 3 produces transcript output that's byte-equal to Mode 1 on jfk.wav (104 B vs 104 B).--emit jsonlincludes"is_eou_boundary"per segment line.
live-mic already routes anything that isn't a Sortformer GGUF
through StreamSession, so live-mic --model models/parakeet-eou-120m-v1.q8_0.gguf works out of the box -- no
new auto-detection logic was required.
test-eou-streaming (new, test/test_eou_streaming.cpp) asserts:
- Mode 2 concatenated text byte-equal to the offline
Engine::transcribe()reference; - Mode 2
is_eou_boundaryfires on at least one segment (the trailing<EOU>onjfk.wav); - Mode 3 transcript size matches the reference within a 20 % tail jitter band. Chasing byte-equality on Mode 3 via cache-aware streaming inference was explored and rejected -- see §8.5 case (A) for the full rationale -- so the rolling-encoder tail-jitter band is the assertion the test will keep.
Passes on both parakeet-eou-120m-v1.gguf (f16) and
parakeet-eou-120m-v1.q8_0.gguf. Existing test-streaming (CTC /
TDT byte-equality + WER tolerance) and test-sortformer-streaming
both still pass after the StreamSession::Impl plumbing changes.
Mode 3 today re-runs the offline encoder per chunk over a
sliding [left + chunk + right_lookahead] window without persistent
KV / conv-state cache across chunks. The transcript matches Mode 2
byte-equally on jfk.wav, but <EOU> boundary detection is
approximate: the trailing chunk doesn't carry the long-context
encoder state the EOU head needs to confidently fire <EOU> on
end-of-utterance.
This is the chosen design, not a deferred workaround. The
obvious alternative -- driving the streaming-trained EOU 120m-v1
weights through NeMo's cache_aware_stream_step to recover
"per-chunk O(chunk) compute and persistent encoder state" --
was prototyped during the Phase 12.x exploration and rejected; see
§8.5 case (A) for the full rationale. Short version: same model
family Phase 8.0 already evaluated, same ~2× early-utterance WER
cliff, same <EOU> token disappearing entirely in the cache-aware
output (NeMo's own cache_aware_stream_step over jfk.wav
produces 0 <EOU> tokens; we reproduced that bit-for-bit). Per-
chunk encoder cost on Mode 3 today is O(left + chunk + right_lookahead); trading that off for the chunked-limited
streaming-inference path is a quality regression and is not on the
roadmap.
scripts/download-all-models.sh swapped the previously-cached
forward-looking stt_en_fastconformer_hybrid_large_streaming_multi.nemo
for the actual parakeet_realtime_eou_120m-v1.nemo. The verifier
(scripts/verify-gguf-roundtrip.py) dispatches on parakeet.model.type
and ships a build_expected_eou() map that asserts every GGUF tensor
matches the source NeMo state-dict at f32 bit-exactness for the f32
slots and within a per-tier rel gate for the quant slots
(2^-10 for f16, 2^-7 for q8_0, 2^-4 for q5_0, 2^-3 for q4_0).
- Cache-aware streaming encoder graph for EOU 120m-v1
(and any other chunked-limited-trained Parakeet checkpoint).
Prototyped on a working branch during the Phase 12.x exploration:
bit-equal NeMo's
cache_aware_stream_step(worst rel1.85e-3over 44 chunks ofjfk.wav), end-to-end transcript bit-equal NeMo's streaming output -- which is structurally distinct from, and meaningfully worse than, NeMo's offline output (~2× early-utterance WER, no<EOU>token emitted). Reverted before landing. Same quality cliff Phase 8.0 documented two years earlier onstreaming_multi; the EOU 120m-v1 family is the same cache-aware streaming Conformer family with a slightly newer 120 M variant. Will not be implemented. See §8.5 case (A) for the full rationale and numbers.
(All Phase 12.x follow-ups have either shipped or been formally rejected; see Phase 13 below for the cross-engine event API.)
Voice-agent UX (turn detection, barge-in, hold-the-mic-open) needs
two signals we already had hooks for but no API on top of: VAD
state transitions and end-of-turn boundaries. Phase 13 lands a
small public StreamEvent surface that streaming sessions can
emit alongside the existing per-segment callbacks. The shape is
explicitly designed to be the same as what whisper.cpp's
streaming API will eventually emit, so consumers can write
engine-agnostic event handling once.
enum class VadState { Unknown, Speaking, Silent };
enum class StreamEventType { VadStateChanged, EndOfTurn };
struct StreamEvent {
StreamEventType type;
double timestamp_s;
int chunk_index;
// VadStateChanged
VadState vad_state;
int speaker_id; // argmax on entering Speaking; -1 otherwise
float vad_score; // 0..1; provenance-specific
// EndOfTurn
float eot_confidence;
};
using StreamEventCallback = std::function<void(const StreamEvent&)>;StreamingOptions::on_event and SortformerStreamingOptions::on_event
default to nullptr (back-compat: existing consumers unaffected).
StreamingOptions also gains enable_energy_vad (default off) plus
energy_vad_threshold_db = -35.0f, energy_vad_window_ms = 30,
energy_vad_hangover_ms = 200 knobs for the CTC/TDT fallback.
| Engine | Event | Trigger |
|---|---|---|
| EOU | EndOfTurn |
<EOU> token decoded in this chunk; eot_confidence = 1.0. Mode 2 + Mode 3. |
| Sortformer | VadStateChanged |
Per-chunk max(speaker_probs) > threshold (the same threshold the diarization head uses), with hysteresis (state retained across chunks). speaker_id = argmax mean(speaker_probs) on entering Speaking. |
| CTC / TDT | VadStateChanged |
Energy-VAD on raw PCM (sliding RMS window, dBFS threshold + hangover). Only fires when consumer opts in via enable_energy_vad. |
EOU's EndOfTurn is fired from both Engine::transcribe_stream
(Mode 2) and StreamSession::process_window (Mode 3) so the event
shape is identical regardless of which streaming entry point the
consumer drives.
-
include/parakeet/ctc/engine.h-- new public types +on_eventslots on both options structs + theenable_energy_vadknobs. Adding fields with defaults to a struct is forward-compatible for current consumers. -
src/energy_vad.{h,cpp}-- internal helper. Sliding RMS over a configurable ms window of mono f32 PCM, with hysteresis: enter Speaking immediately on threshold-crossing; fall back to Silent only afterhangover_msof below-threshold audio. Default-35 dBFS / 30 ms / 200 msis tuned for clean 16 kHz mono speech. Not exposed in the public headers (would force consumers to pin to this implementation; shape may evolve). -
src/parakeet_engine.cpp:StreamSession::Implgains aunique_ptr<EnergyVad>member that is constructed only whenopts.enable_energy_vadand the underlying engine has no native VAD source (constructed for CTC/TDT, skipped for EOU). The VAD is driven from a smallstream_drive_energy_vad()helper invoked from bothfeed_pcm_f32andfeed_pcm_i16.SortformerStreamSession::Implgains avad_statefield (initialUnknown, transitions on each chunk's emit-range speaker probabilities). FiresVadStateChangedon transitions only -- no per-chunk repeat events.- Mode-2 and Mode-3 EOU paths each fire
EndOfTurnevents on chunks whereeou_boundaries_in_chunk > 0.
test-streaming(CTC + TDT) gained an opt-in energy-VAD invocation that asserts at least one Speaking transition fires onjfk.wav. Default-off path (sweep above) keeps emitting zero events, confirming back-compat.test-eou-streamingMode-2 path now asserts thatis_eou_boundaryandEndOfTurnevent count are consistent (boundary fires => at least one event fires). Onjfk.wavchunk size 1500 ms: 1EndOfTurnevent, matching the single trailing<EOU>boundary.test-sortformer-streamingasserts at least oneVadStateChangedevent on a wav with audible speech and at least one Speaking transition. Default fixture (diarization-sample-16k.wav) skips when missing; onjfk.wav(single speaker, 11 s) the test fires oneSpeakingtransition on chunk 0 withspeaker_id = 0, which is the expected shape.
Numbers on jfk.wav (sanity check):
| Test | Events fired |
|---|---|
| test-streaming + energy-VAD (CTC) | 9 VadStateChanged (6 Speaking transitions) |
| test-streaming + energy-VAD (TDT) | 9 VadStateChanged (6 Speaking transitions) |
| test-eou-streaming Mode 2 | 1 EndOfTurn at chunk 7 (the trailing <EOU>) |
| test-sortformer-streaming v1.f16 | 1 VadStateChanged @ 0.00 s -> Speaking, speaker 0 |
| test-sortformer-streaming v1.q8 | identical to f16 |
| test-sortformer-streaming v2.q4 | 1 VadStateChanged @ 0.00 s -> Speaking, speaker 0 |
- Single struct + enum, not separate event types. Keeps the
callback signature trivial (
void(const StreamEvent&)) which maps cleanly through any C/C++/FFI ABI without per-type wrappers. Costs a few unused fields per event; cheap. - Engines fire what they natively know. EOU has the
<EOU>token and fires onlyEndOfTurn; Sortformer has speaker probs and fires onlyVadStateChanged; CTC/TDT have neither so they fireVadStateChangedfrom energy-VAD when explicitly enabled. No engine pretends to fire events it doesn't have a real signal for, and no Silero / external VAD dependency is added. - Default off. Both
on_event = nullptrandenable_energy_vad = falseare the defaults. No behavioural change for existing consumers; opt-in only.
Phase 10 brought up TDT (Token-and-Duration Transducer) end-to-end on CPU with the encoder also offloadable to Metal, but the decoder itself bypassed ggml entirely: at load time the LSTM prediction net
- joint MLP were dequantised to host
std::vector<float>and the greedy emission loop ran scalargemv_f32per emission step. Even with a Metal-accelerated encoder, the decoder owned ~48 % of total inference time on the M4 Air (76 ms of 159 ms on a 20 s clip). Phase 14 ports the decoder to ggml graphs onbackend_activeso it runs end-to-end on Metal alongside the encoder.
Two fixed-shape per-step graphs plus one window-shape graph, all
allocated against model.backend_active() (Metal / CUDA / Vulkan
when compiled and --n-gpu-layers > 0, else CPU):
g_lstm_step— embedding lookup (ggml_get_rowsagainst the native quantisedpredict_embedtensor) + L-layer LSTM unroll expressed asmul_mat+add+sigmoid/tanh+ element-wise products. Inputstoken_in[1, i32],h_in[H, L],c_in[H, L]; outputsh_out[H, L],c_out[H, L],pred_out[H](alias for last-layerh_new). Built once, reused viaggml_gallocr_alloc_graphper emission step.g_joint_step—pred_proj = joint_pred @ pred + b,hidden = relu(pred_proj + enc_proj_row),logits = joint_out @ hidden + b. Inputspred_out[H_pred]andenc_proj_row[H_joint]; outputlogits[V_out].g_enc_proj— full-windowenc_proj = joint_enc @ enc + bmatmul (size[T_enc, D_enc] -> [T_enc, H_joint]). One per distinctT_encseen (LRU-cached inenc_proj_cache). Hoisting this matmul out of the per-step joint graph cuts ~250 smallgemv(640, 1024)calls per window down to one largegemm— cheap on Metal where matmul kernels are compute-bound, expensive on CPU where it loses cache locality (see §14.2 fallback).- All three graphs use
ggml_set_input/ggml_set_outputand upload host inputs each step viaggml_backend_tensor_set, pulling outputs back viaggml_backend_tensor_get. Theargmaxover token + duration logits stays on host (~32 KBtensor_getper step is cheap on unified memory; see §14.5 Phase 4 gate decision).
TdtRuntimeWeights carries both the GPU-graph scaffolding
(ggml_context * gctx, ggml_cgraph * g_lstm / g_joint, gallocrs,
ggml_tensor * inputs/outputs, and an enc_proj_cache LRU) and a
parallel set of host f32 vectors (embed, host_lstm[L],
host_joint_*) for the CPU fallback. Move semantics + a destructor
free the gallocrs, contexts, and any cached enc_proj graphs on
runtime teardown; the backend pointer itself is owned by
ParakeetCtcModel::Impl.
The straightforward "all paths through ggml" design regressed CPU
decode by ~6x (76 ms -> 480 ms median) because per-step graph
dispatch on the synchronous CPU backend pays thread-pool wakeup
latency on every one of ~250 emission steps. The fix is a runtime
branch: tdt_prepare_runtime checks ggml_backend_is_cpu(backend)
and either builds the graphs (GPU) or dequantises weights to host
f32 (CPU). The decode loop then routes every per-step op
(tdt_init_state, host_lstm_step, host_joint_step) through the
proven scalar implementation when !use_graphs.
The CPU path also keeps the original per-step joint_enc gemv
inside host_joint_step rather than the full-window precompute used
on GPU: profiling showed the precompute regresses CPU by ~8 % for
20 s windows because it streams ~1 MB through L1 once per window
without reuse, while the per-step gemv keeps the encoder-frame
slice in cache through both joint_enc and the surrounding
joint_pred / joint_out calls.
test-tdt-decoder-parity (test/test_tdt_decoder_parity.cpp,
linked under PARAKEET_BUILD_TESTS) runs the same WAV through
tdt_greedy_decode twice — once with n_gpu_layers=0 (scalar CPU
fallback) and once with n_gpu_layers=1 (ggml graph path on the
compiled backend). Greedy TDT is fully deterministic, so the
invariant is exact integer equality of the token-ID stream (and
hence byte-equal transcript text). On sample-16k.wav (20.13 s,
M4 Air, Metal build):
[tdt-decode-parity] CPU: tokens=95 text=Alice was beginning to get very tired of sitting by her sister...
[tdt-decode-parity] GPU: tokens=95 text=Alice was beginning to get very tired of sitting by her sister...
[tdt-decode-parity] PASS: CPU vs graph token IDs match (95 tokens)
A <ref-dir> argument optionally also compares against the NeMo
reference token-ID stream from scripts/dump-tdt-reference.py
(extended in this phase to write token_ids.npy alongside the
existing transcript.txt).
sample-16k.wav (20.13 s of audio), --bench-warmup 5 --bench-runs 15, M4 Air, q8_0:
| backend | enc median | dec best | dec median | inf median | RTF best | RTF median | real-time multiple |
|---|---|---|---|---|---|---|---|
| CPU baseline | 911 | 72.8 | 76.6 | 1003 | 0.041 | 0.050 | 24x |
| CPU after | 1102 * | 72.3 | 76.95 | 1190 * | 0.043 | 0.059 | 23x |
| Metal baseline | 68.5 | 72.7 | 76.4 | 159.5 | 0.008 | 0.008 | 132x |
| Metal after | 68.9 | 58.5 | 59.32 | 143.2 | 0.007 | 0.007 | 142x |
* CPU "after" inf median includes encoder-side wall-time noise
(thermal throttling on a passive-cooled Air during the 15-run sweep
shows up in the encoder, not the decoder); the decoder numbers
are within 0.5 ms of baseline on CPU.
Net Metal effect on the 20 s clip:
- decoder: 76.4 ms -> 59.3 ms median (-22.4 %)
- inference total: 159.5 ms -> 143.2 ms median (-10.2 %)
- real-time multiple: 132x -> 142x
CPU stays neutral by design (the fallback path is the same scalar
implementation that shipped in Phase 10); the new graph path is
only exercised on Metal / CUDA / Vulkan builds where
backend_active is non-CPU.
The original plan considered keeping encoder_out resident on the
backend so the TDT decoder could run directly off the GPU tensor
instead of going through the existing host std::vector<float> in
EncoderOutputs::encoder_out. Empirical profiling on the M4 Air:
[probe] encoder_out tensor_get: 87 us (258048 floats)
[probe] enc_proj tensor_set: 17 us (258048 floats)
Total host roundtrip for the encoder→decoder boundary is ~104 us
per call, i.e. 0.07 % of total inference time on a 20 s clip.
The gate for this work was a >5 % RTF improvement; the data
disqualifies it (Apple Silicon's unified-memory tensor_get/set is
essentially memcpy at ~50 GB/s and cannot deliver the threshold).
Skipped, with the engine-side API kept simple — EncoderOutputs
stays host-side, matching CTC + EOU + Sortformer.
Pre-existing bug surfaced by this phase: main.cpp's
--bench-json writer hardcoded "backend": "ggml-cpu" regardless
of the active backend, which silently mis-tagged every Metal /
CUDA / Vulkan bench captured into artifacts/bench/. Fixed to
derive from GGML_USE_METAL / GGML_USE_CUDA / GGML_USE_VULKAN
plus the runtime n_gpu_layers flag, and an n_gpu_layers field
was added to the JSON so post-hoc sweeps can disambiguate same-
binary CPU vs GPU runs.
- CUDA / Vulkan validation. The graph code path is generic
over
backend_active; both backends should "just work" because every op used (get_rows,mul_mat,add,sigmoid,tanh,mul,concat,cont) is supported on CUDA and Vulkan in the pinned ggml. Not validated on hardware in Phase 14 — needs a follow-up bench run. - Mode 3 streaming bench. Phase 14 measured Mode 1 (one-shot
tdt_greedy_decodeover the full window) only. The streamingStreamSession::process_windowcalls into the sametdt_decode_windowso the per-step Metal speed-up should carry over, but the per-chunk cost mix is different (smallerT_encper call -> theg_enc_projcache will see more distinct shapes; the LRU is currently unbounded). Tracked as a follow-up: cap the cache or switch to a bucketed shape. - TDT 1.1B sweep. Numbers above are for
parakeet-tdt-0.6b-v3only; rerun onparakeet-tdt-1.1bto populate the "RTF (Metal)" column for that row in the README's Supported checkpoints table.
Phase 14 ported the TDT decoder to ggml graphs and shipped on Metal
with two compute_graph dispatches per non-blank emission step
(joint, then LSTM). Profiling on M3 Ultra showed the dominant cost
per step is the Metal command-buffer commit + wait latency, not
the readback or the kernel work itself:
[probe] phase 13 decoder = 57.6 ms / 247 dispatches = ~233 us/dispatch
~ commit ~150 us + GPU ~25 us + bookkeeping
Phase 15 collapses the per-non-blank dispatch pair into a single
fused graph. The LSTM update writes h / c / pred in place into a
persistent backend buffer via ggml_cpy; the joint mat-muls take
the pred_cpy node as their input so gallocr orders the LSTM
update strictly before the joint reads inside one Metal command
buffer.
TdtRuntimeWeights gains a dedicated persist_buffer allocated
via ggml_backend_alloc_ctx_tensors that holds:
h_persist: f32[H_pred, L] (LSTM hidden, layer-major)c_persist: f32[H_pred, L] (LSTM cell)pred_persist: f32[H_pred] (last-layer h, fed into joint)enc_proj_persist: f32[H_joint, T_max] (T_max = 4096 frames)
All four stay resident on the backend across the entire decode
loop. Per-step host upload shrinks from ~5 KB (token + h + c +
enc_proj_row) to 4 B (just the frame index or token id);
enc_proj is no longer downloaded after the full-window
projection — it's ggml_cpy'd straight into the persistent slab
and the joint network reads rows via ggml_get_rows on a
host-supplied frame index.
A build_lstm_body helper is shared between two of them so the
LSTM math stays numerically identical across init and the fused
hot path:
g_lstm— init-only. Used once per call (tdt_init_state) to seedpred_persistafter a blank LSTM step.g_joint— used after blank emissions (pred unchanged). Readspred_persist, slicesenc_proj_persistviaggml_get_rows(frame_idx), writes logits to host.g_lstm_joint— used after non-blank emissions. Fused: LSTM body writes the new pred viaggml_cpy, then the joint mat-muls take that cpy node as their pred input. One commit instead of two.
The decoder loop tracks pending_lstm_token: blank emissions
clear it and the next iteration uses g_joint, non-blank
emissions defer the LSTM update so the next iteration fuses it
with the next frame's joint forward via g_lstm_joint.
Streaming windows flush any deferred update at end-of-window.
3-warmup + 10-timed runs, averaged across 3 invocations:
| Stage | Phase 14 base | Phase 15 fused | Δ |
|---|---|---|---|
| mel ms | 14.4 | 14.6 | noise |
| encoder ms | 68.5 | 68.6 | noise |
| decode ms | 57.6 | 43.0 | −25% |
| inference | 141 | 126 | −10% |
| RTF | 0.007 | 0.006 | |
| realtime mult | 146× | 160× | +14× |
Parity gate: test-tdt-decoder-parity PASSes — CPU and graph
paths emit byte-identical 95-token streams. The fused graph is
numerically equivalent to the sequential path because:
ggml_cpy(h_new, h_persist)writes h_persist's memory in place; subsequent readers ofh_persistsee the new value.- The joint body uses the
pred_cpyresult tensor (notpred_persistdirectly) so its mat_muls dataflow-depend on the cpy and gallocr emits the LSTM update's barriers first. h_persistandc_persistlive inpersist_buffer, which is a separate backend buffer from gallocr's compute buffer, so gallocr cannot alias them with intermediateh_new/c_newand there are no read-before-write hazards.
Batched-joint over K consecutive frames (prototyped, reverted)
The arithmetic looked promising: 152 single-frame blank-path joints could collapse to ~96 K-frame batches (one per non-blank cycle, since avg blank-run length ≈ 152 / 95 = 1.6 frames). Tested K ∈ {4, 8} on the same sample; both regressed by ~0–1 ms back to phase-13-ish numbers:
| Variant | decode ms (3-run mean) |
|---|---|
| Phase 15 | 43.0 |
| K = 4 | 43.8 |
| K = 8 | 43.2 |
Empirical conclusion: Apple Silicon Metal command-buffer commit latency is much lower than the ~150 us I assumed from back-of-envelope, probably ~30–50 us in practice. The 56-commit saving from K = 8 (predicted ~8 ms) gets eaten by the larger per-batch GPU work (each batch computes joint over K frames even though only ~1.6 are consumed before a non-blank). Reverted the prototype rather than ship neutral code; phase 14's fused LSTM+joint is the local optimum on this hardware.
- CUDA / Vulkan validation. Same plumbing as Phase 14:
g_lstm_jointand the persistent-state buffer should "just work" on any backend that already supportsggml_cpy,ggml_get_rows,ggml_backend_alloc_ctx_tensors. Worth benchmarking — backends with higher dispatch overhead (CUDA) could see proportionally larger Phase 15 wins. - TDT 1.1B sweep. Same caveat as Phase 14; the relative win should hold but absolute numbers shift.
Closes the lone bench-validation hole in PROGRESS §6.5 ("Test
ggml_flash_attn_ext on Metal — likely a meaningful win"). The
infra has been dormant since the Round 7 audit (§5.13), gated
behind #ifdef PARAKEET_EXPERIMENTAL_FLASH_ATTN in
rel_pos_mha_graph() and surfaced as the PARAKEET_FLASH_ATTN
CMake option (off by default everywhere). Round 7 only A/B'd it on
CPU, where it regressed encoder by +3.1 % because the cast-to-f16
of the relative-position bias bd_final mask before softmax
shifted the BD computation order; Metal was never tested.
What changed: CMakeLists.txt now derives a per-backend
default. When GGML_METAL=ON the option defaults to ON; CPU /
CUDA / Vulkan / OpenCL keep their existing OFF default until each
ships its own A/B (CUDA can be exercised with a local -DGGML_CUDA=ON build
and parakeet --bench when a discrete-GPU host is available). No source files in src/ changed — the
#ifdef PARAKEET_EXPERIMENTAL_FLASH_ATTN branch in
parakeet_ctc.cpp::rel_pos_mha_graph is what gets compiled in.
Kernel actually loaded: from ggml_metal_library_compile_pipeline
log on first invocation —
kernel_flash_attn_ext_pad_mask=1_ncpsg=64kernel_flash_attn_ext_blk_nqptg=8_ncpsg=64kernel_flash_attn_ext_f32_dk128_dv128_mask=1_sinks=0_bias=0_scap=0_kvpad=1_bcm=1_ns10=128_ns20=128_nsg=4
Head_dim 128 covers parakeet-ctc-0.6b, parakeet-tdt-0.6b-v3,
parakeet_realtime_eou_120m-v1 (all d_model=1024 / n_heads=8)
and parakeet-tdt-1.1b (d_model=2048 / n_heads=16 → also 128).
Sortformer's transformer is tf_d_model=192 / n_heads=8 → head_dim=24,
which falls below flash_attn_ext's supported set
{40,64,80,96,112,128,192,256}. parakeet_sortformer.cpp does not
share rel_pos_mha_graph with the conformer encoder, so the
Sortformer transformer block is structurally untouched by this
flag.
Bench (M3 Ultra Metal, q8_0, sample-16k.wav 20.13 s, 95 tokens, 3 warmup + 15 timed runs averaged across 5 invocations):
| Stage | FA OFF (HEAD) | FA ON (this change) | delta |
|---|---|---|---|
| mel ms | 7.72 ± 0.13 | 7.65 ± 0.02 | noise |
| enc ms | 67.35 ± 0.03 | 67.00 ± 0.02 | −0.35 ms / −0.5 % |
| dec ms | 44.16 ± 0.27 | 44.02 ± 0.51 | noise |
| infer ms | 119.24 ± 0.32 | 118.66 ± 0.51 | −0.58 ms / −0.5 % |
| RT mult | 168× | 170× | +2× |
Stdev figures are between-invocation; per-invocation stdev is ≤ 0.21 ms on encoder. The 0.35 ms encoder saving is ~5–6× the between-invocation stdev, so reproducible.
Why so modest on M3 Ultra: the conformer attention shape is
T = 252, H = 8, HD = 128, which puts the QK^T scores tensor at
~2 MB (252 × 252 × 8 × 4 B). That's well-cached on the M3 Ultra's
60-core Metal GPU, so the standard mul_mat → soft_max_ext → permute → mul_mat path was already not memory-bound. The
remaining win is purely from collapsing four kernel dispatches
per attention block into one (24 layers × 4 = 96 dispatch saves
× ~30 µs/dispatch ≈ 2.9 ms theoretical; we measured 0.35 ms,
suggesting the dispatch saving is partially absorbed by ggml's
graph-machinery overhead and the f32→f16 BD-mask cast). PCIe-
based discrete GPUs typically have higher per-dispatch overhead
and proportionally less L2 per SM, so the predicted-positive
case on CUDA / Vulkan is meaningfully larger; that's why those
defaults stay OFF until measured.
Parity: all gates pass byte-exact under the new default —
test-tdt-decoder-parityPASS, 95 tokens, "Alice was beginning…", CPU-fallback vs Metal-graph token IDs identicaltest-mel-fft-parityPASS, rfft vs textbook FFT rel error 6.89e-08, stateful overload bit-equal stateless on 101 frames × 80 mels, 7 sequential calls bit-equaltest-encoder-capture-parity(CTC) PASS,encoder_out (258 048 floats)andlogits (258 300 floats)bit-equal across capture=true/falsetest-perf-regression(TDT q8_0, n_gpu_layers=1) PASS, transcript byte-equal to expected, mel and encoder summary inside the configured budgets
The cast-to-f16 of the BD mask that broke CPU parity in Round 7
does not break Metal parity here because: (a) the no-streaming
case (full sample-16k.wav window) has att_mask = nullptr so the
mask passed to ggml_flash_attn_ext is purely bd_scaled = scale * bd_final with no additional masking term to merge; (b)
the f16 cast happens on a tensor whose pre-softmax magnitudes are
in the ±5 to ±10 range (relative-position embeddings post matmul
pos_bias_vadd), well within f16's 6-decimal-digit precision; (c) the downstream argmax over the joint logits is invariant to sub-bit-15 precision drift in attention scores.
What this does not address:
att_mask != nullptr(Mode 2/3 streaming windows). The experimental code path passes onlybd_scaledas the mask; the additiveatt_maskis dropped on the floor. Mode 1 is fine but for Mode 2/3 production streaming the mask should be folded in viaggml_add(bd_scaled, att_mask)before the f16 cast. Tracked as a precondition for the streaming bench sweep.- Sortformer transformer head_dim=24 is unsupported by
flash_attn_ext. Not a regression — just unaffected.
Remaining stack-rank for next encoder optimization:
- Conv2d-DW Metal kernel (PROGRESS §6.5 first bullet) — promotes 24 conformer-conv blocks off the im2col fallback. Bigger expected gain than this flash-attn flip because the im2col path adds a separate copy kernel before the matmul.
- Hybrid
ggml_backend_sched: overlap mel preprocessing (7.65 ms host CPU) with encoder dispatch (67 ms Metal). Up to 7.65 ms inference wall-time saving if we can hide mel under encoder. att_maskfold-in for streaming (covered above).
Vulkan was listed as a supported backend since Phase 6 but had never been validated end-to-end on the CTC encoder. This phase brings it to correctness on Windows with an NVIDIA RTX 5060 (should apply to any Vulkan-capable GPU).
(Originally landed as "Phase 15" on main while this branch was
mid-flight on its own §14 / §15 TDT-decoder + fused-LSTM+joint
sequence; renumbered to §16 here so the two streams don't collide.)
Two issues prevented the Vulkan backend from producing correct output:
-
memcpyfrom GPU pointer inread_filterbank_to_vector(). The filterbank tensor is allocated on the GPU backend whenn_gpu_layers > 0. The original code didstd::memcpy(out.data(), t->data, ...)which dereferences a device pointer on the host — segfault (0xC0000005). Fixed by switching toggml_backend_tensor_get(t, out.data(), 0, n)which handles the device-to-host copy transparently. This fix is backend-agnostic: it was already correct on Metal/CUDA because those backends happened to mapt->datato host-visible memory, but theggml_backend_tensor_getpath is the correct API for all backends. -
Strided
ggml_view_3dpassed to unary ops in the GLU. The Conformer conv module's Gated Linear Unit splits the pointwise-conv output in half along the channel dimension usingggml_view_3d. The resulting tensors are non-contiguous (strided).ggml-vulkan's unary operations (ggml_sigmoid) use push constants that do not carry stride information — they assume contiguous memory. The sigmoid output was therefore garbage (rel=0.82at the GLU stage), which cascaded into zero-token transcriptions. Fixed by wrapping bothggml_view_3dhalves withggml_cont()before feeding them intoggml_sigmoidandggml_mul. This fix is also backend-agnostic: any backend that doesn't handle strided unary inputs benefits.
The bisection used the test-vk-vs-cpu harness with per-sub-stage
taps injected into the first Conformer block's convolution module.
Each intermediate tensor (pre, post_pw1, post_glu, post_dw,
post_pw2, post_bn) was compared CPU vs Vulkan. The divergence
was pinpointed to post_glu (rel jumped from ~2e-3 to 0.82),
confirming the ggml_view_3d + unary op interaction as root cause.
The ggml-vulkan.cpp source was then inspected to confirm that
unary push constants lack stride fields, validating the hypothesis.
PASS stage subsampling_out n=141312 max_abs=1.239e+01 rel=2.032e-03
PASS stage block0_post_ff1 n=141312 max_abs=3.966e+02 rel=2.030e-03
PASS stage block0_post_attn n=141312 max_abs=3.965e+02 rel=2.033e-03
PASS stage block0_post_conv n=141312 max_abs=3.901e+02 rel=2.032e-03
PASS stage block0_post_ff2 n=141312 max_abs=3.897e+02 rel=2.035e-03
PASS stage block0_out n=141312 max_abs=3.813e-01 rel=1.958e-03
PASS stage block_last_out n=141312 max_abs=1.260e-01 rel=6.897e-03
PASS stage encoder_out n=141312 max_abs=1.260e-01 rel=6.897e-03
PASS stage logits n=141450 max_abs=1.047e+00 rel=1.454e-03
all stages passed
CMakeLists.txt: centralisedGGML_USE_*defines into anINTERFACElibraryparakeet-backend-defs(CUDA, Metal, Vulkan, BLAS, OpenCL). All test targets link this library so GPU code paths are compiled consistently.test-vk-vs-cputarget gated behindif (GGML_VULKAN).- Test sources live under
test/test_*.cppfor cleaner repo organisation.
test/test_vk_vs_cpu.cpp loads the same GGUF twice (CPU and
Vulkan), runs both encoders on the same mel input, and compares
9 intermediate stages. Each stage asserts rel < 5e-2 and no
NaN/Inf values. Exit code 1 on any failure.
- Vulkan performance optimisation (RTF benchmarking, pipeline cache).
- Validate on AMD and Intel GPUs.
- Upstream the
ggml_contfix as a ggml-vulkan unary stride patch.
Phase 11 landed v1 (offline + sliding-history streaming) and §11.11.2
reserved a slot for NeMo's streaming-Sortformer spkcache architecture
shipped with diar_streaming_sortformer_4spk-v2.x. This phase fills
that slot: a faithful C++ port of NeMo's AOSC algorithm so v2.1
correctly tracks speakers across long re-entry gaps (which v1 and v2.1
without a cache cannot do — they collapse returning speakers into
whichever hyp slot is closest to the current talker).
Ported from sortformer_modules.py + sortformer_diar_models.py in
NeMo. Each C++ helper carries an // matches NeMo <fn> at sortformer_modules.py:<line> comment pointing at its source.
_compress_spkcache— composite-score top-K retention per speaker, silence anchoring viamean_sil_emb, dedupe by absolute frame index, chronological output (the v2.1 model was trained with Sort Loss so output order matters)._get_silence_profile— runtime EMA of silence-frame embeddings._disable_low_scores/_boost_topk_scores— threshold gating + newest-frame boost on the per-chunk score matrix.streaming_update— FIFO + pop + compress orchestration.forward_streaming_step(sortformer_aosc_stepin C++) — per-chunk cache + FIFO + chunk concat in the post-subsampling embedding space, FastConformer over the concatenation, head, slice, threshold.
SortformerStreamSession::try_emit_chunks waits for
chunk_right_context_ms of lookahead audio before emitting; tail
chunks fall back to a left-context-only finalize path. New public
fields on SortformerStreamingOptions:
chunk_left_context_ms = 80, chunk_right_context_ms = 560,
spkcache_update_period = 144, fifo_len = 188. Defaults match
NeMo's e2e_diarize_speech.py inference YAML.
run_encoder_bypass_pre_encode (in parakeet_ctc.cpp) skips the
subsampling block and feeds pre-subsampled embeddings straight into
the conformer stack. Required for splicing the speaker cache + FIFO +
new chunk in the post-subsampling space the way NeMo trained v2.1
with. Activated only when the cached EncoderGraph carries
bypass_pre_encode = true; v1 continues through the regular encoder
forward path.
cache_active = false for v1 GGUFs (detected via encoder shape:
18 conformer layers / 80 mel bins, vs v2.x's 17 / 128). v1 streaming
still uses the prior sliding-history + overlap-remap logic and stays
bit-identical to its previous output.
Synthetic English-only fixtures generated via ElevenLabs TTS with LIFO re-entry patterns. Lengths chosen so the re-entry gap exceeds the FIFO span:
test/samples/abcba.wav(160.6 s, 3 distinct speakers, pattern A→B→C→B→A) — A returns after a 97 s gap.test/samples/abcdba.wav(191.2 s, 4 distinct speakers, pattern A→B→C→D→B→A) — A returns after a 128 s gap, B returns after a 66 s gap.
Each fixture ships with a hand-built ground-truth .rttm.
test/test_sortformer_aosc_speakers.cpp (new) checks three invariants
against the RTTM: (a) every reference speaker has at least one
emitted hyp frame, (b) every speaker that re-enters lands in the
same hyp_<id> it was first assigned to (the AOSC contract), and
(c) frame-level DER under the optimal hyp→ref permutation is below
30 %. Both fixtures register as ctest entries
test-sortformer-aosc-speakers-{abcba,abcdba}.
Measured on q8_0 v2.1 GGUF, Apple M-series, CPU backend:
| fixture | mode | speakers tracked | DER | A re-binds | B re-binds |
|---|---|---|---|---|---|
| abcba | v1 streaming | 2 (A,B; no C) | 24.31 % | yes (single hyp_0 across both) | yes (single hyp_1 across both) |
| abcba | v2.1 + AOSC | 3 (A,B,C) | 27.29 % | yes (gap 97 s) | yes (gap 35 s) |
| abcba | v2.1 no-cache | 2 (A,B; no C) | 23.74 % | n/a | n/a |
| abcdba | v1 streaming | 2 (collapsed) | 66.28 % | no — rebinds to hyp_1 | no — rebinds to hyp_0 |
| abcdba | v2.1 + AOSC | 4 (A,B,C,D) | 22.22 % | yes (gap 128 s) | yes (gap 66 s) |
| abcdba | v2.1 no-cache | 2 (collapsed) | 65.72 % | n/a | n/a |
The 4-speaker case is the discriminating one: v2.1+AOSC drops DER from 66 % to 22 %, and is the only mode that holds slot continuity for the returning speakers. Residual confusion in the 3-speaker case (C/Alice gets bound to A/Sarah's slot once) is encoder-side acoustic similarity between two female voices — independent of the cache. The regression test gates on the AOSC contract (slot continuity + DER ceiling), not on per-frame identity, so this real-world ambiguity doesn't flake the test.
include/parakeet/diarization.h— newSortformerStreamingOptionsfields;spkcache_enabledefault flipped totrue.src/parakeet_sortformer.{h,cpp}— AOSC helpers + state extension (mean_sil_emb,spkcache_preds,fifo_preds,n_sil_frames).src/parakeet_ctc.{h,cpp}—run_encoder_bypass_pre_encode;EncoderGraphgainsbypass_pre_encode/T_enc/pre_encode_infields.src/parakeet_engine.cpp— streaming session uses the subsampling+AOSC pipeline on v2.x;try_emit_chunkswaits for right-context;diarize_startpopulates new config fields.test/test_sortformer_streaming.cpp— reads defaults fromSortformerStreamingOptionsso the existing binary reflects the new AOSC config out of the box.test/test_sortformer_aosc_speakers.cpp(new) — regression test described in §17.5.test/samples/abcba.{wav,rttm},test/samples/abcdba.{wav,rttm}— new ElevenLabs fixtures.CMakeLists.txt— path vars +add_executable+parakeet_register_testentries for the two new ctest cases.
- The existing
test-sortformer-streamingassertionn_finals == 1trips non-deterministically on long inputs under AOSC (session emits 0is_finalmarkers instead of 1). The hyp RTTM is still valid; only the session-end signalling needs to emit exactly one final marker. Separate, narrowly-scoped fix. - AOSC streaming is correct through the parakeet-cpp C++ test
binary. Surfacing it through downstream addon wrappers
(e.g.
transcription-parakeet'srunStreaming()JS API) requires separate plumbing work on those wrappers — not in this phase.