Steady-state RSS / VRAM behaviour for the two persistent engines and
the per-process caches they share. Numbers below were measured on
Apple M2 (macOS 15.7, arm64) with the supertonic CPU + Metal paths;
RSS is mach_task_basic_info::phys_footprint. The measurement
harness is test/test_supertonic_engine_cycle.cpp
(build/test-supertonic-engine-cycle).
Both engines must satisfy:
- No monotonic RSS growth across 100+
synthesize → destroycycles (drift < 5 MB). Enforced for supertonic bytest-supertonic-engine-cycle. Chatterbox does not currently have a fixture-free cycle harness in tree; the lifecycle invariants it relies on are documented below and the existing tests (test-t3-caches,test-cpu-caches) cover the cache contracts. - Backend-buffer release ordering: per-stage caches that hold
ggml_gallocr_thandles must be torn down BEFOREggml_backend_free. Reversing that order asserts inside the GPU-backend dylib finaliser (ggml-metal's[rsets->data count] == 0check is the canonical example). - Cancel-mid-flight cleanup:
cancel()throws out ofsynthesize()at the next stage boundary; local scratch buffers unwind via RAII and the populated caches stay valid (they're keyed on(model.generation_id, shape)so a subsequentsynthesize()reuses them without rebuilds).
| Phase | RSS (CPU) | RSS (Metal) | Notes |
|---|---|---|---|
| Process baseline (no engine) | 2.5 MB | 2.5 MB | ggml/tts-cpp dylibs not yet resolved |
After first synthesize() |
311 MB | 207 MB | model + caches + backend init |
| After Engine destruction | 293 MB | 202 MB | one-shot inits stay; per-engine state freed |
| Steady-state cycle drift (n=100) | ±5 MB | ±5 MB | within allocator + page-pool noise |
Where the ~290 MB residual goes after the Engine is destroyed:
- ggml-metal library + device state (~100 MB on Metal, ~0 on CPU).
The metal kernel library is compiled+loaded ONCE per process; the
resulting
MTLLibraryand pipeline cache live on a static device list insidelibqvac-speech-ggml-metal.dylib. - MoltenVK / ggml-vulkan ICD (~30 MB; non-zero even when
n_gpu_layers=0 because the ggml backend registry walks every
registered backend at first
Enginector). - ggml backend registry, OpenMP pool, libstdc++ allocator pools (~60 MB combined).
- Supertonic GGUF residuals — model weights are freed but the
underlying mmap'd pages may stay resident under macOS memory
pressure semantics (the allocator returns the region but the
OS may keep pages mapped until pressure rises). Counts toward
RSS, not "leaked" — releases on
madvise/ OS-level reclaim. - Per-stage thread_local cache CPU buffers (~2.5 MB per worker
thread that has ever run a Supertonic synth). These hold ggml
init bufs + graph metadata; intentionally retained across Engine
cycles for fast graph rebuild on the next synth. Released by
the
release_*_thread_local_cachesmachinery at engine destruction on the synth-owning thread.
free_supertonic_model (called from Engine::~Impl) executes in this
order, which is the documented invariant downstream callers should
not break:
release_vector_estimator_thread_local_caches()/release_text_encoder_thread_local_caches()/release_vocoder_thread_local_caches()/release_duration_thread_local_caches()— drives every per-stage thread_local cache populated on the calling thread through its normalfree_*_cachepath WHILE the backend is still alive, so each gallocr inside gets a completeggml_gallocr_free(no skip).unregister_supertonic_alive(model.generation_id)— any thread_local cache on OTHER threads (cross-thread Engine destruction, not supported per the contract) now sees its generation as no-longer-alive and the skip-and-leak path insupertonic_safe_gallocr_freekicks in to keep us from crashing.ggml_backend_sched_free(model.sched)— the QVAC-19254 scheduler holds non-owning pointers to the backends, must die first.ggml_backend_buffer_free(model.buffer_w_extra)/model.buffer_w— weight buffers, allocated against the primary backend.ggml_backend_free(model.backend)— primary compute backend.ggml_backend_free(model.cpu_backend)— CPU fallback backend (allocated only when primary is non-CPU).ggml_free(model.ctx_w_extra)/model.ctx_w— tensor metadata contexts (CPU memory only).
Per worker thread that has ever run a Supertonic synth, the following caches' CPU-side metadata persists from first call until thread death:
| Cache TU | Per-thread size (typical) |
|---|---|
| supertonic_vector_estimator.cpp | ~2.0 MB (23 caches × ~85 KB ctx+buf each, after engine cycle frees the gallocators) |
| supertonic_text_encoder.cpp | ~0.4 MB (6 caches; 5 are arrays of 2–4 instances) |
| supertonic_vocoder.cpp | ~64 KB (1 cache) |
| supertonic_duration.cpp | ~32 KB (1 cache) |
These are the buf storage vector + ggml_context headers that the
caches reuse across synths for fast graph rebuild. The ggml_gallocr_t
inside each cache IS freed at engine destruction (post-fix) — only
the cache wrapper's CPU memory stays around so the next engine cycle
hits the cache.generation_id != model.generation_id rebuild path
with all metadata already allocated.
If you need a thread-pool host to fully reclaim this after engines
go idle, the cheapest path is to terminate the worker (thread_local
destructors run at thread death; cache.buf.~vector() releases the
heap storage). There is currently no public API to release the
retained CPU metadata without thread termination.
We do not have local chatterbox GGUFs to run a cycle bench, but the
teardown code in src/chatterbox_engine.cpp + src/chatterbox_tts.cpp
src/t3_mtl.cppfollows these documented invariants:
wait_for_preload(s3gen_preload_thread)— block until the background S3Gen GGUF load thread completes; otherwise the destructor races the preload thread inside ggml-metal'sggml_metal_buffer_type_shared_alloc_buffer.s3gen_unload()— refcount-protected. When the count reaches zero, callss3gen_model_cache_release()whichs3gen_release_synth_caches()first (cfm_estimator, encoder, hift, f0, stft, time-mlp results, weight CPU mirrors, pos_emb / inv_alpha scaffolding, hann_window / istft_kernel / window_sum / stft_kernel) BEFORE freeing the s3genmodel_ctx's scheduler + backend.ggml_gallocr_free(allocr)— the T3 prompt+step gallocr.free_model()— callst3_stack_unregister(pull the(buffer_stack, ctx_stack)pair out of the process-widet3_stack_registryBEFORE local free), thent3_release_caches()(drains the T3 step-graph cache — process-global, mutex-protected LRU bounded at 256 entries whenCHATTERBOX_T3_STEP_CACHE=1), then frees T3 buffers + backend + contexts in that order.
- S3Gen model cache (
g_s3gen_cache_entryin chatterbox_tts.cpp). Onemodel_ctxper (path, n_gpu_layers) pair, refcounted across multiple Engine instances sharing the same S3Gen GGUF. Released by the lasts3gen_unload()or by theatexithook (s3gen_model_cache_release) on process exit. - T3 step-graph cache (LRU, bounded at 256 entries, opt-in via
CHATTERBOX_T3_STEP_CACHE=1). Each entry holds aggml_context *(one per(n_past, is_uncond)cache key) + the graph metadatabuf(~MB-scale on the multilingual model). Cap @ 256 → roughly 256 × graph-metadata bytes upper bound; measured ~270 MB peak on the multilingual model when fully saturated. Cleared bydetail::t3_release_caches()(called from Engine destructor) or by theatexithook (t3_step_cache_release_atexit) at process exit. - T3 stack registry (
t3_stack_registryin t3_mtl.cpp). Holds(ggml_backend_buffer_t, ggml_context *)pairs registered by load-time MTL build paths so theatexithook can free them before the GPU dylib's static finaliser runs. Each Engine's destructor pulls its registered entries before its own backend free.
src/chatterbox_tts.cpp::time_mlp_cache (line 1270, inside
compute_time_mlp) is a thread_local cache whose destructor calls
ggml_gallocr_free(allocr) directly (no safe-skip helper). If a
worker thread outlives the Engine that populated it and the backend
has been freed, the thread_local destructor at thread death will
attempt ggml_gallocr_free against a dead backend and assert inside
the dylib finaliser.
Empirically this doesn't trigger on chatterbox CLI runs (single
thread, backend lives until process exit) but it's a latent risk
for thread-pool hosts (SDK / Bare addon). A follow-up fix would
mirror the supertonic registry pattern for chatterbox or move the
time_mlp_cache into s3gen_release_synth_caches()'s sweep.
| State | Scope | Released by |
|---|---|---|
| Supertonic model.backend + buffers | Engine | free_supertonic_model |
| Supertonic per-stage thread_local caches | Per-thread, per-Engine generation | release_*_thread_local_caches (synth thread) on dtor, OR thread death |
| Supertonic alive registry | Process | atomic — Engine ctor adds, dtor drops |
| Chatterbox T3 model.backend + buffers | Engine | Engine::~Impl::free_model |
| Chatterbox S3Gen model_ctx | Process (refcount) | last s3gen_unload() / process atexit |
| Chatterbox T3 step-graph cache | Process | t3_release_caches() / process atexit |
| Chatterbox T3 stack registry | Process | t3_stack_unregister / process atexit |
| ggml backend registry (Metal, Vulkan, …) | Process | Process exit (dylib finaliser) |
| OpenMP / MoltenVK / Metal compiler pools | Process | Process exit |
| GGUF mmap pages | Process (until OS pressure) | madvise on free + OS-level reclaim |
build/test-supertonic-engine-cycle <supertonic.gguf> [REF_DIR_ignored] [n_cycles=20] [n_gpu_layers=0]
n_cycles >= 2required. First cycle's RSS captures one-time process inits; cycles 2..N are compared against it with a 5 MB tolerance.n_gpu_layers=99exercises the Metal / Vulkan / OpenCL path (whichever the backend cascade resolves to first).- Exits non-zero if max RSS across cycles 2..N exceeds first-cycle RSS + 5 MB.
- The synthesis is a fixed sentence ("The quick brown fox …")
so this is a memory-only assertion; numerical parity is
covered by the
test-supertonic-{pipeline,vector,text-encoder, vocoder,duration}harnesses.