You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support multiple concurrent TCP transcriptions sharing a single loaded model
TCP server now spawns a thread per connection, each getting an independent
pipeline (~7.3MB) while sharing the model weights in VRAM/RAM. Local audio
capture (--trigger) runs alongside TCP in a background thread. Output
simplified to raw text stream (no timestamp framing). CoreML encoder caches
moved from shared model struct to per-pipeline for thread safety. --input
flag deprecated in favour of --trigger for enabling local mode.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -55,7 +58,7 @@ On macOS, `./run.ts build` produces one binary. On Linux, it builds both `capspe
55
58
56
59
## Architecture
57
60
58
-
Push-to-talk voice dictation for Linux and macOS. Self-contained binary per platform. On Linux: grabs keyboards via evdev, intercepts CapsLock, captures audio via PipeWire, transcribes with Nemotron RNNT (via onnxruntime), injects text via uinput. On macOS: CGEventTap input, CoreAudio capture, CoreML inference (93% ANE), CGEventPost injection.
61
+
Push-to-talk voice dictation for Linux and macOS. Self-contained binary per platform. Supports multiple concurrent transcriptions — TCP server accepts multiple clients simultaneously, each getting an independent pipeline while sharing the single loaded model. Use `--trigger` to also enable local audio capture with push-to-talk alongside TCP. On Linux: grabs keyboards via evdev, intercepts CapsLock, captures audio via PipeWire, transcribes with Nemotron RNNT (via onnxruntime), injects text via uinput. On macOS: CGEventTap input, CoreAudio capture, CoreML inference (93% ANE), CGEventPost injection.
59
62
60
63
> **History:** Capsper originally used whisper.cpp for ASR with Silero/TEN-VAD for voice activity detection. It now uses NVIDIA's Nemotron Speech 600M model (FastConformer RNNT) which is incremental and doesn't need a separate VAD — PTT (push-to-talk) is the sole gate. The name "Capsper" is a nod to Casper the friendly ghost — ghostwriting via CapsLock.
61
64
@@ -172,7 +175,7 @@ Do NOT manually download CI artifacts or stage releases by hand — the update s
0 commit comments