CrispASR

One C++ binary, twenty-seven ASR backends + eight TTS engines + multilingual text translation, zero Python dependencies.

CrispASR started as a fork of whisper.cpp and extends that base into a unified speech engine called crispasr, backed by full ggml C++ runtimes for major open-weights ASR and TTS architectures. One build, one binary, one consistent CLI — pick the backend at the command line or let CrispASR auto-detect it from your GGUF file. See Text-to-Speech for the TTS side.

$ crispasr -m ggml-base.en.bin          -f samples/jfk.wav                    # OpenAI Whisper
$ crispasr -m parakeet-tdt-0.6b.gguf    -f samples/jfk.wav                    # NVIDIA Parakeet
$ crispasr -m canary-1b-v2.gguf         -f samples/jfk.wav                    # NVIDIA Canary
$ crispasr -m voxtral-mini-3b-2507.gguf -f samples/jfk.wav                    # Mistral Voxtral
$ crispasr --backend qwen3 -m auto      -f samples/jfk.wav                    # -m auto downloads
$ crispasr --backend kokoro -m auto --tts "Hello world" --tts-output out.wav  # TTS

No Python. No PyTorch. No separate per-model binary. No pip install. Just one C++ binary and a GGUF file.

Ecosystem

Project	What it does
CrispASR	This repo — C++ speech recognition engine. 24 ASR backends + 8 TTS backends, CLI + HTTP server + C-ABI + Python/Rust/Dart bindings.
CrisperWeaver	Cross-platform Flutter transcription app built on CrispASR. Desktop + mobile, all 10 backends, model browser with download queue, mic capture, SRT/VTT/JSON export, diarization, batch processing. Fully offline.
CrispEmbed	Text embedding engine via ggml — same philosophy as CrispASR but for retrieval. 10 architectures (XLM-R, Qwen3-Embed, Gemma3, ModernBERT, ...), dense + sparse + ColBERT + reranking. 9.5x faster than ONNX on CPU, GPU via CUDA/Metal/Vulkan. Python/Rust/Dart bindings.
Susurrus	Python ASR GUI with 9 backends (faster-whisper, mlx-whisper, voxtral, insanely-fast-whisper, ...). The Python counterpart to CrispASR's C++ approach.

Supported backends — ASR + TTS + translation + post-processing
Feature matrix
Install & build — quick install (full guide in docs/install.md)
Quick start — ASR
Text-to-Speech (TTS) — Kokoro, Qwen3-TTS, VibeVoice, Orpheus, Chatterbox, IndexTTS, VoxCPM2 (beta)
Streaming & live transcription
Server mode (HTTP API)
CLI reference — flags, VAD, CTC alignment, output formats, auto-download, audio formats
Language bindings — Python / Rust / Dart / Go / Java / Ruby / mobile
Architecture — layered layout, src/core/ primitives, regression discipline
Contributing — adding a new backend — 5-file recipe, ground-truth diff workflow
Regression matrix — tools/test-all-backends.py capability tiers
Quantize models — crispasr-quantize for all backends
GPU backend selection
Debugging & profiling
Credits

Supported backends

CrispASR ships 24 ASR backends for transcription/translation and eight TTS engines for synthesis. Pick at the CLI with --backend NAME, or omit it to let the binary auto-detect from the GGUF metadata. Jump to the TTS table for the synthesis side.

ASR backends

Backend	Model	Architecture	Languages	License
whisper	`ggml-base.en.bin` and all OpenAI Whisper variants	Encoder-decoder transformer	99	MIT
whisper	`distil-whisper/distil-large-v3`	Distilled Whisper: 32L encoder + 2L decoder (6.3x faster)	English	MIT
parakeet	`nvidia/parakeet-tdt-0.6b-v3`	FastConformer + TDT	25 EU (auto-detect)	CC-BY-4.0
parakeet	`nvidia/parakeet-tdt_ctc-0.6b-ja`	FastConformer-TDT-CTC, xscaling, 80 mels	Japanese	CC-BY-4.0
fastconformer-ctc	`nvidia/parakeet-ctc-0.6b`	24L FastConformer + CTC, 80 mels (same arch as fc-ctc-xlarge)	en	CC-BY-4.0
fastconformer-ctc	`nvidia/parakeet-ctc-1.1b`	42L FastConformer + CTC, 80 mels	en	CC-BY-4.0
canary	`nvidia/canary-1b-v2`	FastConformer + Transformer decoder	25 EU (explicit `-sl/-tl`)	CC-BY-4.0
cohere	`CohereLabs/cohere-transcribe-03-2026`	Conformer + Transformer	13	Apache-2.0
granite	`ibm-granite/granite-speech-{3.2-8b,3.3-2b,3.3-8b}`, `granite-4.0-1b-speech`	Conformer + Q-Former + Granite LLM (μP) (more)	en fr de es pt ja	Apache-2.0
granite-4.1	`ibm-granite/granite-speech-4.1-2b`	16L Conformer + Q-Former + Granite LLM; single ggml graph (more)	en fr de es pt ja	Apache-2.0
granite-4.1-plus	`ibm-granite/granite-speech-4.1-2b-plus`	4.1 + hidden-state concat; punctuated output (more)	en fr de es pt	Apache-2.0
granite-4.1-nar	`ibm-granite/granite-speech-4.1-2b-nar`	Non-autoregressive: single LLM forward + slot argmax (more)	en fr de es pt	Apache-2.0
fastconformer-ctc	`nvidia/stt_en_fastconformer_ctc_large`	FastConformer + CTC (NeMo family, all sizes)	en	CC-BY-4.0
voxtral	`mistralai/Voxtral-Mini-3B-2507`	Whisper encoder + Mistral 3B LLM	8	Apache-2.0
voxtral4b	`mistralai/Voxtral-Mini-4B-Realtime-2602`	Causal encoder + 3.4B LLM, sliding window	13, realtime streaming	Apache-2.0
qwen3	`Qwen/Qwen3-ASR-0.6B`	Whisper-style audio encoder + Qwen3 0.6B LLM	30 + 22 Chinese dialects	Apache-2.0
wav2vec2	`jonatasgrosman/wav2vec2-large-xlsr-53-english`	CNN + 24L transformer + CTC head (any Wav2Vec2ForCTC)	per-model	Apache-2.0
wav2vec2	`facebook/data2vec-audio-base-960h`	Data2Vec Audio (79 MB Q4_K)	English	Apache-2.0
wav2vec2	`facebook/hubert-large-ls960-ft`	HuBERT Large (212 MB Q4_K)	English	Apache-2.0
glm-asr	`zai-org/GLM-ASR-Nano-2512`	Whisper encoder + 4-frame projector + Llama 1.5B (GQA)	17 (Mandarin, English, Cantonese, ...)	MIT
kyutai-stt	`kyutai/stt-1b-en_fr`	Mimi codec (SEANet + RVQ) + 16L causal LM	en, fr	MIT
firered-asr	`FireRedTeam/FireRedASR2-AED`	Conformer + CTC + beam search; also LID (120 langs)	Mandarin, English, 20+ Chinese dialects	Apache-2.0
moonshine	`UsefulSensors/moonshine-{tiny,base}`	Conv + 6L enc + 6L dec; multilingual variants	English + 6 langs	MIT
moonshine‑de	`fidoriel/moonshine-base-de`	German fine-tune of moonshine-base (6.9% WER CV22)	German	CC‑BY‑NC‑SA‑4.0
moonshine‑tiny‑de	`fidoriel/moonshine-tiny-de`	German fine-tune of moonshine-tiny (11.4% WER CV22)	German	CC‑BY‑NC‑SA‑4.0
moonshine-streaming	`UsefulSensors/moonshine-streaming-{tiny,small,medium}`	Streaming: sliding-window encoder + AR decoder (34–245M)	English	MIT
gemma4-e2b	`google/gemma-4-E2B-it`	USM Conformer 12L + Gemma4 LLM 35L (GQA, PLE)	140+ langs	Apache-2.0
omniasr	`facebook/omniASR-CTC-{300M,1B}`	wav2vec2 CNN + 24–48L transformer + CTC (more)	1600+	Apache-2.0
omniasr-llm	`omniASR-LLM-300M-v2`	Same encoder + 12L LLaMA decoder (more)	1600+	Apache-2.0
omniasr-llm	`omniASR-LLM-Unlimited-300M-v2`	Streaming: 15s segment protocol, unlimited audio (more)	1600+	Apache-2.0
vibevoice	`microsoft/VibeVoice-ASR`	σ-VAE ConvNeXt + Qwen2.5-7B (more)	50+	MIT
mimo-asr	`XiaomiMiMo/MiMo-V2.5-ASR`	6L transformer + 36L Qwen2 LM + RVQ codec (more)	Mandarin + dialects + English	MIT
funasr	`FunAudioLLM/Fun-ASR-Nano-2512`	70-block SANM encoder + 2-block Transformer adaptor + Qwen3-0.6B LLM	zh, yue, en, ja, ko	FunASR Model License v1.1 (commercial OK w/ attribution)
fun-asr-mlt-nano	`FunAudioLLM/Fun-ASR-MLT-Nano-2512`	Same architecture, multilingual decoder	31 langs incl. de, fr, es, pt, ru, ar, hi, vi, th, ko	FunASR Model License v1.1
sensevoice	`FunAudioLLM/SenseVoiceSmall`	70-block SANM encoder + CTC head; emits transcript + language ID + emotion + audio-event in one forward pass (non-AR, 15× faster than Whisper-Large); structured C ABI + `-oj` JSON expose the four tags as separate fields	50+ langs; native LID + emotion + audio-event tags	FunASR Model License v1.1

Text-to-Speech models

Synthesis backends, driven by the --tts flag and a --tts-output PATH.wav. See the dedicated Text-to-Speech section below for quick-start commands and engine selection guidance.

Backend	Models	Architecture	Languages	License
vibevoice-tts	`VibeVoice-Realtime-0.5B`, `VibeVoice-1.5B`	DPM-Solver++ + σ-VAE decoder; voice presets or cloning	en, zh	MIT
qwen3-tts	`Qwen3-TTS-12Hz-0.6B-Base`, `1.7B-Base`, `1.7B-VoiceDesign`	Qwen3 talker LM + 12 Hz RVQ (more)	multilingual	Apache-2.0
qwen3-tts-customvoice	`1.7B-CustomVoice`	Same talker + 9 premium built-in speakers (`--voice <name>`); optional style via `--instruct` (e.g. "spoke very slowly") (more)	multilingual	Apache-2.0
kokoro	`hexgrad/Kokoro-82M` + German backbones	StyleTTS2 / iSTFTNet (82M); per-voice GGUF (more)	en, es, fr, hi, it, ja, pt, zh, de	Apache-2.0
orpheus	`Orpheus-3B-FT` + `SNAC 24 kHz`	Llama-3.2-3B + SNAC RVQ codec; 8 speakers (more)	en, de	Llama / MIT
chatterbox	`cstr/chatterbox-GGUF` + turbo/kartoffelbox/lahgtna variants	T3 AR + S3Gen flow-matching (more)	en, de, ar	MIT
indextts	`cstr/indextts-1.5-GGUF`	GPT-2 AR (24L/1280d) + Conformer conditioning + BigVGAN vocoder; voice cloning via reference audio	zh, en	Apache-2.0
voxcpm2-tts	`cstr/voxcpm2-GGUF`	Tokenizer-free CFM diffusion AR (TSLM + RALM + LocDiT) at 48 kHz native; zero-shot + voice cloning via `--voice <wav>`	30 languages	Apache-2.0

Status: end-to-end runnable on both Q4_K and F16; zero-shot synth and voice cloning (--voice <wav>) both work and ASR-roundtrip correctly. The inference path matches the upstream PyTorch reference within precision floor (14 pass / 0 fail / 3 skip in crispasr-diff voxcpm2-tts, cfm_step0_result cos_mean ≥ 0.98). Set VOXCPM2_USE_GRAPH=1 for the fast path: weights load onto the best available backend (Metal on Apple Silicon), AR-step LocDiT + TSLM run as cached per-step cgraphs (qwen3-style multi-bucket Lk), and the VAE decode hot loops are OMP-parallelised with a SIMD-friendly contiguous-ic inner layout. On M1 (OMP=8), "Hello world" zero- shot Q4_K takes ~7 s total wall (down from ~49 s before the perf work).

Translation

Text-to-text translation, distinct from the audio-side --translate flag (which routes audio → English text on whisper / canary / etc.). Driven by --text "..." -sl <src> -tl <tgt>.

Backend	Models	Architecture	Languages	License
m2m100	`facebook/m2m100_418M`	12L enc + 12L dec transformer, SentencePiece 128K (more)	100 langs, any-to-any	MIT
m2m100-wmt21	`facebook/wmt21-dense-24-wide-en-x` + `facebook/wmt21-dense-24-wide-x-en`	Same as m2m100, scaled to 4.7B (24L enc) (more)	English ↔ 7 langs (separate `en-x` / `x-en` checkpoints)	MIT
madlad	`google/madlad400-3b-mt`	T5 enc-dec (12L+12L, d=2048, gated-GELU, RMSNorm) (more)	419 languages	Apache-2.0

# m2m100 base (production-ready)
./build/bin/crispasr --backend m2m100 -m auto \
    --text "Hello world, how are you today?" \
    -sl en -tl de
# → Hallo Welt, wie bist du heute?

# WMT21 dense (English ↔ X, 4.7B — auto-downloads ~2.5 GB).
# Two separate checkpoints: en-x for English-source, x-en for
# English-target. Pick the one matching your `-sl`/`-tl` direction
# (or pass an explicit `-m <path>` to load the other manually).
./build/bin/crispasr --backend m2m100-wmt21 -m auto \
    --text "The president said he would not attend." \
    -sl en -tl de   # uses wmt21-dense-24-wide-en-x

./build/bin/crispasr --backend m2m100-wmt21 \
    -m models/wmt21-dense-24-wide-x-en-q4_k.gguf \
    --text "Le président a dit qu'il ne serait pas présent." \
    -sl fr -tl en   # uses wmt21-dense-24-wide-x-en

# MADLAD-400 3B (419 languages, bit-token-identical to Python SP)
./build/bin/crispasr --backend madlad -m auto \
    --text "Hello world." \
    -sl en -tl ta

For 2-stage pipelines (e.g., ASR → m2m100), use the dedicated --tr-sl / --tr-tl flags; they fall back to -sl / -tl when unset, so single-stage standalone usage is just -sl/-tl.

Post-processing models

Work with all backends.

Model	Task	Architecture	Languages	License	HuggingFace
FireRedPunc	Punctuation restoration	BERT-base (12L, d=768), 5 classes	Chinese + English	Apache-2.0	`cstr/fireredpunc-GGUF`
fullstop-punc	Punctuation restoration	XLM-RoBERTa-large (24L, d=1024), 6 classes	EN, DE, FR, IT	MIT	`cstr/fullstop-punc-multilang-GGUF`
punctuate-all	Punctuation restoration	XLM-RoBERTa-base (12L, d=768), 6 classes	12 languages	MIT	`cstr/punctuate-all-GGUF`
PCS	Punc + truecase + SBD	XLM-RoBERTa-base (12L), 4 heads	47 languages	Apache-2.0	`--punc-model pcs`
truecaser‑lstm	German truecasing (best)	BiLSTM char-level (2×150, 3.2 MB, 97.9% F1)	German	Apache-2.0	`--truecase-model lstm`
truecaser‑crf	German truecasing	CRF + context features (8.5 MB)	German	MIT	`--truecase-model crf`
truecaser‑de	German truecasing (simple)	Statistical word-frequency (71K entries, 1.7 MB)	German	MIT	`--truecase-model auto`
CLD3	Text language ID	Embedding-bag → FC + ReLU → softmax (~1.5 MB F32)	109 ISO 639-1	Apache-2.0	`cstr/cld3-GGUF`
GlotLID-V3	Text language ID	fastText supervised, flat softmax	2102 ISO 639-3 + script	Apache-2.0	`cstr/glotlid-GGUF`
LID-176	Text language ID	fastText supervised, hierarchical softmax	176 ISO 639-1	CC-BY-SA-3.0	`cstr/fasttext-lid176-GGUF`

All runtimes share ggml-based inference. The speech-LLM backends (qwen3, voxtral, voxtral4b, granite, glm-asr, kyutai-stt) inject audio encoder frames directly into an autoregressive language model's input embeddings, instead of using a dedicated CTC/transducer/seq2seq decoder. The fastconformer-ctc backend hosts the NeMo FastConformer-CTC standalone ASR family — stt_en_fastconformer_ctc_{large,xlarge,xxlarge} and the architecturally-identical parakeet-ctc-{0.6b,1.1b} (different training data + tokenizer, same encoder + head shape) — with greedy CTC decoding. Same C++ runtime as the canary-ctc aligner.

Feature matrix

Run crispasr --list-backends to see it live. Each backend declares capabilities at runtime; if you ask for a feature the selected backend does not support, CrispASR prints a warning and silently ignores the flag.

Sortable / filterable view: docs/feature-matrix.html — click any column header to sort, type to filter rows, click cap pills to require a capability. Generated from crispasr --list-backends-json (single source of truth — drift impossible). Regenerate via python tools/gen-feature-matrix.py. A Markdown twin lives at docs/feature-matrix.md.

The static table below is a curated subset focusing on the ASR backends and the cross-cutting features that matter for ASR pipelines. The full 39-backend × 18-cap surface is in the generated views.

Feature	whisper	parakeet	canary	cohere	granite	granite‑4.1	voxtral	voxtral4b	qwen3	fc‑ctc	wav2vec2	glm‑asr	kyutai‑stt	firered	moonshine	moon‑stream	omniasr	omniasr‑llm	vibevoice	gemma4‑e2b	mimo‑asr	funasr	sensevoice
Native timestamps	✔	✔	✔	✔									✔
CTC timestamps			✔		✔	✔	✔	✔	✔	✔	✔	✔		✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Word-level timing	✔	✔	✔	✔	`-am`	`-am`	`-am`	`-am`	`-am`	`-am`	`-am`	`-am`	✔	`-am`	`-am`	`-am`	`-am`	`-am`		`-am`	`-am`	`-am`	`-am`
Per-token confidence	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔		✔	✔			✔
Language auto-detect	✔	✔	LID	LID	LID	LID	LID	LID	✔	LID	LID	✔	LID	LID	LID	LID	LID	LID	LID	✔	LID	LID	✔
Speech translation	✔		✔		✔	✔	✔		✔
Speaker diarization	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Grammar (GBNF)	✔
Temperature sampling	✔	✔	✔	✔	✔	✔	✔	✔	✔			✔	✔		✔		✔	✔	✔	✔	✔
Beam search	✔				✔	✔	✔		✔			✔	✔	✔	✔		✔	✔
Flash attention	✔	✔	✔	✔	✔	✔	✔	✔	✔			✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Punctuation toggle		✔	✔	✔	✔	✔	✔	✔	✔			✔	✔		✔		✔	✔				✔	✔
Punc restoration	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp	pp
Source / target language			✔		✔	✔	✔		✔
Audio Q&A (`--ask`)					*	*	✔		*
Streaming	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
Auto-download (`-m auto`)	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
KV quant (`CRISPASR_KV_QUANT`, plus per-half `_K` / `_V`)					✔	✔	✔	✔	✔			✔						✔		✔	✔	✔
mmap weights (`CRISPASR_GGUF_MMAP`)		✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔	✔
TTS																			✔

The matrix above covers ASR backends. TTS-only backends (kokoro, qwen3-tts + variants, vibevoice-tts, orpheus + DE variants, chatterbox / chatterbox-turbo / kartoffelbox-turbo / lahgtna-chatterbox) all carry the TTS, AUTO_DOWNLOAD, TEMPERATURE, and FLASH_ATTN caps; per-backend cloning + voice-pack support is documented in the Text-to-Speech models table above and docs/tts.md. The vibevoice column marks the dual-mode (ASR + TTS) backend.

Key: ✔ = native/built-in, -am = via CTC forced aligner (-am canary-ctc-aligner.gguf or -am qwen3-forced-aligner.gguf), LID = via external language identification pre-step (-l auto), pp = via --punc-model post-processor (FireRedPunc or fullstop-punc), * = experimental or partial support. granite-4.1 covers both the regular and -plus variants; granite-4.1-nar is a non-autoregressive variant with encoder+projector only (no LLM decode features). The KV quant row marks backends that honor CRISPASR_KV_QUANT={f16,q8_0,q4_0} — CTC-style backends without a KV cache (parakeet, fc-ctc, wav2vec2, kyutai-stt, firered, moonshine variants, omniasr-CTC) don't apply. The same backends also honor the per-half CRISPASR_KV_QUANT_K / CRISPASR_KV_QUANT_V overrides (llama.cpp --cache-type-k / --cache-type-v parity) for asymmetric K-vs-V precision; common recipe K=q8_0 V=q4_0 saves ~40 % more KV memory than symmetric Q8_0. The mmap weights row marks backends consuming core_gguf::load_weights() and therefore honoring CRISPASR_GGUF_MMAP=1; whisper itself uses upstream's loader and is unaffected. See docs/cli.md Memory footprint for usage + recommended combos.

Speaker diarization as a post-processing step via --diarize:

energy / xcorr — stereo-only, no extra deps
pyannote — native GGUF (no Python, no sherpa-onnx); add --diarize-embedder auto (TitaNet) or --diarize-embedder indextts (ECAPA-TDNN) for globally stable speaker IDs across long files
sherpa / ecapa — external sherpa-onnx subprocess
vad-turns — mono-friendly gap-based proxy

Full reference + tuning knobs (cluster threshold, max speakers, pluggable embedder adapters): see docs/cli.md#diarization.

Language identification for backends without native LID: --lid-backend whisper (default, 75 MB ggml-tiny.bin), --lid-backend silero (native GGUF, 16 MB, 95 languages), or --lid-backend firered (FireRedLID, 1.7 GB, 120 languages — Conformer encoder + Transformer decoder).

Voice activity detection: --vad uses the default Silero VAD (~885 KB, auto-downloaded). Each VAD segment is transcribed independently, producing separate SRT/VTT entries with correct timestamps. Use --vad --split-on-punct for best subtitle output. Four VAD backends: Silero (default), FireRedVAD (-vm firered, recommended), MarbleNet (-vm marblenet, 439 KB, 6 languages), Whisper-VAD-EncDec (-vm whisper-vad, experimental).

Punctuation restoration (--punc-model): CTC-based backends output lowercase without punctuation. Named shortcuts: auto/firered (Chinese+English), fullstop (EN/DE/FR/IT, XLM-R-large), punctuate-all (12 languages, XLM-R-base), pcs (47 languages, punc + truecasing + sentence boundary detection in one model). Or pass a GGUF path directly. Also available via Python/Rust/Dart wrappers (crispasr.PuncModel).

Truecasing (--truecase-model): Restore German noun/name capitalization in lowercase ASR output. Three options in ascending quality: auto (statistical, 9 MB), crf (CRF with context, 24 MB), lstm (BiLSTM char-level, 3.2 MB, recommended — 97.9% F1, handles adjective/noun distinction and formal "Ihnen"). All auto-download from cstr/truecaser-de. Or use --punc-model pcs for neural punc + truecasing in one pass (47 languages).

Which backends produce punctuation natively?

Backend	Punctuation	Capitalization	Notes
whisper	✔	✔	Full punctuation and casing
parakeet	✔	✔
canary	✔	✔
cohere	✔	✔	Toggleable via `--no-punctuation`
granite	✔	✔	LLM output
voxtral	✔	✔	LLM output
voxtral4b	✔	✔	LLM output
qwen3	✔	✔	LLM output
funasr	✔	✔	LLM output (Qwen3-0.6B decoder). Chinese chars carry full-width period; mlt-nano variant adds Latin-script casing + punctuation.
sensevoice	✔	✔	CTC output with native ITN — toggle via `--punctuation` / `--no-punctuation`, controls Arabic-digit vs spelled-out numerals + comma/period emission.
glm-asr	✔	✔	LLM output
kyutai-stt	✔	✔	LLM output
moonshine	✔	✔	Encoder-decoder output
fastconformer-ctc	no	no	CTC — add `--punc-model`
wav2vec2	no	no	CTC — add `--punc-model`
firered-asr	no	no	CTC — add `--punc-model`
omniasr (CTC)	no	no	CTC — add `--punc-model`
omniasr (LLM)	✔	✔	Autoregressive decoder

Other freely-licensed alternatives that could be added: felflare/bert-restore-punctuation (MIT, English, includes truecasing), xashru/punctuation-restoration (Apache-2.0, 40+ languages, BiLSTM-CRF).

Progressive subtitle output (--flush-after): By default, non-whisper backends buffer all segments and print output at the end. For real-time subtitle consumption (PotPlayer, custom media players), use --flush-after 1 to print each SRT entry to stdout immediately after its VAD segment is transcribed:

crispasr --backend parakeet -m parakeet.gguf --vad --flush-after 1 -osrt -f long_audio.wav
# SRT entries appear progressively as each segment finishes

JSON output with language detection: When using -l auto -oj, the JSON output includes detected language info:

{
  "crispasr": {
    "backend": "cohere",
    "language": "en",
    "language_detected": "en",
    "language_confidence": 0.977,
    "language_source": "ecapa"
  },
  "transcription": [...]
}

Which backend should I pick?

Need	Pick
Battle-tested, all features exposed	whisper
Lowest English WER	cohere
Fastest (16x realtime on CPU)	moonshine (tiny), fc-ctc (10x)
Multilingual + word timestamps + fast	parakeet (2.9x RT)
Multilingual with explicit language control	canary
Speech translation (X→en or en→X)	canary, voxtral, qwen3
30 languages + Chinese dialects	qwen3
1600+ languages	omniasr (CTC or LLM)
Realtime streaming ASR (native incremental encoder, ~2× RT feed; sub-second-token target deferred to phase 2)	voxtral4b
Highest-quality offline speech-LLM	voxtral
Apache-licensed speech-LLM	granite, voxtral, qwen3, omniasr-llm
Lightweight CTC-only (fast, no decoder)	wav2vec2, fc-ctc, data2vec, omniasr
Mandarin + Chinese dialects	firered-asr, qwen3, glm-asr, funasr, sensevoice
Multilingual (31 langs) speech-LLM	fun-asr-mlt-nano, qwen3, omniasr-llm, gemma4-e2b
Multilingual (50+ langs) + LID + emotion + audio-event in one pass	sensevoice (encoder-only CTC, non-AR, 15× faster than Whisper-Large)

Language detection for backends that don't do it natively

Cohere, canary, granite, voxtral and voxtral4b need an explicit language code up front. If you don't know the language, pass -l auto and crispasr runs an optional LID pre-step before the main transcribe() call:

# Downloads ggml-tiny.bin (75 MB, 99 languages) on first use
crispasr --backend cohere -m $TC/cohere-transcribe-q5_0.gguf \
         -f unknown.wav -l auto
# crispasr[lid]: detected 'en' (p=0.977) via whisper-tiny
# crispasr: LID -> language = 'en' (whisper, p=0.977)

These LID providers are available:

--lid-backend whisper (default) — uses a small multilingual ggml-*.bin model via the crispasr C API. Auto-downloads ~75 MB on first use. 99 languages.
--lid-backend silero — native GGUF port of Silero's 95-language classifier. 16 MB F32, pure C++. Faster and smaller than whisper-tiny but slightly less accurate on long audio (>20s).
--lid-backend ecapa — recommended: ECAPA-TDNN (Apache-2.0). Purpose-built for language ID. Very high accuracy on TTS benchmark. Two variants via --lid-model:
- cstr/ecapa-lid-107-GGUF — VoxLingua107, 43 MB F16, 107 languages, ISO codes (en, de, ...). Default.
- cstr/ecapa-lid-commonlanguage-GGUF — CommonLanguage, 40 MB F16, 45 languages, full names (English, German, ...).
--lid-backend firered — FireRedLID (Conformer encoder + Transformer decoder). Q4_K (544 MB), 120 languages including Chinese dialects. Slower but covers more languages.

These VAD providers are available:

Silero VAD (default) — ~885 KB, auto-downloaded via --vad. Industry-standard, well-tested.
FireRedVAD — DFSMN-based, 2.4 MB, F1=97.57%. Pass --vad -vm firered to auto-download. Recommended.
MarbleNet — NVIDIA 1D separable CNN, 439 KB, 6 languages (EN/DE/FR/ES/RU/ZH). Pass --vad -vm marblenet to auto-download. Smallest model. (cstr/marblenet-vad-GGUF)
Whisper-VAD-EncDec (experimental) — Whisper-base encoder + TransformerDecoder head, 22 MB Q4_K. Trained on Japanese ASMR; may not generalise well to all domains. Pass --vad -vm whisper-vad. Slower than others (~1s vs ~50ms). (cstr/whisper-vad-encdec-asmr-GGUF)

Pass --lid-backend off to skip LID entirely.

Text language identification (post-ASR / standalone)

Audio LID (above) tags what was spoken; text LID tags what was written. Text LID runs on a transcript or any UTF-8 string and is useful for routing post-ASR pipelines (translation, punctuation, sub selection) without re-running an audio model. Three GGUF families, one binary — the dispatcher picks by general.architecture:

Backend	Labels	Size (F16)	License	HF repo
CLD3 (Google compact language detector v3)	109 ISO 639-1	440 KB	Apache-2.0	`cstr/cld3-GGUF`
GlotLID-V3 (cis-lmu fastText)	2102 ISO 639-3 + script	250 MB	Apache-2.0	`cstr/glotlid-GGUF`
LID-176 (Facebook fastText)	176 ISO 639-1	63 MB	CC-BY-SA-3.0¹	`cstr/fasttext-lid176-GGUF`

¹ LID-176 is CC-BY-SA-3.0 (viral) — redistributors of the GGUF inherit ShareAlike. CLD3 + GlotLID-V3 are Apache-2.0 with no such constraint. Pick CLD3 for the smallest, fastest path; GlotLID for maximum coverage (low-resource languages); LID-176 only if you need its specific 176-label space and accept the SA obligation.

Standalone CLI — auto-routes by GGUF arch, with auto-download:

crispasr-lid -m auto --text "Bonjour le monde"        # → cstr/cld3-GGUF (default, ~440 KB)
crispasr-lid -m auto:glotlid --text "Bonjour le monde" -k 5
crispasr-lid -m auto:lid-fasttext176 --text "Hallo Welt"
# Or pass an explicit path / canonical filename (looked up in the registry):
crispasr-lid -m cld3-f16.gguf --text "你好世界"
# zh	0.997816
echo "Привет мир" | crispasr-lid -m auto --quiet
# ru	0.907322

Post-ASR pipeline — --lid-on-transcript runs the same dispatcher on the assembled transcript (also accepts auto[:variant]):

crispasr -m ggml-tiny.bin -f speech.wav --lid-on-transcript auto
# (transcript on stdout)
# lang=de	conf=0.997123	backend=lid-cld3

The dispatcher (src/text_lid_dispatch.{h,cpp}) is a thin C ABI façade — one integer compare per call; per-stage diff harness is green at cos≥0.999 across 8 multilingual smoke samples.

Install & build

git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Produces build/bin/crispasr (main CLI), build/bin/crispasr-quantize, and build/bin/crispasr-diff. No Python, PyTorch, or pip required at runtime — just a C++17 compiler and CMake 3.14+.

For GPU acceleration, add the matching ggml flag at configure time:

cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON     # NVIDIA
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL=ON    # Apple Silicon
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_VULKAN=ON   # cross-vendor

See docs/install.md for the full guide: all GPU backends (CUDA / Metal / Vulkan / MUSA / SYCL), Windows convenience scripts, ffmpeg ingestion, optional BLAS, glibc notes, and the scripts/dev-build.sh wrapper.

Quick start

ASR examples below; for TTS see the Text-to-Speech section.

Whisper (historical path, byte-identical to upstream whisper.cpp)

# Download a whisper model (same as upstream whisper.cpp)
./models/download-ggml-model.sh base.en

./build/bin/crispasr -m models/ggml-base.en.bin -f samples/jfk.wav
# [00:00:00.000 --> 00:00:07.940]   And so my fellow Americans ask not what your country can do for you
# [00:00:07.940 --> 00:00:10.760]   ask what you can do for your country.

Parakeet (multilingual, free word timestamps, fastest)

# Grab the quantized model (~467 MB)
curl -L -o parakeet.gguf \
    https://huggingface.co/cstr/parakeet-tdt-0.6b-v3-GGUF/resolve/main/parakeet-tdt-0.6b-v3-q4_k.gguf

./build/bin/crispasr -m parakeet.gguf -f samples/jfk.wav
# Auto-detected backend 'parakeet' from GGUF metadata.
# And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

# Word-level timestamps (one line per word)
./build/bin/crispasr -m parakeet.gguf -f samples/jfk.wav -ml 1

Canary (explicit language, speech translation)

# Transcription (source == target)
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -sl de -tl de

# Translation (German speech → English text)
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -sl de -tl en

# ...or use the familiar crispasr flag:
./build/bin/crispasr --backend canary -m canary-1b-v2-q5_0.gguf -f audio.de.wav -l de --translate

Voxtral (speech-LLM with auto-download)

# First run downloads ~2.5 GB to ~/.cache/crispasr/ via curl, then runs
./build/bin/crispasr --backend voxtral -m auto -f samples/jfk.wav

# Subsequent runs use the cached file
./build/bin/crispasr --backend voxtral -m auto -f samples/jfk.wav -l en

Qwen3-ASR (30 languages + Chinese dialects)

./build/bin/crispasr --backend qwen3 -m auto -f audio.zh.wav

MiMo-V2.5-ASR (Mandarin + dialects + English, 7.5B Qwen2 LM)

# Download the LM + audio tokenizer (the tokenizer is a separate model)
huggingface-cli download cstr/mimo-asr-GGUF mimo-asr-q4_k.gguf \
    --local-dir ~/.cache/crispasr
huggingface-cli download cstr/mimo-tokenizer-GGUF mimo-tokenizer-q4_k.gguf \
    --local-dir ~/.cache/crispasr

# Transcribe (auto-discovers tokenizer if it sits next to the LM)
./build/bin/crispasr \
    --backend mimo-asr \
    -m ~/.cache/crispasr/mimo-asr-q4_k.gguf \
    --codec-model ~/.cache/crispasr/mimo-tokenizer-q4_k.gguf \
    -f samples/jfk.wav
# Output: And so, my fellow Americans, ask not what your country can do
# for you. Ask what you can do for your country.

The 4.5 GB Q4_K is the recommended quant; F16 (14.9 GB) needs ~16 GB RAM during inference. JFK matches the upstream Python MimoAudio.asr_sft reference verbatim; performance on M1+Metal is ~0.3× realtime (Q4_K dequant per step is the bottleneck — F16 + KV-reuse follow-ups are queued under PLAN #51a/b/c).

Wav2Vec2 (lightweight CTC, any HF Wav2Vec2ForCTC model)

# English (Q4_K quantized, 212 MB — 6x smaller than F16)
curl -L -o wav2vec2-en-q4k.gguf \
    https://huggingface.co/cstr/wav2vec2-large-xlsr-53-english-GGUF/resolve/main/wav2vec2-xlsr-en-q4_k.gguf

./build/bin/crispasr -m wav2vec2-en-q4k.gguf -f samples/jfk.wav
# and so my fellow americans ask not what your country can do for you ask what you can do for your country

# German
curl -L -o wav2vec2-de-q4k.gguf \
    https://huggingface.co/cstr/wav2vec2-large-xlsr-53-german-GGUF/resolve/main/wav2vec2-xlsr-de-q4_k.gguf

./build/bin/crispasr -m wav2vec2-de-q4k.gguf -f audio.de.wav

# Convert any HuggingFace Wav2Vec2ForCTC model:
python models/convert-wav2vec2-to-gguf.py \
    --model-dir jonatasgrosman/wav2vec2-large-xlsr-53-german \
    --output wav2vec2-de.gguf --dtype f32
# Then optionally quantize:
./build/bin/crispasr-quantize wav2vec2-de.gguf wav2vec2-de-q4k.gguf q4_k

Streaming, TTS, and HTTP server

CrispASR has three feature areas that warrant their own docs pages:

Streaming & live transcription — --stream, --mic, --live, sliding-window chunking, per-token confidence.
Text-to-Speech (TTS) — Kokoro (multilingual, smallest), Qwen3-TTS (highest fidelity, voice cloning), VibeVoice (lowest-latency streaming), Orpheus (3 B Llama + SNAC), Chatterbox (flow-matching + HiFT vocoder, German via Kartoffelbox). Voice packs, language routing, and qwen3-tts environment switches.
Server mode (HTTP API) — persistent model, OpenAI-compatible /v1/audio/transcriptions (ASR) and /v1/audio/speech + /v1/voices (TTS, automatic on any loaded CAP_TTS backend), per-request voice + speed + instructions, CORS, long-form sentence chunking, API keys, Docker Compose, prebuilt CUDA images.

Quickest taste of each:

# Streaming from microphone
crispasr --mic -m model.gguf

# TTS via auto-downloaded VibeVoice (~636 MB on first run)
crispasr --backend vibevoice-tts -m auto --tts "Hello world" --tts-output hello.wav

# Persistent HTTP server, OpenAI-compatible
crispasr --server -m model.gguf --port 8080
curl -F "file=@audio.wav" http://localhost:8080/v1/audio/transcriptions

# TTS over HTTP — load a TTS backend, hit /v1/audio/speech
crispasr --server --backend qwen3-tts-customvoice -m auto --voice-dir ./voices --port 8080
curl http://localhost:8080/v1/audio/speech \
  -H 'Content-Type: application/json' \
  -d '{"input":"Hello world","voice":"vivian"}' -o out.wav

CLI reference

Common flags:

crispasr -m auto --backend parakeet -f audio.wav --vad -osrt --split-on-punct

Flag	Meaning
`-m FNAME` / `--backend NAME`	Model path (or `auto`) and forced backend
`-f FNAME`	Input audio (repeatable; positional accepted)
`--vad`	Silero VAD chunking — strongly recommended for multi-minute audio
`-osrt` / `-ovtt` / `-otxt` / `-oj` / `-ojf`	Output formats (also `-ocsv`, `-olrc`)
`-am FNAME`	CTC aligner GGUF for word-level timestamps on LLM backends
`-tp F` / `-bs N`	Sampling temperature / beam search width
`-l auto` / `--detect-language`	LID pre-step for backends without native lang detect
`-ck N`	Fallback chunk size when VAD is off (default 30 s)
`--list-backends`	Print the capability matrix and exit

See docs/cli.md for the full reference: every flag, VAD details, CTC alignment workflow, output JSON layout, the auto-download registry, and supported audio formats. See docs/bindings.md for Python / Rust / Dart / Go / Java / Ruby / mobile.

Architecture, contributing, regression matrix

CrispASR is structured as a stable C-ABI in src/ (every algorithm: VAD, diarize, LID, alignment, cache, registry) consumed by all language wrappers, with thin presentation layers in examples/cli/. Per-model runtimes live in src/{whisper,parakeet,canary,...}.cpp, sharing primitives from src/core/ (mel, ffn, attention, GGUF loader, FastConformer / Conformer / Granite-LLM blocks, etc.).

docs/architecture.md — full layered layout, file-by-file tour of src/ and examples/cli/, per-backend internals table, regression discipline.
docs/contributing.md — adding a new backend in five files, clang-format-18 setup, the crispasr-diff PyTorch-ground-truth workflow, and the TTS audio-cosine-vs-reference regression target.
docs/regression-matrix.md — tools/test-all-backends.py capability tiers, cache modes (keep / ephemeral), --skip-missing for CI.

For benchmarks see PERFORMANCE.md; for the session-by-session port log and the bug-class lessons, see LEARNINGS.md.

Quantize models

build/bin/crispasr-quantize is a single, model-agnostic GGUF re-quantization tool that works across all supported model families (Whisper, Parakeet, Canary, Cohere, Voxtral, Qwen3, Granite, Wav2Vec2, MiMo-ASR, GLM-ASR, Moonshine, VibeVoice, Kokoro, Qwen3-TTS, …):

./build/bin/crispasr-quantize input.gguf output.gguf q4_k

See docs/quantize.md for the full guide: supported quant types, K-quant alignment fallback, recommended quant per backend, and worked examples for each architecture.

GPU backend selection

All backends use ggml_backend_init_best() which automatically picks the highest-priority compiled backend: CUDA > Metal > Vulkan > CPU. To force a specific backend:

# Force Vulkan even when CUDA is available
crispasr --gpu-backend vulkan -m model.gguf -f audio.wav

# Pin a specific GPU (useful on Vulkan systems with iGPU + dGPU)
crispasr --gpu-backend vulkan -dev 1 -m model.gguf -f audio.wav

# Force CPU (useful for benchmarking)
crispasr -ng -m model.gguf -f audio.wav

# CUDA unified memory (swap to RAM when VRAM exhausted)
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 crispasr -m model.gguf -f audio.wav

Build flags: -DGGML_CUDA=ON, -DGGML_METAL=ON, -DGGML_VULKAN=ON.

Notes:

--gpu-backend vulkan selects the Vulkan backend, but it does not choose which physical GPU to use. Use -dev N to select the Vulkan device index.
On some Windows laptops, Vulkan device 0 is the Intel iGPU and the NVIDIA GPU is 1. If Vulkan looks unexpectedly slow, rerun with -dev 1.
The Windows convenience script build-vulkan.bat creates a separate Vulkan-capable binary at build-vulkan\bin\crispasr.exe.

Debugging & profiling

For most backends, -v / --verbose surfaces per-stage timings and device picks. For headless / library use (where the CLI flag isn't plumbed through), set CRISPASR_VERBOSE=1 instead.

# Per-stage timing breakdown (mel / encoder / prefill / decode):
crispasr -v --backend gemma4-e2b -m model.gguf -f audio.wav
# gemma4_e2b: mel 128x1099 (17.2 ms)
# gemma4_e2b: encoder done: 1536x275 (719.0 ms)
# gemma4_e2b: prefill done, first_token=3133 (1464.0 ms)
# gemma4_e2b: decoded 25 tokens (7748.3 ms total)
# crispasr: transcribed 11.0s audio in 7.75s (1.4x realtime)

# Hugging Face access for gated models (Voxtral, Gemma4-E2B, …):
HF_TOKEN=hf_xxx crispasr -m auto --backend gemma4-e2b -f audio.wav

The server has its own auth env: CRISPASR_API_KEYS (see Server mode).

Per-backend debug / bench / dump-dir env vars (developer)

These are useful when porting a new backend or chasing a regression. The *_BENCH=1 toggles emit per-stage timings even without -v; the *_DEBUG=1 toggles emit per-step diagnostic prints; the *_DUMP_DIR= paths write per-stage F32 tensors for diff-testing against a PyTorch reference (see Debug a new backend against PyTorch ground truth).

Env var	Purpose
`CRISPASR_VERBOSE=1`	Forces verbose mode for any backend (parallel to the `-v` flag).
`CRISPASR_DUMP_DIR=path/`	Generic per-stage F32 tensor dump for the `crispasr-diff` harness.
`GEMMA4_E2B_BENCH=1`	Per-stage timings for the Gemma-4-E2B backend.
`COHERE_BENCH=1` / `COHERE_DEBUG=1`	Cohere transcribe per-stage timings / per-step diagnostics.
`COHERE_PROF=1`	Cohere graph-level profiling (per-op timings).
`COHERE_THREADS=N`	Override thread count for the Cohere backend.
`COHERE_DEVICE=cpu\|cuda\|metal\|vulkan`	Force the Cohere backend onto a specific device.
`COHERE_DUMP_ATTN=path/`	Dump attention activations for Cohere (used by the diff harness).
`FIRERED_BENCH=1`	Per-stage timings for the FireRedASR backend.
`FIREREDPUNC_DEBUG=1`	Per-step diagnostics for the FireRed punctuation post-step.
`MOONSHINE_STREAMING_BENCH=1`	Per-stage timings for moonshine-streaming.
`OMNIASR_BENCH=1` / `OMNIASR_DEBUG=1` / `OMNIASR_DUMP_DIR=`	OmniASR per-stage timings, diagnostics, and stage dumps.
`PARAKEET_DEBUG=1`	Parakeet TDT per-step diagnostics (joint network, blank-id sanity).
`QWEN3_TTS_BENCH=1` / `QWEN3_TTS_DEBUG=1` / `QWEN3_TTS_DUMP_DIR=`	Qwen3-TTS per-stage timings, diagnostics, and stage dumps.
`VIBEVOICE_BENCH=1` / `VIBEVOICE_DEBUG=1` / `VIBEVOICE_DUMP_DIR=`	VibeVoice ASR per-stage timings, diagnostics, and stage dumps.
`VIBEVOICE_REF_FEATURES=path`	Replace the live encoder with a saved feature tensor (regression harness).
`VIBEVOICE_TTS_DUMP=path/`	VibeVoice TTS per-stage dumps (token IDs, base/TTS hidden, neg condition, frame-0 noise/v_cfg/latent/acoustic_embed) for the diff harness.
`VIBEVOICE_TTS_DUMP_PERFRAME=1`	Per-frame VibeVoice TTS dumps written as `perframe_<stage>_f<NNN>.bin`. Pair with `VIBEVOICE_TTS_DUMP=path/` and `VIBEVOICE_TTS_NOISE=path` for stage-by-stage AR diff against `tools/run_official_vibevoice.py`.
`VIBEVOICE_TTS_TRACE=1`	Extra one-line traces (negative-condition prefill rms, scaling/bias factors loaded). Same effect as `-vv`.
`VIBEVOICE_VOICE_AUDIO=path.wav`	Reference voice WAV for 1.5B-base TTS without a `.gguf` voice cache.
`VIBEVOICE_TTS_NOISE=path`	Override the per-frame Gaussian init noise. Flat little-endian float32 `[N_frames, vae_dim]` — typically the `noise.bin` written by `tools/run_official_vibevoice.py`.
`VIBEVOICE_VAE_BACKEND=cpu\|metal\|cuda\|vulkan`	Pin the VAE decoder onto a specific backend.
`WAV2VEC2_BENCH=1` / `WAV2VEC2_VERBOSE=1` / `WAV2VEC2_DUMP_DIR=`	wav2vec2 per-stage timings, verbose graph traces, and stage dumps.
`CRISPASR_VOXTRAL4B_STREAM_TIMING=1`	Per-stage timings for the voxtral4b streaming path (encoder drain / prefill / first-text-token / decode-step p50/p95).
`CRISPASR_VOXTRAL4B_STREAM_CHUNK_MS=N`	Override the internal encoder chunk size (default 240 ms). Must be a multiple of 80 ms. Larger = faster feed (kernel-launch amortisation), longer live-caption latency floor.
`CRISPASR_VOXTRAL4B_STREAM_BATCH_ENCODER=1`	Regression-debug: ignore the streaming encoder's audio_embeds and re-run the whole batch encoder at flush.
`CRISPASR_VOXTRAL4B_STREAM_DEBUG=1` / `CRISPASR_VOXTRAL4B_STREAM_DIFF=1`	Per-step decode prints / side-by-side encoder cosine vs the batch encoder.
`CRISPASR_VOXTRAL4B_STREAM_LIVE=1`	Live-captions decode-during-feed (PLAN #7 phase 3). `get_text()` polled during feed returns progressive transcript. Default OFF (PTT semantics). Wrappers: Python `Session.stream_open(live=True)`, Rust `stream_open_ex(.., live: true)`.
`CRISPASR_VOXTRAL4B_STREAM_DECODER_THREAD=1`	Decoder worker thread (PLAN #7 phase 4, implies live mode). Lets `feed()` return between encoder chunks without waiting for the decode loop — useful for mic-driven workloads. On M1 the Metal queue serializes encoder and decoder so total wall-clock is unchanged; faster GPUs with kernel-level parallelism see real overlap.
`CRISPASR_VOXTRAL4B_FUSED_QKV=0`	Opt out of the runtime fused-QKV LLM path (default-on, ~7-8 % decode speedup on M1 Q4_K, ~500 MB extra memory).
`CRISPASR_QWEN3_ASR_FUSED_QKV=0`	Opt out of the runtime fused-QKV LLM path for qwen3-asr (default-on; works on F16/F32/Q4_K/Q8_0/...).
`CRISPASR_VOXTRAL_FUSED_QKV=1`	Opt in to the runtime fused-QKV LLM path for voxtral 3B. Off by default (no measurable speedup on JFK-shape decodes; useful for long-form workloads where decode dominates).
`QWEN3_TTS_FUSED_QKV=1`	Opt in to the runtime fused-QKV talker path.
`GRANITE_DISABLE_ENCODER_GRAPH=1`	Force the granite-speech / -plus / -nar encoder back to the per-layer CPU loop (slower but kept around for debugging). The single ggml-graph encoder with per-layer Shaw RPE is the default and is bit-near-identical to the CPU loop while being ~2× faster end-to-end across all three variants.
`CRISPASR_NO_REL_POS=1`	Ablate the relative-position bias in the Gemma-4 audio encoder (development only).
`ECAPA_REF_FBANK=path`	Reference filterbank tensor for the ECAPA-TDNN LID model (regression harness).
`CRISPASR_SHERPA_LID_BIN=path`	Override the auto-detected sherpa-onnx LID binary.
`CRISPASR_ARG_DEVICE=N`	Default GPU device index when `-dev` isn't passed.
`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1`	Let CUDA swap to RAM when VRAM is exhausted.
`GGML_VK_VISIBLE_DEVICES` / `CUDA_VISIBLE_DEVICES`	Standard ggml/CUDA device-visibility filters.

HF_TOKEN and HUGGING_FACE_HUB_TOKEN are both honoured for gated-model downloads (in that order).

Credits

whisper.cpp — the original ggml inference engine and Whisper runtime this fork is built on
ggml — the tensor library everything runs on
NVIDIA NeMo — parakeet-tdt, parakeet-ctc-{0.6b,1.1b}, canary-1b-v2, canary-ctc aligner, and the FastConformer-CTC family (stt_en_fastconformer_ctc_{large,xlarge,xxlarge})
Cohere — cohere-transcribe-03-2026
Qwen team (Alibaba) — Qwen3-ASR-0.6B, Qwen3-ASR-1.7B, Qwen3-ForcedAligner-0.6B
Mistral AI — Voxtral Mini 3B and 4B Realtime
IBM Granite team — Granite Speech 3.2-8b, 3.3-2b, 3.3-8b, 4.0-1b
Meta / wav2vec2 — wav2vec2 CTC models (XLSR-53 English, German, multilingual via any Wav2Vec2ForCTC checkpoint)
sherpa-onnx — optional diarization via subprocess (ONNX models)
Silero — VAD (native GGUF) and language identification (native GGUF, 95 languages)
pyannote — speaker diarization segmentation (native GGUF port)
miniaudio and stb_vorbis — embedded audio decoders
Claude Code (Anthropic) — significant portions of the crispasr integration layer, all model converters, and the FastConformer/attention/mel/FFN/BPE core helpers were co-authored with Claude

License

Same as upstream whisper.cpp: MIT.

Per-model weights are covered by their respective HuggingFace model licenses (see Supported backends). The crispasr binary itself links model runtimes that are mostly permissively licensed (MIT / Apache-2.0 / CC-BY-4.0 for weights).

Name		Name	Last commit message	Last commit date
Latest commit History 1,907 Commits
.devops		.devops
.github/workflows		.github/workflows
Testing/Temporary		Testing/Temporary
bindings		bindings
ci		ci
cmake		cmake
crisp_audio		crisp_audio
crispasr-sys		crispasr-sys
crispasr		crispasr
docs		docs
examples		examples
flutter/crispasr		flutter/crispasr
ggml		ggml
grammars		grammars
hf-space		hf-space
hf_readmes		hf_readmes
include		include
models		models
python		python
ref		ref
samples		samples
scripts		scripts
src		src
tests		tests
tools		tools
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
AUDIT.md		AUDIT.md
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
COMPARISON.md		COMPARISON.md
HANDOVER_INDEXTTS.md		HANDOVER_INDEXTTS.md
HISTORY.md		HISTORY.md
LEARNINGS.md		LEARNINGS.md
LICENSE		LICENSE
Makefile		Makefile
PERFORMANCE.md		PERFORMANCE.md
PLAN.md		PLAN.md
README.md		README.md
README_sycl.md		README_sycl.md
TODO.md		TODO.md
UPSTREAM.md		UPSTREAM.md
VERSION		VERSION
build-android.sh		build-android.sh
build-ios.sh		build-ios.sh
build-vulkan.bat		build-vulkan.bat
build-windows.bat		build-windows.bat
build-xcframework.sh		build-xcframework.sh
docker-compose.cuda.yml		docker-compose.cuda.yml
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini

Folders and files

Latest commit

History

Repository files navigation

CrispASR

Ecosystem

Table of contents

Supported backends

ASR backends

Text-to-Speech models

Translation

Post-processing models

Feature matrix

Which backend should I pick?

Language detection for backends that don't do it natively

Text language identification (post-ASR / standalone)

Install & build

Quick start

Whisper (historical path, byte-identical to upstream whisper.cpp)

Parakeet (multilingual, free word timestamps, fastest)

Canary (explicit language, speech translation)

Voxtral (speech-LLM with auto-download)

Qwen3-ASR (30 languages + Chinese dialects)

MiMo-V2.5-ASR (Mandarin + dialects + English, 7.5B Qwen2 LM)

Wav2Vec2 (lightweight CTC, any HF Wav2Vec2ForCTC model)

Streaming, TTS, and HTTP server

CLI reference

Architecture, contributing, regression matrix

Quantize models

GPU backend selection

Debugging & profiling

Credits

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 42

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages