Summary
Haystack currently supports Whisper (local + remote) for audio transcription. FunASR (16.5K stars, MIT) would be a valuable addition as an alternative audio component — significantly faster, with built-in speaker diarization and emotion detection.
Why FunASR for Haystack pipelines
|
WhisperLocal |
FunASR |
| GPU speed |
13x realtime |
170x realtime |
| CPU speed |
❌ Too slow |
17x realtime |
| Speaker diarization |
❌ |
✅ built-in |
| Emotion detection |
❌ |
✅ |
| Timestamps |
✅ |
✅ word-level |
| Languages |
57 |
50+ |
| API mode |
OpenAI-compatible |
✅ OpenAI-compatible |
For RAG pipelines that index audio/video content, the speed difference is critical — processing a corpus of meeting recordings or podcast episodes is 13x faster.
Speaker diarization is also valuable for RAG: knowing who said something enables filtering by speaker in retrieval.
Integration approach
FunASR already exposes an OpenAI-compatible API:
funasr-server --device cuda
# POST /v1/audio/transcriptions
So a FunASRTranscriber component could follow the same pattern as RemoteWhisperTranscriber, pointing to the local FunASR endpoint. Or directly via Python:
from funasr import AutoModel
model = AutoModel(
model="iic/SenseVoiceSmall",
vad_model="fsmn-vad",
spk_model="cam++",
device="cuda"
)
result = model.generate(input="meeting.wav")
# Returns: text with speaker labels, word timestamps, emotion tags
Use cases in Haystack pipelines
- Audio RAG: Index meeting recordings → retrieve by content + speaker
- Podcast search: Transcribe + index episodes for semantic search
- Video understanding: Extract text from video files for multimodal RAG
- Call center analytics: Transcribe + detect emotion for quality monitoring
Happy to contribute a FunASRTranscriber component if there's interest.
Summary
Haystack currently supports Whisper (local + remote) for audio transcription. FunASR (16.5K stars, MIT) would be a valuable addition as an alternative audio component — significantly faster, with built-in speaker diarization and emotion detection.
Why FunASR for Haystack pipelines
For RAG pipelines that index audio/video content, the speed difference is critical — processing a corpus of meeting recordings or podcast episodes is 13x faster.
Speaker diarization is also valuable for RAG: knowing who said something enables filtering by speaker in retrieval.
Integration approach
FunASR already exposes an OpenAI-compatible API:
funasr-server --device cuda # POST /v1/audio/transcriptionsSo a
FunASRTranscribercomponent could follow the same pattern asRemoteWhisperTranscriber, pointing to the local FunASR endpoint. Or directly via Python:Use cases in Haystack pipelines
Happy to contribute a
FunASRTranscribercomponent if there's interest.