Feature: Add FunASR as audio transcription component (13x faster, speaker diarization)

## Summary

Haystack currently supports Whisper (local + remote) for audio transcription. [FunASR](https://github.com/modelscope/FunASR) (16.5K stars, MIT) would be a valuable addition as an alternative audio component — significantly faster, with built-in speaker diarization and emotion detection.

## Why FunASR for Haystack pipelines

| | WhisperLocal | **FunASR** |
|---|---|---|
| GPU speed | 13x realtime | **170x realtime** |
| CPU speed | ❌ Too slow | **17x realtime** |
| Speaker diarization | ❌ | ✅ built-in |
| Emotion detection | ❌ | ✅ |
| Timestamps | ✅ | ✅ word-level |
| Languages | 57 | 50+ |
| API mode | OpenAI-compatible | ✅ OpenAI-compatible |

For RAG pipelines that index audio/video content, the speed difference is critical — processing a corpus of meeting recordings or podcast episodes is 13x faster.

Speaker diarization is also valuable for RAG: knowing *who* said something enables filtering by speaker in retrieval.

## Integration approach

FunASR already exposes an OpenAI-compatible API:

```bash
funasr-server --device cuda
# POST /v1/audio/transcriptions
```

So a `FunASRTranscriber` component could follow the same pattern as `RemoteWhisperTranscriber`, pointing to the local FunASR endpoint. Or directly via Python:

```python
from funasr import AutoModel

model = AutoModel(
    model="iic/SenseVoiceSmall",
    vad_model="fsmn-vad",
    spk_model="cam++",
    device="cuda"
)
result = model.generate(input="meeting.wav")
# Returns: text with speaker labels, word timestamps, emotion tags
```

## Use cases in Haystack pipelines

1. **Audio RAG**: Index meeting recordings → retrieve by content + speaker
2. **Podcast search**: Transcribe + index episodes for semantic search
3. **Video understanding**: Extract text from video files for multimodal RAG
4. **Call center analytics**: Transcribe + detect emotion for quality monitoring

Happy to contribute a `FunASRTranscriber` component if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add FunASR as audio transcription component (13x faster, speaker diarization) #3373

Summary

Why FunASR for Haystack pipelines

Integration approach

Use cases in Haystack pipelines

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	WhisperLocal	FunASR
GPU speed	13x realtime	170x realtime
CPU speed	❌ Too slow	17x realtime
Speaker diarization	❌	✅ built-in
Emotion detection	❌	✅
Timestamps	✅	✅ word-level
Languages	57	50+
API mode	OpenAI-compatible	✅ OpenAI-compatible

Feature: Add FunASR as audio transcription component (13x faster, speaker diarization) #3373

Description

Summary

Why FunASR for Haystack pipelines

Integration approach

Use cases in Haystack pipelines

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions