@@ -44,11 +44,28 @@ Created and maintained by BizzAppDev Systems Pvt. Ltd.
4444
4545## How It Works
4646
47- 1 . ** Recording** : Browser-based microphone recording
48- 2 . ** Transcription** : Audio to text using Whisper-compatible services
49- 3 . ** Translation** : Text translation using OpenAI-compatible APIs
50- 4 . ** Synthesis** : Text-to-speech conversion
51- 5 . ** Playback** : Browser-based audio playback
47+ 1 . ** Capture** : PolyTalk receives live audio from a microphone, tab, or other browser-supported source.
48+ 2 . ** Listen** : The audio stream is prepared for real-time processing.
49+ 3 . ** Understand** : Speech becomes readable text.
50+ 4 . ** Translate** : The text is converted into the target language.
51+ 5 . ** Respond** : Users receive translated text and translated speech.
52+
53+ ``` text
54+ Browser audio source -> PolyTalk live pipeline
55+ |
56+ +-- 1. Listen: live audio stream
57+ |
58+ +-- 2. Understand: speech -> text
59+ | faster-whisper / Whisper-compatible
60+ |
61+ +-- 3. Translate: text -> target language
62+ | OpenAI-compatible / Ollama / vLLM / Anthropic / Gemini
63+ |
64+ +-- 4. Respond
65+ +-- translated text -> Browser UI
66+ +-- translated speech -> Browser playback
67+ Piper / compatible TTS
68+ ```
5269
5370## Features
5471
@@ -59,6 +76,49 @@ Created and maintained by BizzAppDev Systems Pvt. Ltd.
5976- Simple vanilla JavaScript frontend
6077- Easy to extend for additional providers and workflows
6178
79+ ## Use Cases
80+
81+ - Live multilingual meetings, calls, and demos
82+ - Customer support conversations across languages
83+ - Field-team, clinic, and service-desk communication where privacy matters
84+ - Classroom, training, and onboarding translation
85+ - Private or offline speech workflows on controlled infrastructure
86+ - Self-hosted AI prototypes that need a complete speech-to-speech pipeline
87+
88+ ## Why Self-Host PolyTalk?
89+
90+ - Keep audio, transcripts, translations, and generated speech on infrastructure you control.
91+ - Avoid hard dependency on a single hosted speech or translation vendor.
92+ - Tune latency, batching, VAD, model size, workers, and translation context for real deployments.
93+ - Run with CPU, GPU, local open-weight models, private APIs, or hosted providers.
94+ - Use mock mode for safe local demos and CI-style checks without API keys.
95+
96+ ## Provider Compatibility
97+
98+ PolyTalk is configuration-first and provider-flexible. The default Docker Compose
99+ stack includes local STT and TTS services, while translation can point to hosted
100+ or self-hosted model APIs.
101+
102+ | Pipeline stage | Built-in/default path | Compatible options |
103+ | ----------------| -----------------------| --------------------|
104+ | STT | faster-whisper service over WebSocket | Whisper-compatible WebSocket services that accept 16 kHz mono int16 PCM |
105+ | Translation | OpenAI-compatible chat completions | Ollama, vLLM, LM Studio, LiteLLM, OpenAI-compatible Responses, Anthropic Messages-style, Gemini Generate Content-style |
106+ | TTS | Local Piper HTTP service | Piper-compatible HTTP services, configured OpenAI-style TTS fallback |
107+
108+ See [ Provider Extension] ( docs/provider-extension.md ) for service contracts,
109+ wire formats, and guidance for adding custom providers.
110+
111+ ## Self-Hosted vs Hosted-Only APIs
112+
113+ PolyTalk is designed for teams that want live speech-to-speech translation
114+ without giving up deployment control. Hosted-only translation APIs can be useful
115+ when you want a managed service, but PolyTalk gives you an open-source pipeline
116+ that can run in your own environment, mix local and remote providers, and keep
117+ the browser, WebSocket pipeline, STT, translation, and TTS layers configurable.
118+
119+ The Community Edition is AGPL-3.0 licensed for open-source use, with commercial
120+ licensing available for proprietary deployments.
121+
62122## Quick Start
63123
64124### Prerequisites
@@ -283,6 +343,21 @@ queue wait, STT inference time, emit delay, ASR-to-translation queue wait,
283343translation request time, and TTS queue/duration. When `LOG_LEVEL` is unset,
284344PolyTalk defaults to `INFO`.
285345
346+ # ## Benchmark Preview
347+
348+ PolyTalk includes small benchmark scripts for measuring each stage of the live
349+ translation path before you tune a deployment.
350+
351+ | Benchmark | What it measures | Script |
352+ |-----------|------------------|--------|
353+ | STT | First transcript timing and transcription service behavior | `tools/benchmarks/benchmark_stt.py` |
354+ | Translation | Translation provider latency for repeated text chunks | `tools/benchmarks/benchmark_translation.py` |
355+ | TTS | Speech synthesis latency and generated audio size | `tools/benchmarks/benchmark_tts.py` |
356+ | Full pipeline | First transcription, first translation, first TTS, event counts, and p50/p95 event arrival times | `tools/benchmarks/benchmark_pipeline.py` |
357+
358+ See [Benchmarking](docs/benchmarking.md) for sample commands, fixture audio,
359+ and guidance on reading results.
360+
286361# # API Endpoints
287362
288363# ## `GET /api/health`
0 commit comments