Skip to content

Commit bf9a1c6

Browse files
committed
docs: expand README explainer sections
1 parent 8d46b66 commit bf9a1c6

1 file changed

Lines changed: 80 additions & 5 deletions

File tree

README.md

Lines changed: 80 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -44,11 +44,28 @@ Created and maintained by BizzAppDev Systems Pvt. Ltd.
4444

4545
## How It Works
4646

47-
1. **Recording**: Browser-based microphone recording
48-
2. **Transcription**: Audio to text using Whisper-compatible services
49-
3. **Translation**: Text translation using OpenAI-compatible APIs
50-
4. **Synthesis**: Text-to-speech conversion
51-
5. **Playback**: Browser-based audio playback
47+
1. **Capture**: PolyTalk receives live audio from a microphone, tab, or other browser-supported source.
48+
2. **Listen**: The audio stream is prepared for real-time processing.
49+
3. **Understand**: Speech becomes readable text.
50+
4. **Translate**: The text is converted into the target language.
51+
5. **Respond**: Users receive translated text and translated speech.
52+
53+
```text
54+
Browser audio source -> PolyTalk live pipeline
55+
|
56+
+-- 1. Listen: live audio stream
57+
|
58+
+-- 2. Understand: speech -> text
59+
| faster-whisper / Whisper-compatible
60+
|
61+
+-- 3. Translate: text -> target language
62+
| OpenAI-compatible / Ollama / vLLM / Anthropic / Gemini
63+
|
64+
+-- 4. Respond
65+
+-- translated text -> Browser UI
66+
+-- translated speech -> Browser playback
67+
Piper / compatible TTS
68+
```
5269

5370
## Features
5471

@@ -59,6 +76,49 @@ Created and maintained by BizzAppDev Systems Pvt. Ltd.
5976
- Simple vanilla JavaScript frontend
6077
- Easy to extend for additional providers and workflows
6178

79+
## Use Cases
80+
81+
- Live multilingual meetings, calls, and demos
82+
- Customer support conversations across languages
83+
- Field-team, clinic, and service-desk communication where privacy matters
84+
- Classroom, training, and onboarding translation
85+
- Private or offline speech workflows on controlled infrastructure
86+
- Self-hosted AI prototypes that need a complete speech-to-speech pipeline
87+
88+
## Why Self-Host PolyTalk?
89+
90+
- Keep audio, transcripts, translations, and generated speech on infrastructure you control.
91+
- Avoid hard dependency on a single hosted speech or translation vendor.
92+
- Tune latency, batching, VAD, model size, workers, and translation context for real deployments.
93+
- Run with CPU, GPU, local open-weight models, private APIs, or hosted providers.
94+
- Use mock mode for safe local demos and CI-style checks without API keys.
95+
96+
## Provider Compatibility
97+
98+
PolyTalk is configuration-first and provider-flexible. The default Docker Compose
99+
stack includes local STT and TTS services, while translation can point to hosted
100+
or self-hosted model APIs.
101+
102+
| Pipeline stage | Built-in/default path | Compatible options |
103+
|----------------|-----------------------|--------------------|
104+
| STT | faster-whisper service over WebSocket | Whisper-compatible WebSocket services that accept 16 kHz mono int16 PCM |
105+
| Translation | OpenAI-compatible chat completions | Ollama, vLLM, LM Studio, LiteLLM, OpenAI-compatible Responses, Anthropic Messages-style, Gemini Generate Content-style |
106+
| TTS | Local Piper HTTP service | Piper-compatible HTTP services, configured OpenAI-style TTS fallback |
107+
108+
See [Provider Extension](docs/provider-extension.md) for service contracts,
109+
wire formats, and guidance for adding custom providers.
110+
111+
## Self-Hosted vs Hosted-Only APIs
112+
113+
PolyTalk is designed for teams that want live speech-to-speech translation
114+
without giving up deployment control. Hosted-only translation APIs can be useful
115+
when you want a managed service, but PolyTalk gives you an open-source pipeline
116+
that can run in your own environment, mix local and remote providers, and keep
117+
the browser, WebSocket pipeline, STT, translation, and TTS layers configurable.
118+
119+
The Community Edition is AGPL-3.0 licensed for open-source use, with commercial
120+
licensing available for proprietary deployments.
121+
62122
## Quick Start
63123

64124
### Prerequisites
@@ -283,6 +343,21 @@ queue wait, STT inference time, emit delay, ASR-to-translation queue wait,
283343
translation request time, and TTS queue/duration. When `LOG_LEVEL` is unset,
284344
PolyTalk defaults to `INFO`.
285345

346+
### Benchmark Preview
347+
348+
PolyTalk includes small benchmark scripts for measuring each stage of the live
349+
translation path before you tune a deployment.
350+
351+
| Benchmark | What it measures | Script |
352+
|-----------|------------------|--------|
353+
| STT | First transcript timing and transcription service behavior | `tools/benchmarks/benchmark_stt.py` |
354+
| Translation | Translation provider latency for repeated text chunks | `tools/benchmarks/benchmark_translation.py` |
355+
| TTS | Speech synthesis latency and generated audio size | `tools/benchmarks/benchmark_tts.py` |
356+
| Full pipeline | First transcription, first translation, first TTS, event counts, and p50/p95 event arrival times | `tools/benchmarks/benchmark_pipeline.py` |
357+
358+
See [Benchmarking](docs/benchmarking.md) for sample commands, fixture audio,
359+
and guidance on reading results.
360+
286361
## API Endpoints
287362

288363
### `GET /api/health`

0 commit comments

Comments
 (0)