Skip to content

Latest commit

 

History

History
391 lines (302 loc) · 11.2 KB

File metadata and controls

391 lines (302 loc) · 11.2 KB

Production Deployment

This guide covers a practical production-style PolyTalk deployment using Docker Compose, GPU-backed STT, local Piper TTS, an OpenAI-compatible translation service, and nginx/Let's Encrypt in front of the application.

Use this as an operator runbook for open-source deployments. It intentionally keeps provider-specific model serving outside this repository so API keys, private model paths, and GPU-specific launch flags do not leak into source control.

Architecture

Browser
  |
  | HTTPS + WebSocket
  v
nginx
  |
  | http://127.0.0.1:9000
  v
polytalk
  |-- ws://stt:8000/v1/stream/transcriptions
  |-- http://tts:5000
  `-- http(s)://translation-provider/v1/chat/completions

The translation provider can be OpenAI, vLLM, Ollama with an OpenAI-compatible proxy, or any service that implements /v1/chat/completions.

Production Checklist

  • Run behind HTTPS. Browsers require a secure context for microphone access on non-localhost domains.
  • Set APP_DEBUG=false and LOG_LEVEL=INFO for normal operation.
  • Keep .env, config/config.yaml, model tokens, and provider credentials out of git.
  • Use config/config.yaml.example as a template, then set mock_mode: false for whisper, translation, and tts.
  • Restrict ALLOWED_ORIGINS to the real HTTPS origin.
  • Put nginx or another reverse proxy in front of the app with WebSocket upgrade headers and long read/send timeouts.
  • Persist or rotate media/output; generated TTS audio is user data.
  • Run benchmarks from staging or during an announced maintenance window, not during normal demo traffic.

Server Prerequisites

  • Linux server with Docker Engine and Docker Compose plugin.
  • A domain pointing to the server.
  • Ports 80 and 443 open for nginx and Let's Encrypt.
  • NVIDIA driver and NVIDIA Container Toolkit if using GPU STT.
  • Enough disk space for STT models, TTS voices, and generated media.

For GPU STT, verify Docker can see the GPU:

docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi

Prepare Configuration

cp .env.example .env
cp config/config.yaml.example config/config.yaml

For production, set all service sections in config/config.yaml to real mode:

whisper:
  mock_mode: false

translation:
  mock_mode: false

tts:
  mock_mode: false

Use .env for deployment-specific values:

APP_DEBUG=false
LOG_LEVEL=INFO
ALLOWED_ORIGINS=https://polytalk.example.com

WHISPER_BASE_URL=http://stt:8000
WHISPER_WS_ENDPOINT=/v1/stream/transcriptions

TTS_BASE_URL=http://tts:5000
TTS_MODEL=en_GB-jenny_dioco-medium

TRANSLATION_BASE_URL=http://your-translation-service:8000
TRANSLATION_MODEL=polytalk-translation
TRANSLATION_API_KEY=
TRANSLATION_MAX_TOKENS=240

For GPU STT, start with:

STT_MODEL=large-v3
STT_DEVICE=cuda
STT_COMPUTE_TYPE=float16
STT_WORKERS=1
STT_PRELOAD_MODEL=true
STT_TRANSCRIBE_WORKERS=2
STT_MODEL_WORKERS=2

STT_STREAM_CHUNK_SECONDS=1.2
STT_CHUNK_OVERLAP_SECONDS=0.25
STT_SILENCE_RMS_THRESHOLD=0.003
STT_NO_SPEECH_PROB_THRESHOLD=0.50
STT_LOG_PROB_THRESHOLD=-1.0
STT_MAX_CROSS_DELTA_WORD_REPEATS=6
STT_VAD_FILTER=true
STT_VAD_MIN_SILENCE_MS=500
STT_VAD_SPEECH_PAD_MS=200
STT_CONDITION_ON_PREVIOUS_TEXT=false
STT_TEMPERATURE=0.0

Increase worker counts only after checking GPU memory and latency under load. Each STT web worker loads its own Whisper model.

The STT values above are balanced microphone defaults. If quiet speakers are missed, lower STT_SILENCE_RMS_THRESHOLD slightly. If silence produces hallucinated phrases, raise it slightly or make STT_NO_SPEECH_PROB_THRESHOLD more strict.

If you deploy with Docker Compose, pass ALLOWED_ORIGINS to the polytalk service environment or add it to your own production override file. The default application fallback only allows localhost development origins.

TTS Voices

The TTS image expects voice files in tts/voices. The directory is tracked with .gitkeep, but voice model files are intentionally ignored.

Build the TTS image and download default voices:

docker compose build tts
./tts/setup-voices.sh

The script uses polytalk-tts:latest by default. Override it if needed:

TTS_IMAGE=custom-polytalk-tts:latest ./tts/setup-voices.sh

Start PolyTalk

CPU/default:

docker compose up -d --build

GPU STT:

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up -d --build

Check containers:

docker compose ps
docker compose logs -f polytalk

Check app health:

curl http://127.0.0.1:9000/api/health

The base docker-compose.yml includes source-code bind mounts to keep local development convenient. For immutable production images, use your own Compose override or deployment manifest that removes the ./app, ./stt/app, ./tts/wsgi.py, and ./config bind mounts after the image and configuration are baked or mounted from a controlled release location.

Translation Provider

PolyTalk talks to translation through an OpenAI-compatible chat completions API. Keep the translation service outside this repo if it has model-specific GPU, token, or network requirements.

Example vLLM-style values:

TRANSLATION_BASE_URL=http://polytalk-vllm:8000
TRANSLATION_MODEL=polytalk-translation
TRANSLATION_MAX_TOKENS=240

Model notes:

  • Keep TRANSLATION_MAX_TOKENS bounded for live streaming. 240 gives Indic-script targets more room while still limiting runaway generations.
  • Qwen3 models may need Qwen-specific thinking/reasoning flags on the vLLM side.
  • TranslateGemma/Gemma 3 models require bfloat16 and must not use Qwen reasoning parser flags.
  • Keep model-specific vLLM/Hugging Face tokens out of this repository.

Nginx Reverse Proxy

Install nginx and certbot using your distribution packages, then create a site similar to this:

map $http_upgrade $connection_upgrade {
    default upgrade;
    '' close;
}

server {
    listen 80;
    server_name polytalk.example.com;

    location /.well-known/acme-challenge/ {
        root /var/www/html;
    }

    location / {
        return 301 https://$host$request_uri;
    }
}

server {
    listen 443 ssl http2;
    server_name polytalk.example.com;

    # Certbot can manage these paths automatically after issuance.
    ssl_certificate /etc/letsencrypt/live/polytalk.example.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/polytalk.example.com/privkey.pem;

    client_max_body_size 250m;

    location ~ /\.(?!well-known(?:/|$)) {
        deny all;
        access_log off;
        log_not_found off;
    }

    location ~* ^/(?:phpinfo|info|php|i|pinfo|test|pi|p|debug|server-status|server-info)\.php(?:$|\?) {
        deny all;
        access_log off;
        log_not_found off;
    }

    location = /api/ws/translate {
        proxy_pass http://127.0.0.1:9000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # Live translation sessions can run for a long time.
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        proxy_connect_timeout 60s;
        proxy_buffering off;
    }

    location / {
        proxy_pass http://127.0.0.1:9000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection $connection_upgrade;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
        proxy_connect_timeout 60s;
        proxy_buffering off;
    }
}

Validate and reload:

sudo nginx -t
sudo systemctl reload nginx

Issue a certificate:

sudo certbot --nginx -d polytalk.example.com

Verify:

curl https://polytalk.example.com/api/health
curl -o /dev/null -s -w "%{http_code}\n" https://polytalk.example.com/.env

The .env check should return 403.

Also verify WebSocket traffic from the browser dev tools while starting a live translation. A common reverse-proxy failure mode is a healthy /api/health response but a dropped /api/ws/translate connection because upgrade headers or timeouts are missing.

Operational Checks

Use these commands after deployment or a restart:

docker compose ps
docker compose logs --tail 100 polytalk
docker compose logs --tail 100 stt
docker compose logs --tail 100 tts
curl http://127.0.0.1:9000/api/health

For GPU deployments:

nvidia-smi
docker compose -f docker-compose.yml -f docker-compose.gpu.yml exec stt nvidia-smi

Set LOG_LEVEL=DEBUG temporarily when diagnosing latency. Debug logs include STT queue wait, STT inference time, emit delay, ASR-to-translation queue wait, translation request time, and TTS queue/duration.

Return to LOG_LEVEL=INFO after diagnosis. Debug logs can contain timing, language, and transcript-adjacent metadata that is too noisy for long-term production retention.

Benchmarking and Demo Mode

Use Benchmarking to run STT, translation, TTS, and pipeline checks. Benchmarking can consume GPU and translation-provider capacity, so run it against staging where possible.

If you must benchmark a shared demo system:

  • Announce the test window to users.
  • Watch docker compose logs -f stt polytalk for STT backpressure and provider latency.
  • Compare results with a known-good JSON output from the same audio fixture.
  • Stop tests if live users report degraded microphone or tab-audio behavior.

Backups and Data Handling

  • .env contains secrets and must not be committed.
  • media/output contains generated speech audio. Treat it as user data.
  • Transcripts and translations can be sensitive. Control log retention.
  • TTS voice files can be re-downloaded, but deployments should still preserve tts/voices or run tts/setup-voices.sh during provisioning.
  • Consider a scheduled cleanup job for media/output if generated speech does not need long-term retention.

Example cleanup policy for files older than seven days:

find /path/to/polytalk/media/output -type f -mtime +7 -delete

Troubleshooting

  • Browser microphone is unavailable: use HTTPS and confirm permission prompts.
  • WebSocket disconnects: confirm nginx upgrade headers and long proxy timeouts.
  • STT is slow: use CUDA, a smaller model, or lower STT_STREAM_CHUNK_SECONDS only after checking accuracy.
  • STT CUDA fails: verify NVIDIA Container Toolkit and the GPU override compose file.
  • Translation times out: check provider logs, keep TRANSLATION_MAX_TOKENS low, and verify model-specific serving flags.
  • Missing TTS voice: confirm the voice .onnx and .onnx.json files exist under tts/voices.
  • CORS errors in the browser: set ALLOWED_ORIGINS to the exact HTTPS origin and ensure that variable is passed to the running polytalk container.
  • Microphone quality is worse than tab audio: tune the STT silence and no-speech thresholds first. Tab audio is clean digital input; microphones include room noise, gain changes, and echo cancellation artifacts.