chore: tune gunicorn worker settings (#1)

bizzappdev · web-flow · commit 00360eeca41c · 2026-05-29T12:01:25.000+05:30
diff --git a/Dockerfile b/Dockerfile
@@ -29,4 +29,4 @@ HEALTHCHECK --interval=30s --timeout=10s --start-period=10s --retries=3 \
     CMD curl -f http://localhost:8000/api/health || exit 1
 
 # Run the application
-CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
+CMD ["sh", "-c", "exec gunicorn app.main:app -w ${APP_WORKERS:-3} -k uvicorn.workers.UvicornWorker --bind 0.0.0.0:8000 --timeout 120 --graceful-timeout 30"]
diff --git a/README.md b/README.md
@@ -229,6 +229,7 @@ The main latency knobs are:
 | `STT_INITIAL_PROMPT` | empty | Optional Whisper prompt for domain terms, names, and expected vocabulary. |
 | `whisper.ping_interval_seconds` | `null` | App-to-STT WebSocket ping interval. `null` disables client pings, which avoids timeouts during long local model inference. |
 | `whisper.ping_timeout_seconds` | `null` | App-to-STT WebSocket ping timeout. |
+| `APP_WORKERS` | `3` | Number of PolyTalk app Gunicorn workers. Increase for more concurrent sessions after checking CPU and memory headroom. |
 | `STT_WORKERS` | `1` | Number of STT web workers. Each worker loads its own Whisper model. |
 | `STT_PRELOAD_MODEL` | `true` | Load the Whisper model during STT startup instead of delaying the first stream. |
 | `STT_CHUNK_OVERLAP_SECONDS` | `0.25` | Audio overlap between STT windows. Helps avoid missing words at chunk boundaries. |
@@ -241,6 +242,7 @@ The main latency knobs are:
 | `translation.model` | `qwen3-8b` | Use a model supported by your provider or self-hosted server, such as qwen3-8b, TranslateGama, or another open-source/open-weight model. |
 | `translation.max_tokens` | `240` | Maximum translation output tokens. Keep bounded for live streaming, but allow enough room for Indic-script targets and longer sentence buffers. |
 | `tts.timeout_seconds` | `10` | Maximum wait for TTS generation. |
+| `TTS_WORKERS` | `4` | Number of Piper Gunicorn workers. Keep `2-4` on small hosts; raise toward `min(8, CPU cores)` only after CPU and memory headroom are confirmed. |
 
 For larger continuous-speech translation chunks, start with:
 
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -65,6 +65,7 @@ services:
     environment:
       - PIPER_MODEL=${TTS_MODEL:-en_GB-jenny_dioco-medium}
       - PIPER_DATA_DIR=/data
+      - TTS_WORKERS=${TTS_WORKERS:-4}
     volumes:
       - ./tts/wsgi.py:/app/wsgi.py:ro
       - ./tts/voices:/data:ro
@@ -93,6 +94,7 @@ services:
       # Application
       - APP_HOST=0.0.0.0
       - APP_PORT=8000
+      - APP_WORKERS=${APP_WORKERS:-3}
       - APP_DEBUG=${APP_DEBUG:-false}
       - ALLOWED_ORIGINS=${ALLOWED_ORIGINS:-http://localhost:9000,http://127.0.0.1:9000}
       - LOG_LEVEL=${LOG_LEVEL:-INFO}
diff --git a/requirements.txt b/requirements.txt
@@ -4,6 +4,7 @@
 # Web framework
 fastapi>=0.109.0
 uvicorn[standard]>=0.27.0
+gunicorn>=22.0.0
 python-multipart>=0.0.6
 
 # Templates
diff --git a/tts/Dockerfile b/tts/Dockerfile
@@ -23,4 +23,4 @@ COPY voices/ /data/
 
 EXPOSE 5000
 
-CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "wsgi:app"]
+CMD ["sh", "-c", "exec gunicorn --bind 0.0.0.0:5000 --workers ${TTS_WORKERS:-4} wsgi:app"]

Original file line number	Diff line number	Diff line change
`@@ -23,4 +23,4 @@ COPY voices/ /data/`
`23`	`23`
`24`	`24`	`EXPOSE 5000`
`25`	`25`
`26`		`-CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "wsgi:app"]`
	`26`	`+CMD ["sh", "-c", "exec gunicorn --bind 0.0.0.0:5000 --workers ${TTS_WORKERS:-4} wsgi:app"]`