Add Docker + Kubernetes deployment stack with autoscaling and worker isolation#78
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9a5e283
into
codex/fix-remaining-issues-and-raise-pr
| while RUNNING: | ||
| # Placeholder for queue-based execution workers. | ||
| # This keeps the worker pool isolated from API pods. | ||
| print("worker-heartbeat") | ||
| time.sleep(15) |
There was a problem hiding this comment.
🟡 Graceful shutdown delayed up to 15 seconds because time.sleep() auto-retries after signal (PEP 475)
The worker's graceful shutdown mechanism doesn't work promptly. When SIGTERM/SIGINT is received during time.sleep(15), the signal handler sets RUNNING = False, but due to PEP 475 (Python 3.5+), time.sleep() automatically retries for the remaining duration after the signal handler returns. The while RUNNING condition is not re-checked until the full 15-second sleep completes.
Root Cause and Verification
PEP 475 modified the standard library to automatically retry system calls that are interrupted by signals (EINTR). This means time.sleep(15) will resume sleeping for the remaining time after the _shutdown_handler sets RUNNING = False.
Verified empirically: a time.sleep(5) interrupted by a signal after 1 second still sleeps the full 5 seconds, even though the signal handler ran at the 1-second mark.
Actual behavior: Worker takes up to 15 seconds to shut down after receiving SIGTERM, because time.sleep(15) at backend/worker.py:25 resumes after the signal handler completes.
Expected behavior: Worker should exit promptly (within milliseconds) after receiving SIGTERM.
Impact: In Kubernetes, this means pod termination is delayed by up to 15 seconds on every rolling update or scale-down. While this is within the default 30-second terminationGracePeriodSeconds, it unnecessarily slows deployments and wastes resources. If the sleep interval were increased (e.g., to 60 seconds), it could exceed the grace period and cause forced kills (SIGKILL).
Fix: Use threading.Event.wait() instead of time.sleep(), which can be interrupted immediately:
import threading
_stop_event = threading.Event()
def _shutdown_handler(signum, frame):
_stop_event.set()
while not _stop_event.is_set():
print("worker-heartbeat")
_stop_event.wait(15)Prompt for agents
In backend/worker.py, replace the time.sleep-based loop with a threading.Event-based approach for prompt graceful shutdown. Specifically:
1. At the top of the file (around line 7-8), replace `RUNNING = True` with:
import threading
_stop_event = threading.Event()
2. Change the _shutdown_handler function (lines 11-13) to:
def _shutdown_handler(signum, frame):
_stop_event.set()
3. Change the main loop (lines 21-25) from:
while RUNNING:
print("worker-heartbeat")
time.sleep(15)
to:
while not _stop_event.is_set():
print("worker-heartbeat")
_stop_event.wait(15)
4. Remove the `import time` if no longer needed, and remove the `RUNNING` global variable.
The threading.Event.wait() method returns immediately when the event is set, unlike time.sleep() which auto-retries after signal interruption due to PEP 475.
Was this helpful? React with 👍 or 👎 to provide feedback.
Motivation
Description
Dockerfile.backend,Dockerfile.frontend, and.dockerignoreto build backend and frontend images and reduce build contexts.docker-compose.ymlthat runsbackend,frontend, and an isolatedworkerservice using thebackend.workerentrypoint.backend/worker.pywith graceful shutdown handling intended for Kubernetes worker pools.deploy/k8s(namespace.yaml,backend.yaml,frontend.yaml,worker.yaml,autoscaling.yaml,kustomization.yaml) that include rolling update strategy, readiness/liveness probes, resource requests/limits,nodeSelector/tolerations for worker isolation, and HPAs for autoscaling.deploy/README.mdwith usage instructions and managed-cluster guidance for AWS EKS, Google GKE, and Azure AKS, and linked a short deployment note into the repoREADME.md.Testing
python -m py_compile backend/worker.pyand it completed successfully.yaml.safe_load_all(script run viapython - <<'PY' ... PY) and confirmed no YAML parsing errors.Codex Task