Skip to content

Add Docker + Kubernetes deployment stack with autoscaling and worker isolation#78

Merged
fuzziecoder merged 1 commit intocodex/fix-remaining-issues-and-raise-prfrom
codex/implement-containerization-and-orchestration
Feb 24, 2026
Merged

Add Docker + Kubernetes deployment stack with autoscaling and worker isolation#78
fuzziecoder merged 1 commit intocodex/fix-remaining-issues-and-raise-prfrom
codex/implement-containerization-and-orchestration

Conversation

@fuzziecoder
Copy link
Copy Markdown
Owner

@fuzziecoder fuzziecoder commented Feb 24, 2026

Motivation

  • Provide a first-class containerized setup so the backend and frontend can be built and run in Docker for local and CI workflows.
  • Enable production-ready orchestration features (rolling updates, readiness/liveness probes, auto-scaling) for cloud Kubernetes deployments.
  • Separate execution workers from API pods to allow worker isolation, node scheduling, and independent scaling for pipeline execution workloads.

Description

  • Added container assets: Dockerfile.backend, Dockerfile.frontend, and .dockerignore to build backend and frontend images and reduce build contexts.
  • Added local orchestration via docker-compose.yml that runs backend, frontend, and an isolated worker service using the backend.worker entrypoint.
  • Implemented a simple isolated worker runtime at backend/worker.py with graceful shutdown handling intended for Kubernetes worker pools.
  • Added Kubernetes manifests under deploy/k8s (namespace.yaml, backend.yaml, frontend.yaml, worker.yaml, autoscaling.yaml, kustomization.yaml) that include rolling update strategy, readiness/liveness probes, resource requests/limits, nodeSelector/tolerations for worker isolation, and HPAs for autoscaling.
  • Added deploy/README.md with usage instructions and managed-cluster guidance for AWS EKS, Google GKE, and Azure AKS, and linked a short deployment note into the repo README.md.

Testing

  • Ran Python compile check with python -m py_compile backend/worker.py and it completed successfully.
  • Parsed all YAML manifests and the Docker Compose file using yaml.safe_load_all (script run via python - <<'PY' ... PY) and confirmed no YAML parsing errors.
  • Verified the added Dockerfiles and compose config are present and healthy via local file validations used during the rollout.

Codex Task


Open with Devin

@vercel
Copy link
Copy Markdown

vercel Bot commented Feb 24, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
flexi-roaster Ready Ready Preview, Comment Feb 24, 2026 2:45pm

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 24, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch codex/implement-containerization-and-orchestration

Comment @coderabbitai help to get the list of available commands and usage tips.

@fuzziecoder fuzziecoder self-assigned this Feb 24, 2026
@fuzziecoder fuzziecoder merged commit 9a5e283 into codex/fix-remaining-issues-and-raise-pr Feb 24, 2026
7 checks passed
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment thread backend/worker.py
Comment on lines +21 to +25
while RUNNING:
# Placeholder for queue-based execution workers.
# This keeps the worker pool isolated from API pods.
print("worker-heartbeat")
time.sleep(15)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Graceful shutdown delayed up to 15 seconds because time.sleep() auto-retries after signal (PEP 475)

The worker's graceful shutdown mechanism doesn't work promptly. When SIGTERM/SIGINT is received during time.sleep(15), the signal handler sets RUNNING = False, but due to PEP 475 (Python 3.5+), time.sleep() automatically retries for the remaining duration after the signal handler returns. The while RUNNING condition is not re-checked until the full 15-second sleep completes.

Root Cause and Verification

PEP 475 modified the standard library to automatically retry system calls that are interrupted by signals (EINTR). This means time.sleep(15) will resume sleeping for the remaining time after the _shutdown_handler sets RUNNING = False.

Verified empirically: a time.sleep(5) interrupted by a signal after 1 second still sleeps the full 5 seconds, even though the signal handler ran at the 1-second mark.

Actual behavior: Worker takes up to 15 seconds to shut down after receiving SIGTERM, because time.sleep(15) at backend/worker.py:25 resumes after the signal handler completes.

Expected behavior: Worker should exit promptly (within milliseconds) after receiving SIGTERM.

Impact: In Kubernetes, this means pod termination is delayed by up to 15 seconds on every rolling update or scale-down. While this is within the default 30-second terminationGracePeriodSeconds, it unnecessarily slows deployments and wastes resources. If the sleep interval were increased (e.g., to 60 seconds), it could exceed the grace period and cause forced kills (SIGKILL).

Fix: Use threading.Event.wait() instead of time.sleep(), which can be interrupted immediately:

import threading
_stop_event = threading.Event()

def _shutdown_handler(signum, frame):
    _stop_event.set()

while not _stop_event.is_set():
    print("worker-heartbeat")
    _stop_event.wait(15)
Prompt for agents
In backend/worker.py, replace the time.sleep-based loop with a threading.Event-based approach for prompt graceful shutdown. Specifically:

1. At the top of the file (around line 7-8), replace `RUNNING = True` with:
   import threading
   _stop_event = threading.Event()

2. Change the _shutdown_handler function (lines 11-13) to:
   def _shutdown_handler(signum, frame):
       _stop_event.set()

3. Change the main loop (lines 21-25) from:
   while RUNNING:
       print("worker-heartbeat")
       time.sleep(15)
   to:
   while not _stop_event.is_set():
       print("worker-heartbeat")
       _stop_event.wait(15)

4. Remove the `import time` if no longer needed, and remove the `RUNNING` global variable.

The threading.Event.wait() method returns immediately when the event is set, unlike time.sleep() which auto-retries after signal interruption due to PEP 475.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant