Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.git
.gitignore
node_modules
dist
__pycache__
*.pyc
.env
.env.*
*.db
pipeline/airflow/logs
17 changes: 17 additions & 0 deletions Dockerfile.backend
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
FROM python:3.11-slim

ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1

WORKDIR /app

RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*

COPY backend/requirements.txt /tmp/requirements.txt
RUN pip install --no-cache-dir -r /tmp/requirements.txt

COPY backend /app/backend

EXPOSE 8000

CMD ["uvicorn", "backend.main:app", "--host", "0.0.0.0", "--port", "8000"]
13 changes: 13 additions & 0 deletions Dockerfile.frontend
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
FROM node:20-alpine AS build
WORKDIR /app

COPY package*.json ./
RUN npm ci

COPY . .
RUN npm run build

FROM nginx:1.27-alpine
COPY --from=build /app/dist /usr/share/nginx/html
EXPOSE 80
CMD ["nginx", "-g", "daemon off;"]
31 changes: 31 additions & 0 deletions backend/worker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
"""Simple isolated worker process for Kubernetes worker pool deployments."""

from __future__ import annotations

import signal
import time

RUNNING = True


def _shutdown_handler(signum, frame):
global RUNNING
RUNNING = False


def main() -> None:
signal.signal(signal.SIGTERM, _shutdown_handler)
signal.signal(signal.SIGINT, _shutdown_handler)

print("FlexiRoaster worker started. Waiting for tasks...")
while RUNNING:
# Placeholder for queue-based execution workers.
# This keeps the worker pool isolated from API pods.
print("worker-heartbeat")
time.sleep(15)
Comment on lines +21 to +25
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Graceful shutdown delayed up to 15 seconds because time.sleep() auto-retries after signal (PEP 475)

The worker's graceful shutdown mechanism doesn't work promptly. When SIGTERM/SIGINT is received during time.sleep(15), the signal handler sets RUNNING = False, but due to PEP 475 (Python 3.5+), time.sleep() automatically retries for the remaining duration after the signal handler returns. The while RUNNING condition is not re-checked until the full 15-second sleep completes.

Root Cause and Verification

PEP 475 modified the standard library to automatically retry system calls that are interrupted by signals (EINTR). This means time.sleep(15) will resume sleeping for the remaining time after the _shutdown_handler sets RUNNING = False.

Verified empirically: a time.sleep(5) interrupted by a signal after 1 second still sleeps the full 5 seconds, even though the signal handler ran at the 1-second mark.

Actual behavior: Worker takes up to 15 seconds to shut down after receiving SIGTERM, because time.sleep(15) at backend/worker.py:25 resumes after the signal handler completes.

Expected behavior: Worker should exit promptly (within milliseconds) after receiving SIGTERM.

Impact: In Kubernetes, this means pod termination is delayed by up to 15 seconds on every rolling update or scale-down. While this is within the default 30-second terminationGracePeriodSeconds, it unnecessarily slows deployments and wastes resources. If the sleep interval were increased (e.g., to 60 seconds), it could exceed the grace period and cause forced kills (SIGKILL).

Fix: Use threading.Event.wait() instead of time.sleep(), which can be interrupted immediately:

import threading
_stop_event = threading.Event()

def _shutdown_handler(signum, frame):
    _stop_event.set()

while not _stop_event.is_set():
    print("worker-heartbeat")
    _stop_event.wait(15)
Prompt for agents
In backend/worker.py, replace the time.sleep-based loop with a threading.Event-based approach for prompt graceful shutdown. Specifically:

1. At the top of the file (around line 7-8), replace `RUNNING = True` with:
   import threading
   _stop_event = threading.Event()

2. Change the _shutdown_handler function (lines 11-13) to:
   def _shutdown_handler(signum, frame):
       _stop_event.set()

3. Change the main loop (lines 21-25) from:
   while RUNNING:
       print("worker-heartbeat")
       time.sleep(15)
   to:
   while not _stop_event.is_set():
       print("worker-heartbeat")
       _stop_event.wait(15)

4. Remove the `import time` if no longer needed, and remove the `RUNNING` global variable.

The threading.Event.wait() method returns immediately when the event is set, unlike time.sleep() which auto-retries after signal interruption due to PEP 475.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.


print("FlexiRoaster worker shutting down gracefully.")


if __name__ == "__main__":
main()
69 changes: 69 additions & 0 deletions deploy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# Containerization & Orchestration

This repository now includes container and Kubernetes deployment assets that support:

- Docker-based packaging
- Kubernetes orchestration
- Horizontal auto-scaling (HPA)
- Rolling updates
- Self-healing pods (liveness/readiness probes)
- Worker isolation (dedicated worker deployment + node scheduling hints)

## Docker

### Build images

```bash
docker build -f Dockerfile.backend -t flexiroaster-backend:local .
docker build -f Dockerfile.frontend -t flexiroaster-frontend:local .
```

### Run locally with Docker Compose

```bash
docker compose up --build
```

Services:
- Frontend: `http://localhost:8080`
- Backend: `http://localhost:8000`
- Worker: isolated background worker process

## Kubernetes

Kubernetes manifests are under `deploy/k8s` and can be applied with Kustomize:

```bash
kubectl apply -k deploy/k8s
```

### What is included

- `backend.yaml`: API deployment/service with rolling update strategy and probes.
- `frontend.yaml`: web deployment/service with rolling update strategy and probes.
- `worker.yaml`: isolated worker deployment with node selector/tolerations.
- `autoscaling.yaml`: HPAs for backend and worker.
- `namespace.yaml`: dedicated namespace.

## Managed Kubernetes options

These manifests are cloud-agnostic and can be deployed to:

- **AWS EKS**
- **Google GKE**
- **Azure AKS**

### Recommended managed-cluster setup

1. Create separate node pools for API/web and workers.
2. Label/taint worker nodes to enforce isolation:
- Label: `workload=worker`
- Taint: `dedicated=worker:NoSchedule`
3. Install Metrics Server (or provider equivalent) for HPA.
4. Use a cloud load balancer + Ingress controller for public access.
5. Push images to a cloud registry (ECR/GAR/ACR) and update image references.

## Notes

- Replace placeholder image names (`ghcr.io/your-org/...`) before deployment.
- Consider adding PodDisruptionBudgets, NetworkPolicies, and secrets management for production hardening.
39 changes: 39 additions & 0 deletions deploy/k8s/autoscaling.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flexiroaster-backend-hpa
namespace: flexiroaster
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flexiroaster-backend
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: flexiroaster-worker-hpa
namespace: flexiroaster
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: flexiroaster-worker
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
58 changes: 58 additions & 0 deletions deploy/k8s/backend.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: flexiroaster-backend
namespace: flexiroaster
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: flexiroaster-backend
template:
metadata:
labels:
app: flexiroaster-backend
spec:
containers:
- name: backend
image: ghcr.io/your-org/flexiroaster-backend:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8000
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 20
periodSeconds: 20
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
name: flexiroaster-backend
namespace: flexiroaster
spec:
selector:
app: flexiroaster-backend
ports:
- name: http
port: 8000
targetPort: 8000
58 changes: 58 additions & 0 deletions deploy/k8s/frontend.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: flexiroaster-frontend
namespace: flexiroaster
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: flexiroaster-frontend
template:
metadata:
labels:
app: flexiroaster-frontend
spec:
containers:
- name: frontend
image: ghcr.io/your-org/flexiroaster-frontend:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 15
periodSeconds: 20
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
---
apiVersion: v1
kind: Service
metadata:
name: flexiroaster-frontend
namespace: flexiroaster
spec:
selector:
app: flexiroaster-frontend
ports:
- name: http
port: 80
targetPort: 80
8 changes: 8 additions & 0 deletions deploy/k8s/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- backend.yaml
- frontend.yaml
- worker.yaml
- autoscaling.yaml
4 changes: 4 additions & 0 deletions deploy/k8s/namespace.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
apiVersion: v1
kind: Namespace
metadata:
name: flexiroaster
39 changes: 39 additions & 0 deletions deploy/k8s/worker.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: flexiroaster-worker
namespace: flexiroaster
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
selector:
matchLabels:
app: flexiroaster-worker
template:
metadata:
labels:
app: flexiroaster-worker
spec:
nodeSelector:
workload: worker
tolerations:
- key: "dedicated"
operator: "Equal"
value: "worker"
effect: "NoSchedule"
containers:
- name: worker
image: ghcr.io/your-org/flexiroaster-backend:latest
imagePullPolicy: IfNotPresent
command: ["python", "-m", "backend.worker"]
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "1Gi"
37 changes: 37 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
version: "3.9"

services:
backend:
build:
context: .
dockerfile: Dockerfile.backend
ports:
- "8000:8000"
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 5s
retries: 3
start_period: 20s

frontend:
build:
context: .
dockerfile: Dockerfile.frontend
ports:
- "8080:80"
depends_on:
backend:
condition: service_healthy
restart: unless-stopped

worker:
build:
context: .
dockerfile: Dockerfile.backend
command: ["python", "-m", "backend.worker"]
depends_on:
backend:
condition: service_healthy
restart: unless-stopped
Loading