Production-grade DevOps reference architecture built around a Flask + PostgreSQL REST API. Covers the full lifecycle from local development to production-style Kubernetes orchestration with GitOps and observability.
Audience: This documentation is structured for engineers preparing for 3-5 year DevOps / SRE interviews. Each module covers the implementation, the deep concepts behind it, troubleshooting from real issues we hit, interview Q&A, STAR stories, and how it maps to the cloud.
┌──────────────────────┐
│ Developer pushes │
│ code to GitHub │
└──────────┬───────────┘
│
▼
┌────────────────────── CI Pipeline (GitHub Actions) ───────────────--───────┐
│ │
│ build job: │
│ • Run unit tests (pytest) │
│ • Build Docker image │
│ • Push to DockerHub (tagged with commit SHA) │
│ │
│ update-helm job: │
│ • sed updates helm/application/values.yaml with new image tag │
│ • Commits & pushes to main branch │
│ │
└────────────────────────────────────┬────────────────────────────────-──────┘
│ git push main
▼
┌─────────────────── ArgoCD (GitOps Controller in K8s) ────────────────--────┐
│ │
│ Detects values.yaml diff → renders Helm chart → applies new manifests │
│ → Kubernetes does rolling update │
│ │
└────────────────────────────────────┬───────────────────────────────-───────┘
│
▼
┌──────────────────── 3-Node Minikube Cluster (Production-like) ───────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────────────┐ │
│ │ App Tier │ │ Database Tier │ │ Dependent Services Tier │ │
│ │ (minikube) │ │ (minikube-m02) │ │ (minikube-m03) │ │
│ │ │ │ │ │ │ │
│ │ • Flask API ×3 │ │ • Postgres │ │ • Vault │ │
│ │ │ │ │ │ • External Secrets Op │ │
│ │ │ │ │ │ • Prometheus + AM │ │
│ │ │ │ │ │ • Grafana │ │
│ │ │ │ │ │ • Loki │ │
│ │ │ │ │ │ • Promtail (DS) │ │
│ │ │ │ │ │ • Postgres exporter │ │
│ │ │ │ │ │ • Blackbox exporter │ │
│ └─────────────────┘ └─────────────────┘ └──────────────────────────┘ │
│ │
└────────────────────────────────────┬─────────────────────────────────────┘
│ Slack alerts
▼
┌───────────────────┐
│ #alerts channel │
└───────────────────┘
| Layer | Tools |
|---|---|
| Application | Flask 3 + SQLAlchemy + Flask-Migrate + Gunicorn + PostgreSQL 15 |
| Testing | pytest (unit) + Locust (load) |
| Containerization | Docker (multi-stage build) + Docker Compose (local stack) + nginx (reverse proxy) |
| CI | GitHub Actions on a self-hosted runner; SHA-based image tagging; auto-update Helm values |
| IaC | Terraform (AWS VPC, EC2, ALB) + Ansible (system bootstrapping) — written, not deployed |
| Orchestration | Kubernetes via Minikube (3-node cluster mimicking multi-AZ) |
| Secrets | HashiCorp Vault + External Secrets Operator (ESO) |
| Packaging | Helm 3 charts for every component |
| GitOps | ArgoCD with App-of-Apps pattern + multi-source pattern for upstream charts |
| Observability | Prometheus + Grafana + Loki + Promtail + Alertmanager + exporters |
| Alerting | Alertmanager → Slack via Incoming Webhooks |
The documentation is structured as a curriculum. Read in order for the full picture, or jump to the topic you need.
Goal: Get the Flask API running locally with venv + Postgres + migrations + seed data.
- Tech stack & architecture
- Step-by-step walkthrough with the why for each step
- 12 interview Q&A on Python venvs, WSGI, migrations, secrets, connection pooling
- 2 STAR stories — moving the project broke the venv, AirPlay port conflict
- Production hardening + AWS mapping
Goal: Unit tests with pytest + in-memory SQLite, load tests with Locust.
- The test pyramid + why in-memory SQLite for unit tests
- pytest fixture pattern + setup/teardown
- Locust scenarios + headless CI mode
- 14 interview Q&A on test pyramid, RED method, contract testing, load test interpretation
- 2 STAR stories — duplicated Prometheus registry breaking tests, finding the throughput limit
Goal: Package the app as a Docker image; orchestrate the multi-service stack with Compose.
- Multi-stage Dockerfile (build vs main; image size 80 MB vs 400 MB)
- Layer caching, EXPOSE vs port mapping, CMD vs ENTRYPOINT
- Compose deep dive: networking, healthchecks, depends_on, volumes
- 14 troubleshooting issues — including the famous
127.0.0.1Gunicorn binding bug - 20 interview Q&A — containers vs VMs, layers, distroless, signal handling
- 3 STAR stories — debugging container networking, image optimization, port conflicts
Goal: On every push, run tests → build image → push to DockerHub → update Helm values in main.
- Self-hosted vs GitHub-hosted runners (when to use which)
- Pipeline walkthrough —
buildandupdate-helmjobs - Cross-platform
sed, GH_PAT scopes, secret management - The CI → GitOps handoff
- 14 troubleshooting issues —
setup-pythonpermission errors, push protection, branch confusion - 20 interview Q&A — CI vs CD, OIDC, matrix builds, blue/green
- 3 STAR stories —
setup-pythonmac issue, Slack webhook leak, CI pushing to wrong branch
Goal: Provision AWS infra (VPC, subnets, NAT, ALB, EC2) with Terraform; configure machines with Ansible.
for_eachvscount(with the index-shifting trap)- State management — local vs S3 + DynamoDB locking
- Drift detection (
apply -refresh-onlyvsapply) - Modules, workspaces, backends
- 24 deep Terraform troubleshooting scenarios — state lock recovery, drift, RDS replacement traps, EIP costs, rate limits
- 4 production scenario deep-dives — manually deleted IAM role, CloudFormation migration, leaked tfstate, concurrent applies
- 32 interview Q&A across Terraform + Ansible
- 3 STAR stories — state recovery via S3 versioning, RDS rename trap, $4K/mo cost cleanup
Goal: Deploy Vault, ESO, Postgres, Flask onto a 3-node minikube cluster.
- 3-node architecture with workload-to-node placement (
type=application/database/dependent_services) - Vault deployment, init/unseal flow, KV-v2
- ESO architecture + setup + force-sync pattern
- Deep concepts (the bulk of the doc):
- Networking & CoreDNS — full query flow, ndots:5, Service types, kube-proxy modes
- Storage — PV/PVC/StorageClass, access modes, reclaim policies
- Workloads — Deployment vs StatefulSet vs DaemonSet
- Probes — liveness vs readiness vs startup
- Rollouts & rollbacks — RollingUpdate vs Recreate, maxSurge math
- Autoscaling — HPA + VPA + Cluster Autoscaler + KEDA with full YAMLs
- NetworkPolicies (with DNS gotcha)
- RBAC — Role vs ClusterRole
- Operators & CRDs — ESO walkthrough as the canonical example
- ~50 interview Q&A across architecture / networking / storage / workloads / probes / autoscaling / secrets / operators / scenarios
- 4 STAR stories — pod-to-pod debug, stuck namespace, PVC permissions, HPA implementation
Goal: Package K8s manifests as Helm charts; deploy via ArgoCD using the App-of-Apps pattern.
- Why GitOps (push vs pull)
- Helm deep dive — Chart.yaml, templates, hooks, helpers, sub-charts
- ArgoCD deep dive — Application CRD, sync policies, App-of-Apps, multi-source pattern, sync waves
- The full CI → GitOps → Deploy loop
- 14 troubleshooting issues — CRD version mismatches, ConfigMap-doesn't-restart, sync errors
- 35 interview Q&A — GitOps principles, Helm internals, ArgoCD architecture, AppProjects, ApplicationSet
- 4 STAR stories — adopting GitOps, ConfigMap checksum trick, CRD version mismatch, selfHeal saving the day
Goal: Build a full observability layer with metrics, logs, dashboards, and Slack alerts.
- Three pillars (metrics, logs, traces); USE & RED methods
- Component-by-component setup
- Application instrumentation (
prometheus-flask-exporter) - 10 alert rules with severity + USE/RED classification
- Pre-loaded Grafana dashboards (5 community dashboards)
- Slack integration — using
slack_api_url_fileto keep webhook out of Git - 18 troubleshooting issues — PVC permission fixes, schema mismatches, cardinality issues
- 30+ interview Q&A across SLI/SLO/SLA, USE/RED, Prometheus internals, Loki vs ELK, Grafana, real scenarios
- 4 STAR stories — Prometheus permission debug, Slack webhook leak, true-positive alert, observability from scratch
1. Local setup → understand the app
↓
2. Containerize → make it portable
↓
3. CI pipeline → automate build + test + push
↓
4. IaC → provision infra reproducibly
↓
5. Kubernetes → run it at scale
↓
6. GitOps → declarative, audited deployments
↓
7. Observability → see what's happening in production
Each layer depends on the previous. The CI pipeline (3) makes sense because we can build a container (2) of the app (1). Kubernetes (5) is meaningful because we have a CI artifact (3). GitOps (6) governs Kubernetes (5). Observability (7) closes the loop — you can finally see what your fully-automated, fully-orchestrated system is doing in real time.
Prerequisites: macOS / Linux, Python 3.10+, Docker Desktop, kubectl, helm, minikube, brew (for installs).
git clone https://github.com/akhil27051999/Flask-REST-API.git
cd Flask-REST-API
python3 -m venv venv && source venv/bin/activate
pip install -r app/requirements.txt
# Configure .env (see Module 1) and run:
flask db upgrade
python app/seed.py
flask runexport ENV_FILE=.env
docker compose up -d --build
docker exec flask-app-container flask db upgrade --directory app/migrations
docker exec -e PYTHONPATH=/api flask-app-container python /api/app/seed.py
curl http://localhost/students/3# Cluster
minikube start --nodes=3 --driver=docker --cpus=2 --memory=2048
kubectl label node minikube type=application --overwrite
kubectl label node minikube-m02 type=database --overwrite
kubectl label node minikube-m03 type=dependent_services --overwrite
# Install ArgoCD
helm repo add argo https://argoproj.github.io/argo-helm
helm install argocd argo/argo-cd -n argocd --create-namespace
# Bootstrap everything via App-of-Apps
kubectl apply -f argocd/root-app.yaml
# Manual bootstrap steps (Vault unseal, vault-token secret) — see Module 5Edit helm/application/values.yaml (e.g., bump replicas), commit, push:
git add helm/application/values.yaml
git commit -m "scale flask-api to 3"
git push origin main
# ArgoCD picks it up within 3 min — or trigger immediate sync:
kubectl patch application flask-api -n argocd --type merge \
-p '{"operation":{"sync":{"revision":"main"}}}'Flask-REST-API/
├── app/ # Flask source code + Dockerfile + requirements.txt + migrations
├── tests/ # pytest unit tests + Locust load tests
├── nginx/ # nginx reverse proxy config + Dockerfile
├── docker-compose.yaml # Local multi-service stack
├── .github/workflows/ # CI pipeline
├── terraform/ # AWS infrastructure (VPC, EC2, ALB, etc.)
├── ansible/ # Configuration management for VMs
├── k8s/ # Raw K8s manifests (legacy/reference; see helm/ for current)
├── helm/ # Helm charts for every component
│ ├── application/ # Flask app
│ ├── vault/ # HashiCorp Vault
│ ├── external-secrets/ # ESO + custom resources
│ ├── database/ # PostgreSQL
│ ├── prometheus/ # Prometheus + Alertmanager
│ ├── grafana/ # Grafana
│ ├── loki/ # Loki
│ ├── promtail/ # Promtail
│ ├── postgres-exporter/ # Postgres metrics exporter
│ └── blackbox-exporter/ # HTTP probe exporter
├── argocd/ # ArgoCD Applications
│ ├── root-app.yaml # The App-of-Apps that manages everything
│ ├── vault.yaml
│ ├── external-secrets.yaml
│ ├── database.yaml
│ ├── application.yaml
│ └── observability-*.yaml
└── docs/ # This documentation (modules + images)
By building this project end-to-end, you've practiced every tool a 3-5 yr DevOps/SRE role expects:
- ✅ Python web app with proper structure, migrations, testing
- ✅ Multi-stage Dockerfile with layer caching, alpine base, non-root patterns
- ✅ Docker Compose for local multi-service development
- ✅ CI/CD with GitHub Actions (self-hosted runner) → DockerHub → automatic Helm updates
- ✅ Terraform for AWS provisioning (VPC, subnets, NAT, SGs, ALB, EC2) with state, modules, lifecycle
- ✅ Ansible for configuration management (idempotent, role-based pattern)
- ✅ Kubernetes — multi-node cluster, node labels, all major workload types, networking, storage, RBAC, autoscaling
- ✅ HashiCorp Vault — initialization, unsealing, KV secrets engine
- ✅ External Secrets Operator — bridging Vault and K8s native Secrets
- ✅ Helm — chart structure, templating, hooks, releases
- ✅ ArgoCD — Applications, App-of-Apps, multi-source, sync policies, selfHeal
- ✅ Observability stack — Prometheus, Grafana, Loki, Promtail, Alertmanager, exporters
- ✅ PromQL + LogQL for queries
- ✅ Slack alerting with proper secret handling
- ✅ GitOps workflows — pull-based deploys, drift detection, rollbacks via git revert
- ✅ Real production troubleshooting — Gunicorn binding, fsGroup permissions, push protection, sync errors, state locking
For each module, you should be able to:
- Explain the architecture — what it does and why it's structured that way
- Walk through one debugging story (use the STAR stories as templates)
- Answer 5+ deep questions on the topic from memory
- Sketch the data flow on a whiteboard
- Discuss production hardening — what would change at scale
- Map to cloud equivalents (AWS / GCP / Azure)
This project is a learning + interview prep artifact. Suggested extensions to deepen further:
- Add distributed tracing (Jaeger / Tempo + OpenTelemetry) for the third pillar
- Add service mesh (Istio / Linkerd) for mTLS + traffic policies
- Add chaos engineering (Litmus / Chaos Mesh) — kill pods during load tests
- Migrate Postgres from Deployment → StatefulSet with HA replication
- Add policy as code with OPA Gatekeeper / Kyverno
- Add cert-manager + Ingress with TLS
- Implement canary deployments with Argo Rollouts
MIT (or your license of choice).
Akhil Thyadi — built as a hands-on portfolio project for DevOps / SRE roles.