A Kubernetes operator built in Python that automates deployment lifecycle management for platform applications on AWS EKS. Implements the operator pattern to continuously reconcile desired state with actual cluster state, enabling reliable, high-availability platform operations.
This operator solves the challenge of managing complex deployment lifecycles at scale automating the creation, updating, and health monitoring of platform applications without manual intervention. It integrates with Terraform for infrastructure provisioning, Helm for packaging, Prometheus for monitoring, and GitHub Actions for CI/CD.
- Operator Pattern - Continuous reconciliation loop ensures actual cluster state matches desired state
- Deployment Lifecycle Management - Automated create, update, and delete operations for platform applications
- High-Availability - Liveness and readiness probes with configurable replica management
- Helm Packaging - Production-ready Helm chart with RBAC, resource limits, and monitoring annotations
- Terraform IaC - AWS EKS cluster provisioning with IAM roles and node group management
- Prometheus Monitoring - Custom alerting rules for degraded apps, downtime, and reconcile errors
- GitHub Actions CI/CD - Automated testing, linting, Helm validation, and Docker build pipeline
- Postmortem-Driven Reliability - Structured error handling with status tracking for incident debugging
| Layer | Technology |
|---|---|
| Operator | Python, kubernetes-client |
| Packaging | Helm |
| Infrastructure | Terraform, AWS EKS |
| Monitoring | Prometheus, Grafana |
| CI/CD | GitHub Actions |
| Testing | PyTest |
# Install dependencies
pip install -r requirements.txt
# Run tests
pytest tests/ -v
# Run operator locally (requires kubeconfig)
cd operator
python main.py
# Deploy with Helm
helm install platform-operator helm/platform-operator/
# Provision EKS with Terraform
cd terraform
terraform init
terraform plan
terraform apply┌─────────────────────────────────────────────┐
│ GitHub Actions CI/CD │
│ test → lint → helm-lint → build → deploy │
├─────────────────────────────────────────────┤
│ Platform Operator (Python) │
│ ┌──────────────────────────────────────┐ │
│ │ Reconciliation Loop (30s) │ │
│ │ Desired State ──▶ Actual State │ │
│ │ Create / Update / Delete / Health │ │
│ └──────────────────────────────────────┘ │
├─────────────────────────────────────────────┤
│ AWS EKS Cluster (Terraform) │
│ Node Group │ RBAC │ IAM Roles │
├─────────────────────────────────────────────┤
│ Prometheus Monitoring │
│ Degraded │ Down │ Reconcile Error Alerts │
└─────────────────────────────────────────────┘
When incidents occur:
- Operator logs structured error with phase and error message
- Prometheus alert fires within 2-5 minutes
- On-call engineer reviews status conditions
- Mitigation applied via reconcile loop or manual patch
- Postmortem documents root cause and prevention