A curated collection of production-tested DevOps templates, configurations, and patterns. Everything here has been used in real production environments. Copy, adapt, ship.
No toy examples. No "hello world" pipelines. Every template is annotated with the reasoning behind each decision.
- GitHub Actions Workflows
- Terraform Modules
- Kubernetes Manifests
- Docker Configurations
- Monitoring and Observability
- Security Templates
- Useful References
- Contributing
- License
A production-grade CI pipeline that handles linting, testing, and artifact building. This template supports Node.js, Python, and Go with minimal modification.
# .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
# Cancel in-progress runs for the same branch
concurrency:
group: ci-${{ github.ref }}
cancel-in-progress: true
permissions:
contents: read
checks: write
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- run: npm ci
- run: npm run lint
test:
runs-on: ubuntu-latest
needs: lint
strategy:
matrix:
node-version: [20, 22]
services:
postgres:
image: postgres:16
env:
POSTGRES_USER: test
POSTGRES_PASSWORD: test
POSTGRES_DB: testdb
ports:
- 5432:5432
options: >-
--health-cmd="pg_isready"
--health-interval=10s
--health-timeout=5s
--health-retries=5
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: ${{ matrix.node-version }}
cache: 'npm'
- run: npm ci
- run: npm test -- --coverage
env:
DATABASE_URL: postgresql://test:test@localhost:5432/testdb
- uses: actions/upload-artifact@v4
if: always()
with:
name: coverage-${{ matrix.node-version }}
path: coverage/
build:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '22'
cache: 'npm'
- run: npm ci
- run: npm run build
- uses: actions/upload-artifact@v4
with:
name: build-output
path: dist/
retention-days: 7Why this works:
concurrencyprevents wasted runner time on superseded pushes- Service containers give you a real database for integration tests without external dependencies
- Matrix builds catch compatibility issues across Node versions
- Artifacts with retention limits prevent storage bloat
Build a Docker image and push to Amazon ECR or GitHub Container Registry.
# .github/workflows/docker-publish.yml
name: Docker Build and Push
on:
push:
tags: ['v*']
permissions:
contents: read
packages: write
id-token: write # Required for OIDC auth to AWS
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
build-and-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/metadata-action@v5
id: meta
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha,prefix=
- uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64,linux/arm64Why this works:
- Triggers only on semantic version tags, not every push
- Multi-platform builds for both AMD64 and ARM64
- GitHub Actions cache for Docker layers reduces build time by 50-80%
- Metadata action generates proper semver tags automatically
Safe Terraform workflow with plan review, drift detection, and apply on merge.
# .github/workflows/terraform.yml
name: Terraform
on:
push:
branches: [main]
paths: ['infra/**']
pull_request:
branches: [main]
paths: ['infra/**']
permissions:
contents: read
pull-requests: write
id-token: write
env:
TF_VERSION: '1.8'
WORKING_DIR: './infra'
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
working-directory: ${{ env.WORKING_DIR }}
run: terraform init -backend-config=backend.hcl
- name: Terraform Validate
working-directory: ${{ env.WORKING_DIR }}
run: terraform validate
- name: Terraform Plan
id: plan
working-directory: ${{ env.WORKING_DIR }}
run: terraform plan -no-color -out=tfplan
continue-on-error: true
- name: Comment Plan on PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const output = `#### Terraform Plan
\`\`\`
${{ steps.plan.outputs.stdout }}
\`\`\`
*Pushed by: @${{ github.actor }}, Action: \`${{ github.event_name }}\`*`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
apply:
runs-on: ubuntu-latest
needs: plan
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Init
working-directory: ${{ env.WORKING_DIR }}
run: terraform init -backend-config=backend.hcl
- name: Terraform Apply
working-directory: ${{ env.WORKING_DIR }}
run: terraform apply -auto-approveWhy this works:
- OIDC authentication (no long-lived AWS keys in GitHub secrets)
- Plan output posted as a PR comment for review
- Apply only runs on main branch push, gated by a GitHub environment with approval rules
- Path filtering prevents unnecessary runs on non-infra changes
Automated semantic releases with changelog generation.
# .github/workflows/release.yml
name: Release
on:
push:
branches: [main]
permissions:
contents: write
packages: write
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: actions/setup-node@v4
with:
node-version: '22'
- name: Create Release
uses: google-github-actions/release-please-action@v4
with:
release-type: node
changelog-types: |
[
{"type":"feat","section":"Features","hidden":false},
{"type":"fix","section":"Bug Fixes","hidden":false},
{"type":"perf","section":"Performance","hidden":false},
{"type":"refactor","section":"Refactoring","hidden":true},
{"type":"chore","section":"Miscellaneous","hidden":true}
]A production VPC with public/private subnets, NAT gateways, and flow logs.
# modules/vpc/main.tf
variable "name" {
type = string
description = "Name prefix for all VPC resources"
}
variable "cidr" {
type = string
default = "10.0.0.0/16"
description = "VPC CIDR block"
}
variable "azs" {
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
description = "Availability zones"
}
variable "enable_nat_gateway" {
type = bool
default = true
}
variable "single_nat_gateway" {
type = bool
default = false
description = "Use one NAT gateway instead of one per AZ (saves cost, reduces HA)"
}
locals {
public_subnets = [for i, az in var.azs : cidrsubnet(var.cidr, 8, i)]
private_subnets = [for i, az in var.azs : cidrsubnet(var.cidr, 8, i + 100)]
}
resource "aws_vpc" "main" {
cidr_block = var.cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = {
Name = "${var.name}-vpc"
}
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = { Name = "${var.name}-igw" }
}
resource "aws_subnet" "public" {
count = length(var.azs)
vpc_id = aws_vpc.main.id
cidr_block = local.public_subnets[count.index]
availability_zone = var.azs[count.index]
map_public_ip_on_launch = true
tags = {
Name = "${var.name}-public-${var.azs[count.index]}"
Tier = "public"
}
}
resource "aws_subnet" "private" {
count = length(var.azs)
vpc_id = aws_vpc.main.id
cidr_block = local.private_subnets[count.index]
availability_zone = var.azs[count.index]
tags = {
Name = "${var.name}-private-${var.azs[count.index]}"
Tier = "private"
}
}
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.azs)) : 0
domain = "vpc"
tags = { Name = "${var.name}-nat-eip-${count.index}" }
}
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.azs)) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = { Name = "${var.name}-nat-${count.index}" }
}
output "vpc_id" { value = aws_vpc.main.id }
output "public_subnet_ids" { value = aws_subnet.public[*].id }
output "private_subnet_ids" { value = aws_subnet.private[*].id }Design decisions:
single_nat_gatewayoption for dev/staging environments (saves ~$32/month per NAT gateway removed)- Private subnets start at /8 offset 100 to leave room for additional subnet tiers
- DNS hostnames enabled for private DNS resolution (required for VPC endpoints)
# modules/ecs-service/main.tf
variable "name" { type = string }
variable "cluster_id" { type = string }
variable "image" { type = string }
variable "cpu" { type = number; default = 256 }
variable "memory" { type = number; default = 512 }
variable "desired_count" { type = number; default = 2 }
variable "container_port" { type = number; default = 8080 }
variable "subnet_ids" { type = list(string) }
variable "security_group_ids" { type = list(string) }
variable "target_group_arn" { type = string }
variable "environment" {
type = map(string)
default = {}
}
resource "aws_ecs_task_definition" "main" {
family = var.name
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = var.cpu
memory = var.memory
execution_role_arn = aws_iam_role.execution.arn
task_role_arn = aws_iam_role.task.arn
container_definitions = jsonencode([{
name = var.name
image = var.image
essential = true
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
environment = [for k, v in var.environment : { name = k, value = v }]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.main.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = var.name
}
}
healthCheck = {
command = ["CMD-SHELL", "curl -f http://localhost:${var.container_port}/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 60
}
}])
}
resource "aws_ecs_service" "main" {
name = var.name
cluster = var.cluster_id
task_definition = aws_ecs_task_definition.main.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.subnet_ids
security_groups = var.security_group_ids
}
load_balancer {
target_group_arn = var.target_group_arn
container_name = var.name
container_port = var.container_port
}
deployment_circuit_breaker {
enable = true
rollback = true
}
lifecycle {
ignore_changes = [desired_count] # Let autoscaling manage this
}
}
resource "aws_cloudwatch_log_group" "main" {
name = "/ecs/${var.name}"
retention_in_days = 30
}
data "aws_region" "current" {}# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
labels:
app: api-server
version: v1
spec:
replicas: 3
selector:
matchLabels:
app: api-server
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # Zero downtime deployments
template:
metadata:
labels:
app: api-server
version: v1
spec:
terminationGracePeriodSeconds: 60
serviceAccountName: api-server
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: api-server
image: ghcr.io/org/api-server:v1.0.0
ports:
- containerPort: 8080
name: http
env:
- name: NODE_ENV
value: production
- name: PORT
value: "8080"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: api-secrets
key: db-password
resources:
requests:
cpu: 250m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
livenessProbe:
httpGet:
path: /healthz
port: http
initialDelaySeconds: 15
periodSeconds: 20
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 5
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 10"]
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: api-serverKey decisions:
maxUnavailable: 0guarantees zero-downtime rolling updatespreStophook with sleep ensures load balancer drains connections before pod terminationtopologySpreadConstraintsdistributes pods across availability zonesrunAsNonRootenforces non-root containers (security best practice)- Resource requests and limits prevent noisy-neighbor issues
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120Why asymmetric scaling: Scale up fast (4 pods per minute) to handle traffic spikes. Scale down slowly (1 pod every 2 minutes with 5-minute stabilization) to prevent flapping during variable load.
# k8s/ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-server
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/rate-limit: "100"
nginx.ingress.kubernetes.io/rate-limit-window: "1m"
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.example.com
secretName: api-tls
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server
port:
number: 80# k8s/cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM UTC
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
backoffLimit: 3
activeDeadlineSeconds: 3600 # Kill if running > 1 hour
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: ghcr.io/org/db-backup:v1.0.0
env:
- name: DB_URL
valueFrom:
secretKeyRef:
name: db-secrets
key: connection-string
- name: S3_BUCKET
value: my-backups-bucket
resources:
requests:
cpu: 500m
memory: 1Gi
limits:
cpu: 1000m
memory: 2Gi# Dockerfile
FROM node:22-alpine AS deps
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci --production=false
FROM node:22-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
RUN npm prune --production
FROM node:22-alpine AS runner
WORKDIR /app
RUN addgroup -g 1001 -S appgroup && \
adduser -S appuser -u 1001 -G appgroup
COPY --from=builder --chown=appuser:appgroup /app/dist ./dist
COPY --from=builder --chown=appuser:appgroup /app/node_modules ./node_modules
COPY --from=builder --chown=appuser:appgroup /app/package.json ./
USER appuser
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD wget --no-verbose --tries=1 --spider http://localhost:8080/health || exit 1
CMD ["node", "dist/main.js"]Why multi-stage: Final image contains only production dependencies and compiled code. Typical size reduction: 1.2GB to 180MB.
# Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
RUN pip install --no-cache-dir uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
FROM python:3.12-slim AS runner
WORKDIR /app
RUN groupadd -r appgroup && useradd -r -g appgroup appuser
COPY --from=builder /app /app
USER appuser
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
CMD ["python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]# docker-compose.yml
services:
app:
build:
context: .
target: deps # Use deps stage for hot reload
volumes:
- .:/app
- /app/node_modules
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://postgres:postgres@db:5432/devdb
- REDIS_URL=redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_healthy
db:
image: postgres:16-alpine
environment:
POSTGRES_USER: postgres
POSTGRES_PASSWORD: postgres
POSTGRES_DB: devdb
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 5s
retries: 5
volumes:
pgdata:# prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alert_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__# prometheus/alert_rules.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.instance }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1.0
for: 10m
labels:
severity: warning
annotations:
summary: "P95 latency above 1s on {{ $labels.instance }}"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"# .github/workflows/security-scan.yml
name: Security Scan
on:
push:
branches: [main]
pull_request:
jobs:
trivy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build image
run: docker build -t app:scan .
- name: Trivy vulnerability scan
uses: aquasecurity/trivy-action@master
with:
image-ref: 'app:scan'
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
exit-code: '1' # Fail the build on critical/high vulnerabilities
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v3
if: always()
with:
sarif_file: 'trivy-results.sarif'# .github/workflows/secret-scan.yml
name: Secret Scan
on: [push, pull_request]
jobs:
gitleaks:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}- GitHub Actions Documentation
- Terraform AWS Provider
- Kubernetes API Reference
- Docker Best Practices
- Prometheus Querying
- Citadel DevOps Pipelines Collection -- Production-ready pipeline templates and infrastructure patterns with walkthrough guides
- Awesome Docker Compose -- Official Docker Compose examples
- Terraform AWS Modules -- Community Terraform modules for AWS
- Kubernetes Examples -- Official Kubernetes example applications
- The Twelve-Factor App -- The foundation of modern app deployment
- Google SRE Book -- Free online, covers monitoring, alerting, incident response
- AWS Well-Architected Framework -- Free, mandatory reading
This repository welcomes contributions. If you have a production-tested template that solves a real problem, open a PR.
- Fork this repository
- Add your template in the appropriate section
- Include a brief explanation of why the template makes the decisions it does
- Submit a pull request
- Templates must be production-tested -- no theoretical configurations
- Include comments explaining non-obvious decisions
- Security-sensitive defaults (non-root containers, least-privilege IAM, encrypted storage)
- Keep templates vendor-neutral where possible, or clearly label vendor-specific sections
- No proprietary tool configurations -- open source and free-tier tools only
- GitLab CI/CD equivalents of the GitHub Actions workflows
- Azure DevOps pipeline templates
- ArgoCD application manifests
- Helm chart templates (production-grade)
- Pulumi equivalents of the Terraform modules
- OpenTelemetry collector configuration
- Linkerd/Istio service mesh configurations
This repository is released under the MIT License.
Use these templates in your projects -- commercial or otherwise. Attribution is appreciated but not required.
Stars help others find these templates. If they saved you time, star the repo.
Last updated: May 2026