Validate two production-critical links:
- Distributed load stability (
multi app instances + queue backend + cache) - Alerting loop closure (
Prometheus + Alertmanager + Loki + Tempo)
- App: 2~3 instances (
docker compose --scale app=3) - Queue: Redis Streams or RabbitMQ
- Storage: MySQL + pgvector (optional)
- Observability: Prometheus + Alertmanager + Loki + Tempo + Promtail
-
Start stack:
docker compose up -d -
Start observability stack:
docker compose -f docker-compose.observability.yml up -d -
Run distributed load:
k6 run performance/k6/distributed_chat_ingestion.js -e BASE_URL=http://localhost:8080 -e BEARER_TOKEN=xxx
3.1 Generate report:
python3 performance/k6/generate_report.py --summary reports/performance/distributed-k6-summary.json
- Inject failure drill (optional):
- stop one app instance
- stop queue consumer process once
- Verify metrics:
- p95 latency
- error rate
- queue lag
- retry success rate
- Verify traces/logs:
- trace has
request_id/trace_id/session_id/job_id - logs can be filtered by
trace_id
- Verify alert closure:
- trigger: high latency / error spike
- alertmanager receives alert
- recovery alert appears after load stop
reports/performance/distributed-k6-summary.jsonreports/performance/k6-report.md- screenshot of Grafana dashboards
- screenshot of firing + resolved alerts
- one-page postmortem note