Skip to content

Latest commit

 

History

History
56 lines (40 loc) · 1.46 KB

File metadata and controls

56 lines (40 loc) · 1.46 KB

Distributed Load & Observability Drill

Goal

Validate two production-critical links:

  1. Distributed load stability (multi app instances + queue backend + cache)
  2. Alerting loop closure (Prometheus + Alertmanager + Loki + Tempo)

Topology (simulation)

  • App: 2~3 instances (docker compose --scale app=3)
  • Queue: Redis Streams or RabbitMQ
  • Storage: MySQL + pgvector (optional)
  • Observability: Prometheus + Alertmanager + Loki + Tempo + Promtail

Runbook

  1. Start stack: docker compose up -d

  2. Start observability stack: docker compose -f docker-compose.observability.yml up -d

  3. Run distributed load: k6 run performance/k6/distributed_chat_ingestion.js -e BASE_URL=http://localhost:8080 -e BEARER_TOKEN=xxx

3.1 Generate report: python3 performance/k6/generate_report.py --summary reports/performance/distributed-k6-summary.json

  1. Inject failure drill (optional):
  • stop one app instance
  • stop queue consumer process once
  1. Verify metrics:
  • p95 latency
  • error rate
  • queue lag
  • retry success rate
  1. Verify traces/logs:
  • trace has request_id/trace_id/session_id/job_id
  • logs can be filtered by trace_id
  1. Verify alert closure:
  • trigger: high latency / error spike
  • alertmanager receives alert
  • recovery alert appears after load stop

Deliverables

  • reports/performance/distributed-k6-summary.json
  • reports/performance/k6-report.md
  • screenshot of Grafana dashboards
  • screenshot of firing + resolved alerts
  • one-page postmortem note