6060│ [■] Security Hardened — OWASP Top 10, Brakeman, ZAP tested │
6161│ [■] High Performance — p95: ~500ms · cached: ~50ms │
6262│ [■] Modular Monolith — Scalable modular architecture │
63+ │ [■] Observability — /health/live + /health/ready + Sidekiq mon. │
64+ │ [■] 401 Rate Spike Detection — Sliding-window middleware, alerts at >5% │
65+ │ [■] Job Heartbeat Tracking — Stale scheduled job detection via Redis │
6366└─────────────────────────────────────────────────────────────────────────────┘
6467```
6568
8083│ 07 · Testing │
8184│ 08 · Performance & Load Testing │
8285│ 09 · Security │
83- │ 10 · Deployment │
84- │ 11 · CI/CD │
85- │ 12 · Contributing │
86- │ 13 · License │
86+ │ 10 · Observability & Monitoring │
87+ │ 11 · Deployment │
88+ │ 12 · CI/CD │
89+ │ 13 · Contributing │
90+ │ 14 · License │
8791└──────────────────────────────────────────────────────┘
8892```
8993
@@ -823,6 +827,30 @@ curl -X POST http://localhost:3333/api/v1/auth/refresh \
823827- ` GET /notifications/unread-count ` — Get unread count
824828- ` DELETE /notifications/:id ` — Delete notification
825829
830+ #### Health & Observability
831+
832+ ```
833+ GET /health/live — Liveness probe: is Puma alive? Never checks deps.
834+ Always returns 200 while the process responds.
835+ Use for container restart policies (Coolify/K8s).
836+
837+ GET /health/ready — Readiness probe: checks PostgreSQL + Redis + Meilisearch.
838+ Returns 200 (ok/disabled) or 503 (any dep unreachable).
839+ Use for load balancer traffic routing.
840+
841+ GET /api/v1/monitoring/sidekiq — Admin only. Full Sidekiq snapshot:
842+ queue depths, worker count, dead queue, retry queue,
843+ scheduled job heartbeats (stale detection), alert flags.
844+ Returns 503 if Redis unavailable.
845+ ```
846+
847+ > ** Monitoring endpoint response includes:**
848+ > - ` scheduled_jobs ` — last run timestamp + ` stale: true/false ` per cron job
849+ > - ` alerts.stale_jobs ` — true if any scheduled job exceeded its alert window
850+ > - ` alerts.no_workers ` — true if no Sidekiq workers running
851+ > - ` alerts.dead_queue_exceeded ` — true if dead queue > 10 jobs
852+ > - ` alerts.queue_depth_exceeded ` — true if total queue depth > 100 jobs
853+
826854#### Team Members (chat)
827855- ` GET /team-members ` — List organization members (staff only — rejects player tokens)
828856
@@ -965,7 +993,88 @@ open coverage/index.html
965993
966994---
967995
968- ## 10 · Deployment
996+ ## 10 · Observability & Monitoring
997+
998+ ### Health Probes
999+
1000+ | Endpoint | Purpose | Returns |
1001+ | ---| ---| ---|
1002+ | ` GET /health/live ` | Liveness — is Puma responding? | Always 200 |
1003+ | ` GET /health/ready ` | Readiness — all deps reachable? | 200 / 503 |
1004+ | ` GET /up ` | Legacy backward-compatible alias | 200 |
1005+
1006+ > ** Rule** : never point the liveness probe at an endpoint that checks Redis or DB.
1007+ > A Redis crash → liveness fail → container restart → reconnect storm → worse incident.
1008+
1009+ ### Sidekiq Monitoring
1010+
1011+ ``` bash
1012+ # Requires admin Bearer token
1013+ curl -H " Authorization: Bearer $TOKEN " https://api.prostaff.gg/api/v1/monitoring/sidekiq
1014+ ```
1015+
1016+ Response shape:
1017+ ``` json
1018+ {
1019+ "status" : " ok | degraded | critical" ,
1020+ "processes" : { "count" : 1 , "workers" : [... ] },
1021+ "queues" : { "default" : 0 , "high" : 0 },
1022+ "stats" : { "enqueued" : 0 , "dead" : 0 , "retry" : 0 },
1023+ "scheduled_jobs" : {
1024+ "RefreshMetadataViewsJob" : { "last_run_at" : " ..." , "stale" : false },
1025+ "CleanupExpiredTokensJob" : { "last_run_at" : " ..." , "stale" : false }
1026+ },
1027+ "alerts" : {
1028+ "no_workers" : false ,
1029+ "queue_depth_exceeded" : false ,
1030+ "dead_queue_exceeded" : false ,
1031+ "stale_jobs" : false
1032+ }
1033+ }
1034+ ```
1035+
1036+ ** Status rules:**
1037+
1038+ | status | condition |
1039+ | ------------------------| ----------------------------------------------------|
1040+ | ` ok ` | all thresholds within bounds |
1041+ | ` degraded ` | queue > 100, dead > 10, or any scheduled job stale |
1042+ | ` critical ` | no Sidekiq workers running |
1043+
1044+ ### 401 Rate Spike Detection
1045+
1046+ ` Middleware::AuthFailureTracker ` counts 401s vs total requests using Redis
1047+ sliding-window counters (5-minute window). Emits a structured log alert when
1048+ the ratio exceeds 5%:
1049+
1050+ ``` json
1051+ {
1052+ "event" : " auth_spike_detected" ,
1053+ "level" : " CRITICAL" ,
1054+ "rate_pct" : 8.3 ,
1055+ "threshold_pct" : 5.0 ,
1056+ "total_requests" : 240 ,
1057+ "total_401s" : 20
1058+ }
1059+ ```
1060+
1061+ Threshold and window are configurable via env:
1062+
1063+ ``` bash
1064+ AUTH_TRACKER_THRESHOLD=0.05 # default: 5%
1065+ AUTH_TRACKER_WINDOW=5 # default: 5 minutes
1066+ ```
1067+
1068+ ### Configurable Alert Thresholds
1069+
1070+ ``` bash
1071+ SIDEKIQ_QUEUE_ALERT_THRESHOLD=100 # queue depth that triggers degraded
1072+ SIDEKIQ_DEAD_ALERT_THRESHOLD=10 # dead queue size that triggers degraded
1073+ ```
1074+
1075+ ---
1076+
1077+ ## 11 · Deployment
9691078
9701079### Environment Variables
9711080
@@ -989,6 +1098,12 @@ FRONTEND_URL=https://your-frontend-domain.com
9891098# HashID Configuration (for URL obfuscation)
9901099HASHID_SALT=your-secret-salt
9911100HASHID_MIN_LENGTH=6
1101+
1102+ # Observability thresholds (optional, defaults shown)
1103+ SIDEKIQ_QUEUE_ALERT_THRESHOLD=100 # queue depth → degraded
1104+ SIDEKIQ_DEAD_ALERT_THRESHOLD=10 # dead queue → degraded
1105+ AUTH_TRACKER_THRESHOLD=0.05 # 401 rate spike threshold (5%)
1106+ AUTH_TRACKER_WINDOW=5 # sliding window in minutes
9921107```
9931108
9941109### Docker
@@ -1000,7 +1115,7 @@ docker run -p 3333:3000 prostaff-api
10001115
10011116---
10021117
1003- ## 11 · CI/CD
1118+ ## 12 · CI/CD
10041119
10051120### Architecture Diagram Auto-Update
10061121
@@ -1031,12 +1146,13 @@ Automated testing on every push:
10311146- ** Security Scan** : Brakeman + dependency check
10321147- ** Load Test** : Nightly smoke tests
10331148- ** Nightly Audit** : Complete security scan
1149+ - ** CORS Smoke Test** : Runs after every production deploy — sends a preflight request from each allowed origin and fails the pipeline if CORS is misconfigured
10341150
10351151See ` .github/workflows/ ` for details.
10361152
10371153---
10381154
1039- ## 12 · Contributing
1155+ ## 13 · Contributing
10401156
104111571 . Create a feature branch
104211582 . Make your changes
@@ -1049,7 +1165,7 @@ See `.github/workflows/` for details.
10491165
10501166---
10511167
1052- ## 13 · License
1168+ ## 14 · License
10531169
10541170```
10551171╔══════════════════════════════════════════════════════════════════════════════╗
0 commit comments