Skip to content

Commit f805f5e

Browse files
committed
docs: update readme and add features
1 parent 99a3504 commit f805f5e

1 file changed

Lines changed: 124 additions & 8 deletions

File tree

README.md

Lines changed: 124 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,9 @@
6060
│ [■] Security Hardened — OWASP Top 10, Brakeman, ZAP tested │
6161
│ [■] High Performance — p95: ~500ms · cached: ~50ms │
6262
│ [■] Modular Monolith — Scalable modular architecture │
63+
│ [■] Observability — /health/live + /health/ready + Sidekiq mon. │
64+
│ [■] 401 Rate Spike Detection — Sliding-window middleware, alerts at >5% │
65+
│ [■] Job Heartbeat Tracking — Stale scheduled job detection via Redis │
6366
└─────────────────────────────────────────────────────────────────────────────┘
6467
```
6568

@@ -80,10 +83,11 @@
8083
│ 07 · Testing │
8184
│ 08 · Performance & Load Testing │
8285
│ 09 · Security │
83-
│ 10 · Deployment │
84-
│ 11 · CI/CD │
85-
│ 12 · Contributing │
86-
│ 13 · License │
86+
│ 10 · Observability & Monitoring │
87+
│ 11 · Deployment │
88+
│ 12 · CI/CD │
89+
│ 13 · Contributing │
90+
│ 14 · License │
8791
└──────────────────────────────────────────────────────┘
8892
```
8993

@@ -823,6 +827,30 @@ curl -X POST http://localhost:3333/api/v1/auth/refresh \
823827
- `GET /notifications/unread-count` — Get unread count
824828
- `DELETE /notifications/:id` — Delete notification
825829

830+
#### Health & Observability
831+
832+
```
833+
GET /health/live — Liveness probe: is Puma alive? Never checks deps.
834+
Always returns 200 while the process responds.
835+
Use for container restart policies (Coolify/K8s).
836+
837+
GET /health/ready — Readiness probe: checks PostgreSQL + Redis + Meilisearch.
838+
Returns 200 (ok/disabled) or 503 (any dep unreachable).
839+
Use for load balancer traffic routing.
840+
841+
GET /api/v1/monitoring/sidekiq — Admin only. Full Sidekiq snapshot:
842+
queue depths, worker count, dead queue, retry queue,
843+
scheduled job heartbeats (stale detection), alert flags.
844+
Returns 503 if Redis unavailable.
845+
```
846+
847+
> **Monitoring endpoint response includes:**
848+
> - `scheduled_jobs` — last run timestamp + `stale: true/false` per cron job
849+
> - `alerts.stale_jobs` — true if any scheduled job exceeded its alert window
850+
> - `alerts.no_workers` — true if no Sidekiq workers running
851+
> - `alerts.dead_queue_exceeded` — true if dead queue > 10 jobs
852+
> - `alerts.queue_depth_exceeded` — true if total queue depth > 100 jobs
853+
826854
#### Team Members (chat)
827855
- `GET /team-members` — List organization members (staff only — rejects player tokens)
828856

@@ -965,7 +993,88 @@ open coverage/index.html
965993
966994
---
967995

968-
## 10 · Deployment
996+
## 10 · Observability & Monitoring
997+
998+
### Health Probes
999+
1000+
| Endpoint | Purpose | Returns |
1001+
|---|---|---|
1002+
| `GET /health/live` | Liveness — is Puma responding? | Always 200 |
1003+
| `GET /health/ready` | Readiness — all deps reachable? | 200 / 503 |
1004+
| `GET /up` | Legacy backward-compatible alias | 200 |
1005+
1006+
> **Rule**: never point the liveness probe at an endpoint that checks Redis or DB.
1007+
> A Redis crash → liveness fail → container restart → reconnect storm → worse incident.
1008+
1009+
### Sidekiq Monitoring
1010+
1011+
```bash
1012+
# Requires admin Bearer token
1013+
curl -H "Authorization: Bearer $TOKEN" https://api.prostaff.gg/api/v1/monitoring/sidekiq
1014+
```
1015+
1016+
Response shape:
1017+
```json
1018+
{
1019+
"status": "ok | degraded | critical",
1020+
"processes": { "count": 1, "workers": [...] },
1021+
"queues": { "default": 0, "high": 0 },
1022+
"stats": { "enqueued": 0, "dead": 0, "retry": 0 },
1023+
"scheduled_jobs": {
1024+
"RefreshMetadataViewsJob": { "last_run_at": "...", "stale": false },
1025+
"CleanupExpiredTokensJob": { "last_run_at": "...", "stale": false }
1026+
},
1027+
"alerts": {
1028+
"no_workers": false,
1029+
"queue_depth_exceeded": false,
1030+
"dead_queue_exceeded": false,
1031+
"stale_jobs": false
1032+
}
1033+
}
1034+
```
1035+
1036+
**Status rules:**
1037+
1038+
| status | condition |
1039+
|------------------------|----------------------------------------------------|
1040+
| `ok` | all thresholds within bounds |
1041+
| `degraded` | queue > 100, dead > 10, or any scheduled job stale |
1042+
| `critical` | no Sidekiq workers running |
1043+
1044+
### 401 Rate Spike Detection
1045+
1046+
`Middleware::AuthFailureTracker` counts 401s vs total requests using Redis
1047+
sliding-window counters (5-minute window). Emits a structured log alert when
1048+
the ratio exceeds 5%:
1049+
1050+
```json
1051+
{
1052+
"event": "auth_spike_detected",
1053+
"level": "CRITICAL",
1054+
"rate_pct": 8.3,
1055+
"threshold_pct": 5.0,
1056+
"total_requests": 240,
1057+
"total_401s": 20
1058+
}
1059+
```
1060+
1061+
Threshold and window are configurable via env:
1062+
1063+
```bash
1064+
AUTH_TRACKER_THRESHOLD=0.05 # default: 5%
1065+
AUTH_TRACKER_WINDOW=5 # default: 5 minutes
1066+
```
1067+
1068+
### Configurable Alert Thresholds
1069+
1070+
```bash
1071+
SIDEKIQ_QUEUE_ALERT_THRESHOLD=100 # queue depth that triggers degraded
1072+
SIDEKIQ_DEAD_ALERT_THRESHOLD=10 # dead queue size that triggers degraded
1073+
```
1074+
1075+
---
1076+
1077+
## 11 · Deployment
9691078

9701079
### Environment Variables
9711080

@@ -989,6 +1098,12 @@ FRONTEND_URL=https://your-frontend-domain.com
9891098
# HashID Configuration (for URL obfuscation)
9901099
HASHID_SALT=your-secret-salt
9911100
HASHID_MIN_LENGTH=6
1101+
1102+
# Observability thresholds (optional, defaults shown)
1103+
SIDEKIQ_QUEUE_ALERT_THRESHOLD=100 # queue depth → degraded
1104+
SIDEKIQ_DEAD_ALERT_THRESHOLD=10 # dead queue → degraded
1105+
AUTH_TRACKER_THRESHOLD=0.05 # 401 rate spike threshold (5%)
1106+
AUTH_TRACKER_WINDOW=5 # sliding window in minutes
9921107
```
9931108

9941109
### Docker
@@ -1000,7 +1115,7 @@ docker run -p 3333:3000 prostaff-api
10001115

10011116
---
10021117

1003-
## 11 · CI/CD
1118+
## 12 · CI/CD
10041119

10051120
### Architecture Diagram Auto-Update
10061121

@@ -1031,12 +1146,13 @@ Automated testing on every push:
10311146
- **Security Scan**: Brakeman + dependency check
10321147
- **Load Test**: Nightly smoke tests
10331148
- **Nightly Audit**: Complete security scan
1149+
- **CORS Smoke Test**: Runs after every production deploy — sends a preflight request from each allowed origin and fails the pipeline if CORS is misconfigured
10341150

10351151
See `.github/workflows/` for details.
10361152

10371153
---
10381154

1039-
## 12 · Contributing
1155+
## 13 · Contributing
10401156

10411157
1. Create a feature branch
10421158
2. Make your changes
@@ -1049,7 +1165,7 @@ See `.github/workflows/` for details.
10491165
10501166
---
10511167

1052-
## 13 · License
1168+
## 14 · License
10531169

10541170
```
10551171
╔══════════════════════════════════════════════════════════════════════════════╗

0 commit comments

Comments
 (0)