Reliability expectations and practices for this project.
GET /healthverifies B2 connectivity and returnshealthyordegraded- Health endpoint is always available, even when B2 is down
- HTTP handlers return structured error responses with appropriate status codes
- External service failures (B2) are caught and surfaced as 500/503 responses
- No unhandled exceptions leak stack traces to clients
- Structured JSON logging via Python stdlib
- Every request gets a
request_idfor tracing - Log levels: ERROR for failures, WARNING for degraded state, INFO for requests
- Request timing middleware logs duration for every request
/metricsendpoint exposes basic Prometheus-format counters- Upload success/failure counts tracked
- File listing returns empty list (not error) when B2 has no objects
- Metadata extraction failures don't block upload (return partial metadata)
- Frontend shows skeleton states while loading, error states on failure
- Railway health checks on
/health - Zero-downtime deploys via rolling updates
- Environment-specific configuration via env vars (no config files in prod)