How SimSteward logging scales in Grafana Loki: many drivers per session, many plugin instances, and collecting logs from many users without a heavy stack on each PC. Label rules and volume: docs/GRAFANA-LOGGING.md. What goes to Loki vs OTel/metrics and rough ~1k-user math: docs/DATA-ROUTING-OBSERVABILITY.md.
| Dimension | Meaning | How it’s supported |
|---|---|---|
| Many drivers per session | 100–200+ cars in session-end results (not 64-car live telemetry cap). | Chunked session_end_datapoints_results (35 drivers per line). See docs/GRAFANA-LOGGING.md (chunked session results). |
| Many SimSteward users | 100–200+ instances → one central Loki. | Plugin pushes directly to central Loki (see Part B below). |
- High-cardinality values (
session_id,incident_id,car_idx) belong in the JSON body or structured metadata, not labels. - Current design: four labels only (
app,env,component,level) — docs/GRAFANA-LOGGING.md.
- Grafana Cloud free tier: 5,000 streams, 5 MB/s, 14-day retention, ~50 GB/month.
- ~120 users × <32 streams ≈ under 5,000. Session-end with 200 drivers ≈ 6 chunk lines; no new streams.
Per-driver per-tick telemetry is time-series data; use metrics (OTel), not Loki. Loki = events and throttled snapshots — docs/GRAFANA-LOGGING.md § Phase 2.
- Time range → labels →
| json→ filter body fields. - Optional bounded
instance_idlabel for multi-tenant (<500 values). - Chunked results:
{app="sim-steward", component="simhub-plugin"} | json | event = "session_end_datapoints_results" | session_id = "<id>"; merge bychunk_index. - Trace-style:
| json | session_id = "..."orincident_id = "...".
Local Docker + Loki per developer does not scale to ~120 users each running the full stack.
Plugin → plugin-structured.jsonl (durability) + WebSocket to dashboard. No in-process Loki POST in SimSteward.Plugin today. Optional: deploy.ps1 → send-deploy-loki-marker.ps1 POSTs one deploy_marker line when SIMSTEWARD_LOKI_URL is set.
Always write plugin-structured.jsonl. Intended: batch HTTPS POST of NDJSON to SIMSTEWARD_LOKI_URL from inside the plugin (or an approved sidecar) — one Loki HTTP endpoint (central or Grafana Cloud).
- Today: Run a file tail → Loki agent for
plugin-structured.jsonl, or wait for in-process batch POST. - Many users: Same pattern: many instances → one central Loki; scale ingestion/retention to
users × volume per session.
One tenant or self-hosted Loki; many senders. Scale ingestion/retention to users × volume per session.
Same as Part A: chunked session_end_datapoints_results lines; docs/GRAFANA-LOGGING.md for LogQL merge patterns.
- docs/DATA-ROUTING-OBSERVABILITY.md — Routing decisions (events → Loki; high-rate telemetry → OTel → Prometheus/Mimir), sizing, car telemetry taxonomy.
- docs/GRAFANA-LOGGING.md — Schema, volume budget, LogQL, housekeeping.
- docs/observability-local.md — Local stack quick start.
- Grafana: Label best practices, Query best practices.
| Spec | Doc ID |
|---|---|
| Sim Steward — Data Routing (OTel / Loki / Prometheus) | cbae1c33-c778-4e9a-9a8d-6b3e3c8c368b |
| Grafana Loki (summary) | 58a20aaf-bdde-4318-88f7-1ec8ec44377b |
| Observability — Local Stack | 25ed8579-c142-4040-b9a2-87b14523475f |
| Troubleshooting | 88274879-cd2d-4d86-9766-c86b88f95cfe |