Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions .github/workflows/secrets-scan.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Reproduce locally: npm run secrets:lint ; npm run secrets:gitleaks (Docker required for Gitleaks script).
# Reproduce locally: pnpm run secrets:lint ; pnpm run secrets:gitleaks (Docker required for Gitleaks script).
name: Secrets scan

on:
Expand All @@ -11,12 +11,13 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: pnpm/action-setup@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: npm
- run: npm ci
- run: npm run secrets:lint
cache: pnpm
- run: pnpm install --frozen-lockfile
- run: pnpm run secrets:lint

gitleaks:
runs-on: ubuntu-latest
Expand Down
2 changes: 1 addition & 1 deletion deploy.ps1
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ $deployMarkerLocal = $false
if ([string]::IsNullOrWhiteSpace($env:SIMSTEWARD_LOKI_URL)) {
$env:SIMSTEWARD_LOKI_URL = $localLoki
$deployMarkerLocal = $true
Write-Host "Loki deploy log: SIMSTEWARD_LOKI_URL was unset - using $localLoki (start stack: npm run obs:up)."
Write-Host "Loki deploy log: SIMSTEWARD_LOKI_URL was unset - using $localLoki (start stack: pnpm run obs:up)."
} elseif ($env:SIMSTEWARD_LOKI_URL -match 'grafana\.net') {
$hasCloudBasic = -not [string]::IsNullOrWhiteSpace($env:SIMSTEWARD_LOKI_USER) -and -not [string]::IsNullOrWhiteSpace($env:SIMSTEWARD_LOKI_TOKEN)
if (-not $hasCloudBasic) {
Expand Down
2 changes: 1 addition & 1 deletion docs/DATA-API-DEPLOY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Same contract both sides: `POST /session-complete` with `SessionSummary` JSON. D

### Deploy steps

1. **Prereqs:** Node.js, [Wrangler](https://developers.cloudflare.com/workers/wrangler/install/) (`npm i -g wrangler` or `npx wrangler`), Cloudflare account.
1. **Prereqs:** Node.js, [Wrangler](https://developers.cloudflare.com/workers/wrangler/install/) (`pnpm add -g wrangler` or `pnpx wrangler`), Cloudflare account.

2. **Create the D1 database**
```bash
Expand Down
18 changes: 9 additions & 9 deletions docs/GRAFANA-LOGGING.md
Original file line number Diff line number Diff line change
Expand Up @@ -331,22 +331,22 @@ With `.env` containing `SIMSTEWARD_LOKI_URL`, `SIMSTEWARD_LOKI_USER`, and `SIMST

| Command | Purpose |
|---------|---------|
| `npm run loki:query` | One-off `GET .../loki/api/v1/query_range` via `scripts/query-loki-once.mjs`. Flags: `--query` (LogQL), `--limit`, `--lookback` (seconds). |
| `npm run env:run -- <command>` | Load `.env` into the child process (e.g. `npm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1`). |
| `npm run obs:poll` | Tail-style poll, direct Loki (default). |
| `npm run obs:poll:grafana` | Same, but `-ViaGrafana` (Bearer → Grafana proxy → Loki). |
| `npm run obs:poll:grafana:env` | Same as `obs:poll:grafana` but injects `.env` with `dotenv-cli` first (secrets only in the child process). |
| `pnpm run loki:query` | One-off `GET .../loki/api/v1/query_range` via `scripts/query-loki-once.mjs`. Flags: `--query` (LogQL), `--limit`, `--lookback` (seconds). |
| `pnpm run env:run -- <command>` | Load `.env` into the child process (e.g. `pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1`). |
| `pnpm run obs:poll` | Tail-style poll, direct Loki (default). |
| `pnpm run obs:poll:grafana` | Same, but `-ViaGrafana` (Bearer → Grafana proxy → Loki). |
| `pnpm run obs:poll:grafana:env` | Same as `obs:poll:grafana` but injects `.env` with `dotenv-cli` first (secrets only in the child process). |

**Path A (direct Loki):** `SIMSTEWARD_LOKI_*` + `npm run loki:query` or `npm run obs:poll`.
**Path A (direct Loki):** `SIMSTEWARD_LOKI_*` + `pnpm run loki:query` or `pnpm run obs:poll`.

**Path B (Grafana Cloud, elevated `glsa_*` Bearer):** Set **`GRAFANA_URL`** to your stack (`https://<slug>.grafana.net` — **not** `logs-prod-*.grafana.net`). Set **`GRAFANA_LOKI_DATASOURCE_UID`** to the Loki datasource UID in that stack (Connections → Data sources). Set **`GRAFANA_API_TOKEN`** *or* **`CURSOR_ELEVATED_GRAFANA_TOKEN`** (service account token with permission to query the Loki datasource via the proxy). Then:

```powershell
npm run obs:poll:grafana:env
# or: npm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1 -ViaGrafana
pnpm run obs:poll:grafana:env
# or: pnpm run env:run -- pwsh -NoProfile -File scripts/poll-loki.ps1 -ViaGrafana
```

`poll-loki.ps1` reads `.env` from disk; `*:env` npm scripts add `dotenv -e .env` so variables are also loaded for the child process without exporting them in the shell.
`poll-loki.ps1` reads `.env` from disk; `*:env` pnpm scripts add `dotenv -e .env` so variables are also loaded for the child process without exporting them in the shell.

**401/403:** On Path A, the `glc_*` policy may lack Loki **read**. On Path B, ensure the Bearer token can query datasources; check datasource UID and stack URL.

Expand Down
4 changes: 2 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Editing files outside the **SimHub rule doc allowlist** does not attach the full

- **Workspace:** Open this repo as a **single-folder** Cursor workspace rooted at `simhub-plugin` so search and tooling are not mixed with unrelated paths (other clones, AppData, etc.).
- **ContextStream project:** Keep the ContextStream **project path** aligned with that same folder so `ingest_local` / MCP index the intended tree.
- **Corpus hygiene:** [`.cursorignore`](../.cursorignore) trims noise for Cursor; after changing ignore rules or large doc/code moves, run a **forced** ContextStream ingest (`npm run contextstream:ingest:force` — see [.cursor/skills/contextstream/SKILL.md](../.cursor/skills/contextstream/SKILL.md)).
- **Corpus hygiene:** [`.cursorignore`](../.cursorignore) trims noise for Cursor; after changing ignore rules or large doc/code moves, run a **forced** ContextStream ingest (`pnpm run contextstream:ingest:force` — see [.cursor/skills/contextstream/SKILL.md](../.cursor/skills/contextstream/SKILL.md)).
- **Structural graph:** ContextStream **code graph** may not expose C# module edges; use keyword/semantic `search` plus the **Code map** in [ARCHITECTURE.md](ARCHITECTURE.md) for navigation.

---
Expand All @@ -38,7 +38,7 @@ Editing files outside the **SimHub rule doc allowlist** does not attach the full
|-----|----------|
| [USER-FEATURES-PM.md](USER-FEATURES-PM.md) | PM-style user features (12 flows), connections, vision vs shipped vs [PRODUCT-FLOW.md](PRODUCT-FLOW.md) |
| [USER-FLOWS.md](USER-FLOWS.md) | Step-by-step user journeys through today's UI (mermaid diagrams); PM issues and flow gaps |
| [observability-local.md](observability-local.md) | Local Grafana/Loki stack, npm scripts, loki-gateway |
| [observability-local.md](observability-local.md) | Local Grafana/Loki stack, pnpm scripts, loki-gateway |
| [observability-scaling.md](observability-scaling.md) | Many users, large grids, Loki cardinality |
| [DATA-ROUTING-OBSERVABILITY.md](DATA-ROUTING-OBSERVABILITY.md) | OTel vs Loki vs Prometheus, ~1k-user sizing, car telemetry taxonomy |
| [observability-testing.md](observability-testing.md) | Harness, AssertLokiQueries, Explore validation |
Expand Down
4 changes: 2 additions & 2 deletions docs/TROUBLESHOOTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ If you expect SimSteward logs in Grafana (Cloud or local) but see none:

1. **Plugin output** — The plugin writes **plugin-structured.jsonl** only (plus WebSocket to the dashboard). It does **not** batch-POST those lines to Loki in-process yet. **`deploy.ps1`** can POST a **`deploy_marker`** when **`SIMSTEWARD_LOKI_URL`** is set (see **`send-deploy-loki-marker.ps1`**). For full logs in Loki, use an external shipper to tail **plugin-structured.jsonl**.
2. **Env metadata** — Set `SIMSTEWARD_LOKI_URL` and `SIMSTEWARD_LOG_ENV` before SimHub starts (e.g. `.env` loaded by **`deploy.ps1`** / **`run-simhub-local-observability.ps1`**) so JSON includes `loki_push_target` / `log_env`.
3. **Local stack** — Start observability from `observability/local/` (`npm run obs:up`) so Loki (3100) and Grafana (3000) run; compose does **not** ingest **plugin-structured.jsonl** automatically.
3. **Local stack** — Start observability from `observability/local/` (`pnpm run obs:up`) so Loki (3100) and Grafana (3000) run; compose does **not** ingest **plugin-structured.jsonl** automatically.
4. **Auth (Grafana Cloud / gateway)** — For **deploy markers**: Grafana Cloud uses **Basic** (`SIMSTEWARD_LOKI_USER` + **`SIMSTEWARD_LOKI_TOKEN`**); local **loki-gateway** uses **Bearer `LOKI_PUSH_TOKEN`**. Push failures print in the deploy script output.
5. **Data source in Grafana** — Point the Loki data source at your Loki URL (e.g. `http://localhost:3100` for local). Explore: `{app="sim-steward"}`.
6. **Debug vs production** — With `SIMSTEWARD_LOG_DEBUG=1`, many more lines (e.g. `tick_stats`, `yaml_update`) are sent. For AI or production dashboards, filter with `| level != "DEBUG"` to avoid noise.
Expand All @@ -155,7 +155,7 @@ See **docs/GRAFANA-LOGGING.md** for label schema, event taxonomy, and LogQL exam

For the full pipeline (collector, ports, Grafana datasource URL), see **docs/observability-local.md** § Canonical path and § Metrics / OTLP troubleshooting.

1. **Nothing in Explore (Prometheus Local)** — Confirm **`npm run obs:up`** is running and **`http://localhost:9090/-/healthy`** returns OK. Smoke: **`npm run obs:poll:prometheus`**.
1. **Nothing in Explore (Prometheus Local)** — Confirm **`pnpm run obs:up`** is running and **`http://localhost:9090/-/healthy`** returns OK. Smoke: **`pnpm run obs:poll:prometheus`**.
2. **No `simsteward_*` metrics** — OTLP is disabled unless **`OTEL_EXPORTER_OTLP_ENDPOINT`** or **`SIMSTEWARD_OTLP_ENDPOINT`** is set **before** SimHub starts (SimHub does not load `.env` automatically). Use **`scripts/run-simhub-local-observability.ps1`** or set env in the user/session environment.
3. **`connection refused` to port 4317** — OpenTelemetry Collector is not up or ports are not mapped; restart compose from the repo root.
4. **Wrong protocol** — gRPC defaults for **`http://127.0.0.1:4317`**. For HTTP/protobuf on **4318**, set **`OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf`** and point the endpoint at **4318**.
Expand Down
14 changes: 7 additions & 7 deletions docs/observability-local.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,30 +21,30 @@ Quick start for plugin logs in local Grafana/Loki, **optional OTLP metrics** (Op
2. **Start the stack** (repo root):

```powershell
npm run obs:up
pnpm run obs:up
```

Or copy `observability/local/.env.observability.example` → `.env.observability.local`, set passwords/tokens, then `npm run obs:up:env`. Check: `npm run obs:ps`.
Or copy `observability/local/.env.observability.example` → `.env.observability.local`, set passwords/tokens, then `pnpm run obs:up:env`. Check: `pnpm run obs:ps`.

3. **Configure the plugin** — SimHub does not load `.env` by default. Recommended: `.\scripts\run-simhub-local-observability.ps1` (sets `SIMSTEWARD_LOKI_URL=http://localhost:3100`, `SIMSTEWARD_LOG_ENV=local`, and OTLP for metrics — see script). Or set those in Windows user env and restart SimHub. See `.env.example` “Local Loki” and “OTLP / Prometheus (local metrics)” blocks.

4. **Grafana** — http://localhost:3000 → Explore → Loki → `{app="sim-steward", env="local"}`. Provisioned dashboard **Sim Steward — Deploy health** (`simsteward-deploy-health`) correlates `deploy.ps1` markers (`event=deploy_marker`) with plugin bring-up and errors. Put `SIMSTEWARD_LOKI_URL` (and `LOKI_PUSH_TOKEN` if using loki-gateway) in repo **`.env`** — `deploy.ps1` loads it automatically via `scripts/load-dotenv.ps1` (optional merge: `observability/local/.env.observability.local`).

5. **Metrics (optional)** — With the stack up, set **`OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4317`** (or use `SIMSTEWARD_OTLP_ENDPOINT`) before starting SimHub. After the plugin loads, Explore → **Prometheus Local** → e.g. `simsteward_process_cpu_percent` or `up{job="otel-collector"}`. Smoke: `npm run obs:poll:prometheus` or `.\scripts\poll-prometheus.ps1`.
5. **Metrics (optional)** — With the stack up, set **`OTEL_EXPORTER_OTLP_ENDPOINT=http://127.0.0.1:4317`** (or use `SIMSTEWARD_OTLP_ENDPOINT`) before starting SimHub. After the plugin loads, Explore → **Prometheus Local** → e.g. `simsteward_process_cpu_percent` or `up{job="otel-collector"}`. Smoke: `pnpm run obs:poll:prometheus` or `.\scripts\poll-prometheus.ps1`.

6. **Generate traffic** — Use SimHub + web dashboard; confirm logs in **Explore** with `{app="sim-steward", env="local"}` (no repo-provisioned Grafana dashboards until you add JSON under `observability/local/grafana/provisioning/dashboards/`).

**Storage override:** Set `GRAFANA_STORAGE_PATH` in `.env.observability.local`; compose uses `${GRAFANA_STORAGE_PATH:-S:/sim-steward-grafana-storage}`.

**Terminal tail:** `npm run obs:poll` (direct Loki :3100) or `npm run obs:poll:grafana` / `.\scripts\poll-loki.ps1 -ViaGrafana` using **GRAFANA_API_TOKEN** (or admin user/password) in repo `.env` — same path Grafana Explore uses (`loki_local` datasource). **Prometheus:** `npm run obs:poll:prometheus` / `.\scripts\poll-prometheus.ps1`.
**Terminal tail:** `pnpm run obs:poll` (direct Loki :3100) or `pnpm run obs:poll:grafana` / `.\scripts\poll-loki.ps1 -ViaGrafana` using **GRAFANA_API_TOKEN** (or admin user/password) in repo `.env` — same path Grafana Explore uses (`loki_local` datasource). **Prometheus:** `pnpm run obs:poll:prometheus` / `.\scripts\poll-prometheus.ps1`.

---

## Housekeeping: wipe dashboards’ data (local)

To **clear Loki chunks/WAL**, optional **Prometheus TSDB**, and optional Grafana bind-mount state **without** changing compose, `loki-config.yml`, datasource provisioning, `LOKI_PUSH_TOKEN`, or `SIMSTEWARD_LOKI_*`:

1. From repo root, run **`npm run obs:wipe -- -Force`** (clears the `loki` and **`prometheus`** subdirectories under `GRAFANA_STORAGE_PATH`).
1. From repo root, run **`pnpm run obs:wipe -- -Force`** (clears the `loki` and **`prometheus`** subdirectories under `GRAFANA_STORAGE_PATH`).
2. Optional flags: **`-Grafana`** (wipes `grafana.db`; re-run `scripts/grafana-bootstrap.ps1` if you use `GRAFANA_API_TOKEN`), **`-SampleLogs`** (clears `observability/local/sample-logs/*` files), or **`-All`** for both.

Equivalent: `.\scripts\obs-wipe-local-data.ps1 -Force` (same switches).
Expand Down Expand Up @@ -101,9 +101,9 @@ The stack publishes these **host** ports together; any other process (or second

### Metrics / OTLP troubleshooting

- **`up{job="otel-collector"} == 0`** — Prometheus cannot reach the collector on `otel-collector:8889` (compose network). Confirm `otel-collector` is running: `npm run obs:ps`.
- **`up{job="otel-collector"} == 0`** — Prometheus cannot reach the collector on `otel-collector:8889` (compose network). Confirm `otel-collector` is running: `pnpm run obs:ps`.
- **No `simsteward_*` series** — OTLP is off until **`OTEL_EXPORTER_OTLP_ENDPOINT`** or **`SIMSTEWARD_OTLP_ENDPOINT`** is set **before** SimHub starts. Use **`http://127.0.0.1:4317`** for gRPC; for port **4318** set **`OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf`**.
- **Connection refused on 4317** — Collector not started or ports not published; run `npm run obs:up` from repo root.
- **Connection refused on 4317** — Collector not started or ports not published; run `pnpm run obs:up` from repo root.
- **Grafana Prometheus query errors** — Datasource must be **`http://prometheus:9090`** (container DNS), not `localhost:9090`.
- **Loki remains authoritative** for `host_resource_sample` until you rely on Prom-only SLOs; metrics duplicate CPU/working set at OTLP export cadence.

Expand Down
Loading
Loading