diff --git a/.gitignore b/.gitignore index 3ce88ea3..ef301187 100644 --- a/.gitignore +++ b/.gitignore @@ -771,35 +771,15 @@ scap_json_to_openwatch_converter_enhanced.py backend/debug_*.py backend/scripts/ -# Documentation directories (user-generated, do not commit) +# docs/: only operator-facing material is tracked — the guides, the runbooks, +# their images, and the docs index. Engineering/planning/architecture/vision +# docs (including docs/engineering/STATUS.md) are local-only. docs/* !docs/.gitkeep !docs/README.md -!docs/INTRODUCTION.md !docs/guides/ -!docs/architecture/ -!docs/api/ -!docs/decisions/ !docs/runbooks/ -!docs/engineering/ !docs/images/ -docs/images/* -!docs/images/*.png -!docs/images/quickstart/ -!docs/images/scanning/ -!docs/images/hosts/ -# Security reviews and assessments (tracked in git for audit trail) -!docs/OW_SECURITY_ASSESSMENT.md -!docs/*_SECURITY_REVIEW_*.md -# Planning documents (tracked in git so cross-team coordination docs can -# reference them and so the Kensa↔OpenWatch convergence schedule is -# reviewable) -!docs/OPENWATCH_Q1_PLAN.md -!docs/OPENWATCH_Q2_PLAN.md -!docs/OPENWATCH_Q1_Q3_PLAN.md -!docs/OPENWATCH_VISION.md -!docs/OPENWATCH_VISION_STATUS.md -!docs/KENSA_OPENWATCH_*.md PRD/ backend/docs/ frontend/docs/ diff --git a/STATUS.md b/STATUS.md deleted file mode 100644 index 79825fce..00000000 --- a/STATUS.md +++ /dev/null @@ -1,68 +0,0 @@ -# OpenWatch — Project Status - -**Last Updated:** 2026-06-25 -**Latest release:** `v0.2.0-rc.14` (Eyrie) — signed RPM/DEB (amd64 + arm64) + SBOMs, GitHub pre-release -**Stack:** single Go binary (`openwatch`, go 1.26), PostgreSQL-only, Kensa v0.6.0 (538 rules), React 19 + TanStack frontend (embedded) - -> One-page snapshot of where the project is. For the work queue see -> [BACKLOG.md](BACKLOG.md); for history see [CHANGELOG.md](CHANGELOG.md) and -> [SESSION_LOG.md](SESSION_LOG.md); for the deployment roadmap see -> [docs/engineering/openwatch_roadmap.md](docs/engineering/openwatch_roadmap.md). - ---- - -## Shipped (on `main` / in the last release) - -- **Compliance scanning** end to end: Kensa SSH-based scans, OS-aware lens - views, adaptive scheduler (4h–48h bands), durable per-scan evidence + OSCAL - export, Kensa rule-library browser. -- **Fleet management**: hosts list/detail, multilayer liveness (ping/SSH/ - privilege), server-intelligence collection, groups, posture trends. -- **Remediation** (free-core, single-rule): apply + rollback from the host tab, - serialized per host, live status over SSE. -- **Exception governance**: request/approve/revoke/expire with separation of - duties. -- **Reports**: scoped, coverage-honest, signed snapshots with multiple faces - (OSCAL SAR, CSV, PDF, JSON), scheduled + emailed. -- **Settings** activated: Audit, License status, Users (invite/add + manage), - Notifications channels (Slack/webhook/email), Security (auth policy, SSO/OIDC, - API tokens). -- **Security controls**: CSRF double-submit, per-IP auth rate-limit, security - headers, durable TOFU known-hosts, breach-corpus password screening, Argon2id, - RS256 JWT, AES-256-GCM credential encryption. -- **Packaging**: native RPM (CentOS Stream 9) + DEB (Ubuntu 24.04), tag-driven - signed release pipeline, FIPS via OpenSSL 3.x provider. - -## In flight (open PRs, not yet merged — 2026-06-25) - -| PR | Area | Status | -|----|------|--------| -| #673 | **PKG-3**: remediation store path under hardened unit | green; **production-breaking fix**, land first | -| #675 | **AUTH-1 slice 1**: client idle-session timeout | green; land before #678 | -| #678 | **AUTH-1 b+c**: absolute-timeout ceiling + slide-on-activity | green; (c) client side inert until #675 ships | -| #679 | **Notifications Slice 1**: durable change-driven bell | green | -| #676 | **Avg-compliance parity** (/hosts ↔ /dashboard) | green | -| #677 | Notifications design doc | docs | -| #674 | Backlog (PKG-3 + AUTH-1) | docs | - -Recommended merge order: **#673 → #675 → #678**, then #676 / #679 / docs. - -## Next - -- Cut **`v0.2.0-rc.15`** once #673 (and the auth fixes) land — remediation is - broken on hardened installs until then. -- **Notifications Slice 2**: transaction-log rule-regression projector (critical - `pass→fail`, grouped per host/scan) + per-host RBAC recipient scoping. -- **License enforcement coverage**: the tier/key/402 machinery exists but only - the demo endpoint is gated; wire `x-required-feature` to actually gate - declared paid routes. -- GA gates: Stage 3 fleet-verification per `docs/runbooks/RELEASING.md`. - -## Known issues / caveats - -- Remediation is **non-functional on hardened packaged installs** until #673 - lands (operator workaround: set `OPENWATCH_KENSA_STORE_PATH`). -- Landing/login animation can appear frozen under OS **reduced-motion** (by - design — the radar/pulse honor `prefers-reduced-motion`). -- License **enforcement coverage** is partial (machinery present, most paid - routes not yet gated). diff --git a/docs/INTRODUCTION.md b/docs/INTRODUCTION.md deleted file mode 100644 index 2e88b0c8..00000000 --- a/docs/INTRODUCTION.md +++ /dev/null @@ -1,146 +0,0 @@ -# Introduction to OpenWatch - -## What is OpenWatch - -OpenWatch is a continuous compliance platform for Linux infrastructure. It connects to servers over SSH, runs compliance checks via the Kensa engine, and provides visibility into compliance posture over time. OpenWatch answers not just what is passing now, but what was passing last week, what drifted since the last scan, and what needs immediate attention. All findings include machine-generated evidence, framework mappings, and timestamps suitable for audit review. - ---- - -## The Problem - -Compliance in most environments is manual, fragmented, and reactive. - -- **Point-in-time scans decay immediately.** A passing result from last Tuesday says nothing about today. Without continuous scanning, compliance status is unknown between assessments. -- **Historical questions are unanswerable.** When an auditor asks "were you compliant on January 15th?", the answer requires re-scanning infrastructure that may have changed. If the environment was rebuilt, the answer is lost entirely. -- **Exceptions live in spreadsheets.** Waiver approvals, risk acceptances, and compensating controls are tracked outside the scanning tool. There is no link between the exception and the finding it covers. -- **Drift is invisible until the next audit.** A configuration change at 2 AM on a Saturday will not surface until someone manually re-scans the host or an auditor flags it weeks later. -- **Evidence is assembled after the fact.** Instead of generating evidence during checks, teams spend days before audits collecting screenshots, command outputs, and configuration files to prove compliance. - -These gaps create risk. They also create unnecessary work for teams that are already stretched thin. The typical result is that compliance becomes an audit preparation exercise -- a burst of activity before an assessment, followed by months of unknown state. - -OpenWatch eliminates this cycle. - ---- - -## The Solution: See, Scan, Secure - -OpenWatch addresses each of these problems through three capabilities. - -### See - -The dashboard provides real-time compliance posture across all managed hosts. Historical trend data shows how posture has changed over days, weeks, and months. Drift alerts notify operators when a previously-passing check begins failing. Point-in-time queries answer "what was the state on this date?" without re-scanning. - -### Scan - -The Kensa compliance engine runs 338 YAML-based rules over SSH connections. Each rule maps to one or more compliance frameworks simultaneously. A single scan produces results for CIS, STIG, NIST 800-53, PCI-DSS, and FedRAMP without running separate tools for each framework. Rules detect OS capabilities at runtime rather than requiring per-distribution configuration. - -### Secure - -When findings are identified, OpenWatch provides remediation workflows. Automated fixes include rollback capability in case a remediation introduces unintended side effects. Exception governance tracks waivers through an approval workflow with expiration dates. All scan results, remediations, and exceptions produce audit-ready evidence packages. - ---- - -## Core Values - -1. **Security-First** -- Every feature is designed with security as the primary requirement. Authentication uses Argon2id password hashing. API tokens use RS256 JWT. All SSH credentials are encrypted at rest with AES-256-GCM. Audit logging covers every authentication and authorization event. - -2. **Transparency** -- Compliance status is visible at all times. There is no hidden state. Dashboard views, API endpoints, and audit exports all reflect the same underlying data. When a check fails, the evidence explains why. - -3. **Automation** -- Manual effort is reduced through intelligent scanning schedules and automated remediation. Operators configure policies once. The platform enforces them continuously. - -4. **Rule-Based Compliance** -- One rule set covers many frameworks. Kensa rules declare what to check and how to evaluate the result. Framework mappings are maintained separately, so adding a new framework does not require writing new rules. Capabilities are detected at runtime, not hardcoded per operating system. - ---- - -## Operating Principle - -Compliance should be a seamless part of operations, not a periodic burden. - -Kensa scans run on adaptive schedules based on host compliance state: - -| Host State | Scan Interval | Rationale | -|------------|---------------|-----------| -| Healthy | Every 24 hours | Baseline monitoring, low overhead | -| Degraded | Every 6 hours | Track remediation progress | -| Critical | Every 1 hour | Rapid feedback on urgent fixes | - -These intervals are configurable per policy. No manual scanning is required for day-to-day operations. - -When scan results change, the platform generates alerts based on configurable thresholds. Operators respond to alerts rather than polling dashboards. This shifts compliance from a reactive audit preparation exercise to a continuous operational practice. - ---- - -## Architecture at a Glance - -OpenWatch ships as a single Go binary backed by PostgreSQL. - -``` -+-----------------------------------------------------------+ -| openwatch (single Go binary) :8443 | -| - REST API (net/http) | -| - embedded React 19 UI (go:embed) | -| - background worker (PostgreSQL SKIP LOCKED queue) | -| - Kensa compliance engine (Go, SSH-based) | -+-----------------------------------------------------------+ - | -+-----------------------------------------------------------+ -| PostgreSQL | -+-----------------------------------------------------------+ -``` - -**Binary** serves both the REST API and the embedded React single-page application over HTTPS on port 8443. The SPA is compiled into the binary with `go:embed`, so there is no separate web tier or reverse proxy to run. The binary handles authentication, authorization, scan management, compliance queries, and framework mappings. - -**Worker** runs as `openwatch worker` and processes asynchronous tasks including scan execution, result parsing, alert evaluation, and remediation jobs. It connects to target hosts over SSH using credentials encrypted in the database. - -**Job queue** is PostgreSQL-native, using the `SKIP LOCKED` pattern to dispatch and lock jobs. There is no Redis and no Celery. Scheduled scans are enqueued from adaptive compliance policies. - -**PostgreSQL** stores all persistent data: hosts, scans, findings, users, exceptions, alerts, framework mappings, and audit logs. All primary keys are UUIDs. Schema changes ship as migrations in `internal/db/migrations/` and apply via `openwatch migrate`. - -**Kensa** is the compliance engine, integrated as a Go dependency. Kensa connects to target hosts over SSH, executes rule checks, and returns structured results with evidence. It does not store results, manage exceptions, or provide a UI -- those responsibilities belong to OpenWatch. - -The only externally exposed port is 8443 (API and UI over HTTPS). Target hosts are reached over SSH from the binary. Lifecycle is managed by systemd (`openwatch.service`); no Docker or Podman runtime is required to run OpenWatch. - ---- - -## Who OpenWatch Is For - -**System Administrators** managing compliance across a Linux fleet. OpenWatch connects to hosts they already manage via SSH, runs checks on their schedule, and surfaces findings that need attention. - -**Security Engineers** building and enforcing security baselines. The rule reference interface shows every check Kensa performs, which frameworks it satisfies, and what evidence it collects. - -**Security Analysts** investigating compliance drift and remediation effectiveness. Point-in-time queries and trend data support root cause analysis when compliance degrades. - -**Compliance Officers** preparing for audits and generating evidence packages. Temporal compliance queries produce the exact posture at any historical date. Exception governance provides auditable waiver records. - -**Auditors** reviewing compliance posture and exceptions. Audit export endpoints produce structured data covering findings, evidence, exceptions, and remediation history. - -These roles are not mutually exclusive. OpenWatch provides role-based access control so each user sees the views and actions relevant to their responsibilities. - ---- - -## Supported Frameworks - -Kensa rules map to the following compliance frameworks. - -| Framework | Mapping ID | Rules | -|-----------|------------|-------| -| CIS RHEL 9 v2.0.0 | cis-rhel9-v2.0.0 | 271 | -| STIG RHEL 9 V2R7 | stig-rhel9-v2r7 | 338 | -| NIST 800-53 R5 | nist-800-53-r5 | 87 | -| PCI-DSS v4.0 | pci-dss-v4.0 | 45 | -| FedRAMP Moderate | fedramp-moderate | 87 | - -A single scan evaluates all applicable rules. Framework filtering is applied at query time, not scan time. Adding support for a new framework requires only a mapping file -- no new rules or scanner changes. - -Rule counts reflect current Kensa mapping files. As Kensa releases new rule versions, these counts will change. The Rule Reference interface in the OpenWatch UI shows the current rule inventory, organized by framework, severity, and category. - ---- - -## What's Next - -- [Quickstart Guide](guides/QUICKSTART.md) -- First 15 minutes with OpenWatch -- [Installation Guide](guides/INSTALLATION.md) -- Deployment options and configuration -- [User Roles](guides/USER_ROLES.md) -- Permissions and workflows -- [Scanning and Compliance](guides/SCANNING_AND_COMPLIANCE.md) -- Scan lifecycle, frameworks, and posture queries -- [Hosts and Remediation](guides/HOSTS_AND_REMEDIATION.md) -- Host management, remediation, and exception workflows -- [API Guide](guides/API_GUIDE.md) -- Automation and integration reference diff --git a/docs/KENSA_OPENWATCH_BOUNDARY.md b/docs/KENSA_OPENWATCH_BOUNDARY.md deleted file mode 100644 index 037207d4..00000000 --- a/docs/KENSA_OPENWATCH_BOUNDARY.md +++ /dev/null @@ -1,243 +0,0 @@ -# Kensa / OpenWatch responsibility boundary - -> **Status:** Ratified 2026-05-25. Authoritative. -> **Supersedes:** [`KENSA_OPENWATCH_COORDINATION_2026-04-14.md`](./KENSA_OPENWATCH_COORDINATION_2026-04-14.md) §3.4 ("Event subscription for Heartbeat"). The rest of the 2026-04-14 memo remains accurate; only the event-subscription plan was overtaken by this decision. -> **Audience:** OpenWatch engineers, Kensa engineers, anyone scoping work that crosses the boundary. -> **One-line summary:** Kensa is the per-host measurement engine; OpenWatch is the fleet orchestration and monitoring platform. - ---- - -## 1. Why this document exists - -The April 14 coordination memo's §3.4 planned for OpenWatch to subscribe to a Kensa event stream for liveness pulses, drift signals, and fleet-monitoring events. That plan does not match what the Kensa API can actually support, for reasons that are structural rather than implementational: - -- Kensa is one-shot: every entry point (`detect`, `check`, `remediate`, the library `Service`) runs, does work, exits. The `InMemoryEventBus` is per-process and dies with the process. -- Kensa is single-host: each invocation is told one host. There is no inventory it could iterate. -- Periodic, fleet-wide pulses inherently require a long-running process with an inventory and a scheduler. - -Adding `HeartbeatPulse` emission to Kensa would require turning Kensa into a stateful long-running fleet daemon with a host registry and a scheduler. That is OpenWatch. The shape of the work, not its complexity, dictates which side owns it. - -This document records the boundary that both teams ratified on 2026-05-25, so future scoping conversations have one authoritative reference instead of re-deriving it. - ---- - -## 2. The boundary, stated cleanly - -| | Kensa | OpenWatch | -|---|---|---| -| Scope per invocation | One host | The fleet | -| Lifetime | One-shot (run, exit) | Long-lived service | -| State | Stateless between invocations | Stateful (inventory, history, schedules) | -| Surface | Library API consumed by callers | HTTP API + scheduler + bus, consumed by humans, integrations, and CI | -| Identity | The "git" of the system | The "GitHub" | - -OpenWatch invokes Kensa per measurement; Kensa returns a structured result. OpenWatch persists that result, decides when to measure next, monitors whether hosts respond at all, compares results across time to detect drift, aggregates across the fleet, and routes alerts. - ---- - -## 3. Responsibility split - -### 3.1 OpenWatch owns - -Everything in this column is built in the OpenWatch repo. Kensa has zero runtime responsibility for any of it. - -| Responsibility | Notes | -|---|---| -| **Scheduler** | Decides which hosts run which frameworks at which cadence. Replaces what Kensa would never have done. | -| **Host inventory** | Already shipped in OpenWatch Slice A (`internal/host/`). | -| **Credential store + resolver** | Already shipped in Slice A (`internal/credential/`). Kensa is handed credentials per invocation; OpenWatch never relies on Kensa to look them up. | -| **Liveness pulse loop** | Per-host periodic reachability probe. Calls Kensa's `Reachable()` primitive when available; until then, OpenWatch SSH-dials directly via its existing `internal/ssh` package. | -| **Drift detection** | Reads OpenWatch's transaction history, compares latest result to previous per (host, rule), emits drift signals. | -| **Fleet rollup** | Cross-host aggregation: posture by framework, posture by host, drift trend over time. Backed by Kensa's per-host signed records but composed in OpenWatch. | -| **Event bus** | Long-lived in-process pubsub. OpenWatch publishes its own monitoring events here; alert routing subscribes. | -| **Alert routing + channel dispatch** | Slack, email, webhook, future Jira. Includes dedup, rate limit, severity routing. | -| **Coalescing, back-pressure, drop counters** | The "always see at least one pulse per host per interval" guarantee is OpenWatch's responsibility, on OpenWatch's bus. | -| **`HeartbeatPulse` event emission** | OpenWatch publishes pulses on its own bus. See §4 for the type-definition decision. | -| **`DriftDetected` event emission** | Same as above. | -| **Subscription to Kensa transaction-progress events** | OpenWatch consumes Kensa's events for in-flight scan/remediation visibility. See §4 for the consumer-side contract. | - -### 3.2 Kensa owns - -Everything in this column lives in the Kensa repo. OpenWatch is a consumer. - -| Responsibility | Notes | -|---|---| -| **Per-host compliance evaluation** | The `Plan` / `Execute` / single-rule check primitives. Existing surface. | -| **Per-host signed transaction record** | The SQLite-backed signed log Kensa writes per host. OpenWatch's Eye reads these via `LogQuery`. | -| **Transaction-progress events** | `TransactionStarted`, `PhaseCompleted`, `Committed`, `RolledBack`. Emitted during a Kensa transaction; OpenWatch subscribes for progress display. | -| **Deadman events** | `DeadmanTimerArmed`, `DeadmanTimerFired`. Transaction-scoped, engine-internal — same lifecycle bucket as the other transaction-progress events. Kensa emits, OpenWatch consumes. Wiring currently absent (see §6.2). | -| **`Reachable(ctx, host)` primitive** | Cheap single-host reachability probe reusing Kensa's existing SSH ControlMaster transport. Doesn't exist yet — planned mid-Slice B (see §6.3). | -| **The shared event envelope** | `api.Event` and `api.EventKind` type definitions stay in Kensa as one shared wire vocabulary. See §4. | - ---- - -## 4. Event taxonomy — three buckets - -Every event in the system falls into exactly one bucket. The bucket determines who emits it and how its constant is declared. - -### Bucket A — Kensa-emitted, OpenWatch-consumed - -| Event | Status | -|---|---| -| `TransactionStarted` | Emitted by Kensa engine, published to internal bus | -| `PhaseCompleted` | Emitted | -| `Committed` | Emitted | -| `RolledBack` | Emitted | -| `DeadmanTimerArmed` | **Wiring gap** — deadman subsystem arms for real but does not publish (see §6.2) | -| `DeadmanTimerFired` | **Wiring gap** — same | - -Consumer-side contract: OpenWatch subscribes via `Kensa.Subscribe(EventFilter{...})`. `Kensa.Subscribe` is currently stubbed (returns `ErrNotYetImplemented`); the underlying `pkg/kensa.Service.Subscribe` works. See §6.4. - -### Bucket B — OpenWatch-owned (emitted by OpenWatch on its own bus) - -| Event | Status | -|---|---| -| `HeartbeatPulse` | OpenWatch publishes per pulse-loop iteration | -| `DriftDetected` | OpenWatch publishes when a (host, rule) state transitions | - -These never originate in Kensa. The Kensa code that today declares them as `EventKind` constants is dead vocabulary and will be removed (see §6.1). - -`EventFilter.HeartbeatInterval` and `EventFilter.FleetIDs` follow the same path — they are filter fields for an OpenWatch-owned subscription model, not Kensa's. They move out of Kensa's `api/` package along with the event constants. - -### Bucket C — The shared envelope - -Kensa retains `api.Event` (the struct shape) and `api.EventKind` (the type, with its currently-emitted constants from Bucket A only). OpenWatch declares its Bucket B constants against the same `api.EventKind` type — one wire vocabulary, zero dead constants in Kensa. - -```go -// In Kensa (api/events.go): -type EventKind string -const ( - TransactionStarted EventKind = "transaction.started" - PhaseCompleted EventKind = "transaction.phase_completed" - Committed EventKind = "transaction.committed" - RolledBack EventKind = "transaction.rolled_back" - DeadmanTimerArmed EventKind = "deadman.armed" - DeadmanTimerFired EventKind = "deadman.fired" -) - -// In OpenWatch (e.g. internal/events/kinds.go): -import "github.com/Hanalyx/kensa/api" - -const ( - HeartbeatPulse api.EventKind = "openwatch.heartbeat.pulse" - DriftDetected api.EventKind = "openwatch.drift.detected" -) -``` - -Both sides emit `api.Event` values; subscribers downstream of either bus see one type. No duplication of envelope definitions; no Kensa-side constants for events Kensa never produces. - ---- - -## 5. What this means for OpenWatch planning - -### 5.1 Slice A (shipped at `v0.2.0-rc.3`) - -Unaffected. The auth + user + host + credential admin surface stands. Host inventory, credential resolver, SSH dial layer all match the boundary as ratified. - -### 5.2 Slice B (next, not yet scoped to specs) - -Slice B is meaningfully larger than the "just call Kensa to run a scan" framing the April 14 memo implied. The Slice B specs should cover: - -1. **Scheduler** — when to run which framework against which host. -2. **Kensa executor wrapper** — invokes Kensa per scheduled run, persists the structured result. -3. **Transaction log writer** — OpenWatch's persistent record. Reads Kensa's signed records on each run; the log is OpenWatch's primary read surface for Eye / posture queries. -4. **Liveness loop** — periodic per-host reachability probe. Interim implementation SSH-dials directly via `internal/ssh`; switches to `Kensa.Reachable()` when that primitive lands. -5. **Drift detector** — compares latest transaction to previous per (host, rule); emits `DriftDetected` to OpenWatch's bus. -6. **Fleet rollup queries** — aggregate per-host state into fleet-level views (posture by framework, posture by host, drift trend). -7. **OpenWatch event bus** — in-process pubsub. Publishes Bucket B events; downstream subscribers attach. -8. **Alert router** — subscribes to the bus; routes by severity + tag + channel config. - -Realistic estimate: 10–12 weeks. The work is genuinely larger than Slice A; the boundary ratification is what makes the scoping possible. - -### 5.3 Slices beyond B - -`Subscribe` to Kensa transaction-progress events (Bucket A) lands when OpenWatch needs in-flight scan/remediation visibility. That's likely Slice C (proactive remediation workflow) or whenever a UI surface needs "show me what's happening on host X right now." Slice B does not require it because the scheduler invokes Kensa synchronously and persists the final result; in-flight visibility is a nice-to-have on top. - ---- - -## 6. Open Kensa-side action items - -These are planned against this boundary doc. None block OpenWatch Slice A; one (6.3) is needed mid-Slice B. - -### 6.1 Remove `HeartbeatPulse` and `DriftDetected` constants from Kensa's `api/` - -- **Action:** Move the two constants (plus `EventFilter.HeartbeatInterval` and `EventFilter.FleetIDs`) out of Kensa's frozen `api/` package. OpenWatch redeclares them as `api.EventKind`-typed constants in its own package. -- **Owner:** Kensa team, with founder sign-off (Kensa `api/` is semver-frozen — this is a one-way door before v1.0.0; would require a v2 major bump if missed). -- **Timing:** Before Kensa v1.0.0 (M7 in progress). Time-sensitive — should not drift. -- **Done by this boundary doc:** No, separately tracked. - -### 6.2 Wire deadman subsystem to publish to event bus - -- **Action:** `DeadmanTimerArmed` / `DeadmanTimerFired` are declared and the deadman subsystem fires for real (`internal/engine/deadman`, `internal/agent/deadman`), but the events are never published to the bus. Add the publish call inside the existing deadman fire path. -- **Owner:** Kensa team. -- **Timing:** Whenever Kensa is doing engine work near deadman. Not OpenWatch-blocking. - -### 6.3 Build `Reachable(ctx, host)` primitive - -- **Action:** Add a cheap single-host reachability probe that reuses Kensa's ControlMaster SSH transport. Return shape MUST distinguish "host down" (the expected `Reached: false` answer) from "probe couldn't run" (a config / transport error). Recommended shape: - ```go - type Reachability struct { - Reached bool - Latency time.Duration - } - // err is reserved for probe-execution failures, NOT host-down conditions. - func (s *Service) Reachable(ctx context.Context, host Host) (Reachability, error) - ``` -- **Owner:** Kensa team. -- **Timing:** Mid-Slice B. OpenWatch's interim liveness loop SSH-dials directly until this lands; the switch-over is mechanical when it does. - -### 6.4 Wire `Kensa.Subscribe` to `Service.Subscribe` - -- **Action:** `Kensa.Subscribe` is stubbed (returns `ErrNotYetImplemented`); only `pkg/kensa.Service.Subscribe` reaches the bus. The top-level wrapper needs to delegate. -- **Owner:** Kensa team. -- **Timing:** Before OpenWatch's first consumer of Bucket A events lands. Probably Slice C. - -### 6.5 Doc fixes (landed 2026-05-25, recorded for traceability) - -- `api/events.go` godoc: corrected the false "every event type the engine emits" claim and clarified `EventSubscriber` is for transaction progress, not heartbeat/drift. -- `KENSA_API_DOC.md` §8: rewritten to the three-bucket model with the open-item-2 resolution recorded inline. - ---- - -## 7. What this document does NOT change - -- The Kensa `Plan` / `Execute` / `LogQuery` / `EnvelopeVerifier` contract. Unaffected. -- The April 14 memo §1 (vision split), §2 (top-level identity mapping), §3.1–§3.3 (other duplication resolutions), §3.5–§5 (per-API discussions other than §3.4) remain authoritative. -- OpenWatch's Eye and Control Plane wiring against Kensa's `LogQuery` / `Planner` / `Executor`. Unaffected. -- Anything about who signs what (Kensa signs per-transaction; OpenWatch signs aggregates). Unaffected. - ---- - -## 8. FAQ - -**Q: If `HeartbeatPulse` is OpenWatch-emitted, why does it use Kensa's `api.EventKind` type at all?** - -So the wire envelope stays one type across the whole platform. A downstream consumer subscribing to either bus sees `api.Event` values without conditional decoding. The decision is to share the envelope, not the constants — Kensa declares what Kensa emits; OpenWatch declares what OpenWatch emits; the carrier type is one. - -**Q: Why doesn't Kensa expose pulses by gaining a daemon mode?** - -Three reasons, in order of weight: -1. **It's the wrong product shape.** Kensa is "git." A daemon with an inventory and a scheduler is "GitHub." Putting both in one tool means losing the property that any caller (CLI, OpenWatch, third-party automation) can use Kensa as a stateless library. -2. **The boundary disappears.** If Kensa owns an inventory, OpenWatch and Kensa are competing for the same data model, and every future "where does X live?" question gets harder. -3. **Cost vs benefit.** A pulse loop is ~200 lines of Go. The benefit of moving it to Kensa is zero (no other Kensa caller needs it). The cost is permanent — Kensa carries an inventory + scheduler forever. - -**Q: What happens if Kensa later wants to publish a HeartbeatPulse for some new reason?** - -It can — Kensa would declare a Kensa-emitted constant (e.g. `KensaInternalHeartbeat` for a self-health pulse Kensa emits). The Bucket B decision is specifically about the existing `HeartbeatPulse` constant whose semantics are fleet-level monitoring. Kensa-internal heartbeats are a different concept and would get a different constant. - -**Q: We're an outside team writing a Kensa consumer. Which boundary do we follow?** - -This one. The 2026-04-14 memo's §3.4 is obsolete on heartbeat/drift. The other sections still apply. - -**Q: How is this kept in sync if Kensa or OpenWatch evolves?** - -Each side cites this document in its own internal docs (Kensa's `KENSA_API_DOC.md` §8; OpenWatch's Slice B specs and any future scope docs). Changes to the boundary require coordinated edits to this file plus the citing docs; the file is versioned (date in the Status line at top). If a future change overtakes a section, that section gets struck through and a new doc supersedes it (same pattern as this doc superseding the April 14 memo's §3.4). - ---- - -## 9. Reference - -- April 14 memo (still authoritative outside §3.4): [`KENSA_OPENWATCH_COORDINATION_2026-04-14.md`](./KENSA_OPENWATCH_COORDINATION_2026-04-14.md) -- OpenWatch quarterly plans: [`OPENWATCH_Q2_PLAN.md`](./OPENWATCH_Q2_PLAN.md), [`OPENWATCH_Q1_Q3_PLAN.md`](./OPENWATCH_Q1_Q3_PLAN.md) -- OpenWatch vision: [`OPENWATCH_VISION.md`](./OPENWATCH_VISION.md) -- Stage 2 Slice A plan (shipped): [`engineering/stage_2_slice_a.md`](./engineering/stage_2_slice_a.md) diff --git a/docs/KENSA_OPENWATCH_COORDINATION_2026-04-14.md b/docs/KENSA_OPENWATCH_COORDINATION_2026-04-14.md deleted file mode 100644 index 87d43171..00000000 --- a/docs/KENSA_OPENWATCH_COORDINATION_2026-04-14.md +++ /dev/null @@ -1,199 +0,0 @@ -# Coordination Memo: OpenWatch ↔ Kensa Go Day-1 - -**From:** OpenWatch team -**To:** Kensa team -**Date:** 2026-04-14 -**Subject:** Duplication review, integration commitments, and interface-freeze asks against `KENSA_GO_DAY1_PLAN.md` -**Status:** Draft for review - ---- - -## 1. What triggered this memo - -OpenWatch reviewed `kensa/docs/KENSA_GO_DAY1_PLAN.md` (the Go Day-1 build plan) on 2026-04-14 after recent OpenWatch Q3 work started diverging from the interfaces you've defined in §3.5 and §9. Four confirmed overlaps, one architectural misalignment, and two deferred OpenWatch phases that would build throwaway code if they proceed on current assumptions. - -We want to resolve all of this **before** your `api/` surface freezes at Week 1 and before OpenWatch's Phase 6.2 implementation starts. - -## 2. The posture OpenWatch is adopting - -Per `OPENWATCH_VISION.md`'s framing (git : GitHub :: Kensa : OpenWatch), OpenWatch commits to the following rules: - -| Rule | Consequence | -|------|-------------| -| Source of truth for per-transaction data lives in **Kensa's SQLite store**. | OpenWatch's PostgreSQL `transactions` table is demoted to a **derived cache/index**, not a parallel source of truth. | -| Per-transaction cryptographic attestations are **signed by Kensa**. | OpenWatch's per-transaction signing path is deleted. OpenWatch keeps signing only for aggregate artifacts that OpenWatch itself originates (cross-host audit exports, quarterly posture reports, State-of-Production releases). | -| Single-host execution semantics (Plan, Execute, Rollback, atomicity, capture) live in **Kensa**. | OpenWatch's Phase 6.2 "proactive remediation" rewrites from "OpenWatch generates a plan" to "OpenWatch wraps `Kensa.Plan` / `Kensa.Execute` with an approval-workflow UI." | -| Event streams originate in **Kensa**. | OpenWatch's Heartbeat service subscribes to `Kensa.Subscribe(filter)` instead of polling PostgreSQL. | -| OpenWatch codes against **Kensa's `api/` signatures from commit 1**. | `ErrNotYetImplemented` during the stub period is acceptable. Parallel implementations with the intent to "swap later" are not. | - -The short form: **OpenWatch is GitHub over Kensa's git.** We present, aggregate, orchestrate, collaborate. We do not re-implement what Kensa already does for a single host. - -## 3. Confirmed duplication and OpenWatch's resolution - -### 3.1 Transaction log query (Kensa §3.5.1 `LogQuery`) - -**Duplication:** OpenWatch merged PR #398 today adding `POST /api/transactions/query` with a DSL whose filter fields mirror your `LogFilter` struct (HostIDs, FleetIDs, RuleIDs, FrameworkRefs, Statuses, Since, Until). Our schema, pagination, and projection shapes were derived independently but the surface is effectively the same read-side contract. - -**OpenWatch resolution:** -- Keep the HTTP endpoint URL and schema stable — it's what OpenWatch UI and any third-party customers will call -- Refactor the implementation to delegate to `kensa.TransactionLog().Query()` once your Week 22 milestone lands -- Interim (pre-Week 22): the endpoint queries the PostgreSQL cache (which the Python Kensa presently writes) -- Spec and route file annotated with this "interim implementation" framing in a follow-up PR - -**Ask for Kensa:** see §5 interface questions. - -### 3.2 Per-transaction Ed25519 signing (Kensa §8.2) - -**Duplication:** OpenWatch merged PR #397 earlier today with `backend/app/services/signing/signing_service.py` + a `deployment_signing_keys` table + `POST /api/transactions/{id}/sign`. Your Go plan places Ed25519 signing at the point of evidence capture, which is the correct trust layer — the auditor needs Kensa's attestation ("this execution happened on this host"), not OpenWatch's ("OpenWatch stored this later"). - -**OpenWatch resolution:** -- Delete `POST /api/transactions/{id}/sign` — per-transaction signing becomes Kensa-only -- Keep the `SigningService` class **but only for aggregate artifacts OpenWatch originates** — cross-host audit export bundles, quarterly posture snapshots, future State-of-Production report -- Update `docs/SIGNING_SECURITY_REVIEW_2026-04-14.md` with an explicit trust-layer diagram -- Bump `specs/services/signing/evidence-signing.spec.yaml` to version 2.0 with the narrowed scope - -**Ask for Kensa:** confirm that the signed envelope structure in §8.2 is exposed via the Go `api/` (we'll need to display the envelope + signature in OpenWatch's audit UI and verify it via `Kensa` on client request). - -### 3.3 Plan / Execute for remediation (Kensa §3.5.3 `Planner`, `Executor`) - -**Duplication (planned, not yet built):** OpenWatch's Q1-Q3 plan §6.2 "Proactive Remediation Workflow" specified *"Draft job is a remediation_jobs row with status=draft + the full proposed transaction plan (capture / apply / validate / rollback)"* — re-implementing your `Plan` type and Execute semantics. - -**OpenWatch resolution:** -- Rewrite §6.2 before implementation starts. Revised architecture: - 1. Drift event → OpenWatch calls `Kensa.Plan(host, rule)` → receives an opaque `Plan` blob - 2. OpenWatch stores the blob in `remediation_jobs.kensa_plan` (JSONB) without interpreting it - 3. ApprovalQueue UI renders the plan via a Kensa-provided preview formatter (not OpenWatch's own render) - 4. On N-of-M approval (OpenWatch's approval-chain layer, §6.3), OpenWatch calls `Kensa.Execute(host, plan)` - 5. `PlanStaleError` from Kensa surfaces as "re-plan required" in the UI -- **Do not start 6.2 implementation** until your Week 24 milestone - -**Ask for Kensa:** does the `Plan` struct include a human-readable preview string or should OpenWatch render from the `ApplyStep` / `RollbackStep` structures directly? We'd prefer a Kensa-owned formatter (`Plan.Preview()` method or an `api` helper) so the display stays consistent with the CLI's preview. - -### 3.4 Event subscription for Heartbeat (Kensa §3.5.2 `EventSubscriber`) - -**Duplication (planned, not yet built):** OpenWatch's Phase 3 Heartbeat design called for a PostgreSQL-backed event stream generated by the OpenWatch scheduler/worker. - -**OpenWatch resolution:** -- Rewrite Phase 3 before implementation starts. OpenWatch runs a long-lived consumer over `Kensa.Subscribe(EventFilter{...})` -- OpenWatch owns: fleet-level aggregation, alert-routing policy, channel dispatch (Slack/email/webhook/Jira), deduplication, notification-rate-limiting -- Kensa owns: the event stream itself - -**Ask for Kensa:** see §5. - -### 3.5 Transactions table as "canonical" - -**Architectural misalignment, not strict duplication:** OpenWatch's Q1 Phase 1 shipped a `transactions` + `host_rule_state` schema in PostgreSQL. With Kensa's SQLite store becoming the per-deployment source of truth, OpenWatch's PostgreSQL layer needs to be explicitly reframed. - -**OpenWatch resolution:** -- Treat the PostgreSQL `transactions` table as a **multi-host aggregation cache** (not a source of truth). It survives because cross-fleet queries against N independent Kensa SQLite stores are too slow for UI response times -- Add prominent comments to the ORM model and to `backend/app/tasks/kensa_scan_tasks.py` making this explicit -- `transaction-log.spec.yaml` updated to bump version and reflect the cache-over-Kensa posture -- Any conflict between PostgreSQL row and Kensa SQLite row: **Kensa wins** (cache invalidation path via `Subscribe` events) - -**Ask for Kensa:** confirm that `LogQuery.Query` + `LogQuery.Aggregate` can serve OpenWatch's multi-host aggregate needs at acceptable latency (<500ms p95 for historical posture queries on fleets of ~1000 hosts), or whether OpenWatch should maintain its own aggregation cache. If the former, OpenWatch drops the PostgreSQL `transactions` table entirely in a later phase. - -## 4. Work OpenWatch keeps as pure OpenWatch-layer (NON-duplicative) - -These are fleet/multi-user/multi-tenant concerns that have no analog in single-host Kensa. OpenWatch continues building them independently: - -| OpenWatch feature | Justification | -|---|---| -| Multi-approval chains + approval policies (Phase 6.3) | Orchestrating N approvers is orthogonal to `Kensa.Execute`. Same relationship as GitHub branch-protection rules to `git merge`. | -| Fleet grouping + per-group policies (Phase 6.4) | Kensa has no concept of "a fleet". OpenWatch owns group membership, group-specific scan cadences, group approval policies. | -| Public State-of-Production Rollback report (Phase 6.5) | Aggregated statistics across opt-in customers. Cross-tenant by definition. | -| SSO federation (OIDC + SAML) | User authentication for OpenWatch; not a per-host concern. | -| Notification channels (Slack, email, webhook, Jira) | Fan-out for Kensa events into organization-specific tooling. | -| RBAC, audit logging of OpenWatch user actions, multi-tenant isolation | OpenWatch-specific. | -| Adaptive scan scheduling across a fleet | OpenWatch decides *when* to call `Kensa.Scan` for each host. Kensa scans one host on demand. | -| Audit export (aggregate CSV/JSON/PDF across hosts) + its Ed25519 signing | OpenWatch-originated artifact. | - -## 5. Interface review requests (before Week-1 freeze) - -We would value a review of the following interface shapes **before `api/` freezes**, because once semver locks you can't adjust without a major-version bump: - -### 5.1 `LogFilter` (§3.5.1) - -- Add `Phase []Phase` field? OpenWatch UI filters by phase (capture/apply/validate/commit/rollback). -- Add `Severity []string` field? OpenWatch views filter by severity (critical/high/medium/low). Today inferred from `rule_id` — but that's expensive at query time. -- Clarify `FrameworkRef` semantics: is it `(framework_id, control_id)` or an opaque string? OpenWatch filters by control path (`cis_rhel9_v2:5.2.3`). - -### 5.2 `AggregateKey` (§3.5.1) - -Please support at minimum: -- `by_host` -- `by_rule` -- `by_framework_control` -- `by_host_then_framework_control` (compliance-officer view: which control is failing on which host?) -- `by_rule_then_status_over_time` (drift view: rule X's pass/fail ratio over week buckets) - -### 5.3 `EventFilter` (§3.5.2) - -- Can OpenWatch subscribe to `DeadmanTimerFired` **alone**? Our alert-routing needs to treat this as a critical-severity event regardless of other subscriptions. -- Is `HeartbeatPulse` rate-limitable in the filter, or does the subscriber drop? - -### 5.4 `Plan` (§3.5.3) - -- Does `Plan` include a `Preview() string` or `Render() *PreviewDoc` method Kensa owns? OpenWatch would rather display Kensa's rendering than build a second renderer that drifts. -- `PlanStaleError`: what granularity? (Same-host-any-change, or per-file drift?) OpenWatch's UX needs to say "re-plan because X changed," not just "re-plan." - -### 5.5 `TransactionRecord` (§3.5.1 `Get`) - -- Does it include the full evidence envelope or only its hash? OpenWatch's audit-export path embeds the envelope directly, so we'd need the full payload. - -### 5.6 Concurrency / rate limiting - -- Is `Kensa.Scan` / `Kensa.Transact` safe to call concurrently against the same host from different OpenWatch workers? OpenWatch's job queue may fan out. -- Any per-host serialization you enforce, or is the caller responsible? - -## 6. Timing + coordination - -From your build sequence (§11): - -| Kensa milestone | OpenWatch action | -|---|---| -| **Week 1** — `api/` surface frozen with stubs | OpenWatch starts coding against signatures immediately. PR #398 spec annotated; signing narrowed; Q1-Q3 plan §6.2 and Phase 3 rewritten to target `api/`. | -| **Week 22** — `LogQuery` real | OpenWatch swaps `POST /api/transactions/query` implementation from PostgreSQL to `Kensa.TransactionLog()`. | -| **Week 24** — `Plan`/`Execute` real | OpenWatch starts §6.2 proactive-remediation implementation. | -| **Week 25** — `Subscribe` real | OpenWatch cuts Heartbeat from PostgreSQL polling to Kensa event stream. | -| **Week 26 (M5)** — all OpenWatch-facing APIs real | OpenWatch runs full integration test: Plan → Subscribe → Execute → Query. Target: parity with Python Kensa on a 50-rule corpus. | -| **Week 40 (M7)** — Kensa Go v1.0.0 | OpenWatch is a pure consumer of Go Kensa. Python Kensa archived. | - -Concrete OpenWatch deliverables **this sprint** in direct response to this memo: - -1. PR: "docs: align signing scope to OpenWatch-originated artifacts only" (narrows `backend/app/services/signing/`, deletes per-transaction signing endpoint, updates review doc + spec) -2. PR: "docs: rewrite Q1-Q3 plan §6.2 + Phase 3 against Kensa api/" (architecture-only, no code changes) -3. PR: "chore(transactions): reframe query API as interim over Kensa LogQuery" (spec annotation + TODO comment in route; no behavior change) - -**Not in this sprint:** Phase 6.2 implementation (waits for Week 24) and Phase 3 Heartbeat (waits for Week 25). Phase 6.3 (multi-approval) and Phase 6.4 (fleet groups) remain scheduled — those are OpenWatch-layer and don't wait. - -## 7. Asks summary - -In priority order, what OpenWatch needs from Kensa team: - -1. **Confirm this memo's resolutions are what you expect.** Any of §3.1–3.5 where our resolution is wrong, flag now. -2. **Review interface questions in §5** and adjust `api/` before Week-1 freeze. -3. **Confirm the Week 1 `api/` stub strategy is real and imminent.** OpenWatch's roadmap assumes we can start coding against it in ~days, not ~months. -4. **Coordinate on the evidence-envelope structure** so OpenWatch's audit UI and the CLI present the same thing. -5. **Shared `kensa-spec` repo for rules/mappings/specs** (your §12.1) — confirm the submodule mechanics so OpenWatch's Kensa rule-reference UI doesn't diverge. - -## 8. Open questions - -These came up during the review and we want your input, not a pre-baked answer from us: - -- **Agent API** (your §3.5 intro says "future AI agents" are a consumer). Is the intent that OpenWatch *also* exposes an HTTP version of Kensa's API to external AI agents, or do agents talk to Kensa directly? This affects whether OpenWatch stands up an `/api/v2/agent` surface or not. -- **Deadman-timer visibility.** Should OpenWatch's UI render a prominent warning when a deadman timer is armed on a host? (We think yes — operators need to know a rollback is scheduled.) What's the UX you envision? -- **Multi-fleet transaction log.** If a transaction on host H_1 in fleet F_1 and another on host H_2 in fleet F_2 need cross-querying (e.g., "show me all remediations for CIS 5.2.3 across both fleets last week"), does `LogQuery` on a single Kensa instance answer this, or does OpenWatch federate across N Kensa instances? - ---- - -**Response requested by:** Kensa team commit-1 timeline (please respond before you freeze `api/`). - -**Contacts:** -- OpenWatch: engineering (CLAUDE.md collaborator reviewing this memo, human review pending) -- Kensa: engineering - -**Related documents:** -- `/home/rracine/hanalyx/kensa/docs/KENSA_GO_DAY1_PLAN.md` -- `/home/rracine/hanalyx/openwatch/docs/OPENWATCH_VISION.md` -- `/home/rracine/hanalyx/openwatch/docs/OPENWATCH_Q1_Q3_PLAN.md` -- `/home/rracine/hanalyx/openwatch/docs/SIGNING_SECURITY_REVIEW_2026-04-14.md` diff --git a/docs/OPENWATCH_Q1_Q3_PLAN.md b/docs/OPENWATCH_Q1_Q3_PLAN.md deleted file mode 100644 index c02cb6c9..00000000 --- a/docs/OPENWATCH_Q1_Q3_PLAN.md +++ /dev/null @@ -1,712 +0,0 @@ -# OpenWatch Q1–Q3 Implementation Plan - -**Date:** 2026-04-11 -**Last updated:** 2026-04-14 (Kensa Convergence Addendum) -**Source:** Synthesis of codebase assessments against [OPENWATCH_VISION.md](OPENWATCH_VISION.md) Q1–Q3 milestones -**Companion:** [OPENWATCH_VISION_STATUS.md](OPENWATCH_VISION_STATUS.md) - -**Scope note on OSCAL:** Per decision on 2026-04-11, OSCAL export is deferred — the feature belongs in Kensa first, then OpenWatch calls into it. This plan includes evidence envelope structure and Ed25519 signing (which are OpenWatch concerns) but omits OSCAL serialization. - ---- - -## Kensa Convergence Addendum (2026-04-14) - -This addendum captures the coordination outcome between the OpenWatch team and the Kensa team on 2026-04-14. It supersedes sections of this plan that assumed OpenWatch would implement functionality the Kensa Go Day-1 plan (`kensa/docs/KENSA_GO_DAY1_PLAN.md`) now commits to providing through its `api/` surface. - -### The posture - -Per the `OPENWATCH_VISION.md` framing (git : GitHub :: Kensa : OpenWatch), **OpenWatch is a collaboration, aggregation, and orchestration layer over Kensa**. OpenWatch does not re-implement what Kensa already does for a single host. - -### What this changes in this plan - -| Plan section | Original assumption | Revised assumption | -|---|---|---| -| **§6.1 Transaction log query API** | OpenWatch builds and owns the read path against PostgreSQL `transactions` | Endpoint URL + schema owned by OpenWatch; implementation delegates to `kensa.api.Kensa.TransactionLog().Query()` at **Kensa Week 22**. Interim implementation annotated in `specs/api/transactions/transaction-query.spec.yaml` v1.1. | -| **§6.2 Proactive remediation workflow** | OpenWatch generates the plan (capture/apply/validate/rollback) | OpenWatch wraps `Kensa.Plan` / `Kensa.Execute` with an approval-workflow UI. See revised §6.2 below. **Do not implement until Kensa Week 24.** | -| **Phase 3.4 Fleet health** | Queries `transactions` table + `host_liveness` | Same queries; PostgreSQL `transactions` is now framed as a **derived multi-host aggregation cache over Kensa's SQLite store** per Kensa Day-1 plan §13A. Survives through Kensa v1.0.0. | -| **Phase 3 Heartbeat (broadly)** | OpenWatch-internal event generation | OpenWatch subscribes to `Kensa.Subscribe(EventFilter{...})` at **Kensa Week 25** for transaction lifecycle events; OpenWatch still owns its own TCP liveness ping (§3.2 — distinct from Kensa's `HeartbeatPulse` and complementary). | -| **Per-transaction Ed25519 signing** | OpenWatch signs transaction envelopes via `POST /api/transactions/{id}/sign` | **Removed from OpenWatch.** Kensa signs envelopes at capture/execute time. OpenWatch's `SigningService` narrows to aggregate artifacts it originates (audit exports, quarterly posture reports, State-of-Production release). See `specs/services/signing/evidence-signing.spec.yaml` v2.0. | - -### What this keeps unchanged - -These remain purely OpenWatch-layer concerns and ship on their original schedule: - -- SSO federation (Phase 3.6) — OIDC + SAML, purely user-auth concern -- Notification dispatch (Phase 3.5) — Slack/email/webhook/Jira fan-out from `AlertService` -- Adaptive scan scheduling — OpenWatch decides *when* to call `Kensa.Scan` -- Multi-approval chains and approval policies (Phase 6.3) -- Fleet grouping + per-group policies (Phase 6.4) -- State-of-Production Rollback report (Phase 6.5) — cross-tenant aggregation -- RBAC, audit logging of OpenWatch user actions, multi-tenant isolation -- Audit-export generation and its Ed25519 signing (OpenWatch-originated artifact) - -### Kensa milestones OpenWatch converges onto - -| Kensa week | OpenWatch action | -|---|---| -| **Week 1** — `api/` surface frozen with stubs | OpenWatch codes against signatures immediately; stubs return `ErrNotYetImplemented` | -| **Week 22** — `LogQuery` real | OpenWatch swaps `/api/transactions/query` from PostgreSQL to `Kensa.TransactionLog()` | -| **Week 24** — `Plan`/`Execute` real | OpenWatch starts §6.2 implementation | -| **Week 25** — `Subscribe` real | OpenWatch cuts Heartbeat (the event stream parts) from polling to subscription | -| **Week 26 (M5)** — all OpenWatch-facing APIs real | Full integration test: Plan → Subscribe → Execute → Query | -| **Week 40 (M7)** — Kensa Go v1.0.0 | OpenWatch is pure consumer; Python Kensa archived | - -### Convergence-annotation convention - -Every OpenWatch spec or interim implementation that delegates to a Kensa `api/` method post-convergence carries a frontmatter block: - -```yaml -interim_implementation: - delegates_to: kensa.api.Kensa.TransactionLog().Query - convergence_week: 22 - kensa_plan_ref: kensa/docs/KENSA_GO_DAY1_PLAN.md §3.5.1 LogQuery - notes: | - ... -``` - -This makes drift visible at review time. The pattern is established in `specs/api/transactions/transaction-query.spec.yaml` v1.1 (PR #399). - -### Related documents - -- `docs/KENSA_OPENWATCH_COORDINATION_2026-04-14.md` — the outbound memo from OpenWatch to Kensa -- `/home/rracine/hanalyx/kensa/docs/KENSA_OPENWATCH_RESPONSE_2026-04-14.md` — Kensa team's response with accepted resolutions + interface decisions -- `/home/rracine/hanalyx/kensa/docs/KENSA_GO_DAY1_PLAN.md` — Kensa's Day-1 build plan (updated 2026-04-14 §3.5 with interface refinements from OpenWatch's asks) - ---- - -## Executive Summary - -The assessment confirms the vision doc's diagnosis: **OpenWatch's engine layer is strong, but the product-identity layer — the transaction log, Control Plane integrations, and signed evidence — is absent.** The single highest-leverage change is the Q1 transaction log refactor, because every subsequent milestone (per-host audit timeline, Agent API, historical posture, query API, signed bundles) assumes it exists. - -**Critical finding**: Kensa already captures the `validate` phase of the four-phase model in `scan_findings.evidence` JSONB. **Pre-state and post-state are not systematically captured.** The refactor is primarily (a) schema unification, (b) adding pre/post-state capture to `kensa_scan_tasks.py`, and (c) UI reorganization — NOT greenfield data modeling. The data mostly exists; it's in the wrong shape. - -**Second critical finding**: The compliance scheduler is more mature than the vision status suggested. Adaptive intervals (1h–48h based on compliance state) are shipped, Celery Beat dispatches every 2 minutes, per-host schedules live in `host_compliance_schedule`. The Heartbeat is **~60% there**; the gaps are auto-baseline-on-first-scan, a separate liveness ping (independent of scan cadence), and notification dispatch (Slack/email/webhook — the service layer exists, the channels don't). - -**Third critical finding**: The Control Plane has the biggest absolute gap. **Zero** SAML/OIDC groundwork. **No** multi-approval infrastructure (single-approver exceptions only). **No** Slack/Jira integrations (webhooks exist, but generic and outbound-only). Exception workflow has a complete backend API but **no frontend UI**. Scheduled-scan management has the same shape: backend ready, UI missing. - ---- - -## Phasing - -The plan is organized into **6 phases over ~9 months**, mapping to Q1/Q2/Q3 milestones. Each phase is ~4–6 weeks. Phases 1–3 are Q1, phases 4–5 are Q2, phase 6 is Q3. Critical path is Phase 1 (transaction log schema) — everything else compounds on it. - -``` -Phase 1 (wks 1-6) : Transaction log schema + write-path refactor [Q1 — Eye] -Phase 2 (wks 4-8) : Transaction log UI + navigation rename [Q1 — Eye] -Phase 3 (wks 6-12) : Heartbeat completion + Control Plane integrations Tier 1 - (SSO, Slack/email, auto-baseline, liveness ping) [Q1 — Heartbeat + CP] -Phase 4 (wks 12-18) : Evidence envelope four-phase capture + Ed25519 signing - + per-host timeline API + exception UI + scheduler UI [Q2] -Phase 5 (wks 14-20) : Baseline auto-management + alert routing + Jira sync - + retention policies [Q2] -Phase 6 (wks 20-36) : Transaction log query API + proactive remediation workflow - + multi-approval infrastructure + fleet-group policies - + first "State of Production Rollback" report [Q3] -``` - -Phases overlap deliberately: while Phase 1's backend refactor is in flight, Phase 2's frontend work can start against the (versioned) new API; Phase 3 Control Plane work doesn't depend on transactions and starts in parallel. - ---- - -## Phase 1: Transaction Log Schema & Write Path (weeks 1–6) — Q1 - -**Why first:** Every Q2/Q3 deliverable (per-host timeline, query API, signed bundles, proactive remediation, Agent API) reads from the transaction log. Without the unified schema, later work either builds on shifting foundations or duplicates effort. - -**Current state (from assessment):** -- 5 separate tables: `scans`, `scan_results`, `scan_findings`, `scan_baselines`, `scan_drift_events` -- `scan_findings.evidence` (JSONB) already captures Kensa's validate-phase evidence (method, command, stdout, stderr, expected, actual, exit_code, timestamp) -- `scan_findings.framework_refs` (JSONB) already stores rule-to-control mappings with GIN indexes -- **Missing**: pre-state, post-state, four-phase-shaped envelope, initiator metadata, approval/rollback linkage -- Write surface: `backend/app/tasks/kensa_scan_tasks.py:312-341` (single INSERT point for findings) - -### 1.1 New `transactions` table (week 1) - -Create a new Alembic migration adding a `transactions` table **alongside** the existing scan tables (do not drop old tables yet). Columns: - -| Column | Type | Notes | -|---|---|---| -| `id` | UUID (PK) | | -| `host_id` | UUID (FK hosts.id) | | -| `rule_id` | VARCHAR(255) | Kensa rule id; NULL for orchestration transactions | -| `scan_id` | UUID (FK scans.id) | Legacy linkage during migration window | -| `phase` | VARCHAR(16) | `capture` / `apply` / `validate` / `commit` / `rollback` | -| `status` | VARCHAR(16) | `pass` / `fail` / `skipped` / `error` / `rolled_back` | -| `severity` | VARCHAR(16) | | -| `initiator_type` | VARCHAR(16) | `user` / `scheduler` / `drift_trigger` / `agent` | -| `initiator_id` | VARCHAR(255) | user_id or service name | -| `pre_state` | JSONB | System state before apply (nullable for read-only checks) | -| `apply_plan` | JSONB | Handler + params that Kensa executed | -| `validate_result` | JSONB | stdout, stderr, exit_code, expected, actual | -| `post_state` | JSONB | System state after commit / restored state after rollback | -| `evidence_envelope` | JSONB | Full structured envelope (see Phase 4) | -| `framework_refs` | JSONB | `{cis-rhel9-v2.0.0: "5.1.12", stig-rhel9-v2r7: "V-257778"}` | -| `baseline_id` | UUID (FK scan_baselines.id) | For drift comparison | -| `remediation_job_id` | UUID | Links remediation transactions back to finding transaction | -| `started_at` | TIMESTAMPTZ | | -| `completed_at` | TIMESTAMPTZ | | -| `duration_ms` | INTEGER | | -| `tenant_id` | UUID | Nullable now; foundation for Q6 multi-tenancy | - -**Indexes:** -- `(host_id, started_at DESC)` — primary per-host timeline query -- `(scan_id)` — legacy join during migration -- `(status, started_at)` — "all failures in last N hours" (alerts) -- GIN on `framework_refs` — "all transactions satisfying NIST AC-2" -- GIN on `evidence_envelope` — audit search -- `(remediation_job_id)` — link remediation chains - -**Spec:** Create a new `specs/system/transaction-log.spec.yaml` (Active, owner: backend) as the authoritative contract for the four-phase model. This becomes a hard CI gate via existing `check-spec-coverage.py`. - -### 1.2 Dual-write from `kensa_scan_tasks.py` (week 2) - -Modify `backend/app/tasks/kensa_scan_tasks.py` (lines 250–343, the existing write path) to emit both old-schema rows AND new transaction rows on the same DB transaction. This gives us a reversible migration. - -**Capturing pre/post-state (the real new work):** -- Kensa's current `Evidence.actual` field records post-validation state -- `pre_state` is **not** captured today. Two options: - 1. **Minimal**: for read-only compliance checks (the common case), pre_state == post_state (nothing changed), record once - 2. **Full**: before Kensa applies a check, run a lightweight `capture_state` call via the same SSH session. Reuses Kensa's `detect_capabilities` mechanism but narrowed to the rule's target -- **Recommendation**: ship Option 1 for read-only checks in Phase 1; extend to Option 2 for remediation transactions in Phase 4 (where pre/post genuinely differ) - -For **remediation** transactions (which already have richer data per migration `20260224_0100_039_add_remediation_evidence.py`), write a second transaction row with `phase=apply`/`commit`/`rollback` and link it to the original finding transaction via `remediation_job_id`. - -### 1.3 Shim read layer (weeks 2–3) - -Add a `TransactionRepository` in `backend/app/repositories/transaction_repository.py` that services will use going forward. For Phase 1, it reads from the new `transactions` table. Existing services (`DriftDetectionService`, `AlertGeneratorService`, `AuditQueryService`, `TemporalComplianceService`) stay on the old tables until Phase 2 migrates them one at a time. - -**Critical dependency map (from assessment) — these 14+ services read the old tables and must migrate:** - -- `services/compliance/temporal.py` — `get_posture()`, `detect_drift()`, `create_snapshot()` (historical queries) -- `services/compliance/alert_generator.py` — severity threshold reads -- `services/compliance/audit_query.py` — evidence search -- `services/compliance/audit_export.py` — CSV/PDF/JSON exports (**highest risk** — customer-facing contract) -- `services/compliance/exceptions.py` — finding suppression -- `services/compliance/remediation.py` — job creation -- `services/monitoring/drift.py` — drift monitoring -- `services/baseline_service.py` — baseline management -- `routes/scans/reports.py` — report generation -- `routes/scans/kensa.py` — scan execution -- `routes/compliance/drift.py` — drift API -- `routes/compliance/posture.py` — posture API -- `routes/compliance/audit.py` — audit API -- `tasks/backfill_snapshot_rule_states.py` — snapshot backfill - -### 1.4 Backfill task (week 4) - -Celery task `backfill_transactions_from_scans` that reads all historical `scan_findings` rows and synthesizes transaction rows. Run in chunks of 10k rows with progress tracking. Transactions generated from historical data have `phase=validate` only (we can't reconstruct pre/post-state for rows that predate the refactor — this is fine; historical rows become immutable validate-only entries). - -### 1.5 Service migration (weeks 4–6) - -Migrate services off old tables to `TransactionRepository` one at a time, in order of risk: - -1. `audit_query.py` (read-only; low risk) -2. `temporal.py` — `get_posture()` and `detect_drift()` — **most important because temporal compliance is a key differentiator**. Ensure `(host_id, started_at)` index query plans are <500ms -3. `alert_generator.py` -4. `audit_export.py` — **high risk**. Keep exports emitting the same CSV/JSON column contract; only the read source changes. Add a regression test that compares old vs new export bytes for a known fixture scan. -5. `drift.py`, `posture.py` route layers -6. `kensa.py` route layer - -**At end of Phase 1:** all services read from `transactions`, old tables still exist as write-through shadow tables (safe rollback), Phase 2 frontend work can begin. - -### 1.6 Risk mitigation - -From the assessment: -- **Foreign key cascades**: `scan_findings.scan_id → scans.id ON DELETE CASCADE` could orphan transactions during the dual-write window. Add an explicit `ON DELETE` policy on `transactions.scan_id` (SET NULL, not CASCADE — we want transactions to survive scan deletion; they're the audit trail) -- **Framework mapping consistency**: `RuleReferenceService` already syncs inline `references:` + mapping files into `framework_mappings`. Extend it to also sync into `transactions.framework_refs` on write (not retroactively — only on new transactions) -- **Export schema stability**: regression test on fixture scan, as above - -### 1.7 Exit criteria - -- [ ] `transactions` table in production, dual-writing -- [ ] All services migrated to `TransactionRepository` -- [ ] `audit_export` regression test passes (byte-identical fixture export) -- [ ] Temporal query benchmark: `<500ms` for "posture at date X for host Y" -- [ ] `transaction-log.spec.yaml` Active with 100% AC coverage -- [ ] Old tables still written to (rollback possible) -- [ ] No performance regression on scan execution (`kensa_scan_tasks` duration within +10%) - ---- - -## Phase 2: Transaction Log UI & Navigation (weeks 4–8) — Q1 - -**Current state (from assessment):** -- Frontend nav: Dashboard → Scans → Compliance (Drift, Exceptions, Alerts, Audit) → Reports -- `frontend/src/pages/scans/Scans.tsx` + `ScanDetail.tsx` are the primary scan entry points -- `frontend/src/services/adapters/scanAdapter.ts` is the API client -- Role-based dashboards (PR #349) shipped; widgets are swappable per role - -### 2.1 New API surface (weeks 4–5) - -Create `/api/transactions/*` endpoints in `backend/app/routes/transactions/`: - -- `GET /api/transactions` — paginated list, filter by `host_id`, `status`, `framework`, `phase`, `initiator_type`, `started_at` range -- `GET /api/transactions/{id}` — single transaction with full four-phase breakdown -- `GET /api/transactions/{id}/evidence` — evidence envelope (prep for Phase 4 signing) -- `GET /api/hosts/{host_id}/transactions` — per-host timeline (Q2 deliverable, stubbed in Phase 2, fully implemented in Phase 4) - -Old `/api/scans/*` endpoints stay live as shims (proxy to transactions repository) with `Deprecation` headers. Remove no earlier than Phase 6. - -### 2.2 Frontend refactor (weeks 5–8) - -- Rename top-nav **Scans** → **Transactions** -- Create `frontend/src/pages/transactions/Transactions.tsx` (list) and `TransactionDetail.tsx` (detail) -- Four tabs on TransactionDetail: **Execution** (four-phase timeline), **Evidence** (raw envelope), **Controls** (framework mappings), **Related** (other transactions for same host/rule) -- Create `frontend/src/services/adapters/transactionAdapter.ts`; leave `scanAdapter.ts` as a thin re-export during deprecation -- **Findings** becomes a filtered view: `Transactions` with `status=fail`; build `Findings.tsx` as a preset filter on the list page -- **Reports** navigation unchanged; reports re-sourced from `TransactionRepository` (handled in Phase 1 service migration) - -### 2.3 Spec updates - -Update these specs (from the assessment) to reference the new `transactions` table and four-phase model: - -- `pipelines/scan-execution.spec.yaml` -- `pipelines/drift-detection.spec.yaml` -- `services/compliance/temporal-compliance.spec.yaml` -- `services/compliance/audit-query.spec.yaml` -- `services/compliance/compliance-scheduler.spec.yaml` -- `api/scans/scan-results.spec.yaml` -- `api/scans/scan-crud.spec.yaml` -- `api/scans/scan-reports.spec.yaml` -- `frontend/scan-workflow.spec.yaml` -- `frontend/scans-list.spec.yaml` - -### 2.4 Exit criteria - -- [ ] `/api/transactions/*` live and documented in Swagger -- [ ] Transactions list + detail pages shipped -- [ ] Findings as filtered transaction view -- [ ] Old `/api/scans/*` deprecation headers -- [ ] 10 specs updated, CI coverage enforced -- [ ] Manual QA: end-to-end flow (Kensa scan → transaction row → UI renders four phases) - ---- - -## Phase 3: Heartbeat Completion + Control Plane Tier 1 (weeks 6–12) — Q1 - -This phase runs in parallel with Phase 2 because it doesn't depend on the transaction log refactor. - -### 3.1 Heartbeat: auto-baseline on first scan (week 6) - -**Current state:** `PostureSnapshot` model exists; daily snapshots via `create_daily_posture_snapshots`. Manual snapshot creation via `TemporalComplianceService.create_snapshot()`. **No trigger on first scan.** - -Wire into `kensa_scan_tasks.py` (the same write path we're refactoring in Phase 1): after a successful scan, if `scan_baselines` has no `is_active=true` row for this host, create one via `BaselineService.establish_baseline(host_id, source_scan_id)`. Idempotent; safe to call on every scan. - -### 3.2 Heartbeat: liveness ping separate from scan cadence (weeks 6–7) - -**Current state:** "Liveness" = `last_scan_completed` timestamp. At the default 6h–24h scan cadence, liveness signal is too slow for the vision's "15-min detection" target. - -Add `host_liveness` table: `host_id, last_ping_at, last_response_ms, reachability_status (reachable/unreachable/unknown)`. New Celery Beat task `ping_managed_hosts` every 5 minutes — for each host, open a TCP connection to the SSH port and record response time. No auth, no command execution; it's a reachability check. - -Update `FleetHealthWidget.tsx` (already exists, 336 LOC) to show liveness distinct from scan recency. - -### 3.3 Heartbeat: maintenance mode UI (week 7) - -Backend exists (`compliance_scheduler.py:508-549`). Frontend needs a toggle in Host Detail + Host List pages. Small change, high user visibility. - -### 3.4 Heartbeat: fleet health "at a glance" (week 8) - -Extend Dashboard's existing fleet health section with: -- "X hosts up / Y total" -- "Z hosts with drift in last 24h" -- "N failed scans in last 24h" - -Queries go against `transactions` table (Phase 1 exit criteria) + `host_liveness`. - -> **Revised 2026-04-14** per Kensa Convergence Addendum: the `transactions` PostgreSQL table is now framed as a **derived multi-host aggregation cache over Kensa's SQLite store** (Kensa Day-1 plan §13A). At Kensa Week 25, the event feed for drift counts switches from polling the PostgreSQL table to consuming `Kensa.Subscribe` with `EventKind=DriftDetected`. No API surface change for the frontend; just a backend implementation swap. The PostgreSQL cache survives through Kensa v1.0.0 because multi-fleet aggregation across N Kensa SQLite stores would be too slow to do at query time without a cache. - -### 3.5 Control Plane: notification dispatch (weeks 7–9) - -**Current state:** `AlertService` + alert thresholds shipped (PR #281). `alert_generator.py` creates alert rows in DB. **No outbound dispatch.** Generic webhook surface exists (`routes/integrations/webhooks.py`) but not wired to alerts. - -Create `backend/app/services/notifications/` package with: -- `base.py` — abstract `NotificationChannel` interface -- `slack.py` — uses `slack-sdk`, POST to incoming webhook URL with Block Kit formatting -- `email.py` — SMTP via `aiosmtplib`, templated HTML -- `webhook.py` — thin wrapper over existing webhook service for alert-specific events - -Wire `AlertService.create_alert()` to enqueue a notification task per configured channel. Dedupe via the existing 60-min window logic in `alerts.py:137`. - -**Jira is deferred to Phase 5.** Jira's bidirectional sync is a larger lift than Slack/email and isn't a Q1 blocker. - -### 3.6 Control Plane: SAML/OIDC SSO (weeks 8–12) - -**Current state (from assessment):** Zero groundwork. Local users + JWT only. FIPS-compliant Argon2id and RS256 JWT (good), but no federation. - -Add `authlib` to `requirements.txt` (authlib handles both OIDC and SAML2 and is actively maintained, FIPS-compatible). - -Create `backend/app/services/auth/sso/`: -- `provider.py` — abstract `SSOProvider` with `get_login_url`, `handle_callback`, `map_claims_to_user` -- `oidc.py` — `OIDCProvider` using authlib's OAuth2 client -- `saml.py` — `SAMLProvider` using python3-saml (FedRAMP-approved library) - -Database: -- `sso_providers` table: `id, tenant_id (nullable), provider_type, config (JSONB, encrypted), enabled` -- Extend `users` table: `sso_provider_id`, `external_id`, `last_sso_login_at` - -Routes: -- `GET /api/auth/sso/login?provider={id}` — redirect to IdP -- `GET /api/auth/sso/callback/{provider_type}` — handle ACS (SAML) or token exchange (OIDC) -- `GET /api/auth/sso/providers` — list configured providers for login screen -- `POST /api/admin/sso/providers` — admin configures new IdP - -Claim mapping: `email → users.email`, `groups → users.role` (configurable mapping in provider config). First-login creates a local user record linked to `external_id`. Subsequent logins update claims. - -Frontend: extend login page to show "Login with SSO" buttons for configured providers. Small change; authlib does the heavy lifting. - -**Load-bearing because:** federal customers cannot buy OpenWatch without SSO. This is the most commercially-urgent item in Q1 after the transaction log. - -### 3.7 Exit criteria - -- [ ] First-scan baseline auto-established -- [ ] `host_liveness` table + 5-minute ping task running -- [ ] Maintenance mode toggle in Host Detail UI -- [ ] Slack + email notifications firing on alerts -- [ ] At least one OIDC provider (e.g., Okta dev tenant) and one SAML provider (e.g., AD FS test instance) successfully authenticating users -- [ ] Deprecation headers on `/api/auth/login` for customers who need to migrate - ---- - -## Phase 4: Evidence Envelope + Signing + Per-Host Timeline + UIs (weeks 12–18) — Q2 - -### 4.1 Four-phase evidence capture (weeks 12–14) - -**Current state:** Kensa's `Evidence` dataclass captures the validate phase only (method, command, stdout, stderr, exit_code, expected, actual, timestamp). Pre/post-state missing. - -For compliance scans (read-only checks), Phase 1 established that `pre_state == post_state` is the common case. For Phase 4, add **explicit structured capture** even for read-only checks: - -```python -evidence_envelope = { - "schema_version": "1.0", - "kensa_version": "1.2.5", - "phases": { - "capture": {"state": {...}, "at": "..."}, - "apply": {"plan": {...}, "executed": False, "at": null}, # read-only - "validate": {"method": ..., "command": ..., "stdout": ..., "exit_code": ...}, - "commit": {"status": "pass", "post_state": {...}, "at": "..."}, - "rollback": null, - }, - "framework_refs": {...}, - "rule_metadata": {"id": ..., "title": ..., "severity": ...}, - "host_context": {"host_id": ..., "os": ..., "arch": ...}, -} -``` - -For **remediation** transactions, all four phases populate. Extend `backend/app/plugins/kensa/evidence.py:19-45` (current `_evidence_to_dict`) to return this envelope shape. Coordinate with the Kensa team if upstream changes are needed — per the vision, OpenWatch is the fleet runtime, but if Kensa needs to emit pre-state, that's a Kensa PR. - -**Spec:** add AC for envelope schema to `specs/system/transaction-log.spec.yaml` (created in Phase 1). - -### 4.2 Ed25519 signing (weeks 13–15) - -**Current state (from assessment):** Greenfield. No Ed25519 code. `encryption/service.py` has AES-256-GCM; `auth.py` has RS256 JWT; no signing abstraction. - -Create `backend/app/services/signing/`: -- `service.py` — `SigningService` with `sign_envelope(envelope: dict) -> SignedBundle` and `verify(bundle: SignedBundle) -> bool` -- Uses `cryptography.hazmat.primitives.asymmetric.ed25519` (FIPS-compatible, already in deps) -- Signing key stored per-deployment in `deployment_signing_keys` table (encrypted via existing `EncryptionService`) -- Key rotation: new key becomes active, old keys remain verifiable; bundles record `key_id` - -Signed bundle format: -```json -{ - "envelope": { ... }, - "signature": "base64(ed25519-sig)", - "key_id": "uuid", - "signed_at": "ISO8601", - "signer": "openwatch@deployment-name" -} -``` - -Public verification endpoint: `GET /api/signing/public-keys` returns all active + retired public keys so auditors can verify bundles offline. - -Documentation: publish `docs/EVIDENCE_VERIFICATION.md` with a standalone Python verification script (20 lines) that auditors can use without an OpenWatch install. - -### 4.3 Per-host transaction timeline (weeks 14–16) - -**Current state (from assessment):** `TemporalComplianceService.get_posture(host_id, as_of)` exists for point-in-time queries. No "all transactions for host X" timeline. - -API: `GET /api/hosts/{host_id}/transactions` with filters `phase`, `status`, `framework`, `rule_id`, date range, full-text search on evidence (using the GIN index on `transactions.evidence_envelope`). Paginated, cursor-based. - -Frontend: new tab on `HostDetail.tsx` — **Audit Timeline**. Reverse-chronological list of transactions, click-through to `TransactionDetail`. Export button → queues an audit export job for that host + date range. - -### 4.4 Exception workflow UI (weeks 15–17) - -**Current state (from assessment):** Backend complete at `routes/compliance/exceptions.py`. Zero frontend. - -Create `frontend/src/pages/compliance/Exceptions.tsx`: -- List view (paginated, filter by status/rule/host) -- Request form (justification, risk assessment, expiration date) -- Approval workflow display (approver name, approved_at, justification) -- "Escalate" button — re-routes to higher-role approver (requires Phase 6 multi-approval infra, so in Phase 4 it's a single-level escalation: analyst → officer/admin) -- Button to kick off remediation from an excepted rule - -Backend change: add `approval_chain JSONB` to `ComplianceException` table for multi-approval groundwork (populated with single approver for now; Phase 6 extends to N approvers). - -### 4.5 Scheduled scan management UI (weeks 16–18) - -**Current state:** Backend complete at `routes/compliance/scheduler.py`. No frontend. - -Create `frontend/src/pages/scans/ScheduledScans.tsx`: -- Current adaptive-interval config (the 1h/6h/12h/24h/48h tiers) with sliders -- Per-host schedule table: `next_scheduled_scan`, `current_interval_minutes`, `maintenance_mode` -- Preview: histogram of next 48h scans across the fleet -- New backend endpoint `POST /api/compliance/scheduler/preview` that returns "given this config, here are the next 50 scheduled scans" - -### 4.6 Exit criteria - -- [ ] Evidence envelope schema v1.0 frozen and specced -- [ ] Ed25519 signing service with key rotation -- [ ] Per-host timeline API + Host Detail tab -- [ ] Exception workflow UI shipped -- [ ] Scheduled scan management UI shipped -- [ ] `docs/EVIDENCE_VERIFICATION.md` + standalone verification script - ---- - -## Phase 5: Baseline Auto-Mgmt, Alert Routing, Jira, Retention (weeks 14–20) — Q2 - -Parallel to Phase 4. - -### 5.1 Baseline auto-management (weeks 14–15) - -**Current state (from assessment):** Baselines exist as `scan_baselines` rows; daily snapshots via `create_daily_posture_snapshots`. Auto-create on first scan lands in Phase 3. **Missing:** explicit "update baseline" API + rolling baselines for moving targets. - -- `POST /api/hosts/{host_id}/baseline/reset` — establish new baseline from most recent scan -- `POST /api/hosts/{host_id}/baseline/promote` — promote current posture to baseline (after legitimate config change) -- Rolling baseline: 7-day moving average for hosts marked `baseline_type=rolling_avg` -- Frontend: button on HostDetail.tsx - -### 5.2 Alert routing rules (weeks 15–17) - -**Current state:** Alerts fire to a single default channel set. No per-severity routing. - -Add `alert_routing_rules` table: `id, severity, alert_type, channel_type, channel_config (JSONB), tenant_id`. Example rule: `CRITICAL + HOST_UNREACHABLE → pagerduty:oncall`. - -Extend `AlertService.create_alert()` dispatch loop to query routing rules and fan out to multiple channels. Add PagerDuty channel to `notifications/` package (alongside Slack/email from Phase 3). - -Frontend: `frontend/src/pages/compliance/AlertRoutingRules.tsx` — rule table, create/edit form. - -### 5.3 Jira bidirectional sync (weeks 16–19) - -Deferred from Phase 3 because bidirectional is nontrivial. - -- `backend/app/services/notifications/jira.py` — uses `jira` Python SDK -- Outbound: drift events + failed transactions create Jira issues with evidence envelope attached -- Inbound: Jira webhook → `POST /api/integrations/jira/webhook` → update OpenWatch exception or transaction state based on issue state transitions -- Field mapping configurable per Jira project; first customer gets hardcoded mapping - -### 5.4 Retention policies (weeks 18–20) - -**Current state:** `audit_export.cleanup_expired_exports()` has 7-day retention. `scan_findings` has no TTL. - -Add `retention_policies` table: `tenant_id, resource_type, retention_days`. Enforce via `cleanup_old_transactions` Celery task that deletes `transactions` older than policy (default 365 days, configurable per fleet/tenant). - -**Critical:** before deletion, emit an "archive" signed bundle to configurable storage (S3 or filesystem). Retention deletion should NEVER be destructive of the audit trail — it moves transactions from hot storage to cold signed archives. - -### 5.5 Exit criteria - -- [ ] Baseline reset/promote APIs + UI -- [ ] Alert routing rules + PagerDuty channel -- [ ] Jira outbound + inbound (first customer mapping) -- [ ] Retention policy CRUD + enforcement with signed archive emission - ---- - -## Phase 6: Query API, Proactive Remediation, Multi-Approval, Groups, Report (weeks 20–36) — Q3 - -### 6.1 Transaction log query API (weeks 20–23) - -> **REVISED 2026-04-14** per Kensa Convergence Addendum. The endpoint URL + schema + DSL shape are owned by OpenWatch (stable HTTP contract). The **implementation** converges onto Kensa's `api.Kensa.TransactionLog()` at Kensa Week 22. **First slice shipped 2026-04-14 as PR #398**; interim annotation added in PR #399. - -The read side we stubbed in Phase 2 becomes a first-class, documented, paginated, filterable HTTP API — which in turn is a thin wrapper over Kensa's `LogQuery` interface (Kensa Day-1 plan §3.5.1). - -- `POST /api/transactions/query` accepts a query DSL: filters (`host_id`, `fleet_id`, `date_range`, `status`, `phase`, `framework`, `rule_id`, `initiator_type`), sort, pagination cursor, projection (which fields to return) — **shipped in PR #398** -- Response includes `total_count`, `next_cursor`, paginated results -- Rate limits per API key — **deferred to follow-up PR** (listed in spec's `out_of_scope`) -- OpenAPI spec published, versioned `v1` -- **Target**: historical posture query (`"fleet X compliance state on 2026-03-15"`) in `<500ms` p95 (the vision's KPI) — **deferred to follow-up PR** - -At Kensa Week 22, the endpoint's implementation swaps: -- Current: reads PostgreSQL `transactions` table (fed by Python Kensa) -- Post-Week-22: delegates to `kensa.api.Kensa.TransactionLog().Query()` for single-deployment queries; PostgreSQL cache serves multi-fleet aggregate queries that span N Kensa deployments (per Kensa §13A federated-v1.0 / push-v1.1 sequencing). - -Spec: `specs/api/transactions/transaction-query.spec.yaml` v1.1 — carries the `interim_implementation:` frontmatter establishing the convergence pattern. - -### 6.2 Proactive remediation workflow (weeks 22–26) - -> **REVISED 2026-04-14** per Kensa Convergence Addendum. Original draft had OpenWatch generating the plan. Revised architecture: **OpenWatch wraps Kensa.Plan / Kensa.Execute with an approval-workflow UI.** Do not start implementation until Kensa Week 24 (when `Plan` / `Execute` land real). - -**Current state (from assessment):** `RemediationService.create_job()` exists with dry-run flag + license enforcement. **Missing:** auto-draft on drift, approval queue UI, integration with Kensa's Plan/Execute API. - -**Architecture (revised):** - -``` -Drift event detected (from Kensa.Subscribe event stream, Week 25) - ↓ -OpenWatch calls kensa.api.Kensa.Plan(host, rule) - ↓ -Returns an opaque Plan blob - ↓ -OpenWatch stores the blob in remediation_jobs.kensa_plan (JSONB) without interpreting it -Row starts at status=draft with approval_chain metadata - ↓ -ApprovalQueue UI renders the plan via Kensa's plan.Preview(PreviewMarkdown) -(canonical preview owned by Kensa — no OpenWatch-side plan rendering) - ↓ -Multi-approval chain (Phase 6.3) progresses draft → approved - ↓ -On full approval, OpenWatch calls kensa.api.Kensa.Execute(host, plan) - ↓ -If PlanStaleError returned: mark remediation_jobs.status=stale, - prompt for re-plan. The `StaleStepIndex` + `Field` + `Expected`/`Actual` - fields from Kensa drive the UX ("re-plan because step 2's config_set - of PermitRootLogin found value 'prohibit-password' but the plan - captured 'yes'") - ↓ -On success: update remediation_jobs.status=completed, - store Kensa's returned TransactionResult.TxnID - ↓ -Each state transition writes a transaction row to Kensa's log via Kensa's -engine (not a separate OpenWatch-generated row) -``` - -**What OpenWatch owns:** -- `remediation_jobs` table schema: `id, host_id, rule_id, kensa_plan (JSONB, opaque), approval_chain_id, status, created_at, approved_at, executed_at, kensa_txn_id (nullable, filled on success)` -- Auto-draft triggering — when `DriftDetected` event arrives from `Kensa.Subscribe` with `drift_type=major`, call `Kensa.Plan` and persist the draft -- ApprovalQueue UI (`frontend/src/pages/remediation/ApprovalQueue.tsx`) listing drafts, routing to detail view -- Approval state machine (`draft → approved → executing → completed | failed | stale`) -- Integration with the Phase 6.3 multi-approval chain -- Re-plan UX when `PlanStaleError` surfaces - -**What OpenWatch does NOT own:** -- The `Plan` struct internals — OpenWatch never looks inside the JSONB blob -- The preview rendering — calls `plan.Preview(PreviewMarkdown)` which Kensa owns -- The rollback plan derivation — part of Kensa's Plan -- Staleness detection — Kensa's `PlanStaleError` is the authoritative signal -- The actual execution semantics, capture logic, validation — all Kensa - -**Interim-implementation annotation** (to go on `specs/api/compliance/proactive-remediation.spec.yaml` when the spec is written): - -```yaml -interim_implementation: - delegates_to: - - kensa.api.Kensa.Plan - - kensa.api.Kensa.Execute - - kensa.api.Kensa.Subscribe (for DriftDetected event) - convergence_week: 24 - kensa_plan_ref: kensa/docs/KENSA_GO_DAY1_PLAN.md §3.5.3 Planner/Executor - notes: | - Do not implement until Kensa Week 24. Before that, OpenWatch codes - against api/ signatures returning ErrNotYetImplemented to validate - the integration shape. -``` - -**Blocking dependency:** Do not start until Kensa Week 24. - -### 6.3 Multi-approval infrastructure (weeks 24–28) - -**Current state (from assessment):** Single-approver only. No approval chains. - -- New `approval_policies` table: `resource_type, action, required_approvals, approver_roles, conditions (JSONB)` -- Example policy: `transaction, execute, 2, [SECURITY_ADMIN], {"change_type": "grub_param"}` -- `ApprovalService` evaluates policies on every state transition -- Extend `ComplianceException.approval_chain` (introduced in Phase 4) to track N approvals -- Audit each approval as a transaction row in the log (control-plane actions are themselves transactions — this is the vision's "audit log IS the transaction log" principle) - -### 6.4 Fleet grouping + per-group policies (weeks 26–30) - -**Current state:** Host groups exist as entities (`routes/host_groups/crud.py`). **No policies attached.** - -- New `group_compliance_policies` table: `group_id, scan_interval_override, approval_policy_id, drift_threshold_percent, auto_remediate_severities` -- Extend `compliance_scheduler.py` to prefer group policy over default intervals -- Extend `ApprovalService` to apply group-specific policies to hosts in that group -- Frontend: Group Detail page gains a Policies tab - -### 6.5 First "State of Production Rollback" report (weeks 30–34) - -**Current state:** Zero. The report is the output, not the infrastructure. - -- `generate_production_rollback_report` task aggregates anonymized transaction log statistics across lighthouse customers (opt-in telemetry) -- Metrics: rollback frequency by OS/framework, mean-time-to-remediate, drift types most commonly detected, most-failed rules -- Output: public PDF + JSON datasets -- Marketing deliverable, not a product feature — but the infrastructure (query API from 6.1, anonymized telemetry) feeds the Q5 "Agent API + aggregate dataset" milestone - -### 6.6 Exit criteria - -- [ ] `/api/transactions/query` with published OpenAPI spec -- [ ] Proactive remediation draft → approval → execute flow -- [ ] Multi-approval infrastructure with at least one 2-approval policy in production -- [ ] Group compliance policies enforced by scheduler + approval service -- [ ] First public "State of Production Rollback" report published - ---- - -## Cross-Cutting Concerns - -### Testing strategy - -Every phase adds regression tests against the existing CI gate (42% coverage floor, 100% AC coverage for Active specs). Specific additions: - -- Phase 1: `test_transaction_backfill.py`, `test_audit_export_parity.py` (byte-identical export across schema change), `test_temporal_query_perf.py` (p95 < 500ms) -- Phase 3: `test_sso_oidc_flow.py`, `test_sso_saml_flow.py` with mock IdP -- Phase 4: `test_ed25519_signing.py`, `test_envelope_schema_v1.py`, `test_verification_script.py` -- Phase 6: `test_transaction_query_dsl.py`, `test_approval_policy_evaluation.py` - -### Spec governance - -Every phase must land its spec updates in the same PR as the code change (existing `check-spec-changes.py` advisory becomes a hard block for new work). Phase 1 creates `specs/system/transaction-log.spec.yaml` which becomes the load-bearing contract for everything else. - -### Security review gates - -Three mandatory security reviews: -- **End of Phase 1**: schema + write path (before transactions become canonical) -- **End of Phase 3**: SSO (federation is a high-value attack surface) -- **End of Phase 4**: signing (key management + verification) - -### Commercial gates - -- **End of Phase 3**: first customer can sign up with SSO — unblocks federal sales -- **End of Phase 4**: first auditor can verify a signed bundle offline — unblocks the "signed evidence" trust moat -- **End of Phase 6**: first "State of Production Rollback" report — unblocks the "canonical upstream" trust moat - -### Team shape - -Plan assumes ~2 backend engineers + 1 frontend engineer + founding engineer oversight. If headcount is smaller, Phase 3's SSO work and Phase 5's Jira sync are the first candidates to slip, in that order. **Do not slip Phase 1** — it blocks everything. - ---- - -## What This Plan Does NOT Do - -Per vision doc "What OpenWatch Must Never Become" and the OSCAL deferral: - -- **No OSCAL export** — lands in Kensa first, OpenWatch calls into it later -- **No third-party scanner ingestion** — we do not ingest Tenable/Qualys/Rapid7 findings -- **No generic observability dashboards** — Heartbeat is about state, not metrics -- **No cloud-posture features** — we manage Linux hosts, not AWS/Azure/GCP configurations -- **No multi-tenancy exposure** — `tenant_id` columns land in Phase 1 but stay NULL / single-tenant until Q6 -- **No Agent API (write)** — Q5/Q6 work; Phase 6's query API is the read-only foundation - ---- - -## Risks & Open Questions - -1. **Kensa team coordination on pre-state capture**: if capturing pre-state requires Kensa changes, the Phase 4 envelope work depends on a Kensa PR. **Mitigation**: start the conversation in week 1 of Phase 1; Phase 4 doesn't start until week 12, giving 11 weeks of lead time. - -2. **Audit export customer contract**: the Phase 1 regression test locks the CSV/JSON column contract, but customers may depend on undocumented column ordering. **Mitigation**: survey existing customers on audit export usage during Phase 1 week 1. - -3. **SAML library choice**: `python3-saml` has C dependencies that complicate RPM/DEB packaging. **Mitigation**: evaluate `pysaml2` as a pure-Python alternative in Phase 3 week 1. - -4. **Retention archive storage**: Phase 5.4 requires customer-side cold storage (S3 or filesystem). **Open question**: do we ship a default filesystem archive path, or require configuration? - -5. **Proactive remediation trust**: Phase 6.2's "auto-draft → human approve → execute" depends on the remediation job's dry-run accuracy. If drafts are consistently wrong, users disable the feature. **Mitigation**: ship dry-run preview UI before auto-draft; require users to opt into auto-draft per host. - -6. **Phase 1 backfill on large deployments**: customers with millions of `scan_findings` rows may have multi-hour backfills. **Mitigation**: chunked task with resumability, progress UI, and the ability to run Phase 2 UI work on dual-written (forward-only) data without full backfill. - ---- - -## Next Steps - -1. **Walk this plan with founding team** — confirm phase ordering and Phase 3 parallelism assumption -2. **Create spec `specs/system/transaction-log.spec.yaml`** as the first concrete Phase 1 deliverable -3. **Open tracking epics** in PRD for each phase (E7–E12, following existing E0–E6 convention) -4. **Schedule Kensa team sync** on pre-state capture requirements for Phase 4 -5. **Survey audit export customers** to lock the Phase 1 contract diff --git a/docs/OPENWATCH_Q2_PLAN.md b/docs/OPENWATCH_Q2_PLAN.md deleted file mode 100644 index e461034e..00000000 --- a/docs/OPENWATCH_Q2_PLAN.md +++ /dev/null @@ -1,319 +0,0 @@ -# OpenWatch Q2 Implementation Plan - -**Date:** 2026-04-13 -**Window:** Months 4-6 (~12 weeks) -**Parent:** [OPENWATCH_Q1_Q3_PLAN.md](OPENWATCH_Q1_Q3_PLAN.md) -**Vision:** [OPENWATCH_VISION.md](OPENWATCH_VISION.md) Quarters 2-3 -**Predecessor:** OPENWATCH_Q1_PLAN.md (completed 2026-04-13; archived to OWAR docs-archive) - ---- - -## Q1 Completed (foundation for Q2) - -Everything Q2 builds on was shipped in Q1: -- Transaction log with write-on-change model (host_rule_state + transactions) -- PostgreSQL job queue (Celery + Redis removed, 4 containers) -- Notification channels (Slack, email, webhook) -- SSO federation (OIDC + SAML) -- Host liveness monitoring (5-min TCP ping) -- FreeBSD 15.0 Dockerfiles + packaging skeleton -- 86 specs, 762 ACs, 100% coverage - ---- - -## Q2 Goals (from vision) - -| Identity | Milestone | -|---|---| -| **Eye** | Ed25519 signed evidence bundles. Per-host audit timeline. Transaction log retention policy. | -| **Heartbeat** | Drift alerts via Slack/email/webhook. Baseline auto-management (reset/promote). | -| **Control Plane** | Jira bidirectional sync. Scheduled scan management UI. Exception workflow UI. | -| **Platform** | FreeBSD 15.0 container migration (test + validate). XCCDF/lxml removal. | - -**Scope note**: OSCAL export remains deferred to Kensa. Evidence signing is OpenWatch-side. - ---- - -## Workstreams - -``` -Workstream F: Evidence Signing + Audit Timeline [weeks 1-6] -Workstream G: Control Plane UIs + Jira [weeks 3-9] -Workstream H: FreeBSD Validation + XCCDF Cleanup [weeks 1-4] -Workstream I: Baseline Mgmt + Retention Policies [weeks 4-8] -``` - ---- - -## Workstream F — Evidence Signing + Per-Host Audit Timeline (weeks 1-6) - -### F1: Ed25519 signing service (weeks 1-3) - -**Deliverables:** -- [ ] `backend/app/services/signing/__init__.py` -- [ ] `backend/app/services/signing/service.py` — `SigningService`: - - `sign_envelope(envelope: dict) -> SignedBundle` - - `verify(bundle: SignedBundle) -> bool` - - Uses `cryptography.hazmat.primitives.asymmetric.ed25519` -- [ ] Alembic migration: `deployment_signing_keys` table (key_id, public_key, private_key_encrypted, active, created_at, rotated_at) -- [ ] Key rotation: new key becomes active, old keys remain for verification -- [ ] API: `GET /api/signing/public-keys` — returns all active + retired public keys -- [ ] API: `POST /api/transactions/{id}/sign` — sign a transaction's evidence envelope -- [ ] `docs/EVIDENCE_VERIFICATION.md` — standalone Python verification script (~20 lines) -- [ ] Spec: `specs/services/signing/evidence-signing.spec.yaml` - -### F2: Per-host audit timeline (weeks 3-5) - -**Deliverables:** -- [ ] `GET /api/hosts/{host_id}/transactions` — full filter surface (phase, status, framework, rule_id, date range) -- [ ] Cursor-based pagination for large timelines -- [ ] Full-text search on `evidence_envelope` via GIN index (already exists) -- [ ] Frontend: new tab on HostDetail — **Audit Timeline** - - Reverse-chronological list of transactions - - Click-through to TransactionDetail - - Export button → queues audit export for that host + date range -- [ ] Spec: update `api/hosts/host-crud.spec.yaml` with timeline AC - -### F3: Signed evidence export (weeks 5-6) - -**Deliverables:** -- [ ] Extend audit export (CSV/JSON/PDF) to include Ed25519 signature -- [ ] Export includes `signed_bundle` with envelope + signature + key_id -- [ ] Verification endpoint: `POST /api/signing/verify` accepts a bundle, returns valid/invalid -- [ ] Frontend: "Download Signed Evidence" button on TransactionDetail page - ---- - -## Workstream G — Control Plane UIs + Jira (weeks 3-9) - -### G1: Exception workflow UI (weeks 3-5) - -**Current state:** Backend complete at `routes/compliance/exceptions.py`. Zero frontend. - -**Deliverables:** -- [ ] `frontend/src/pages/compliance/Exceptions.tsx` — list view (paginated, filter by status/rule/host) -- [ ] Exception request form (justification, risk assessment, expiration date) -- [ ] Approval workflow display (approver name, approved_at, justification) -- [ ] Escalate button (routes to higher-role approver) -- [ ] Re-remediation button (kick off remediation for excepted rule) -- [ ] Nav item: "Exceptions" under Compliance - -### G2: Scheduled scan management UI (weeks 4-6) - -**Current state:** Backend complete at `routes/compliance/scheduler.py`. No frontend. - -**Deliverables:** -- [ ] `frontend/src/pages/scans/ScheduledScans.tsx` — adaptive interval config with sliders -- [ ] Per-host schedule table: next_scheduled_scan, current_interval, maintenance_mode -- [ ] Preview histogram: "next 48h scans across the fleet" -- [ ] New backend endpoint: `POST /api/compliance/scheduler/preview` - -### G3: Jira bidirectional sync (weeks 5-9) - -**Deliverables:** -- [ ] `backend/app/services/notifications/jira.py` — uses `jira` Python SDK (add to requirements.txt) -- [ ] Outbound: drift events + failed transactions create Jira issues with evidence -- [ ] Inbound: `POST /api/integrations/jira/webhook` — Jira webhook receiver - - Issue state transitions update OpenWatch exception or transaction state -- [ ] Field mapping configurable per Jira project -- [ ] Admin UI: Jira integration settings (project, field mapping, webhook URL) -- [ ] Spec: `specs/services/infrastructure/jira-sync.spec.yaml` - ---- - -## Workstream H — FreeBSD Validation + XCCDF Cleanup (weeks 1-4) - -> **STATUS UPDATE (2026-04-14):** H1 and H3 (the FreeBSD items) are -> **abandoned**. No path forward — Linux Docker hosts cannot run FreeBSD OCI -> containers, GitHub Actions has no FreeBSD runners, and the native pkg -> deliverable did not justify maintaining the container fork. All FreeBSD -> artifacts removed. **H2 (XCCDF/lxml removal) shipped as planned.** -> See `docs/OPENWATCH_VISION_STATUS.md` for the platform decision details. - -### H1: FreeBSD container testing (weeks 1-2) — ABANDONED - -**Deliverables:** -- [ ] Test `docker-compose.freebsd.yml` with FreeBSD 15.0 images -- [ ] Verify all Python C extensions compile: psycopg2, cryptography, argon2-cffi -- [ ] Verify job queue worker runs correctly on FreeBSD -- [ ] Verify SSH connections (Paramiko) work from FreeBSD containers -- [ ] Fix any FreeBSD-specific issues (paths, package names, signal handling) -- [ ] CI: add FreeBSD container build job to `.github/workflows/ci.yml` - -### H2: XCCDF/lxml removal (weeks 2-4) - -**From backlog (P2):** `owca/extraction/xccdf_parser.py` imports lxml at module level via `owca/__init__.py`. Legacy OpenSCAP path. - -**Deliverables:** -- [ ] Make XCCDF parser import conditional (lazy import, not at module level) -- [ ] Verify OWCA works without lxml when XCCDF parser is not called -- [ ] If XCCDF parser is never called in the Kensa-only path: remove it entirely -- [ ] Remove `lxml` from `requirements.txt` if no active code paths use it -- [ ] Audit: verify no other module imports lxml - -### H3: FreeBSD native package testing (weeks 3-4) — ABANDONED - -**Deliverables:** -- [ ] Test `packaging/freebsd/build-pkg.sh` on FreeBSD 15.0 -- [ ] Verify rc.d scripts start/stop services correctly -- [ ] Test upgrade path: install pkg, upgrade pkg -- [ ] Document any FreeBSD-specific configuration in `docs/guides/` - ---- - -## Workstream I — Baseline Management + Retention (weeks 4-8) - -### I1: Baseline auto-management (weeks 4-5) - -**Current state:** Auto-baseline on first scan shipped (Q1). Missing: explicit reset/promote API. - -**Deliverables:** -- [ ] `POST /api/hosts/{host_id}/baseline/reset` — establish new baseline from most recent scan -- [ ] `POST /api/hosts/{host_id}/baseline/promote` — promote current posture to baseline -- [ ] Rolling baseline: 7-day moving average for hosts marked `baseline_type=rolling_avg` -- [ ] Frontend: "Reset Baseline" / "Promote to Baseline" buttons on HostDetail - -### I2: Alert routing rules (weeks 5-7) - -**Deliverables:** -- [ ] `alert_routing_rules` table: severity, alert_type, channel_type, channel_config -- [ ] Example: `CRITICAL + HOST_UNREACHABLE → pagerduty:oncall` -- [ ] Extend `AlertService.create_alert()` to fan out per routing rule -- [ ] PagerDuty channel: `backend/app/services/notifications/pagerduty.py` -- [ ] Frontend: `frontend/src/pages/compliance/AlertRoutingRules.tsx` -- [ ] Add `python-pagerduty` to requirements.txt - -### I3: Transaction log retention policies (weeks 6-8) - -**Deliverables:** -- [ ] `retention_policies` table: tenant_id, resource_type, retention_days -- [ ] Default: 365 days for transactions, 30 days for host_rule_state check history -- [ ] `cleanup_old_transactions` job queue task (registered in recurring_jobs) -- [ ] Before deletion: emit signed archive bundle to configurable storage (filesystem) -- [ ] Admin API: `GET/PUT /api/admin/retention` — view/update retention config -- [ ] Frontend: retention settings in admin page - ---- - -## Exit Criteria (end of Q2) - -### Evidence (Workstream F) -- [ ] Ed25519 signing service with key rotation -- [ ] Per-host audit timeline with full filter/export surface -- [ ] Signed evidence exports downloadable from UI -- [ ] `docs/EVIDENCE_VERIFICATION.md` with standalone verification script - -### Control Plane (Workstream G) -- [ ] Exception workflow UI shipped -- [ ] Scheduled scan management UI shipped -- [ ] Jira bidirectional sync (outbound + inbound webhook) - -### Platform (Workstream H) -- [ ] FreeBSD 15.0 containers tested and validated -- [ ] XCCDF/lxml dependency removed (or made conditional) -- [ ] FreeBSD pkg package tested - -### Heartbeat (Workstream I) -- [ ] Baseline reset/promote API + UI -- [ ] Alert routing rules with PagerDuty channel -- [ ] Transaction retention policy enforced with signed archives - ---- - -## Dependencies and Risks - -1. **Kensa team coordination** (F1): If evidence signing requires Kensa to emit different data, that's an upstream PR. Current envelope shape may be sufficient. - -2. **Jira SDK packaging** (G3): `jira` Python SDK adds a dependency. Evaluate size vs value. Alternative: raw REST calls to Jira API (no SDK needed). - -3. **FreeBSD container availability** (H1): FreeBSD 15.0 OCI images are on Docker Hub but may have quirks with specific Python C extensions. Test early. - -4. **lxml removal risk** (H2): OWCA module-level import means removing lxml breaks import chain. Must be lazy-loaded first. - -5. **PagerDuty pricing** (I2): PagerDuty integration requires customers to have PagerDuty accounts. May not be relevant for all deployments. - ---- - -## PR Decomposition - -| PR | Contents | Workstream | Week | -|---|---|---|---| -| 1 | Signing service + migration + spec | F1 | 1-2 | -| 2 | Signing API endpoints + verification docs | F1 | 2-3 | -| 3 | Per-host audit timeline API + frontend tab | F2 | 3-5 | -| 4 | Signed evidence export + download button | F3 | 5-6 | -| 5 | Exception workflow UI | G1 | 3-5 | -| 6 | Scheduled scan management UI + preview API | G2 | 4-6 | -| 7 | Jira service + outbound + inbound webhook | G3 | 5-8 | -| 8 | Jira admin UI | G3 | 8-9 | -| 9 | FreeBSD container validation + CI | H1 | 1-2 | -| 10 | XCCDF lazy import / removal | H2 | 2-4 | -| 11 | FreeBSD pkg testing | H3 | 3-4 | -| 12 | Baseline reset/promote API + UI | I1 | 4-5 | -| 13 | Alert routing rules + PagerDuty | I2 | 5-7 | -| 14 | Retention policies + signed archives | I3 | 6-8 | - -**~14 PRs over 9 weeks.** - ---- - -## Q2 Specs Plan - -### New draft specs (8 total, created at Q2 kickoff) - -| Spec | Location | Workstream | ACs | Test Stub | -|------|----------|------------|-----|-----------| -| evidence-signing | services/signing/ | F1 | 8 | test_evidence_signing_spec.py | -| jira-sync | services/infrastructure/ | G3 | 8 | test_jira_sync_spec.py | -| baseline-management | services/compliance/ | I1 | 5 | test_baseline_management_spec.py | -| alert-routing | services/compliance/ | I2 | 6 | test_alert_routing_spec.py | -| retention-policy | services/compliance/ | I3 | 6 | test_retention_policy_spec.py | -| exception-workflow (FE) | frontend/ | G1 | 7 | exception-workflow.spec.test.ts | -| scheduled-scans (FE) | frontend/ | G2 | 5 | scheduled-scans.spec.test.ts | -| host-audit-timeline (FE) | frontend/ | F2 | 5 | host-audit-timeline.spec.test.ts | - -### Existing specs to update in Q2 - -| Spec | Change | Version Bump | -|------|--------|-------------| -| api/hosts/host-crud.spec.yaml | Add AC: per-host transaction timeline endpoint | bump | -| services/compliance/alert-thresholds.spec.yaml | Add AC: alert routing rules dispatch | bump | -| frontend/host-detail-behavior.spec.yaml | Add AC: audit timeline tab | bump | - -### SPEC_REGISTRY after Q2 kickoff - -- Total: 94 specs (80 Active, 14 Draft) -- System: 13 (10 Active, 3 Q1 Draft) -- Services: 29 (21 Active, 3 Q1 Draft, 5 Q2 Draft) -- Frontend: 16 (13 Active, 3 Q2 Draft) -- All others unchanged - -### Promotion schedule - -- **Q2 week 4**: Promote Q1 draft specs (6) to active once CI validates -- **Q2 week 9**: Promote Q2 draft specs as features ship - ---- - -## Carries from Q1 - -These Q1 items carry into Q2 as operational gates: - -| Item | Status | Q2 action | -|---|---|---| -| SSO security review | Checklist documented, Bandit/Semgrep clean | Complete internal review or engage external reviewer | -| Spec promotions (6 draft → active) | Code landed, tests skip-marked | Unskip tests in CI Docker environment, promote | -| Liveness ping port detection | P2 backlog | Fix: read SSH port from host credential config | -| XCCDF/lxml removal | P2 backlog | Workstream H2 | - ---- - -## Q3 Preview (from Q1-Q3 plan) - -Q3 focuses on: -- **Transaction log query API** (REST, filters, pagination) — foundation for Agent API -- **Proactive remediation workflow** (drift → draft remediation → human approve → execute) -- **Multi-approval infrastructure** (2-human approval for sensitive transactions) -- **Fleet grouping + per-group policies** (scan cadence, approval, drift thresholds) -- **Tier 3 decision gate** (Go rewrite viability + Kensa integration path) -- **First "State of Production Rollback" public report** diff --git a/docs/OPENWATCH_VISION.md b/docs/OPENWATCH_VISION.md deleted file mode 100644 index f544ca3e..00000000 --- a/docs/OPENWATCH_VISION.md +++ /dev/null @@ -1,314 +0,0 @@ -# OpenWatch Vision - -**Status:** Founding document, draft v1 -**Companion documents:** -- `KENSA_VISION.md` — the transactional primitive OpenWatch is built on -- `HANALYX_MISSION_AND_ROADMAP.md` — company mission and 18-month trust roadmap -- `HANALYX_18_MONTH_STRATEGY.md` — tactical strategy and 90-day plan -- `AI_DEFENSIBILITY.md` — why Hanalyx becomes more valuable as AI improves - ---- - -## What OpenWatch Is - -**OpenWatch is the fleet eye, the heartbeat, and the control plane for Kensa.** - -Kensa is a passive primitive. It acts only when invoked. It remembers nothing across runs. It knows how to capture, apply, validate, and roll back — but it does not know when to do so, which hosts to do it on, or what happened yesterday. Left alone, Kensa does nothing. - -OpenWatch is what turns Kensa from a CLI tool into continuous, proactive, observable infrastructure. It decides when Kensa runs. It remembers every transaction Kensa has ever executed. It notices when today differs from yesterday. It alerts humans to drift. It orchestrates transactions across fleets. It provides the audit trail. It is where the passive primitive becomes an active system. - -**Kensa is the transaction. OpenWatch is the fleet that runs on it.** - ---- - -## The Frame: git is to GitHub as Kensa is to OpenWatch - -The cleanest mental model for how Kensa and OpenWatch relate is git and GitHub. The pattern repeats because it is correct for a whole class of products. - -| git | Kensa | GitHub | OpenWatch | -|---|---|---|---| -| Open-source plumbing | Open-source plumbing | Hosted porcelain | On-prem or hosted porcelain | -| Local, stateless | Local, stateless | Stateful, persistent | Stateful, persistent | -| Powerful primitive | Powerful primitive | Multiplies git's value | Multiplies Kensa's value | -| Used by developers who know it | Used by compliance engineers who know it | Where most users actually work | Where most users actually work | -| Credibility-bearing | Credibility-bearing | Revenue-bearing | Revenue-bearing | -| Can be used alone | Can be used alone | Cannot exist without git | Cannot exist without Kensa | -| Most people don't use it alone | Most people won't use it alone | — | — | - -git without GitHub is a programmer's tool. GitHub without git is a dashboard with nothing underneath. They need each other, they reinforce each other, and they divide labor cleanly: one is the open primitive that earns trust through auditability, the other is the product people actually pay for. - -Kensa and OpenWatch should be thought of the same way. **Kensa must remain open source, visible, auditable, and community-facing — because credibility demands it and because it is the primitive that defines the category. OpenWatch must become continuous, proactive, and transactional — because it is where value is delivered to customers and where revenue lives.** - -This is the architecture of a successful open-core company. It is not an accident that GitHub, GitLab, Grafana, MongoDB, Elastic, Sentry, and HashiCorp all run versions of this pattern. It works because it resolves the central tension in selling infrastructure software: customers need the engine to be transparent, and the company needs the product to be ownable. - ---- - -## The Three Identities of OpenWatch - -OpenWatch has three architectural identities. Each one corresponds to a specific customer need and a specific part of the codebase. The three together define what OpenWatch is for. - -### 1. The Eye - -**OpenWatch is the continuous, comprehensive view of the transactional state of every Linux host under management.** Every change Kensa has ever captured, applied, validated, committed, or rolled back — on every host, across every fleet — is visible here. Nothing is lost. Nothing is invisible. If it happened on a managed host, OpenWatch saw it and recorded it. - -The Eye is the component that makes the product trustworthy. You cannot sell "every change is auditable" unless you have a system that actually captures and retains every change in a queryable form. The Eye is that system. - -This identity is delivered by the **transaction log** — the primary data structure of OpenWatch, the thing customers look at first, the thing auditors export from, and the thing AI agents will eventually read and write against. - -### 2. The Heartbeat - -**OpenWatch is continuously and proactively aware of every host's state — not just when a human asks.** The heartbeat runs whether a human is watching or not. It scans hosts on a schedule. It detects drift from baseline. It raises alerts when something changes. It tracks host liveness, reachability, and responsiveness. It is the difference between a tool that answers questions and a tool that tells you when you need to ask one. - -The heartbeat is how OpenWatch earns the "continuous compliance" and "continuous state assurance" claims that federal continuous monitoring requires and that commercial SREs intuitively want. It is also the component that makes the Eye's data current — without a heartbeat, the Eye is a photograph, not a live feed. - -### 3. The Control Plane - -**OpenWatch is where humans and (eventually) AI agents issue instructions to the fleet.** A human who wants to apply a change across 500 hosts describes it in OpenWatch, reviews the preview (what will be captured, what will be applied, what validation will run, what rollback will occur on failure), approves, and OpenWatch orchestrates the transaction across the fleet. The result flows back into the transaction log. - -This identity is what turns OpenWatch from a dashboard into infrastructure. A dashboard is something you look at. A control plane is something you operate through. The difference is the difference between Grafana and Kubernetes. - -The Control Plane is also the surface that the eventual AI-agent use case will consume. An agent in 2027 or 2028 that wants to apply a change to production does not talk to Kensa directly. It talks to OpenWatch's Control Plane API, which enforces authorization, records intent, captures the transaction, and provides the audit trail. Humans approve; agents operate; OpenWatch mediates. - ---- - -## The Core Architectural Commitment - -Every feature in OpenWatch must serve one or more of the three identities above. Features that do not serve any of them do not ship. - -Concretely: - -- **Scan scheduling** → Heartbeat -- **Drift detection** → Heartbeat → Eye -- **Transaction log UI** → Eye -- **Evidence export (OSCAL, signed bundles)** → Eye -- **Exception workflow** → Control Plane -- **Multi-host orchestration** → Control Plane -- **RBAC, SSO, audit log** → Control Plane -- **API for programmatic access** → Control Plane (future: agents) -- **Alerting and notification** → Heartbeat → Control Plane -- **Host health / liveness monitoring** → Heartbeat -- **Historical posture queries** → Eye -- **Baseline management** → Heartbeat → Eye - -Features that do not fit this model — third-party scanner ingestion, cloud provider integrations that aggregate foreign findings, CI/CD security scanning, generic observability dashboards — do not ship. They expand OpenWatch's scope at the cost of its identity. OpenWatch is not a compliance aggregator. It is the Eye, the Heartbeat, and the Control Plane for Kensa transactions. - ---- - -## The Transaction Log as Primary Interface - -The most important architectural decision in the next six months is to make the **transaction log** the primary interface of OpenWatch, replacing the current organization around "scans," "findings," and "reports." - -### What the transaction log contains - -Every entry is a Kensa transaction with: - -- **Timestamp and duration** -- **Host and fleet context** -- **Initiator** (human user, scheduled job, drift trigger, AI agent) -- **Pre-state capture** — the exact state of the system before the change -- **Change applied** — the specific remediation handler and parameters -- **Validation result** — did the change produce the intended effect -- **Commit or rollback decision** -- **Post-state** — the exact state of the system after commit, or restored pre-state after rollback -- **Evidence envelope** — structured, signable, exportable to OSCAL -- **Framework mappings** — which compliance controls this transaction satisfies (CIS, STIG, NIST, etc.) — as metadata, not as the primary organizing principle - -### Why this reframing matters - -- **One data model serves three audiences.** SREs see "what changed." Compliance officers see "what was remediated." Auditors see "the evidence trail." All three views come from the same log; only the filter and the UI differ. -- **It maps 1:1 to the Kensa vision.** Kensa's four phases (capture, apply, validate, commit-or-rollback) are exactly the fields of a transaction log entry. No impedance mismatch between the engine and the product. -- **It is the right surface for the AI-agent future.** When an agent needs to apply a change, it writes a transaction intent to the log. When it needs to understand fleet state, it reads the log. The log is the API. -- **It differentiates from every other compliance tool.** No competitor organizes around transactions. They all organize around findings (scanner mindset) or controls (GRC mindset). The transaction log is a category-defining UI, not just a rename. - -### What this replaces - -- The current "Scans" top-level navigation becomes "Transactions." -- "Findings" becomes a filtered view of the transaction log (transactions with status = fail). -- "Reports" becomes exports generated from the transaction log. -- "Compliance status" becomes aggregate queries against the transaction log over time ranges. -- The database schema is refactored to treat `scans` + `scan_results` + `scan_findings` as a single `transactions` table, with the existing fields reorganized around the four-phase model. - -This is the single highest-leverage change to OpenWatch in the next six months. It is mostly a data-model refactor and UI reorganization, not new feature work. It pays for itself by making every subsequent feature simpler to build. - ---- - -## What OpenWatch Must Never Become - -As important as naming what OpenWatch is: naming what it is not, so scope creep does not dilute the identity. - -- **OpenWatch is not a compliance aggregator.** It does not ingest findings from Tenable, Qualys, Rapid7, or any other scanner. It records Kensa transactions. Customers who want a compliance aggregator should buy a compliance aggregator. -- **OpenWatch is not a GRC platform.** It does not track policies, manage SOC 2 evidence collection, or produce organization-wide compliance dashboards across non-Linux systems. Drata, Vanta, and Secureframe exist for that. OpenWatch is focused on the Linux transactional layer. -- **OpenWatch is not an observability tool.** It does not replace Datadog, Grafana, Prometheus, or New Relic. It tells you what changed, not what is happening. The heartbeat is about state, not about metrics and logs. -- **OpenWatch is not a configuration management system.** Customers should still use Ansible, Chef, Puppet, or Salt for day-to-day provisioning. OpenWatch is where those changes become transactional and auditable — not where they originate. -- **OpenWatch is not a multi-cloud security posture management tool.** It does not talk to AWS Security Hub, Azure Defender, or GCP SCC. It manages Linux hosts directly. Cloud-native posture management is a different market with different competitors and Hanalyx does not play there. -- **OpenWatch is not a scanner without Kensa.** Every transaction runs through the Kensa primitive. There is no parallel scanning path. The architectural commitment is that Kensa is the only engine underneath. - -Each of these constraints is load-bearing. Violating any one of them dilutes the identity of the product and pushes it toward being a generic compliance platform — a space where we cannot compete and would not want to. - ---- - -## 12–18 Month Milestones - -These milestones are organized around the three identities. They connect directly to the trust moats in `HANALYX_MISSION_AND_ROADMAP.md` — every milestone serves at least one moat. - -### Quarter 1 (Months 0–3): Transaction log reframing and heartbeat foundations - -**The Eye** -- [ ] Refactor database schema: unify `scans`, `scan_results`, `scan_findings`, `scan_baselines`, `scan_drift_events` around a single `transactions` table with the four-phase model (capture, apply, validate, commit/rollback). -- [ ] Ship the transaction log as the primary top-level UI in OpenWatch. Replace "Scans" / "Findings" / "Reports" navigation with "Transactions." -- [ ] Implement per-transaction detail view: full pre-state, apply, validate, commit/rollback, post-state, evidence envelope, framework mappings as metadata. - -**The Heartbeat** -- [ ] Scheduled scans enabled by default on every onboarded host. Remove the opt-in barrier. -- [ ] Host liveness monitoring: last-seen timestamp, reachability check, response time tracking on every managed host. -- [ ] Fleet-level health view: all hosts up, last scan successful, drift events in the last 24 hours visible at a glance. - -**The Control Plane** -- [ ] Slack + Jira integration (outbound alerts and bidirectional ticket sync for drift events and failed transactions). -- [ ] SAML/OIDC SSO — required for enterprise and federal sales. - -**Moat connection:** Track Record (Eye makes the log auditable from day one), Community (clean data model is foundation for community rule contributions). - ---- - -### Quarter 2 (Months 3–6): Evidence export and auditor-grade outputs - -**The Eye** -- [ ] OSCAL export from the transaction log. Every transaction in the log can be exported as an OSCAL-formatted evidence bundle. -- [ ] Signed evidence bundles using Ed25519. Signing key managed per deployment, with published verification instructions. -- [ ] Per-host audit timeline view: every transaction that has ever touched this host, with filter, search, and export. -- [ ] Transaction log retention policy, configurable per fleet. - -**The Heartbeat** -- [ ] Drift detection running automatically on every scheduled scan, with no configuration required. -- [ ] First-class drift alert notifications via Slack, email, and webhook. -- [ ] Baseline auto-management: first scan establishes baseline, subsequent scans measured against it, baseline can be explicitly updated. - -**The Control Plane** -- [ ] Scheduled scan management UI: when, how often, which rules, which hosts, with a clear preview of what each scheduled scan will do. -- [ ] Exception workflow UI for the transaction log: mark a transaction as accepted (risk acknowledged), escalate, or request re-remediation. - -**Moat connection:** Auditor Relationships (OSCAL + signed bundles are the concrete artifacts we will brief auditors on), Liability (the signed evidence is what makes the production SLA defensible). - ---- - -### Quarter 3 (Months 6–9): Proactive remediation and control plane maturity - -**The Eye** -- [ ] Query API for the transaction log: REST endpoint that accepts filters (host, fleet, date range, status, mechanism, framework) and returns paginated transactions. This is the foundation of both the advanced UI and the future agent API. -- [ ] Historical posture queries: "what was fleet X's compliance state on date Y?" answered in under 500ms from the transaction log. -- [ ] First public **"State of Production Rollback"** report generated from anonymized aggregate transaction log data across lighthouse customers. - -**The Heartbeat** -- [ ] **Proactive remediation workflow:** when drift is detected, OpenWatch automatically drafts a proposed remediation transaction (capture plan, apply plan, validation plan, rollback plan) and raises it to a human for approval. One-click approve → transaction runs → result flows back into the log. -- [ ] Alert routing rules: different drift severities go to different channels (Slack, email, PagerDuty, ticketing). -- [ ] Heartbeat performance: every managed host scanned at least every 6 hours by default, with per-host override. - -**The Control Plane** -- [ ] RBAC with role-based approval requirements: certain transactions (e.g., grub parameter changes) require two-human approval before execution. -- [ ] Fleet grouping and per-group policy: different hosts can have different scan cadences, different approval requirements, different drift thresholds. -- [ ] First batch of user-contributed rules merged from the open-source community (tied to Kensa community work). - -**Moat connection:** Track Record (proactive remediation is where customers see the closed-loop story in action), Canonical Upstream (public report establishes Hanalyx as the authority on production rollback statistics). - ---- - -### Quarter 4 (Months 9–12): FedRAMP-ready continuous monitoring - -**The Eye** -- [ ] Continuous monitoring reporting that meets federal ConMon requirements: rolling 30-day posture, POA&M integration, continuous compliance dashboards, monthly evidence packages. -- [ ] Per-framework filtered views of the transaction log: "show me all transactions that satisfy NIST 800-53 AC-2 over the last 90 days." -- [ ] Export integration with FedRAMP continuous monitoring tooling. - -**The Heartbeat** -- [ ] SLO tracking: uptime of OpenWatch itself, time-to-detect drift, time-to-alert, time-to-remediate. Publicly visible on an internal status page first, then externally. -- [ ] Alerting integrations: PagerDuty, Opsgenie, Microsoft Teams (in addition to existing Slack/email/webhook). - -**The Control Plane** -- [ ] Audit log for every Control Plane action: who approved what, when, from where, with what justification. The audit log is itself a set of transactions in the transaction log. -- [ ] First signed **production SLA** offered to paying customers, backed by the transaction log as evidence. -- [ ] First federal customer successfully passing a continuous monitoring review with OpenWatch as the ConMon system. - -**Moat connection:** FedRAMP (continuous monitoring is one of the largest control families), Liability (SLA backed by transaction log), Auditor Relationships (first auditor success story). - ---- - -### Quarter 5 (Months 12–15): The agent API surface and hosted control plane - -**The Eye** -- [ ] Read-only **Agent API**: authenticated, rate-limited, OpenAPI-specified interface that lets an authorized AI agent query the transaction log, read fleet state, and subscribe to drift events. Not write-enabled yet. -- [ ] Anonymized aggregate telemetry (opt-in) from customer deployments feeding the first cross-customer benchmark dataset. - -**The Heartbeat** -- [ ] Multi-region heartbeat: OpenWatch can monitor hosts across geographic regions with appropriate latency and reliability guarantees. -- [ ] Graceful degradation: if OpenWatch loses contact with a host, the Heartbeat explicitly distinguishes "host is down" from "host is unreachable from OpenWatch" and alerts accordingly. - -**The Control Plane** -- [ ] **First hybrid deployment:** on-prem Kensa agent pushing signed transaction bundles to a Hanalyx-hosted OpenWatch control plane, as an opt-in upgrade for existing customers. Single-tenant at first. -- [ ] Formal API versioning and deprecation policy for the Control Plane API. -- [ ] First non-founder engineering hire (if hiring timing allows) focused on the Control Plane surface. - -**Moat connection:** Canonical Upstream (agent API positions OpenWatch as infrastructure, not a tool), Track Record (hybrid deployment is the beginning of the long-term SaaS option). - ---- - -### Quarter 6 (Months 15–18): Write-enabled agent API and multi-tenant readiness - -**The Eye** -- [ ] Public transaction log schema specification, versioned and stable. Third parties can build tools against it. -- [ ] Second **"State of Production Rollback"** report with year-over-year trends. - -**The Heartbeat** -- [ ] Heartbeat performance SLA: drift detected within 15 minutes of occurrence on any managed host by default. -- [ ] Predictive heartbeat: OpenWatch flags hosts whose behavior is diverging from the fleet norm before an explicit drift event fires. - -**The Control Plane** -- [ ] **Write-enabled Agent API:** an authorized AI agent can propose a transaction, which lands in the approval queue for human review. Approved transactions execute through Kensa and flow back into the log. This is the first version of the "AI agents + humans operating the fleet together" vision. -- [ ] **Multi-tenancy groundwork:** `tenant_id` / `org_id` columns on all relevant tables, row-level security policies, tenant-aware RBAC. Not yet exposed to customers; this is the technical foundation for the potential commercial SaaS wedge at month 18+. -- [ ] Decision point on the commercial SaaS wedge: based on federal ARR, community traction, and agent API interest, decide whether to launch a separate commercial brand on top of the multi-tenant foundation. - -**Moat connection:** Canonical Upstream (agent API makes OpenWatch the reference integration point for AI infrastructure), FedRAMP (authorization should land around this time — the multi-tenant groundwork is what makes a hosted FedRAMP offering possible). - ---- - -## KPIs - -Measured monthly, reviewed quarterly. - -**The Eye** -- Transactions per month (cumulative across customers) -- Percentage of transactions with complete evidence envelopes (target: 100%) -- Time to query the transaction log for a typical historical posture question (target: under 500ms) -- Evidence exports generated per month - -**The Heartbeat** -- Percentage of managed hosts scanned in the last 24 hours (target: 99%+) -- Median time from drift event to human alert (target: under 15 minutes) -- False positive rate on drift alerts (target: decreasing over time) -- Host liveness coverage: percentage of managed hosts with current liveness data - -**The Control Plane** -- Active users per customer per month -- Transactions initiated from the Control Plane (human-initiated) vs the Heartbeat (automatic) — the ratio tells us how proactive the product has become -- Approval latency: median time from proposed transaction to human approval -- API requests per month (once the Agent API is live) - ---- - -## The One-Line Version - -**OpenWatch is the Eye, the Heartbeat, and the Control Plane for Kensa. Kensa is the transaction; OpenWatch is the fleet that runs on it.** - ---- - -## The OpenWatch Landing Page Hero - -> ### OpenWatch is the fleet eye, the heartbeat, and the control plane for Kensa. -> -> Continuous visibility into every transactional change across your Linux fleet. Proactive drift detection. Auditor-grade evidence, automatically. One transaction log for humans, compliance teams, auditors, and eventually the AI agents that will operate production alongside them. -> -> *Kensa is the transaction. OpenWatch is the fleet that runs on it.* - ---- - -*End of document.* diff --git a/docs/OPENWATCH_VISION_STATUS.md b/docs/OPENWATCH_VISION_STATUS.md deleted file mode 100644 index 93ea42da..00000000 --- a/docs/OPENWATCH_VISION_STATUS.md +++ /dev/null @@ -1,36 +0,0 @@ -# OpenWatch vs. Vision Milestones — Status Check - -**Date:** 2026-04-13 (updated) -**Source:** Assessment against [OPENWATCH_VISION.md](OPENWATCH_VISION.md) Q1–Q3 milestones - ---- - -## Platform Decision: Linux Containers (FreeBSD evaluated, dropped 2026-04-14) - -OpenWatch ships on Linux containers with native RPM and DEB packages for -air-gapped deployment. - -- Container base: Red Hat UBI 9 (backend, worker), Alpine (db, frontend) -- FIPS: OpenSSL 3.x FIPS provider module (portable, not tied to Red Hat) -- Native packages: RPM (CentOS Stream 9) and DEB (Ubuntu 24.04) - -### Why FreeBSD was evaluated and dropped - -A FreeBSD 15.0 minimal container target was scoped in early 2026-04 as part of -the Workstream E dependency-minimization story. The Dockerfiles, compose file, -and pkg packaging skeleton were drafted and merged. Validation revealed there -is no practical path forward: - -- Standard Linux Docker hosts (including all developer machines and GitHub - Actions Linux runners) cannot execute FreeBSD OCI containers — that requires - OCI v1.3 with a FreeBSD-aware runtime, which only exists on FreeBSD hosts -- GitHub Actions does not provide FreeBSD runners; self-hosted FreeBSD runners - would need to be procured and maintained -- The native FreeBSD pkg deliverable can serve air-gapped FreeBSD operators - without requiring containerized FreeBSD at all, but H3 alone did not justify - the maintenance cost of the container fork - -All FreeBSD artifacts (Dockerfile.*.freebsd, docker-compose.freebsd.yml, -packaging/freebsd/) were removed on 2026-04-14. - ---- diff --git a/docs/PRE_RELEASE_SECURITY_REVIEW_2026-06-16.md b/docs/PRE_RELEASE_SECURITY_REVIEW_2026-06-16.md deleted file mode 100644 index 297a63bd..00000000 --- a/docs/PRE_RELEASE_SECURITY_REVIEW_2026-06-16.md +++ /dev/null @@ -1,149 +0,0 @@ -# OpenWatch — Pre-Release Security Review (2026-06-16) - -**Scope:** the full Go backend (`internal/`, `cmd/`), the React/TypeScript -frontend (`frontend/`), and packaging/CI. Conducted as six parallel, -read-only dimension audits; every High/Critical finding was then -re-verified by hand against the source. - -**Method:** auth/session · authZ/RBAC · cryptography/secrets · -injection/SSH/SSRF · web/HTTP/audit · supply-chain/packaging. - ---- - -## Verdict - -**Release-ready (pending CI + the medium/low backlog).** The cryptographic and data-handling core is -strong — correct AES-256-GCM at rest, sound Argon2id, no SQL injection / -command injection / path traversal / unsafe deserialization, secrets never -on argv or in logs, strong SSRF defense. The problems were **missing -perimeter controls** and **two access-control defects**. Five of those have -been fixed under SDD discipline (PR #584); three larger items remain and are -the gate for an internet-facing release. - -| State | Count | -|-------|-------| -| **Fixed** (PR #584, specced + tested) | 8 | -| **Open — release blockers** | 0 | -| Open — medium | ~7 | -| Open — low / informational | ~12 | -| Verified strong (no action) | many | - ---- - -## Fixed in PR #584 (spec → test → code) - -| # | Sev | Finding | Fix | Spec | -|---|-----|---------|-----|------| -| 1 | **High** | `GET /api/v1/audit/events` had **zero authorization** — anonymous full audit-trail disclosure (actor ids, IPs, resource ids). | Require `audit:read`; anonymous → 403, no events. | `api-audit-events-query` C-06/AC-11 | -| 2 | **High** | API-token **privilege escalation**: a `token:write` holder (e.g. `security_admin`) could `POST /tokens` with `role_id:"admin"` and obtain a full-admin bearer token. | New `auth.RoleGrantsWithin`: requested role's permissions must be ⊆ caller's; else 403. | `api-tokens` C-03/AC-05 | -| 3 | Med-High | `roles:assign` had no subset/self guard (escalation primitive; only admin holds `role:assign` today, so defense-in-depth). | Same `RoleGrantsWithin` check on assignment. | `api-users` C-05/AC-13 | -| 4 | **High** | **No security headers** on an origin serving SPA + API (clickjacking, SSL-strip, MIME-sniff, no XSS defense-in-depth). | `securityHeaders` middleware: HSTS, CSP (`frame-ancestors 'none'`, `default-src 'self'`), nosniff, `X-Frame-Options: DENY`, Referrer-Policy. | `system-http-server` C-12/AC-17 | -| 5 | **High** | **Breach-password check dead in production** — every `users.NewService` passed a `nil` corpus, so compromised passwords were accepted. | Always-on embedded baseline (`DefaultBreachCorpus`, 129 common passwords, airgap-safe) wired at all 3 prod sites; operator HIBP override via `OPENWATCH_BREACH_CORPUS_FILE`. | `system-auth-identity` C-15/AC-27 | -| 6 | **High** | **No CSRF enforcement** — frontend double-submit was theater; only `SameSite=Lax` protected mutations. | `csrfProtect` middleware (constant-time double-submit) + `XSRF-TOKEN` cookie at login/refresh; gated on the session cookie; Bearer/`/auth/*` exempt. | `system-http-server` C-14/AC-19 | -| 7 | **High** | **No login rate-limiting / lockout** — unlimited online guessing. | Dependency-free per-IP sliding-window limiter on `/auth/login` + `/auth/mfa:verify`; 429 + Retry-After. | `system-http-server` C-13/AC-18 | -| 8 | Med-High | **SSH host-key in-memory per-process TOFU** — MITM on first scan after every restart. | PostgreSQL-backed `KnownHostsStore` (migration 0036) wired at all 4 dial sites → durable TOFU across restarts. | `system-ssh-connectivity` C-13/AC-22 | - -> #1, #2, #4, #5 are mutually reinforcing: weak-password acceptance + no -> CSRF (below) + clickjacking + anonymous data access formed a realistic -> account-takeover / data-exposure chain. #1 (anonymous) was the most -> severe and is fixed. - ---- - -## ~~Open — release blockers~~ → ALL FIXED in PR #584 - -The three blockers below are **now closed** (see fixed rows 6-8 above). The -original analysis is retained for the record. - -### B-1 (High) — CSRF is not enforced server-side ✅ FIXED -State-changing endpoints authenticate via the `openwatch_session` cookie. -The frontend advertises a double-submit scheme (`client.ts`), but **no -server code sets an `XSRF-TOKEN` cookie or validates `X-CSRF-Token`** — the -protection reduces to `SameSite=Lax`. The client-side half is theater. -**Fix:** issue a random non-HttpOnly `XSRF-TOKEN` cookie at login/refresh + -middleware requiring `X-CSRF-Token == XSRF-TOKEN` on unsafe methods for -cookie-authenticated requests (matches what the frontend already sends). -*Evidence:* `internal/server/server.go` chain; `frontend/src/api/client.ts`. - -### B-2 (High) — No login rate-limiting or account lockout ✅ FIXED -`PostAuthLogin` / `PostAuthMFAVerify` have no throttle, no failed-attempt -counter, no lockout (confirmed: no rate-limiter anywhere in the HTTP chain). -Direct online password / OTP guessing + credential-stuffing. Flagged -independently by both the auth and web audits. -**Fix:** per-IP + per-account limiter (e.g. `httprate` or a DB-backed -counter) with stricter buckets on `/auth/*`, plus progressive backoff/lockout. -Derive client IP from a trusted-proxy config (see L-7). -*Evidence:* `internal/server/auth_handlers.go:27`. - -### B-3 (Med-High) — SSH host-key trust is in-memory, per-process TOFU ✅ FIXED -Every production dial uses `ModeTOFU` + `NewMemoryStore()` (no persistent -store, no `ModeStrict`). A network attacker can MITM the **first** scan after -every daemon restart and harvest the credentials presented to each host. The -~5-minute liveness probe is worse: `InsecureIgnoreHostKey()`. -**Fix:** a Postgres-backed `KnownHostsStore` (persist `hostname → key`), -wire it in place of `NewMemoryStore()`, default `ModeStrict` once keys are -provisioned (TOFU only for an explicit enrollment window), surface -host-key-mismatch as an operator alert, and share the store with the -liveness probe (drop `InsecureIgnoreHostKey`). -*Evidence:* `cmd/openwatch/worker.go:181-182`, `main.go:385/409/525-526`; -`internal/sshprivilege/privilege.go:311`. - ---- - -## Open — medium - -- **Access JWTs (30 min) are not revocable; password change revokes nothing.** `jti` is stamped but never checked; logout/reuse-cascade don't cover the bearer path. `internal/identity/jwt.go`, `internal/users/users.go UpdatePassword`. *Fix:* `jti`/token-version denylist in `VerifyJWT`; revoke sessions on password change. -- **Shared demo TLS key** baked into every package at the prod cert path, not `%config(noreplace)`/conffile → silently reverts an operator's cert on upgrade. *(2 audits)* `packaging/common/gen-demo-cert.sh`, `packaging/rpm/openwatch.spec`. *Fix:* generate per-install in `%post`; mark cert/key as config; warn/refuse on the `openwatch-demo` subject. -- **No request body-size limits** (`http.MaxBytesReader` absent) → cheap memory-exhaustion DoS on any JSON endpoint. -- **API-token create/revoke emit no audit event**; **security-failure audit events omit source IP / user-agent** (forensics gaps on a compliance product). `tokens_handlers.go`, `auth_handlers.go emitAudit`. -- **Raw `err.Error()` leaked on 500 paths** (`users_handlers.go:69,205`). -- **Release job pipes unpinned `curl | sh` syft** in the same job that holds the signing keys; **untrusted PR title interpolated into a shell `run:`** (`claude-code-alerts.yml`). - -## Open — low / informational - -JWT verify lacks an explicit `WithValidMethods` pin (safe today, weaker than -the OIDC/license verifiers) · TLS 1.2 default cipher set, and the "OpenSSL -FIPS provider" claim should be reconciled with the actual `crypto/tls` stack -· JWT signing key shipped `0640` group-readable (DEK is `0600`) · -Swagger/OpenAPI served unauthenticated (full API-surface recon) · -`X-Forwarded-Host/-Proto` trusted unconditionally (latent host-header -injection; becomes load-bearing once IP-based rate-limiting lands) · no -per-user SSE connection cap · default DSN ships `sslmode=disable` · -`cosign.pub` not published (breaks the documented offline cosign verify) · -`govulncheck` doesn't gate the *release* workflow (only `go-ci`) · -custom-role creation can request permissions you don't hold (latent — inert -because custom roles currently resolve to zero permissions) · alert -ack/resolve gate on the wrong permission (fail-closed correctness bug) · -login username-enumeration via Argon2id timing. - ---- - -## Verified strong (no action) - -Credential AES-256-GCM with per-encrypt nonces + enforced `0600` DEK · -Argon2id (64 MiB / t=3) with constant-time compare · session/refresh tokens -256-bit CSPRNG, SHA-256 at rest, no fixation, reuse-detection cascade-revoke, -idle + absolute expiry correctly clamped · MFA 160-bit secret, encrypted, -replay-prevented · **no SQLi / command injection / path traversal / unsafe -deserialization** · sudo password delivered via stdin only, never argv · -strong SSRF (post-DNS dial-IP re-check closes the rebind hole) · audit -redaction recursively scrubs secrets; parameterized audit SQL · RBAC and -license gating fail-closed · exception separation-of-duties enforced in the -service layer (not just the handler) · API tokens hashed, role re-evaluated -per request · hardened non-root systemd unit · per-install identity secrets · -signed artifacts + CycloneDX SBOMs · slowloris-resistant server timeouts · -open-redirect handled (`safeReturnTo`). - ---- - -## Recommendation - -1. **Merge PR #584** (all 8 fixes — the 5 quick wins + B-1/B-2/B-3). -2. **Smoke-test the CSP against the running SPA + /docs** before release (it - is a single tunable const). -3. Work the medium list as a fast-follow; the lows can be batched post-GA. -4. Add a CI guard that fails the release if the binary contains a font-CDN - or other external-host string (airgap regression backstop). - -*This review reflects the codebase at commit on `feat/security-hardening`. -Re-run on the release tag (including a live `make vuln`) before sign-off.* diff --git a/docs/README.md b/docs/README.md index b8483c51..16dd5bf6 100644 --- a/docs/README.md +++ b/docs/README.md @@ -2,7 +2,7 @@ Production documentation for deploying, operating, and maintaining OpenWatch. -Start here: [Introduction](INTRODUCTION.md) | [Quickstart](guides/QUICKSTART.md) +Start here: Introduction | [Quickstart](guides/QUICKSTART.md) > **⚠️ Migration note (2026-06-05).** OpenWatch is being rebuilt on Go (the Go tree > now lives at the repo root); the Python implementation was archived to @@ -18,7 +18,7 @@ Start here: [Introduction](INTRODUCTION.md) | [Quickstart](guides/QUICKSTART.md) | Document | Description | |----------|-------------| -| [Introduction](INTRODUCTION.md) | Platform philosophy, architecture overview, supported frameworks | +| Introduction | Platform philosophy, architecture overview, supported frameworks | | [Quickstart](guides/QUICKSTART.md) | First 15 minutes: log in, add a host, run a scan, read results | | [Installation](guides/INSTALLATION.md) | Deploy from a native RPM/DEB package or from source | @@ -62,7 +62,7 @@ Start here: [Introduction](INTRODUCTION.md) | [Quickstart](guides/QUICKSTART.md) | Document | Description | |----------|-------------| -| [Kensa Integration](architecture/KENSA_INTEGRATION.md) | Kensa compliance engine integration manual | +| Kensa Integration | Kensa compliance engine integration manual | For installing OpenWatch from the native RPM/DEB, see [guides/INSTALLATION.md](guides/INSTALLATION.md). (The legacy Python/Docker-Compose diff --git a/docs/SIGNING_SECURITY_REVIEW_2026-04-14.md b/docs/SIGNING_SECURITY_REVIEW_2026-04-14.md deleted file mode 100644 index ca4bbbf5..00000000 --- a/docs/SIGNING_SECURITY_REVIEW_2026-04-14.md +++ /dev/null @@ -1,162 +0,0 @@ -# Evidence Signing Security Review - -**Date**: 2026-04-14 -**Last updated**: 2026-04-14 (scope narrow per Kensa↔OpenWatch coordination) -**Scope**: `backend/app/services/signing/`, `backend/app/routes/signing/`, signing integration in `backend/app/services/compliance/audit_export.py`, schema migration `051_add_signing_keys.py` -**Reviewer**: Automated (Bandit 1.9.4, Semgrep 239 rules) + manual code review -**Spec**: `specs/services/signing/evidence-signing.spec.yaml` (9 ACs, active, **v2.0**) -**Phase**: Phase 4 mandatory security review per `docs/OPENWATCH_Q1_Q3_PLAN.md` §"Security review gates" - ---- - -## Scope narrow (2026-04-14) - -Per the Kensa↔OpenWatch coordination (`docs/KENSA_OPENWATCH_COORDINATION_2026-04-14.md` §3.2; Kensa team response §2.2), this review covered two trust layers that must not be conflated. The signing scope has been narrowed accordingly. - -### Trust-layer boundary - -| Layer | Who signs | What it attests | Storage | -|---|---|---|---| -| **Per-transaction evidence envelope** | **Kensa** (not OpenWatch) | "This execution happened on this host at this time" | Kensa SQLite store at capture time; envelope travels with the transaction log record | -| **Aggregate audit export / quarterly posture report / State-of-Production release** | OpenWatch | "OpenWatch aggregated this data from N hosts and produced this artifact" | OpenWatch PostgreSQL; signed at export time by `SigningService` | - -### What was removed from OpenWatch - -- `POST /api/transactions/{id}/sign` endpoint — that surface belongs to Kensa per `KENSA_GO_DAY1_PLAN.md` §8.2 -- Any future per-transaction signing code path — OpenWatch does not attempt to co-sign what Kensa already signed - -### What remains in OpenWatch (covered by this review) - -- `SigningService.sign_envelope()` used **only** by `audit_export._generate_json()` and future aggregate-report services -- `GET /api/signing/public-keys` — public key list so auditors can verify OpenWatch-signed aggregate bundles -- `POST /api/signing/verify` — verification endpoint for OpenWatch-signed aggregate bundles -- All five findings below remain valid for the narrowed scope - -### OpenWatch audit-UI verification of Kensa-signed envelopes - -At Kensa Week 22, OpenWatch audit UIs verify per-transaction envelopes via `kensa.api.Kensa.VerifyEnvelope()` (see `KENSA_GO_DAY1_PLAN.md` §3.5.4). OpenWatch does **not** maintain its own Kensa-envelope verification code path — Kensa owns that verification logic. - -## Summary - -Manual review found one HIGH (insecure private-key fallback to plain base64), two MEDIUM (race condition on key generation, silent signing failure on export), and two LOW (verify endpoint DoS surface, no key revocation flag). Automated scans clean. **HIGH and both MEDIUMs fixed in this PR**; LOWs filed as follow-up issues. - -## Findings - -### Resolved in this PR - -#### SEC-SIGN-01: Private key fallback to plain base64 — HIGH - -**Details:** `signing_service.py:91-94`: - -```python -if self._enc: - priv_encrypted = base64.b64encode(self._enc.encrypt(priv_bytes)).decode() -else: - priv_encrypted = base64.b64encode(priv_bytes).decode() -``` - -When `EncryptionService` is not provided, the Ed25519 private key is stored **base64-encoded only — no encryption**. The docstring at line 57 calls this "dev only" but nothing enforces that. A production misconfiguration (e.g., `app.state.encryption_service` not initialized at startup) silently produces a deployment whose audit-facing claim ("private keys encrypted at rest via EncryptionService") is false. - -This violates the spec's AC-8 ("Private keys MUST be encrypted at rest using AES-256-GCM via EncryptionService"). - -**Risk:** Anyone with database read access (operator with PostgreSQL credentials, backup restorer, breach attacker who lifts a backup) can decode the private key with one base64 round-trip and forge signed evidence bundles indistinguishable from legitimate ones. The signing trust chain collapses. - -**Fix:** Hard-fail in `generate_key()` and `sign_envelope()` if `_enc is None` unless an explicit `OPENWATCH_SIGNING_DEV_MODE=true` env var is set (test/dev only). Production deploys that rely on the silent fallback will surface the misconfiguration loudly instead of quietly accepting it. - -#### SEC-SIGN-02: Race condition on key generation — MEDIUM - -**Details:** `generate_key()` at lines 96-111 executes: -``` -UPDATE deployment_signing_keys SET active = false WHERE active = true -INSERT INTO deployment_signing_keys (..., active) VALUES (..., true) -COMMIT -``` -Two separate `db.execute()` calls with no transaction wrapping. Concurrent invocations (admin script + API user, or two simultaneous rotations) can interleave such that two rows end up with `active = true`. The `sign_envelope()` query uses `LIMIT 1` so it picks one — non-deterministically. - -**Risk:** Sign operations under load could pick either of the two active keys. Verification still works (looks up by `key_id`), so this doesn't break trust, but it does break the "one active key at any time" invariant the codebase assumes. - -**Fix:** Wrap UPDATE + INSERT in a single transaction with `SELECT ... FOR UPDATE` on the active row (PostgreSQL row-level lock). - -#### SEC-SIGN-03: Silent signing failure on export — MEDIUM - -**Details:** `audit_export.py:447-458` wraps the signing call in `try / except Exception`, logs a warning, and proceeds with an unsigned export. The export file is generated and downloadable with no indication to the auditor that signing failed. - -**Risk:** The compliance use case for these exports requires non-repudiation. An auditor downloads what they believe is a signed export and gets an unsigned one — and the only signal is a backend log line they don't have access to. The export's value as audit evidence is undermined. - -**Fix:** When signing fails, write `"signed_bundle": null` to the export with a `"signing_error"` field naming the cause. The export is still produced (so operators can see partial data), but the fact that it is unsigned is now machine-detectable from the export itself. - -### Deferred (follow-up issues filed) - -#### SEC-SIGN-04: Verify endpoint is unauthenticated and CPU-bound — LOW - -**Details:** `POST /api/signing/verify` is unauthenticated by design (auditors verify externally). Each request does base64 decode + canonical JSON serialization + Ed25519 verification. Large or deeply-nested envelopes amplify the cost of the JSON canonicalization step. Combined with the global rate limit (100 req/min per IP) the practical risk is bounded, but a coordinated source could still consume meaningful CPU. - -**Recommendation:** Add a per-endpoint request-size limit (e.g., 64KB envelope max) and a stricter rate limit (e.g., 20 req/min per IP) on this specific endpoint. Tracked as follow-up issue. - -#### SEC-SIGN-05: No key revocation flag — LOW - -**Details:** Current model: `active` (true/false) + `rotated_at` timestamp. There's no way to mark a key as "compromised — bundles signed with this key should NOT be trusted." `verify()` happily verifies any bundle whose `key_id` matches a row in `deployment_signing_keys`, regardless of whether the key was leaked. - -**Recommendation:** Add `revoked` boolean + `revoked_at` timestamp + `revocation_reason` text. `verify()` returns false (with reason in response) for bundles signed with a revoked key. Add `POST /api/signing/keys/{id}/revoke` admin endpoint. Tracked as follow-up issue. - -### Informational (no action) - -#### INFO-SIGN-01: Public public-keys endpoint exposes retired keys - -By design — auditors need retired keys to verify older bundles. Not a finding. - -#### INFO-SIGN-02: Canonical JSON for deterministic signing - -`sign_envelope()` uses `json.dumps(envelope, sort_keys=True, separators=(",", ":"))` — correct deterministic serialization. Same canonicalization in `verify()`. Confirmed correct. - -#### INFO-SIGN-03: alg=none N/A - -Ed25519 has no algorithm-confusion vector (single algorithm by definition). The OIDC `alg=none` defense applied to JWTs is not relevant here. - -## Positive observations - -| Area | Finding | Location | -|------|---------|----------| -| Algorithm choice | Ed25519 (modern, no parameter choices, fixed output size, fast) | `signing_service.py:23, 74` | -| Key generation | `cryptography.hazmat.primitives.asymmetric.ed25519.Ed25519PrivateKey.generate()` (CSPRNG-backed) | line 74 | -| Canonical signing | `json.dumps(..., sort_keys=True, separators=(",", ":"))` — deterministic byte representation | lines 162, 201 | -| Verify failure mode | `except Exception: return False` — never leaks why verify failed | lines 204-208 | -| RBAC on sign | `@require_role([SECURITY_ADMIN, SUPER_ADMIN])` — only privileged users sign | `routes.py:123` | -| Public verify | Unauthenticated by design — auditors don't need OpenWatch credentials | `routes.py:95` | -| Schema | UUID PK, `active` flag, `rotated_at` for retirement | migration 051 | -| Rotation support | Old keys remain in DB for verification of historical bundles | `signing_service.py:96-100` | -| Encrypted at rest | `EncryptionService.encrypt()` AES-256-GCM applied to private key | `signing_service.py:91-92` (when configured) | - -## Tool results - -### Bandit 1.9.4 (high+medium severity) - -``` -Code scanned: 416 lines -Total issues: 0 -``` - -### Semgrep (p/security-audit + p/owasp-top-ten + p/python + p/secrets) - -``` -239 rules run, 4 files scanned, 0 findings -``` - -## Governance - -This automated + manual review does **not** substitute for: - -1. **Key management operational runbook** — operators are responsible for the lifecycle of `OPENWATCH_MASTER_KEY` (the EncryptionService root key); compromise of that key compromises all signing keys -2. **External auditor verification flow** — the public verification endpoint and `verify-bundle.py` companion script (if any) need a human-readable runbook published alongside signed exports -3. **Cryptoperiod policy** — NIST SP 800-57 §5.3.6 recommends Ed25519 cryptoperiods ≤2 years; OpenWatch should establish a rotation schedule (suggest annual) - -## References - -- NIST SP 800-57 Pt. 1 Rev. 5 §5.3.6 (asymmetric key cryptoperiods): https://csrc.nist.gov/publications/detail/sp/800-57-part-1/rev-5/final -- RFC 8032 (Ed25519): https://datatracker.ietf.org/doc/html/rfc8032 -- Spec: `specs/services/signing/evidence-signing.spec.yaml` (8 ACs, active) -- Implementing PR: #351 (squashed `3b95ef7a feat: Q2 implementation`) - ---- - -**Review status:** Phase 4 signing review complete per Q1-Q3 plan §"Security review gates". HIGH (SEC-SIGN-01) and both MEDIUMs fixed in this PR; LOWs (SEC-SIGN-04, SEC-SIGN-05) filed as follow-up issues. diff --git a/docs/SSO_SECURITY_REVIEW_2026-04-14.md b/docs/SSO_SECURITY_REVIEW_2026-04-14.md deleted file mode 100644 index cd393a53..00000000 --- a/docs/SSO_SECURITY_REVIEW_2026-04-14.md +++ /dev/null @@ -1,125 +0,0 @@ -# SSO Federation Security Review - -**Date**: 2026-04-14 -**Scope**: `backend/app/services/auth/sso/`, `backend/app/routes/auth/sso.py`, SSO dependencies -**Reviewer**: Automated (Bandit 1.9.4, Semgrep 205 rules, pip-audit 2.10.0) + manual code review -**Spec**: `specs/services/auth/sso-federation.spec.yaml` (16 ACs, active) - -## Summary - -Automated scans found one real transitive CVE chain (pyOpenSSL 22.0.0, pulled in by pysaml2) which is fixed in this PR. Manual code review found two defense-in-depth items filed as follow-up issues. No P0/P1 findings. The SSO code follows OIDC best practices for the critical validation paths (signature, `alg=none` rejection, single-use state token, PKCE S256). - -## Findings - -### Resolved in this PR - -#### SEC-SSO-01: Transitive pyOpenSSL CVEs (MEDIUM) - -**Details:** -- `pyOpenSSL 22.0.0` pulled transitively by `pysaml2 7.5.4` -- CVE-2026-27448 and CVE-2026-27459 — fix version 26.0.0 -- Not exploitable via OpenWatch application code (we don't call pyOpenSSL directly), but any pysaml2 code path that reaches into pyOpenSSL inherits the risk - -**Fix:** Pinned `pyOpenSSL==26.0.0` in `backend/requirements.txt` under the Authentication & Security block. - -**Verification:** -``` -$ pip-audit -r requirements.txt | grep pyopenssl -(no output — clean) -``` - -### Deferred (follow-up issues filed) - -#### SEC-SSO-02: OIDC nonce not implemented — LOW - -**Details:** The OIDC authorization URL in `oidc.py:23-46` does not include a `nonce` parameter, and `handle_callback()` at line 48 does not validate a `nonce` claim on the id_token. OpenID Connect Core 1.0 §15.5.2 strongly recommends nonce for Authorization Code Flow as defense-in-depth against id_token replay. - -**Risk:** Low — current defenses are already strong: -- 256-bit cryptographically random state, single-use, validated on callback (`provider.py:102-111`, `sso.py:274-281`) -- PKCE S256 enforced (`oidc.py:39`) -- id_token signature verified against JWKS, `alg=none` explicitly rejected (`oidc.py:89-90`) -- `iss`, `aud`, `exp`, `nbf` validated (`oidc.py:93`) - -A successful replay would require simultaneously compromising both the state token (server-side session) AND the token endpoint response — the state alone already binds the session. - -**Recommendation:** Add nonce as defense-in-depth, tracked as [follow-up issue](https://github.com/Hanalyx/OpenWatch/issues/___). - -#### SEC-SSO-03: JWKS fetched on every callback with no cache — LOW - -**Details:** `_get_jwks()` in `oidc.py:103-113` does a synchronous `httpx.get()` to the IdP's JWKS endpoint on every SSO login. No caching. - -**Risks:** -- Latency: adds a round-trip to every SSO login -- Availability coupling: if the IdP's JWKS endpoint is slow or down, all SSO logins stall -- Rate-limiting: frequent JWKS fetches may be rate-limited by some IdPs - -**Industry practice:** IdPs publish JWKS with Cache-Control / ETag headers; clients cache for minutes to hours. Google, Auth0, Okta all advise clients cache JWKS. - -**Recommendation:** In-process TTL cache (5-15 min), with refresh-on-miss if the id_token's `kid` isn't in the cached set. Tracked as [follow-up issue](https://github.com/Hanalyx/OpenWatch/issues/___). - -### Informational (no action) - -#### INFO-SSO-01: Bandit B105 false positive - -`routes/auth/sso.py:177` — `"token_type": "bearer"` flagged as hardcoded password. This is the OAuth2 token_type standard string; not a credential. Suppressing adds noise; leaving unsuppressed since the scan already runs at LOW severity and this is a known pattern. - -## Positive observations (confirmed by review) - -| Area | Finding | Location | -|------|---------|----------| -| State parameter | 256-bit `secrets.token_urlsafe(32)`, single-use, PostgreSQL-backed | `provider.py:102-111`, `sso_state.py` | -| PKCE | S256 enforced | `oidc.py:39` | -| id_token signature | Verified against JWKS, alg=none rejected | `oidc.py:84-90` | -| Standard claims | `iss`, `aud`, `exp`, `nbf` validated by authlib | `oidc.py:93` | -| SAML assertion signature | `want_assertions_signed: True` | `saml.py:123` | -| SAML AuthnRequest signature | `authn_requests_signed: True` | `saml.py` config | -| Audit logging | All SSO login outcomes logged via `log_login_event` | `sso.py` | -| Client IP | Uses trusted proxy validation (not raw XFF header) | `sso.py:38-42` | -| Encryption at rest | Provider configs stored encrypted via `EncryptionService` | `sso.py:50-56` | - -## Tool results - -### Bandit (backend-only, SSO scope) -``` -Test results: - Total lines of code: 650 - Total issues (by severity): - High: 0 - Medium: 0 - Low: 1 (B105 false positive on "bearer") -``` - -### Semgrep (p/security-audit + p/owasp-top-ten + p/jwt + p/python) -``` -205 rules run, 0 findings, 5 files scanned. -``` - -### pip-audit (SSO-relevant dependencies) -Before this PR: -``` -authlib 1.6.10 (clean after group update) -pysaml2 7.5.4 (clean) -pyOpenSSL 22.0.0 (2 CVEs) <-- FIXED in this PR -cryptography 46.0.5 (2 CVEs) <-- fixed by Dependabot PR #376 -``` - -After this PR + #376 merged: zero known CVEs in SSO-relevant dependency subgraph. - -## Governance - -This automated + manual review does **not** substitute for a human security sign-off on the following, which remain explicit operational requirements: - -1. **IdP metadata trust**: operator is responsible for validating the IdP's metadata URL / certificate fingerprint before adding a provider to OpenWatch -2. **Role mapping review**: group-to-role mappings (`claim_mappings`) must be reviewed per-IdP to prevent unintended privilege grants -3. **Session timeout**: absolute session timeout (12h) applies equally to SSO and local auth; no SSO-specific override - -## References - -- OpenID Connect Core 1.0 §15.5.2 (nonce recommendation): https://openid.net/specs/openid-connect-core-1_0.html#NonceNotes -- RFC 7636 (PKCE): https://tools.ietf.org/html/rfc7636 -- Authlib security advisories: https://github.com/lepture/authlib/security/advisories -- pysaml2 security: https://github.com/IdentityPython/pysaml2/security - ---- - -**Review status:** complete for automated tooling. Manual review items SEC-SSO-02 and SEC-SSO-03 are defense-in-depth improvements, not correctness or exploitability fixes. diff --git a/docs/architecture/KENSA_INTEGRATION.md b/docs/architecture/KENSA_INTEGRATION.md deleted file mode 100644 index 6bebfaaa..00000000 --- a/docs/architecture/KENSA_INTEGRATION.md +++ /dev/null @@ -1,219 +0,0 @@ -# OpenWatch and Kensa integration - -This document describes how the current OpenWatch (single Go binary, PostgreSQL, -systemd) integrates the Kensa compliance engine. It covers the integration -package, the responsibility boundary, the data path, and what is and is not yet -wired in the code. - -For the full, ratified responsibility split between the two products, read -[`docs/KENSA_OPENWATCH_BOUNDARY.md`](../KENSA_OPENWATCH_BOUNDARY.md) — that is the -authoritative boundary reference, and this document does not restate it in full. -For the behavioral contract enforced by tests, read -[`specs/system/kensa-executor.spec.yaml`](../../specs/system/kensa-executor.spec.yaml) -(version 2.0.0, status `approved`). - -## What Kensa is - -Kensa is a pure, single-host measurement engine maintained at -`github.com/Hanalyx/kensa`. It connects to a host over SSH, evaluates native -YAML rules, and returns a structured result with evidence. It is stateless -between invocations and single-host per invocation. It does not store results -long-term, manage exceptions, schedule scans, or provide a UI. - -OpenWatch is the long-lived fleet platform around that engine: it owns the host -inventory, the credential store, scheduling, persistence (the transaction log), -drift detection, fleet rollups, alerting, and the API plus embedded UI. - -Kensa is integrated as a Go module dependency, not a separate process or -container. Kensa runs its own SSH-based checks against native YAML rules. - -## Version pin - -OpenWatch pins the Kensa module in `go.mod`: - -| Item | Value | Source | -|------|-------|--------| -| Module | `github.com/Hanalyx/kensa` | `go.mod` | -| Version | `v0.2.1` | `go.mod`, `internal/kensa/types.go` (`KensaModuleVersion`) | - -The linked Kensa version is also exposed at runtime. `GET /api/v1/health` -returns a `kensa` field populated by `version.Kensa()` in -`internal/version/version.go`, which reads the version from the binary's build -info rather than a hand-edited constant, so it always reflects the Kensa module -actually linked in. - -Note: the package comment in `internal/kensa/doc.go` still reads `v0.1.1`. That -comment is stale; `go.mod` and `internal/kensa/types.go` are the authoritative -sources and both record `v0.2.1`. - -## Integration package - -All Kensa integration lives in one package, `internal/kensa/`. Per its package -doc, this is the only package in OpenWatch that imports -`github.com/Hanalyx/kensa`. - -| File | Purpose | -|------|---------| -| `internal/kensa/doc.go` | Package contract and architectural decisions | -| `internal/kensa/types.go` | `Result`, `RuleOutcome`, sentinel errors, failure-reason enum, evidence cap | -| `internal/kensa/executor.go` | `Executor`: concurrency guard, credential bridge, audit emission | -| `internal/kensa/import.go` | Blank import that keeps the module pinned in `go.mod` | -| `internal/kensa/backoff.go` | Retry backoff helpers | - -The executor is constructed once at process start and held for the process -lifetime. `internal/worker/credential_bridge.go` adapts OpenWatch's credential -service to the executor's `CredentialBridge` interface. - -### Security properties enforced by the executor - -These are stated in `internal/kensa/doc.go` and `types.go` and verified by the -spec's acceptance criteria: - -- SSH private keys are parsed in memory via `crypto/ssh.ParsePrivateKey` and - passed to Kensa through an in-memory transport. The key bytes never touch - `/tmp` or any disk path (spec AC-02). Kensa's `HostConfig.KeyPath` is never - populated by this wrapper. -- Decrypted credential plaintext is zeroed via a deferred `Wipe()` on every code - path — success, error, and context cancellation (spec AC-07). -- A per-host concurrency guard (a `sync.Map` of in-flight host IDs) prevents two - concurrent scans against the same host; the second caller gets `ErrHostBusy` - immediately, without opening an SSH session (spec AC-03). Different hosts run - in parallel. -- Per-rule evidence is capped at 10 MiB (`MaxEvidenceBytes`); a larger blob - fails the whole scan with `ErrEvidenceOversize` (spec AC-14). -- No engine-abstraction interface is defined; Kensa is invoked directly. There - is no `ScanEngine interface` seam (spec AC-12). - -### Framework-agnostic scans (v2.0.0 change) - -As of the executor spec v2.0.0, a scan covers the full rule corpus applicable to -the host's detected OS capabilities. There is no per-scan framework parameter; -`Result` has no `FrameworkID` field. Per-rule framework metadata lives on each -`RuleOutcome.FrameworkRefs` (for example `"cis_rhel9_v2": "5.1.12"`). This is a -breaking change from the earlier per-framework scan model. - -## Result shape - -`Executor.Run` returns a `*kensa.Result` on success (see -`internal/kensa/types.go`): - -| Field | Type | Notes | -|-------|------|-------| -| `HostID` | `uuid.UUID` | Target host | -| `Outcomes` | `[]RuleOutcome` | One entry per evaluated rule | -| `StartedAt` / `CompletedAt` | `time.Time` | Scan window | -| `PolicyVersion` | `string` | Snapshotted from the job payload | - -Each `RuleOutcome` carries `RuleID`, `Status` (`pass`, `fail`, `skipped`, -`error`), `Severity`, raw `Evidence` bytes (capped at `MaxEvidenceBytes`), -`FrameworkRefs`, and `SkipReason` (set when skipped). - -## Failure classification - -Failures are returned as sentinel errors and mapped to a closed -`detail.reason` enum on the `scan.failed` audit event (spec AC-06): - -| Sentinel error | `FailureReason` | -|----------------|-----------------| -| `ErrHostKeyUnknown` | `host_key_unknown` | -| `ErrCredentialDecryption` | `credential_decryption_failed` | -| `ErrEvidenceOversize` | `evidence_oversize` | -| `ErrKensaInternal` | `kensa_error` | -| `ErrHostBusy` | `host_busy` | -| (timeout) | `timeout` | -| `ErrNoCredential` | host has no credential registered | - -## Data path - -The intended scan path, per the boundary doc (§5.2) and the worker wiring: - -1. The scheduler enqueues a scan job onto the PostgreSQL-native job queue - (`SKIP LOCKED`), with a JSONB body carrying the host ID, a policy version, - and an HMAC (`internal/worker/payload.go`). -2. The worker (`openwatch worker`) dequeues the job and resolves the host's - credential through the credential bridge. -3. The executor opens an in-memory SSH session and runs the Kensa scan against - the host. -4. The result is handed to the transaction-log writer - (`internal/transactionlog/writer.go`), which persists meaningful state - changes and emits audit events. -5. Audit events (`scan.started`, `scan.completed`, `scan.failed`) are emitted - through `audit.Emit`. - -Persistence is not the executor's responsibility; the transaction-log writer -owns it. Steps 1, 2, 4, and 5 are wired in `cmd/openwatch/worker.go`. - -## What is not yet wired - -The live Kensa scan call is not yet wired into production. In -`internal/kensa/executor.go`, `NewExecutor` binds `scanFunc` to -`unwiredScanFunc`, which returns an error reading "scan path not yet wired -(production wiring pending)". The worker constructs the executor with -`kensa.NewExecutor(...)` (`cmd/openwatch/worker.go`) but does not yet call -`WithScanFunc` to inject a closure backed by the real Kensa client. Until that -binding lands (tracked as spec AC-18), a dequeued scan job fails with -`ReasonKensaError` rather than performing a real scan. - -`internal/kensa/import.go` is a blank import (`_ "github.com/Hanalyx/kensa/api"`) -that keeps the module pinned in `go.mod` while no Kensa symbol is called -directly; it is removed once the executor invokes real Kensa calls. - -These items are roadmap, not present behavior: - -- Live `ScanFunc` wiring in the worker (spec AC-18). -- A Kensa `Reachable(ctx, host)` reachability primitive for the liveness loop; - until it exists, OpenWatch dials hosts directly via `internal/ssh` - (boundary doc §6.3). -- Subscription to Kensa transaction-progress events for in-flight scan - visibility; `Kensa.Subscribe` is stubbed on the Kensa side - (boundary doc §6.4). -- Remediation and rollback execution through Kensa. - -Do not document these as working features until the corresponding code lands. - -## Operating the integration - -Kensa runs inside the OpenWatch binary, so there is no separate Kensa service to -start, scan, or restart. You operate the worker that drives it. - -Check the linked Kensa version: - -``` -curl -sk https://localhost:8443/api/v1/health -``` - -The response includes a `kensa` field with the embedded engine version. - -The worker process runs the scan jobs: - -``` -systemctl status openwatch.service -journalctl -u openwatch.service -f -``` - -Inspect queued and in-flight scan jobs in PostgreSQL (use the DSN from -`/etc/openwatch/secrets.env`): - -``` -psql "$OPENWATCH_DATABASE_DSN" -c \ - "select id, status, created_at from job_queue order by created_at desc limit 10;" -``` - -For install, configuration, TLS, and database setup, follow the canonical -[`docs/guides/INSTALLATION.md`](../guides/INSTALLATION.md); this -document does not duplicate those procedures. For role and permission details -that gate the scan and compliance endpoints, see -[`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md). - -## References - -| Topic | Source | -|-------|--------| -| Responsibility boundary | `docs/KENSA_OPENWATCH_BOUNDARY.md` | -| Executor behavioral spec | `specs/system/kensa-executor.spec.yaml` (v2.0.0, approved) | -| Integration package | `internal/kensa/` | -| Worker wiring | `cmd/openwatch/worker.go`, `internal/worker/` | -| Transaction-log writer | `internal/transactionlog/writer.go` | -| Health endpoint and Kensa version | `api/openapi.yaml` (`GET /api/v1/health`), `internal/version/version.go` | -| Install and configuration | `docs/guides/INSTALLATION.md` | -| Roles and permissions | `docs/engineering/rbac_registry.md` | diff --git a/docs/engineering/BACKEND_FUNCTIONALITY.md b/docs/engineering/BACKEND_FUNCTIONALITY.md deleted file mode 100644 index e771376c..00000000 --- a/docs/engineering/BACKEND_FUNCTIONALITY.md +++ /dev/null @@ -1,890 +0,0 @@ -# OpenWatch Backend Functionality Inventory - -> **Status (2026-06-22):** Historical rebuild input. A point-in-time catalog of -> the **Python/FastAPI** backend as it stood on 2026-04-27, used to triage the -> Go rebuild. The Python tree was archived out of the repo on 2026-06-05; this -> does **not** describe current Go code. Current behavioral SSOT is `specs/` -> (+ `specter.yaml`) and the Go packages under `internal/`. -> **Generated:** 2026-04-27 -> **Source:** `backend/app/` — Python/FastAPI implementation -> **Purpose:** Stage 1 input for the Go rebuild — complete catalog of features that exist today, so we can triage MUST / MAYBE / NEVER for the rebuild. -> **Method:** Six parallel sub-agents inventoried routes, auth/security, compliance/Kensa, infra services, data layer, and background work. This document synthesizes their reports. - ---- - -## How to read this document - -Every section is a **factual** description of what exists today. A feature being listed here does **not** mean it must be rebuilt — that's Stage 1's triage job. - -- Entries with **[LEGACY]** are SCAP-era or replaced by newer subsystems; high candidates for the NEVER list. -- Entries with **[DUPLICATED]** have overlapping scope with another subsystem; one of the two should be dropped. -- Entries with **[FEATURE-GATED]** require an OpenWatch+ license today. - -The "Rebuild Attention List" at the end aggregates these flags into a single triage view. - ---- - -## 1. HTTP API surface (~350+ endpoints across 18 route modules) - -Routes are mounted under `/api`. Below: one section per route module with the route table. Auth column shows the FastAPI dependency: `JWT` = any authenticated user, `require_permission(X)` = RBAC permission required, `require_role(X)` = role-based check, `None` = unauthenticated. - -### 1.1 Auth (`/api/auth`) - -Login, refresh, MFA, API keys, SSO callbacks (OIDC + SAML). - -| Method | Path | Handler | Auth | Purpose | -|---|---|---|---|---| -| POST | /login | login | None | Authenticate with username/password; optional MFA | -| POST | /register | register | None | Register new user account | -| POST | /refresh | refresh_token | JWT | Refresh access token using refresh token | -| POST | /logout | logout | JWT | Invalidate refresh token | -| GET | /me | get_current_user | JWT | Get current user profile | -| POST | /mfa/status | get_mfa_status | JWT | Check MFA enrollment | -| POST | /mfa/enroll | enroll_mfa | JWT | Start TOTP enrollment | -| POST | /mfa/validate | validate_mfa_code | JWT | Validate TOTP code | -| POST | /mfa/enable | enable_mfa | JWT | Enable MFA after enrollment | -| POST | /mfa/regenerate-backup-codes | regenerate_backup_codes | JWT | Regenerate backup codes | -| POST | /mfa/disable | disable_mfa | JWT | Disable MFA | -| POST | /api-keys | create_api_key | JWT | Create API key | -| GET | /api-keys | list_api_keys | JWT | List user's API keys | -| DELETE | /api-keys/{id} | delete_api_key | JWT | Delete API key | -| PUT | /api-keys/{id}/permissions | update_api_key_permissions | JWT | Update key permissions | -| GET | /sso/providers | get_sso_providers | None | List enabled SSO providers | -| GET | /sso/login | sso_login | None | Initiate SSO login | -| GET | /sso/callback/oidc/{provider_id} | oidc_callback | None | OIDC callback | -| POST | /sso/callback/saml/{provider_id} | saml_callback | None | SAML callback | - -### 1.2 Admin (`/api/admin`) - -User/role/permission/SSO/security/retention/notification administration. ~45 endpoints. - -Highlights: `/users`, `/users/roles`, `/audit/events`, `/audit/stats`, `/authorization/permissions/host/{id}`, `/authorization/check`, `/credentials/hosts/{id}`, `/notifications/channels` (CRUD + test), `/security/mfa`, `/security/templates`, `/sso/providers` (CRUD + test), `/retention` (policies + enforce), `/transactions/backfill`. - -### 1.3 Hosts (`/api/hosts`) - -Host CRUD, OS discovery, system intelligence, baseline, connectivity. ~40 endpoints. - -Notable groupings: -- **Core CRUD:** GET/POST `/`, GET/PUT/DELETE `/{id}`, DELETE `/{id}/ssh-key` -- **OS / platform discovery:** `/{id}/discover-os`, `/{id}/os-info`, `/{id}/detect-platform`, `/{id}/system-info` -- **Discovery sub-domains:** basic, network, security, compliance — single + bulk variants for each -- **Server intelligence:** `/{id}/intelligence/{services,packages,users,audit,network,baseline}` -- **Baseline:** GET/POST/DELETE `/{id}/baseline` -- **Connectivity:** `/check`, `/status`, `/{id}/ping`, `/{id}/check-connectivity`, `/{id}/state` - -### 1.4 Bulk operations (`/api/bulk/hosts`) - -`/bulk-import`, `/import-template`, `/export-csv`, `/analyze-csv`, `/import-with-mapping`. - -### 1.5 Host groups (`/api/host-groups`) - -CRUD + member management + group-level scans + compliance reports/metrics + scheduling. ~17 endpoints. - -### 1.6 Scans (`/api/scans`) - -Largest single domain. ~50 endpoints across: -- **Scan lifecycle:** GET `/`, GET `/{id}`, POST `/legacy` **[LEGACY]**, PATCH/DELETE `/{id}`, `/{id}/stop`, `/{id}/cancel`, `/{id}/recover`, `/{id}/apply-fix` -- **Kensa subgroup:** POST `/kensa` (execute), GET `/kensa/{frameworks,health,frameworks/db,rules/framework/{f},framework/{f}/coverage,rules/{id}/framework-refs,controls/search,controls/{f}/{c},sync-stats,compliance-state/{host_id}}`, POST `/kensa/sync` -- **Templates:** CRUD on `/templates`, plus `/templates/quick`, `/templates/host/{id}`, `/templates/{id}/{apply,clone,set-default}` -- **Execution helpers:** `/validate`, `/hosts/{id}/quick-scan`, `/verify`, `/{id}/rescan/rule`, `/{id}/remediate` -- **Capabilities/profiles:** `/capabilities`, `/summary`, `/profiles` -- **Bulk scans:** POST `/bulk-scan`, GET `/bulk-scan/{session_id}/progress`, `/bulk-scan/{session_id}/cancel`, GET `/sessions` -- **Results/reports:** `/{id}/results`, `/{id}/report/{html,json,csv}`, `/{id}/failed-rules` - -### 1.7 Compliance (`/api/compliance`) - -Alerts, audit queries, baselines, drift, exceptions, OWCA, posture, scheduler, remediation. ~70 endpoints. - -- **Alerts:** list/stats/thresholds (GET/PUT) + per-alert `/acknowledge`, `/resolve`. `/alert-routing` CRUD. -- **Audit queries:** `/audit/queries` (CRUD + saved query execution + ad-hoc execution + statistics) **[FEATURE-GATED]** for execute/preview -- **Audit exports:** `/audit/exports` (CRUD + download + statistics) **[FEATURE-GATED]** -- **Baselines:** list/create/get -- **Drift:** `/drift`, `/drift/summary` -- **Exceptions:** list/get/request/approve/reject/revoke/check; `/exceptions/summary` -- **OWCA:** `/owca/score`, `/owca/frameworks`, `/owca/control/{f}/{c}`, `/owca/trends` **[FEATURE-GATED]**, `/owca/forecast` **[FEATURE-GATED]**, `/owca/export` -- **Posture:** `/posture` (current), `/posture/history` **[FEATURE-GATED]**, `/posture/drift` **[FEATURE-GATED]**, `/posture/snapshot`, `/posture/drift/group` **[FEATURE-GATED]**, `/posture/drift/export` -- **Scheduler:** config (GET/PUT), `/toggle`, `/status`, `/hosts-due`, `/hosts/{id}` (schedule view), `/hosts/{id}/maintenance` (PUT), `/hosts/{id}/force-scan`, `/initialize` -- **Remediation:** `/remediation` (request/get/approve/execute/pending/rollback) **[DUPLICATED]** — also exists at `/api/remediation/` - -### 1.8 Integrations (`/api/integrations`) - -Webhooks (CRUD + deliveries + test), Jira (webhook handler + field mapping), plugins (CRUD + execute + executions), ORSA plugin discovery, integration metrics. ~28 endpoints. - -### 1.9 Rules (`/api/rules`) - -Rule reference (Kensa YAML browser): list, stats, frameworks, categories, variables, capabilities, detail, refresh. - -### 1.10 Remediation (`/api/automated-fixes` and `/api/remediation`) - -Fix evaluation/execution lifecycle: evaluate-options → request-execution → approve → execute → rollback → status; pending-approvals; secure-commands; cleanup; provider listing; Kensa remediation callback. ~17 endpoints. - -### 1.11 SSH (`/api/ssh`) - -Policy (GET/POST), known-hosts CRUD, connectivity test, debug auth/log. 8 endpoints. - -### 1.12 System (`/api`) - -Version, capabilities, feature flags, health (integrations, service, content, summary, refresh, history), discovery config + run, scheduler config + start/stop, system credentials CRUD, session-timeout. ~30 endpoints. - -### 1.13 Transactions (`/api/transactions`, `/api/hosts/{id}/transactions`) - -Q1 transaction-log read API: list, list by rule, get, list by host. 5 endpoints. - -### 1.14 Signing (`/api/signing`) - -Public keys (GET), verify, sign. 3 endpoints. - -### 1.15 Fleet (`/api/fleet`) - -Single endpoint: `/health`. - -### 1.16 Content (`/api/content`) - -[Empty/minimal — was for SCAP content; **[LEGACY]**, mostly removed.] - -### 1.17 Plugins (`/api/plugins`) - -[Largely consolidated into `/api/integrations/plugins/` — verify what remains here.] - ---- - -## 2. Authentication, Authorization, RBAC - -### 2.1 JWT (FIPSJWTManager) - -- **Location:** `backend/app/auth.py` -- RS256 with RSA-2048 keys -- 30-min access token, 7-day refresh token, 12-hour absolute session timeout -- JTI claim for revocation tracking; PostgreSQL-backed blacklist (`token_blacklist_pg.py`) -- Public surface: `create_access_token`, `create_refresh_token`, `verify_token`, `validate_access_token`, `validate_refresh_token` -- NIST AC-12 / AC-13 compliance - -### 2.2 API key authentication - -- Prefix `owk_`; SHA256-hashed at rest; expiry-aware; per-key permissions -- Resolved via `decode_token()` in `auth.py` - -### 2.3 Password hashing (PasswordManager) - -- Argon2id, FIPS-approved -- 64 MB memory, 3 iterations, 1 parallelism, 32-byte hash, 16-byte salt - -### 2.4 MFA - -- `backend/app/services/auth/mfa.py` -- TOTP (RFC 6238), 160-bit secret, 1-window validation with replay protection -- 10 backup codes, SHA256-hashed -- QR code generation -- FIDO2 interface scaffolded but not implemented - -### 2.5 Token blacklist (PostgreSQL-backed) - -- `backend/app/services/auth/token_blacklist_pg.py` -- Replaces former Redis-based blacklist -- `token_blacklist` table; atomic UPSERT; cleanup of expired entries -- Fail-open on DB error (availability preference) - -### 2.6 SSO — OIDC - -- `backend/app/services/auth/sso/oidc.py` -- `authlib`-based; PKCE; standard claims validation (iss/aud/exp/nbf); rejects `alg=none` -- IdP JWKS validation - -### 2.7 SSO — SAML 2.0 - -- `backend/app/services/auth/sso/saml.py` -- `pysaml2`-based; AuthnRequest generation; signature validation; InResponseTo + RelayState anti-CSRF -- Rejects unsigned assertions - -### 2.8 SSO state storage - -- `backend/app/services/auth/sso_state.py` -- Single-use state tokens, 5-min TTL, atomic delete-on-validate -- PostgreSQL `sso_state` table (replaces Redis) - -### 2.9 RBAC - -- `backend/app/rbac.py` -- 6 roles: SUPER_ADMIN, SECURITY_ADMIN, SECURITY_ANALYST, COMPLIANCE_OFFICER, AUDITOR, GUEST -- 33 fine-grained permissions across USER, HOST, CONTENT, SCAN, RESULTS, REPORTS, SYSTEM, AUDIT, COMPLIANCE -- Decorators: `@require_permission`, `@require_any_permission`, `@require_role`, `@require_admin`, `@require_super_admin`, `@require_analyst_or_above` - -### 2.10 Authorization middleware (Zero-Trust) - -- `backend/app/middleware/authorization_middleware.py` -- Intercepts protected endpoints (16+ patterns); extracts user/resources/action; delegates to AuthorizationService; **fails secure** on any error -- Audit logs allow/deny decisions - ---- - -## 3. Cryptography, Signing, Audit - -### 3.1 AES-256-GCM encryption (EncryptionService) - -- `backend/app/encryption/service.py` -- NIST SP 800-38D (GCM); PBKDF2-HMAC-SHA256 (100K iter default, min 10K per SP 800-132) -- 16-byte salt + 12-byte nonce per encryption; format: `salt + nonce + ciphertext + tag` -- Configurable: DEFAULT (100K), FAST_TEST (10K), HIGH_SECURITY (200K, SHA512) -- Used for credentials, SSH keys, MFA secrets, channel configs - -### 3.2 Ed25519 evidence signing (SigningService) - -- `backend/app/services/signing/signing_service.py` -- Signs compliance evidence envelopes; supports key rotation without breaking old verifications -- Private keys encrypted at rest (via EncryptionService) -- Dev-mode flag: `OPENWATCH_SIGNING_DEV_MODE` (hard fail in production) -- Transaction-level locking prevents concurrent key generation - -### 3.3 File-based audit (SecurityAuditLogger) - -- `backend/app/auth.py` -- Logs login attempts, API key actions, scan operations to file - -### 3.4 Database audit (`audit_db.py`) - -- Writes to PostgreSQL `audit_logs` table -- Helpers: `log_audit_event`, `log_login_event`, `log_scan_event`, `log_host_event`, `log_user_event`, `log_security_event`, `log_admin_event` -- Defensive SSH conflict handling suggests in-progress migration - -**[DUPLICATED]** File-based and DB-based audit both exist; consolidate to DB-only in rebuild. - ---- - -## 4. Middleware - -| Middleware | File | Purpose | -|---|---|---| -| Authorization (Zero-Trust) | `authorization_middleware.py` | RBAC enforcement on protected endpoints | -| Rate limiting | `rate_limiting.py` | Token-bucket per client/endpoint; suspicious-activity tracking; environment-aware (dev 10x); HMAC-SHA256 client hashing | -| Error handling | `error_handling.py` | Global exception → standardized error response with correlation IDs | -| Metrics | `metrics.py` | Latency, status code, endpoint count collection | - -Rate limit categories: anonymous (60/min), authenticated (300/min), system (600/min), auth (15/min strict), validation (60/min). - ---- - -## 5. Compliance services - -### 5.1 Temporal compliance - -- `backend/app/services/compliance/temporal.py` -- Point-in-time posture queries (NIST SP 800-137) -- Public surface: `TemporalComplianceService.{get_posture, get_posture_history, detect_drift}` -- **[FEATURE-GATED]** historical queries - -### 5.2 Exception management - -- `backend/app/services/compliance/exceptions.py` -- Approval workflow: pending → approved → expired -- Host-level and host-group waivers; risk_acceptance + compensating_controls fields -- Auto-expiry via scheduled task - -### 5.3 Alert thresholds & lifecycle - -- `backend/app/services/compliance/alerts.py`, `alert_generator.py`, `alert_routing.py` -- 15 alert types (CRITICAL_FINDING, SCORE_DROP, EXCEPTION_EXPIRING, CONFIGURATION_DRIFT, …) -- Lifecycle: Active → Acknowledged → Resolved -- Default thresholds: score-drop 20pp/24h, non-compliant <80%, mass-drift 10+ hosts - -### 5.4 Audit queries & export - -- `backend/app/services/compliance/audit_query.py`, `audit_export.py` -- Saved queries with preview; ad-hoc execution; pagination -- Exports: JSON / CSV / PDF; signed; tracked in `audit_exports`; auto-cleanup -- **[FEATURE-GATED]** - -### 5.5 Drift detection - -- `backend/app/services/monitoring/drift.py` -- Major (≥10pp), Minor (5–10pp), Improvement (≥5pp); auto-baseline on first scan -- Triggers `CONFIGURATION_DRIFT` alerts - -### 5.6 Baseline management - -- `backend/app/services/compliance/baseline_management.py` -- Manual reset / promote / rolling 7-day average -- One active baseline per host; baseline_type tracks origin - -### 5.7 Adaptive compliance scheduler - -- `backend/app/services/compliance/compliance_scheduler.py` -- State-based intervals: compliant 24h, mostly 12h, partial 6h, critical 1h; max 48h -- Reads `host_compliance_schedule` table -- Dispatched via job queue every 2 minutes - -### 5.8 State writer (write-on-change) - -- `backend/app/services/compliance/state_writer.py` -- Updates `host_rule_state` every scan; writes `transactions` rows only on status change -- Captures evidence, framework_refs, skip_reason, initiator_type/id - -### 5.9 Retention policy - -- `backend/app/services/compliance/retention_policy.py` -- Default 365 days; per-resource policies (transactions, audit_exports, posture_snapshots) -- Never deletes `host_rule_state` -- Signed archive bundles **partially planned** (AC-4) - -### 5.10 Remediation service - -- `backend/app/services/compliance/remediation.py` -- License-gated (OpenWatch+ for execution); rollback support; step-level tracking; dry-run preview -- Real Kensa dry-run plans -- Snapshots retained 30 days - ---- - -## 6. Kensa integration & ORSA plugin - -### 6.1 KensaScanner - -- `backend/app/plugins/kensa/scanner.py` -- BaseScanner implementation; delegates to Kensa runner package -- 338 canonical YAML rules; SSH-based execution - -### 6.2 KensaExecutor & credential bridge - -- `backend/app/plugins/kensa/executor.py` -- Bridges OpenWatch's encrypted credentials to Kensa's SSH session requirements -- `OpenWatchCredentialProvider`, `KensaSessionFactory` -- Writes credentials to temp files; cleaned after use - -### 6.3 KensaORSAPlugin - -- `backend/app/plugins/kensa/orsa_plugin.py` -- ORSA v2.0 implementation -- Capabilities advertised: compliance_check, remediation, rollback, dry-run -- License-gated for remediation/rollback - -### 6.4 KensaRuleSyncService - -- `backend/app/plugins/kensa/sync_service.py` -- Syncs Kensa YAML rules to `kensa_rules` table; framework mappings to `framework_mappings` -- Hash-based change detection; dual-mapping system (inline refs + mapping files) - -### 6.5 RuleReferenceService - -- `backend/app/services/rule_reference_service.py` -- UI-facing browser of Kensa YAML rules; in-process cache; search/filter/pagination - -### 6.6 FrameworkMapper (Kensa) - -- `backend/app/plugins/kensa/framework_mapper.py` -- Maps rules to CIS RHEL 8/9/10, STIG RHEL 8/9, NIST 800-53 R5, PCI-DSS v4, FedRAMP, SRG -- PostgreSQL-backed via `framework_mappings` table - -### 6.7 ComplianceFrameworkMapper **[LEGACY]** - -- `backend/app/services/framework/mapper.py` -- SCAP-era mapper; in-memory; superseded by Kensa FrameworkMapper - -### 6.8 ORSA plugin interface & registry - -- `backend/app/services/plugins/orsa/{interface,registry}.py` -- `ORSAPlugin` ABC + `ORSAPluginRegistry` singleton -- Capability enum: COMPLIANCE_CHECK, REMEDIATION, ROLLBACK, CAPABILITY_DETECTION, DRY_RUN, PARALLEL_EXECUTION, FRAMEWORK_MAPPING - -### 6.9 Plugin governance - -- `backend/app/services/plugins/governance/service.py` -- Policy-based plugin compliance; lifecycle, evaluation vs SOC2/HIPAA/ISO-27001 standards -- Immutable audit events for evaluations - ---- - -## 7. Scan engine - -### 7.1 Executors - -- `services/engine/executors/ssh.py` — remote scan execution via SSH -- `services/engine/executors/local.py` — local self-assessment - -### 7.2 Scanners - -- `services/engine/scanners/scap.py` — **[LEGACY]** SCAP/XCCDF (replaced by Kensa) - -### 7.3 Result parsers - -- `services/engine/result_parsers/` — XCCDF + ARF parsers; `RuleResult` dataclass; JSONB evidence in `scan_findings` - -### 7.4 Dependency resolver **[LEGACY]** - -- `services/engine/dependency_resolver.py` — SCAP content dependency walker (OVAL, CPE, tailoring) - -### 7.5 Platform detector - -- `services/engine/discovery/` — JIT OS/kernel/arch detection, per-host caching - -### 7.6 Kensa mapper (engine integration) - -- `services/engine/integration/kensa_mapper.py` — XCCDF → Kensa remediation plan **[LEGACY]** (SCAP-era bridge) - -### 7.7 Scan orchestrator - -- `services/engine/orchestration/orchestrator.py` — multi-scanner coordination, parallel execution, result merging - -### 7.8 Bulk scan orchestrator - -- `services/bulk_scan_orchestrator.py` — multi-host scanning with intelligent batching, progress tracking, **per-host zero-trust authorization** - ---- - -## 8. SSH layer - -### 8.1 Connection manager - -- `services/ssh/connection_manager.py` — Paramiko-backed; `SSHConnectionContext`; integrates `PolicyFactory` for host-key verification - -### 8.2 Key validation - -- `services/ssh/key_validator.py`, `key_parser.py` — RSA / Ed25519 / ECDSA validation; security level assessment per NIST SP 800-57; SHA256 fingerprints - -### 8.3 Known-hosts manager - -- `services/ssh/known_hosts.py` — DB-backed (not filesystem); automation-friendly verification - -### 8.4 SSH config manager - -- `services/ssh/config_manager.py` — policy persistence; per-host overrides - ---- - -## 9. OWCA — OpenWatch Compliance Algorithm (5 layers) - -### 9.1 Score calculator (Core, Layer 1) - -- `services/owca/core/score_calculator.py` -- `get_host_compliance_score(host_id)` → `ComplianceScore` (overall, tier, severity breakdown) - -### 9.2 Severity risk calculator (Extraction, Layer 0) - -- `services/owca/extraction/severity_calculator.py` -- NIST SP 800-30 weighted formula: critical=10, high=5, medium=2, low=0.5 - -### 9.3 Framework intelligence (Layer 2) - -- `services/owca/framework/` — per-framework analyzers (NIST 800-53, CIS, STIG, PCI-DSS, FedRAMP) - -### 9.4 Fleet aggregator (Layer 3) - -- `services/owca/aggregation/fleet_aggregator.py` — fleet-wide stats, daily trend points - -### 9.5 Trends, drift, anomalies, forecast (Layer 4) - -- `services/owca/intelligence/` — `TrendAnalyzer`, `BaselineDriftDetector`, `RiskScorer`, `CompliancePredictor`, anomaly detection - -### 9.6 Result caching - -- `services/owca/cache/redis_cache.py` — **[LEGACY-REFERENCE]** Redis cache (Redis removed); falls back to in-process `TTLCache` via `cachetools` - ---- - -## 10. System info / discovery - -### 10.1 SystemInfoCollector - -- `services/system_info/collector.py` -- Collects: packages, services, users, network interfaces, audit events, firewall rules, metrics, OS/kernel/arch, SELinux, firewall status -- Stored in `host_packages`, `host_services`, `host_users` tables - -### 10.2 Discovery services - -- `services/discovery/host.py` — basic host info (OS, kernel, hostname, arch) -- `services/discovery/compliance.py` — installed compliance tools (OpenSCAP, Kensa, ansible) -- `services/discovery/network.py` — interfaces, routes, DNS, firewall -- `services/discovery/security.py` — SELinux, firewall, audit daemon, SSH config - ---- - -## 11. Licensing - -- `services/licensing/service.py` -- Feature gating via `LicenseService.has_feature()` and `@requires_license()` decorator -- Free: compliance_check -- OpenWatch+: remediation, temporal_queries, structured_exceptions, priority_updates - ---- - -## 12. Notifications - -| Channel | File | Notes | -|---|---|---| -| Slack | `services/notifications/slack.py` | `slack-sdk`; webhook URL config; retry w/ exponential backoff | -| Email | `services/notifications/email.py` | `aiosmtplib`; TLS/SSL; HTML + plaintext | -| Webhook | `services/notifications/webhook.py` | HMAC-SHA256 in `X-OpenWatch-Signature` | -| Jira | `services/notifications/jira.py` | API token auth; severity → priority mapping | -| PagerDuty | `services/notifications/pagerduty.py` | Severity → urgency; dedup by rule_id+host_id | - ---- - -## 13. Integrations & webhooks - -### 13.1 Webhook service - -- `services/infrastructure/webhooks.py` -- HMAC-SHA256 signing; `X-OpenWatch-Signature` + `X-OpenWatch-Timestamp` headers -- Payload templates: `create_scan_completed_payload`, `create_scan_failed_payload` - -### 13.2 Jira service - -- `services/infrastructure/jira_service.py` -- Issue creation/update/close from compliance findings -- Token auth, project/issue type config - -### 13.3 HTTP client - -- `services/infrastructure/http.py` -- Unified `httpx`-based client with circuit breaker, timeout, retry, connection pooling -- Specialized `WebhookHttpClient` with signature verification - ---- - -## 14. Remediation - -### 14.1 Recommendation engine - -- `services/remediation/recommendation/` -- Generates prioritized recommendations from scan results; ORSA-compatible output -- Executors: Bash, Ansible, Kensa -- Dry-run by default; auto-generates rollback scripts for reversible operations - -### 14.2 Secure automated fixes - -- `services/remediation/secure_fixes.py` -- Command validation (blocklist); rollback support; full audit trail - -### 14.3 Command sandbox - -- `services/infrastructure/sandbox.py` -- Security levels LOW/MEDIUM/HIGH; blocks `rm -rf /`, `dd`, `format`, etc. - -### 14.4 Remediation models - -- `RemediationRecommendation`, `RemediationStep`, `RemediationJob`, `RemediationCategory`, `RemediationComplexity`, `RemediationPriority`, `RemediationSystemCapability` - ---- - -## 15. Monitoring & liveness - -| Service | File | Purpose | -|---|---|---| -| Health monitoring | `services/monitoring/health.py` | DB / scheduler / cache health checks | -| Host monitor | `services/monitoring/host.py` | Connectivity + last-scan tracking | -| Drift detection | `services/monitoring/drift.py` | Per-scan compliance change detection **[DUPLICATED]** with OWCA `BaselineDriftDetector` | -| Integration metrics | `services/monitoring/metrics.py` | Prometheus metrics for API, webhook, remediation | -| Adaptive scheduler | `services/monitoring/scheduler.py` | Score-based scan interval calculation | -| State machine | `services/monitoring/state.py` | Online/degraded/offline transitions w/ hysteresis | -| Liveness | `services/monitoring/liveness.py` | PostgreSQL-backed heartbeat (replaces Redis) | - ---- - -## 16. Infrastructure services - -| Service | File | Notes | -|---|---|---| -| Terminal | `infrastructure/terminal.py` | Interactive SSH terminal; TTY allocation | -| Sandbox | `infrastructure/sandbox.py` | Remediation command isolation | -| Email | `infrastructure/email.py` | Alert/report dispatch | -| HTTP | `infrastructure/http.py` | Unified `httpx` client | -| Webhooks | `infrastructure/webhooks.py` | Signature gen/verify, payload construction | -| Prometheus | `infrastructure/prometheus.py` | `/metrics` endpoint | -| Jira | `infrastructure/jira_service.py` | Ticket lifecycle | -| Config | `infrastructure/config.py` | Pydantic-validated config | -| Audit | `infrastructure/audit.py` | Structured audit logging stream | - ---- - -## 17. Validation services - -| Service | File | Notes | -|---|---|---| -| Error classification | `validation/errors.py` | SSH/scan errors → categories + remediation guidance | -| Group validation | `validation/group.py` | Pre-scan host-group compatibility check | -| Error sanitization | `validation/sanitization.py` | MINIMAL/MODERATE/STRICT levels; anti-reconnaissance | -| System info sanitization | `validation/system_sanitization.py` | Filter sensitive data from exports | -| Unified validation | `validation/unified.py` | Pre-scan validation orchestration | - ---- - -## 18. Background work (job queue + tasks) - -### 18.1 Job queue core - -- `services/job_queue/service.py` (`JobQueueService`) -- PostgreSQL `SKIP LOCKED` (Celery + Redis fully removed) -- Exponential backoff: `2^retry_count * 60s` -- Schema: pending/running/completed/failed; JSONB args + result; 2000-char error -- Partial index on `(queue, priority DESC, scheduled_at ASC) WHERE status='pending'` - -### 18.2 Worker - -- `services/job_queue/worker.py` -- Single-threaded polling loop; round-robin across queues -- Signal-based graceful shutdown (SIGTERM/SIGINT); SIGALRM for timeout enforcement -- Concurrency setting present but unused (single-threaded) -- Poll interval: 1.0s - -### 18.3 Scheduler - -- `services/job_queue/scheduler.py` -- Polls `recurring_jobs`; cron parser supports `*`, lists, ranges, steps (`*/5`) -- 60-second dedup window; check interval 10s -- Background daemon thread - -### 18.4 Dispatch - -- `services/job_queue/dispatch.py` (`enqueue_task`) — Celery `.delay()` replacement -- Hardcoded `_TASK_QUEUES` routing table - -### 18.5 Registry - -- `services/job_queue/registry.py` — task name → handler mapping; wraps Celery `bind=True` tasks **[LEGACY]** - -### 18.6 Job types (~30 distinct) - -| Job | Trigger | Notes | -|---|---|---| -| `ping_all_managed_hosts` | Cron */5 min | Liveness check | -| `execute_kensa_scan` | API + scheduler | Kensa engine call | -| `execute_scan_celery` | API/legacy | **[LEGACY]** SCAP-era | -| `dispatch_compliance_scans` | Cron */2 min | Adaptive dispatcher | -| `run_scheduled_kensa_scan` | Enqueued by dispatcher | Per-host adaptive scan | -| `initialize_compliance_schedules` | One-shot | Bootstrap on first deploy | -| `expire_compliance_maintenance` | Cron hourly | Clear maintenance flags | -| `create_daily_posture_snapshots` | Cron 00:30 UTC | Daily aggregation | -| `cleanup_old_posture_snapshots` | Cron 03:00 UTC | Retention enforcement | -| `check_host_connectivity` | Adaptive | TCP ping | -| `dispatch_host_checks` | Cron every minute | Connectivity dispatcher | -| `detect_stale_scans` | Cron */10 min | SCAN_FAILED alert generator | -| `discover_all_hosts_os` | Cron 02:00 UTC | OS discovery sweep | -| `trigger_os_discovery` | Manual | Single-host discovery | -| `batch_os_discovery` | Manual | Batched (10/job) | -| `dispatch_alert_notifications` | Event | Slack/email/webhook fan-out | -| `execute_remediation` | API | Kensa remediation execution | -| `execute_rollback_job` | Manual/auto | Reverts remediation | -| `generate_audit_export` | API async | CSV/JSON/PDF | -| `cleanup_expired_audit_exports` | Cron daily | File + row deletion | -| `expire_compliance_exceptions` | Cron | Lifecycle | -| `backfill_posture_snapshots` | Manual | Reconstruct from transactions | -| `backfill_snapshot_rule_states` | Manual | Populate JSONB | -| `backfill_transactions_from_scans` | Manual | Convert findings → transactions | -| `backfill_host_rule_state` | Manual | 5000-row chunks | -| `enrich_scan_results` | Post-scan | **[LEGACY]** No-op | -| `import_scap_content_celery` | — | **[LEGACY]** Dead code | -| `deliver_webhook` | Event | HTTP POST + HMAC retry | -| `check_kensa_updates` | Cron nightly | Update polling | -| `perform_auto_update` | Cron conditional | Auto-upgrade Kensa | -| `cleanup_old_update_records` | Cron daily | Retention | -| `enforce_retention` | Cron 04:00 UTC | Transaction-log retention | - -### 18.7 Retries & dead-letter - -- Per-job `max_retries` (default 3); exponential backoff (60s, 120s, 240s, …) -- No separate dead-letter queue; failed jobs persist in `job_queue` for audit -- Manual inspection / requeue via row update - -### 18.8 Observability gaps - -- No built-in metrics exposition for queue depth, processing time, error rate -- No HTTP status query API; internal `JobQueueService.get_status()` only - -### 18.9 Liveness service - -- `services/monitoring/liveness.py` -- TCP connect to SSH port (5s timeout); no auth -- `host_liveness` table; alerts on 2 consecutive failures - ---- - -## 19. Data layer - -### 19.1 Tables (40+) - -Grouped by domain. PostgreSQL-only (MongoDB fully removed PR #295). - -**Identity / RBAC:** -- `users` — accounts (note: `id` is **int**, not UUID — divergence from rest of schema) -- `mfa_audit_log`, `mfa_used_codes` -- `roles`, `user_groups`, `user_group_memberships` -- `host_access`, `host_groups`, `host_group_memberships` -- `api_keys` - -**Hosts & scans:** -- `hosts` (UUID PK) — inventory, encrypted credentials -- `scap_content` — **[LEGACY]** benchmark metadata -- `scans` (UUID PK) -- `scan_results` — **[LEGACY-ish]** legacy summary metrics (`host_rule_state` is now primary) -- `scan_findings` — Kensa results, JSONB evidence + framework_refs -- `scan_baselines`, `scan_drift_events` -- `system_credentials` — **[LEGACY-ish]** mostly superseded by per-host encrypted_credentials - -**Compliance state (Q1 model):** -- `host_rule_state` — primary state source (per host × rule, current status) -- `transactions` — write-on-change event log; JSONB `pre_state`, `post_state`, `apply_plan`, `validate_result`, `evidence_envelope`, `framework_refs` -- `posture_snapshots` — daily compliance snapshots -- `compliance_exceptions` — waivers w/ approval workflow -- `host_compliance_schedule` — adaptive scan intervals - -**Kensa & frameworks:** -- `kensa_rules` (synced metadata) -- `framework_mappings` (control → rule) - -**Alerts & notifications:** -- `alert_settings` -- `alert_routing_rules` -- `notification_channels` (config_encrypted JSONB) -- `notification_deliveries` - -**Auth & SSO:** -- `token_blacklist_pg` -- `signing_keys` -- `sso_providers` (config_encrypted JSONB) - -**Audit & retention:** -- `audit_logs` (global) -- `audit_exports` -- `integration_audit_log` -- `retention_policies` - -**Job queue:** -- `job_queue` (JSONB args + result) -- `recurring_jobs` (cron schedule) - -**Liveness:** -- `host_liveness` - -**System config:** -- `system_settings` -- `webhook_endpoints`, `webhook_deliveries` - -### 19.2 Repositories - -**No active repository pattern.** `framework_repository.py.disabled` is a legacy artifact. OpenWatch routes use direct SQL builders (`QueryBuilder`, `InsertBuilder`, `UpdateBuilder`, `DeleteBuilder`) — 100% adoption. - -### 19.3 Pydantic schemas (by domain) - -Auth, hosts, scans (`ScanStatus`, `RuleResultStatus`, `ScanConfiguration`, `ScanResultSummary`), compliance (`ComplianceSystemInfo`, `OperationalSystemInfo`, `AdminSystemInfo`), alerts, authorization (`ResourceType`, `ActionType`, `PermissionEffect`, `PermissionPolicy`, `AuthorizationContext`, `BulkAuthorizationRequest`), plugins (`PluginType`, `PluginStatus`, `PluginManifest`, `PluginExecutor`, `PluginPackage`), remediation (`RemediationStatus`, `RemediationTarget`, `RemediationResult`), posture, audit queries, transactions, exceptions, rule reference. - -### 19.4 Recent migrations (040+) - -| ID | Description | -|---|---| -| 040 | Rename Aegis → Kensa remediation_id | -| 041 | Manual remediation status | -| 042 | Make scans.content_id nullable (Kensa: no SCAP) | -| 043 | Add `has_remediation` flag | -| 044 | **transactions** table (write-on-change) | -| 045 | **host_liveness** table (heartbeat) | -| 046 | **notification_channels + notification_deliveries** | -| 047 | **sso_providers** (OIDC + SAML) | -| 048 | **host_rule_state** (primary state table) | -| 049 | **job_queue + recurring_jobs** (Celery replacement) | -| 050 | **token_blacklist_pg** (Redis replacement) | -| 051 | **signing_keys** | -| 052 | **retention_policies** | -| 053 | **alert_routing_rules** | -| 054 | Seed default recurring_jobs | - -### 19.5 Connection / session management - -- PostgreSQL 15+; `QueuePool` (size=10, max_overflow=20, pool_recycle=3600s) -- TLS in production (sslmode/sslcert/sslkey/sslrootcert) -- 10s connection timeout, `application_name=openwatch` -- Sync SQLAlchemy 2.0 ORM via `SessionLocal()` (NOT async) -- FastAPI `depends.get_db()` yields per-request session - ---- - -## 20. External dependencies - -| Package | Version | Use | -|---|---|---| -| Paramiko | 3.5.0 | SSH protocol | -| Kensa | v1.2.5 | **[STALE — now Go]** Compliance scanning engine; rules path discovery | -| slack-sdk | ≥3.27.0 | Slack notifications | -| aiosmtplib | 5.1.0 | Async SMTP | -| httpx | 0.28.1 | HTTP client | -| Cryptography | 46.0.5 | AES-256-GCM, RS256 JWT | -| Pydantic | 2.12.5 | Request/response validation | -| SQLAlchemy | 2.0.46 | PostgreSQL ORM | -| aiohttp | 3.13.3 | **[NARROW]** Kensa updater plugin only | - -> **Note:** Memory says Kensa was migrated to Go before 2026-04-26; the inventory above reflects the Python integration as it currently lives in `backend/`. The Go rebuild will use Kensa Go directly. - ---- - -## 21. Rebuild attention list (triage candidates) - -Aggregated from the **[LEGACY]**, **[DUPLICATED]**, **[FEATURE-GATED]** flags above. This is the input to Stage 1 triage — every item below should be evaluated for MUST / MAYBE / NEVER. - -### 21.1 Strong NEVER candidates (legacy, replaced, dead) - -- **SCAP/XCCDF transformation chain** — `services/engine/scanners/scap.py`, `dependency_resolver.py`, `result_parsers/xccdf.py`, `kensa_mapper.py` (XCCDF → Kensa bridge). All replaced by direct Kensa execution. -- **`POST /api/scans/legacy`** — explicitly marked legacy -- **`enrich_scan_results` task** — DEPRECATED no-op -- **`import_scap_content_celery` task** — dead code -- **`execute_scan_celery` task** — SCAP-era; superseded by `execute_kensa_scan` -- **`ComplianceFrameworkMapper`** (`services/framework/mapper.py`) — superseded by Kensa `FrameworkMapper` -- **`framework_repository.py.disabled`** — legacy file, already disabled -- **OWCA `redis_cache.py` Redis path** — Redis removed; only `cachetools` fallback used -- **Celery references** in `tasks/__init__.py`, `registry.py` `_wrap_bound_task`, `dispatch.py` `_TASK_QUEUES` comment, `seed_schedule.py` "Translations from celery_app.py" comment — orphaned; cleanup-only -- **File-based audit (`SecurityAuditLogger`)** — DB audit (`audit_db.py`) is the canonical path; consolidate - -### 21.2 DUPLICATED — pick one - -- **Remediation endpoints** at `/api/remediation/` and `/api/compliance/remediation/` — same workflow, two surfaces -- **Scheduler config** at `/api/compliance/scheduler/` and `/api/system/scheduler/` — clarify ownership -- **Host credentials** at `/api/admin/credentials/` and `/api/system/credentials/` -- **Drift detection** — `services/monitoring/drift.py` (Drift Detection Service) and `services/owca/intelligence/baseline_drift.py` (Baseline Drift Detector) -- **Rule reference** — `RuleReferenceService` (Kensa YAML loader, authoritative) and legacy `RuleService` (cache-based, SCAP-era) - -### 21.3 FEATURE-GATED — verify customer demand before rebuilding - -- Audit query preview/execute (`/api/compliance/audit/queries/preview`, `/execute`) -- Audit exports (`/api/compliance/audit/exports`) -- OWCA trends / forecast (`/api/compliance/owca/trends`, `/forecast`) -- Posture history / drift / group-drift (`/api/compliance/posture/history`, `/drift`, `/drift/group`) -- Structured exceptions (full workflow) -- Priority Kensa updates - -### 21.4 Architectural divergences to resolve in rebuild - -- **`users.id` is `int`; everything else is UUID** — pick one in the rebuild (UUID consistent with rest of schema) -- **Sync SQLAlchemy in async FastAPI app** — Go rebuild uses pgx natively, eliminates this seam -- **`scan_results` summary table coexists with `host_rule_state`** — primary read path is `host_rule_state`; decide whether `scan_results` remains -- **`system_credentials` largely unused** — per-host encrypted_credentials is the active path - -### 21.5 Implementation gaps (planned but incomplete) - -- **Server Intelligence collection** — schedule and table scaffolding present, but full telemetry sweep (packages, services, users, network, audit, metrics) only partially implemented -- **Signed archive bundles** in retention policy (AC-4) — marked future enhancement -- **Baseline rolling-average auto-update** — method exists, not yet enabled -- **FIDO2 MFA** — interface scaffolded, no implementation - ---- - -## 22. Quantitative summary - -- **HTTP endpoints:** ~350 across 18 route modules -- **Database tables:** 40+ (PostgreSQL only) -- **Recent migrations:** 040–054 (15 in the Q1 wave) -- **Job types:** ~32 distinct (3 are dead code, 1 deprecated) -- **External Python packages:** 9 primary (Paramiko, Kensa, slack-sdk, aiosmtplib, httpx, Cryptography, Pydantic, SQLAlchemy, aiohttp) -- **Notification channels:** 5 (Slack, email, webhook, Jira, PagerDuty) -- **OWCA layers:** 5 (extraction, core, framework, aggregation, intelligence) -- **Compliance frameworks mapped:** 6+ (CIS, STIG, NIST 800-53, PCI-DSS, FedRAMP, SRG) -- **Roles:** 6 (SUPER_ADMIN, SECURITY_ADMIN, SECURITY_ANALYST, COMPLIANCE_OFFICER, AUDITOR, GUEST) -- **Permissions:** 33 fine-grained -- **SSO providers:** 2 (OIDC, SAML 2.0) - ---- - -## How this informs the rebuild - -This inventory is descriptive, not prescriptive. The Stage 1 triage step (per `app/docs/openwatch_roadmap.md`) takes this document plus telemetry from the running system plus operator interviews and produces three buckets: - -- **MUST** — rebuild in Phase 1 (high-usage or critical-infrequent) -- **MAYBE** — rebuild only if cheap (moderate usage) -- **NEVER** — explicitly drop, log in `app/docs/not_rebuilt.md` - -The "Rebuild attention list" (§21) is the pre-flagged input to that triage. Anything not flagged there still requires evidence before being considered MUST. diff --git a/docs/engineering/MAYBE_BACKEND_FUNCTIONALITY.md b/docs/engineering/MAYBE_BACKEND_FUNCTIONALITY.md deleted file mode 100644 index ebe810ee..00000000 --- a/docs/engineering/MAYBE_BACKEND_FUNCTIONALITY.md +++ /dev/null @@ -1,264 +0,0 @@ -# MAYBE — Backend Functionality (Phase 1+ Backlog) - -> **Status (2026-06-22):** Historical rebuild-triage input, derived from the -> Python/FastAPI inventory. Deferred-feature backlog for the Go rebuild, not -> Python code. Current SSOT is `specs/` + the Go packages under `internal/`. -> See [BACKEND_FUNCTIONALITY.md](BACKEND_FUNCTIONALITY.md). -> **Source:** `docs/engineering/BACKEND_FUNCTIONALITY.md`, triaged 2026-04-27 -> **Rule:** Items here are deferred from Phase 1 unless usage evidence or a specific customer requirement promotes them. They will be considered for Phase 1+ backlog after MVP ships. -> **Method:** Static analysis from inventory. Without telemetry, this list is best-guess; some items may move to MUST or NEVER once usage data arrives. - ---- - -## Triage criteria for MAYBE - -An item lands here if **any** of the following holds: - -1. **Feature-gated (OpenWatch+)** — only paid customers use it; verify subscription mix before rebuilding. -2. **Planned but incomplete** — scaffolded in current code, never finished. Decide: complete it or drop it. -3. **Moderate usage suspected** — operationally useful but not load-bearing for core compliance loop. -4. **Customer-dependent** — value depends on which customer profile dominates (e.g., Jira vs PagerDuty depends on what the customer runs). -5. **Advanced variant of a MUST item** — the basic version is in MUST; the elaborated version is here. - ---- - -## Each item has a "trigger" — what evidence would promote it to MUST - -If telemetry / operator feedback hits the trigger, the item moves to MUST. Otherwise it stays deferred or moves to NEVER after the rebuild ships. - ---- - -## Authentication — advanced - -| Component | Trigger to promote | Notes | -|---|---|---| -| FIDO2 / WebAuthn MFA | Customer requirement for hardware-token MFA | Currently scaffolded interface only; no implementation. Decide: ship in Phase 1+ or drop entirely. | -| MFA backup-code regeneration UI flow | Operator usage data on backup-code use | Backup codes themselves are MUST; the regeneration self-service flow can be deferred to admin-driven path | -| API key permission updates (`PUT /api-keys/{id}/permissions`) | Operator demand for fine-grained key permissions | Basic API key CRUD is MUST; granular permission editing deferred | - ---- - -## Compliance workflow — advanced - -| Component | Trigger to promote | Notes | -|---|---|---| -| Temporal compliance — historical posture queries | Active OpenWatch+ subscriptions using the feature | Feature-gated today; verify paid usage | -| Posture history endpoint | Same as above | `/api/compliance/posture/history` | -| Posture drift analysis | Same as above | `/api/compliance/posture/drift` | -| Group drift analysis | Same as above | `/api/compliance/posture/drift/group` | -| Drift export | Same as above | `/api/compliance/posture/drift/export` | -| Compliance forecast | Same as above | OWCA-backed predictions; Layer 4 | -| Compliance trends | Same as above | OWCA Layer 3/4 | -| Audit query system (saved queries CRUD) | Active customer usage of saved queries | Feature-gated | -| Audit query preview/execute | Same as above | `/api/compliance/audit/queries/{preview,execute}` | -| Audit ad-hoc query | Same as above | `/api/compliance/audit/queries/execute` (ad-hoc) | -| Audit exports (JSON/CSV/PDF) | Customer audit/regulatory request | Feature-gated; signed bundles incomplete | -| Audit export download | Same as above | `/api/compliance/audit/exports/{id}/download` | -| Baseline rolling-average auto-update | Operator demand for automatic baseline drift | Method exists but not enabled | -| Compliance exceptions — full approval state machine | Enterprise customer demand for multi-stage approval | Basic request/approve/reject/revoke is MUST; advanced workflow (multi-approver, delegation) deferred | -| Alert routing rules (per-severity → channel) | Operator demand for fine-grained routing | Basic dispatch (alert → all configured channels) is MUST | -| Advanced alert types (SCORE_DROP, drift severity tiers, EXCEPTION_EXPIRING, MASS_DRIFT) | Active alert configuration data | Basic 5 alert types are MUST; the other 10+ types deferred | - ---- - -## Remediation (entire subsystem — license-gated) - -| Component | Trigger to promote | Notes | -|---|---|---| -| Remediation recommendation engine | Active OpenWatch+ subscriptions using remediation | Feature-gated; depends on whether customers use it | -| Secure automated fixes | Same as above | Command sandboxing, validation, rollback support | -| Command sandbox | Same as above | Required only if remediation is rebuilt | -| Rollback support | Same as above | 30-day snapshot retention | -| Remediation API endpoints (`/automated-fixes/*`, `/remediation/*`) | Same as above | Note: also resolves the §21.2 duplication — collapse into one path | -| Kensa remediation dry-run integration | Same as above | Already exists in Kensa Go side; OpenWatch is the orchestrator | -| Remediation execution task (`execute_remediation`) | Same as above | Job-queue side | -| Rollback execution task (`execute_rollback_job`) | Same as above | Job-queue side | -| Remediation status tracking | Same as above | `RemediationJob`, step-level results | -| Remediation provider listing | Same as above | Multiple executors (Bash, Ansible, Kensa) | - -> **Recommendation:** Defer all remediation to Phase 1+. Compliance scanning and reporting is the primary value; remediation is the up-sell. Get core working before paying its rebuild cost. - ---- - -## Discovery — beyond basic - -| Component | Trigger to promote | Notes | -|---|---|---| -| Network discovery (interfaces, routes, DNS, firewall) | Customer evidence using network-aware compliance | `services/discovery/network.py` | -| Network topology map | Same as above | Likely low usage | -| Security posture discovery (SELinux, firewalld, audit daemon) | Compliance framework requirement (e.g., STIG audit-daemon checks) | Some Kensa rules already check these; may not need separate discovery | -| Compliance baseline discovery | Operator feedback on baseline auto-detect | Distinct from baseline management | -| Bulk variants of all discovery types | Operator feedback on fleet-scale discovery | Single-host versions are MUST | - ---- - -## Server intelligence (currently incomplete) - -| Component | Trigger to promote | Notes | -|---|---|---| -| Package inventory collection | Customer demand for package CVE matching | Schema exists; collection partial | -| Service inventory collection | Same as above | Schema exists | -| User inventory collection | Same as above | Schema exists | -| Network connection collection | Audit trail demand | Likely low value | -| Audit event collection | Compliance audit requirement | Overlaps with Kensa audit-daemon checks | -| Metrics collection | Operational monitoring demand | Likely better solved by Prometheus on the host | -| Compliance baseline collection | Distinct from `scan_baselines`? | Verify scope before promoting | - -> **Open question:** Server intelligence is part of the "Compliance OS" direction in `docs/openwatchos/`. It needs explicit go/no-go decision; if go, all of these become MUST. - ---- - -## OWCA — advanced layers - -> **Updated 2026-04-28 from static-analysis evidence.** Layer 2 framework -> intelligence and most of Layer 3/4 are confirmed unused by static analysis -> and have been moved to NEVER. Only the items still genuinely deferrable -> remain here. - -| Component | Trigger to promote | Notes | -|---|---|---| -| Anomaly detector | Customer evidence of demand | Statistical anomalies in compliance state | -| Risk scoring (custom NIST SP 800-30 weighted) | Customer demand for risk-weighted dashboards | If demand surfaces, build fresh — don't port the current `risk_scorer.py` (now NEVER) | -| Forecast / prediction surface | OpenWatch+ subscriber demand surfaces in telemetry | Same — fresh build if demanded | - -**Moved to NEVER (2026-04-28, evidence-backed):** -- OWCA framework intelligence (Layer 2) — `cis.py`, `stig.py`, `nist_800_53.py`, `base.py`, `models.py`. Replaced by Kensa `FrameworkMapper`. -- OWCA fleet aggregator (Layer 3). -- OWCA trend analyzer, predictor, risk scorer, baseline drift detector (Layer 4). - -> **Recommendation:** Rebuild only Layers 0–1 in Phase 1 (in MUST: `score_calculator`, `severity_calculator`). Anything more advanced is a fresh build if and when customer demand surfaces — not a port. - ---- - -## Notifications — beyond basic 3 - -| Component | Trigger to promote | Notes | -|---|---|---| -| Jira channel + Jira service integration | Customer using Jira as ticketing system | Includes Jira webhook receiver and field mapping | -| Jira webhook receiver | Same as above | `/integrations/jira/webhook` | -| Jira field mapping | Same as above | `/integrations/jira/field-mapping` | -| PagerDuty channel | Customer using PagerDuty for incident response | Severity → urgency mapping | -| Channel test endpoint (`/test`) | Operator workflow data | Useful but not core | - ---- - -## Plugins (custom plugin system) - -| Component | Trigger to promote | Notes | -|---|---|---| -| Plugin import / install | Customer demand for custom rules / scanners beyond Kensa | `routes/integrations/plugins/` | -| Plugin execution endpoint | Same as above | `/integrations/plugins/{id}/execute` | -| Plugin execution history | Same as above | Audit trail | -| Plugin governance service | Enterprise compliance customer demand | SOC2/HIPAA/ISO-27001 evaluation against plugins | -| Plugin statistics / overview | Operator workflow data | Likely low value | -| Plugin auto-update (Kensa) | Operator workflow data | `tasks/plugin_update_tasks.py` — Kensa Go integration may handle this differently | - -> **Open question:** Is the plugin system actually used outside Kensa? If only Kensa is plugged in, the entire system is over-engineered scaffolding and most of this moves to NEVER. - ---- - -## Bulk operations - -| Component | Trigger to promote | Notes | -|---|---|---| -| Bulk CSV analysis | Operator workflow data | Pre-import inspection | -| Bulk import with column mapping | Operator workflow data | Advanced variant of basic CSV import | -| Bulk discovery (network/security/compliance) | Operator workflow data | Bulk single-host equivalents are MUST | - -> Basic CSV import + export are MUST; the analyze + map workflow is advanced. - ---- - -## Scan engine — secondary paths - -| Component | Trigger to promote | Notes | -|---|---|---| -| Local executor | Self-assessment use case (container scans itself) | `services/engine/executors/local.py` | -| Scan orchestrator (multi-scanner) | Customer running both Kensa + custom plugin | Only valuable if multiple ORSA plugins active | -| Scan template clone | Operator workflow data | Useful but not core | -| Scan template default-set | Operator workflow data | Convenience | -| Scan validate endpoint (`/scans/validate`) | Operator workflow data | Pre-flight check | -| Quick scan helpers (`/templates/quick`, `/hosts/{id}/quick-scan`) | Operator workflow data | UX convenience | -| Rescan single rule | Operator workflow data | Targeted re-execution | -| Scan verify endpoint | Operator workflow data | Result verification | - ---- - -## Backfill / admin tooling - -| Component | Trigger to promote | Notes | -|---|---|---| -| Transaction backfill | Migration from current OpenWatch instance | Only needed if migrating existing data | -| Posture snapshot backfill | Same as above | Reconstruct historical snapshots | -| Snapshot rule-state backfill | Same as above | Populate JSONB | -| Host rule-state backfill | Same as above | 5000-row chunks | - -> **Recommendation:** These are migration tools, not features. If the rebuild is a clean break (no migration of existing customers), all four go to NEVER. If there's a migration path, all four become MUST during the migration window only. - ---- - -## Operational / debug - -| Component | Trigger to promote | Notes | -|---|---|---| -| Terminal service (interactive SSH) | Operator demand for debug access via UI | Could be a CLI tool instead | -| SSH debug endpoints (`/api/ssh/debug/*`) | Same as above | Test authentication, paramiko log | -| Discovery acknowledge-failures | Operator workflow data | OS discovery failure management | -| Manual scheduler controls (`/scheduler/start`, `/stop`, `/reset-defaults`) | Operator workflow data | Convenience | - ---- - -## Retention — advanced - -| Component | Trigger to promote | Notes | -|---|---|---| -| Signed archive bundles before deletion (AC-4) | Compliance / audit requirement | Currently incomplete; complete or drop | -| Per-resource retention policy granularity | Customer demand for differentiated retention | Basic retention is MUST | - ---- - -## Health & capabilities — advanced - -| Component | Trigger to promote | Notes | -|---|---|---| -| Health history (service / content) | Operator demand for health timeline | Current health is MUST | -| Health refresh endpoint | Operator workflow data | On-demand re-check | -| Capabilities by sub-domain (network/security/compliance/discovery capabilities) | Operator UI need | Single `/capabilities` is MUST; per-domain endpoints can collapse via API redesign | - ---- - -## Schema (MAYBE tables) - -The following tables back MAYBE features. They are kept in the schema but not actively used by Phase 1 code paths. - -- `audit_exports` (only if audit query system promoted) -- `alert_routing_rules` (only if advanced alert routing promoted) -- `posture_snapshot.rule_states` JSONB (only if temporal queries promoted — basic snapshot row stays) -- `system_credentials` (largely unused; per-host encrypted_credentials is the active path) — possible NEVER - ---- - -## What this list captures vs ignores - -**Captured:** every feature-gated, partially-implemented, or moderate-usage component from the inventory. The triage assumes Phase 1 ships without these and adds them when usage data justifies. - -**Not captured:** items that are clear NEVER (legacy, dead, replaced, duplicated) — those live in `NEVER_BACKEND_FUNCTIONALITY.md`. - -**Risk:** without telemetry, this list is overconservative. Real customer usage may move 30–50% of these into NEVER (genuinely unused) or into MUST (load-bearing for the customer mix). Re-triage after Stage 1 telemetry collection. - ---- - -## Action protocol when MAYBE items get promoted - -When an item moves MAYBE → MUST during Phase 1+ work: - -1. Add it to `MUST_BACKEND_FUNCTIONALITY.md` with the trigger evidence -2. Remove from this file -3. Update `app/docs/openwatch_roadmap.md` decision log -4. Estimate cost; check against the "doable in <2 weeks" bar from roadmap Stage 3 - -When an item moves MAYBE → NEVER (no demand surfaced): - -1. Add it to `NEVER_BACKEND_FUNCTIONALITY.md` with the rationale -2. Remove from this file -3. The deletion is the win — that's the rebuild's whole point diff --git a/docs/engineering/MUST_BACKEND_FUNCTIONALITY.md b/docs/engineering/MUST_BACKEND_FUNCTIONALITY.md deleted file mode 100644 index d8946473..00000000 --- a/docs/engineering/MUST_BACKEND_FUNCTIONALITY.md +++ /dev/null @@ -1,355 +0,0 @@ -# MUST — Backend Functionality (Phase 1 Rebuild) - -> **Status (2026-06-22):** Historical rebuild-triage input, derived from the -> Python/FastAPI inventory. It lists *required behavior* to carry into the Go -> rebuild, not Python code to keep. Current SSOT is `specs/` + the Go packages -> under `internal/`. See [BACKEND_FUNCTIONALITY.md](BACKEND_FUNCTIONALITY.md). -> **Source:** `docs/engineering/BACKEND_FUNCTIONALITY.md`, triaged 2026-04-27 -> **Rule:** Items here are non-negotiable for Phase 1. Rebuilding without them produces a non-viable compliance platform. -> **Method:** Static analysis from inventory + architectural reasoning. **No telemetry data yet.** Items marked **[VALIDATE]** should be confirmed against deployment evidence before final lock-in. - ---- - -## Triage criteria for MUST - -An item lands here if **any** of the following holds: - -1. **Core compliance loop:** authenticate → discover host → run scan → record state → query posture. Without it, OpenWatch isn't a compliance scanner. -2. **Security baseline:** auth, encryption, RBAC, audit. Compliance product without these is dead on arrival. -3. **Data-model load-bearing:** Q1 architecture (transaction-log + write-on-change + adaptive scheduling) is the proven foundation; the schema is locked for the rebuild. -4. **Operational floor:** job queue, scheduling, liveness, health, retention. The platform can't run without these. - -Anything that fails all four → MAYBE or NEVER. - ---- - -## Authentication & Authorization - -| Component | Source (current) | Why MUST | -|---|---|---| -| JWT (RS256, RSA-2048) | `auth.py` `FIPSJWTManager` | Auth foundation; FIPS-aligned | -| Password hashing (Argon2id, 64MB / 3 iter) | `auth.py` `PasswordManager` | Security baseline | -| API key auth (prefix `owk_`, SHA256-hashed) | `auth.py` | Required for agent / service-to-service auth — agent-first principle | -| MFA TOTP + backup codes | `services/auth/mfa.py` | Compliance product baseline (NIST IA-2) | -| Token blacklist (PostgreSQL) | `services/auth/token_blacklist_pg.py` | Logout / revocation; replaces Redis cleanly | -| RBAC — 6 roles, 33 permissions | `rbac.py` | Authorization foundation; map directly into Go | -| SSO — OIDC | `services/auth/sso/oidc.py` | **[VALIDATE]** Federal/enterprise customers commonly require it; lean MUST | -| SSO — SAML 2.0 | `services/auth/sso/saml.py` | **[VALIDATE]** Same reasoning as OIDC | -| SSO state storage (PostgreSQL, single-use, 5-min TTL) | `services/auth/sso_state.py` | Required by SSO providers above | - ---- - -## Cryptography & Audit - -| Component | Source | Why MUST | -|---|---|---| -| AES-256-GCM `EncryptionService` | `encryption/service.py` | Credentials, SSH keys, MFA secrets, channel configs all depend on it | -| Ed25519 `SigningService` (with key rotation) | `services/signing/signing_service.py` | Evidence integrity; matches roadmap §Phase 1 (`crypto/ed25519` stdlib in Go) | -| DB-based audit logging | `audit_db.py` | Audit-as-API contract (roadmap §Agent-First); the canonical audit path | -| Audit log table (`audit_logs`) | model | Structured event store for compliance | -| Integration audit log (`integration_audit_log`) | model | Cross-service audit trail | - ---- - -## Middleware - -| Component | Source | Why MUST | -|---|---|---| -| Authorization (Zero-Trust) | `middleware/authorization_middleware.py` | RBAC enforcement on every protected route | -| Rate limiting (token bucket) | `middleware/rate_limiting.py` | DoS protection; auth brute-force defense | -| Error handling (correlation IDs) | `middleware/error_handling.py` | Roadmap requires `X-Correlation-Id` end-to-end | -| Metrics | `middleware/metrics.py` | Observability floor | - ---- - -## Hosts & host management - -| Component | Source | Why MUST | -|---|---|---| -| Host CRUD (list, create, get, update, delete) | `routes/hosts/` | Core resource | -| Host group CRUD + member management | `routes/host_groups/` | Operational unit for scans | -| Bulk import (CSV) + analyze + export | `routes/hosts/bulk_operations.py` | **[VALIDATE]** common operator workflow | -| Basic host discovery (OS, kernel, hostname, arch) | `services/discovery/host.py` | Required to select scan profile | -| Compliance tools discovery | `services/discovery/compliance.py` | Required to confirm Kensa eligibility | -| Host connectivity check / ping | `routes/hosts/` | Operational requirement | -| Host state read (`/{id}/state`) | `routes/hosts/` | Compliance posture entry point | - ---- - -## SSH layer - -| Component | Source | Why MUST | -|---|---|---| -| Connection manager | `services/ssh/connection_manager.py` | Every scan executes over SSH | -| Key validation (RSA / Ed25519 / ECDSA, NIST SP 800-57) | `services/ssh/key_validator.py`, `key_parser.py` | Security baseline | -| Known hosts manager (DB-backed) | `services/ssh/known_hosts.py` | Host-key verification | -| SSH config / policy manager | `services/ssh/config_manager.py` | Policy enforcement (cipher allowlist, key types) | - ---- - -## Scan engine (Kensa-only path) - -| Component | Source | Why MUST | -|---|---|---| -| SSH executor | `services/engine/executors/ssh.py` | Remote scan execution | -| Platform detector | `services/engine/discovery/` | JIT OS detection per scan | -| Bulk scan orchestrator (zero-trust per-host auth) | `services/bulk_scan_orchestrator.py` | Multi-host scanning is core | -| Scan lifecycle (start, stop, cancel, recover) | `routes/scans/` | Operational | -| Scan templates (CRUD, apply, clone) | `routes/scans/` | **[VALIDATE]** Operator convenience; lean MUST | -| Scan results read API (`/scans/{id}/results`, `/failed-rules`) | `routes/scans/` | Output of every scan | -| Scan reports (HTML/JSON/CSV — content-negotiated in rebuild) | `routes/scans/` | Required artifact format | - ---- - -## Kensa integration (Go-to-Go in rebuild) - -| Component | Source | Why MUST | -|---|---|---| -| Kensa scanner adapter | `plugins/kensa/scanner.py` | The compliance engine | -| Credential bridge (encrypted creds → Kensa SSH session) | `plugins/kensa/executor.py` | Required for SSH-based scanning | -| ORSA plugin wrapper | `plugins/kensa/orsa_plugin.py` | Kensa is the reference ORSA plugin | -| Kensa rule sync service | `plugins/kensa/sync_service.py` | Keeps `kensa_rules` and `framework_mappings` current | -| Rule reference service (Kensa YAML browser) | `services/rule_reference_service.py` | Backs the rules UI | -| Framework mapper (Kensa, PG-backed) | `plugins/kensa/framework_mapper.py` | CIS/STIG/NIST/PCI-DSS/FedRAMP mapping | -| ORSA plugin interface + registry | `services/plugins/orsa/{interface,registry}.py` | Extensibility contract | - -> **Note:** Implementation shape changes in the Go rebuild (Go-to-Go integration with Kensa Go), but the conceptual surface (scanner, sync, rule reference, framework mapping, ORSA interface) is unchanged. - ---- - -## Compliance state (Q1 model — load-bearing) - -| Component | Source | Why MUST | -|---|---|---| -| State writer (write-on-change pattern) | `services/compliance/state_writer.py` | Core data-flow primitive | -| `host_rule_state` table | migration 048 | Primary read source for compliance state | -| `transactions` table | migration 044 | Append-only event log; powers temporal & audit queries | -| Transaction log read API | `routes/transactions/` | Required for audit/agent-readable history | -| Posture (current) | `services/compliance/temporal.py` (current-state path) | Real-time compliance view | -| `posture_snapshots` table | model | Daily snapshots; enables historical queries | -| `host_compliance_schedule` table | model | Adaptive scheduling foundation | -| Daily posture snapshot creation (cron) | `tasks/posture_tasks.py` `create_daily_posture_snapshots` | Continuous monitoring (NIST SP 800-137) | - ---- - -## Compliance workflow (essentials) - -| Component | Source | Why MUST | -|---|---|---| -| Drift detection (basic — scan-vs-baseline) | `services/monitoring/drift.py` | Core compliance signal | -| Baseline management (manual reset, promote) | `services/compliance/baseline_management.py` | Required to manage drift events | -| `scan_baselines`, `scan_drift_events` tables | model | Drift tracking storage | -| Compliance exceptions (request, approve, reject, revoke, check) | `services/compliance/exceptions.py` | Governance baseline | -| `compliance_exceptions` table | model | Exception storage with approval workflow | -| Adaptive compliance scheduler | `services/compliance/compliance_scheduler.py` | Auto-scan engine — core to Compliance OS direction | -| Adaptive scheduler dispatcher (cron */2 min) | `tasks/compliance_scheduler_tasks.py` | Runs the scheduler | -| Maintenance window expiry (cron hourly) | `tasks/compliance_scheduler_tasks.py` `expire_compliance_maintenance` | Cleanup | -| Exception expiry task | `tasks/exception_tasks.py` `expire_compliance_exceptions` | Lifecycle | - ---- - -## Alerts (basic lifecycle) - -| Component | Source | Why MUST | -|---|---|---| -| Alert lifecycle (create, list, acknowledge, resolve) | `services/compliance/alerts.py` | Operator notification floor | -| Basic alert types (CRITICAL_FINDING, SCORE_DROP, EXCEPTION_EXPIRING, CONFIGURATION_DRIFT, HOST_UNREACHABLE) | `services/compliance/alerts.py` | The 5 alert types operators actually use | -| Alert generator | `services/compliance/alert_generator.py` | Emits alerts on signal | -| Stale-scan detector (cron */10 min) | `tasks/stale_scan_detection.py` | Generates SCAN_FAILED alerts | -| `alert_settings` table | model | Per-user alert preferences | - -> Alert routing rules and the full 15-type alert taxonomy → MAYBE. - ---- - -## Notifications (3 channels) - -| Component | Source | Why MUST | -|---|---|---| -| Slack channel | `services/notifications/slack.py` | **[VALIDATE]** Most-used integration; lean MUST | -| Email channel (SMTP) | `services/notifications/email.py` | Fallback channel; always available | -| Webhook channel (HMAC-signed) | `services/notifications/webhook.py` | Generic integration path | -| Notification channels CRUD | `routes/admin/` notifications/channels | Channel configuration | -| Notification dispatch (alert → channels) | `tasks/notification_tasks.py` `dispatch_alert_notifications` | The fan-out mechanism | -| `notification_channels`, `notification_deliveries` tables | migration 046 | Storage | - -> Jira and PagerDuty channels → MAYBE. - ---- - -## OWCA (compliance scoring — minimum) - -| Component | Source | Why MUST | -|---|---|---| -| Score calculator (Layer 1) | `services/owca/core/score_calculator.py` | The compliance score is the headline metric | -| Severity calculator (Layer 0) | `services/owca/extraction/severity_calculator.py` | Underlies the score | - -> Layers 2–4 (framework intelligence, fleet aggregator, trends/predictions) → MAYBE. - ---- - -## Validation & sanitization - -| Component | Source | Why MUST | -|---|---|---| -| Error classification (SSH/scan errors → user guidance) | `services/validation/errors.py` | Operational UX | -| Group validation (pre-scan compatibility check) | `services/validation/group.py` | Prevents bad bulk scans | -| Error sanitization (anti-reconnaissance) | `services/validation/sanitization.py` | Security baseline | -| System info sanitization | `services/validation/system_sanitization.py` | Security baseline | -| Unified pre-scan validation | `services/validation/unified.py` | Orchestrates the above | - ---- - -## Job queue (custom port) - -| Component | Source | Why MUST | -|---|---|---| -| Queue core (`SKIP LOCKED`) | `services/job_queue/service.py` | Foundational; locked decision in roadmap | -| Worker | `services/job_queue/worker.py` | Executes jobs | -| Scheduler (cron parser, recurring_jobs poll) | `services/job_queue/scheduler.py` | Drives all periodic work | -| Dispatch (`enqueue_task`) | `services/job_queue/dispatch.py` | Public enqueue API | -| Registry (task name → handler) | `services/job_queue/registry.py` | Routing — but rebuild without Celery `bind=True` wrapping | -| `job_queue`, `recurring_jobs` tables | migration 049 | Storage | - ---- - -## Liveness & monitoring (essentials) - -| Component | Source | Why MUST | -|---|---|---| -| Liveness service (TCP ping) | `services/monitoring/liveness.py` | Heartbeat for fleet health | -| `host_liveness` table | migration 045 | Heartbeat storage | -| Health monitoring service | `services/monitoring/health.py` | `/health` endpoint backing | -| Host monitor | `services/monitoring/host.py` | Connectivity + last-scan tracking | -| Drift detection (monitoring path) | `services/monitoring/drift.py` | One drift implementation — keep this one (simpler than OWCA path) | -| Adaptive monitoring dispatcher (cron every minute) | `tasks/adaptive_monitoring_dispatcher.py` | Drives connectivity checks | -| Per-host connectivity check task | `tasks/monitoring_tasks.py` `check_host_connectivity` | The work unit | -| Ping-all task (cron */5 min) | `tasks/liveness_tasks.py` `ping_all_managed_hosts` | Fleet-wide liveness sweep | - ---- - -## Retention (basic) - -| Component | Source | Why MUST | -|---|---|---| -| Retention policy service (basic — delete by age) | `services/compliance/retention_policy.py` | Compliance + storage hygiene | -| `retention_policies` table | migration 052 | Per-resource policy storage | -| Retention enforcement task (cron 04:00 UTC) | `tasks/retention_tasks.py` `enforce_retention` | Runs the policy | - -> Signed archive bundles → MAYBE (incomplete). - ---- - -## Infrastructure (foundational) - -| Component | Source | Why MUST | -|---|---|---| -| HTTP client (with circuit breaker) | `services/infrastructure/http.py` | All outbound traffic | -| Webhook signing/verification | `services/infrastructure/webhooks.py` | Webhook security | -| Email service | `services/infrastructure/email.py` | Notification baseline | -| Prometheus metrics export | `services/infrastructure/prometheus.py` | `/metrics` endpoint | -| Config service (Pydantic-validated, env-driven) | `services/infrastructure/config.py` | Replaced by TOML + env in Go (per roadmap), but the concept is MUST | -| Audit logger stream | `services/infrastructure/audit.py` | Structured audit | - ---- - -## OS discovery (basic) - -| Component | Source | Why MUST | -|---|---|---| -| OS discovery sweep (cron 02:00 UTC) | `tasks/os_discovery_tasks.py` `discover_all_hosts_os` | Required for platform_identifier accuracy | -| Single-host OS discovery (manual/scheduler) | `tasks/os_discovery_tasks.py` `trigger_os_discovery` | Operator action | -| Batch OS discovery | `tasks/os_discovery_tasks.py` `batch_os_discovery` | Bulk variant | - ---- - -## Webhooks (delivery infrastructure) - -| Component | Source | Why MUST | -|---|---|---| -| Webhook delivery worker | `tasks/background_tasks.py` `deliver_webhook` | Async webhook dispatch with HMAC + retry | -| `webhook_endpoints`, `webhook_deliveries` tables | model | Storage + delivery audit | - ---- - -## Data layer (MUST tables) - -The following tables are required by the items above. The full table list with migration numbers is in `BACKEND_FUNCTIONALITY.md` §19.1. - -**Identity / RBAC:** `users`, `roles`, `user_groups`, `user_group_memberships`, `host_access`, `api_keys`, `mfa_audit_log`, `mfa_used_codes`, `sso_providers` - -**Hosts & scans:** `hosts`, `host_groups`, `host_group_memberships`, `scans`, `scan_findings`, `scan_baselines`, `scan_drift_events` - -**Compliance state:** `host_rule_state`, `transactions`, `posture_snapshots`, `compliance_exceptions`, `host_compliance_schedule` - -**Kensa & frameworks:** `kensa_rules` (or Go equivalent), `framework_mappings` - -**Alerts & notifications:** `alert_settings`, `notification_channels`, `notification_deliveries` - -**Auth & SSO:** `token_blacklist_pg`, `signing_keys`, `sso_providers` - -**Audit & retention:** `audit_logs`, `integration_audit_log`, `retention_policies` - -**Job queue:** `job_queue`, `recurring_jobs` - -**Liveness:** `host_liveness` - -**System config:** `system_settings`, `webhook_endpoints`, `webhook_deliveries` - -> **Schema divergence to fix in rebuild:** `users.id` is currently `int`. The Go rebuild uses UUID for everything. Plan a one-shot data migration as part of Phase 1. - ---- - -## What this list deliberately excludes - -- Anything in `MAYBE_BACKEND_FUNCTIONALITY.md` (feature-gated, planned-incomplete, or moderate-usage features). -- Anything in `NEVER_BACKEND_FUNCTIONALITY.md` (legacy, deprecated, dead code, or duplications where one path is dropped). - -When in doubt, don't add to MUST. The discipline is to keep MUST small and let MAYBE catch the borderline cases. - ---- - -## Stage 1 evidence corrections (2026-04-28) - -The static-analysis pass at `app/docs/stage_1_evidence_static.md` surfaced two corrections to this list: - -### Correction 1: Licensing is a fresh build, not a port - -`services/licensing/service.py` has three TODO stubs for license validation (`Implement license key validation`, two for `Query database for license`). Today's `LicenseService.has_feature()` is a **config-flag check pretending to be license validation** — it does not validate license keys, query a license DB, or enforce expiry. - -**Triage update:** Licensing stays in MUST, but the rebuild's licensing component must be a **fresh build with proper key validation, expiry enforcement, and DB-backed license records.** Do not port the current Python implementation; design from scratch. - -### Correction 2: Test debt list — Stage 2 entry criteria - -The following MUST items have **zero test coverage** in the current Python codebase. The Go rebuild must add tests for these from day one of porting — not "add tests later." - -| Module | Why this matters | -|---|---| -| `services/job_queue/dispatch.py` | Used by 14+ route handlers as the enqueue API | -| `services/job_queue/registry.py` | Task name → handler mapping | -| `services/auth/credential_handler.py` | Phase 2 host credential refactor | -| `services/auth/token_blacklist_pg.py` | JWT revocation; security-critical | -| `services/baseline_service.py` | NIST SP 800-137 drift baseline | -| `plugins/kensa/scanner.py` | Core Kensa execution adapter | -| `plugins/kensa/evidence.py` | Evidence serialization for audit | -| `plugins/kensa/sync_service.py` | Rule sync after Kensa updates | - -**Stage 2 entry criterion:** Slice A cannot ship without test coverage for the auth modules listed. Slice B cannot ship without test coverage for the Kensa modules listed. Slice C cannot ship without test coverage for the baseline service. - ---- - -## Validation TODOs (before final lock) - -Items marked **[VALIDATE]** above should be confirmed via: - -1. **Telemetry from current OpenWatch** — endpoint hit rates over 60–90 days -2. **Operator interviews** — "what would break for me if this disappeared?" - -The following [VALIDATE] items are most at risk of demotion to MAYBE if telemetry says otherwise: - -- SSO OIDC + SAML (might be MAYBE if no enterprise/federal customer signal) -- Bulk CSV import/export (might be MAYBE if not used) -- Scan templates (might be MAYBE if operators don't actually use them) -- Slack channel (might be MAYBE if customers prefer email/webhook) - -Other MUST items are protected by the four triage criteria and should not be questioned by usage data alone. diff --git a/docs/engineering/README.md b/docs/engineering/README.md deleted file mode 100644 index 0e014395..00000000 --- a/docs/engineering/README.md +++ /dev/null @@ -1,280 +0,0 @@ -# OpenWatch — Go Rebuild - -This directory is the working tree for the from-scratch OpenWatch rebuild in -Go. The existing Python backend lives in `../backend/` and remains in -production until the rebuild is ready to take over. - -**Status:** Stage 0 (walking skeleton) **complete**. 18/18 specs at 100% under `specter coverage --strict`. 19-step Definition of Done passes end-to-end (see `internal/server/api_signoff_test.go`). - -> **What this means:** The toolchain is proven — every Stage-0 foundation (config layering, migrations, HTTPS + cert hot-reload + correlation propagation, audit, idempotency, license validation, RBAC, policy framework, queue, in-process worker, native RPM + DEB packaging, FIPS 140-3 build) is wired end-to-end and tested. **It is not a working compliance scanner yet** — that's Stage 2 (slice A: auth + add host; slice B: Kensa scan; slice C: historical posture). - ---- - -## Design references - -All design work is in `docs/`. Read these before changing anything: - -| Topic | File | -|-------|------| -| Vision, goals, decisions | [`openwatch_roadmap.md`](openwatch_roadmap.md) | -| Stage 0 / Stage 1 plans (complete) | Delivered; the walking-skeleton plan and the Python-backend Stage-1 audits were archived out of the repo (2026-06-22) to `~/hanalyx/OWAR/openwatch-python/docs-archive/`. | -| API design principles | [`api_design_principles.md`](api_design_principles.md) | -| Audit event taxonomy | [`audit_event_taxonomy.md`](audit_event_taxonomy.md) | -| Licensing foundation | [`licensing_foundation.md`](licensing_foundation.md) | -| Policies-as-data | [`policies_as_data.md`](policies_as_data.md) | -| RBAC registry | [`rbac_registry.md`](rbac_registry.md) | -| Correlation ID propagation | [`correlation_id_propagation.md`](correlation_id_propagation.md) | - -Registries (source of truth for codegen): - -| Registry | File | -|----------|------| -| Audit events | [`audit/events.yaml`](audit/events.yaml) | -| Error codes | [`api/error_codes.yaml`](api/error_codes.yaml) | -| License features | [`license/features.yaml`](license/features.yaml) | -| Permissions + built-in roles | [`auth/permissions.yaml`](auth/permissions.yaml) | - -OpenAPI domain specs in [`api/`](api/) (4 full-fidelity + 11 skeletons + meta `openapi.yaml`). - ---- - -## Prerequisites - -- Go 1.25+ (auto-downloaded by toolchain if local Go is older; raised from - the originally-planned 1.22+ floor when `pressly/goose v3.27` required 1.25) -- `make`, `git` - -Optional (lands in later days): - -- `golangci-lint` (Day 1: lint target works without it, just skips) -- `oapi-codegen`, `sqlc`, `redocly` (Day 5: codegen — until then, the - audit queries are hand-written but match what sqlc would produce) -- A running PostgreSQL 15+ for `migrate` and integration tests; integration - tests in `internal/db` skip if `OPENWATCH_TEST_DSN` is unset -- `microsoft/go` (Day 12: FIPS build) - ---- - -## Quick start - -```bash -# From this directory (app/) -make help # list all targets -make version # show what version metadata will be injected -make build # produces dist/openwatch -make test # run all Go tests - -./dist/openwatch --version -./dist/openwatch check-config # uses defaults (silent if /etc/openwatch/openwatch.toml missing) -./dist/openwatch --config configs/openwatch.toml.example check-config -OPENWATCH_SERVER_LISTEN=0.0.0.0:9443 ./dist/openwatch check-config # env override -./dist/openwatch --listen 0.0.0.0:9000 check-config # flag override (wins over env) -``` - -Config layering (highest precedence first): - -1. CLI flags (`--listen`, `--log-level`) -2. Env vars (`OPENWATCH_
_`) -3. TOML file (`--config`, default `/etc/openwatch/openwatch.toml`) -4. Built-in defaults - -Subcommands beyond `check-config` are stubbed until their day arrives: -`serve` (Day 4), `migrate` (Day 3). - ---- - -## Layout - -``` -app/ -├── api/ # OpenAPI specs and error_codes.yaml registry -├── audit/ # events.yaml registry -├── auth/ # permissions.yaml registry (RBAC) -├── cmd/ # binaries (entry points) -│ └── openwatch/ # the main daemon -├── dist/ # build output (gitignored) -├── docs/ # design docs (the spec) -├── internal/ # Go packages, not importable outside this module -│ ├── config/ # (placeholder, Day 2) -│ ├── server/ # (placeholder, Day 4) -│ └── version/ # build-time metadata -├── license/ # features.yaml registry -├── .golangci.yml -├── .gitignore -├── Makefile -├── README.md # this file -├── go.mod -└── go.sum # (populated by `go mod tidy`) -``` - -Foundation packages (`internal/audit/`, `internal/auth/`, `internal/correlation/`, -`internal/errors/`, `internal/license/`, `internal/log/`, `internal/policy/`, -`internal/queue/`, `internal/httpclient/`) come online as their Stage 0 days arrive. - ---- - -## Stage 0 progress - -The Stage 0 walking-skeleton plan (the 13-day plan and 19-step Definition of -Done) is complete and was archived out of the repo (2026-06-22) to -`~/hanalyx/OWAR/openwatch-python/docs-archive/`. The delivered status is below. - -| Day | Topic | Status | -|----:|-------|--------| -| 1 | Repository scaffold | complete | -| 2 | Config + flags + TOML | complete | -| 3 | PostgreSQL + goose migrations | complete | -| 4 | HTTP server + chi + TLS + correlation propagation | complete | -| 5a | Audit foundation (migration + codegen + emit/writer/redact) | complete | -| 5b | OpenAPI codegen + endpoints (/health, :echo, /audit/events) | complete | -| 6 | Idempotency middleware (folded into Day 5b) | complete | -| 7 | Licensing foundation (JWT EdDSA + RequireFeature + owlicgen) | complete | -| 8 | RBAC registry | complete | -| 9 | Policies-as-data + queue correlation helpers | complete | -| 10 | Specter spec + AC coverage | complete (18/18 specs at 100% under strict mode) | -| 11 | Native packaging (RPM + DEB) | complete | -| 12 | FIPS build (Go 1.25 native `GOFIPS140`) | complete | -| 13 | Documentation, demo, sign-off | complete | - ---- - -## Developer walkthrough (from a fresh clone) - -This section walks a new developer through every Stage-0 command. Run from `app/`. - -### 1. Build - -```bash -make build # produces dist/openwatch (non-FIPS) -make build-fips # produces dist/openwatch-fips (Go 1.25 native FIPS 140-3) - -./dist/openwatch --version -./dist/openwatch-fips --version # → "fips: true" -``` - -### 2. Local development against PostgreSQL - -The integration tests require a running PostgreSQL. Easiest setup uses -docker / podman: - -```bash -docker run -d --name openwatch-pg \ - -e POSTGRES_USER=openwatch \ - -e POSTGRES_PASSWORD=openwatch \ - -e POSTGRES_DB=openwatch \ - -p 5432:5432 \ - postgres:16-alpine - -export OPENWATCH_TEST_DSN="postgres://openwatch:openwatch@127.0.0.1:5432/openwatch?sslmode=disable" - -./dist/openwatch migrate # apply all goose migrations -./dist/openwatch check-config -``` - -To run the daemon locally with a self-signed cert: - -```bash -mkdir -p /tmp/ow-tls -bash packaging/common/gen-demo-cert.sh /tmp/ow-tls - -OPENWATCH_DATABASE_DSN="$OPENWATCH_TEST_DSN" \ -OPENWATCH_SERVER_TLS_CERT=/tmp/ow-tls/cert.pem \ -OPENWATCH_SERVER_TLS_KEY=/tmp/ow-tls/key.pem \ -./dist/openwatch --listen 127.0.0.1:8443 serve -``` - -In another terminal, exercise the surface: - -```bash -curl -k https://127.0.0.1:8443/api/v1/health -curl -k 'https://127.0.0.1:8443/api/v1/audit/events?limit=5' -curl -k https://127.0.0.1:8443/api/v1/license -curl -k https://127.0.0.1:8443/api/v1/auth/permissions:registry | jq . -``` - -### 3. Running tests - -```bash -make test # unit + integration; integration tests skip without DSN -specter sync # spec validation + coverage gate -``` - -### 3a. Quality + security gates (run before pushing) - -```bash -make check # vet → lint → vuln → test-race, chained -``` - -Or individually: - -```bash -make vet # go vet ./... -make lint # golangci-lint (staticcheck + gosec + govet + ...) -make vuln # govulncheck ./... (stdlib + deps; auto-installs) -make test-race # go test -race -p 1 ./... -``` - -The same gates run in CI via `.github/workflows/go-ci.yml` on every PR touching `app/**`. - -For strict-mode AC coverage (requires the test pipeline to ingest results): - -```bash -export OPENWATCH_TEST_DSN="postgres://openwatch:openwatch@127.0.0.1:5432/openwatch?sslmode=disable" -go test -json -p 1 ./... > /tmp/go-test.json -specter ingest --go-test /tmp/go-test.json -specter coverage --strict # → "18 specs: 18 passing, 0 failing" -``` - -### 4. Building packages - -```bash -make rpm # → dist/openwatch--1.x86_64.rpm (needs rpmbuild) -make deb # → dist/openwatch__amd64.deb (needs dpkg-deb) -``` - -Install on a target VM via [`docs/guides/INSTALLATION.md`](../guides/INSTALLATION.md). - -### 5. Code generation - -When you edit any registry, regenerate the typed Go output: - -```bash -make generate-audit # audit/events.yaml → internal/audit/events.gen.go -make generate-api # api/openapi.yaml → internal/server/api/server.gen.go -go run scripts/gen-rbac.go # auth/permissions.yaml → internal/auth/{permissions,roles}.gen.go -go run scripts/gen-license-features.go # license/features.yaml → internal/license/features.gen.go -``` - -CI fails if a generated file is out of sync with its source registry. - ---- - -## The 19-step Definition of Done — runnable checklist - -Each step below has an enforcing test in the spec registry. Steps that -the operator must run on a VM (file-watch cert reload, full binary -restart) are flagged. - -| # | Step | Spec AC / test | -|---|------|----------------| -| 1 | `git clone` + walk-through this README | covered by this section | -| 2 | `make build` produces `dist/openwatch` | `release-package-build/AC-13` | -| 3 | `make build-fips` produces `dist/openwatch-fips` with `fips: true` | `release-fips-build/AC-01`, `AC-02` | -| 4 | `make rpm` + `make deb` produce installable packages | `release-package-build/AC-01`, `AC-02` | -| 5 | Package installs to `/etc/systemd/system/openwatch.service` + friends | `release-package-build/AC-04`, `AC-06` | -| 6 | `systemctl start openwatch` + `journalctl -u openwatch` | operator (install guide) | -| 7 | GET `/api/v1/health` → 200 + canonical body | `release-stage-0-signoff/AC-01` | -| 8 | POST `:echo` with `Idempotency-Key` + `X-Correlation-Id` → 200 echoed | `release-stage-0-signoff/AC-02` | -| 9 | GET `/audit/events` includes the row from step 8 | `release-stage-0-signoff/AC-03` | -| 10 | Replay step 8 → cached response, only one audit row | `release-stage-0-signoff/AC-04` | -| 11 | `viewer` permissions include `host:read`, exclude `host:write` | `release-stage-0-signoff/AC-05` | -| 12 | `viewer` + POST `:require-host-write` → 403 + audit row | `release-stage-0-signoff/AC-06` | -| 13 | `security_admin` + no license + `:require-remediation-execute` → 402 | `release-stage-0-signoff/AC-07` | -| 14 | POST `:evaluate-alert` `{score:65}` → `outcome=high`, version `0.0.0` | `release-stage-0-signoff/AC-08` | -| 15 | Drop signed `alert_thresholds.yaml` v1.0.0; reload; reflect new thresholds | `release-stage-0-signoff/AC-09` | -| 16 | POST `:enqueue-test-job` with `X-Correlation-Id: req-end2end-001`; worker emits matching audit event | `release-stage-0-signoff/AC-10` | -| 17 | `specter sync` → 100% AC coverage on every Active spec | `release-stage-0-signoff/AC-11` | -| 18 | Edit cert on disk; new TLS handshakes pick up the new cert | `system-http-server/AC-08` + `AC-09`; full file-watch is operator | -| 19 | Stop / restart service — DB state survives, audit row persists | `system-db/AC-12` (pool reopen); binary restart is operator | - -Run all 19 in CI by running `go test ./...` after `make rpm && make deb && make build-fips` — the test suite asserts steps 7-17 against a live server; steps 2-5 are exercised by the packaging build itself; steps 1, 6, 18, 19 are the operator's gate. diff --git a/docs/engineering/activity_and_os_intelligence.md b/docs/engineering/activity_and_os_intelligence.md deleted file mode 100644 index 4c45acde..00000000 --- a/docs/engineering/activity_and_os_intelligence.md +++ /dev/null @@ -1,344 +0,0 @@ -# Activity feed + OS Intelligence — design context (deferred) - -> **Status**: Deferred. Current focus is frontend GUI direction (post-`feat/daemon-orchestration`). -> This document captures the full design discussion so we can resume cleanly without re-litigating decisions. -> -> **Last updated**: 2026-05-30 (just after PR #430 landed) - ---- - -## TL;DR - -A single user-facing page named **`/activity`** that holds five categories of incoming information, role-filtered, with URL-routed filter presets (`/activity/alerts`, `/activity/transactions`, `/activity/intelligence`, etc.). The page is the operator's "Eye on the fleet" — a unified surface for OpenWatch-synthesized signals and host-reported security/configuration events. - -The biggest gap is that **OS Intelligence collection does not exist in the Go rebuild yet**. The legacy Python side has it (PR #274 in legacy CLAUDE.md); the Go side has zero. Building `/activity` properly therefore lands in this order: - -1. OS Intelligence writer + storage (~1.5 days) -2. Alert persistence amendment (~1 day) -3. Unified activity query API (~half day) -4. Frontend `/activity` page (depends on stack decision) - -Roughly **4-5 days of focused backend work** before the frontend page is meaningful. - ---- - -## Why this is deferred (not abandoned) - -The current focus is frontend GUI foundational work: - -- Frontend architecture ADR (TS framework, state management, API client, build, embed) -- First implemented page (validates the stack) -- `frontend-findings-ui` spec implementation (`/hosts/{id}` detail page) - -Building OS Intelligence + activity feed in parallel would dilute both efforts. The frontend foundational work must settle first so the activity page has somewhere to land. - -**Resume here when:** - -- Frontend architecture ADR is on main -- At least one frontend page is implemented end-to-end (proves the stack works) -- Either customer demand surfaces fleet-wide visibility as a need, OR enough Slice B/C operator surface area accumulates that a unified view becomes necessary - ---- - -## Product vision - -A user navigates to `/activity` and sees, time-ordered and severity-colored, every signal the platform has captured about its fleet in the last N hours/days. Filterable, paginated, RBAC-gated. Linked everywhere — clicking a row opens the relevant `/hosts/{id}` or `/transactions/:id`. - -The mental model is the "Eye": complete visibility into what's happening across the infrastructure. Not just OpenWatch's own internal events, but real signals from each managed host — account changes, security events, system changes — synthesized into a coherent operator view. - -This is a meaningful product differentiator vs. traditional point-in-time compliance scanners: most compliance tools only show you scan results. OpenWatch should show you what *changed* on the host between scans. - ---- - -## Page model — single page, multiple URL routes as filter presets - -``` -/activity Role-default view -/activity/alerts Filter: alerts only -/activity/transactions Filter: compliance state changes -/activity/intelligence Filter: OS Intelligence events -/activity/intelligence/account Sub-filter: account events -/activity/intelligence/security Sub-filter: security events -/activity/intelligence/system Sub-filter: system changes -/activity/audit Filter: who-did-what (admin only) -/activity?host=...&severity=high Composable query params -``` - -**URL is the source of truth.** Role gates which filters are available. The page is one component; routes are filter presets baked into nav and bookmarkable. - -**Why this design** (rejected alternatives): -- *Separate top-level pages per category* — nav clutter; six items become twelve -- *`/feed` as the name* — too casual for enterprise compliance buyers -- *Stream all four sources without role-gating* — overload for non-admin users - ---- - -## Five data sources on this page - -### 1. OS Intelligence events (NEW — does not exist on main) - -Host-reported security / account / configuration events captured by the OS Intelligence collection service. **This is the big new piece.** - -Three subcategories: - -**Account / identity** -- User account locked out -- Password expired or expiring -- New user account created -- User added to privileged group (wheel, sudo, admin) -- SSH key added or removed for a user -- Sudo failure threshold crossed - -**Security** -- SSH login from new source IP for a known user -- Failed login attempts threshold crossed -- SELinux / AppArmor denials -- New listening port opened -- Firewall rule changed -- First-time privilege escalation by a user - -**System** -- Package installed / updated / removed -- Kernel update applied; reboot pending or completed -- Critical config file changed (`/etc/sudoers`, `/etc/passwd`, `sshd_config`, crontab) -- Service started / stopped / failed -- Disk filesystem mounted or unmounted - -### 2. Compliance state transactions (EXISTS on main) - -`transactions` table from Slice B (B.1c). Each row is a rule's state change on a host. `change_kind IN ('first_seen', 'state_changed', 'severity_changed')`. Already queryable via `GET /api/v1/fleet/recent-changes`. - -### 3. OpenWatch-synthesized alerts (PARTIAL on main) - -Slice B's alert router fires `Alert` values via the eventbus. The 5 types on main: - -- `host_unreachable` / `host_recovered` (liveness) -- `drift_major` / `drift_minor` / `drift_improvement` (drift detector) - -**Persistence gap**: alerts are fire-and-forget today; no `alerts` table. Building `/activity` to show alerts requires the alerts persistence amendment described below. - -### 4. Audit events (EXISTS on main) - -`audit_events` table. Already queryable via `GET /api/v1/audit/events`. Covers who-did-what for compliance/forensics. RBAC-gated to `audit:read`. - -### 5. Future: OS Intelligence-derived alerts - -OS Intelligence events that meet a threshold get re-promoted into the alert pipeline. E.g., `security.firewall.rule_changed` on a production host with severity=critical fires `firewall_rule_changed_unattended` if the change wasn't preceded by an approved change-management ticket. - -This is the OpenWatch+OS-Intelligence loop: collect → detect → alert → triage → audit. - ---- - -## OS Intelligence — the new backend piece - -### Service - -A new package: `internal/intelligence` (or `internal/osintel`). Long-lived service, started by `cmd/openwatch serve` alongside the liveness loop. - -### Collection model — pull, with same SSH session as Kensa scan - -**Decision: pull via SSH on a schedule.** Reuses the existing credential resolution + SSH dial path. Same dial budget the executor uses; can piggyback on the scan SSH session (collect right before or after the scan completes) for efficiency. - -Rejected for now: **push via host-side agent.** That would require an enroll/auth flow that doesn't exist; deferred until OpenWatch has a real agent story. - -### Granularity model — snapshot delta - -**Decision: store full collected state per host in `host_intelligence_state`; emit one event row in `host_intelligence_events` per detected change.** - -Same write-on-change discipline as `transactions` + `host_rule_state` (the 99.7% write reduction model from Q1). Avoids unbounded growth from re-emitting "nothing changed" snapshots every collection cycle. - -### Storage schema - -```sql -CREATE TABLE host_intelligence_state ( - host_id UUID PRIMARY KEY REFERENCES hosts(id) ON DELETE CASCADE, - snapshot JSONB NOT NULL, -- last full collected state by category - collected_at TIMESTAMPTZ NOT NULL, - collected_by UUID, -- scan_id when piggybacked on a scan, else NULL - updated_at TIMESTAMPTZ NOT NULL DEFAULT now() -); - -CREATE TABLE host_intelligence_events ( - id UUID PRIMARY KEY, - host_id UUID NOT NULL REFERENCES hosts(id) ON DELETE RESTRICT, - event_code TEXT NOT NULL, -- closed enum; see taxonomy - severity TEXT NOT NULL - CHECK (severity IN ('info','low','medium','high','critical')), - detail JSONB NOT NULL, -- per-code typed schema - occurred_at TIMESTAMPTZ NOT NULL, -- when the change happened on the host - detected_at TIMESTAMPTZ NOT NULL, -- when OpenWatch noticed it - correlation_id TEXT NOT NULL, -- chain back to the collection run - UNIQUE (host_id, event_code, occurred_at) -- idempotency under retries -); - -CREATE INDEX idx_intelligence_events_recent - ON host_intelligence_events (detected_at DESC); -CREATE INDEX idx_intelligence_events_host_code - ON host_intelligence_events (host_id, event_code, occurred_at DESC); -``` - -### Event taxonomy - -Closed enum stored in `internal/intelligence/taxonomy.go`, mirrored in `audit/events.yaml` for taxonomy consistency. Roughly 25 codes at v1.0.0: - -| Category | Code | -|----------|------| -| Account | `account.user.locked` / `unlocked` | -| Account | `account.user.created` / `deleted` | -| Account | `account.user.privileged_group_added` | -| Account | `account.password.expired` / `expiring` | -| Account | `account.ssh_key.added` / `removed` | -| Account | `account.sudo.failure_threshold` | -| Security | `security.login.new_source_ip` | -| Security | `security.login.failed_threshold` | -| Security | `security.selinux.denied` | -| Security | `security.apparmor.denied` | -| Security | `security.firewall.rule_changed` | -| Security | `security.port.opened` | -| System | `system.package.installed` / `updated` / `removed` | -| System | `system.kernel.updated` | -| System | `system.reboot.required` / `completed` | -| System | `system.config.changed` | -| System | `system.service.started` / `stopped` / `failed` | -| System | `system.filesystem.mounted` / `unmounted` | - -Each has a default severity, an actor_types list, and a detail schema (same shape as the existing `audit/events.yaml` entries). - -### Collection cadence - -Default 1 hour per host, configurable via `policy.Intelligence.IntervalSec`. Clamped to `[5min, 24h]`. Same per-host advisory-lock discipline the scheduler dispatch and the future worker will share. - ---- - -## RBAC model — row-level, not page-level - -Today, RBAC is per-endpoint. For `/activity`, the API needs row-level filtering: a `host:read`-less user shouldn't see intelligence events about hosts; an `audit:read`-less user shouldn't see audit rows. - -**Per-row permission map:** - -| Source | Required permission | -|--------|---------------------| -| Alerts | `alert:read` | -| Compliance transactions | `host:read` AND `compliance:read` (the rule_id is the compliance reference) | -| OS Intelligence | `host:read` AND `intelligence:read` (new permission) | -| Audit | `audit:read` | - -The unified `GET /api/v1/activity` endpoint filters rows the caller can't see, AND returns metadata: `{total_visible: N, total_hidden_by_rbac: M}`. The UI honestly tells the user "you have 47 items; 200 more are hidden by your role." - -### Default view per role - -Surfaced in the role definitions at `internal/auth/roles.gen.go`. New field on `RoleDefinition`: `DefaultActivityView` carrying a default filter URL fragment. - -| Role | Default `/activity` view | -|------|--------------------------| -| `admin` | All sources, last 24h, severity ≥ info | -| `security_admin` | Audit + alerts (admin actions, MFA changes), last 7d | -| `ops_lead` | Alerts + transactions + intelligence, severity ≥ medium, last 24h | -| `auditor` | Audit + transactions, last 30d | -| `viewer` | Alerts (high+critical only) + transactions, last 24h | - ---- - -## Pagination — UNION query with seek-cursor - -**Decision: single SQL UNION ALL across all four sources, paginated via seek-cursor on `detected_at DESC`.** - -```sql -SELECT * FROM ( - SELECT id, 'alert' AS source, severity, host_id, occurred_at AS detected_at, ... - FROM alerts WHERE state != 'dismissed' - UNION ALL - SELECT id, 'transaction' AS source, severity, host_id, occurred_at, ... - FROM transactions WHERE host_id = ANY($accessible_hosts) - UNION ALL - SELECT id, 'intelligence' AS source, severity, host_id, detected_at, ... - FROM host_intelligence_events - UNION ALL - SELECT id, 'audit' AS source, severity, NULL AS host_id, occurred_at, ... - FROM audit_events WHERE $can_read_audit -) AS activity -WHERE detected_at < $cursor -ORDER BY detected_at DESC -LIMIT 50; -``` - -Cursor is just the `detected_at` of the last row. Simpler than per-source cursors; trades a slight precision loss at category boundaries for implementation tractability. - -Heavier DB query than per-source pagination — UNION + sort + limit. Acceptable at fleet scale; benchmark before assuming so. - ---- - -## Spec / PR sequence — when work resumes - -| # | Spec | What lands | Effort | -|---|------|------------|--------| -| 1 | `system-os-intelligence` v1.0.0 | Writer service, collection scheduler, event taxonomy, two new tables | ~1.5 days | -| 2 | `api-os-intelligence` v1.0.0 | `GET /api/v1/intelligence/events`, `GET /api/v1/intelligence/state` | ~half day | -| 3 | `system-alert-router` v1.1.0 | Amendment: persist every routed alert to a new `alerts` table | ~1 hour spec + 2 hours code | -| 4 | `system-alerts` v1.0.0 | Lifecycle service: acknowledge, silence, resolve, dismiss | ~half day | -| 5 | `api-alerts` v1.0.0 | `GET /api/v1/alerts`, lifecycle endpoints, RBAC | ~half day | -| 6 | `system-activity` v1.0.0 | Unified UNION query, RBAC row-filter, seek-cursor | ~half day | -| 7 | `api-activity` v1.0.0 | `GET /api/v1/activity` | ~half day | -| 8 | Role defaults | `roles.gen.go` extended with `DefaultActivityView` field | ~2 hours | -| 9 | Frontend `/activity` page | Depends entirely on frontend stack decision | TBD | - -**Backend total: ~4-5 focused days** before the frontend page can start. Each row above is a single PR-sized unit; sequence is hard-ordered above the line at row 6 (1+2 parallel; 3+4+5 parallel after 1; 6 after all of 1-5; 7+8 after 6). - ---- - -## Open decisions when work resumes - -These remained open at time of writing. None are urgent now; all need an answer before implementation starts. - -1. **Snapshot detail size cap**. The `snapshot` JSONB on `host_intelligence_state` could blow up on hosts with many packages. Cap at 10MB? Compress? Split per category into separate columns? Recommendation: 10MB hard cap with truncation marker, same pattern as `kensa.MaxEvidenceBytes`. - -2. **OS Intelligence collection failure handling**. If collection times out on a host, do we mark `host_backoff_state.suppress_until` the same way kensa scan failures do? Risk: one bad host suppresses ALL its scan AND intelligence cycles. Probably need a per-probe-type backoff: `(host_id, probe_type)` where probe_type is `scan` or `intelligence`. The current `host_backoff_state` already has `probe_type` — extend it. - -3. **Retention policy**. `host_intelligence_events` will accumulate fast. Default retention? Per-severity (`critical` keeps longest)? Configurable via policy? - -4. **Multi-instance dedup**. The alert router's dedup gate is in-memory per-process. If two `serve` instances ever run, they each fire the same alert. The persistence amendment opens an attractive shared-state path: move dedup to a Postgres-backed `alert_dedup` table. Decide before multi-instance is a real deployment topology. - -5. **Auto-resolve hooks**. When `host_recovered` arrives, it should auto-resolve the matching open `host_unreachable` alert. Similar for `drift_improvement` → close prior `drift_major`. The pattern is: every alert has a `resolves_when` predicate; the router checks it on every event. Concrete enough to spec; deferred to alert-lifecycle spec. - -6. **`/activity` query performance**. UNION ALL across four tables sorted by timestamp with seek-cursor. Acceptable on dev fleet. Needs a benchmark on the first ~100k-row fleet before committing to it for production. - -7. **Notification fanout**. If an OS Intelligence event needs to fire an alert, does the intelligence service publish to the eventbus (same path Slice B uses) or call alertrouter directly? Cleaner: publish typed events to the bus, let the alert router (which already subscribes to the bus) handle routing. Needs a new EventKind on the eventbus. - ---- - -## How this composes with what's already on main - -These specs are stable on main and inform the design: - -- `system-event-bus` v1.0.0 — typed pub/sub; we add a new EventKind for intelligence events -- `system-alert-router` v1.0.0 — the persistence amendment is item 3 in the sequence above -- `system-transaction-log-writer` v1.0.0 — same write-on-change discipline we'd reuse for intelligence -- `system-liveness-loop` v1.1.0 — the cron-driven loop pattern (`Service.Run(ctx)`) is the model for the intelligence collection loop -- `system-kensa-executor` v2.0.0 — its SSH session is the piggyback target for collection -- `system-host-inventory` v1.0.0 — defines the active hosts the loop walks - -These do NOT exist on main yet and are prerequisites OR siblings: - -- `system-worker-subcommand` (drafted at `/tmp/worker-spec-polished.yaml`, not landed) — needs to land first so the scan path is complete, but technically orthogonal to OS Intelligence -- `frontend-architecture` (not drafted) — needed before any frontend page can land - ---- - -## Why this isn't a copy of `docs/openwatchos/04-SERVER-INTELLIGENCE.md` - -That document covers the **legacy Python** server-intelligence collection (PR #274). The Go rebuild has none of that code. The design in this document deliberately reuses concepts and the operational mental model from that doc but is the Go-rebuild-native version: - -- New Go package, new specs, new storage tables -- Same write-on-change discipline established by Slice B -- Same SSH session reuse as the kensa executor -- Aligned with the framework-at-query-time architecture from Slice B/C work - -When this work starts, reading the legacy doc gives operational context; the implementation is fresh. - ---- - -## When you (or future-me) come back here - -The first action is **not** to start implementing. The first action is to re-read this doc end-to-end, verify the open decisions section against any decisions that have settled in the meantime, and then write `system-os-intelligence` v1.0.0 spec (item 1 in the sequence). The spec drives the implementation, per SDD discipline. - -Estimated 30 minutes to re-orient. Then proceed. diff --git a/docs/engineering/activity_page_scope.md b/docs/engineering/activity_page_scope.md deleted file mode 100644 index 6fb36ea1..00000000 --- a/docs/engineering/activity_page_scope.md +++ /dev/null @@ -1,111 +0,0 @@ -# Activity Page — Backend Scope + MVP - -**Created**: 2026-06-13 -**Status**: Scoping (informs the `frontend-activity` MVP and a backend backlog) -**Prototype**: [`prototypes/openwatch-v1/Activity.html`](prototypes/openwatch-v1/Activity.html) - -> The `Activity.html` prototype is far richer than the live feed can back. -> This doc records, per prototype feature, what ships **now** (zero new backend) -> versus what is **backend-gated** (and how much backend each needs), so the MVP -> is honest about its boundaries. - ---- - -## The backend reality - -`GET /api/v1/activity` (`internal/activity/service.go`) is a **read-only UNION -projection** of five sources — `alert`, `transaction`, `intelligence`, `audit`, -`monitoring` — flattened to: - -``` -Activity { id, source, severity, host_id?, title, summary?, occurred_at } -ActivityPage { items[], hidden_count, next_cursor } -``` - -- **Filters**: `source`, `severity` (info/low/medium/high/critical), `host_id`, - `since`, `until`, `cursor`, `limit` (default 50, max 200). **No** text-search - (`q`) param. **No** aggregate/histogram endpoint. -- **RBAC**: per-source gating inside the service (alert→`alert:read`, - transaction/intelligence/monitoring→`host:read`, audit→`audit:read`). - `hidden_count` is the count of rows suppressed by the caller's missing - permissions (not a pagination remainder). -- **Order**: `occurred_at DESC`, cursor-seek pagination. - -**The key structural fact**: the activity row `id` is the *real* underlying id -**only for `source: "alert"`** (`service.go:159` — `SELECT id::text AS id ... -FROM alerts`). Monitoring synthesizes a fake UUID (`service.go:251`); the other -legs pass their row id but those tables have no per-id detail/mutation API. So -**`alert` is the only source an activity row can act on or fetch detail for by -id today.** - ---- - -## Per-feature scope - -| Prototype feature | Status | What it needs | -|---|---|---| -| Day-grouped event stream | **READY** | The flat feed | -| Source + severity filters | **READY** | Existing query params | -| Host filter / deep-link | **READY** | `host_id` param | -| Cursor "Load more" | **READY** | `next_cursor` | -| `hidden_count` surfaced | **READY** | Response field | -| Ack / Silence(Mute) / Resolve / Dismiss | **READY — alert rows only** | Lifecycle + endpoints + RBAC exist (`alerts/{id}:acknowledge\|silence\|resolve\|dismiss`, `main.go:543`); the activity id is the alert id. The other four sources are immutable logs | -| Detail drawer — basic fields | **READY (all sources)** | Render the activity item's own title/summary/source/severity/host/time; no fetch | -| Detail drawer — rich payload (alert) | **READY** | `GET /api/v1/alerts/{id}` returns tags/body/lifecycle | -| Detail drawer — rich payload (audit) | **GATED (small)** | `detail` JSONB exists on `audit_events` but only via the list API; needs `GET /api/v1/audit/events/{id}` | -| Detail drawer — rich payload (intelligence) | **GATED (small)** | `detail` JSONB on `host_intelligence_events`, list-only; needs `GET …/intelligence/events/{id}` | -| Detail drawer — rich payload (transaction) | **GATED (medium)** | `evidence` + `framework_refs` JSONB on `transactions`; needs `GET …/transactions/{id}` | -| Detail drawer — rich payload (monitoring) | **GATED (medium)** | per-layer flags + `error_*` on `host_monitoring_history`; needs a per-id GET | -| "Routed to" delivery panel | **GREENFIELD (medium–large)** | `notifications.yaml` is spec-only: **no tables, no service, no persistence**. Needs `notification_channels` + `notification_deliveries` schema, dispatch-outcome capture in `internal/alertrouter`, and a `GET …/notifications/deliveries?alert_id=` endpoint | -| Severity histogram | **GREENFIELD (small–medium)** | No aggregate endpoint. Either an `/activity/histogram` bucketed-count endpoint, or client-side over the loaded page (approximate only) | -| Text search | **GATED (small)** | No `q` param; add server-side search or client-side filter over the loaded page | -| Live tail | **PARTIAL** | SSE (`/api/v1/events`) carries monitoring/intelligence/heartbeat/drift/scan — **not** alert/audit/transaction state. A true activity tail needs those on the bus; a cheap version refetches on an SSE pulse | -| Category chip | **substitute** | No category column; `source` is the closest grouping | -| Group filter | **GREENFIELD** | Depends on the Groups entity (itself greenfield) | -| Dedup ×N count | **GATED** | The feed has no per-event count; needs a dedup-count column | - ---- - -## MVP (shipping now — zero new backend) - -1. `/activity` route + `ActivityPage`. -2. Day-grouped stream: time · source · severity · title · summary · host link. -3. Filters: **source** + **severity** dropdowns + **host_id** (deep-link), wired - to the real query params. -4. Cursor **Load more**; surface **`hidden_count`** ("N hidden by permissions"). -5. **Alert-source row actions** — Acknowledge / Silence / Resolve, shown only - when `source === 'alert'` and the caller has `alert:write`, calling the - existing `/alerts/{id}:action` endpoints. -6. **Detail drawer** — basic activity fields for every source, **enriched for - `alert`** via `GET /alerts/{id}` (tags, body, lifecycle, and the same - actions). - -Deferred but cheap follow-ups (small backend each), in priority order: - -1. `GET /audit/events/{id}` + `GET /intelligence/events/{id}` → rich drawer for - those two sources (the JSONB is already stored). -2. `GET /transactions/{id}` + monitoring per-id GET → rich drawer for the - remaining two. -3. Server-side text search (`q`) on the activity feed. - -Larger, genuinely new backend (own tracks, not part of Activity MVP): - -- **Notifications persistence** (channels + deliveries) → the "Routed to" panel. -- **Activity histogram** aggregate endpoint. -- **Live activity tail** (alert/audit/transaction events on the SSE bus). -- Generic **ack/mute for non-alert sources** — would need an - `activity_event_state` table. **Recommendation: do not build this.** Keep - ack/mute semantics on alerts (their natural home); the other four sources are - immutable logs by design. - ---- - -## Recommendation - -Ship the MVP above. It delivers the stream, real filtering, the honest -`hidden_count`, and — because the alert lifecycle is already complete and -reachable from activity rows — a real slice of the prototype's interaction model -(ack/silence/resolve + an alert detail drawer) with **no backend work**. Treat -the two small per-id detail GETs (audit, intelligence) as the first fast-follow -if the richer drawer is wanted for those sources. Keep "Routed to", the -histogram, and live-tail as separate backend tracks. diff --git a/docs/engineering/activity_readability_plan.md b/docs/engineering/activity_readability_plan.md deleted file mode 100644 index 3e89d468..00000000 --- a/docs/engineering/activity_readability_plan.md +++ /dev/null @@ -1,201 +0,0 @@ -# Activity & Audit Readability — Implementation Plan - -> **Status:** planning (approved direction, 2026-06-20). Tracks the initiative -> to make every activity/log surface human-readable, and to make the audit -> trail a first-class, exportable, compliance-grade record. -> -> **Decisions on file** (from the planning discussion): -> - The settings **Audit log stays** as the dedicated *forensic* surface -> (distinct from the operational `/activity` feed), made readable + a detail -> drawer + export. It is **not** redundant with `/activity`. -> - An immutable, exportable audit trail is a **committed compliance -> requirement** (FedRAMP / CMMC / NIST 800-53 **AU** control family). -> - Readability target depth: **plain-English sentences + clickable context + -> detail drawers + grouping/dedup** (the full tier). -> - Sequencing: **ship Phases 0-3 first** (the complete readability + exportable -> audit win), then pick up Phase 4 (grouping/dedup) and Phase 5 (compliance -> hardening) as fast-follow tracks, each with its own go-ahead. -> -> Related docs: [`audit_event_taxonomy.md`](audit_event_taxonomy.md) (canonical -> audit taxonomy, ~70 codes), [`activity_page_scope.md`](activity_page_scope.md) -> (the original `/activity` MVP scoping). - ---- - -## 1. Why this initiative - -The architecture is already sound — this is **not** a rebuild. There is one -unified feed, `GET /api/v1/activity`, backed by a single `UNION ALL` across the -five categories (`internal/activity/service.go`). The problem is two specific -gaps: - -1. **Three of the five feed legs emit raw codes as the row `title`** (the - backend hands the UI machine codes instead of sentences). -2. **There is no shared frontend formatter** — all six surfaces independently - render fields, so the same raw enum/UUID leaks differently in each place. - -### Current state of the five legs (the feed already gets 2/5 right) - -| Category | Source table | `title` today | `summary` today | Human-readable? | -|----------|--------------|---------------|-----------------|-----------------| -| **Alerts** | `alerts` | pre-formatted in Go (alert router) | pre-formatted body | **Yes** | -| **Monitoring** | `host_monitoring_history` | built in SQL ("Host became unreachable") | error_message / failed layer | **Yes** | -| **Compliance** | `transactions` | raw `rule_id` ("CIS.6.1.1") | `change_kind` enum | **No** | -| **Intelligence** | `host_intelligence_events` | raw `event_code` ("system.package.updated") | **empty** (detail JSONB unused) | **No** | -| **Audit** | `audit_events` | raw `action` ("auth.login.success") | bare `resource_id` UUID | **No** | - -### The six surfaces (all consume the same feed except the settings audit log) - -| Surface | Component | Endpoint | Today | -|---------|-----------|----------|-------| -| `/activity` (central) | `pages/activity/ActivityPage.tsx` | `/api/v1/activity` | partial; leaks `source` enum | -| Dashboard "Recent Activity" | `pages/dashboard/widgets.tsx` | `/api/v1/activity?limit=8` | worst: prints `source` + `severity` raw | -| Host-detail "Recent Activity" | `pages/HostDetailPage.tsx` | `/api/v1/activity?host_id=` | cleanest (icon + title + summary) | -| Host-detail "Activity" tab | `HostDetailPage.tsx` TabStub | — | **stub** (deferred) | -| Host-detail "Audit log" tab | `HostDetailPage.tsx` TabStub | — | **stub** (deferred) | -| Settings "Audit log" | `pages/settings/AuditPage.tsx` | `/api/v1/audit/events` | leaks raw `action`, actor/resource **UUIDs**, JSON | - ---- - -## 2. Architecture decision — where the sentence is built - -**The backend builds the sentence; the frontend owns only display chrome.** - -Rationale (evidence-first): -- Only the backend can resolve codes→sentences and IDs→names correctly: the - rule catalog, the audit taxonomy registry, the intelligence `detail` payload, - and host/user label lookups all live server-side. A frontend mapping would - hard-code ~70 audit codes + the intelligence codes and **drift** from the - server the moment a new event type is added. -- The feed already works this way for alerts + monitoring — we are *finishing* - the pattern, not inventing one. -- Every consumer (all six surfaces, the SSE stream, future exports) gets - readable text for free. - -The frontend keeps a single thin helper (`eventDisplay.ts`) for the chrome only: -source label, severity label, icon, relative time. - -### Audit vs activity — two lenses over one store - -Both read `audit_events`, but they are **semantically distinct**, not duplicates: -- **`/activity?source=audit`** is a *lossy projection* — it drops `actor`, - `outcome`, `correlation_id`, `detail`, `redactions`, `parent_event_id`. It is - the operational headline. -- **`/api/v1/audit/events`** is the *full forensic envelope* — who/what/outcome, - the causal chain, the redaction record. It is the compliance record. - -Removing the dedicated audit surface would be a compliance regression (AU -controls), not a cleanup — hence "keep + improve." - ---- - -## 3. Phases - -### Phase 0 — Backend: human sentences for all five feed legs *(highest leverage)* - -Finish the three weak legs so `title`/`summary` are real sentences. This alone -makes `/activity`, the dashboard widget, and host-detail Recent Activity -readable, because they already render those fields. - -- **Compliance leg** (`transactions`): resolve `rule_id` → rule title (the rule - catalog used by the host compliance lens) and `change_kind` → verb. - → *"Ensure auditd is enabled: Pass → Fail."* -- **Intelligence leg** (`host_intelligence_events`): map `event_code` → a - description, and build the currently-empty `summary` from the stored `detail` - JSONB. → *"curl updated: 7.64 → 7.81."* -- **Audit leg** (`audit_events`): map `action` → a description, and **project the - `actor_label` / `resource_label` columns the UNION query currently drops** into - the row. → *"Alice created host web-01."* -- Likely needs a small runtime description registry for audit + intelligence - codes (the audit taxonomy is compile-time only today). - -Specs: bump `system-activity` (currently v1.1.0); `api-activity` may stay (shape -unchanged — only field *content* improves), confirm during implementation. - -### Phase 1 — Frontend: one shared display helper, adopt on all surfaces - -- New `frontend/src/api/eventDisplay.ts`: `sourceLabel`, `severityLabel`, - `iconFor`, `relativeTime`. -- Refactor `ActivityPage`, the dashboard widget, host-detail Recent Activity, - and the settings Audit log onto it. -- Delete the raw `source` / `severity` / UUID renders (dashboard widget first). - -Specs: `frontend-activity` (v1.0.0) + the dashboard/host-detail specs. - -### Phase 2 — Detail drawers + finish the deferred stubs - -- Backend: `GET /api/v1/audit/events/{id}`, `GET /api/v1/intelligence/events/{id}` - (and transactions) returning the full structured payload. -- Frontend: row-expand drawer showing that payload + **clickable host/user - context** (links to the host / user pages). -- Host-detail **Activity** tab → render the host-scoped feed (it is the "View - all" target from the Recent Activity card). -- Host-detail **Audit log** tab → **drop it**. Audit events carry no `host_id`, - so a host-scoped audit tab is empty by design; surface host-relevant audit via - `resource = host` inside the host Activity tab instead. - -### Phase 3 — Settings Audit log → the forensic / compliance view - -- Readable rows: action description, actor/resource **names** (not UUIDs), - outcome. -- Detail drawer over `GET /audit/events/{id}` (full envelope + redactions + - correlation chain). -- **CSV / JSON export** of a filtered audit query. -- AU alignment: AU-3 (record content), AU-6 (review/analysis), AU-7 - (reduction/report generation). - -> **End of the committed body of work.** After Phase 3 every surface is -> readable, the audit trail is a complete, exportable, name-resolved record, and -> the two deferred host tabs are resolved. Phases 4-5 below are fast-follow -> tracks, each gated on a separate go-ahead. - -### Phase 4 — Grouping / dedup / noise control *(fast-follow)* - -- Collapse bursts: *"12 packages updated on web-01"* instead of 12 rows. -- Suppress monitoring flaps (e.g. the dev-restart NULL→online noise already - noted in BACKLOG), severity rollups, "N similar events." -- Design fork to settle: group at **query time** (backend, scales to large - fleets) vs **client-side** (simpler, limited to the current page). - Recommendation: backend. - -### Phase 5 — Compliance hardening *(fast-follow, committed track)* - -- Tamper-evidence: the `signature` field already reserved in the audit taxonomy - (Ed25519 per-event signing or a hash-chain over the log). -- Retention / archival policy. -- An explicit AU-control mapping doc (which capability satisfies AU-2 / AU-3 / - AU-6 / AU-7 / AU-9 / AU-12). - ---- - -## 4. Sequencing summary - -| Phase | Scope | Track | -|-------|-------|-------| -| 0 | Backend sentences for all 5 legs | **Committed** (do first) | -| 1 | Shared frontend formatter, adopt everywhere | **Committed** | -| 2 | Detail endpoints + drawers; finish host tabs | **Committed** | -| 3 | Settings audit log: readable + export | **Committed** | -| 4 | Grouping / dedup / noise control | Fast-follow (separate go-ahead) | -| 5 | Tamper-evidence + retention + AU mapping | Fast-follow (separate go-ahead) | - -Each phase ships incrementally (spec → tests → code, normal PR flow). Phase 0 -delivers the largest visible win on its own. - ---- - -## 5. Key files (anchors for the work) - -- Feed service / UNION: `internal/activity/service.go` -- Feed handler: `internal/server/activity_handler.go` -- Audit emission + registry: `internal/audit/` (`emit.go`, `events.gen.go`) -- Audit query handler: `internal/server/handlers.go` (`GET /audit/events`) -- Taxonomy: `docs/engineering/audit_event_taxonomy.md` -- Frontend surfaces: `pages/activity/ActivityPage.tsx`, - `pages/dashboard/widgets.tsx`, `pages/HostDetailPage.tsx`, - `pages/settings/AuditPage.tsx` -- Specs: `specs/system/activity.spec.yaml` (v1.1.0), - `specs/system/audit-emission.spec.yaml` (v1.0.0), - `specs/api/activity.spec.yaml` (v1.0.0), - `specs/api/audit-events-query.spec.yaml` (v1.1.0), - `specs/frontend/activity.spec.yaml` (v1.0.0) diff --git a/docs/engineering/api_design_principles.md b/docs/engineering/api_design_principles.md deleted file mode 100644 index df59a7b8..00000000 --- a/docs/engineering/api_design_principles.md +++ /dev/null @@ -1,726 +0,0 @@ -# OpenWatch API Design Principles - -> **Status:** Locked 2026-04-27 -> **Authority:** This document is the rulebook for `api/openapi.yaml`. If the spec violates a rule here, the spec is wrong. -> **Audience:** Anyone designing or reviewing OpenAPI 3.1 endpoints for OpenWatch. - ---- - -## Why this document exists - -Today's backend has ~350 endpoints. Roughly 250 of those are not separate features — they are the same features exposed via inflated surface area: bulk variants doubled, format-per-endpoint, RPC-style action verbs, fragmented health/capabilities/stats, triplicated prefixes for the same workflow. - -Proper API design absorbs the inflation. With these rules applied, the same MUST capabilities cover **~60–80 endpoints**, not 350. This is structural reduction, not feature reduction. - ---- - -## Section 1 — Resources, not actions - -### 1.1 Resources are nouns, plural, kebab-case - -``` -/hosts ✓ -/host-groups ✓ -/scan-templates ✓ -/notification-channels ✓ - -/createHost ✗ (verb, not noun) -/host_groups ✗ (snake_case) -/HostGroup ✗ (PascalCase) -/host ✗ (singular) -``` - -Every URL segment that names a resource is a noun. The HTTP method is the verb. - -### 1.2 The seven canonical operations - -For every resource: - -| Operation | Method + Path | Notes | -|---|---|---| -| List | `GET /resources` | Cursor-paginated, filterable, sortable | -| Get one | `GET /resources/{id}` | Single resource read | -| Create | `POST /resources` | Body is the resource (or array; see §3) | -| Replace | `PUT /resources/{id}` | Full replacement; rare in practice | -| Update | `PATCH /resources/{id}` | Partial update; preferred for changes | -| Delete | `DELETE /resources/{id}` | Soft-delete by default; hard-delete via flag | -| Sub-resource read | `GET /resources/{id}/sub-resource` | For natural compositions | - -If you find yourself writing an eighth operation, you probably need a **state transition** (§2) or a **specialized resource** (§4), not a new verb. - -### 1.3 Identifiers are UUIDs - -All resource IDs are UUIDs. No integer IDs in the rebuild — including `users`. This eliminates the current `users.id is int while everything else is UUID` divergence. - -URL path: `/hosts/{host_id}` where `{host_id}` is `{type: string, format: uuid}`. - ---- - -## Section 2 — State transitions, not RPC verbs - -### 2.1 Default: PATCH with target status - -When a resource has a status field and the operation is "change that status": - -``` -✓ PATCH /alerts/{alert_id} body: {"status": "acknowledged"} -✓ PATCH /alerts/{alert_id} body: {"status": "resolved", "resolution_note": "..."} - -✗ POST /alerts/{alert_id}/acknowledge -✗ POST /alerts/{alert_id}/resolve -``` - -The handler validates the transition is legal for the current status (state machine) and rejects illegal transitions with `error.code = "transition.invalid"`. - -### 2.2 Exception: side-effect operations use `:action` - -When the operation has side effects beyond changing the resource's own status — running a scan, sending a test notification, triggering a discovery probe, refreshing a cache — use the colon-action form: - -``` -✓ POST /scans/{scan_id}:cancel -✓ POST /notification-channels/{channel_id}:test -✓ POST /hosts/{host_id}/intelligence:refresh -✓ POST /auth/mfa:enroll -✓ POST /auth/mfa:enable -``` - -The colon makes the action explicit and visually distinct from a sub-resource. (Convention borrowed from Google AIP and gRPC-Gateway.) - -**Test for which form to use:** -- Does the operation only change this resource's status? → PATCH -- Does the operation trigger work elsewhere (a job, a network call, an external side effect)? → `:action` - -### 2.3 State machines are documented - -For every stateful resource, the OpenAPI spec includes a `x-state-machine` extension showing valid statuses and legal transitions: - -```yaml -components: - schemas: - Alert: - x-state-machine: - states: [active, acknowledged, resolved, expired] - transitions: - - from: active, to: acknowledged - - from: active, to: resolved - - from: acknowledged, to: resolved - - from: active, to: expired # automatic, not via API -``` - -Auto-only transitions are noted but not exposed as API operations. - ---- - -## Section 3 — Bulk operations as collection-level POST - -### 3.1 The default rule: one endpoint, one or many - -``` -✓ POST /hosts body: {host} → returns 201 + {host} -✓ POST /hosts body: {hosts: [{...}]} → returns 207 multi-status - -✗ POST /hosts body: {host} -✗ POST /hosts/bulk body: {hosts: [...]} -``` - -The request body shape signals single vs many. Response status is `201 Created` for single, `207 Multi-Status` for many (with per-item results). - -### 3.2 Three legitimate exceptions - -Keep separate endpoints when **any** of: - -1. **Different RBAC permissions** — `read` vs `bulk-read` may have different permission requirements -2. **Different rate limits** — bulk operations get a separate rate-limit bucket that single ops shouldn't share -3. **Different async semantics** — single returns synchronously; bulk returns a job ID for polling - -When this applies, name explicitly: `POST /hosts:bulk-import` (action form, not `/hosts/bulk`). - -### 3.3 Async bulk pattern - -For bulk operations that fan out to background jobs: - -``` -POST /hosts:bulk-import body: {csv: "..."} -→ 202 Accepted -→ body: {job_id: "uuid", status: "queued", _links: {self: "/jobs/{id}"}} - -GET /jobs/{job_id} -→ 200 OK -→ body: {id, status, progress: {total, completed, failed}, result: {...}} -``` - -Polling is via `/jobs/{id}`, never via the original endpoint. - ---- - -## Section 4 — Specialized resources over specialized endpoints - -When you find yourself with many endpoints under one resource doing different things, consider whether the "things" are themselves resources. - -### 4.1 Example: Scans - -Today: `/scans/kensa`, `/scans/kensa/frameworks`, `/scans/kensa/health`, `/scans/kensa/rules/...`, `/scans/kensa/controls/...`, `/scans/kensa/sync-stats`, `/scans/kensa/sync` (12 endpoints). - -Properly modeled: -- `POST /scans` — execute a scan (Kensa is the default and only engine; no `kensa/` prefix needed) -- `GET /rules` + `GET /rules/{rule_id}` — rule reference (already exists at `/api/rules/reference/`; consolidate) -- `GET /frameworks` + `GET /frameworks/{framework_id}` — framework metadata -- `GET /frameworks/{framework_id}/coverage` — coverage stats (sub-resource) -- `GET /frameworks/{framework_id}/rules` — rules in framework (sub-resource) -- `GET /controls/{framework_id}/{control_id}` — control resource -- `POST /admin/operations:sync-rules` — admin operation (different resource entirely) - -12 endpoints → 7 endpoints across 4 well-defined resources. - -### 4.2 Example: Discovery / Intelligence - -Today: 15+ endpoints (`/{id}/discover-os`, `/{id}/os-info`, `/{id}/detect-platform`, `/{id}/system-info`, `/{id}/discovery/{basic,network,security,compliance}`, plus bulk variants, plus `/{id}/intelligence/{services,packages,users,audit,network,baseline}`). - -Properly modeled: -- `GET /hosts/{host_id}/intelligence` — full snapshot (filter via `?include=services,packages,...`) -- `POST /hosts/{host_id}/intelligence:refresh` — trigger collection (sync flag for single, async for bulk) -- `GET /hosts/{host_id}/intelligence/status` — last collection state - -15 endpoints → 3 endpoints. Same capability surface. - ---- - -## Section 5 — Pagination - -### 5.1 Cursor-based, never offset-based - -Every list endpoint is cursor-paginated: - -``` -GET /hosts?limit=50&cursor=eyJpZCI6Ii4uLiJ9 - -→ 200 OK -{ - "items": [...], - "next_cursor": "eyJpZCI6Ii4uLiJ9", // null if no more - "total_estimate": 1247, // optional, may be omitted for cost - "_links": { - "next": "/hosts?limit=50&cursor=eyJpZCI6Ii4uLiJ9" - } -} -``` - -**Why cursor:** offset pagination drifts under concurrent writes. The compliance surface has constant background updates (job queue writes transactions, scheduler updates host_compliance_schedule). Offset gives wrong results; cursor doesn't. - -### 5.2 Cursor is opaque - -Clients treat cursors as opaque strings. Server may encode `(last_id, last_sort_value)` or similar. Never document cursor internals. - -### 5.3 Default limit, max limit - -- Default `limit`: 50 -- Max `limit`: 500 (returns 400 with `error.code = "pagination.limit_exceeded"` if higher) -- No way to say "give me everything" — clients must page - ---- - -## Section 6 — Filtering, sorting, field selection - -### 6.1 Filtering via query params - -Simple equality and ranges as query params: - -``` -GET /hosts?status=active&os_family=rhel&created_after=2026-04-01 -GET /scans?status=completed&host_id=abc-...&framework=cis-rhel9-v2.0.0 -GET /alerts?severity=critical&status=active&since=2026-04-20T00:00:00Z -``` - -Multi-value filters use repeated keys (RFC-recommended) or comma-separated (pick one and document): - -``` -GET /hosts?os_family=rhel&os_family=ubuntu # repeated keys -GET /hosts?os_family=rhel,ubuntu # comma-separated -``` - -**Choice for OpenWatch: comma-separated.** Easier to read, simpler to URL-encode in agent code. Document in spec. - -### 6.2 Complex filters: structured query body - -For queries beyond simple equality (Boolean combinations, full-text search, date arithmetic), use a `POST /resources:query` action with a structured body. Don't pretend GraphQL via overloaded query strings. - -``` -POST /transactions:query -body: -{ - "filter": { - "and": [ - {"field": "status", "op": "in", "values": ["fail", "skip"]}, - {"field": "severity", "op": "in", "values": ["high", "critical"]}, - {"field": "applied_at", "op": "gte", "value": "2026-04-01T00:00:00Z"} - ] - }, - "sort": [{"field": "applied_at", "direction": "desc"}], - "limit": 100 -} -``` - -This is already prototyped in OpenWatch (`POST /api/transactions/query`) — extend the pattern. - -### 6.3 Sort - -Single param, comma-separated, `field:direction`: - -``` -GET /hosts?sort=hostname:asc -GET /hosts?sort=created_at:desc,hostname:asc -``` - -Default sort is documented per resource. Stability under concurrent writes is required (cursor pagination relies on it). - -### 6.4 Field selection (sparse responses) - -For agent efficiency, `?fields=id,hostname,status` returns only those fields. - -``` -GET /hosts?fields=id,hostname,compliance_score - -→ 200 OK -{ - "items": [ - {"id": "...", "hostname": "...", "compliance_score": 87.4} - ] -} -``` - -Default response includes a documented "default field set" — typically the most-used 80%, not everything. - -### 6.5 Inclusion of related resources - -For agent composition, `?include=` follows references: - -``` -GET /scans/{scan_id}?include=host,findings.rule - -→ 200 OK -{ - "id": "...", - "host_id": "...", - "host": {"id": "...", "hostname": "...", ...}, // included - "findings": [ - {"rule_id": "...", "rule": {"id": "...", "title": "...", ...}, ...} - ] -} -``` - -`include` paths are documented per resource. Maximum depth: 2 levels. (Deeper nesting suggests a query-API endpoint instead.) - ---- - -## Section 7 — Content negotiation - -### 7.1 Output format via `Accept` header, not URL - -``` -GET /scans/{scan_id}/report -Accept: application/json → JSON -Accept: text/html → HTML -Accept: text/csv → CSV -Accept: application/pdf → PDF (only if implemented) - -✗ GET /scans/{scan_id}/report/json -✗ GET /scans/{scan_id}/report/html -✗ GET /scans/{scan_id}/report/csv -``` - -Same for audit exports, posture exports, anything that has multiple representations. - -### 7.2 Default representation - -When no `Accept` header is sent, default to `application/json`. Never silently return HTML. - -### 7.3 Supported formats are documented - -Each endpoint that supports content negotiation lists supported media types in its OpenAPI `responses` section. - ---- - -## Section 8 — Errors - -### 8.1 The error envelope (locked in roadmap) - -All error responses (4xx, 5xx) use the same shape: - -```json -{ - "error": { - "code": "host.unreachable", - "fault": "external", - "retryable": true, - "human_message": "Host 1.2.3.4 did not respond on port 22 within 5 seconds.", - "detail": { - "host_id": "...", - "address": "1.2.3.4", - "timeout_seconds": 5 - }, - "correlation_id": "req-...-..." - } -} -``` - -Fields: - -- `code` — stable string, dotted hierarchy. Never changes meaning across versions. Examples: `auth.token_expired`, `host.unreachable`, `policy.version_mismatch`, `transition.invalid`, `pagination.limit_exceeded`, `idempotency.key_reused`. Sourced from the registry (§8.4). -- `fault` — who is at fault: `client` (caller's input/perms), `server` (our bug), `policy` (denied by policy/license/RBAC), `external` (downstream system — host SSH, plugin, OIDC IdP). Drives agent retry/abort logic. -- `retryable` — boolean. Tells agents whether the *same call* may succeed later without modification. False on validation/policy errors (caller must change something first). -- `human_message` — for UI/log display. May be localized later. -- `detail` — structured object with operation-specific context. When the registry defines a `detail_schema` for the code, the response MUST conform. -- `correlation_id` — same `X-Correlation-Id` returned in headers; included in body for log-grep. - -> **Naming note (2026-04-29):** The `fault` field was previously named `category`, which collided with the registry's namespace grouping (`auth`, `host`, `scan`, ...). Renamed before any code shipped. - -### 8.2 HTTP status alone is not enough - -`409 Conflict` means many things in HTTP. The `error.code` disambiguates: - -- `409` + `policy.version_mismatch` → caller used stale policy version -- `409` + `transition.invalid` → caller tried illegal state transition -- `409` + `idempotency.key_reused` → idempotency key collision - -Agents key off `error.code`, never the HTTP status alone. - -### 8.3 Error code naming convention - -Dotted hierarchy: `.`. Category is the namespace (defined in the registry's `categories` block). Failure is the kind of thing that went wrong. - -``` -auth.token_expired host.unreachable scan.already_running -auth.mfa_required host.ssh_authentication_failed scan.kensa_error -authz.permission_denied host.duplicate policy.version_mismatch -validation.field_required pagination.limit_exceeded transition.invalid -license.feature_unavailable quota.max_hosts_exceeded audit.query_invalid -rate_limit.exceeded server.internal server.timeout -``` - -See `api/error_codes.yaml` for the full registry. - -### 8.4 The registry is the source of truth - -All error codes live in `api/error_codes.yaml`. The registry is the **only** place a code is defined. Codegen produces: - -- `internal/errors/codes.gen.go` — typed Go constants (e.g., `errors.HostUnreachable`) and a `Code -> metadata` map for runtime lookup of `http_status`, `fault`, `retryable`, and `detail_schema`. -- A reference document rendered into the OpenAPI bundle so consumers can see all codes in one place. - -Build invariants enforced by `scripts/validate-error-codes.go` (run in CI): - -- Every code matches `^[a-z][a-z0-9_]*\.[a-z][a-z0-9_]*$`. -- Every code's prefix references a defined `categories[].id`. -- `http_status` is a valid 4xx/5xx (or `402`/`503` for soft denials). -- `fault` is one of `client | server | policy | external`. -- `detail_schema` (when present) is valid JSON Schema. -- No duplicates between `errors` and `deprecated_errors`. - -**Workflow for adding a new code:** - -1. Add the entry to `api/error_codes.yaml` (PR review required). -2. Run codegen: `go generate ./internal/errors/...`. -3. Reference the generated constant from handler code: `errors.New(errors.HostUnreachable, "...")`. -4. CI fails the build if a handler emits a string literal that does not match a registry entry. - -**Deprecation:** Move retired codes from `errors:` to `deprecated_errors:` (preserved for log/audit-history compatibility). Build fails if a deprecated code is emitted from new code. - -**OpenAPI references:** Domain specs (e.g., `hosts.yaml`, `scans.yaml`) declare `4xx`/`5xx` responses with a `$ref` to the shared `ErrorEnvelope` schema. The list of *possible* codes per endpoint is surfaced via `x-possible-error-codes` (advisory; the registry remains authoritative). - ---- - -## Section 9 — Idempotency - -### 9.1 Required on POST, PUT, PATCH, DELETE - -Every mutating endpoint accepts and respects `Idempotency-Key` header: - -``` -POST /scans -Idempotency-Key: 5f8e9a0b-... -body: {host_id: "...", template_id: "..."} - -(replay the exact same request with the same key) -→ Returns the same response that the original call returned. Does not double-execute. -``` - -### 9.2 Storage and TTL - -Idempotency keys + their result envelopes are stored for **24 hours**. Replays within that window return cached results. Replays after that window are treated as new requests. - -### 9.3 Key collision - -If two requests have the same `Idempotency-Key` but different bodies, return `409` with `error.code = "idempotency.key_reused"`. Caller used the same key for a different operation — they need a new key. - -### 9.4 GET is naturally idempotent - -`GET` and `HEAD` don't accept idempotency keys. They're already safe to retry by definition. - ---- - -## Section 10 — Correlation and tracing - -### 10.1 X-Correlation-Id end-to-end - -Every request: - -1. Server checks for `X-Correlation-Id` header. If present, use it. If absent, generate a UUID. -2. Correlation ID is added to `context.Context` and propagated through all downstream calls (DB, Kensa, webhook delivery). -3. Logged on every line of the request lifecycle. -4. Returned in response headers: `X-Correlation-Id: req-`. -5. Recorded in audit log entry for any mutating operation. -6. Included in error envelope `error.correlation_id`. - -### 10.2 Format - -`req-` for server-generated. Caller-provided values are accepted as-is (don't validate format aggressively — agents may use their own conventions). - ---- - -## Section 11 — Auth (transport) - -### 11.1 Auth mechanisms - -- `Authorization: Bearer ` — user JWT -- `Authorization: Bearer owk_` — API key (prefix-distinguished) -- mTLS — for ORSA plugin / agent-to-OpenWatch (deferred to Phase 1+) - -### 11.2 Anonymous endpoints are explicit - -The OpenAPI spec marks anonymous endpoints with `security: []`. Anything not marked requires auth. Default-secure. - -### 11.3 RBAC requirements in spec - -Every endpoint declares its required permission via `x-required-permission`: - -```yaml -/hosts: - get: - x-required-permission: host:read - post: - x-required-permission: host:write - -/hosts/{host_id}: - delete: - x-required-permission: host:delete # registry has dangerous: true - x-audit-events: [host.deleted] # MUST be non-empty for dangerous - -# Multi-permission (rare; most endpoints use the single form above): -/some/endpoint: - get: - x-required-permission: - any_of: [host:read, host:write] -``` - -The values are **registry-validated** against `auth/permissions.yaml`. The build fails if a spec references an unknown permission. See `docs/engineering/rbac_registry.md` for the full registry model and the registry-first workflow for adding permissions. - -`oapi-codegen` generates middleware that enforces these from the spec. No hand-written `@require_permission()` decorators in handler code. - -> **Naming note (2026-04-30):** values are lowercase `resource:action` (e.g., `host:read`), not `HOST_READ`. The earlier all-caps form predates the registry; only the registry-validated form is accepted. Drift is a build error. - -### 11.4 License feature gating in spec - -Every endpoint that requires an OpenWatch+ license feature declares it via `x-required-feature`: - -```yaml -/compliance/audit/queries:execute: - post: - x-required-permission: audit:read - x-required-feature: audit_query -``` - -The feature ID must exist in `licensing/features.yaml`. The build fails if a spec references an unknown feature ID. See `docs/engineering/licensing_foundation.md` for the full feature gating model. - -**Cross-validation with the RBAC registry:** if the permission has `license_gated: X` in `permissions.yaml`, the operation MUST declare `x-required-feature: X` (or omit the permission entirely). Mismatch — declaring a license-gated permission without the matching feature, or vice versa — fails the build. This prevents the failure mode where the permission says "license needed" but the operation forgets to declare it. - -When both `x-required-permission` and `x-required-feature` are declared, **both** must pass for the request to reach the handler. The combined middleware (per `docs/engineering/rbac_registry.md` §6) checks RBAC first (denial → `403`), then license (denial → `402`). One middleware, one denial path, one audit event. - -### 11.5 Audit events declared in spec - -Every mutating endpoint (POST/PUT/PATCH/DELETE) declares the audit events it may emit via `x-audit-events`: - -```yaml -/hosts: - post: - x-required-permission: HOST_WRITE - x-audit-events: [host.created] -``` - -The build verifies: -- All declared codes exist in `audit/events.yaml` -- Every mutating endpoint declares at least one audit event - -`x-audit-events` is documentation, not codegen — it does not generate the emission. Handlers explicitly call `audit.Emit(ctx, audit.HostCreated, ...)` using typed constants. The spec declaration is the contract that the handler must honor. - -See `docs/engineering/audit_event_taxonomy.md` for the full taxonomy, redaction discipline, and emission patterns. - ---- - -## Section 12 — Health, capabilities, and admin - -### 12.1 One health, one capabilities, one analytics root - -``` -GET /health # current health, optional ?component= -GET /health/history # timeline, optional ?component=, ?since= -GET /capabilities # what this server supports, optional ?component= -GET /analytics/{domain} # rolled-up stats per domain -``` - -No more `/scanner/health`, `/health/integrations`, `/health/service`, `/health/content`, `/system/capabilities`, `/discovery/network/capabilities`, etc. Component filter absorbs all of them. - -### 12.2 Admin operations are operations, not endpoints - -Long-running or system-wide actions (sync rules, enforce retention, backfill state) are `POST /admin/operations:` returning a job ID: - -``` -POST /admin/operations:sync-rules -→ 202 Accepted -→ body: {job_id: "...", status: "queued"} -``` - -Track via `GET /jobs/{job_id}`. Don't expose them as one-off endpoints. - ---- - -## Section 13 — Sub-resources, not flattened paths - -When a relationship exists, use sub-resource paths: - -``` -✓ GET /hosts/{host_id}/transactions # transactions for a host -✓ GET /hosts/{host_id}/findings # findings for a host -✓ GET /scans/{scan_id}/findings # findings for a scan -✓ GET /host-groups/{group_id}/hosts # hosts in a group -✓ GET /webhooks/{webhook_id}/deliveries # deliveries for a webhook - -✗ GET /transactions?host_id=... # works but less discoverable -✗ GET /scan-findings?scan_id=... # awkward -``` - -The `?host_id=` filter form is also acceptable for cross-cutting queries (e.g., "all findings for this host across all scans"). Both patterns can coexist when each is more natural in its context. - -**Rule of thumb:** if the relationship is "one-to-many and the parent owns the children", use sub-resource. If the children exist independently and you're querying across parents, use filter. - ---- - -## Section 14 — Versioning - -### 14.1 Version in URL: `/api/v1/...` - -Major version in the URL path. `/api/v1/hosts`, `/api/v2/hosts`, etc. - -### 14.2 What forces a v2 - -- Removing a field from a default response -- Changing the meaning of an existing field -- Removing an endpoint -- Changing required parameters - -What does NOT force a v2: - -- Adding a new field (clients must ignore unknown fields) -- Adding a new endpoint -- Adding a new optional parameter -- Adding a new error code (registered in `error_codes.yaml`) -- Adding a new value to an enum (clients must handle unknown enum values gracefully) - -### 14.3 OpenAPI is the version source of truth - -`info.version` in `openapi.yaml` is the API version. Changes to it follow semver. Major bumps require a separate spec file (e.g., `api/openapi-v2.yaml`). - ---- - -## Section 15 — Deprecation - -Endpoints being removed are marked `deprecated: true` in the spec for one minor version, then removed in the next major version. - -```yaml -/old-endpoint: - get: - deprecated: true - description: | - Deprecated since v1.4.0. Use `/new-endpoint` instead. - Removed in v2.0.0. -``` - -Deprecated endpoints return a `Deprecation` header (RFC 8594) with a sunset date. - ---- - -## Section 16 — Specter and OpenAPI - -OpenAPI describes **HTTP contracts** — request/response shapes, status codes, error structures. - -Specter describes **behavioral contracts** — what guarantees the system makes (postconditions, invariants, side effects). - -Both are checked into git. Both are agent-readable. They reference each other but don't overlap. - -Example: - -- OpenAPI for `POST /scans/{scan_id}:cancel`: defines the request, response, error codes, idempotency key handling -- Specter spec for the same operation: defines what state changes happen (scan goes to `cancelled`), what audit events emit, what guarantees about in-flight Kensa subprocess termination - ---- - -## Section 17 — The "don't collapse" exceptions - -Sometimes preserving two endpoints is correct, even if they look similar. Three legitimate reasons: - -1. **Different RBAC permissions.** `GET /audit/events` (any auditor) and `POST /audit/log` (requires `AUDIT_WRITE`) — different verbs, different permissions, different endpoints. Correct. - -2. **Different rate limits.** Bulk operations may need a separate rate-limit bucket from single ops. Document the bucket in the spec via `x-rate-limit-bucket`. - -3. **Different async semantics.** Sync single-host operation (returns result) vs bulk fan-out (returns job ID, poll for status) are genuinely different contracts. One endpoint with conditional response shape is worse, not better. - -**Rule:** collapse unless one of the three applies. When you don't collapse, document the reason in `description`. - ---- - -## Section 18 — Reviewing a draft endpoint - -Checklist for any new or modified endpoint: - -- [ ] Path uses kebab-case nouns, plural? -- [ ] HTTP method matches the canonical operation? -- [ ] If status change, is it PATCH-with-status (default) or `:action` (with reason documented)? -- [ ] If RPC-style verb, is one of the §17 exceptions documented? -- [ ] List endpoints have cursor pagination, filters, sort, fields, include? -- [ ] Mutating endpoints accept `Idempotency-Key`? -- [ ] Error responses use the canonical error envelope? -- [ ] All error codes registered in `error_codes.yaml`? -- [ ] `X-Correlation-Id` propagation documented? -- [ ] `x-required-permission` declared? -- [ ] `x-state-machine` declared (if stateful resource)? -- [ ] Content negotiation via `Accept`, not URL suffix? -- [ ] Sub-resource pattern used for parent-child relationships? - -If any item is "no, because...", document the because in the spec. - ---- - -## Section 19 — Anti-pattern catalog - -Patterns that should never appear in `openapi.yaml`: - -| Anti-pattern | Why bad | -|---|---| -| `/createFoo`, `/updateFoo`, `/deleteFoo` | Verbs in URL; HTTP methods exist | -| `POST /foo/{id}/acknowledge` (when not a true side effect) | Status change should be PATCH | -| `GET /foo/{id}/report/json`, `/report/html`, `/report/csv` | Use `Accept` header | -| `POST /foo/bulk` and `POST /foo` doing the same thing | Collapse to one | -| `?page=3&per_page=50` | Offset pagination drifts under concurrent writes | -| `GET /scanner/health`, `GET /worker/health`, `GET /db/health` | Use `GET /health?component=...` | -| `POST /foo/{id}/start`, `POST /foo/{id}/stop`, `POST /foo/{id}/cancel` | Status transitions; PATCH with target status | -| Returning 200 OK with `{success: false, error: "..."}` | Use proper HTTP status + error envelope | -| Different endpoints for "with details" vs "without details" | Use `?fields=` or `?include=` | -| Endpoints whose only difference is auth requirement | Different auth on the same endpoint is fine via spec | - ---- - -## Section 20 — Living document - -This file is updated when a real case forces a clarification. Updates are dated: - -- 2026-04-27 — Initial version. Locked baseline rules. - -When a rule is added or changed, every existing OpenAPI domain spec must be re-checked for compliance. The principles document leads; the spec follows. diff --git a/docs/engineering/audit_event_taxonomy.md b/docs/engineering/audit_event_taxonomy.md deleted file mode 100644 index c83a4e67..00000000 --- a/docs/engineering/audit_event_taxonomy.md +++ /dev/null @@ -1,961 +0,0 @@ -# Audit Event Taxonomy Foundation - -> **Status:** Locked design 2026-04-29 -> **Authority:** This document is the architectural foundation for audit events in the Go rebuild. Implementation in Stage 0 must conform. -> **Why now:** Audit events are referenced from licensing (6 events already), authentication, host management, scans, compliance, system lifecycle, and remediation. Without a stable taxonomy, every component invents its own naming. Drift starts on day one. - ---- - -## 1. Why audit events are foundation - -Audit events cross every seam: - -- **Compliance posture** — audit log is a regulated artifact (NIST 800-53 AU-2 / AU-3, ISO 27001 A.12.4) -- **Agent orchestration** — agents verify operation effects via audit log, not back-channel -- **Forensics** — incident response queries the audit log -- **Operator visibility** — operators investigate "what happened?" through audit -- **Compliance reporting** — audit events support attestation -- **Audit-as-API** — first-class queryable resource per agent-first §1 - -Today's Python codebase has **two competing audit implementations**: file-based (`SecurityAuditLogger`) and DB-based (`audit_db.py`). Static analysis flagged this as a duplication to consolidate. The rebuild commits to **DB-based as the canonical path** — and gets the taxonomy right from day one. - -If audit events are added ad-hoc per-component, three things break: - -1. **Drift.** `auth.success` vs `auth.login_success` vs `login.successful` — same event, three names. Agents and queries need to handle all three forever. -2. **Untyped detail blobs.** Every event has different fields in `detail`; aggregation is impossible. -3. **Hidden-secret leaks.** Without a redaction discipline, passwords / tokens / SSH keys end up in logs. - -Doing it once, properly, in Stage 0 Day 5 is roughly half a day. Retrofitting it later costs 1–2 weeks plus the bugs from inconsistent past data. - ---- - -## 2. Core requirements - -### 2.1 Functional - -1. **Stable, registered taxonomy** — every event type has a stable string code, defined in a single registry -2. **Structured envelope** — every event has the same top-level shape; only `detail` varies -3. **Type-safe emission** — emitting an event uses generated Go code, not raw strings -4. **Queryable API** — events are queryable via REST (filters, pagination, time range, structured search) -5. **Correlation** — every event ties to a request, session, and (where applicable) parent event -6. **Redaction** — sensitive fields are explicitly scrubbed before write -7. **Performance** — emission is async; never blocks the request path -8. **Tamper-evidence** — events are append-only with optional signature chain - -### 2.2 Non-functional - -1. **Schema stability** — event codes never rename; new ones add without breaking old queries -2. **Cardinality bounded** — taxonomy stays at ~100 event types, not 1000 -3. **Compliance-aligned** — meets NIST 800-53 AU-3 (Content of Audit Records) and ISO 27001 A.12.4 -4. **Storage efficient** — JSONB detail with GIN index; bulk-insertable -5. **Always emit, never crash** — audit failures degrade gracefully (log error, continue) - ---- - -## 3. Event envelope (canonical schema) - -Every audit event has this shape. Only `detail` varies per event type. - -```json -{ - "id": "uuid-v7", - "occurred_at": "2026-04-29T14:32:01.123Z", - "recorded_at": "2026-04-29T14:32:01.156Z", - "action": "auth.login.success", - "severity": "info", - "outcome": "success", - - "actor": { - "type": "user", - "id": "uuid", - "label": "alice@example.com", - "ip_address": "10.0.0.42", - "user_agent": "openwatch/1.0.0", - "session_id": "uuid" - }, - - "resource": { - "type": "host", - "id": "uuid", - "label": "web-prod-01.example.com" - }, - - "correlation_id": "req-018f3c5d-...", - "parent_event_id": "uuid-of-causally-prior-event", - - "policy_version": "exception-policy-v1.2.0", - - "detail": { - "auth_method": "password", - "mfa_used": true - }, - - "redactions": ["password_hash"], - "signature": "ed25519-sig-of-canonical-form" -} -``` - -### 3.1 Field semantics - -| Field | Required | Notes | -|---|---|---| -| `id` | yes | UUIDv7 — sortable by time, globally unique | -| `occurred_at` | yes | When the event actually happened (operation timestamp) | -| `recorded_at` | yes | When we wrote it. Always ≥ `occurred_at`. Lag = how far behind the writer is. | -| `action` | yes | Stable code from registry. Dotted hierarchy. | -| `severity` | yes | `info` / `warning` / `error` / `critical` | -| `outcome` | yes | `success` / `failure` / `denied` | -| `actor` | yes | Who/what did this. `type` ∈ `{user, api_key, agent, system, scheduler}`. | -| `actor.id` | mostly | Required for `user` and `api_key`; `system`/`scheduler` may omit | -| `actor.label` | yes | Human-readable identifier (username, key name, "scheduler", etc.) | -| `actor.ip_address` | when known | IP from request; null for `system`/`scheduler` | -| `actor.user_agent` | when known | UA from request | -| `actor.session_id` | when applicable | Session UUID | -| `resource` | mostly | What was acted on. Omit only for events with no resource (e.g., `system.startup`). | -| `correlation_id` | yes | Request correlation ID from middleware | -| `parent_event_id` | optional | Causally prior event (e.g., scan.completed parent for findings) | -| `policy_version` | when applicable | Which policy version applied (per agent-first principle 2) | -| `detail` | optional | Per-action structured fields. Schema documented per event in registry. | -| `redactions` | when applicable | Field names that were scrubbed before storage (`["password", "ssh_key"]`) | -| `signature` | optional | Ed25519 signature of canonical form (for tamper-evidence; high-assurance deployments) | - -### 3.2 Why UUIDv7 - -- Time-sortable by primary key (no need for `(occurred_at, id)` composite index for ordering) -- Globally unique without coordination -- Natural fit for cursor pagination (UUIDv7 IS the cursor) -- Replaces "UUID + timestamp" patterns with one field - ---- - -## 4. Taxonomy registry - -The registry is the source of truth. Lives at `audit/events.yaml`. Format: - -```yaml -# audit/events.yaml -version: 1 - -categories: - - id: auth - description: Authentication and session lifecycle - - id: authz - description: Authorization decisions and RBAC - - id: host - description: Host management and discovery - - id: scan - description: Scan execution and lifecycle - - id: compliance - description: Compliance state, exceptions, baselines, drift - - id: alert - description: Alert generation and lifecycle - - id: notification - description: Notification dispatch and delivery - - id: license - description: License install, expiry, feature gating - - id: policy - description: Policy load, validation, application (Principle 2) - - id: remediation - description: Remediation requests and execution - - id: integration - description: External system integration (Jira, webhooks, plugins) - - id: system - description: Service lifecycle and configuration - - id: admin - description: Administrative operations - -events: - # ----- auth ----- - - code: auth.login.success - severity: info - description: User authenticated successfully - actor_types: [user] - detail_schema: - auth_method: {type: string, enum: [password, sso, api_key]} - mfa_used: {type: boolean} - - - code: auth.login.failure - severity: warning - description: Authentication attempt failed - actor_types: [user] - detail_schema: - reason: {type: string, enum: [invalid_credentials, account_locked, mfa_required, mfa_failed]} - auth_method: {type: string} - - - code: auth.logout - severity: info - description: User explicitly logged out - actor_types: [user] - - - code: auth.token.refreshed - severity: info - description: Refresh token used to issue new access token - actor_types: [user, api_key] - - - code: auth.token.revoked - severity: warning - description: Token added to revocation blacklist - actor_types: [user, system] - detail_schema: - reason: {type: string, enum: [user_logout, admin_revoke, suspicious_activity, token_rotation]} - - - code: auth.mfa.enrolled - severity: info - description: User completed MFA enrollment - detail_schema: - method: {type: string, enum: [totp, fido2]} - - - code: auth.mfa.validated - severity: info - description: MFA challenge passed during login - - - code: auth.mfa.failed - severity: warning - description: MFA challenge failed - - - code: auth.session.expired - severity: info - description: Session timed out (idle or absolute) - - - code: auth.api_key.created - severity: info - description: API key issued - - - code: auth.api_key.revoked - severity: warning - description: API key invalidated - - # ----- authz ----- - - code: authz.permission.denied - severity: warning - description: RBAC denied an authenticated request - detail_schema: - required_permission: {type: string} - route: {type: string} - - - code: authz.role.assigned - severity: info - description: Role assigned to user - - - code: authz.role.removed - severity: warning - description: Role removed from user - - # ----- host ----- - - code: host.created - severity: info - - - code: host.updated - severity: info - - - code: host.deleted - severity: warning - - - code: host.connectivity.checked - severity: info - detail_schema: - ping_success: {type: boolean} - ssh_accessible: {type: boolean} - response_time_ms: {type: integer} - - - code: host.platform.detected - severity: info - detail_schema: - os_family: {type: string} - os_version: {type: string} - - # ----- scan ----- - - code: scan.queued - severity: info - detail_schema: - framework: {type: string} - template_id: {type: [string, 'null']} - - - code: scan.started - severity: info - - - code: scan.completed - severity: info - detail_schema: - compliance_score: {type: number} - passed: {type: integer} - failed: {type: integer} - - - code: scan.failed - severity: error - detail_schema: - error_code: {type: string} - error_message: {type: string} - - - code: scan.cancelled - severity: warning - detail_schema: - cancellation_reason: {type: string} - - - code: scan.session.created - severity: info - detail_schema: - total_hosts: {type: integer} - - # ----- compliance ----- - - code: compliance.state.changed - severity: info - description: Rule state transition (write-on-change) - detail_schema: - rule_id: {type: string} - previous_status: {type: string} - new_status: {type: string} - - - code: compliance.exception.requested - severity: info - - - code: compliance.exception.approved - severity: warning - - - code: compliance.exception.rejected - severity: info - - - code: compliance.exception.revoked - severity: warning - - - code: compliance.exception.expired - severity: info - - - code: compliance.baseline.established - severity: info - - - code: compliance.baseline.cleared - severity: warning - - - code: compliance.drift.detected - severity: warning - detail_schema: - drift_type: {type: string, enum: [major, minor, improvement]} - score_delta: {type: number} - - - code: compliance.snapshot.created - severity: info - - # ----- alert ----- - - code: alert.created - severity: info - detail_schema: - alert_type: {type: string} - alert_severity: {type: string} - - - code: alert.acknowledged - severity: info - - - code: alert.resolved - severity: info - - # ----- notification ----- - - code: notification.dispatched - severity: info - detail_schema: - channel_type: {type: string, enum: [slack, email, webhook, jira, pagerduty]} - delivery_id: {type: string} - - - code: notification.delivery.failed - severity: error - - - code: notification.delivery.succeeded - severity: info - - # ----- license ----- - - code: license.installed - severity: info - - - code: license.invalid - severity: error - - - code: license.expiring_soon - severity: warning - - - code: license.expired - severity: error - - - code: license.feature_check_denied - severity: warning - detail_schema: - feature: {type: string} - suppressed_count: {type: integer, description: Events deduped within 1-min window} - - - code: license.quota_exceeded - severity: warning - detail_schema: - quota: {type: string} - limit: {type: integer} - current: {type: integer} - - - code: license.clock_rollback_detected - severity: critical - - - code: license.tampered - severity: critical - - # ----- policy (Principle 2) ----- - - code: policy.loaded - severity: info - detail_schema: - policy_type: {type: string} - policy_version: {type: string} - - - code: policy.invalid - severity: error - - - code: policy.applied - severity: info - description: Operation evaluated against a versioned policy - detail_schema: - policy_type: {type: string} - policy_version: {type: string} - decision: {type: string, enum: [allow, deny, defer]} - - # ----- remediation ----- - - code: remediation.requested - severity: info - - - code: remediation.approved - severity: warning - - - code: remediation.executed - severity: warning - detail_schema: - dry_run: {type: boolean} - steps_succeeded: {type: integer} - steps_failed: {type: integer} - - - code: remediation.rolled_back - severity: warning - - # ----- integration ----- - - code: integration.webhook.delivered - severity: info - - - code: integration.webhook.failed - severity: error - - - code: integration.plugin.installed - severity: info - - - code: integration.plugin.executed - severity: info - - # ----- system ----- - - code: system.startup - severity: info - actor_types: [system] - - - code: system.shutdown - severity: info - actor_types: [system] - - - code: system.config.changed - severity: warning - detail_schema: - config_key: {type: string} - - - code: system.migration.applied - severity: info - detail_schema: - migration_id: {type: string} - - - code: system.health.degraded - severity: error - detail_schema: - component: {type: string} - - # ----- admin ----- - - code: admin.user.created - severity: warning - - - code: admin.user.deleted - severity: warning - - - code: admin.role.changed - severity: warning - - - code: admin.system_setting.changed - severity: warning - - - code: admin.retention_policy.changed - severity: warning - -# Deprecated codes kept for backward-compatible reads: -deprecated_events: - - code: auth.login_attempt - deprecated_in: "1.0.0" - replaced_by: [auth.login.success, auth.login.failure] -``` - -**Initial registry: ~70 event codes** across 13 categories. This is the cap target — if it grows past ~150, the taxonomy is becoming a free-form bag rather than a controlled vocabulary. - -### 4.1 Naming convention rules - -1. **Lowercase, dotted hierarchy.** `domain.resource.action` (3 levels) or `domain.action` (2 levels). -2. **Action verbs are past-tense indicative.** `created`, `failed`, `denied`, `expired` — never `create`, `fail`, `denying`. -3. **Outcomes are in the field, not the name.** `auth.login.success` and `auth.login.failure` differ by outcome — but since they have meaningfully different `detail` schemas, they get separate codes. This is the exception that proves the rule: split when schemas diverge, merge when they don't. -4. **Resource names match API path nouns.** If the API has `/host-groups`, the audit code is `host_group.created`, not `hostgroup.created`. (Underscore replaces hyphen since dots are reserved for hierarchy.) -5. **Never rename.** Removing requires deprecation; renaming is forbidden. Add a new code, deprecate the old. - -### 4.2 Adding a new event code - -PR adds a row to `audit/events.yaml`. The build verifies: - -- Code matches `^[a-z][a-z0-9_]*(\.[a-z][a-z0-9_]*){1,2}$` -- Category exists -- `detail_schema` (if present) is valid JSON Schema -- Code is not in `deprecated_events` - -Codegen produces a typed constant in `internal/audit/events.gen.go`: - -```go -package audit - -// AUTO-GENERATED — DO NOT EDIT - -const ( - AuthLoginSuccess EventCode = "auth.login.success" - AuthLoginFailure EventCode = "auth.login.failure" - HostCreated EventCode = "host.created" - // ... -) -``` - -Emitting code uses these constants: - -```go -audit.Emit(ctx, audit.AuthLoginSuccess, audit.Event{ - Outcome: audit.Success, - Actor: audit.ActorFromContext(ctx), - Detail: map[string]any{"auth_method": "password", "mfa_used": true}, -}) -``` - -**Drift becomes a compile error** — you can't emit `audit.AuthLoginSucessful` because the constant doesn't exist. - ---- - -## 5. Architecture - -### 5.1 Package layout (as built) - -``` -internal/audit/ -├── types.go # Event, Actor, Resource enums + Code type -├── events.gen.go # Generated from audit/events.yaml (Code constants + Meta) -├── emit.go # Public Emit() + EmitSync() API -├── writer.go # Async batched writer (channel + goroutine + flush ticker) -├── redact.go # Redaction helpers -├── store.go # PostgreSQL persistence (sqlc-generated Store) -└── emit_test.go, redact_test.go -``` - -Deferred (not yet implemented in Stage 0): -- `registry.go` — runtime YAML validation; codegen output is the registry today -- `signer.go` — Ed25519 per-event signing (Phase 2+) -- `query.go` — split out from the server handler when query API grows beyond a single endpoint - -### 5.2 Emission API (as built) - -Two functions, one shape — no typed-helper subpackages yet. - -**Async `Emit`** (default; non-blocking, dropped on overflow): - -```go -audit.Emit(ctx, audit.AuthLoginSuccess, audit.Event{ - ActorType: "user", - ActorID: user.ID, - Detail: map[string]any{"method": "password", "mfa_used": mfaUsed}, -}) -``` - -**Sync `EmitSync`** (returns error; reserved for events that must be -durable before the request returns): - -```go -if err := audit.EmitSync(ctx, audit.SystemStartup, audit.Event{ - ActorType: "system", - Detail: map[string]any{"version": cfg.Version}, -}); err != nil { - return fmt.Errorf("startup audit: %w", err) -} -``` - -Both fill in `ID` (UUIDv7), `OccurredAt`, and `CorrelationID` (from ctx) -before persistence. Per-event-code typed wrappers (the `audit.Auth.LoginSuccess` -shape from earlier drafts) are deferred — the generic shape covers all -current callers and avoids registry duplication. - -### 5.3 Async batching - -Emission is non-blocking: - -```go -// emit.go -func Emit(ctx context.Context, code EventCode, e Event) { - e.populate(ctx, code) // fill correlation_id, recorded_at, etc. - select { - case eventChan <- e: // buffered channel, default 10000 - default: - droppedCounter.Inc() // back-pressure: log + count drops - // Critical events (severity=critical) bypass back-pressure - if e.Severity == Critical { - blockingWrite(e) // sync write; rare path - } - } -} -``` - -A dedicated writer goroutine consumes `eventChan` and batches: -- Up to 100 events per `INSERT` -- Or every 100ms, whichever first -- Single transaction per batch - -**Failure mode:** if the DB write fails, log to stderr (visible in `journalctl`) and increment a `audit_write_failures` Prometheus counter. **Never fail the originating request.** Audit failures are operational concerns, not request-blocking. Compliance is the platform's value; an audit log full disk should not deny scans. - -### 5.4 Critical-event sync option - -Some events are too important to drop: - -- `system.startup`, `system.shutdown` -- `license.installed`, `license.tampered`, `license.clock_rollback_detected` -- `auth.token.revoked` with `reason=admin_revoke` -- `compliance.exception.approved` - -For these, `audit.EmitSync(ctx, code, event)` writes synchronously and returns the error. Caller can decide whether to retry. Maximum ~5 events per minute use this path; the 95% case stays async. - ---- - -## 6. Storage - -### 6.1 `audit_events` table - -```sql -CREATE TABLE audit_events ( - id UUID PRIMARY KEY, -- UUIDv7 - occurred_at TIMESTAMPTZ NOT NULL, - recorded_at TIMESTAMPTZ NOT NULL DEFAULT now(), - action TEXT NOT NULL, - severity TEXT NOT NULL DEFAULT 'info', - outcome TEXT NOT NULL, - - actor_type TEXT NOT NULL, -- user/api_key/agent/system/scheduler - actor_id TEXT, - actor_label TEXT, - actor_ip INET, - actor_user_agent TEXT, - actor_session_id UUID, - - resource_type TEXT, - resource_id TEXT, - resource_label TEXT, - - correlation_id TEXT NOT NULL, - parent_event_id UUID REFERENCES audit_events(id), - - policy_version TEXT, - detail JSONB, - redactions TEXT[], - signature TEXT -); -``` - -### 6.2 Indexes - -```sql --- Time-ordered scan -CREATE INDEX idx_audit_occurred ON audit_events (occurred_at DESC); - --- Per-actor history -CREATE INDEX idx_audit_actor ON audit_events (actor_id, occurred_at DESC) WHERE actor_id IS NOT NULL; - --- Per-action queries -CREATE INDEX idx_audit_action ON audit_events (action, occurred_at DESC); - --- Per-resource history -CREATE INDEX idx_audit_resource ON audit_events (resource_type, resource_id, occurred_at DESC) WHERE resource_id IS NOT NULL; - --- Correlation ID grouping -CREATE INDEX idx_audit_correlation ON audit_events (correlation_id); - --- Severity filters (for "show me errors/criticals") -CREATE INDEX idx_audit_severity ON audit_events (severity, occurred_at DESC) WHERE severity IN ('error', 'critical'); - --- Detail JSON search -CREATE INDEX idx_audit_detail_gin ON audit_events USING GIN (detail); - --- Action-prefix search (e.g., 'license.*') --- Use action LIKE 'license.%' which uses idx_audit_action btree. -``` - -### 6.3 Partitioning (deferred to Phase 1+) - -For deployments with >100M events/year, partition by month: - -```sql --- Phase 1+ migration -CREATE TABLE audit_events PARTITION OF ... -PARTITION BY RANGE (occurred_at); -``` - -Stage 0 ships unpartitioned; the migration is non-breaking. - -### 6.4 Retention - -Driven by `retention_policies` table (already in MUST). Defaults: - -- Standard events: 365 days -- High-severity (error/critical): 730 days -- Compliance-required events (auth, authz, license, admin): 2555 days (7 years; many regulatory baselines) - -Retention enforcement is a daily cron job (`enforce_retention` task) that runs `DELETE` in batches. - -**Pre-deletion archive (planned, MAYBE):** sign + bundle events about to be deleted into `audit_archive_.tar.gz.sig` for offline retention. - ---- - -## 7. Query API - -The `audit.yaml` OpenAPI spec covers the queryable surface. Headline endpoints: - -- `GET /audit/events` — list with filters, cursor pagination, sort *(Stage 0)* -- `GET /audit/events/{event_id}` — single event *(deferred to Phase 1)* -- `POST /audit/events:query` — complex queries via DSL (see scans.yaml `:query` precedent) *(deferred to Phase 1)* -- `GET /audit/events:export` — CSV/JSON/PDF (gated on `audit_export` feature) *(deferred to Phase 1)* -- `GET /audit/events:taxonomy` — read the registry (for UI rendering of filters) *(deferred to Phase 1; the registry is embedded in `events.gen.go` and can be exposed when the frontend needs it)* - -**As-built in Stage 0:** only `GET /audit/events` is wired. The other endpoints are -declared in `api/audit.yaml` for contract design but have no handler yet. - -Filters include `action` (with prefix support), `actor_id`, `resource_id/type`, `correlation_id`, `severity`, `outcome`, time range. - -**Per-resource convenience endpoints** in domain specs include: - -- `GET /hosts/{host_id}/audit-events` -- `GET /scans/{scan_id}/audit-events` -- `GET /users/{user_id}/audit-events` - -These are sugar over `GET /audit/events?resource_type=host&resource_id=...`. Same backing query. - ---- - -## 8. Redaction discipline - -### 8.1 Forbidden in audit `detail` - -These fields **must never** be in `detail`. Period. - -- Passwords (any form — plaintext, hashed, base64'd, anything) -- API key secret values (the `owk_` part after the prefix) -- SSH private keys -- License JWTs (the raw signed token) -- TLS private keys -- MFA TOTP secrets -- Session tokens (JWTs) -- OAuth client secrets -- SAML signing keys - -If any of these appear in `detail`, the redactor pre-write replaces them with `""` and adds the field name to the `redactions` array. - -### 8.2 Redaction helpers - -```go -// Redact removes/replaces sensitive fields before audit write. -func (e *Event) Redact() *Event { - for _, k := range []string{"password", "ssh_key", "api_key", "token", "secret", "license_jwt"} { - if _, ok := e.Detail[k]; ok { - e.Detail[k] = "" - e.Redactions = append(e.Redactions, k) - } - } - return e -} -``` - -### 8.3 Allowed (with caveats) - -- IP addresses — yes; network metadata is auditable -- Hostnames — yes -- Usernames — yes; but only the username, never password attempts -- Email addresses — yes; PII concerns handled at retention level, not at write time -- Resource IDs (UUIDs) — yes -- Timestamps — yes -- Status codes — yes -- Configuration keys (not values) — yes; e.g., `config.changed { config_key: "session_timeout" }` but not the new value if it's sensitive - -### 8.4 PII handling - -For deployments with PII concerns (GDPR, HIPAA), an additional retention policy can pseudonymize old events: - -```sql -UPDATE audit_events SET actor_label = 'user-' || md5(actor_id), actor_ip = NULL -WHERE occurred_at < NOW() - INTERVAL '180 days'; -``` - -This is opt-in per-deployment. Off by default. - ---- - -## 9. Tamper-evidence (optional, for high-assurance deployments) - -### 9.1 Per-event Ed25519 signature - -Each event can carry a signature over its canonical JSON form, signed by an audit-specific Ed25519 key. Verify at read time on demand. - -The signing key is held by the running service (not embedded in binary). On rotation, old events keep their old signature; new events get the new key. - -### 9.2 Hash chain (deferred) - -Inspired by Certificate Transparency. Each event's record includes the hash of the previous event's full record. Daily, the latest hash is published to a separate audit log destination (or signed and exposed via API). - -This is **deferred to Phase 1+**. Not needed for Stage 0 walking skeleton; designed in so the schema accommodates `parent_hash` field if added later. - -### 9.3 Honest framing - -Tamper-evidence is best-effort. A determined attacker with database root access can rewrite any log. The point of these mechanisms is to: - -- **Catch accidents** (e.g., wrong query in a maintenance script) -- **Raise the cost** of evidence tampering -- **Detect tampering after the fact** (compromise indicators) -- **Support compliance attestations** that require log integrity - -It is not — and cannot be — proof against every adversary. - ---- - -## 10. Performance and capacity - -### 10.1 Sizing - -Estimated rates per active deployment: - -- Compliance scanning of 1000 hosts every hour: ~50K `compliance.state.changed` events/hour, but only when state actually changes (write-on-change). Realistic steady-state: ~5K/hour. -- Auth events (logins, refreshes, sessions): ~200/hour for a typical operator team. -- Scan events: ~20K/day for a 1000-host fleet. -- Misc (notification, alert, license, system): ~500/day baseline. - -**Steady state: ~10K events/day for a small fleet, ~500K events/day for a large fleet.** At 365-day retention, the table is 3.6M to 180M rows. PostgreSQL handles this comfortably with the planned indexes. - -### 10.2 Emission performance - -- Async path: ~5µs per `Emit()` call (channel send only) -- Async batch write: 100 events / 100ms → 1000 events/second sustained throughput -- Sync path: ~200µs per `EmitSync()` (single insert) -- Memory: 10000 buffered events × ~500 bytes = ~5MB headroom - -These numbers comfortably exceed expected throughput for any production deployment. - -### 10.3 Drop policy - -If the buffer fills (back-pressure): - -1. Increment `audit_dropped_total{severity}` counter -2. Drop `info` and `warning` events -3. Block-write `error` and `critical` events synchronously -4. Page on `audit_dropped_total{severity="critical"} > 0` - ---- - -## 11. OpenAPI integration - -### 11.1 Per-endpoint audit declarations - -Endpoints declare which audit events they emit: - -```yaml -/hosts: - post: - x-required-permission: HOST_WRITE - x-audit-events: [host.created] - summary: Create host - ... -``` - -This is documentation only (does not generate code), but it's checked: every endpoint that's a mutating operation must declare at least one audit event in `x-audit-events`. The build fails if a `POST/PUT/PATCH/DELETE` endpoint has no audit events declared. - -### 11.2 Error code: audit failure - -`audit.write_failed` (500 Internal Server Error) — emitted only when sync writes fail and the request specifically required audit-before-response. Most endpoints don't. - ---- - -## 12. Stage 0 integration - -Stage 0 already includes audit events lightly: - -- Day 5 — first endpoints emit audit events -- Day 7 — licensing emits 8+ audit event types - -**Refined Stage 0 Day 5 deliverables:** - -- `audit/events.yaml` registered with the initial 70+ codes -- `internal/audit/events.gen.go` produced by codegen -- `internal/audit/{emit,writer,redact}.go` implementations -- `audit_events` table migration (`internal/db/migrations/0002_audit_events_taxonomy.sql`) -- Async writer goroutine wired into server lifecycle (started on boot, drained on shutdown) -- The Day 5 demo `:echo` endpoint emits `system.diagnostic_echo` event using the typed pattern - -**Acceptance:** -- Service boots, writer goroutine starts -- `:echo` produces one audit event with proper envelope -- `GET /audit/events` returns the event with all canonical fields populated -- `GET /audit/events:taxonomy` returns the registry *(deferred to Phase 1)* -- Emit benchmark confirms <10µs async; <500µs sync -- Bench: 1000 emit calls → 1000 events written within 200ms - -This adds roughly half a day to Day 5 vs the original Stage 0 plan. Net Stage 0 estimate stays at 7–11 days. - ---- - -## 13. Anti-patterns (never do these) - -| Anti-pattern | Why bad | -|---|---| -| Free-form `action` strings (`audit.Emit(ctx, "auth_login_OK", ...)`) | Drift; queries break | -| Logging passwords / tokens in `detail` even once | Once logged, forever in DB; compliance violation | -| Synchronous emission on hot paths | Adds DB latency to every request | -| Crashing on audit write failure | Compliance product can't deny operations because audit storage is full | -| Catch-all `audit.LogEvent("something happened")` | Defeats taxonomy; review can't aggregate | -| Reusing event codes for different shapes | A code's `detail` schema is part of its contract | -| Adding events without registering in `events.yaml` | Build fails — but only because of the spec lock | - ---- - -## 14. Acceptance criteria for "foundation is built" - -Stage 0 ships with audit foundation when: - -- [ ] `audit/events.yaml` exists with ~70 initial codes across 13 categories -- [ ] `internal/audit/events.gen.go` produced by codegen -- [ ] `internal/audit/` package complete per §5.1 -- [ ] `audit_events` table migration applied -- [ ] Async writer with batching, back-pressure, drop policy -- [ ] Sync writer for critical events -- [ ] Redaction discipline tested (sensitive fields scrubbed) -- [ ] Codegen validates code naming convention regex -- [ ] Codegen validates category references -- [ ] Codegen validates `detail_schema` JSON Schema validity -- [ ] OpenAPI build check: every mutating endpoint declares `x-audit-events` -- [ ] `GET /audit/events` returns canonical envelope -- [ ] `GET /audit/events:taxonomy` returns the registry *(deferred to Phase 1)* -- [ ] Per-resource sub-resource paths work (`/hosts/{id}/audit-events`) *(deferred to Phase 1)* -- [ ] Emit benchmark: async <10µs, sync <500µs -- [ ] Documentation: `docs/engineering/audit_event_taxonomy.md` referenced from CLAUDE.md / README - -Once these are checked, every subsequent feature can emit audit events with a typed constant and trust the foundation. - ---- - -## 15. Why this matters more than it looks - -Three concrete failure modes the foundation prevents: - -1. **The ad-hoc taxonomy graveyard.** Without a registry, ten developers invent ten naming conventions. After a year, the audit log has 200+ unique action strings, no one queries them all, and "show me failed logins" becomes a regex archaeology project. - -2. **The accidental secret leak.** Without enforced redaction, someone someday writes `detail: {"password": req.Password}` "just for debugging." That row sits in the database for 365 days. Compliance failure. By the time it's caught, it's in backups too. - -3. **The audit-as-side-channel anti-pattern.** Without a clear emission API, components start writing to `stderr` because "the audit log is too slow / unreliable." Now there are two places to look. Operators search both. Audit logs miss events. Compliance attestations can't prove completeness. - -Half a day in Stage 0 prevents weeks of cleanup later — and several of these failure modes are only catchable in retrospect, not retrofittable. diff --git a/docs/engineering/correlation_id_propagation.md b/docs/engineering/correlation_id_propagation.md deleted file mode 100644 index ddb49e14..00000000 --- a/docs/engineering/correlation_id_propagation.md +++ /dev/null @@ -1,633 +0,0 @@ -# Correlation ID Propagation — Design Specification - -**Status:** Foundation, locked 2026-04-30 -**Owner:** Backend platform -**Spec:** `specs/system/correlation.spec.yaml` (to be authored at Specter migration) -**Source-of-truth code:** `internal/correlation/` package; helpers in `internal/queue/`, `internal/audit/`, `internal/log/` - ---- - -## 1. Why this exists - -When a user reports "my 10am scan failed," support must reconstruct the request from logs. Without a correlation ID, that means joining facts by hand: which host, what time window, which worker process — chained across a half-dozen log files. With a correlation ID, support greps for one string and reads the entire story top-to-bottom. - -The hard part isn't generating the ID. It's keeping it alive across asynchronous boundaries: - -- An HTTP request enqueues a job and returns `202 Accepted`. -- The worker dequeues the job seconds later in a different goroutine (sometimes a different process). -- The worker spawns sub-jobs (a fleet scan fans out to per-host scans). -- A cron tick enqueues work with no originating request. -- A worker calls Kensa over SSH; Kensa's logs and the worker's logs are separate streams. -- A webhook fires to an external system. - -Every one of those boundaries is a place where the correlation ID can drop on the floor. Each drop creates a forensic gap that can't be back-filled. **The contract has to be locked once, codified in helpers, and enforced by CI** — because retrofitting propagation across 30 handlers and 12 job types is a multi-week refactor that will be skipped under deadline pressure. - -This document defines that contract. - ---- - -## 2. The one-line contract - -> **Every audit event, every log line, and every job in OpenWatch carries a correlation ID. The ID enters the system at exactly four origin points and is propagated by exactly four helpers. Code that bypasses the helpers is rejected by CI.** - -Origin points: - -1. HTTP request (middleware extracts or generates) -2. Cron tick (scheduler generates) -3. System startup (boot generates one shared ID) -4. Test harness (test injects) - -Propagation helpers: - -1. `correlation.HTTPMiddleware` — entry from HTTP -2. `queue.Enqueue(ctx, payload)` — extract from context, write to job row -3. `queue.Dequeue() (*Job, ctx)` — read from job row, restore onto context -4. `audit.Emit(ctx, event)` and the slog handler — extract from context for storage and logs - -Anything else that needs the ID reads it from `context.Context` via `correlation.From(ctx)`. There is no global, no thread-local, no other path. - ---- - -## 3. ID format - -### 3.1 Shape - -``` --<16 hex chars> -``` - -Examples: - -``` -req-018f3c2a8b7d4e9a # HTTP request -cron-018f3c2a8b7d4eaa # cron tick -boot-018f3c2a8b7d4eb0 # process boot -test-deadbeef00000001 # test harness -``` - -Total length: 20–21 characters. Greppable, distinctive, fits in a log column without truncation. - -### 3.2 Generation - -The 16 hex chars are 8 bytes: a 48-bit unix-millisecond timestamp followed by a 16-bit per-millisecond monotonic counter. - -**Why a counter, not random bits in the trailing 16 bits?** - -The original design proposed "48 bits of timestamp + 16 bits of randomness, which is plenty at request rates <10K/sec." That math is correct for steady-state traffic (birthday-paradox collision probability is negligible at those rates), but it **fails under bursty workloads** — tight loops, batch operations, parallel tests. The Day 4 acceptance test for `correlation.spec.yaml` AC-2 ("10000 sequential Generate calls return distinct IDs") found duplicate IDs within milliseconds: the same 16 random bits get drawn twice when you generate ~256 IDs in a single ms, by the birthday paradox. - -The monotonic counter eliminates this entirely: - -- **Within the same millisecond** the counter increments by 1 per `Generate` call. -- **When the millisecond advances** the counter is re-seeded with `crypto/rand.Read(2 bytes)` so consecutive IDs don't reveal request rate via predictable values. - -Trade-off: the counter wraps (back to 0) after 65,536 IDs in a single millisecond — ~65 M IDs/sec. Far beyond any realistic OpenWatch rate; if observed in production, alert and re-design (16-bit counter wasn't sized for that scenario). - -**Properties this gives us:** - -- **Time-ordered.** Two correlation IDs generated in sequence sort lexicographically in time order (timestamp dominates, counter tiebreaks within ms). `grep req- /var/log/openwatch.log | sort` produces a chronological view. -- **Unique under bursty load.** No collisions until ≥65,536 IDs/ms, which is operationally impossible at our scale. -- **Greppable.** No special characters, no quoting issues. -- **Boundary-distinct.** The prefix tells you at a glance whether this came from an API call, a cron, or a boot. - -```go -// internal/correlation/correlation.go - -var ( - monoMu sync.Mutex - monoLastMs uint64 - monoCounter uint16 -) - -func Generate(prefix Prefix) string { - nowMs := uint64(time.Now().UnixMilli()) - c := nextCounter(nowMs) - - var u [8]byte - u[0] = byte(nowMs >> 40); u[1] = byte(nowMs >> 32) - u[2] = byte(nowMs >> 24); u[3] = byte(nowMs >> 16) - u[4] = byte(nowMs >> 8); u[5] = byte(nowMs) - u[6] = byte(c >> 8); u[7] = byte(c) - return string(prefix) + "-" + hex.EncodeToString(u[:]) -} - -func nextCounter(nowMs uint64) uint16 { - monoMu.Lock() - defer monoMu.Unlock() - if nowMs != monoLastMs { - monoLastMs = nowMs - var r [2]byte - if _, err := rand.Read(r[:]); err != nil { - panic("correlation: rand.Read failed: " + err.Error()) - } - monoCounter = uint16(r[0])<<8 | uint16(r[1]) - } else { - monoCounter++ - } - return monoCounter -} - -type Prefix string - -const ( - PrefixRequest Prefix = "req" - PrefixCron Prefix = "cron" - PrefixBoot Prefix = "boot" - PrefixTest Prefix = "test" -) -``` - -### 3.3 Sanitization of client-provided IDs - -A client may supply `X-Correlation-Id` to pre-correlate their side of the call. We trust the value only after sanitizing: - -| Check | Rule | On failure | -|-------|------|-----------| -| Charset | `^[A-Za-z0-9_-]+$` | Generate fresh; log warning | -| Length | 1–64 chars | Generate fresh; log warning | -| Reserved prefix | If client sends `boot-` or `cron-`, reject and regenerate. Those prefixes are reserved for internal origins. | Generate fresh; log warning | - -Sanitization happens in the HTTP middleware, before the value touches `context.Context`. **A correlation_id that reaches a handler is always trusted.** - -```go -// internal/correlation/sanitize.go -var validIDPattern = regexp.MustCompile(`^[A-Za-z0-9_-]{1,64}$`) -var reservedPrefixes = []string{"boot-", "cron-", "test-"} - -func SanitizeOrGenerate(client string) (id string, regenerated bool) { - if client == "" { - return Generate(PrefixRequest), false - } - if !validIDPattern.MatchString(client) { - return Generate(PrefixRequest), true - } - for _, rp := range reservedPrefixes { - if strings.HasPrefix(client, rp) { - return Generate(PrefixRequest), true - } - } - return client, false -} -``` - -### 3.4 Why not W3C `traceparent`? - -The OpenTelemetry-standard `traceparent` header carries a 32-char trace ID + 16-char span ID + flags. It would integrate cleanly with future OTel exporters. We chose `X-Correlation-Id` because: - -1. **Single-string simplicity.** A correlation ID maps 1:1 to "user intent." Span hierarchies are an observability concern, not an audit/forensic concern. -2. **Forensic-readable.** `req-018f3c2a8b7d4e9a` is human-recognizable in logs; `00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01` is not. -3. **Future-compatible.** The propagation discipline transfers: when we adopt OTel, the same boundary helpers populate `traceparent` alongside `correlation_id`. No discipline rewrite. - -Locked decision (roadmap 2026-04-27): `X-Correlation-Id` is the canonical header. - ---- - -## 4. The four primary boundaries - -### 4.1 HTTP request entry - -``` -Client OpenWatch - │ POST /api/v1/scans │ - │ X-Correlation-Id: my-id-123 (optional) │ - ├─────────────────────────────────────────────────────▶ - │ │ - │ HTTPMiddleware: │ - │ 1. Sanitize or generate - │ 2. correlation.Set(ctx, id) - │ 3. ResponseHeader: X-Correlation-Id - │ 4. Log "request received" with id - │ │ - │ chi.Mux │ - │ ▼ │ - │ handler(w, r) - │ │ - │ 202 Accepted │ - │ X-Correlation-Id: my-id-123 │ - │◀────────────────────────────────────────────────────┤ -``` - -The middleware: - -```go -// internal/correlation/http.go -func HTTPMiddleware(next http.Handler) http.Handler { - return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - client := r.Header.Get("X-Correlation-Id") - id, regenerated := SanitizeOrGenerate(client) - if regenerated { - slog.WarnContext(r.Context(), "rejected client correlation id; regenerated", - slog.String("rejected_value_preview", truncate(client, 16)), - slog.String("correlation_id", id), - ) - } - ctx := correlation.Set(r.Context(), id) - w.Header().Set("X-Correlation-Id", id) - next.ServeHTTP(w, r.WithContext(ctx)) - }) -} -``` - -This middleware runs **before** every other middleware in the chain (auth, RBAC, idempotency, audit). The audit middleware needs `correlation_id` to be on context already; auth needs it for failed-login audit events; idempotency checks key reuse and may need to log under the original request's correlation_id. - -> **Router quirk (chi v5):** chi's default `NotFound` and `MethodNotAllowed` handlers do **not** run the `r.Use(...)` middleware chain. Unmatched paths and unsupported methods will return responses **without** `X-Correlation-Id`, breaking the forensic guarantee. The server bootstrap MUST register explicit handlers for both: -> -> ```go -> r.Use(correlation.HTTPMiddleware) -> -> r.NotFound(func(w http.ResponseWriter, _ *http.Request) { -> http.Error(w, "404 page not found", http.StatusNotFound) -> }) -> r.MethodNotAllowed(func(w http.ResponseWriter, _ *http.Request) { -> http.Error(w, "405 method not allowed", http.StatusMethodNotAllowed) -> }) -> ``` -> -> Discovered Day 4 of Stage 0 during live `curl` verification of `http-server.spec.yaml` AC-9. If you swap routers in a future iteration (gorilla/mux, httprouter, etc.), re-verify that unmatched routes traverse middleware — same class of bug recurs across router libraries. - -### 4.2 HTTP → job queue - -When a handler enqueues a job: - -```go -// internal/queue/enqueue.go -func Enqueue(ctx context.Context, jobType string, payload []byte) (jobID uuid.UUID, err error) { - correlationID, ok := correlation.From(ctx) - if !ok { - // This is a programming error: every code path that reaches Enqueue - // must have come through an origin (HTTP, cron, boot, test). - return uuid.Nil, fmt.Errorf("queue.Enqueue: no correlation_id on context") - } - return enqueueRow(ctx, jobType, payload, correlationID) -} -``` - -The `job_queue` table has: - -```sql -ALTER TABLE job_queue ADD COLUMN correlation_id TEXT NOT NULL; -CREATE INDEX idx_job_queue_correlation ON job_queue(correlation_id); -``` - -The index supports forensic queries: "show every job spawned by `req-018f3c2a8b7d4e9a`." - -**No public path to insert a job exists outside `internal/queue/`.** The CI lint rule (Section 7) rejects any other code that constructs an `INSERT INTO job_queue` statement. - -### 4.3 Job → worker dequeue - -The worker: - -```go -// internal/queue/dequeue.go -func Dequeue(ctx context.Context) (*Job, context.Context, error) { - job, err := dequeueRow(ctx) - if err != nil || job == nil { - return nil, ctx, err - } - workerCtx := correlation.Set(context.Background(), job.CorrelationID) - workerCtx = applyDeadline(workerCtx, job.MaxRuntime) - return job, workerCtx, nil -} -``` - -The worker uses the **returned `workerCtx`**, not the caller's `ctx`. This prevents accidentally bleeding the worker-loop's own correlation (if any) into the per-job context. - -The worker entry point: - -```go -// internal/worker/run.go -func (w *Worker) processOne(ctx context.Context) error { - job, jobCtx, err := queue.Dequeue(ctx) - if err != nil || job == nil { - return err - } - handler := w.registry.Get(job.Type) - return handler.Run(jobCtx, job) // jobCtx carries the originating correlation_id -} -``` - -Anything `handler.Run` does — emit audit, log, enqueue child jobs, call external systems — uses `jobCtx`. The chain holds. - -### 4.4 Worker → sub-job (cascading enqueue) - -A handler that enqueues child jobs uses the same `queue.Enqueue`: - -```go -// example: a fleet-scan handler spawns one job per host -func (h *FleetScanHandler) Run(ctx context.Context, parent *queue.Job) error { - hosts := h.fetchHosts(ctx, parent.Payload.FleetID) - for _, host := range hosts { - _, err := queue.Enqueue(ctx, "scan.host", encodeHostScan(host)) // ctx carries parent's correlation_id - if err != nil { - return err - } - } - return nil -} -``` - -The child jobs inherit the parent's correlation_id. A user clicking "scan fleet" produces: - -``` -req-018f3c2a8b7d4e9a # the HTTP click -├── job-fleet-scan # parent job (carries req-018f3c2a8b7d4e9a) -│ ├── job-scan-host-1 # child (carries req-018f3c2a8b7d4e9a) -│ ├── job-scan-host-2 # child (carries req-018f3c2a8b7d4e9a) -│ └── job-scan-host-N # child (carries req-018f3c2a8b7d4e9a) -``` - -Grep for `req-018f3c2a8b7d4e9a` and the entire fleet operation reconstructs. - -> **Why not a per-job ID instead?** Because the support question is "what happened with the user's 10am click?" — not "what happened with this one host scan?" The job ID answers the second; correlation_id answers the first. The job hierarchy is captured by `job.parent_id` (a separate column), not by mangling the correlation ID. - ---- - -## 5. The four secondary boundaries - -### 5.1 Cron tick → job - -The cron scheduler has no originating HTTP request. It generates a fresh correlation_id at tick time: - -```go -// internal/cron/tick.go -func (s *Scheduler) tick(jobID string) { - ctx := correlation.Set(context.Background(), correlation.Generate(correlation.PrefixCron)) - slog.InfoContext(ctx, "cron tick", slog.String("cron_job", jobID)) - s.handlers[jobID].Run(ctx) -} -``` - -Each tick gets a distinct ID. If a tick enqueues 50 jobs, all 50 share the cron tick's ID. If the same cron job fires again at the next interval, it gets a different ID — different intent, different forensic story. - -### 5.2 Worker → external system (Kensa, OIDC, webhook) - -**HTTP-based outbound calls** (OIDC discovery, webhook dispatch, plugin calls) forward as headers: - -```go -// internal/httpclient/client.go (wraps net/http.Client) -func (c *Client) Do(ctx context.Context, req *http.Request) (*http.Response, error) { - if id, ok := correlation.From(ctx); ok { - req.Header.Set("X-Correlation-Id", id) - } - return c.inner.Do(req.WithContext(ctx)) -} -``` - -OpenWatch's own outbound HTTP always uses this wrapper; raw `http.Client.Do` is forbidden by lint (Section 7). - -**Kensa SSH invocations** pass via env var: - -```go -// internal/kensa/invoke.go -func (k *Invoker) Run(ctx context.Context, host string, args []string) (*Result, error) { - id, _ := correlation.From(ctx) - cmd := fmt.Sprintf("KENSA_CORRELATION_ID=%s kensa scan %s", shellEscape(id), strings.Join(args, " ")) - slog.InfoContext(ctx, "invoking kensa", slog.String("host", host), slog.String("kensa_cmd_summary", summary(args))) - // ... ssh exec ... -} -``` - -Kensa's coordination ask: include `KENSA_CORRELATION_ID` in its JSON output (so Kensa-side audit/logs reference it). Until Kensa supports this, we log on the OpenWatch side at invocation and completion. Kensa-side logs remain unlinked, which is a known forensic gap to close in Phase 2. - -### 5.3 Worker → audit - -Already specified in `audit_event_taxonomy.md` (the audit envelope has `correlation_id`). The emit helper extracts from context: - -```go -// internal/audit/emit.go -func Emit(ctx context.Context, e Event) { - id, _ := correlation.From(ctx) - e.CorrelationID = id // fall back to "" if missing; the writer logs a warning - audit.queue <- e -} -``` - -If a worker's `ctx` somehow has no correlation_id (a bug), the audit event still writes with `correlation_id = ""` — and the audit writer increments a counter that the operator can alert on. **Audit always succeeds; correlation propagation bugs surface as monitoring signal, not lost events.** - -### 5.4 Anywhere → log line (slog) - -A custom slog.Handler reads correlation_id from context and adds it as a top-level attribute on every log line: - -```go -// internal/log/handler.go -type CorrelationHandler struct { - inner slog.Handler -} - -func (h *CorrelationHandler) Handle(ctx context.Context, r slog.Record) error { - if id, ok := correlation.From(ctx); ok { - r.AddAttrs(slog.String("correlation_id", id)) - } - return h.inner.Handle(ctx, r) -} -``` - -This is wrapped around the stdlib JSON handler. The result: every `slog.InfoContext(ctx, ...)` call automatically carries correlation_id. Developers don't think about it. - -Logs without context (rare — used in `slog.Info` not `slog.InfoContext`) have no correlation_id. The CI lint forbids non-Context slog calls outside of `func init()` and `func main()`. - ---- - -## 6. Boot correlation - -A single correlation_id is generated at process startup: - -```go -// cmd/openwatch/main.go -func main() { - bootID := correlation.Generate(correlation.PrefixBoot) - bootCtx := correlation.Set(context.Background(), bootID) - - slog.InfoContext(bootCtx, "openwatch starting", slog.String("version", buildVersion)) - license.Load(bootCtx) - policy.LoadAll(bootCtx) - audit.EmitSync(bootCtx, audit.Event{Action: "system.startup"}) - server.Run(bootCtx) -} -``` - -All startup-phase events share `boot-018f3c2a8b7d4eb0`. Forensically: "what happened at the last restart?" → grep for the boot ID. - -This works because every loader (`license`, `policy`, audit, etc.) accepts `ctx context.Context`. The discipline is uniform. - ---- - -## 7. CI enforcement - -The contract is meaningless unless the build catches violations. Three layers: - -### 7.1 Forbidigo lint: no raw job_queue inserts - -```yaml -# .golangci.yml -linters: - enable: - - forbidigo -linters-settings: - forbidigo: - forbid: - - p: '^pgxpool\.Pool\.Exec.*INSERT INTO job_queue' - msg: "Use queue.Enqueue(ctx, ...) — raw inserts skip correlation propagation" - - p: '^http\.DefaultClient' - msg: "Use internal/httpclient — wrapped to forward X-Correlation-Id" - - p: '^slog\.(Info|Warn|Error|Debug)\b' - msg: "Use slog.{Info,Warn,Error,Debug}Context — bare loggers skip correlation_id" - exclude-godoc-examples: true -``` - -The `slog.Info` ban has a few legitimate exceptions (very-early startup logs before the boot context exists). These are tagged with `//nolint:forbidigo` and reviewed in PR. - -### 7.2 Behavioral spec: every audit event has a non-empty correlation_id - -`specs/system/correlation.spec.yaml` (post-Specter migration): - -```yaml -spec_id: system/correlation -status: active -acceptance_criteria: - - id: AC-1 - description: HTTP middleware sets a non-empty correlation_id on every request context - - id: AC-2 - description: Sanitizer rejects malformed client-provided IDs (charset, length, reserved prefix) - - id: AC-3 - description: queue.Enqueue requires correlation_id on context; returns error if missing - - id: AC-4 - description: queue.Dequeue restores correlation_id onto fresh context (not caller context) - - id: AC-5 - description: Cron ticks generate fresh correlation_id with cron- prefix - - id: AC-6 - description: Boot generates one correlation_id shared by all startup events - - id: AC-7 - description: Outbound HTTP via internal/httpclient forwards X-Correlation-Id - - id: AC-8 - description: slog handler adds correlation_id to every Context-aware log call - - id: AC-9 - description: Every audit event in test fixtures has a non-empty correlation_id -``` - -Each AC has one or more enforcing tests (Go `testing` package, `// AC-N` docstring). - -### 7.3 End-to-end propagation test - -One integration test exercises the full chain: - -```go -// internal/correlation/propagation_test.go -func TestEndToEndPropagation(t *testing.T) { - // 1. Send HTTP request with explicit X-Correlation-Id - resp := postJSON(t, "/api/v1/diagnostics:enqueue-test-job", `{"message":"hi"}`, - "X-Correlation-Id", "test-end2end-001") - - // 2. Response header echoes the same ID - require.Equal(t, "test-end2end-001", resp.Header.Get("X-Correlation-Id")) - - // 3. Audit event for the API call carries the ID - apiAudit := waitForAuditEvent(t, "diagnostics.enqueue", 2*time.Second) - require.Equal(t, "test-end2end-001", apiAudit.CorrelationID) - - // 4. Worker picks up the job - waitForJobStatus(t, apiAudit.ResourceID, "completed", 5*time.Second) - - // 5. Audit event written by the worker carries the same ID - workerAudit := waitForAuditEvent(t, "diagnostics.test_job_completed", 5*time.Second) - require.Equal(t, "test-end2end-001", workerAudit.CorrelationID) -} -``` - -This test runs in CI on every commit. If propagation breaks anywhere on the chain, the test fails. It is the single most important regression net for this contract. - ---- - -## 8. Anti-patterns - -| Anti-pattern | What's wrong | What to do instead | -|--------------|--------------|---------------------| -| `correlation.From(context.Background())` | `Background()` has no correlation_id. The `_, ok := correlation.From(...)` returns `ok=false`. | Pass through the real ctx. If you genuinely have no ctx (rare), treat that as a programming error. | -| Storing correlation_id in a struct field for "convenience" | The struct outlives the context. Stale correlation_ids start appearing in unrelated logs. | Always read from `ctx` at point of use. Pay the O(1) lookup cost. | -| Pulling correlation_id into a string and concatenating into log messages | Not searchable as a structured field; can't filter by it. | `slog.InfoContext(ctx, "...")` — the handler adds it as a structured attribute. | -| Calling `slog.Info` without context | The slog handler can't extract correlation_id; the log line has none. | Always use `slog.InfoContext`. Lint enforces this. | -| Using `http.DefaultClient` for outbound calls | Bypasses the wrapper that forwards `X-Correlation-Id`. Downstream systems see no correlation. | Use `internal/httpclient`. Lint enforces this. | -| Generating a new correlation_id inside a handler "to be safe" | Breaks the chain. The HTTP middleware already set one; overwriting it loses the link to the original request. | Read the existing one. If missing, that's a bug in the middleware, not a reason to generate. | -| Treating correlation_id as a security identifier | It is **not** authenticated, **not** unique-per-user, **not** tamper-proof. A client can replay any correlation_id they like (after sanitization). | Never use correlation_id for auth, rate limiting, or session tracking. It is forensic only. | -| Using correlation_id as a database join key | Multiple top-level requests over time can produce the same audit/job rows from the same user; correlation is *per request*, not *per user*. | Join on user_id, host_id, scan_id — domain keys. | - ---- - -## 9. Failure modes - -| Scenario | Behavior | -|----------|----------| -| Client sends a malicious header (e.g., 10MB string, control chars) | Sanitizer rejects; new ID generated; warning log emitted with truncated preview of the rejected value. | -| Two clients send the same correlation_id | Allowed. Logs and audit interleave under the same ID. Forensically, this looks like one logical operation by two callers — usually that's exactly what the clients intended (e.g., a workflow tool replaying). | -| Worker crashes before completing a job; job re-dispatches to another worker | New worker reads the same correlation_id from the job row. The chain holds. | -| Worker dequeues a job whose row has `correlation_id = ''` (legacy data, bug) | `queue.Dequeue` generates a fresh `req-` ID with a warning log; emits a `system.health.degraded` audit event. | -| `correlation.From(ctx)` returns `ok=false` deep in a handler | Should not happen if HTTPMiddleware ran. If it does, the audit emit fallback writes `correlation_id=""` and the writer counter increments. Operator alert fires when counter > 0. | -| Process killed mid-request | The originating request log shows the started entry; no completion entry. Same correlation_id appears in any audit events that did flush. Recovery is on the operator (look at the start log + system shutdown event). | -| Clock skew makes correlation_ids out of order | UUIDv7 is millisecond-precision. Skew >1ms produces out-of-order IDs but uniqueness is preserved (the random suffix dominates collision risk). Forensic queries that rely on lexicographic time order may see misordered events; queries that filter by correlation_id are unaffected. | -| Test injects `test-foo` and runs concurrently with another test injecting `test-foo` | Tests must use unique IDs. The `test-` prefix is reserved for tests but uniqueness is the test's responsibility; collision produces test-noise, not production bugs. | - ---- - -## 10. Stage 0 vs Stage 2 split - -### Stage 0 ships (Day 4 + Day 5): - -- `internal/correlation/` — generate, sanitize, set, from, prefix constants -- `correlation.HTTPMiddleware` wired into the chi router -- `internal/log/CorrelationHandler` wrapping slog.JSONHandler -- `internal/audit/Emit` integration (already in audit foundation) -- Boot correlation (Day 1 main.go scaffold updated to set boot-) -- `internal/httpclient/Client` wrapper (forwards header) -- Forbidigo lint config in `.golangci.yml` - -### Stage 0 ships (Day 8, alongside policy framework): - -- `internal/queue/Enqueue` and `Dequeue` helpers with correlation propagation -- `job_queue.correlation_id` column (folded into the existing job_queue migration from Day 3) -- `internal/cron/Scheduler.tick` with cron- generation -- End-to-end propagation test - -### Stage 2 ships (when real jobs land): - -- Job handlers using `Enqueue` for sub-jobs -- Kensa invoker passing `KENSA_CORRELATION_ID` -- OIDC/SAML/webhook outbound calls using `internal/httpclient` -- Per-job-type correlation forensic tests - -The Stage 0 work is small (~600 LOC + lint config + migrations) but locks the contract before any consumer exists. Stage 2 consumers cannot accidentally bypass it because the helpers are already the only path. - ---- - -## 11. Performance - -The propagation machinery is hot-path; targets: - -| Operation | Target | Notes | -|-----------|--------|-------| -| `correlation.Generate()` | < 200ns | One `rand.Read(8)` + hex encode + concat | -| `correlation.Set(ctx, id)` | < 50ns | Single `context.WithValue` | -| `correlation.From(ctx)` | < 30ns | Chain walk on context (typical depth ~5) | -| `HTTPMiddleware` overhead | < 1µs | Generate + sanitize + set + header write | -| `slog handler` overhead | < 100ns | One context lookup + one attr add | -| End-to-end (HTTP entry to log line) | < 2µs added | Negligible vs DB and HTTP serialization | - -`crypto/rand` is the long pole; we read 8 bytes (~150ns on a typical Linux). Pre-fetching from a buffered random source is an option if benchmarks show contention, but at our request rates it's unnecessary. - ---- - -## 12. Open questions - -1. **OTel adoption.** When (not if) we adopt OpenTelemetry, do we wrap the correlation middleware to emit both `X-Correlation-Id` and `traceparent`? Or use OTel's native trace context as the correlation source? Defer until OTel is on the roadmap; the propagation discipline transfers either way. -2. **Per-tenant prefixing.** Multi-tenant deployments may want `tenant-acme-req-...` for per-tenant log filtering. Current design is single-tenant; defer to multi-tenant epic. -3. **Correlation ID in metrics labels.** Prometheus high-cardinality labels are dangerous. Correlation IDs must NOT become metric labels — they would explode the cardinality. Logs and audit only. -4. **Frontend propagation.** The TypeScript frontend should generate a correlation_id at the start of a user action (e.g., "click scan button") and pass it through every API call for that action. That makes the frontend a first-class origin point. Defer to frontend integration epic; design transfers cleanly. -5. **Long-running operations.** A scan can run for tens of minutes. The correlation_id stays the same end-to-end, but the operator's view of "what's happening now?" needs additional handles. Job ID + parent_id covers the operator view; correlation_id covers the forensic view. No new design needed. - ---- - -## Cross-references - -- HTTP design: `docs/engineering/api_design_principles.md` §9.4 (correlation_id in error envelope), §11 (`X-Correlation-Id` header). -- Audit foundation: `docs/engineering/audit_event_taxonomy.md` §3 (envelope), §6 (writer paths). -- Policies: `docs/engineering/policies_as_data.md` §8 (audit integration; `policy.applied` carries correlation_id). -- Roadmap: 2026-04-27 entry on `X-Correlation-Id` propagation; 2026-04-30 entries on this design. -- Stage 0: Day 4 (HTTP middleware), Day 5 (audit uses it), Day 8 (queue helpers + lint + e2e test). diff --git a/docs/engineering/frontend_architecture_adr.md b/docs/engineering/frontend_architecture_adr.md deleted file mode 100644 index c892226c..00000000 --- a/docs/engineering/frontend_architecture_adr.md +++ /dev/null @@ -1,226 +0,0 @@ -# OpenWatch Frontend Architecture (ADR) - -> **Status:** Locked 2026-05-30 -> **Authority:** This document is the rulebook for `frontend/`. If code at `frontend/` violates a rule here, the code is wrong. -> **Audience:** Anyone scaffolding, specing, or implementing frontend modules for OpenWatch. - ---- - -## Why this document exists - -The Go rebuild ships a fresh frontend at `frontend/`. Without an ADR up front, every spec downstream is built on assumed defaults — router choice leaks into data-fetching choice leaks into auth-flow choice, and the first three specs disagree on which library owns which concern. - -This document **locks** the stack and the conventions. It is reviewed only when a hard external pressure forces a change (library deprecation, breaking version, security CVE in a transitive dep). Day-to-day work consults this; it does not edit it. - ---- - -## Context - -- **Backend**: Go (chi router, pgx, sqlc, embed.FS, OpenAPI 3.1 SSOT). -- **Serving**: Single binary. Frontend embedded into the binary via `//go:embed all:spa` in `internal/server/spa.go` (the `vite build` output is copied into `internal/server/spa/` at build time). No nginx, no separate SPA host (per `openwatch_roadmap.md` L21, L87). -- **Auth contract** (per `stage_2_slice_a.md` L25): both session cookies (browser) and JWT (API consumers). The browser frontend uses **session cookies** with CSRF protection — **not** JWT in `localStorage`. -- **Prototype** at `docs/engineering/prototypes/openwatch-v1/` defines the visual language (9 HTML pages, dark-only). The frontend implements the same language in a real component system with both dark and light modes. -- **Page rollout** is slice-driven, not page-list-driven. Each backend slice unblocks the corresponding frontend pages. - ---- - -## Decisions - -### D-01: React 19 (latest stable) - -React 19 is the foundation. Adopt R19 conventions: - -- **Ref as prop** — no `forwardRef` in custom components. -- **``** — no `.Provider` suffix. -- **`use(promise)`** — paired with TanStack Query `useSuspenseQuery` for declarative loading-state composition. -- **`useActionState` + `useFormStatus`** — reduces react-hook-form boilerplate around pending/error states in forms. -- **`useOptimistic`** — optimistic UI for toggles, mute/ack, and widget reorder. -- **Document metadata in JSX** — ``, `<meta>` per page without `react-helmet`. - -### D-02: TypeScript (strict mode, no implicit any) - -`tsconfig.json` enables `strict`, `noUncheckedIndexedAccess`, `noFallthroughCasesInSwitch`, `noImplicitOverride`. Type holes are bugs, not style preferences. - -### D-03: MUI v7 with CSS-vars mode - -`@mui/material` v7 in CSS-variables mode (`extendTheme` + `<CssVarsProvider>`). - -- **Why CSS-vars**: matches the prototype's `var(--*)` token model exactly; component-level theme overrides become trivial; per-mode tokens are real CSS variables, not JS-runtime branching. -- **Why MUI v7**: A11y baked in, mature component coverage (Drawer, Menu, ToggleButtonGroup, DataGrid), peer-supports R19. -- `cssVarPrefix: 'ow'` — every variable is `--ow-*`. No collisions with library-default `--mui-*`. - -### D-04: Vite (latest) - -- TS support is native; HMR is fast; output is what `go:embed` consumes. -- `vite.config.ts` sets `build.outDir: 'dist'` (the path embedded by Go). -- Dev server proxies `/api → https://localhost:8443` with `secure: false` to accept the dev self-signed cert. - -### D-05: TanStack Router v1 - -- Type-safe routes. Route params and search params are typed at the call site. -- File-based routing optional; we use the declarative-tree API for explicitness. -- Pairs cleanly with TanStack Query for prefetching on route enter. - -### D-06: TanStack Query v5 for server state - -- `useSuspenseQuery` + `<Suspense>` for loading states. -- Cursor-paginated lists use `useInfiniteQuery` (matches backend pagination contract in `api_design_principles.md`). -- Mutations call `queryClient.invalidateQueries` on success, or use `useOptimistic` when the change is local. - -### D-07: API client via `openapi-typescript` + `openapi-fetch` - -- `openapi-typescript` generates types from `api/openapi.yaml` into `frontend/src/api/schema.d.ts`. -- `openapi-fetch` is a 4 KB typed fetch wrapper. No heavier `orval` / RTK Query lock-in. -- **Spec is the contract**: when `openapi.yaml` changes, the TS types regenerate. Type errors at compile time are the contract drift signal. -- Generation command in `package.json`: `"api:types": "openapi-typescript ../api/openapi.yaml -o src/api/schema.d.ts"`. Run by CI and pre-commit. - -### D-08: Session-cookie auth + CSRF (not JWT in localStorage) - -Per `stage_2_slice_a.md`: - -- Login (`POST /api/v1/auth/login`) returns a session cookie (`openwatch_session`, HttpOnly, Secure, SameSite=Lax) AND a body containing `{access_token, refresh_token, user}`. -- **The browser frontend ignores the body tokens**. The cookie is the only credential carried on subsequent requests. -- A **CSRF token** is read from a `XSRF-TOKEN` cookie (server-set, non-HttpOnly) and echoed on mutating requests via the `X-CSRF-Token` header. This is the double-submit-cookie pattern. -- No `localStorage` for auth state. The frontend reads identity only from `GET /api/v1/auth/me` and Zustand caches it in memory. - -### D-09: Forms — react-hook-form + zod - -- `react-hook-form` for form state. -- `zod` schemas for validation. -- For shapes that appear in `openapi.yaml`, **derive zod schemas from the generated TS types** where feasible (or keep them hand-written but unit-test they match the OpenAPI shape). - -### D-10: Client state — Zustand v5 - -- Single store per concern: `useAuthStore`, `useColorSchemeStore`, `useNotificationStore`. -- No Redux. No Context for shared mutable state (Context is reserved for theme/color-scheme propagation only). -- Stores expose actions; components consume slices via selectors. - -### D-11: Drag & drop — @dnd-kit/core v6.3+ - -For the dashboard widget reorder (when the dashboard slice unblocks). Used nowhere else without explicit need. - -### D-12: Icons — lucide-react - -- Matches the prototype's visual language (every prototype SVG is a Lucide icon). -- Per-icon imports keep the bundle small. -- Do **not** mix `@mui/icons-material` into the same surface — pick one library, stick with it. - -### D-13: Three-mode color scheme — light, dark, system - -- Default mode = `system` (follows `prefers-color-scheme`). -- User override persists to `localStorage` under key `ow-color-scheme` ∈ `{'light','dark','system'}`. -- **No FOUC**: a synchronous script in `index.html` `<head>` reads the stored preference and sets `data-mui-color-scheme="light|dark"` on `<html>` before React mounts. MUI v7 ships `getInitColorSchemeScript()` — we use it verbatim. -- System changes propagate live: `matchMedia('(prefers-color-scheme: dark)').addEventListener('change', ...)` updates the mode without reload. -- Settings UX: a three-segment toggle (Light / Dark / System). When System is selected, the label shows what it currently resolves to: "System (currently dark)". - -### D-14: Design tokens — dual-mode, prefixed - -- Every token is a `--ow-*` CSS variable defined per mode. -- Severity colors include explicit on-color foregrounds: `--ow-info`, `--ow-info-on`, `--ow-info-bg` (and the same for crit/warn/ok). -- Shadows, line/border, surface elevations all per-mode. -- Full table lives in `docs/engineering/frontend_design_tokens.md`. The frontend's `theme/index.ts` is the executable form; the doc is the human-readable form. - -### D-15: Testing — Vitest + RTL 16 + Playwright - -- **Vitest** (Vite-native, Jest-compatible API) for unit and integration tests of components/hooks/stores. -- **`@testing-library/react` v16** (required for R19) for component tests. -- **Playwright** for e2e flows (login → hosts → host detail). -- **axe-core** runs in Playwright e2e as the WCAG gate. Zero violations on `wcag2a` + `wcag2aa` rule sets. - -### D-16: A11y target — WCAG 2.1 AA - -- Every interactive element keyboard-reachable. -- Every form control labelled (no placeholder-as-label). -- Every page passes axe-core in CI before merge. -- The findings-ui spec (template) already encodes this — every page spec inherits AC-12-style axe assertions. - -### D-17: Spec home — `specs/frontend/` - -- New directory parallel to `specs/{api,system,release}/`. -- Same Specter `.spec.yaml` schema as backend specs. -- Spec IDs prefix with `frontend-`: `frontend-foundation`, `frontend-auth-login`, etc. -- **Tier 1** for security-sensitive UX (auth, RBAC gating, foundation). -- **Tier 2** for feature pages (hosts, host detail, settings tabs). -- Coverage thresholds enforced by `specter.yaml`: Tier 1 = 100%, Tier 2 = 80%. - -### D-18: Where the frontend lives + how it ships - -- **Tree**: `frontend/` — sibling of `internal/`, `cmd/`, `api/`. -- **Build output**: `frontend/dist/` (set in `vite.config.ts`). -- **Embed**: `make build` copies the `frontend/dist/` output into `internal/server/spa/`, which `internal/server/spa.go` embeds via `//go:embed all:spa`. SPA fallback (non-`/api/` requests serve `index.html`) is implemented in `newSPAHandler`. -- **Single artifact**: frontend updates require a binary rebuild. No separate SPA hotfix path. Acceptable trade-off for security tooling with infrequent UI changes (per `openwatch_roadmap.md` L245). - -### D-19: Dev server proxy - -- `vite.config.ts` proxies `/api` to `https://localhost:8443` with `secure: false`. -- Dev workflow: `make run` (Go server on :8443) in one terminal; `npm run dev` (Vite on :5173) in another. Browser opens `http://localhost:5173`; the proxy routes API calls. -- CSRF cookie is set by the Go server on first response; Vite passes cookies through transparently. - -### D-20: React Compiler — deferred - -- Optional R19 feature. Stable but new in early 2026. -- We ship v0 without it. Add later as a pure build-step change (no source edits required). -- This decision reviewed when 3+ teams in the React ecosystem report it as low-friction. - -### D-21: Internationalization — deferred - -- English only in v0. No `react-i18next`. -- Strings live inline in components for v0. The first paying customer with a non-English ask drives the i18n decision. - ---- - -## Consequences - -### Positive - -- One stack to learn. Every page has the same shape: route → suspense → query → form → mutation → invalidate. -- A11y is non-optional and CI-gated. -- Auth follows the security recommendation in the roadmap (cookies + CSRF, not localStorage tokens). -- Single-binary install survives the frontend (`go:embed`). - -### Negative - -- Frontend updates require a backend rebuild. CI release surface includes the frontend build every cut. -- TanStack Router is less broadly known than React Router; onboarding cost is real but small. -- CSS-vars mode in MUI is mature but pre-existing tutorials lean on the legacy palette API. - -### Trade-offs explicitly accepted - -- **No SSR** — pure SPA. Initial paint shows a skeleton until the bundle parses. Acceptable: this is an internal admin tool, not a public marketing site. -- **No code-splitting by route in v0** — Vite splits vendor automatically; per-route splits added when bundle audit shows the need. -- **English only** — see D-21. -- **No React Compiler** — see D-20. - ---- - -## Stack summary - -| Concern | Choice | Min version | -|---------|--------|-------------| -| Framework | React | 19.x | -| Language | TypeScript | 5.x (strict) | -| Build tool | Vite | 5.x or 6.x (latest stable) | -| UI library | MUI Material | 7.x (CSS-vars mode) | -| Icons | lucide-react | 0.460+ | -| Router | @tanstack/react-router | 1.85+ | -| Server state | @tanstack/react-query | 5.59+ | -| API types | openapi-typescript | 7.x | -| API client | openapi-fetch | 0.13+ | -| Forms | react-hook-form | 7.54+ | -| Validation | zod | 3.x | -| Client state | zustand | 5.x | -| DnD | @dnd-kit/core | 6.3+ | -| Test runner | vitest | 2.1+ | -| Component testing | @testing-library/react | 16.x | -| E2E | @playwright/test | latest | -| A11y CI | @axe-core/playwright | latest | - ---- - -## Open follow-ups (not blocking v0) - -- **Real-time transport for Activity page** (`Live` toggle). SSE vs. WebSocket. Decided when OS Intelligence backend lands. See `docs/engineering/activity_and_os_intelligence.md`. -- **React Compiler adoption** — reviewed when ecosystem reports stabilize. -- **Bundle splitting per route** — when initial-paint metrics warrant. -- **i18n** — when a customer demands non-English UI. -- **Storybook** — useful for design-system maintenance; deferred until token-set proves stable in real pages. diff --git a/docs/engineering/frontend_design_tokens.md b/docs/engineering/frontend_design_tokens.md deleted file mode 100644 index 80eafe5d..00000000 --- a/docs/engineering/frontend_design_tokens.md +++ /dev/null @@ -1,260 +0,0 @@ -# OpenWatch Frontend Design Tokens - -> **Status:** Locked 2026-05-30 -> **Authority:** This document defines every `--ow-*` CSS variable consumed by `frontend/`. If the executable theme at `frontend/src/theme/` disagrees with this table, the executable form is wrong. -> **Audience:** Anyone writing or reviewing frontend components. - ---- - -## What this document is - -Every visible surface in the OpenWatch frontend ultimately resolves through a CSS variable defined here. MUI v7's CSS-vars mode reads these variables; component styles reference them; the dark and light color schemes differ only in the values assigned to them. - -The token names come from the prototype at `docs/engineering/prototypes/openwatch-v1/`. Prototype values become the **dark** scheme; **light** values are computed in this document. - -Naming rule: every variable is prefixed `--ow-*`. This is set via MUI v7's `cssVarPrefix: 'ow'` so there are no collisions with library-default `--mui-*` variables. - ---- - -## Surfaces (background + line) - -The frontend uses a 4-tier surface scale. `bg-0` is the page canvas; `bg-1` is the lightest elevated surface (sidebar, top bar, widgets); `bg-2` is hover surfaces; `bg-3` is the most elevated surface (active states, drawers). - -| Token | Dark | Light | Usage | -|-------|------|-------|-------| -| `--ow-bg-0` | `#0b0c0f` | `#ffffff` | Page canvas | -| `--ow-bg-1` | `#111317` | `#f6f7f9` | Sidebar, topbar, widgets | -| `--ow-bg-2` | `#161a20` | `#eef0f3` | Hover states | -| `--ow-bg-3` | `#1c2129` | `#e3e6eb` | Active states, drawers | -| `--ow-line` | `#232831` | `#d8dce2` | Borders, dividers | -| `--ow-line-2` | `#2c323d` | `#c4ccd6` | Hover borders, separators | - -Light values target ≥4.5:1 contrast for primary text on `--ow-bg-0` and ≥3:1 for borders on adjacent surfaces. - -## Text - -Four tiers of text emphasis. `fg-0` is the most prominent; `fg-3` is decorative/disabled. - -| Token | Dark | Light | Usage | -|-------|------|-------|-------| -| `--ow-fg-0` | `#f3f5f8` | `#0b0c0f` | Primary text | -| `--ow-fg-1` | `#cfd4dd` | `#1c2129` | Secondary text | -| `--ow-fg-2` | `#8a93a3` | `#5b6473` | Muted labels | -| `--ow-fg-3` | `#5b6473` | `#8a93a3` | Tertiary, placeholders | - -## Severity / semantic colors - -Four semantic colors. Each carries three forms: - -- **base** — the fill or stroke color (e.g., button background, icon color, status dot) -- **on** — the foreground that's legible on top of `base` (text on a primary button, icon on a status pill) -- **bg** — the soft-tinted background that pairs with base (alert chip background, hover ring) - -The dark severity values come from the prototype's `oklch()` colors. Light values lower the L% to ~52–58 so text on white passes WCAG 2.1 AA (4.5:1). - -### Critical (red) - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-crit` | `oklch(64% 0.20 25)` | `oklch(52% 0.22 25)` | -| `--ow-crit-on` | `#0a1424` | `#ffffff` | -| `--ow-crit-bg` | `oklch(35% 0.12 25 / 0.18)` | `oklch(95% 0.04 25)` | - -### Warning (amber) - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-warn` | `oklch(78% 0.15 75)` | `oklch(58% 0.15 75)` | -| `--ow-warn-on` | `#0a1424` | `#ffffff` | -| `--ow-warn-bg` | `oklch(50% 0.12 75 / 0.16)` | `oklch(96% 0.06 75)` | - -### Ok (green) - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-ok` | `oklch(72% 0.16 155)` | `oklch(48% 0.16 155)` | -| `--ow-ok-on` | `#0a1424` | `#ffffff` | -| `--ow-ok-bg` | `oklch(45% 0.10 155 / 0.18)` | `oklch(95% 0.05 155)` | - -### Info (blue) — also the brand accent - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-info` | `oklch(70% 0.13 245)` | `oklch(52% 0.15 245)` | -| `--ow-info-on` | `#0a1424` | `#ffffff` | -| `--ow-info-bg` | `oklch(45% 0.12 245 / 0.18)` | `oklch(95% 0.04 245)` | - -### Brand secondary (logo gradient tail) - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-brand-2` | `oklch(60% 0.20 290)` | `oklch(52% 0.20 290)` | - -Used only in the logo gradient. Not a general-purpose color. - -### OS brand colors (rare) - -| Token | Both modes | -|-------|------------| -| `--ow-os-ubuntu` | `#e95420` | -| `--ow-os-rhel` | `#ee0000` | - -Used only in OS-identity decoration on Host Detail. Same color in both modes (OS brand identity does not adapt). - -## Typography - -| Token | Value | -|-------|-------| -| `--ow-font-sans` | `'Inter', system-ui, -apple-system, sans-serif` | -| `--ow-font-mono` | `'JetBrains Mono', ui-monospace, monospace` | -| `--ow-font-size-base` | `14px` | -| `--ow-line-height-base` | `1.45` | - -Inter weights loaded: 400, 500, 600, 700. JetBrains Mono weights loaded: 400, 500. Fonts ship via Google Fonts in `index.html` (matches prototype) or self-hosted (decision deferred to v0.2 — for v0 we accept the Google Fonts CDN dependency). - -## Border radius - -| Token | Value | Usage | -|-------|-------|-------| -| `--ow-radius` | `8px` | Cards, drawers, dialogs | -| `--ow-radius-sm` | `6px` | Inputs, small chips, buttons | -| `--ow-radius-full` | `999px` | Pills, status badges | - -## Shadows / elevation - -Shadows differ per mode — the dark-mode `rgba(0,0,0,0.45)` shadows are too heavy for light surfaces. - -| Token | Dark | Light | -|-------|------|-------| -| `--ow-shadow-sm` | `0 1px 2px rgba(0,0,0,0.3)` | `0 1px 2px rgba(11,12,15,0.08)` | -| `--ow-shadow-md` | `0 4px 12px rgba(0,0,0,0.4)` | `0 4px 12px rgba(11,12,15,0.10)` | -| `--ow-shadow-lg` | `0 16px 40px rgba(0,0,0,0.45)` | `0 16px 40px rgba(11,12,15,0.15)` | - -Drawer / floating menu use `--ow-shadow-lg`; widgets use no shadow (border-only); avatar dropdown uses `--ow-shadow-md`. - -## Motion - -| Token | Value | Usage | -|-------|-------|-------| -| `--ow-motion-fast` | `120ms` | Button/link hover state transitions | -| `--ow-motion-base` | `150ms` | Drawer slide, modal fade | -| `--ow-motion-slow` | `200ms` | Tray panels, large drawers | - -All transitions use the default `ease` curve unless explicitly overridden in a component. - -## Spacing scale - -Tied to MUI v7's `theme.spacing(n)` function — `n` × 4px — but selected scale points are exposed as tokens for direct CSS use: - -| Token | Value | -|-------|-------| -| `--ow-space-1` | `4px` | -| `--ow-space-2` | `8px` | -| `--ow-space-3` | `12px` | -| `--ow-space-4` | `16px` | -| `--ow-space-5` | `20px` | -| `--ow-space-6` | `24px` | -| `--ow-space-7` | `28px` (sidebar→content gutter) | - ---- - -## How MUI v7 sees this - -`extendTheme({ cssVarPrefix: 'ow', colorSchemes: { light: {...}, dark: {...} }, defaultColorScheme: 'dark' })` produces theme objects whose `palette.*` fields reference the variables above. Example: - -```ts -{ - palette: { - background: { - default: 'var(--ow-bg-0)', - paper: 'var(--ow-bg-1)', - }, - text: { - primary: 'var(--ow-fg-0)', - secondary: 'var(--ow-fg-1)', - disabled: 'var(--ow-fg-3)', - }, - primary: { - main: 'var(--ow-info)', - contrastText: 'var(--ow-info-on)', - }, - error: { - main: 'var(--ow-crit)', - contrastText: 'var(--ow-crit-on)', - }, - warning: { - main: 'var(--ow-warn)', - contrastText: 'var(--ow-warn-on)', - }, - success: { - main: 'var(--ow-ok)', - contrastText: 'var(--ow-ok-on)', - }, - info: { - main: 'var(--ow-info)', - contrastText: 'var(--ow-info-on)', - }, - divider: 'var(--ow-line)', - }, - shape: { borderRadius: 8 /* matches --ow-radius */ }, -} -``` - -The `colorSchemes.light` and `colorSchemes.dark` variants supply the same MUI palette structure pointing at the same `--ow-*` variables; MUI v7 emits the appropriate variable values per `data-mui-color-scheme` attribute on `<html>`. - -## How a component sees this - -Components prefer the MUI `sx` prop or `styled()` wrappers, which receive the theme and resolve to `var(--ow-*)`. Direct CSS that bypasses MUI (e.g. in a custom hand-styled component) reads variables straight: - -```tsx -const Surface = styled('div')(({ theme }) => ({ - background: 'var(--ow-bg-1)', - border: '1px solid var(--ow-line)', - borderRadius: 'var(--ow-radius)', - color: 'var(--ow-fg-0)', -})); -``` - -## How no-FOUC works - -A synchronous `<script>` in `index.html` `<head>` (before the React bundle parses) reads `localStorage.getItem('ow-color-scheme')`, falls back to `system` (which resolves via `window.matchMedia('(prefers-color-scheme: dark)')`), and sets `data-mui-color-scheme="light"` or `"dark"` on `<html>`. MUI v7 ships this helper as `getInitColorSchemeScript({ attribute: 'data-mui-color-scheme', defaultMode: 'system' })`. We use it verbatim. - -When React mounts and a user selects a different mode via the Settings toggle, `useColorScheme().setMode(newMode)` updates the attribute, re-renders subscribing components, and persists to `localStorage`. - -## How system-mode change propagation works - -```ts -useEffect(() => { - const mq = window.matchMedia('(prefers-color-scheme: dark)'); - const onChange = () => { - // mode === 'system' → re-resolve and re-apply - if (mode === 'system') setSystemMode(mq.matches ? 'dark' : 'light'); - }; - mq.addEventListener('change', onChange); - return () => mq.removeEventListener('change', onChange); -}, [mode]); -``` - -(MUI v7's `useColorScheme()` does this internally; we don't write it ourselves.) - ---- - -## Per-token audit checklist (for the foundation spec's AC tests) - -The `frontend-foundation` spec asserts the executable theme matches this document. A test reads `frontend/src/theme/tokens.ts` and verifies: - -1. Every token from §Surfaces, §Text, §Severity, §Typography, §Radius, §Shadows, §Motion, §Spacing is present. -2. Every token in this document maps to an exported constant or theme path. -3. The dark + light values exactly match the tables above. -4. No `var(--mui-*)` references slip in (must be `var(--ow-*)`). -5. axe-core scan of a rendered shell passes WCAG 2.1 AA in **both** dark and light modes. - -When this document and `tokens.ts` disagree, this document wins. Update `tokens.ts` to match, never the other way. - ---- - -## Open follow-ups - -- **Self-host Inter + JetBrains Mono** — current decision is Google Fonts CDN. For air-gapped deployments this fails. Self-host in v0.2. -- **Component-specific tokens** — once Storybook is in place, document per-component overrides (e.g., dashboard-widget elevation) as a sub-set of these tokens. -- **Reduced-motion mode** — `prefers-reduced-motion: reduce` should zero out `--ow-motion-*`. Implement at the same time as the foundation spec. diff --git a/docs/engineering/licensing_foundation.md b/docs/engineering/licensing_foundation.md deleted file mode 100644 index 6923c43c..00000000 --- a/docs/engineering/licensing_foundation.md +++ /dev/null @@ -1,757 +0,0 @@ -# OpenWatch+ Licensing Foundation - -> **Status:** Locked design 2026-04-28 -> **Authority:** This document is the architectural foundation for license validation, feature gating, and quota enforcement in the Go rebuild. Implementation in Stage 0 must conform. -> **Why now:** Today's `LicenseService` is a config-flag stub (3 TODOs in `services/licensing/service.py`). The rebuild has a clean opportunity to design the licensing foundation properly. Bolting it on later — when several Phase-2 features depend on it — will be expensive. - ---- - -## 1. Why this is foundation work, not feature work - -Licensing crosses every architectural seam in the system: - -- **HTTP routes** — many endpoints are gated on a license feature -- **Service layer** — quotas enforced at points of use (host count, scan rate) -- **Frontend** — UI hides or marks features the customer's license doesn't include -- **Audit log** — license events (install, expiry, denial) are first-class audit events -- **Errors** — `license.feature_unavailable` is a documented error code with `402 Payment Required` -- **Operational** — license install / renew / verify is a CLI flow operators run -- **Spec layer** — Specter validates behavioral contracts about expiry, grace periods, tamper detection -- **Build** — public key for signature verification is compiled into the binary - -If licensing is added later, every one of these seams has to be retrofitted — and retrofits leak. Doing it once, now, with the architecture decisions still fluid, is the cheap path. - ---- - -## 2. Core requirements - -### 2.1 Functional - -1. **Validate license authenticity** — cryptographically signed, tamper-resistant -2. **Determine feature availability** — per feature ID, fast (hot path) -3. **Enforce quota limits** — host count, scan rate, user count -4. **Handle expiry gracefully** — grace period before lockout -5. **Reload without restart** — operators install a new license file and signal the service -6. **Air-gapped deployment** — no phone-home; license validation is fully offline -7. **Audit every license event** — install, reload, expiry, feature check denial, quota exceeded - -### 2.2 Non-functional - -1. **Hot-path performance** — `IsEnabled(featureID)` must be O(1) and lock-free -2. **Tamper resistance** — best-effort against license file modification and clock rollback; not against process patching (unwinnable) -3. **Operational simplicity** — single `.lic` file, single CLI command to install -4. **Free tier always works** — service boots and runs without any license file (free features only) -5. **Forward compatibility** — adding a new feature ID doesn't break existing licenses -6. **Backward compatibility** — removing a feature ID is a versioning event; old licenses keep working - ---- - -## 3. License model - -### 3.1 License is a signed JWT - -A license is a JWT with `EdDSA` algorithm (Ed25519), payload as documented below. Distributed as a single file: `license.lic` (literal JWT compact serialization). - -**Why JWT:** -- Standard format with mature Go library support (`golang-jwt/jwt` v5 — already locked in roadmap) -- Single file, no out-of-band signature -- Header carries algorithm + key ID, enabling key rotation -- Compact, base64url-safe, suitable for email distribution - -**Why Ed25519:** -- Same primitive used elsewhere in the platform (Kensa rule signing, evidence signing) — one algorithm, fewer surfaces -- Stdlib (`crypto/ed25519`) — no external dependency -- Fast verification, small signatures, FIPS-compliant via `microsoft/go` - -### 3.2 License JWT claims - -```json -{ - "iss": "openwatch-licensing@hanalyx.com", - "sub": "customer-uuid-or-name", - "jti": "license-uuid", - "iat": 1700000000, - "nbf": 1700000000, - "exp": 1731622400, - "aud": "openwatch", - "license": { - "tier": "openwatch_plus", - "customer_name": "Acme Corp", - "customer_id": "cust-uuid", - "features": [ - "audit_query", - "audit_export", - "temporal_queries", - "remediation_execution", - "structured_exceptions", - "priority_updates", - "sso_saml", - "fido2_mfa" - ], - "quotas": { - "max_hosts": 5000, - "max_scans_per_day": 50000, - "max_users": 500, - "max_concurrent_scans": 100 - }, - "deployment_fingerprint": null, - "support_contact": "support@example.com" - } -} -``` - -**Field semantics:** - -| Field | Meaning | -|---|---| -| `iss` | License issuer. Pinned per build via embedded public key + issuer string. | -| `sub` | Customer subject. Used for audit/logging only (not validation). | -| `jti` | Unique license ID. Stored in DB on install for revocation tracking. | -| `iat` / `nbf` / `exp` | Standard JWT timestamps. `nbf` and `exp` enforce the validity window. | -| `aud` | Always `openwatch`. Rejected if mismatched. | -| `license.tier` | `free` or `openwatch_plus`. Free licenses can be issued explicitly to override default Free tier (for trials, partners). | -| `license.features` | Authoritative list of enabled feature IDs. Anything not in this list is denied. | -| `license.quotas` | Numeric limits. `null` or absent = unlimited for that quota. | -| `license.deployment_fingerprint` | Optional SHA-256 of `(machine_id + install_id)`. If set, license is bound to that deployment. Most customers: `null`. | -| `license.support_contact` | Embedded for operator convenience; never used in validation. | - -### 3.3 Feature ID registry - -Feature IDs are stable strings. Registry lives at `licensing/features.yaml` and is checked into source. - -```yaml -# licensing/features.yaml -version: 1 -features: - - id: compliance_check - tier: free - description: Run compliance scans against hosts - introduced: "1.0.0" - - - id: audit_query - tier: openwatch_plus - description: Saved and ad-hoc audit query system - introduced: "1.0.0" - - - id: audit_export - tier: openwatch_plus - description: Export audit data as JSON/CSV/PDF with signed bundles - introduced: "1.0.0" - - - id: temporal_queries - tier: openwatch_plus - description: Point-in-time compliance posture, drift, forecasts - introduced: "1.0.0" - - - id: remediation_execution - tier: openwatch_plus - description: Apply remediation via Kensa with rollback support - introduced: "1.0.0" - - - id: structured_exceptions - tier: openwatch_plus - description: Multi-stage exception approval workflow - introduced: "1.0.0" - - - id: priority_updates - tier: openwatch_plus - description: Early access to Kensa rule updates - introduced: "1.0.0" - - - id: sso_saml - tier: openwatch_plus - description: SAML 2.0 single sign-on - introduced: "1.0.0" - - - id: fido2_mfa - tier: openwatch_plus - description: FIDO2/WebAuthn second factor - introduced: "1.0.0" - -# Deprecated features kept for backwards compatibility: -deprecated_features: - - id: legacy_csv_export - deprecated_in: "1.0.0" - removed_in: "2.0.0" - description: Legacy flat CSV report export (replaced by signed report faces) -``` - -**Rules:** - -1. **Adding a feature** is non-breaking. Existing licenses without it default to denied. New licenses include it as needed. -2. **Removing a feature** requires deprecation period (one minor version) then removal. Old licenses including the removed feature are unaffected (the registry is the source of truth, not the license). -3. **Renaming a feature** is forbidden. Add a new ID and deprecate the old one. -4. **Tier changes** (e.g., promoting a feature from `openwatch_plus` to `free`) are allowed and take effect on next license reload. -5. **Free-tier features are never gated.** They're listed for completeness; the gate logic short-circuits on `tier=free`. - -### 3.4 Quotas - -Quotas are advisory limits enforced at point of use: - -| Quota | Enforcement point | Behavior at limit | -|---|---|---| -| `max_hosts` | Host create/import | Reject new host with `quota.max_hosts_exceeded` | -| `max_scans_per_day` | Scan enqueue | Reject scan with `quota.daily_scan_limit` | -| `max_users` | User create | Reject with `quota.max_users_exceeded` | -| `max_concurrent_scans` | Scan dequeue | Defer scan with `quota.concurrent_scan_limit` (queued, not failed) | - -Free tier defaults (compiled into binary as fallbacks): - -```go -var FreeTierQuotas = Quotas{ - MaxHosts: 100, - MaxScansPerDay: 1000, - MaxUsers: 10, - MaxConcurrentScans: 10, -} -``` - -These can be overridden by an explicit free-tier license. - ---- - -## 4. Validation logic - -### 4.1 Verification order (all must pass) - -1. **JWT structure** — three base64url segments separated by dots -2. **Algorithm** — header `alg` must be `EdDSA`. No exceptions. Reject `none`, `HS256`, etc. -3. **Key ID match** — header `kid` must match an embedded public key -4. **Signature** — Ed25519 verify with the resolved public key -5. **Issuer** — `iss` must match the embedded issuer string -6. **Audience** — `aud` must equal `openwatch` -7. **Validity window** — current monotonic-cross-checked time must be within `[nbf, exp]` -8. **Deployment fingerprint** (if set) — SHA-256(machine_id + install_id) must match -9. **Clock rollback check** — current time must be ≥ last_known_good_time stored in DB - -If any fails: license is rejected, service falls back to Free tier, audit event emitted with the specific failure. - -### 4.2 Public key distribution - -Public keys are **compiled into the binary** at build time. Three keys minimum (current + 2 historical) embedded for rotation support: - -```go -//go:embed keys/license-pubkey-current.pem -var licensePubKeyCurrent []byte - -//go:embed keys/license-pubkey-prev.pem -var licensePubKeyPrev []byte - -//go:embed keys/license-pubkey-deprecated.pem -var licensePubKeyDeprecated []byte -``` - -**Why embedded, not config:** -- Tampering with config files is easier than tampering with the binary -- Embedded keys cannot be replaced without re-shipping the binary -- Customers can verify integrity by checking binary signatures (RPM/DEB signing) - -**Key rotation procedure:** -1. New key pair generated by issuer -2. Next OpenWatch release embeds the new key as `current`, old key as `prev` -3. New licenses signed with new key -4. Old licenses signed with old key continue to validate (until `prev` is rotated out) -5. After 12 months, the once-`prev` key becomes `deprecated` (still validates but emits warning) and is rotated out one release later - -### 4.3 Clock rollback detection - -System clocks can be tampered with. Mitigation: - -1. On license install / reload, store `last_known_good_time = max(now, exp_minus_grace_period)` in the `licenses` table -2. On every validation, check `now >= last_known_good_time - tolerance` (where tolerance = 1 hour for NTP drift) -3. If `now < last_known_good_time - tolerance`: clock rollback detected, license invalidated, audit event emitted, fall back to Free tier - -This is best-effort — a determined attacker with root can defeat it. The point is to catch accidents and obvious tampering. - -### 4.4 Grace period on expiry - -Licenses don't go from "active" to "free tier" instantly on expiry: - -| Time relative to `exp` | State | Behavior | -|---|---|---| -| Before `exp` | active | All licensed features available | -| `exp` to `exp + 30 days` | grace | All features still available; `Warning` header on every API response; banner on UI; `license.expiring_soon` audit event daily | -| After `exp + 30 days` | expired | Free tier only; `license.expired` audit event on first denial; UI shows expired banner | - -**Operational pressure during grace:** the warning headers and audit events make expiry visible long before the lockout. Operators have 30 days to install a renewal. - -### 4.5 Reload model - -License is loaded: - -1. **At startup** — read `/etc/openwatch/license.lic`, validate, populate in-memory state -2. **On SIGHUP** — re-read, re-validate, swap in-memory state atomically -3. **On schedule** — every hour, re-validate the in-memory license against current time (catches expiry transitions without restart) - -The in-memory state is `*atomic.Pointer[LicenseState]` — readers (the hot-path `IsEnabled` check) load the pointer with a single atomic op; reload publishes a new pointer atomically. No locks on the read path. - ---- - -## 5. Architecture - -### 5.1 Package layout (as built) - -``` -internal/license/ -├── types.go # License, Feature, Tier, State structs -├── features.gen.go # Codegen output from specs (Feature constants, FeatureRegistry) -├── validator.go # JWT parsing + Ed25519 signature verification + claims -├── state.go # In-memory state + atomic.Pointer[State]; IsEnabled hot path -├── middleware.go # RequireFeature/EnforceFeature/DenyFeature + denial dedup -├── service.go # GET /license, GET /license/features handlers -├── audit.go # License event emission helpers -├── keys.go # Embedded public key ring loader -├── keys/ # Embedded public keys (.pem) via go:embed -│ └── license-pubkey-current.pem -├── testdata/ # Test private key (NOT shipped in releases) -└── features_test.go, validator_test.go -``` - -Deferred to a later stage (not yet implemented): -- `loader.go` — file-based license install path (currently env-var only) -- `reload.go` — SIGHUP-driven re-validation -- `cli.go` — `openwatch license install/verify/info` subcommands -- Quotas (`Quotas` struct + `RequireQuota` middleware) — Phase 2 - -### 5.2 Hot path: `IsEnabled` - -```go -package license - -import ( - "sync/atomic" -) - -type State struct { - Tier Tier - Features map[string]struct{} // O(1) lookup - Quotas Quotas - ExpiresAt time.Time - GraceUntil time.Time - LicenseID string - InstalledAt time.Time -} - -var current atomic.Pointer[State] - -// IsEnabled is the hot-path check. Lock-free, O(1). -func IsEnabled(featureID string) bool { - s := current.Load() - if s == nil { - return isFreeTierFeature(featureID) - } - _, ok := s.Features[featureID] - return ok -} - -// IsExpired returns true if license is past grace period. -func IsExpired() bool { - s := current.Load() - if s == nil { - return false // No license = Free tier, not expired - } - return time.Now().After(s.GraceUntil) -} -``` - -Performance: ~20ns per check. Safe to call from every HTTP request without measurable overhead. - -### 5.3 Middleware (as built) - -The implementation takes a typed `Feature` (not a `string`) and uses the -error envelope schema fixed by `error_codes.spec.yaml` (`fault`, not -`category`). Dedup on denial events is enforced per -`(feature, actor)` within a 60s window — see `denialMap` in -`internal/license/middleware.go`. - -```go -// RequireFeature: chi middleware for routes wired directly via chi. -func RequireFeature(f Feature) func(http.Handler) http.Handler { - return func(next http.Handler) http.Handler { - return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - if EnforceFeature(w, r, f) { - return // denied; response already written - } - next.ServeHTTP(w, r) - }) - } -} - -// EnforceFeature: called inside oapi-codegen-generated handlers where -// per-route middleware injection is awkward. Returns true if denied -// (handler should return immediately). -func EnforceFeature(w http.ResponseWriter, r *http.Request, f Feature) (denied bool) { - if IsEnabled(f) { - return false - } - DenyFeature(w, r, f) - return true -} -``` - -Denial envelope (`fault: "policy"`, retryable: false): -```json -{"error":{"code":"license.feature_unavailable","fault":"policy","retryable":false, - "human_message":"this feature requires an OpenWatch+ license", - "detail":{"feature":"premium_diagnostics"}}} -``` - -Grace-period Warning header support is on the roadmap; not yet implemented. - -### 5.4 OpenAPI integration: `x-required-feature` (as built) - -Endpoints that require a license feature declare it in the OpenAPI spec: - -```yaml -paths: - /diagnostics:premium-echo: - post: - x-required-feature: premium_diagnostics # License gate - summary: Premium-tier echo (license-gated) - ... -``` - -`x-required-feature` is documentation only — oapi-codegen does not auto-wire -license enforcement. Handlers call `license.EnforceFeature(w, r, f)` at the -top of the function body (`internal/server/handlers.go:PostDiagnosticsPremiumEcho`). -The `x-required-feature` extension is the source of truth for which routes are -gated and which Feature ID gates them; an audit script can cross-check this -against the handler code. - -RBAC integration (`x-required-permission`) is planned but not yet implemented -(deferred to the RBAC milestone — Day 8 in the Stage-0 plan). - -If an endpoint declares `x-required-feature` for a feature ID not in `features.yaml`, the build fails. This prevents silent drift. - -### 5.5 Service-layer quota enforcement - -For things that don't map cleanly to a single HTTP route (e.g., concurrent scan limit hits the worker, not the route): - -```go -package license - -func CheckQuota(q QuotaName, current int64) error { - state := current.Load() - limit := state.Quotas.Get(q) - if limit == 0 { - return nil // unlimited - } - if current >= limit { - audit.Emit(ctx, audit.Event{ - Action: "license.quota_exceeded", - Resource: string(q), - Detail: map[string]any{"limit": limit, "current": current}, - }) - return &Error{ - Code: "quota." + string(q) + "_exceeded", - HumanMessage: fmt.Sprintf("%s limit reached (%d).", q.Description(), limit), - } - } - return nil -} - -// Call sites: -func (s *HostService) Create(ctx context.Context, req HostCreate) (*Host, error) { - count, _ := s.repo.CountActiveHosts(ctx) - if err := license.CheckQuota(license.QuotaMaxHosts, count); err != nil { - return nil, err - } - return s.repo.Insert(ctx, req) -} -``` - ---- - -## 6. Data model - -### 6.1 `licenses` table - -```sql -CREATE TABLE licenses ( - id UUID PRIMARY KEY, -- license JTI claim - tier TEXT NOT NULL, -- 'free' | 'openwatch_plus' - customer_id TEXT NOT NULL, - customer_name TEXT, - issued_at TIMESTAMPTZ NOT NULL, - not_before TIMESTAMPTZ NOT NULL, - expires_at TIMESTAMPTZ NOT NULL, - features JSONB NOT NULL, -- string array - quotas JSONB NOT NULL, -- {max_hosts: int, ...} - fingerprint TEXT, -- nullable - raw_jwt TEXT NOT NULL, -- the original JWT for re-validation - installed_at TIMESTAMPTZ NOT NULL DEFAULT now(), - installed_by UUID REFERENCES users(id), - superseded_at TIMESTAMPTZ, -- null = active; set when replaced - last_validated_at TIMESTAMPTZ NOT NULL DEFAULT now(), - last_known_good_time TIMESTAMPTZ NOT NULL DEFAULT now() -- clock rollback baseline -); - -CREATE INDEX idx_licenses_active ON licenses (installed_at DESC) WHERE superseded_at IS NULL; -``` - -**Why store the raw JWT:** allows the service to re-verify the license on every reload without re-reading the file. Also enables forensic review of past licenses. - -**Only one active license at a time** — install supersedes the previous (sets `superseded_at`). - -### 6.2 Audit events - -License-related audit events use a stable namespace: - -| Action | When emitted | -|---|---| -| `license.installed` | New license loaded (startup or SIGHUP) | -| `license.invalid` | License file failed validation | -| `license.expiring_soon` | Within 30 days of expiry (daily) | -| `license.expired` | First feature denial after grace period ended | -| `license.feature_check_denied` | Per-request denial (high-volume; rate-limit logging) | -| `license.quota_exceeded` | Quota limit hit | -| `license.clock_rollback_detected` | System clock < last_known_good_time | -| `license.tampered` | Fingerprint mismatch or signature failure on reload | - -Per the audit-as-API contract, these are queryable via `/api/v1/audit/events?action=license.*`. - -> **Rate limiting `license.feature_check_denied`:** if every denied request emits an audit event, a misbehaving client floods the audit log. Mitigation: deduplicate by `(actor_id, feature_id)` within a 1-minute window, emit at most one event per window. Counts in `detail.suppressed_count`. - ---- - -## 7. CLI: `openwatch license` - -Three subcommands cover the operator flow: - -### 7.1 `openwatch license info` - -Show current license state. - -``` -$ openwatch license info -License ID: e21f5a8d-... -Customer: Acme Corp -Tier: openwatch_plus -Status: active -Expires: 2027-01-15 (262 days) -Features: audit_query, audit_export, temporal_queries, remediation_execution, - structured_exceptions, priority_updates, sso_saml, fido2_mfa -Quotas: - max_hosts: 5000 (currently using 1247) - max_scans_per_day: 50000 - max_users: 500 (currently using 23) - max_concurrent_scans: 100 -Installed: 2026-03-01 by admin@example.com -Last validated: 2026-04-28 14:32:00 UTC -Support contact: support@hanalyx.com -``` - -### 7.2 `openwatch license verify <file>` - -Verify a license file without installing. Useful before sending to a customer. - -``` -$ openwatch license verify /tmp/new-license.lic -✓ JWT structure valid -✓ Signature verified (key: license-pubkey-current.pem) -✓ Issuer matches -✓ Audience matches -✓ Validity window: 2026-05-01 → 2027-05-01 (active in 3 days) -✓ Features: 8 declared, all known -✓ Quotas: max_hosts=5000, max_scans_per_day=50000, max_users=500, max_concurrent_scans=100 -- Deployment fingerprint: not bound - -License is valid. -``` - -### 7.3 `openwatch license install <file>` - -Verify and install. Sends SIGHUP to running service so reload happens without restart. - -``` -$ sudo openwatch license install /tmp/new-license.lic -Verifying license... ✓ -Backup: /etc/openwatch/license.lic.bak (previous license) -Installing: /etc/openwatch/license.lic -Permissions: 0640 root:openwatch -Reloading openwatch service... ✓ -Audit event: license.installed (license_id=e21f5a8d-...) - -License installed successfully. -``` - ---- - -## 8. Spec coverage (Specter) - -Behavioral specs live at `specs/system/license-validation.spec.yaml`, -`specs/system/license-features.spec.yaml`, and `specs/api/license.spec.yaml`. -The acceptance criteria below capture the design intent (the shipped specs -split these across the files above and may use updated AC ids): - -```yaml -spec_id: system/licensing -status: active -version: 1.0 -acceptance_criteria: - - id: AC-1 - description: Service boots without license file in Free tier - - id: AC-2 - description: Valid signed license unlocks declared features - - id: AC-3 - description: Invalid signature is rejected; service falls back to Free tier; emits license.invalid - - id: AC-4 - description: Expired license enters 30-day grace period before falling back to Free tier - - id: AC-5 - description: Grace-period responses include Warning header - - id: AC-6 - description: SIGHUP reload picks up new license without restart; emits license.installed - - id: AC-7 - description: Each feature_check_denied emits a structured audit event (rate-limited per (actor, feature)) - - id: AC-8 - description: Clock rollback (>1h) is detected and license invalidated; emits license.clock_rollback_detected - - id: AC-9 - description: Quota limits enforced at service boundaries with quota.<name>_exceeded error - - id: AC-10 - description: alg=none, alg=HS256, and other non-EdDSA JWTs are rejected - - id: AC-11 - description: License with deployment_fingerprint validates only on matching deployment - - id: AC-12 - description: Public key rotation (current → prev → deprecated) preserves validation for old licenses - - id: AC-13 - description: Adding a feature ID is non-breaking for existing licenses -``` - -13 ACs. Each one has at least one enforcing test. `specter coverage --enforce-active` blocks merge if any AC lacks coverage. - ---- - -## 9. Frontend integration - -The frontend is human-first, but it must respect license state. Pattern: - -1. **On app load:** `GET /api/v1/capabilities` returns features and quotas -2. **Frontend caches** capabilities for the session -3. **UI rendering:** - - Features not in the license: hide, OR show with "Upgrade to OpenWatch+" badge (operator-configurable) - - Quota approaching limit (>80% used): show warning indicator - - Quota at limit: disable creation UI, show "Limit reached" message -4. **Grace period:** banner at top of every page with expiry date and renewal CTA -5. **Expired:** banner persists; UI degrades gracefully to Free tier features only - -This matches the principle from the Agent-First architecture: **frontend is human-first, API is agent-first**. The same `/capabilities` endpoint feeds both. - ---- - -## 10. Failure modes and operator experience - -| Scenario | System behavior | Operator experience | -|---|---|---| -| No license file at startup | Boot to Free tier; no audit event (this is normal for Free-tier deployments) | UI shows Free tier, no warnings | -| Invalid license file (bad signature, malformed JWT) | Boot to Free tier; emit `license.invalid` audit event; log error | UI banner: "License file invalid. Contact support." | -| Expired license at startup | If within grace period: load with grace flag. If past grace: load to Free tier; emit `license.expired` | UI banner: "License expired N days ago. Some features unavailable." | -| Quota exceeded at install | License loads; quota enforcement kicks in immediately; existing-state operations succeed but new ones fail | UI shows quota indicators; new creates fail with `quota.*_exceeded` error | -| Deployment fingerprint mismatch | License rejected; emit `license.tampered`; fall back to Free tier | UI banner: "License is bound to a different deployment. Contact support." | -| Clock rolled back | License rejected; emit `license.clock_rollback_detected`; fall back to Free tier | UI banner: "System time appears incorrect. License validation paused." | -| Public key rotated; old license signed with deprecated key | License loads with deprecation warning; emit `license.using_deprecated_key`; ask operator to renew | UI banner: "License signed with deprecated key. Please renew." | - -The pattern: **always boot, never crash on license issues.** A bad license never prevents the service from starting. It just degrades to Free tier. Compliance scanning is operationally critical; the licensing layer must not become a single point of failure. - ---- - -## 11. Build-time issuance flow (separate tool) - -License generation lives in a separate small tool (`owlicgen`), not in OpenWatch itself: - -``` -$ owlicgen \ - --signing-key /vault/license-signing-key.pem \ - --customer-id cust-uuid-... \ - --customer-name "Acme Corp" \ - --tier openwatch_plus \ - --features audit_query,audit_export,temporal_queries,remediation_execution \ - --max-hosts 5000 \ - --max-scans-per-day 50000 \ - --valid-from 2026-05-01 \ - --valid-until 2027-05-01 \ - --output /tmp/acme-license.lic - -Generated license: e21f5a8d-3c7a-4b1f-9e8d-... -File: /tmp/acme-license.lic (1247 bytes) -SHA256: 7a8b9c... - -Verify with: openwatch license verify /tmp/acme-license.lic -``` - -`owlicgen` lives at `cmd/owlicgen/`. It is not shipped to customers. It uses the **private** signing key, which never leaves Hanalyx infrastructure. Customers never see this tool. - ---- - -## 12. Stage 0 integration - -Stage 0 (walking skeleton) currently includes: -- Audit log endpoint -- Idempotency middleware -- Correlation ID middleware - -**Add to Stage 0 (Day 7 or new Day 8):** - -- Load `/etc/openwatch/license.lic` at startup (or run Free tier if absent) -- Validate JWT signature against embedded public key -- Populate `license.State` atomic pointer -- Implement `license.IsEnabled` and `license.RequireFeature` middleware -- Add a single demo gated endpoint: `POST /api/v1/diagnostics:premium-echo` with `x-required-feature: premium_diagnostics` -- Without a license file, that endpoint returns `402 Payment Required` with `error.code = "license.feature_unavailable"` -- With a test license signed by a test key, the endpoint works -- Audit event `license.feature_check_denied` is emitted on the failed call - -This is the minimum to prove the licensing seam works end-to-end before any real feature builds on it. - ---- - -## 13. What this document does NOT address (yet) - -These are licensing concerns deferred to later stages. Each one has a known answer; they're cataloged here so they don't get rediscovered as problems. - -| Topic | Deferred to | Rationale | -|---|---|---| -| License revocation list (CRL) | Phase 2+ | Air-gapped doesn't support online CRL; if needed, ship CRL via signed update bundle | -| Trial license self-service | Phase 2+ | Today: customer talks to sales for a trial license. Self-service registration form is a marketing project, not a platform project | -| Floating / concurrent licenses | Not planned | OpenWatch is per-deployment, not per-user. Not a fit for the model | -| Usage-based billing telemetry | Not planned | Hanalyx model is fixed-tier subscriptions. If usage-based ever becomes a thing, telemetry is a separate workstream | -| License upgrade in place (Free → +) | Phase 1 | Implicit: operator just installs the new license. No migration needed | -| Multi-license aggregation | Not planned | One license per deployment, period | - ---- - -## 14. Acceptance criteria for "foundation is built" - -Stage 0 ships with the licensing foundation when: - -- [ ] `internal/license/` package exists with all files listed in §5.1 -- [ ] `licensing/features.yaml` exists with the 9 initial feature IDs -- [ ] Public key embedded in binary via `//go:embed` -- [ ] `IsEnabled(featureID)` is lock-free, O(1), and tested -- [ ] `RequireFeature(featureID)` middleware works in chi -- [ ] `:premium-echo` demo endpoint validates the end-to-end seam -- [ ] License loads at startup; SIGHUP triggers reload -- [ ] Audit events emit for install, invalid, denied, quota_exceeded -- [ ] `openwatch license info / verify / install` CLI works -- [ ] `owlicgen` tool generates valid licenses signed by the test key -- [ ] Specter specs `system/license-validation.spec.yaml`, `system/license-features.spec.yaml`, and `api/license.spec.yaml` exist covering the ACs above -- [ ] OpenAPI extension `x-required-feature` is documented in `docs/engineering/api_design_principles.md` -- [ ] Frontend `/capabilities` response includes the active feature set -- [ ] `licenses` table migration is in place - -Once all 14 boxes are checked, downstream Phase-2 features can declare `x-required-feature` confidently and the gating "just works." - ---- - -## 15. Why this is worth doing now, in detail - -Three concrete failure modes are avoided by getting this right in Stage 0: - -1. **The "decorator graveyard."** The current Python codebase has `@require_license(...)` calls scattered across handlers, with the actual `LicenseService.has_feature()` doing nothing meaningful (3 TODO stubs). When real validation gets added, every decorator has to be re-checked — but because `has_feature()` always returned `True`, no one wrote tests against the `False` branch. Hidden bugs everywhere. Doing it once, properly, in Stage 0, eliminates this class. - -2. **The "schema-after" tax.** If licensing is added in Phase 2, the `licenses` table is a Phase-2 migration. But Phase-2 features depend on it for gating, which means the migration order has to be carefully sequenced. Doing it in Stage 0 means licensing is just... there, like the audit log. - -3. **The "silent denial" failure mode.** Without proper audit events on denial, customers can't tell why a feature isn't working. Support tickets pile up: "is it a bug?" "is it a permission issue?" "is it the license?" Audit events make this answerable in seconds. Adding audit events later requires retrofitting every gate — easy to miss one. Build it in once. - -The licensing foundation is roughly 1 extra day of Stage 0 work (15% of the budget). The cost of bolting it on later is at minimum 2 weeks of cross-cutting refactor, and at worst a class of latent bugs that surface only in production. - -The roadmap budget can absorb 1 day. It cannot absorb 2 weeks. diff --git a/docs/engineering/notifications_design.md b/docs/engineering/notifications_design.md deleted file mode 100644 index 8e8eadce..00000000 --- a/docs/engineering/notifications_design.md +++ /dev/null @@ -1,276 +0,0 @@ -# In-App Notifications — Change-Driven Design - -**Status:** Proposed -**Last Updated:** 2026-06-25 -**Related specs:** `frontend-notifications`, `system-alerts`, `api-alerts`, -`system-transaction-log`, `system-posture-snapshots`, `api-events-stream`, -`system-rbac` - ---- - -## 0. Why this document - -The shipped in-app notification MVP (`specs/frontend/notifications.spec.yaml`) -is a session-scoped counter whose only producer is `report.ready`. A bell that -counts finished reports is the least valuable thing the bell could do: a report -completing is not a *change in the world the operator must react to*. - -This document repoints the bell at **meaningful state changes** — first and -foremost a compliance regression ("a rule that was passing is now failing"), -plus connectivity loss, drift, failed scans, failed remediation, and governance -items that need a decision. The thesis: - -> A notification is a **change in compliance, fleet health, or governance state -> that a specific user should act on** — delivered durably, deduplicated, -> grouped, and deep-linked to the change. - -The good news: OpenWatch is built on a **write-on-change** model, so the -change events already exist as first-class records. We are mostly *surfacing* -data, not computing it. - ---- - -## 1. Two surfaces, deliberately different - -| | Activity feed (`/activity`) | Notifications (bell) | -|---|---|---| -| Audience | anyone, exploratory | the signed-in user | -| Content | the full chronological log of everything | the **actionable subset** of changes | -| State | stateless stream | **per-user unread / read** | -| Volume | high (includes routine noise) | low, severity-gated | -| Goal | "what happened" | "what needs my attention now" | - -The bell is not a second activity feed. It is the **curated, stateful, per-user -slice** of the same change data. - ---- - -## 2. Principles - -1. **Change-driven, not event-driven.** The backbone is the write-on-change - `transactions` log + the `alertrouter` (which already classifies changes, - assigns severity, and deduplicates). The bell is a *new sink* of that - stream, not a parallel pipeline. -2. **Severity-ranked.** Every notification carries a severity - (`critical`/`high`/`medium`/`low`/`info`, the existing `alertrouter.Severity` - enum). The badge counts **unread high+**, not raw volume. -3. **Group, don't flood.** A scan that flips 30 rules on a host produces **one** - notification ("web-01: 30 rules regressed, 4 critical"), not 30. Same - grouping discipline as Activity-readability Phase 4. -4. **Per-user and RBAC-scoped.** A user sees changes for hosts they can see; - approvers additionally get governance items; security roles get auth/security - items. Scope mirrors the `host:read` gating already on the SSE stream and - audit queries. -5. **Durable + read state.** A real per-user table, surviving refresh, with - mark-read / mark-all-read. (This is exactly what the MVP spec deferred.) -6. **Actionable.** Every notification deep-links to the change: - `/transactions/rule/:id`, `/hosts/:id`, the scan, or the exception. -7. **Noise is a bug.** If the bell ever shows routine churn, it has failed. - Reuse the drift thresholds and severity floors that already keep the alert - stream quiet. - ---- - -## 3. The notification taxonomy - -Anchored to real producers and identifiers in the codebase. "Source exists" -means the change is already detected/recorded today; we only need to fan it into -the in-app feed. - -### Compliance (the core) -| Change | Severity | Source (identifier) | Exists | -|---|---|---|---| -| Rule **pass → fail**, critical severity | **critical** | `transactions` row `change_kind=state_changed`, `status=fail`, `severity=critical` (`internal/transactionlog`) | yes | -| Rule **pass → fail**, high/medium | high / medium | same, by `severity` | yes | -| New **critical** finding (`first_seen` as fail) | critical | `transactions` `change_kind=first_seen` | yes | -| Host compliance **band drop** (Compliant → At-risk → Critical) | high | `scheduler.StateFromScore` band change / `monitoring.band.changed` | yes | -| **Fleet** compliance **drift** ≥ major threshold (10pp) | high | `alertrouter` `drift_major` (from `drift.detected`) | yes | -| Rule **fail → pass** / band **improvement** | info (batch) | `transactions` `state_changed` to pass / `drift_improvement` | yes | - -### Fleet health / connectivity -| Change | Severity | Source | Exists | -|---|---|---|---| -| Host **unreachable** (was reachable) | high | `alertrouter` `host_unreachable` | yes | -| Privilege/auth **degraded** (online but privilege probe failing — the #664 class) | medium | liveness band (`host_liveness.privilege_*`) | yes | -| Host **recovered** | info | `alertrouter` `host_recovered` (auto-resolves the unreachable alert) | yes | - -### Scanning -| Change | Severity | Source | Exists | -|---|---|---|---| -| Scan **failed** (connect/auth/error — not a compliance fail) | high | `scan_runs.status=failed` + `failure_reason` | yes | -| Scan completed **with regressions** | — | *fold into the per-host regression group; do not notify "scan done" by itself* | — | - -### Remediation -| Change | Severity | Source | Exists | -|---|---|---|---| -| Remediation **failed** / rolled back | high | `remediation.completed` event + `remediation_transactions.status` | yes | -| Remediation **pending approval** (licensed bulk/auto track) | high (approvers) | needs the bulk track | partial | -| Remediation **succeeded** (rule fixed) | info | `remediation.completed` | yes | - -### Governance / exceptions -| Change | Severity | Source | Exists | -|---|---|---|---| -| Exception **pending approval** | high (approvers) | exception workflow (request state) | yes | -| Exception **approved / rejected** | medium (requester) | exception workflow | yes | -| Exception **expiring soon / expired** (rules re-enter scope) | medium | exception expiry sweep | yes | - -### Security / system (low volume, high importance) -| Change | Severity | Source | Exists | -|---|---|---|---| -| Repeated **failed logins / account lockout** | high | auth audit events | yes (events) | -| **License expiring / entered grace** | medium | `internal/license` status (grace window) | yes | -| **New host discovered** | info | `host.discovered` | yes | -| User **invited / role changed** | info / medium | user-management audit | yes | - -### Reports -| Change | Severity | Source | Exists | -|---|---|---|---| -| `report.ready` | info (demoted) | `internal/report/job.go` | yes | - ---- - -## 4. Explicit non-events (never a notification) - -These are routine churn. Surfacing them in the bell would recreate the -Activity-feed noise problem (where `scheduler.tick.dispatched` and -`system.package.installed` each run to ~7k rows): - -- `scheduler.tick.dispatched` -- routine package inventory deltas (`system.package.installed`, etc.) -- online **heartbeat pulses** for already-online hosts -- a plain `scan.completed` that changed nothing -- **sub-threshold** compliance jitter (the `drift` classifier already suppresses - moves below `minor=5pp`; the bell inherits that floor) - ---- - -## 5. Architecture — reuse, don't rebuild - -The cleanest move is to make the bell **another channel of the existing alert -engine**, not a third notion of "notification." - -``` - ┌────────────────────────────────────────────┐ - event bus ───► │ alertrouter (classify → severity → dedup) │ - (heartbeat, │ AlertType: host_unreachable/recovered, │ - drift.detected)│ drift_major/minor/improvement, ... │ - └───────────────┬────────────────────────────┘ - │ fan-out to channels - ┌────────────────────────┼───────────────────┬───────────────┐ - ▼ ▼ ▼ ▼ - stdout channel Slack channel email channel IN-APP channel ◄── NEW - │ - transactions log ──► regression projector ────────────────────────┤ writes - (state_changed, (critical pass→fail, band drops, grouped) │ per-user - first_seen) ▼ rows - notifications table - │ - GET /api/v1/notifications - (+ unread count, :markRead) - │ - SSE push (api-events-stream) ──► bell drawer -``` - -Two producers feed the new in-app channel: - -1. **The alert stream** (already built): `host_unreachable`, `host_recovered`, - `drift_major/minor/improvement`. Wiring an in-app channel alongside the - existing stdout/Slack/email channels makes these light up the bell **for - free**. -2. **A transaction-log projector** (new, small): turns critical `state_changed` - → fail and `first_seen` fail rows (and band drops) into grouped notification - rows. This is the part the alert engine does not cover today — rule-level - regressions. - -**Delivery:** the existing SSE bus (`api-events-stream`) pushes a lightweight -`notification.created` signal so the bell updates live; the drawer pulls the -durable list from `GET /api/v1/notifications`. - ---- - -## 6. Data model - -A durable, per-user table (replacing the session-scoped counter): - -``` -notifications - id uuid pk - user_id uuid -- recipient (fan-out: one row per eligible user) - kind text -- 'rule_regression' | 'host_unreachable' | 'drift_major' | 'exception_pending' | ... - severity text -- critical|high|medium|low|info (alertrouter.Severity) - title text -- "web-01: 30 rules regressed (4 critical)" - body text -- short detail - host_id uuid null -- scope + dedup - rule_id text null - group_key text -- dedup/collapse key (e.g. host_id + scan_id + 'regression') - link text -- deep-link target (/transactions/rule/:id, /hosts/:id, ...) - occurred_at timestamptz - read_at timestamptz null - created_at timestamptz default now() - - index (user_id, read_at) -- unread badge query - unique (user_id, group_key) -- collapse a burst into one row, bump a count -``` - -Grouping is enforced by `group_key` + the unique constraint: a second regression -in the same (host, scan) updates the existing row's count/`occurred_at` instead -of inserting a new one. - ---- - -## 7. RBAC & scoping - -Fan-out decides recipients per change: - -- **Host-scoped changes** (regressions, unreachable, scan-failed, remediation): - users who can see that host (`host:read`, plus any group/scope restriction). -- **Governance** (exception pending/expiring): users with the approver - permission — surfaced as the bell's actionable queue for approvers. -- **Security/system** (lockouts, license): `security_admin` / `admin`. - -This mirrors the `host:read` gate already on the SSE stream and audit queries — -no new authorization model. - ---- - -## 8. Grouping & dedup - -- **Per-scan collapse:** all regressions from one scan on one host → one row. -- **Flap suppression:** a host that goes unreachable→recovered→unreachable - within a short window should not produce three bells (the alert engine already - dedups via `dedup_key`; the in-app channel inherits it). -- **Recoveries batch:** `fail → pass` and `host_recovered` are reassuring but - low-urgency — collapse into an info-level digest rather than badging. - ---- - -## 9. Phasing - -| Slice | Scope | Why first | -|---|---|---| -| **1** | Durable per-user `notifications` table + `GET /api/v1/notifications` + unread count + `:markRead`/mark-all + drawer UI + SSE push. Wire the **in-app alert channel** so existing alerts (`host_unreachable/recovered`, `drift_*`) populate it. | Biggest bang: reuses the entire alert engine; immediately useful; replaces the session-scoped MVP with durable state. | -| **2** | **Rule-regression projector** from the transaction log (critical `pass→fail`, `first_seen` critical, band drops), grouped per host/scan. | The headline use case ("a passing rule now fails"). | -| **3** | Governance (exception pending/expiring, RBAC-scoped to approvers) + remediation failures. | Turns the bell into an action queue. | -| **4** | Security (failed-login/lockout), license expiry, and **info-level digests** (batched recoveries / good news). | Rounds out coverage without adding noise. | - -`report.ready` stays wired but is reclassified `info` — one small producer among -many, never the headline. - ---- - -## 10. Open decisions - -1. **Fan-out timing:** materialize one row per recipient at write time (simple - reads, more rows) vs a single row + per-user read state (fewer rows, joins on - read). Recommend per-recipient rows for small/medium fleets; revisit at - scale. -2. **Retention:** notifications are derived from durable sources (transactions, - alerts), so they can be pruned aggressively (e.g. 90 days) without losing the - system of record. Tie to the audit/host retention sweep already on the - backlog. -3. **User preferences:** which kinds/severities a user wants in the bell belongs - in the existing `users.preferences` JSONB (`system-user-preferences`), not a - new table — same home as the per-user alert-type preferences backlog item. -4. **Alerts vs bell unification:** confirm we treat the bell as a *channel* of - `alertrouter`, so "Alerts" (Slack/email thresholds) and the in-app bell are - one configurable stream, not two parallel concepts. diff --git a/docs/engineering/openwatch_roadmap.md b/docs/engineering/openwatch_roadmap.md deleted file mode 100644 index 317fe65d..00000000 --- a/docs/engineering/openwatch_roadmap.md +++ /dev/null @@ -1,321 +0,0 @@ -# OpenWatch Deployment Roadmap - -> **Status**: Planning — from-scratch rebuild scoped 2026-04-26 -> **Stack**: Go backend, TypeScript + MUI frontend, PostgreSQL, Kensa (Go) engine - -This roadmap defines the deployment topologies OpenWatch will support, in priority order. Phase 1 is the canonical install. Later phases extend the same single binary to additional environments — they do not require separate codebases. - ---- - -## Phase 1 (current focus): Fully native, no Nginx - -**One binary, one systemd unit, one cert path.** - -OpenWatch ships as a single statically-linked Go binary that serves HTTPS directly using `net/http` + `crypto/tls`. PostgreSQL is installed natively on the same host via OS package manager. No Nginx, no containers, no reverse proxy. - -### Components - -| Component | Where it lives | -|---|---| -| OpenWatch binary | `/usr/bin/openwatch` | -| Frontend SPA assets | `/opt/openwatch/frontend/` (served by Go via `http.FileServer` + SPA fallback) | -| Kensa rules + mappings | `/opt/openwatch/kensa/` | -| Config | `/etc/openwatch/openwatch.yaml` | -| TLS cert + key | `/etc/openwatch/tls/{cert.pem,key.pem}` | -| Systemd unit | `/etc/systemd/system/openwatch.service` | -| Logs | stdout → `journalctl -u openwatch` | -| State | PostgreSQL on `localhost:5432` (native install) | - -### Why this first - -- **Smallest surface area.** No Nginx, no Docker, no container runtime. Fewer moving parts means fewer CVEs to track and fewer dependencies to reason about — directly aligned with the security-minded reduction effort that produced the 7→4 container drop and the Celery+Redis removal. -- **Operationally legible.** A sysadmin reads one systemd unit and one config file. `systemctl status openwatch` is the entire ops surface. -- **Air-gapped friendly.** Native packages (RPM, DEB) carry forward from current packaging work; no container registry needed. -- **Forces honest scoping.** If a feature can't be expressed as "the binary does X," it's probably accidental complexity. - -### What's in scope for Phase 1 - -- HTTPS via `net/http` + `tls.Config` (TLS 1.2+, configurable cipher suites, mTLS optional) -- HTTP/2 via ALPN (automatic) -- Cert hot-reload via `GetCertificate` callback (no `SIGHUP`, no restart) -- Static SPA serving with `index.html` fallback -- Slow-loris protection via `ReadHeaderTimeout`/`ReadTimeout`/`WriteTimeout`/`IdleTimeout` -- Request body size limits via `http.MaxBytesReader` -- Security headers middleware (HSTS, CSP, X-Frame-Options, X-Content-Type-Options) -- Rate limiting via `golang.org/x/time/rate` -- gzip compression middleware -- Structured logs via `log/slog` (stdlib) -- Native RPM (CentOS Stream 9) and DEB (Ubuntu 24.04) packages - -### Phase 1 stack (locked 2026-04-26) - -| Concern | Choice | Notes | -|---|---|---| -| HTTP server | `net/http` (stdlib) | Direct HTTPS via `ListenAndServeTLS` + `tls.Config`. HTTP/2 via ALPN. Cert hot-reload via `GetCertificate`. | -| Router | `go-chi/chi` v5 | Stdlib-compatible, no global state, ~1k LOC. | -| PG driver | `jackc/pgx` v5 | Native PG types, LISTEN/NOTIFY, COPY protocol, connection pooling. | -| Query layer | `sqlc` codegen | Type-safe Go generated from raw SQL. SQL is the source of truth. No ORM. | -| Schema | **Reuse existing** (transaction-log, host_rule_state, migrations 044–048) | Q1 model is recent and proven. Schema redesign is out of scope for Phase 1. | -| Migrations | `pressly/goose` | SQL + Go migrations. Embeddable into binary. | -| FIPS | `microsoft/go` toolchain | FIPS 140-2 validated, links OpenSSL FIPS provider via CGO. Drop-in `go build` replacement. | -| Spec tooling | `specter` v0.11+ | Validates `.spec.yaml`, ingests `go test -json` for AC traceability. | -| Config file | TOML at `/etc/openwatch/openwatch.toml` | Loader: `BurntSushi/toml` (stdlib-only) or `knadh/koanf` if layered config is needed. | -| Config overrides | Env vars (e.g., `OPENWATCH_DB_DSN`) | Override TOML values for systemd-unit-managed secrets. | -| CLI flags | stdlib `flag` package | For one-shot subcommands (migrate, version, check-config). | -| Logging | `log/slog` (stdlib) | Structured logs to stdout → `journalctl -u openwatch`. | -| Testing | stdlib `testing` + `go test -json` | Traceability via Specter ingestion. | - -**Config precedence (highest wins):** CLI flags → env vars → TOML file → built-in defaults. - -### Phase 1 next-tier stack (locked 2026-04-26) - -Built on top of the foundational stack above. - -| Concern | Choice | Notes | -|---|---|---| -| JWT | `golang-jwt/jwt` v5 | RS256 with RSA-2048 (matches current OpenWatch). Access 30m, refresh 7d, absolute session 12h. | -| Password hashing | `golang.org/x/crypto/argon2` (direct) | Argon2id, 64 MB memory, 3 iterations (matches current config). Thin wrapper for params + constant-time comparison. | -| OIDC | `coreos/go-oidc` v3 + `golang.org/x/oauth2` | Discovery, JWKS, ID-token validation. | -| SAML | `crewjam/saml` | SP + IdP. Active maintenance. | -| SSH client (host scanning) | `golang.org/x/crypto/ssh` | NIST SP 800-57 key validation logic ports from current Python implementation. | -| Request validation | `go-playground/validator` v10 | Struct-tag-based. | -| Encryption (AES-256-GCM) | `crypto/aes` + `crypto/cipher` (stdlib) | No external dep needed. | -| Rate limiting | `golang.org/x/time/rate` | Stdlib-adjacent token bucket. | -| CORS | `go-chi/cors` | chi-compatible middleware. | -| Job queue | **Custom port of Q1 PostgreSQL `SKIP LOCKED` design** | Owned in-repo. Built on `pgx` + `sqlc`. Reuses existing `job_queue` schema. Retries, dead-letter, scheduling implemented as needed — no library dependency. | -| Cron / scheduling | `robfig/cron` v3 | De facto Go cron library. Drives the Adaptive Compliance Scheduler. | -| Frontend bundling | `embed.FS` (stdlib, Go 1.16+) | Frontend build output embedded into the binary via `//go:embed`. **Trade-off:** frontend updates require a binary rebuild — no separate SPA hotfix path. Aligns with single-artifact install. | -| API contract codegen | `oapi-codegen/oapi-codegen` v2 | Generates chi-compatible Go server stubs from OpenAPI 3.1. Spec at `app/api/openapi.yaml` is SSOT. Spec-first, never code-first. | -| Policy signing | `crypto/ed25519` (stdlib) | Signs `policies/*.yaml` at build/release time; verified at startup. Reuses Ed25519 pattern from current Kensa rule signing. | - -**Job queue note:** Custom implementation chosen over `riverqueue/river` to preserve the existing Q1 design (already proven, already understood) and avoid a new schema migration. This is a maintenance commitment — retries, scheduling, dead-letter, observability are owned in-repo. - -### Agent-First Architecture (Phase 1) - -OpenWatch's API/data layer is designed for agent orchestration from day one. This is API-first discipline, not generic "AI platform" architecture. - -**Boundary**: API/data layer is agent-first (auditability, determinism, composability). Frontend is human-first (friendliness, discoverability per Goal #2). Same backend, two surfaces. - -#### Three principles - -1. **Agent-trustable APIs.** Clean APIs, structured outputs, deterministic behavior, audit trails. An agent calls, parses, verifies the audit fingerprint, and never has to interpret. -2. **Domain logic as data.** Compliance rules, exceptions, approvals, schedules, alert thresholds, remediation playbooks all live in versioned YAML policy files. Services are mechanical evaluators — no judgment calls in code. -3. **Human approval as a first-class entity.** Operations requiring human review return `pending_approval` with the same response shape as immediate operations. Approval requirements declared per-operation in YAML. - -#### Phase 1 architecture requirements - -| Concern | Decision | -|---|---| -| API contract | OpenAPI 3.1 at `app/api/openapi.yaml` is SSOT. Go server code generated via `oapi-codegen`. Spec-first, never code-first. | -| Error taxonomy | All errors return `{error: {code, category, retryable, human_message, detail}}`. HTTP status alone is insufficient. | -| Idempotency | `Idempotency-Key` header required on POST/PUT/PATCH. Same key = same response, no double-execute. | -| Pagination | Cursor-based only; never offset. Explicit sort order on every list endpoint. | -| Correlation | `X-Correlation-Id` propagated via `context.Context`. Logged everywhere, returned in responses, recorded in audit events. | -| Audit log | First-class queryable API endpoint, not a back-channel. Every mutating operation writes a structured event with correlation ID, actor (user OR agent), policy version, evidence pointer. | -| Policies as data | `/opt/openwatch/policies/*.yaml` — semver-versioned, Ed25519-signed, loaded at startup. Covers exceptions, approvals, schedules, alert thresholds, remediation playbooks. | -| Approval workflow | First-class entity. Mutating endpoints return `applied` or `pending_approval`, same response envelope. | -| MCP server | Deferred from Phase 1. REST layer designed so an MCP wrapper is a ≤500 LOC translation when wanted. | - -#### Specter's role - -Specter is the **behavioral contract type-checker**, distinct from but complementary to OpenAPI: - -| Layer | Format | Tool | Audience | -|---|---|---|---| -| HTTP contract | OpenAPI 3.1 (YAML) | `oapi-codegen` | Agents, frontend, API consumers | -| Behavioral contract | `.spec.yaml` | `specter` | Auditors, contributors, CI | -| Domain logic | `policies/*.yaml` | per-schema validators | Compliance team, auditors | -| Test results | `go test -json` | `specter ingest` | CI traceability | - -What Specter delivers for the agent-first model: - -- **Behavioral contracts are enforced.** `specter coverage --enforce-active` fails CI if any AC lacks an enforcing test. Specs cannot silently drift from reality. -- **Specs are agent-readable artifacts.** OpenAPI tells an agent how to call OpenWatch. Specter specs tell it what guarantees the system makes. Both are versioned YAML, both are checked in, both are addressable. -- **Determinism is provable end-to-end.** Response → audit event → operation → spec AC → enforcing test → test result ingested by Specter. The chain is auditable. -- **Change detection.** `specter diff` between git revisions surfaces behavioral changes — exactly what auditors and downstream agents need when system guarantees shift. - -What Specter is **not**: -- A policy engine (policies are validated by per-schema Go validators) -- An OpenAPI alternative (OpenAPI = HTTP contract; specs = behavioral postconditions) -- A runtime validator (CI/build-time only) - -#### Cultural commitment - -Spec-first development is slower than code-first. The discipline tax is paid upfront for long-term composability. If "let's write the handler and document later" enters the team's vocabulary, the whole approach collapses. The tooling alone won't save it. - -### Non-goals for Phase 1 - -- Multi-node OpenWatch (Phase 4) -- Container deployment (Phase 2/3) -- Kubernetes, Helm charts (Phase 3+) -- HTTP/3 / QUIC (deferred indefinitely — no use case) -- WAF / ModSecurity (Phase 4 if customer demand) -- Brotli compression (gzip is sufficient) - ---- - -## Phase 2 (deferred): Native OpenWatch + containerized PostgreSQL/Nginx - -**Same binary. PostgreSQL and Nginx run as containers; OpenWatch stays native.** - -For environments where the database team mandates containerized stateful services, or where Nginx is required as an explicit reverse proxy (e.g., for FIPS via OpenSSL FIPS provider, or for ops-team familiarity). - -**Trigger to build:** First customer requesting separation of OpenWatch from its database/proxy lifecycle, or first FIPS deployment where `microsoft/go` is rejected. - -**Delta from Phase 1:** -- Config points OpenWatch at `postgres://localhost:5432` (containerized PG with port forward) instead of native socket -- Optional Nginx front-door config provided as an example (not required by OpenWatch) -- Compose file for PG + Nginx provided - -**Effort estimate:** 1–2 weeks once Phase 1 is stable. Mostly documentation and an example compose file. - ---- - -## Phase 3 (deferred): Fully containerized - -**Same binary. Everything runs as containers (OpenWatch, PostgreSQL, optionally Nginx).** - -For Kubernetes, OpenShift, Docker Swarm, and managed-container environments. The Go binary is unchanged; what changes is the packaging artifact (OCI image instead of RPM/DEB) and the orchestration layer. - -**Trigger to build:** First customer with K8s-only deployment policy, or first cloud marketplace listing requirement. - -**Delta from Phase 1:** -- Multi-stage Dockerfile (build → minimal distroless or UBI 9 micro) -- Helm chart or Kustomize overlay -- Liveness/readiness probes wired to existing health endpoints -- Secrets via K8s `Secret` / Docker `secret` instead of `/etc/openwatch/` -- Configmap-driven config - -**Effort estimate:** 2–3 weeks. Primary work is packaging and operator-facing docs; binary changes minimal. - ---- - -## Phase 4 (deferred): Distributed OpenWatch + Nginx, external DB - -**Multiple OpenWatch instances behind Nginx, talking to an external PostgreSQL.** - -For HA / scale-out, FIPS deployments using Nginx + OpenSSL FIPS provider, and customers who want their database team to own the database lifecycle entirely (RDS, Cloud SQL, on-prem PG cluster). - -**Trigger to build:** First customer hitting single-node throughput limits, OR first FIPS deployment that won't accept `microsoft/go`, OR first customer with managed-PG mandate. - -**Delta from Phase 1:** -- Nginx as explicit reverse proxy / load balancer (now doing real work, not optional) -- Session affinity considerations (or stateless-by-design — depends on auth/SSO model) -- DB connection pool tuning for higher concurrency -- Job queue (`SKIP LOCKED` PostgreSQL pattern from Q1) already supports multi-instance — verify under load -- Coordination story: heartbeats, leader election if needed (probably not needed given `SKIP LOCKED`) -- TLS termination moves to Nginx; OpenWatch listens on internal HTTP only (or mTLS to Nginx) - -**Effort estimate:** 4–6 weeks. Real engineering work — multi-instance correctness, load testing, failover behavior. Should not be attempted until Phase 1 is in production for at least one quarter. - ---- - -## Cross-phase principles - -1. **One binary serves all phases.** Topology is selected by config, not by build flag. The same `openwatch` artifact runs as a systemd unit, a container, or behind a load balancer. -2. **No topology-specific features.** If a feature only works in one deployment shape, it's a design smell. -3. **Frontend is unchanged across phases.** TypeScript + MUI SPA served as static files; same API contract regardless of how the backend is deployed. -4. **PostgreSQL is the only data store.** Same schema across all phases. No phase-specific tables or columns. -5. **FIPS via Nginx is acceptable when needed.** Pure-Go FIPS (`microsoft/go`) is preferred for Phase 1 simplicity; Nginx-fronted FIPS is the fallback that Phase 2/4 unlocks. Either path is supported — don't lock to one. - ---- - -## Decision log - -| Date | Decision | Rationale | -|------|----------|-----------| -| 2026-04-26 | Phase 1 = fully native, no Nginx | Single binary + systemd is the minimum viable install; aligns with dependency-reduction direction (7→4 containers, Kensa Go migration, Celery+Redis removal). Other topologies are extensions, not separate products. | -| 2026-04-26 | Frontend stack frozen: TypeScript + MUI + Zustand | Frontend was just modernized through Phase 8 (PRs #337–#349). No reason to churn it. | -| 2026-04-26 | Backend language: Go | Aligns with Kensa's Go stack; single statically-linked binary fits the native-install topology cleanly. | -| 2026-04-26 | Router: `chi` v5 over `gin`/`echo`/`fiber` | Stdlib-compatible, no custom context type, no `fasthttp` divergence. | -| 2026-04-26 | DB layer: `pgx` + `sqlc` over `GORM`/`ent` | Keeps SQL as source of truth; matches existing SQL Builder discipline; no ORM magic. | -| 2026-04-26 | Reuse PostgreSQL schema; do not redesign | Q1 transaction-log + host_rule_state model is proven (99.7% write reduction). Schema redesign is a separate decision from code rewrite. | -| 2026-04-26 | Migrations: `pressly/goose` over `golang-migrate/migrate` | Simpler, fewer drivers (only PG needed), embeddable. | -| 2026-04-26 | FIPS: `microsoft/go` toolchain in Phase 1 | Vendor-neutral, FIPS 140-2 validated, drop-in `go build`. Avoids deferring FIPS to Nginx-front-door, which would have made Phase 4 mandatory for FedRAMP customers. | -| 2026-04-26 | Spec tooling: `specter` v0.11+ | Native Go-test traceability via `specter ingest`; replaces Python `inspect.getsource()` pattern. | -| 2026-04-26 | Config: TOML file + env-var overrides + stdlib `flag` for CLI | Most Go-community-aligned file format; env vars for systemd-managed secrets; stdlib-only loader (`BurntSushi/toml`) keeps dependency surface minimal. | -| 2026-04-26 | JWT: `golang-jwt/jwt` v5 | De facto Go JWT library; supports current RS256 / RSA-2048 directly. | -| 2026-04-26 | Password hashing: `golang.org/x/crypto/argon2` direct | Stdlib-adjacent; matches current Argon2id config; no convenience-wrapper dependency needed. | -| 2026-04-26 | OIDC: `coreos/go-oidc` v3 + `golang.org/x/oauth2` | The pairing every serious Go OIDC service uses. | -| 2026-04-26 | SAML: `crewjam/saml` | Most-used Go SAML library; SP + IdP; active maintenance. | -| 2026-04-26 | SSH client: `golang.org/x/crypto/ssh` | Stdlib-adjacent; only serious Go SSH option. | -| 2026-04-26 | Request validation: `go-playground/validator` v10 | De facto Go validator; struct-tag-based. | -| 2026-04-26 | Job queue: custom port of Q1 PostgreSQL `SKIP LOCKED` over `riverqueue/river` | Preserves the proven Q1 design; avoids a new schema migration; consistent with the prior Celery+Redis-removal philosophy. Trade-off: maintenance burden for retries/scheduling/dead-letter is owned in-repo. | -| 2026-04-26 | Cron: `robfig/cron` v3 | De facto Go cron library; drives Adaptive Compliance Scheduler. | -| 2026-04-26 | Frontend: `embed.FS` over disk-served | Single-artifact install; frontend updates require binary rebuild (acceptable trade-off for security tooling with infrequent UI changes). | -| 2026-04-27 | Agent-first architecture for API/data layer | Cleanly composable APIs, deterministic behavior, structured outputs, and audit trails are good design regardless of agents. Codifying as a Phase 1 requirement prevents corner-cutting. UI remains human-first per Goal #2. | -| 2026-04-27 | OpenAPI 3.1 as API SSOT; codegen via `oapi-codegen` v2 | Spec-first discipline. Agents and frontend consume the same contract. | -| 2026-04-27 | Stable, machine-readable error taxonomy | `{code, category, retryable, human_message, detail}`. HTTP status alone is insufficient for agent reliability. | -| 2026-04-27 | Idempotency keys required on mutating endpoints | Same `Idempotency-Key` header = same response. Enables safe retries by agents. | -| 2026-04-27 | Cursor-based pagination only; never offset-based | Determinism under concurrent writes. Explicit sort order on every list endpoint. | -| 2026-04-27 | `X-Correlation-Id` propagated via `context.Context` end-to-end | Traceable across logs, audit events, downstream Kensa calls. | -| 2026-04-27 | Audit log as first-class API endpoint | Agents verify operation effects via API, not back-channel access. | -| 2026-04-27 | Domain logic in `policies/*.yaml`, not Go code | Versioned (semver), Ed25519-signed, loaded at startup. Covers exceptions, approvals, schedules, alert thresholds, remediation. | -| 2026-04-27 | Approval as first-class entity; uniform response shape | Mutating endpoints return `applied` or `pending_approval` with same envelope. Agents handle both paths without special-casing. | -| 2026-04-27 | MCP server deferred from Phase 1 | REST layer designed so MCP wrapper is ≤500 LOC translation. OpenAPI-first design naturally produces tool-callable endpoints. | -| 2026-04-27 | Specter scoped to behavioral contracts only | Distinct from OpenAPI (HTTP contracts) and policy validators (domain rules). Specter ensures behavioral specs have enforcing tests; not a policy engine, not a runtime validator. | -| 2026-04-28 | Stage 1 static-analysis pass complete | Three parallel agents inventoried dead modules, test coverage gaps, and code-health markers. Findings in `app/docs/stage_1_evidence_static.md`. Triage files updated. | -| 2026-04-28 | LicenseService is a fresh build, not a port | Static analysis revealed 3 TODO stubs in `services/licensing/service.py`: license validation is a config-flag check today, not real validation. Rebuild's licensing component must be designed from scratch. | -| 2026-04-28 | OWCA Layer 2/3/4 moved MAYBE → NEVER | Static analysis confirms `cis/stig/nist_800_53/base/models.py` (Layer 2), `fleet_aggregator.py` (Layer 3), `predictor/risk_scorer/trend_analyzer/baseline_drift.py` (Layer 4) are unreachable from active routes. If risk-scoring or forecasting is later demanded, build fresh — don't port. | -| 2026-04-28 | Q1 Celery/Redis/MongoDB cleanup is incomplete; rebuild must intentionally drop residue | 78 vestigial references survive in schema (`celery_task_id` column), config (5 `redis_*` fields), and shim functions (5 `*_celery` task definitions). Listed in NEVER §K with explicit removal targets. | -| 2026-04-28 | 8 Stage-2-blocking test gaps identified | `services/job_queue/{dispatch,registry}`, `services/auth/{credential_handler,token_blacklist_pg}`, `services/baseline_service`, `plugins/kensa/{scanner,evidence,sync_service}` have zero coverage today. Rebuild must add tests at port time, not later. | -| 2026-04-28 | Licensing foundation moved to Stage 0 | OpenWatch+ feature gating is foundation, not a feature. Adding it after Phase-2 features depend on it requires retrofitting every gated handler. Cost in Stage 0: ~1 day (Day 7). Cost if deferred: ~2 weeks of cross-cutting refactor + latent bugs from untested `False` branches. Design locked in `app/docs/licensing_foundation.md`. | -| 2026-04-28 | License model: signed JWT (EdDSA/Ed25519) | Single file, standard format, mature Go support (`golang-jwt/jwt` v5 already locked). Same algorithm as Kensa rule signing and evidence signing — one crypto primitive, fewer surfaces. | -| 2026-04-28 | Public keys embedded in binary, not config | Config tampering is easier than binary tampering. Embedded keys cannot be replaced without re-shipping the binary. Three slots support rotation: current + prev + deprecated. | -| 2026-04-28 | OpenAPI extensions: `x-required-permission` + `x-required-feature` | RBAC and license enforcement declared in spec; `oapi-codegen` generates the middleware. No hand-written gating decorators (the failure mode that produced today's 3 license validation TODOs). | -| 2026-04-28 | License denial error code: `license.feature_unavailable` → 402 Payment Required | Stable agent-readable error envelope; `detail.feature` and `detail.tier` populated for actionable client behavior. | -| 2026-04-28 | Quota enforcement at service-layer point of use | `max_hosts` (host create), `max_scans_per_day` (enqueue), `max_users` (user create), `max_concurrent_scans` (worker dequeue — defers, doesn't fail). Quotas are advisory; default unlimited if license omits the field. | -| 2026-04-29 | Audit event taxonomy is foundation, scheduled in Stage 0 Day 5 | ~70 stable event codes across 13 categories committed in `app/audit/events.yaml`. Without a registry, every component invents naming — drift starts immediately and becomes unfixable in months. Design locked in `app/docs/audit_event_taxonomy.md`. | -| 2026-04-29 | Audit emission via codegen-typed constants | `internal/audit/events.gen.go` produced from registry. Drift becomes a compile error: `audit.AuthLoginSucessful` doesn't exist as a constant. Hand-written event strings in handlers are forbidden by code review. | -| 2026-04-29 | Async batched writer with critical-event sync path | 95% of events flow through `Emit()` (channel + batched insert, ~5µs per call). Critical events (license, system lifecycle, suspect activity) use `EmitSync()` for guaranteed durability. Audit failures never block the originating request — drop policy increments a counter, never crashes. | -| 2026-04-29 | Redaction enforced pre-write | Sensitive fields (password, ssh_key, api_key, token, secret, license_jwt) are scrubbed from `detail` before storage. Field names recorded in `redactions` array for forensic visibility. Once scrubbed at write, never recoverable — by design. | -| 2026-04-29 | UUIDv7 for audit event IDs | Time-sortable primary key + globally unique. Eliminates the `(occurred_at, id)` composite index pattern. Cursor pagination uses the UUIDv7 directly. | -| 2026-04-29 | OpenAPI extension `x-audit-events` declares emission contract | Every mutating endpoint declares the audit events it may emit. Build fails if a `POST/PUT/PATCH/DELETE` declares none, or if a declared code is unknown to the registry. | -| 2026-04-29 | Audit-as-API confirmed: queryable, exportable, taxonomy-readable | Stage 0 ships only `GET /audit/events` (filters + cursor pagination). The rest are designed in `api/audit.yaml` but deferred to Phase 1: `POST /audit/events:query` (DSL, license-gated), `GET /audit/events:export` (license-gated), `GET /audit/events:taxonomy` (registry), per-resource sub-resources (`/hosts/{id}/audit-events`). | -| 2026-04-29 | Error code registry locked at `app/api/error_codes.yaml` | Same registry pattern as licensing/audit (registry → codegen → constants). ~50 codes across 15 categories. Build invariants: code regex, category reference, http_status range, fault enum, JSON-Schema validation of `detail_schema`. Drift becomes compile error; deprecated codes preserved for historical log compatibility. | -| 2026-04-29 | Error envelope field renamed `category` → `fault` | `category` collided with the registry's namespace grouping (auth, host, scan, ...). Renamed before any code shipped. Field semantics unchanged: `client | server | policy | external` drives agent retry/abort logic. | -| 2026-04-29 | Error code metadata is registry-driven, not handler-driven | `http_status`, `fault`, `retryable`, and `detail_schema` are looked up from the registry at runtime. Handlers emit `errors.New(errors.HostUnreachable, ctx)` — they cannot lie about status code or retry semantics. Eliminates the inconsistency class where two handlers return the same code with different status codes. | -| 2026-04-29 | Policies-as-data design locked at `app/docs/policies_as_data.md` | Five policy types (exceptions, approvals, schedules, alert_thresholds, remediation), each with a typed schema and dedicated Go evaluator. No generic rules engine, no expression DSL — eliminates the failure mode where YAML drifts from what the evaluator expects. | -| 2026-04-29 | Policy "is it a policy?" four-part test | Operator-tunable + auditor-relevant + agent-quotable + runtime-not-startup. Failing any part = config or code, not policy. Prevents policy sprawl. | -| 2026-04-29 | Policies are Ed25519-signed; admin keys embedded in binary | Same primitive as license signing and audit chain — one crypto surface. Filesystem permissions are insufficient (most attackers who can write the file can also run openwatch). Unsigned files load only with `OPENWATCH_DEV_MODE=true`. | -| 2026-04-29 | Policy versioning is monotonic semver | Loading a lower version is rejected (`policy.invalid`). Rollback requires republishing as a new higher version. Audit history references the version active at evaluation time. | -| 2026-04-29 | Policy state held in `atomic.Pointer[*State]`; lock-free hot path | Same pattern as license `IsEnabled`. Reload swaps; readers see consistent old or new state, never partial. Target p99 `Evaluate()` < 50µs. | -| 2026-04-29 | Built-in default policies are intentionally strict | Missing policy file → conservative default loaded with `version: 0.0.0`. Operators opt in to looser policies by writing a file. Removes the "forgot to install policy = wide-open system" failure mode. | -| 2026-04-29 | Policy framework scaffolded in Stage 0 Day 6 | Loader + state + history + audit + admin reload endpoint + OpenAPI extension parsing. Type-specific evaluators are stubs returning defaults; evaluator implementations come online as their consumers do (Stage 2+). | -| 2026-04-29 | `policy.applied` always uses async audit path | Highest-volume audit event in the system; sync emission would back-pressure every API call. Drop on overflow is acceptable — `policy.applied` is forensic, not safety-critical. Scheduler coalesces evaluations into one event per material decision change. | -| 2026-04-29 | OpenAPI `x-requires-approval` and `x-policy-evaluated` extensions | `x-requires-approval` is enforced (codegen produces middleware); `x-policy-evaluated` is documentation-only. Approval `defer` outcome maps to `202 Accepted` with `approval_id` — uniform across operations. | -| 2026-04-30 | Correlation propagation contract locked at `app/docs/correlation_id_propagation.md` | One ID per top-level intent flows from HTTP entry through audit, job queue, worker dequeue, sub-jobs, cron ticks, external HTTP, and Kensa SSH calls. Four origins (HTTP/cron/boot/test) and four propagation helpers (HTTPMiddleware/Enqueue/Dequeue/audit.Emit) — anything else is rejected by lint. Retrofitting after Stage 2 = multi-week refactor; locking now = one Day-4 + Day-8 effort. | -| 2026-04-30 | Correlation ID format: `{prefix}-{16 hex chars}` | Prefix (req/cron/boot/test) gives at-a-glance origin; 16 hex chars are the high-order 8 bytes of a UUIDv7 (time-ordered, 16 bits randomness suffice at <10K req/sec). Total ~20 chars: greppable, log-column-friendly, lexicographically time-sortable. Rejected W3C `traceparent` because forensic readability beats OTel-native shape; propagation discipline transfers if/when OTel adopts. | -| 2026-04-30 | Client `X-Correlation-Id` header is sanitized, not trusted | Charset `[A-Za-z0-9_-]{1,64}`; reserved prefixes `boot-`, `cron-`, `test-` are rejected from clients. Invalid → fresh generation + warning log. Once past middleware, IDs on context are trusted. Correlation IDs are forensic, never authn/authz. | -| 2026-04-30 | `queue.Enqueue` errors when ctx has no correlation_id | The function is the only public path to insert a job; missing correlation = programming error. Lint forbids raw `INSERT INTO job_queue` outside `internal/queue/` (golangci-lint forbidigo rule). Same enforcement on `http.DefaultClient` (forces use of `internal/httpclient` wrapper). | -| 2026-04-30 | `queue.Dequeue` returns a fresh ctx, not the caller's | Worker uses returned `workerCtx` carrying the originating job's correlation_id; caller's ctx (which may have its own correlation from the worker loop) does not bleed into per-job execution. Prevents cross-contamination in pooled workers. | -| 2026-04-30 | slog handler enforces structured correlation_id on every log line | `internal/log/CorrelationHandler` wraps stdlib `slog.JSONHandler`; reads ID from ctx and emits it as a top-level attr. Lint forbids non-Context slog calls (`slog.Info` etc.) outside `func init`/`func main`. Operators search by `correlation_id="..."` in any log query tool. | -| 2026-04-30 | Boot generates one shared `boot-` correlation_id | All `system.startup`, `policy.loaded`, `license.loaded` events at startup share it. Forensic question "what happened at the last restart?" reduces to one grep. Same pattern for cron ticks (per-tick `cron-` ID covers all jobs that tick enqueues). | -| 2026-04-30 | Job queue helpers ship in Stage 0 Day 9 alongside policies | `internal/queue/Enqueue`+`Dequeue` ship before any real job exists. End-to-end propagation test (`/diagnostics:enqueue-test-job` → worker → audit chain shares correlation_id) is part of the 19-step DoD. Stage 2 consumers (scan jobs, scheduled scans, remediation jobs) cannot bypass the contract because the helpers are already the only path. | -| 2026-04-30 | Kensa correlation ID forwarded via `KENSA_CORRELATION_ID` env var | SSH-invoked Kensa receives the originating ID; coordination ask is for Kensa to include it in JSON output. Until Kensa supports it, OpenWatch logs the invocation/completion correlation pair on its side. Known Phase-2 forensic gap with explicit closure path. | -| 2026-04-30 | RBAC registry locked at `app/auth/permissions.yaml` | ~50 permissions across 17 categories + 5 built-in roles (viewer, auditor, ops_lead, security_admin, admin). Same registry pattern as audit/license/error-codes/policies (registry → codegen → typed Go constants). Drift becomes a build error: misspelled permissions in OpenAPI fail validation; raw permission-string literals in handler code fail lint. | -| 2026-04-30 | Permissions are immutable at runtime; built-in roles update only via migration | Permissions are *contract* (OpenAPI ↔ handler ↔ license). Adding one is a code+spec change. Built-in role definitions ship in releases via migration; release notes call out the change. Custom roles (Stage 2) are runtime-mutable but constrained to registered permissions. The three-layer split prevents the failure modes: silent permission drift, undocumented built-in role changes, custom roles granting nonexistent permissions. | -| 2026-04-30 | Combined RBAC+license middleware (one pass, one denial path) | `RequirePermission(p)` checks role membership (deny → 403 `authz.permission_denied`) then license gate (deny → 402 `license.feature_unavailable`). Order matters: RBAC first so unauthenticated/unauthorized callers can't probe license shape. License-gated permissions (`remediation:execute`, `audit:export`) declare `license_gated: <feature_id>` in the registry; codegen emits the gate inside the same middleware. Eliminates the per-handler decoration-stack drift mode. | -| 2026-04-30 | Bare wildcard `*` reserved for built-in `admin` role | Custom roles cannot grant `*`; they must list permissions explicitly (or use category wildcards like `host:*`). Cloning admin without code review would sidesteps the audit trail of "who is the most privileged role." Category wildcards in custom roles auto-pick-up new permissions in that category; built-in role lists are codegen-expanded at release time so they don't (release notes are the change channel). | -| 2026-04-30 | OpenAPI cross-validation enforces RBAC↔license↔audit invariants | Build fails if (a) `x-required-permission` references an unregistered permission, (b) a license-gated permission lacks matching `x-required-feature`, (c) a `dangerous: true` permission's operation lacks `x-audit-events`. Three drift modes closed by one validator. | -| 2026-04-30 | Custom roles deferred to Stage 2 (auth slice) | Stage 0 ships registry + built-in roles + lookup endpoints + middleware. Custom-role CRUD (`POST/PUT/DELETE /admin/roles`, `:assign`, `:unassign`, `:clone`) requires user management which lands with the Stage 2 auth slice. The contract is locked now; the consumer ships when its dependencies exist. | -| 2026-04-30 | Stage 0 grew to 13 days; further foundation work goes to Stage 2 | Six foundations (audit, error codes, licensing, policies, correlation, RBAC) added since the original 7-day plan. Stage 0 has reached its working maximum. Remaining concerns (configuration schema, error envelope unification across domain specs, observability stack) ship Stage 2 unless evidence shows they are foundational drift sources. Bias toward shipping Stage 0 over expanding it. | -| 2026-04-30 | 11 OpenAPI skeleton specs drafted; full API surface enumerated | Specs 5–15 drafted as operation maps (paths, methods, extensions, descriptions; schemas stubbed). 14 domain files total + meta openapi.yaml manifest. ~154 operations across the platform — about 2.3x collapse from the Python codebase's ~350 endpoints, validating "had bad grouping, not too many features." Full schemas land slice-by-slice in Stage 2. | -| 2026-04-30 | Foundation cleanup pass after skeleton sweep | Four gaps surfaced and closed: (1) `app/license/features.yaml` was missing entirely — created with 10 features (9 canonical + premium_diagnostics for Stage 0 demo); (2) compliance.yaml used `temporal_compliance` which doesn't exist in the registry — corrected to `temporal_queries`; (3) added `admin.sso_provider.updated` audit event (was incorrectly reusing `.created`); (4) added `integration.webhook.subscribed` + `.unsubscribed` audit events (was incorrectly reusing `plugin.installed`). The skeleton exercise paid for itself by surfacing one missing registry file and three audit-event mismatches at design time rather than mid-Stage-2. | -| 2026-05-24 | Go toolchain floor raised: 1.22+ → 1.25+ | Discovered Day 3 of Stage 0: `pressly/goose v3.27.1` requires Go 1.25 minimum. Go's toolchain auto-download makes this seamless for developers on 1.22+ (1.25.7 is fetched transparently). Accepting the bump because (a) auto-download means zero operator friction, (b) modern container images ship 1.25+ already, and (c) pinning goose to an older v3.20.x to keep 1.22 compat fights the tool. README and Makefile updated; FIPS toolchain compatibility with 1.25 to be verified Day 12. | -| 2026-05-24 | Audit queries hand-written for Day 3, sqlc-generated for Day 5 | `internal/db/audit_queries.go` is hand-written for Stage 0 Day 3 but matches what sqlc would produce against `internal/db/queries/audit.sql`. `sqlc.yaml` is in place so Day 5's `make generate` swaps the hand-written file for the generated one. Function signatures are identical so callers don't change. Reduces Day 3 scope (no sqlc tooling install) without making Day 5 a rewrite. | -| 2026-05-24 | SDD discipline applied retroactively + forward | Days 1–3 shipped without Specter behavioral specs (drift from the locked SDD discipline). Backfilled `app/specs/system/config.spec.yaml` (15 ACs) and `app/specs/system/db.spec.yaml` (12 ACs) with `// Spec:` headers and `// AC-N` annotations on existing tests. Day 4 forward: spec-first — `app/specs/system/correlation.spec.yaml` (16 ACs) and `app/specs/system/http-server.spec.yaml` (11 ACs) written before any code; tests reference each AC in comments. Future days continue spec-first. | -| 2026-05-24 | Day 4 finding: correlation ID needed monotonic counter | The 8-byte ID format (48-bit timestamp + 16 bits random) collides under tight-loop generation. Test `TestGenerate_UniquenessSequential` (10K calls) failed with duplicate IDs. Fix: 16-bit monotonic counter within the same millisecond, randomly seeded when ms advances. Preserves time-ordering AND guarantees uniqueness up to ~65M IDs/sec. The design doc said "<10K/sec is plenty for 16 bits random"; under bursty load that math breaks. Counter is the right primitive. | -| 2026-05-24 | Day 4 finding: chi `NotFound`/`MethodNotAllowed` bypass middleware by default | chi's default 404/405 handlers do NOT run the `r.Use(...)` middleware chain, which meant unmatched routes returned without an X-Correlation-Id header. Fix: register explicit `r.NotFound(handler)` and `r.MethodNotAllowed(handler)` so chi routes them through the middleware. Documented in `internal/server/server.go`. Would have surfaced as a "where did my correlation_id go?" forensic hole during Stage 2; caught at Day 4 acceptance. | -| 2026-05-24 | Day 5 finding: oapi-codegen v2 doesn't fully support OpenAPI 3.1 | Tried `openapi: 3.1.0` with `type: [string, 'null']` nullable syntax; codegen failed with "unhandled Schema type". Converted Stage-0 manifest to 3.0.3 with `type: string, nullable: true`. The full 14-domain manifest in `openapi.full.yaml` stays at 3.1.0 as a forward-looking Stage 2 artifact; the codegen-consumed `openapi.yaml` is 3.0.3 until upstream support lands. Tracked at https://github.com/oapi-codegen/oapi-codegen/issues/373. | -| 2026-05-24 | Day 5 finding: `oapi-codegen` output path is CWD-relative, not config-relative | Wrote `output: ../internal/server/api/...` expecting config-relative path; the file ended up two levels too high in the repo tree. Fix: paths in `oapi-codegen.yaml` are relative to the CWD where the binary is invoked. `make generate-api` documents the right invocation. | -| 2026-05-24 | Day 5 finding: pgxpool `body` JSONB requires explicit cast | INSERT into `idempotency_keys` failed at runtime because Go `[]byte` is sent as `bytea` by pgx, not `jsonb`. Fix: explicit `$4::jsonb` cast in the SQL. Spotted in initial idempotency replay test that returned 500. Documented inline. | -| 2026-05-24 | Day 5b complete: 3 endpoints live, idempotency replay verified | `/health`, `/diagnostics:echo`, `/audit/events` all returning correct envelopes. Live test: two POSTs with same `Idempotency-Key` and same body produced exactly 1 audit row (replay was cached, handler not re-invoked). `system.startup` emitted via `EmitSync` at boot, queryable via `/audit/events?action=system.startup`. Day 6 (idempotency) was implemented as part of Day 5b since `:echo` depends on it. | -| 2026-04-29 | Day 1–7 hardening sweep: SDD baseline locked at 67% avg AC coverage | Multi-agent review surfaced 7 P0 bugs (pgx error-compare, audit deadline override, channel-close-on-shutdown race, denialMap growth, server shutdown goroutine leak, missing `x-required-feature`, hard-coded feature ID string) — all fixed. Migrated 12 specs to Specter 0.13 schema and populated `specter.yaml`. Added integration tests for idempotency (9 ACs), license features (12 ACs), and API surface (12 ACs across 4 specs in per-spec files). Tightened 4 placebo tests. Audit doc drift fixed: licensing/audit-taxonomy/api-design sections updated to match implementation; `:taxonomy` and 3 other audit endpoints flagged as Phase-1 deferred. Coverage: 4/12 specs at 100%, 8/12 below tier threshold — gaps documented and triaged. | -| 2026-04-29 | All coverage gaps closed: 12/12 specs at 100% under `specter coverage --strict` | 41 uncovered ACs closed via real tests (not annotation-only). New tests: audit codegen (AC-01..03), EmitSync latency (AC-07), license-features p99 (AC-08), license-validation prev-key/fingerprint/latency (AC-03,10,13), idempotency missing-key + cache p99 (AC-04,07), db unreachable-host/migrations-idempotent/schema/round-trip/persistence (AC-02,03,04,05,06,07,08,10,12), api-health DB-down/latency/no-audit (AC-04,05,06), api-echo correlation echo/empty-body/oversize/single-audit/405/queryable (AC-02,04,05,06,09,10), api-audit-query filters + cursor + redaction (AC-02,04,05,06,07,08,09,10), api-license install+verify+leak/denial-audit/SIGHUP-equivalent (AC-02..10), server.Run real-bind + inflight + listener-error (AC-01,10,11). Added `license.Reset()` exported helper for clean test isolation. Perf budgets relaxed where shared-DB load made spec targets unrealistic (`EmitSync` 500µs→10ms ceiling, `Emit` 10µs→50µs); spec target preserved in comments. `specter sync` passes end-to-end with `.specter-results.json` from a real `go test -json` run. | -| 2026-05-24 | CI gates wired: make check + .github/workflows/go-ci.yml | New spec `release-ci-gates` (10 ACs, T1) at 100% strict coverage. Makefile gains `vet`, `vuln`, `test-race`, and `check` targets; `make check` chains vet → lint → vuln → test-race. `govulncheck` auto-installs if absent. `test-race` uses `-p 1` so packages don't trample each other's shared-DB state under the race detector. `internal/internalrace/` ships a build-tag-aware multiplier (1 normally, 20 under -race) that perf tests apply to their budgets so spec targets stand without -race and pass with it. **Lint findings fixed**: gofmt on 16 files; bounds-check on int→int32 conversions in `internal/db` and `internal/server/handlers.go`; `slog.Warn` → `slog.WarnContext` in audit writer (was drift from the project's own forbidigo rule); inline `Id` field annotated as mirroring codegen output. **Go toolchain bumped to 1.25.10** to close 7 stdlib CVEs surfaced by govulncheck (GO-2026-4601, 4602, 4870, 4918, 4946, 4947, 4971). `.github/workflows/go-ci.yml` runs the same gates on every PR touching `app/**` against a Postgres 16 service container. 19/19 specs at 100% strict. | -| 2026-05-24 | Day 13 complete: Stage 0 walking skeleton done — 18/18 specs at 100% strict | New spec `release-stage-0-signoff` (13 ACs, T2) maps the 19-step DoD onto enforcing tests. Four previously-deferred demo endpoints wired: `POST /diagnostics:require-host-write` (RBAC denial demo), `POST /diagnostics:evaluate-alert` (policy evaluator demo), `POST /diagnostics:enqueue-test-job` + in-process worker (`internal/worker/`) that drains `diagnostics.test_job` and emits the completion event with the originating correlation_id, `POST /admin/policies:reload` (operator endpoint behind `policy:reload`). New audit code `diagnostics.test_job_completed`. Server lifecycle: `s.Run(ctx)` starts the worker; `httptest.NewServer`-based tests call `s.StartWorker(ctx)` explicitly so the queue→worker→audit chain runs end-to-end. README expanded with a developer walkthrough and a 19-step DoD checklist mapping each step to its enforcing spec AC. **DoD step 16 amended**: the example originally used `X-Correlation-Id: test-end2end-001` which collides with the `test-` reserved prefix in the correlation contract (intended for in-process generation only); the canonical client prefix is `req-`. **Stage 0 complete — 13/13 days. 18/18 specs at 100% under `specter coverage --strict`.** Ready for `stage-0-complete` tag when the operator chooses to cut it. | -| 2026-05-24 | Day 12 complete: FIPS 140-3 build via Go 1.25 native `GOFIPS140` | Original plan called for microsoft/go but stock Go 1.24+ ships the in-toolchain FIPS module — second toolchain dropped from the dependency list. New spec `release-fips-build` (8 ACs, T1) at 100% strict coverage. `make build-fips` invokes `GOFIPS140=v1.0.0 go build` and produces `dist/openwatch-fips` with `crypto/internal/fips140/v1.0.0` symbols linked in. Tests verify: `--version` reports `fips: true` for FIPS binary and `fips: false` for non-FIPS, FIPS-module symbols present via `go tool nm`, TLS handshake + `/health` serves identical response, license/RBAC/correlation suites pass with `GOFIPS140=v1.0.0` set, Ed25519 license JWT verify still succeeds (FIPS 186-5 approved), Version/Commit match across both binaries. 17/17 specs at 100% strict. Stage 0 status: 12/13 days complete. | -| 2026-05-24 | Day 11 complete: native RPM + DEB packaging | New spec `release-package-build` (13 ACs, T2) at 100% strict coverage. `app/packaging/` holds shared assets (`common/openwatch.service`, `common/openwatch.toml`, `common/gen-demo-cert.sh`), the RPM spec (`packaging/rpm/openwatch.spec`), the DEB maintainer scripts (`packaging/deb/{control,preinst,postinst,prerm,postrm,conffiles}`), and the build scripts (`packaging/rpm/build-rpm.sh`, `packaging/deb/build-deb.sh`). `make rpm` and `make deb` invoke them; both run end-to-end on this host and produce shipping artifacts under `dist/`. Maintainer scripts: pre-install creates the `openwatch` system user + group; post-install runs `systemctl daemon-reload`; pre-uninstall runs `systemctl stop && disable`. Tests in `packaging/tests/package_test.go` build the artifacts and inspect them with `rpm -qp --queryformat` and `dpkg-deb --info / -c / --ctrl-tarfile` so every AC is enforced against real bytes, not just the source. 16/16 specs at 100% strict mode. Day 12 (FIPS via microsoft/go) and Day 13 (docs + demo + sign-off) remain. | -| 2026-05-24 | Day 9 complete: queue + cron correlation + policy framework | Two specs added (`system-job-queue` 11 ACs, `system-policy` 12 ACs); 15/15 specs at 100% strict coverage. `internal/queue/` ships Enqueue (rejects missing correlation_id), Dequeue (`FOR UPDATE SKIP LOCKED`, fresh worker ctx carrying the job's correlation_id — never the caller loop's), Complete/Fail. `internal/cron/` ships per-tick `cron-` correlation IDs; ticks never share IDs. `internal/policy/` ships generic loader (Ed25519 verify + semver monotonic + atomic.Pointer swap), `alert_thresholds` evaluator, `policy_history` snapshot, `policy.loaded`/`.invalid`/`.applied` audit emit. Migrations 0003 (`job_queue` with NOT NULL correlation_id) and 0004 (`policy_history`). Forbidigo lint now rejects raw `INSERT INTO job_queue` outside `internal/queue/` and `http.DefaultClient` outside `internal/httpclient/`. Demo HTTP endpoints (`:enqueue-test-job`, `:evaluate-alert`, `:reload-policies`) deferred — spec ACs validated by per-package tests, not by HTTP veneer. | -| 2026-05-24 | Day 8 complete: RBAC registry + middleware + demo endpoints | 13th spec `system-rbac` ships at 100% coverage. `app/auth/permissions.yaml` is the SSOT (59 permissions across 18 categories, 5 built-in roles). `scripts/gen-rbac.go` produces `internal/auth/permissions.gen.go` and `roles.gen.go` with category wildcards (`host:*`) and role inheritance (`viewer:*` → `auditor`) expanded at codegen time. `RequirePermission`/`EnforcePermission` middleware enforces RBAC first (403 `authz.permission_denied` + audit) then license gate (402 `license.feature_unavailable` + audit) in one pass — RBAC always wins when both fail. Stage-0 `X-Stub-Role` header binds identity; Stage 2 replaces the binder while keeping the `Identity` shape. New endpoints: `:require-host-read` (RBAC demo), `:require-remediation-execute` (RBAC+license combo), `GET /auth/me/permissions`, `GET /auth/permissions:registry`, `GET /admin/roles`. 16 ACs + 8 API integration tests. Deferred: `scripts/validate-rbac.go` and `scripts/validate-openapi.go` (registry-shape enforcement is at codegen time today). | diff --git a/docs/engineering/policies_as_data.md b/docs/engineering/policies_as_data.md deleted file mode 100644 index 1f6e6f26..00000000 --- a/docs/engineering/policies_as_data.md +++ /dev/null @@ -1,766 +0,0 @@ -# Policies as Data — Design Specification - -**Status:** Foundation, locked 2026-04-29 -**Owner:** Backend platform -**Spec:** `specs/system/policies.spec.yaml` (to be authored at Specter migration) -**Source-of-truth files:** -- `policies/*.yaml` — versioned, Ed25519-signed policy documents -- `internal/policy/types/*.go` — per-type schema validators (codegen + hand-written semantics) - ---- - -## 1. Why this exists - -OpenWatch's domain logic — *when a finding can be excepted, who can approve a remediation, how often a host is scanned, what compliance score triggers an alert* — has, historically, lived inside service code. That model fails for an agent-first platform for three reasons: - -1. **Agents ask "what's the rule?" not "trace the code."** When a remediation request returns `403`, an agent needs to read a structured reason — not infer it from HTTP status. If the rule lives in a YAML file with a version, the agent can quote it back. -2. **Operators tune policies more often than they ship code.** Today "compliance score < 80 fires an alert" is hardcoded. Tomorrow the customer wants `< 70 for production, < 90 for dev`. Without policies-as-data, that's a code change + redeploy + release notes. With it, it's a YAML edit and a SIGHUP. -3. **Audit answers "who decided" not "what code ran."** A `policy.applied` event with `policy_type: alert_thresholds` and `policy_version: 2.1.0` is forensically useful. "Service X line 1247" is not. - -This is not "make every `if` a policy." It's "the small set of decisions that operators tune, auditors review, and agents query." Section 2 defines the test. - ---- - -## 2. The "is it a policy?" test - -A piece of logic is a policy if **all four** are true: - -1. **An operator (not a developer) would change it.** Compliance score thresholds, exception expiry, approval requirements: yes. JSON parsing, retry timing, connection pool size: no. -2. **Auditors care about its history.** "What was our scan cadence policy on March 15?" is a real question. "What was our HTTP timeout?" is not. -3. **Agents would benefit from quoting it.** Returning `error.code = "policy.denied"` with `detail.policy_type = "remediation_approval"` and `detail.policy_version = "1.4.0"` is actionable. "Permission denied" is not. -4. **It's a runtime decision, not a startup config.** Database URL is config (loaded once, restart to change). Alert thresholds are policy (evaluated per scan, hot-reload). - -Five domains pass the test in OpenWatch: - -| Domain | Policy type ID | Evaluated at | -|--------|----------------|--------------| -| Compliance exceptions | `exceptions` | Exception request submission, exception revalidation | -| Operation approvals | `approvals` | Any operation declared `x-requires-approval` in OpenAPI | -| Scan scheduling | `schedules` | Scheduler tick (every 60s), per host per framework | -| Alert thresholds | `alert_thresholds` | Scan completion, drift detection, host state change | -| Remediation rules | `remediation` | Remediation request enqueue, before execution | - -Anything else that "feels like a policy" should be reviewed against the four-part test before being added. Adding policies is cheap; removing them — once handlers depend on `policy.Evaluate(ctx, "thing", ...)` — is expensive. - ---- - -## 3. Anti-patterns (what policies-as-data is NOT) - -- **Configuration in disguise.** A YAML file that lists "max upload size" is config, not policy. Configs go in `ow.yml`. -- **Generic rules engine.** No `if score < {{value}} then alert` evaluator. Each policy type has a typed Go schema and a dedicated evaluator. Generic engines fail silently when the YAML drifts from what the evaluator expects; typed schemas fail at load time. -- **Workflow engine.** Approvals are simple state machines (`pending → approved | rejected | expired`), not BPMN. If a workflow needs branching, parallel paths, or sub-processes, it belongs in code. -- **Specter target.** Specter validates that behavioral specs have enforcing tests. Specs describe *what the code does*; policies describe *operator-tunable rules*. They live in different files for different audiences. -- **Replacement for code.** A 400-line YAML with 30 conditional branches is worse than a 50-line Go function. If a policy reaches that complexity, it's no longer operator-tunable — break the operator-tunable bits out and put the rest in code. - ---- - -## 4. Core design - -### 4.1 Policy document envelope - -Every policy file conforms to this outer shape: - -```yaml -# policies/exceptions.yaml -policy_type: exceptions # one of: exceptions | approvals | schedules | alert_thresholds | remediation -version: 2.1.0 # semver; advances on any change to `rules` -metadata: - description: Compliance exception lifecycle rules - effective_from: 2026-05-01T00:00:00Z - superseded_by: null # set when this version is retired - signed_by: ops-admin@hanalyx.com - signed_at: 2026-04-29T14:32:11Z -rules: - # type-specific schema; see Section 5 -signature: # Ed25519 signature over the entire document MINUS this field - algorithm: ed25519 - key_id: ops-admin-2026 # references admin signing key - value: base64(64 bytes) -``` - -**Invariants:** -- `policy_type` must match the filename stem (`exceptions.yaml` → `policy_type: exceptions`). -- `version` must be a valid semver string, monotonically increasing across loads of the same `policy_type`. Loading `2.0.0` after `2.1.0` is rejected. -- `signature` is verified against an embedded Ed25519 admin public key set (separate from license keys). Unsigned policy files load only if `OPENWATCH_DEV_MODE=true`. -- `effective_from` is a wall-clock timestamp; the policy is inert before that time even if loaded. - -### 4.2 Why Ed25519, not just file permissions - -File-permission protection assumes an attacker who can write `/opt/openwatch/policies/` cannot execute `openwatch policy install` (which checks signatures). Most real attackers who can do the first can do the second. Ed25519 with embedded public keys raises the bar: an attacker must possess the admin private key, which is held offline. - -Same primitive as license signing and audit chain signing — one crypto surface, fewer keys to rotate. - -### 4.3 Versioning rules - -- **Semver strict.** Patch (`2.1.0 → 2.1.1`) for descriptive changes (typo, comment). Minor (`2.1.0 → 2.2.0`) for adding new rules without changing existing decisions. Major (`2.1.0 → 3.0.0`) for any change that could flip a decision (raising a threshold, removing an exception class). -- **Monotonic.** The runtime tracks `current_version` per `policy_type`. Loading a lower version is rejected (`policy.invalid` audit event). Rolling back requires republishing as a new higher version — no version reuse, ever. -- **Multiple versions on disk allowed for forensics.** `policies/exceptions.v2.1.0.yaml` archived; `policies/exceptions.yaml` is the active symlink. Audit history references the version that was active at evaluation time. - -### 4.4 Loading and evaluation pipeline - -``` - ┌──────────────────────────────────────┐ - │ policies/{type}.yaml on disk │ - │ + admin public keys (embedded) │ - └──────────────┬───────────────────────┘ - │ openwatch policy install - │ OR startup - │ OR SIGHUP - ▼ - ┌──────────────────────────────────────────────────┐ - │ internal/policy/loader.Load(policyType) │ - │ 1. Read file │ - │ 2. Verify Ed25519 signature │ - │ 3. Parse against type-specific Go struct │ - │ 4. Run validator (refs, ranges, mutex rules) │ - │ 5. Compare version with runtime state │ - │ 6. Atomic swap into atomic.Pointer[State] │ - └──────────────┬───────────────────────────────────┘ - │ emit audit: - │ policy.loaded (success) - │ policy.invalid (any failure) - ▼ - ┌──────────────────────────────────────────────────┐ - │ internal/policy/{type}/evaluate.go │ - │ │ - │ Decision Evaluate(ctx, input) Decision │ - │ - input is a typed struct per policy type │ - │ - reads atomic.Pointer[State] (lock-free) │ - │ - returns: allow | deny | defer | tier-of-action│ - │ - emits policy.applied audit on every call │ - └──────────────────────────────────────────────────┘ -``` - -Evaluation is lock-free for hot-path throughput: `atomic.Pointer[*State]` swap on reload, readers see either the old or new state with no locks. - -### 4.5 The Decision type - -```go -// internal/policy/types.go - -type Decision struct { - Outcome Outcome // allow | deny | defer | <type-specific> - PolicyType string // e.g., "exceptions" - PolicyVersion string // e.g., "2.1.0" - Reason string // machine-stable reason string (e.g., "expired", "out_of_scope") - HumanMessage string // for UI/log display - Detail map[string]any // type-specific context - AppliedAt time.Time -} - -type Outcome string - -const ( - OutcomeAllow Outcome = "allow" - OutcomeDeny Outcome = "deny" - OutcomeDefer Outcome = "defer" - // Type-specific outcomes (e.g., scheduling) extend this set. -) -``` - -When a handler turns a `Decision` into an HTTP response, `OutcomeDeny` maps to `error.code = "policy.denied"` with the policy type and version in `detail` (see error_codes.yaml). - ---- - -## 5. The five policy types - -Each subsection defines: (a) the YAML schema, (b) the Go evaluation input, (c) decision outcomes, (d) where it's evaluated. - -### 5.1 Exceptions - -**Purpose:** Govern when a compliance finding can be marked as "accepted risk" / "false positive" / "compensating control" without re-firing. Currently the backend has an exception model but no policy gating — anyone with the role can grant an exception of any duration. This policy adds bounds. - -```yaml -policy_type: exceptions -version: 2.1.0 -metadata: {...} -rules: - defaults: - max_duration_days: 90 - requires_justification: true - auto_revalidate_on_drift: true - classes: - - id: false_positive - max_duration_days: 365 - requires_approval_roles: [auditor, security_admin] - - id: accepted_risk - max_duration_days: 90 - requires_approval_roles: [security_admin] - requires_justification_min_chars: 100 - - id: compensating_control - max_duration_days: 180 - requires_approval_roles: [security_admin] - requires_evidence_url: true - scope: - framework_blocklist: [] # frameworks where ALL exceptions are denied - rule_blocklist: # specific rule IDs that can never be excepted - - cis_rhel9_3.7.1 # SELinux disabled - - cis_rhel9_5.2.4 # SSH PermitRootLogin=yes -signature: {...} -``` - -**Evaluation input:** -```go -type ExceptionRequest struct { - RuleID string - Framework string - Class string // "false_positive" | "accepted_risk" | "compensating_control" - DurationDays int - RequesterRole string - Justification string - EvidenceURL string // optional -} -``` - -**Outcomes:** `allow` (request may proceed to approval workflow), `deny` (rejected at request time — bad class, blocklisted rule, duration exceeds class limit, missing justification). - -**Evaluated at:** `POST /compliance/exceptions` (request handler), and on revalidation (every 24h cron, on drift detection). - -### 5.2 Approvals - -**Purpose:** Declare which operations require human approval, who can approve, and approval-quorum rules. This is the "two-person rule" / "change control" surface. - -```yaml -policy_type: approvals -version: 1.0.0 -metadata: {...} -rules: - operations: - - id: remediation.execute - approvers_required: 1 - approver_roles: [security_admin, ops_lead] - same_role_can_self_approve: false # requester != approver - timeout_hours: 24 - auto_reject_on_timeout: false # null = stays pending - - id: host.delete - approvers_required: 0 # no approval needed (RBAC alone) - - id: license.install - approvers_required: 2 - approver_roles: [security_admin] - timeout_hours: 168 - reminder_intervals_hours: [24, 72, 144] - - id: admin.user.delete - approvers_required: 1 - approver_roles: [security_admin] - timeout_hours: 24 -signature: {...} -``` - -**Evaluation input:** -```go -type ApprovalRequest struct { - Operation string // matches an operations[].id - RequesterID uuid.UUID - RequesterRole string -} -``` - -**Outcomes:** -- `allow` — operation may execute immediately (`approvers_required = 0`). -- `defer` — operation enters pending state with `approval_id`. Handler returns `202 Accepted` with `approval_id`. -- `deny` — operation forbidden by policy (no matching operation entry, or requester role not allowed to even request). - -**Evaluated at:** Any handler whose OpenAPI operation declares `x-requires-approval: <operation_id>`. Codegen wraps the handler with the approval middleware. - -### 5.3 Schedules - -**Purpose:** Adaptive Compliance Scheduler — how often to scan each host, by current state. Replaces fixed-interval scanning. - -```yaml -policy_type: schedules -version: 1.2.0 -metadata: {...} -rules: - defaults: - interval_compliant_hours: 168 # weekly when fully compliant - interval_drifted_hours: 24 # daily when drift detected - interval_failed_hours: 6 # 4x/day when actively failing - interval_first_scan_hours: 1 # near-immediate after host registration - max_interval_hours: 168 # ceiling — even compliant hosts scan weekly - jitter_percent: 10 # ±10% randomization to avoid thundering herd - per_framework: - - framework: cis-rhel9-v2.0.0 - interval_compliant_hours: 168 - - framework: stig-rhel9-v2r7 - interval_compliant_hours: 24 # STIG re-scans daily even when compliant - interval_failed_hours: 1 - per_host_tag: - - tag: production - interval_compliant_hours: 24 - - tag: dev - interval_compliant_hours: 336 # 14 days -signature: {...} -``` - -**Evaluation input:** -```go -type ScheduleQuery struct { - HostID uuid.UUID - Framework string - HostTags []string - LastScanAt time.Time - LastScanStatus ScanStatus // compliant | drifted | failed - HostCreatedAt time.Time -} -``` - -**Outcome (type-specific):** -```go -type ScheduleDecision struct { - NextScanAt time.Time - Interval time.Duration - Source string // which rule matched: "default" | "framework:cis-rhel9-v2.0.0" | "tag:production" -} -``` - -**Evaluated at:** Scheduler tick (every 60s). For each `(host, framework)` pair: compute `NextScanAt`, enqueue scan if `now() >= NextScanAt`. - -**Conflict resolution:** When multiple rules match (e.g., a host has tag `production` AND framework `stig-rhel9-v2r7`), the **shortest** interval wins (most aggressive scanning). This is safe-by-default — the operator can lengthen intervals via tag rules without worrying about framework rules being silently overridden. - -### 5.4 Alert thresholds - -**Purpose:** When a scan completes or drift is detected, decide whether to fire an alert and at what severity. - -```yaml -policy_type: alert_thresholds -version: 1.0.0 -metadata: {...} -rules: - compliance_score: - - condition: score < 70 - severity: critical - debounce_minutes: 60 - - condition: score < 80 - severity: warning - debounce_minutes: 240 - - condition: score < 90 - severity: info - debounce_minutes: 1440 - drift: - - condition: pass_to_fail_count > 5 - severity: critical - debounce_minutes: 60 - - condition: pass_to_fail_count > 0 - severity: warning - debounce_minutes: 60 - per_host_tag: - - tag: production - compliance_score: - - condition: score < 90 - severity: critical - - condition: score < 95 - severity: warning - channels_default: [slack, email] - channels_critical_override: [slack, email, pagerduty] -signature: {...} -``` - -**Note on `condition`:** The string `score < 70` is **not** parsed as a generic expression. The schema validator constrains it to a small allowlist: `score < N`, `score > N`, `pass_to_fail_count > N`, `pass_to_fail_count > N`. This stays out of "generic rules engine" territory — the validator can be ~30 lines of Go. - -**Evaluation input:** -```go -type AlertEvaluation struct { - HostID uuid.UUID - Framework string - HostTags []string - ComplianceScore float64 - PassToFailCount int - LastAlertAt map[Severity]time.Time // for debounce -} -``` - -**Outcome:** -```go -type AlertDecision struct { - Fire bool - Severity Severity - Channels []string - DebounceUntil time.Time // populated even when Fire=false, to short-circuit re-evaluation - MatchedRule string -} -``` - -**Evaluated at:** Scan completion; drift detection job. - -### 5.5 Remediation - -**Purpose:** Govern auto-remediation. License-gated (Phase 4), but the policy machinery exists in Stage 0 so the gate is not a retrofit. - -```yaml -policy_type: remediation -version: 1.0.0 -metadata: {...} -rules: - global: - require_dry_run_first: true - max_concurrent_executions: 5 - rollback_on_step_failure: true - rule_classes: - - rule_pattern: "cis_rhel9_1\\..*" # config files, low risk - auto_execute_allowed: true - requires_approval: false - - rule_pattern: "cis_rhel9_3\\..*" # network, medium risk - auto_execute_allowed: false # dry-run only without approval - requires_approval: true - - rule_pattern: "cis_rhel9_5\\..*" # SSH/access, high risk - auto_execute_allowed: false - requires_approval: true - requires_dual_approval: true - blocklist: - - cis_rhel9_3.7.1 # SELinux changes — never auto-remediate -signature: {...} -``` - -**Evaluation input:** -```go -type RemediationRequest struct { - RuleID string - HostID uuid.UUID - DryRun bool - HasApproval bool -} -``` - -**Outcomes:** -- `allow` — proceed with remediation. -- `defer` — needs approval; route through approval policy. -- `deny` — blocklisted rule, or pattern requires dry-run first and `DryRun=false`. - -**Evaluated at:** `POST /remediation/requests` and at worker dequeue (re-check, in case policy changed since enqueue). - ---- - -## 6. Storage and runtime state - -### 6.1 On-disk layout - -``` -/opt/openwatch/policies/ -├── exceptions.yaml # active version (symlink or current file) -├── exceptions.v2.1.0.yaml # archived -├── exceptions.v2.0.0.yaml # archived -├── approvals.yaml -├── schedules.yaml -├── alert_thresholds.yaml -└── remediation.yaml -``` - -Archived versions are read-only. They support audit forensics: "show me the exceptions policy as of 2026-03-15" reads the file from a backup or the policy_history table (Section 6.3). - -### 6.2 Runtime state - -```go -// internal/policy/state.go - -type State struct { - LoadedAt time.Time - Policies map[string]*LoadedPolicy // key: policy_type -} - -type LoadedPolicy struct { - Type string - Version string // semver - Rules any // type-asserted to per-type struct - SignatureValid bool - EffectiveFrom time.Time - SourceFile string - SourceHash string // SHA-256 of file contents -} - -var current atomic.Pointer[State] - -func IsActive(policyType string) bool { - s := current.Load() - p, ok := s.Policies[policyType] - return ok && p.SignatureValid && time.Now().After(p.EffectiveFrom) -} - -func Get(policyType string) (*LoadedPolicy, bool) { - s := current.Load() - p, ok := s.Policies[policyType] - return p, ok -} -``` - -Hot-path evaluation never takes a lock; readers see either old or new state during a swap, never a partial state. - -### 6.3 Database snapshot table - -```sql -CREATE TABLE policy_history ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid_v7(), - policy_type TEXT NOT NULL, - version TEXT NOT NULL, - source_hash TEXT NOT NULL, - rules JSONB NOT NULL, - metadata JSONB NOT NULL, - signature JSONB NOT NULL, - loaded_at TIMESTAMPTZ NOT NULL DEFAULT now(), - superseded_at TIMESTAMPTZ, - UNIQUE(policy_type, version) -); - -CREATE INDEX idx_policy_history_type_loaded ON policy_history(policy_type, loaded_at DESC); -``` - -Every successful policy load inserts a row. The `superseded_at` column is set when a newer version of the same `policy_type` loads. This is the audit trail for "what was the active policy at time T." - -`policy.applied` audit events reference `(policy_type, version)`; the snapshot table provides full text of that version forever. - ---- - -## 7. Loading lifecycle - -### 7.1 Startup - -``` -1. Read embedded admin public keys. -2. For each policy_type in {exceptions, approvals, schedules, alert_thresholds, remediation}: - a. Read /opt/openwatch/policies/{type}.yaml. - b. If missing → use built-in default policy (described in §7.4); emit policy.loaded with version=0.0.0. - c. Verify signature; if invalid: - - production: refuse to start; exit non-zero with audit policy.invalid (sync). - - dev mode: log warning; load with SignatureValid=false. - d. Validate against type-specific schema; on failure → exit (production). - e. Compare version with policy_history; reject if not monotonic. - f. Insert into policy_history; mark prior version superseded. - g. Atomic swap into State. - h. Emit audit policy.loaded. -3. State is non-nil before any handler accepts traffic. -``` - -### 7.2 Hot reload (SIGHUP) - -``` -1. Receive SIGHUP. -2. For each policy_type, re-run steps 2a–2h from §7.1. -3. Reload is best-effort: failure of one policy_type does not roll back others. -4. Each successful reload emits policy.loaded; each failure emits policy.invalid. -5. The previous State is the fallback if all reloads fail (no atomic swap performed). -``` - -### 7.3 Admin endpoint reload - -``` -POST /admin/policies:reload -Idempotency-Key: required - -Body: {} (reload all) or { "types": ["exceptions", "approvals"] } - -Response: - 200 OK { "results": [{"policy_type": "exceptions", "version": "2.2.0", "outcome": "loaded"}, ...] } - 207 Multi-Status when some succeeded and some failed - 503 if reload deferred (another reload in progress) -``` - -Required permission: `admin.policies.reload` (declared via `x-required-permission`). - -### 7.4 Built-in defaults - -If a policy file is missing from disk, the loader uses a hardcoded conservative default with `version: 0.0.0`. The defaults are intentionally **strict** — operators must opt in to looser policies by writing a file. Examples: - -- `exceptions` default: max 30 days, all classes require security_admin approval, no blocklist. -- `approvals` default: every operation that *can* require approval *does*; default approver_roles = `[security_admin]`. -- `schedules` default: weekly for compliant hosts, daily otherwise, no per-tag overrides. -- `alert_thresholds` default: warning at <80, critical at <70. -- `remediation` default: `auto_execute_allowed: false` for everything; dry-run only. - ---- - -## 8. Audit integration - -The audit registry already defines (`audit/events.yaml`): - -- `policy.loaded` — emitted on successful load (startup or reload). -- `policy.invalid` — emitted when load fails (signature, schema, version regression). -- `policy.applied` — emitted on **every** call to `Evaluate()`. - -Detail schemas for each: - -**`policy.loaded.detail`:** -```yaml -policy_type: string -policy_version: string # the now-active version -previous_version: string|null # what was replaced -source_hash: string -load_source: string # "startup" | "sighup" | "admin_reload" -``` - -**`policy.invalid.detail`:** -```yaml -policy_type: string -attempted_version: string|null -errors: array of strings -load_source: string -``` - -**`policy.applied.detail`:** -```yaml -policy_type: string -policy_version: string -decision: string # "allow" | "deny" | "defer" | type-specific -reason: string # machine-stable -input_summary: object # type-specific; redacted (no secrets) -``` - -**Volume note:** `policy.applied` is the highest-volume audit event in the system — every API call that hits a policy emits one. Two mitigations: - -1. **Async path.** `policy.applied` always uses the async batched writer (never `EmitSync`). Drop on overflow is acceptable; the absence of an apply event does not affect correctness. -2. **Coalescing for schedules.** The scheduler evaluates *every host × framework* every minute. Emitting one event per evaluation is wasteful (most are "stay the course"). The scheduler emits `policy.applied` only when the decision *changes* the next-scan time materially (>5% delta) or fires a scan. The summary event `scan.queued` references the policy version, providing the audit chain. - ---- - -## 9. OpenAPI integration - -Two extensions tie policies into the API spec. - -### 9.1 `x-requires-approval` - -Declared per operation. Codegen generates middleware that wraps the handler. - -```yaml -paths: - /remediation/requests/{id}:execute: - post: - operationId: executeRemediation - x-required-permission: remediation:execute - x-required-feature: remediation_execution - x-requires-approval: remediation.execute - x-audit-events: [remediation.requested, remediation.executed] - responses: - '202': - description: Approval required; returns approval_id - content: - application/json: - schema: {$ref: '#/components/schemas/ApprovalPending'} - '200': - description: Executed (approval policy returned allow) -``` - -The `202` response is the `defer` outcome of the approvals policy. Agents key off `error.code` (none in 2xx) and the response body shape. - -### 9.2 `x-policy-evaluated` (informational) - -Declared per operation when a non-approval policy may produce a denial. This is documentation-only — the spec consumer can see which policy types govern the endpoint. - -```yaml -paths: - /compliance/exceptions: - post: - operationId: requestException - x-policy-evaluated: [exceptions] - x-audit-events: [compliance.exception.requested] -``` - -CI does not enforce that the handler actually evaluates the listed policies — that's a behavioral spec concern (Specter), not a spec-time concern. - ---- - -## 10. Code organization - -``` -internal/ -└── policy/ - ├── state.go # atomic.Pointer[State], Get/IsActive - ├── loader.go # ReadFile, VerifySignature, Validate, Apply - ├── reload.go # SIGHUP handler, admin endpoint glue - ├── history.go # snapshot to policy_history table - ├── audit.go # policy.loaded / .invalid / .applied helpers - ├── types/ - │ ├── exceptions.go # struct + JSON Schema validator - │ ├── approvals.go - │ ├── schedules.go - │ ├── alert_thresholds.go - │ └── remediation.go - └── eval/ - ├── exceptions.go # Evaluate(ctx, ExceptionRequest) Decision - ├── approvals.go - ├── schedules.go - ├── alert_thresholds.go - └── remediation.go -``` - -Each evaluator is plain Go, fully unit-testable with table-driven tests. No DSL, no AST, no expression evaluator — just typed inputs into typed decisions. - ---- - -## 11. Failure modes and edge cases - -| Scenario | Behavior | -|----------|----------| -| Policy file deleted while running | Existing in-memory state continues; on next reload, missing file → built-in default loaded; emit `policy.loaded` with `previous_version` populated. | -| Policy file edited but signature stale | Signature check fails on reload; `policy.invalid` emitted; previous in-memory state retained (no swap). | -| Policy file references unknown rule ID | Schema validator rejects at load; `policy.invalid` emitted with the unknown reference in `errors[]`. | -| Two policies with the same `policy_type` and same `version` on disk | Filename precedence: `{type}.yaml` (active symlink) wins. Other files are ignored. | -| Clock skew makes `effective_from` invalid | Loaded but inert until `now() >= effective_from`. Evaluations during this window use the previous policy. | -| Database snapshot insert fails | Policy still loads into memory; warning logged; reconciliation job retries snapshot. The in-memory state is authoritative for evaluation. | -| Evaluator panic (bug in eval code) | Recovered by middleware; returns `error.code = "server.internal"`; emits `policy.invalid` (yes, the *evaluator* failed, not the policy itself — the audit code captures the bug); request retries are not safe. | -| Policy version downgrade attempted | Loader rejects; `policy.invalid` with `errors: ["version regression: 2.0.0 < 2.1.0"]`. | - ---- - -## 12. Performance targets - -| Metric | Target | Rationale | -|--------|--------|-----------| -| `Evaluate()` p99 (any type) | < 50µs | Hot-path; lock-free atomic read + struct evaluation | -| Schedule evaluator full sweep (1000 hosts × 3 frameworks) | < 100ms | Runs every 60s; must finish before the next tick | -| Policy reload (single type) | < 100ms | SIGHUP reload; signature verify + validate + DB insert | -| `policy.applied` audit volume | ~10K/sec sustained | Async path absorbs without backpressure | - ---- - -## 13. Stage 0 work - -Day 8 of the walking skeleton (after audit foundation Day 5, idempotency Day 6, and licensing Day 7) folds in policies-as-data: - -1. **`internal/policy/` package** — state, loader, history (no eval logic). -2. **Type schema scaffolding** — Go structs and JSON-Schema validators for all 5 types. Evaluator stubs return built-in defaults. -3. **`policy_history` table** — migration + repository. -4. **`/admin/policies:reload` endpoint** — admin auth, returns the reload outcome map. -5. **SIGHUP handler** — wired in. -6. **Audit integration** — `policy.loaded`/`.invalid`/`.applied` emit helpers. -7. **OpenAPI extensions** — codegen support for `x-requires-approval` and `x-policy-evaluated` (parsing only; middleware in Stage 2). -8. **Built-in default policies** — checked in as `policies/{type}.default.yaml` (version `0.0.0`, unsigned in dev mode). - -What is **not** in Stage 0: -- Approval state machine (Stage 2 — needs approvals table, notification dispatch). -- Adaptive scheduler implementation (Stage 2 — needs scan execution). -- Alert dispatch (Stage 2 — needs notification channels). -- Remediation evaluator integration (Phase 4). - -The framework loads, validates, snapshots, and emits audit events on Day 6. Type-specific evaluators come online as their consumers do. - ---- - -## 14. Testing strategy - -| Layer | Test type | What it asserts | -|-------|-----------|-----------------| -| Schema validators | Table-driven unit | Bad inputs rejected with the expected error string | -| Loader | Integration | Real file → real signature verify → real DB insert; covers the happy path and 5 failure modes from §11 | -| Evaluators | Table-driven unit | (input, expected Decision) pairs; one per outcome and reason | -| Hot reload | Integration | SIGHUP triggers reload; concurrent evaluators see consistent state | -| Performance | Benchmark | `BenchmarkEvaluate*` — fail CI if p99 > 50µs | -| Audit emission | Integration | Every Evaluate call produces a `policy.applied` event | -| End-to-end | Behavioral spec (post-Specter) | "If exception class is `accepted_risk` and duration > 90 days, request returns `policy.denied`" | - ---- - -## 15. Open questions - -1. **Per-tenant policies.** The current design is single-tenant per OpenWatch deployment. Multi-tenant would require namespacing files and snapshot rows by tenant. Defer to multi-tenant epic. -2. **Policy diff/preview UI.** Operators will want to see "what would change if I install this version?" before installing. Out of Stage 0 scope. -3. **Policy linting beyond schema.** E.g., "you have a `class: accepted_risk` rule with `max_duration_days: 9999` — is that intentional?" Defer; can be added as a separate `openwatch policy lint` subcommand. -4. **Cross-policy invariants.** E.g., the `approvals` policy declares `host.delete` requires no approval, but the `remediation` policy declares it does. Today each policy is validated independently. Cross-checks deferred until we hit a real conflict. -5. **Policy expression evaluator scope.** §5.4 uses string conditions like `score < 70`. The schema validator allowlists patterns. If operators ever want richer expressions (e.g., `score < 70 AND host.tag == "production"`), we either grow the allowlist or write a tiny CEL-style evaluator. Expressions in the allowlist stay restricted in v1. - ---- - -## Cross-references - -- Error codes: `api/error_codes.yaml` — `policy.invalid`, `policy.version_mismatch`, `policy.denied`, `policy.not_found` are already registered. -- Audit events: `audit/events.yaml` — `policy.loaded`, `policy.invalid`, `policy.applied` already registered with detail schemas. -- API design: `docs/engineering/api_design_principles.md` §11 (extensions), §15 (idempotency on `/admin/policies:reload`). -- RBAC registry: `docs/engineering/rbac_registry.md` — the `approvals` policy's `approver_roles` field cross-validates against the active role set (built-in + custom). Unknown role at policy load → `policy.invalid` audit event with the unknown role in `errors[]`; previous in-memory state retained. -- Roadmap: 2026-04-27 entry on policies-as-data; 2026-04-29 entry on this design doc; 2026-04-30 entry on RBAC cross-validation. diff --git a/docs/engineering/rbac_registry.md b/docs/engineering/rbac_registry.md deleted file mode 100644 index a49818b4..00000000 --- a/docs/engineering/rbac_registry.md +++ /dev/null @@ -1,669 +0,0 @@ -# RBAC Registry — Design Specification - -**Status:** Foundation, locked 2026-04-30 -**Owner:** Backend platform -**Spec:** `specs/system/rbac.spec.yaml` (to be authored at Specter migration) -**Source-of-truth files:** -- `auth/permissions.yaml` — registry of permissions and built-in roles -- `internal/auth/permissions.gen.go` — codegen-typed Go constants -- `internal/auth/roles.gen.go` — codegen-typed built-in role definitions - ---- - -## 1. Why this exists - -OpenWatch enforces access control at three layers: - -1. **Spec layer** — OpenAPI declares `x-required-permission: host:read` per operation. -2. **Handler layer** — Go middleware checks `user.HasPermission(perms.HostRead)`. -3. **Role layer** — A user's role has a list of permissions; the union of their roles' permissions is their effective set. - -In a string-literal world, all three layers refer to permissions by free-form string. Drift arrives within a release: - -- The spec says `host:read`. The handler checks `hosts:read`. The role grants `host.read`. All three are slightly different. Tests pass because fixtures grant superusers everything. Production fails when a real `auditor` role tries to list hosts and gets `403`. -- A new dangerous permission gets added to a handler but never to the registry. There is no audit hook that says "this permission was added"; reviewers don't know the surface grew. -- License-gated permissions (`audit:export` requires OpenWatch+) are gated in some places and not others — gating is a per-handler decoration that goes stale. - -A registry collapses all three layers onto one source. The OpenAPI validator, the handler middleware, and the role definitions all read from the same file. Misspell a permission anywhere → build fails. License gating co-locates with permission definition → middleware enforces both in one pass. Custom roles created at runtime validate every permission against the registry → no silent grant of a permission that doesn't exist. - ---- - -## 2. The one-line contract - -> **Permissions are a registry, not a vocabulary. Every reference to a permission — in OpenAPI, in handler code, in built-in role definitions, in custom roles created at runtime — resolves through the registry. Drift becomes a build error.** - -The registry has two sections: - -- **Permissions** — immutable at runtime. Adding one is a code+spec change. -- **Built-in roles** — extensible via migration only. Updates ship in product releases. - -**Custom roles** (Stage 2) are a third concept: runtime-mutable, DB-stored, but constrained by the registry — every permission they grant must be a registry permission. - ---- - -## 3. Permission schema - -Every entry in `auth/permissions.yaml` `permissions:` section conforms to: - -```yaml -- id: host:read - category: host - description: View host details, list hosts, view host audit history - dangerous: false # optional; default false - license_gated: null # optional; default null -``` - -**Field semantics:** - -| Field | Type | Required | Notes | -|-------|------|----------|-------| -| `id` | string | yes | `^[a-z][a-z0-9_]*:[a-z][a-z0-9_]*$` (resource:action). Stable across versions; never changes meaning. | -| `category` | string | yes | Must reference a `categories[].id`. The category is implied by the `id` prefix; this field exists for explicitness in tooling. | -| `description` | string | yes | One-line human description. Surfaced in admin UI and `/auth/permissions:registry`. | -| `dangerous` | boolean | no | `true` for destructive ops, license install, or anything that would warrant a "are you sure?" confirmation. UI uses for confirmation dialogs; audit middleware records as a high-priority denial. | -| `license_gated` | string | no | Feature ID from `licensing/features.yaml`. Permission is inert if the license doesn't include the feature. Combined RBAC+license check happens in one middleware pass. | - -**Build invariants** (enforced by `scripts/gen-rbac.go`, run in CI): - -- Every `id` matches the regex. -- Every `id`'s prefix matches a defined `categories[].id`. -- Every `license_gated` value matches a `feature.id` in `licensing/features.yaml`. -- `dangerous` is a boolean. -- No duplicates between `permissions:` and `deprecated_permissions:`. - -### 3.1 Naming convention - -Always **resource:action**, both lowercase, both underscore-separated within tokens: - -``` -host:read ✓ -host:connectivity_check ✓ -scan_template:write ✓ -remediation:execute ✓ -``` - -Anti-patterns: - -``` -hosts:read ✗ plural noun -host.read ✗ dot separator (collides with audit codes) -host:Read ✗ capitals -host_read ✗ no separator -host:write_all ✗ multi-token action -``` - -The action vocabulary is small: `read`, `write`, `delete`, `execute`, plus operation-specific verbs where appropriate (`approve`, `revoke`, `acknowledge`, `resolve`, `cancel`, `request`, `comment`, `connectivity_check`, `intelligence_refresh`, `test`, `rollback`, `install`, `reload`). - ---- - -## 4. Codegen - -### 4.1 Output - -```go -// internal/auth/permissions.gen.go (DO NOT EDIT) - -package auth - -type Permission string - -const ( - AuthRead Permission = "auth:read" - AuthWrite Permission = "auth:write" - UserRead Permission = "user:read" - UserWrite Permission = "user:write" - UserDelete Permission = "user:delete" - HostRead Permission = "host:read" - HostWrite Permission = "host:write" - HostDelete Permission = "host:delete" - HostConnectivityCheck Permission = "host:connectivity_check" - HostIntelligenceRefresh Permission = "host:intelligence_refresh" - // ... ~50 more ... - RemediationExecute Permission = "remediation:execute" - AuditExport Permission = "audit:export" -) - -type PermissionMeta struct { - Category string - Description string - Dangerous bool - LicenseGated string // empty if not gated -} - -var Permissions = map[Permission]PermissionMeta{ - HostRead: {Category: "host", Description: "View host details...", Dangerous: false, LicenseGated: ""}, - RemediationExecute: {Category: "remediation", Description: "Execute...", Dangerous: true, LicenseGated: ""}, // free core (single-rule); bulk/auto gated at the handler - AuditExport: {Category: "audit", Description: "Export audit data...", Dangerous: false, LicenseGated: "audit_export"}, - // ... -} - -// AllPermissions returns every active permission id. -func AllPermissions() []Permission { ... } - -// IsDangerous reports whether p is marked dangerous. -func IsDangerous(p Permission) bool { ... } - -// LicenseGate returns the feature id required for p, or "" if none. -func LicenseGate(p Permission) string { ... } -``` - -```go -// internal/auth/roles.gen.go (DO NOT EDIT) - -package auth - -type RoleID string - -const ( - RoleViewer RoleID = "viewer" - RoleAuditor RoleID = "auditor" - RoleOpsLead RoleID = "ops_lead" - RoleSecurityAdmin RoleID = "security_admin" - RoleAdmin RoleID = "admin" -) - -// BuiltInRoles resolves wildcards at codegen time so the runtime never expands. -var BuiltInRoles = map[RoleID]RoleDefinition{ - RoleViewer: {ID: "viewer", Description: "...", Permissions: []Permission{ - AuthRead, HostRead, ScanRead, /* ... explicit list ... */ - }}, - // ... -} -``` - -### 4.2 Workflow for adding a permission - -1. Add the entry to `auth/permissions.yaml`. -2. Run codegen: `go run scripts/gen-rbac.go`. -3. Add `x-required-permission: <id>` to the relevant OpenAPI operations. -4. Reference the typed constant from handler code: `requireAuth(perms.HostRead)`. -5. CI fails the build if an OpenAPI spec uses an unknown permission, or if a handler emits a string literal that doesn't match a constant. - -### 4.3 Adding a built-in role - -Built-in roles are extensible only via product release: - -1. Add the entry to `auth/permissions.yaml` `roles:` section. -2. Author a migration that inserts the new row into `roles` with `is_built_in: true`. -3. Existing custom roles unaffected. - -Modifying a built-in role's permission list is the same process. The migration UPDATEs the row; release notes call out the change. Customers running an older release see the older permission set. - -### 4.4 Deprecation - -Move retired permissions from `permissions:` to `deprecated_permissions:`: - -```yaml -deprecated_permissions: - - id: scan:legacy_export - deprecated_at: 2026-04-30 - successor: audit:export - notes: Replaced by unified audit export -``` - -While deprecated: - -- OpenAPI specs cannot reference it (build fails). -- Handler code cannot reference it (constant removed; lint catches string literals). -- Existing custom roles in DB still work — the read endpoint surfaces a `deprecated_permissions: ["scan:legacy_export"]` warning attribute. -- After one product release, hard-remove from `deprecated_permissions:`. Custom roles auto-prune on next read; `admin.role.changed` audit event emitted with `detail.removed: ["scan:legacy_export"]`. - ---- - -## 5. OpenAPI integration - -### 5.1 The `x-required-permission` extension - -```yaml -paths: - /api/v1/hosts/{host_id}: - get: - operationId: getHost - x-required-permission: host:read - x-audit-events: [] # reads don't emit audit (per audit taxonomy) - responses: {...} - - delete: - operationId: deleteHost - x-required-permission: host:delete # registry validates this is dangerous=true - x-audit-events: [host.deleted] - responses: {...} - - /api/v1/remediation/requests/{id}:execute: - post: - operationId: executeRemediation - x-required-permission: remediation:execute # registry validates license_gated - x-required-feature: remediation_execution # MUST match permissions.yaml license_gated - x-requires-approval: remediation.execute - x-audit-events: [remediation.requested, remediation.executed] - responses: {...} -``` - -### 5.2 Cross-validation - -The OpenAPI build validator enforces: - -1. Every `x-required-permission` value is in the active permissions registry. -2. If a permission has `license_gated: X`, the operation must declare `x-required-feature: X` (or omit the permission). Mismatch → build fails. -3. If `x-required-permission` is `dangerous: true`, the operation MUST emit at least one audit event (`x-audit-events` non-empty). Dangerous ops without audit are a contradiction. - -### 5.3 Multiple permissions per operation - -A handler may require *any of* or *all of* multiple permissions. The extension supports both shapes: - -```yaml -# Single permission (most common) -x-required-permission: host:read - -# Any-of (rare; one of these is sufficient) -x-required-permission: - any_of: [host:read, host:write] - -# All-of (rare; user must have all) -x-required-permission: - all_of: [host:write, scan:execute] -``` - -The vast majority of operations use the single-permission form. - ---- - -## 6. The combined middleware - -The same middleware that enforces the permission also enforces the license gate. One pass, one denial path, one audit event. - -```go -// internal/auth/middleware.go - -func RequirePermission(p Permission) func(http.Handler) http.Handler { - return func(next http.Handler) http.Handler { - return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { - ctx := r.Context() - user, ok := UserFrom(ctx) - if !ok { - writeError(w, http.StatusUnauthorized, errors.AuthTokenMissing) - return - } - - // 1. Permission check - if !user.HasPermission(p) { - audit.Emit(ctx, audit.Event{ - Action: audit.AuthzPermissionDenied, - Detail: map[string]any{ - "required_permission": string(p), - "route": r.URL.Path, - }, - }) - writeError(w, http.StatusForbidden, errors.AuthzPermissionDenied) - return - } - - // 2. License gate (if applicable) - if feature := LicenseGate(p); feature != "" { - if !license.IsEnabled(feature) { - audit.Emit(ctx, audit.Event{ - Action: audit.LicenseFeatureCheckDenied, - Detail: map[string]any{"feature": feature, "permission": string(p)}, - }) - writeError(w, http.StatusPaymentRequired, errors.LicenseFeatureUnavailable) - return - } - } - - next.ServeHTTP(w, r) - }) - } -} -``` - -**Order matters.** Auth (who are you?) → idempotency → RBAC+license → handler. The full chain: - -``` -correlation → auth → idempotency → RBAC+license → handler → audit emit -``` - -`oapi-codegen` produces this wiring from the `x-required-permission` extension. Handlers do not call `RequirePermission` themselves; the middleware is generated. - ---- - -## 7. The user's effective permissions - -```go -type User struct { - ID uuid.UUID - Roles []RoleID - // ... -} - -func (u *User) HasPermission(p Permission) bool { - for _, roleID := range u.Roles { - role, ok := lookupRole(roleID) - if !ok { - continue - } - if role.HasPermission(p) { - return true - } - } - return false -} - -func (r *RoleDefinition) HasPermission(p Permission) bool { - for _, granted := range r.Permissions { - if granted == "*" { - return true - } - if matchesWildcard(granted, p) { - return true - } - if granted == p { - return true - } - } - return false -} - -func matchesWildcard(granted, p Permission) bool { - // granted is "host:*"; p is "host:read" - if !strings.HasSuffix(string(granted), ":*") { - return false - } - grantedCategory := strings.TrimSuffix(string(granted), ":*") - pCategory, _, ok := strings.Cut(string(p), ":") - return ok && grantedCategory == pCategory -} -``` - -**Built-in roles have wildcards expanded at codegen time** (per `BuiltInRoles` in `roles.gen.go`); the runtime check never expands. **Custom roles store wildcards as-is** (so a category-level grant continues to cover newly added permissions in that category); the runtime expands per-call. - -### 7.1 The bare wildcard `*` - -Reserved for the built-in `admin` role. The validation rule: - -```go -func validateRolePermissions(roleID RoleID, perms []Permission, isBuiltIn bool) error { - for _, p := range perms { - if p == "*" && (!isBuiltIn || roleID != RoleAdmin) { - return fmt.Errorf("bare wildcard '*' is reserved for the built-in admin role") - } - // ... category-wildcard and exact-match validations ... - } - return nil -} -``` - -A custom role that wants "everything" must list permissions explicitly (or use category wildcards like `host:*`, `scan:*`, etc.). This is a deliberate friction: cloning admin without code review sidesteps the audit trail of "who is the most privileged role in the system." - ---- - -## 8. Custom roles (Stage 2 preview) - -Stage 0 ships the registry, built-in roles, and the lookup endpoints. Stage 2 ships custom-role CRUD when user management lands. - -### 8.1 The `roles` table - -```sql -CREATE TABLE roles ( - id TEXT PRIMARY KEY, - description TEXT NOT NULL, - is_built_in BOOLEAN NOT NULL DEFAULT false, - permissions TEXT[] NOT NULL, - created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), - created_by UUID REFERENCES users(id) -); - -CREATE TABLE user_roles ( - user_id UUID NOT NULL REFERENCES users(id) ON DELETE CASCADE, - role_id TEXT NOT NULL REFERENCES roles(id) ON DELETE RESTRICT, - granted_at TIMESTAMPTZ NOT NULL DEFAULT now(), - granted_by UUID REFERENCES users(id), - PRIMARY KEY (user_id, role_id) -); -``` - -### 8.2 The custom-role API (Stage 2) - -``` -POST /api/v1/admin/roles # create custom role -PUT /api/v1/admin/roles/{id} # update custom role; built-ins → 405 -DELETE /api/v1/admin/roles/{id} # delete custom role; built-ins → 405 -POST /api/v1/admin/roles/{id}:assign # assign to user -POST /api/v1/admin/roles/{id}:unassign # remove from user -POST /api/v1/admin/roles/{id}:clone # sugar: clone built-in to custom -``` - -All declare `x-required-permission: admin:role_manage`. - -### 8.3 Validation at custom-role create - -The handler: - -1. Validates `id` matches `^[a-z][a-z0-9_]{1,63}$`, not a built-in role name. -2. **Validates every permission against the registry.** Unknown permission → `400` with `error.code = "validation.field_unknown"` and `detail.invalid_permissions: [...]`. -3. Validates wildcards: bare `*` rejected; category wildcards expanded for the response (so the admin sees what they granted). -4. Counts dangerous permissions and includes a warning array in the response. -5. Counts license-gated permissions; warns if the license currently doesn't enable them. -6. Inserts; emits `admin.role.changed` audit event. - -### 8.4 Custom roles and license-gated permissions - -A custom role with `remediation:execute` is **allowed** even if the license does not enable `remediation_execution`. The permission is inert at runtime — the combined middleware denies. This means: - -- Admins can pre-stage roles before purchasing OpenWatch+ (the role exists; activates when license arrives). -- License downgrade does not require role cleanup. Roles continue to grant the permission; runtime simply denies until license re-enables. - -### 8.5 Custom roles and policy cross-validation - -The policies-as-data framework registers an `approvals` policy *type*, but **no `approvals` policy is currently configured** and no code reads `approver_roles`. The enforced approval gate today is the `remediation:approve` / `exception:approve` permission alone (held by `security_admin` and `admin`; see [remediation_exception_governance.md](remediation_exception_governance.md)). If an `approvals` policy is added later, the loader cross-validates its `approver_roles` against the active role set (built-in + custom; an unknown role → `policy.invalid` audit event, previous policy state retained). Those `approver_roles` MUST be a subset of the roles that hold the matching `*:approve` permission — otherwise the policy names a role that cannot actually approve. - ---- - -## 9. The lookup endpoints - -### 9.1 `GET /api/v1/auth/me/permissions` (Stage 0) - -Returns the calling user's effective permissions (union of all their roles' permissions, wildcards expanded against the current registry). - -```json -{ - "user_id": "018f3c2a-...", - "roles": ["ops_lead"], - "permissions": [ - "auth:read", "auth:write", - "host:read", "host:write", "host:connectivity_check", "host:intelligence_refresh", - "scan:read", "scan:execute", "scan:cancel", - "..." - ], - "license_gated_unavailable": ["audit:export"] -} -``` - -`license_gated_unavailable` lists permissions the user technically has via their role but the license currently denies. Helps the frontend hide buttons that would always 402. - -`x-required-permission: auth:read` (every authenticated user can see their own). - -### 9.2 `GET /api/v1/auth/permissions:registry` (Stage 0) - -Returns the full registry — categories, permissions, built-in roles. Frontend uses to render permission selectors and role editors. - -```json -{ - "version": 1, - "categories": [ - {"id": "host", "description": "Host management permissions"}, - "..." - ], - "permissions": [ - {"id": "host:read", "category": "host", "description": "...", "dangerous": false, "license_gated": null}, - "..." - ], - "built_in_roles": [ - {"id": "viewer", "description": "...", "permissions": ["auth:read", "..."]}, - "..." - ], - "deprecated_permissions": [] -} -``` - -`x-required-permission: auth:read` (the registry is non-secret; any authenticated user can read it to render their own UI). - -### 9.3 `GET /api/v1/admin/roles` (Stage 0; Stage 2 expands) - -Stage 0: returns the built-in roles only. -Stage 2: returns built-in + custom roles, with `is_built_in: bool` per row. - -`x-required-permission: admin:role_manage`. - ---- - -## 10. CI enforcement - -Three layers (parallel to correlation, audit, policy patterns): - -### 10.1 Forbidigo lint - -```yaml -# .golangci.yml -linters-settings: - forbidigo: - forbid: - - p: '"[a-z_]+:[a-z_]+"' - msg: "Use auth.<PermissionConstant> from internal/auth/permissions.gen.go — raw permission strings drift" - # exclusions: internal/auth/* (the registry-loading code itself) -``` - -The pattern is intentionally broad and triggers on string literals that *look like* permissions. Reviewers add `//nolint:forbidigo` annotations on legitimate exceptions (test fixtures, schema validators). - -### 10.2 OpenAPI validator extension - -`scripts/validate-openapi.go` walks every operation: - -- Every `x-required-permission` value (or any-of/all-of list) resolves to a registry permission. -- License-gated permissions co-declare `x-required-feature` matching the registry's `license_gated`. -- Dangerous permissions co-declare `x-audit-events` non-empty. - -### 10.3 Behavioral spec - -`specs/system/rbac.spec.yaml` (post-Specter migration): - -```yaml -spec_id: system/rbac -status: active -acceptance_criteria: - - id: AC-1 - description: Permission registry validates against schema (ids, categories, license_gated cross-refs) - - id: AC-2 - description: Built-in role wildcards expand at codegen time - - id: AC-3 - description: Custom role create rejects unknown permissions - - id: AC-4 - description: Custom role create rejects bare wildcard "*" - - id: AC-5 - description: Combined RBAC+license middleware denies when license missing feature - - id: AC-6 - description: HasPermission honors category wildcards in custom roles - - id: AC-7 - description: Deprecated permissions are pruned from custom roles on next read - - id: AC-8 - description: Built-in roles cannot be modified via API (PUT/DELETE return 405) - - id: AC-9 - description: GET /auth/me/permissions includes license_gated_unavailable -``` - ---- - -## 11. Anti-patterns - -| Anti-pattern | What's wrong | What to do instead | -|--------------|--------------|---------------------| -| `if user.HasPermission("host:read")` | Raw string literal; drifts from the registry. | `user.HasPermission(auth.HostRead)`. Lint enforces. | -| Hardcoding role names in handler logic (`if user.RoleID == "admin"`) | Couples handler logic to a specific role; breaks when admins create custom roles with similar capability. | Check permissions, not roles. `if user.HasPermission(auth.AdminRoleManage)` is what you actually mean. | -| Adding a permission to a built-in role via DB UPDATE in production | Built-in role definitions are migration-driven. A direct UPDATE is invisible to release notes and may be reverted by the next migration. | Author a migration; ship in the next release. Or create a custom role for the customer's edge case. | -| Granting `*` to a custom role to "make it work" | Bare wildcard is reserved. Granting it via direct DB write bypasses validation but makes the role indistinguishable from `admin`. | Either use `admin` or list permissions explicitly. | -| Checking RBAC inside the handler instead of via middleware | Bypasses the codegen-driven license-gate co-check; surfaces inconsistent denial paths. | Declare `x-required-permission` in OpenAPI; let codegen wire the middleware. | -| Treating permissions as feature flags | They aren't. License features (`features.yaml`) are the feature-flag layer; permissions are RBAC. A permission says "user may do X"; a feature says "this build can do X." | Use both: license-gated permissions co-locate them. | - ---- - -## 12. Failure modes and edge cases - -| Scenario | Behavior | -|----------|----------| -| Permission added to registry but no handler references it | Build passes; the permission is dormant. UI permission selectors show it. Acceptable — not every registered permission must have a handler in the same release. | -| Permission referenced in OpenAPI but not in registry | Build fails with `unknown permission: foo:bar` | -| Built-in role definition references a permission that doesn't exist | Build fails at codegen. | -| Custom role in DB references a deprecated permission | Read endpoint returns the role with `deprecated_permissions: [...]`. Permission still works during deprecation window. After hard removal: pruned with audit event. | -| Custom role in DB references a permission that was hard-removed | Pruned at read time; `admin.role.changed` audit event with `detail.removed: [...]`. Role continues to function with remaining permissions. | -| User has zero roles assigned | `user.HasPermission(*)` always returns false. All non-public endpoints return 403. Audit event `authz.permission_denied`. | -| User assigned a role that was deleted | `lookupRole` returns false; that role contributes zero permissions. Other roles still apply. | -| License downgraded mid-session; user's role had `remediation:execute` | Permission check passes (role grants it); license check fails; user gets `402` with `error.code = "license.feature_unavailable"`. Per-call enforcement, no session invalidation. | -| Admin tries to update built-in role via PUT /admin/roles/admin | `405 Method Not Allowed` with `error.code = "resource.builtin"`. | -| Custom role create with 200 permissions including 50 dangerous | Allowed if all in registry, but warning array lists all 50 dangerous IDs; UI shows confirmation dialog before submit. | -| Two admins simultaneously create roles with the same id | First wins (`UNIQUE(id)`); second gets `409` with `error.code = "resource.conflict"`. | -| Wildcard `host:*` granted at time T; new permission `host:reboot` added at time T+1 | Custom role retains the wildcard, so it now also grants `host:reboot`. Built-in roles, having codegen-expanded lists, do NOT pick up the new permission until the next migration. This is a deliberate asymmetry: built-in role updates ship as releases (auditable), custom role updates are admin actions (auditable per assignment, but the permission set follows the wildcard semantics that were declared at creation). | - ---- - -## 13. Stage 0 vs Stage 2 split - -### Stage 0 ships (Day 8, after licensing on Day 7): - -- `auth/permissions.yaml` registry -- `internal/auth/permissions.gen.go` and `roles.gen.go` codegen -- Permission validator (`scripts/gen-rbac.go`) wired into CI -- `RequirePermission` middleware (with combined license-gate logic) -- OpenAPI validator extension for `x-required-permission` and cross-checks -- Migration `0004_roles.sql`: creates `roles` and `user_roles` tables; inserts the 5 built-in roles with `is_built_in=true` -- `GET /api/v1/auth/me/permissions` (returns built-in role expansion for current user; user model is stub until Stage 2 auth) -- `GET /api/v1/auth/permissions:registry` -- `GET /api/v1/admin/roles` (built-ins only) -- Forbidigo lint config for raw permission-string literals -- Stage 0 demo endpoint `POST /api/v1/diagnostics:require-host-read` declared with `x-required-permission: host:read` to verify the middleware fires - -### Stage 0 does NOT ship: - -- User model (Stage 2 auth slice) -- Custom-role CRUD (`POST/PUT/DELETE /admin/roles`) -- `:assign`/`:unassign` endpoints -- `:clone` sugar -- Role-management audit events tied to real users (the events exist in the audit registry; they emit when Stage 2 user management lands) - -The Stage 0 work is small (~600 LOC + registry + migration + lint config) but locks the contract before any consumer exists. Every Stage-2 endpoint that declares `x-required-permission` lands into a working middleware. - ---- - -## 14. Performance - -| Operation | Target | Notes | -|-----------|--------|-------| -| `RequirePermission` middleware overhead | < 1µs | Two map lookups (user → roles, role → permissions); no DB round-trip — user struct loaded by auth middleware upstream | -| Built-in role lookup | < 50ns | Compile-time map | -| Custom role lookup | < 100µs | Cached in process; cache invalidation on `admin.role.changed` audit event (Stage 2) | -| Wildcard match in custom role | < 200ns | `strings.HasSuffix` + `strings.Cut`; no regex | -| `GET /auth/me/permissions` | < 5ms | One DB read for user_roles join, one cache read for role definitions | - -The middleware is hot-path; it must not allocate. Codegen-expanded built-in role permissions are slices indexed by RoleID; the slice is read directly without copying. - ---- - -## 15. Open questions - -1. **Per-resource scoping** (data-level authorization). "User can read host X but not host Y." This is row-level / attribute-based access control; the registry handles role-level only. Defer to a separate ABAC design when needed; scoping logic lives in repository layer, not handlers. -2. **Permission groupings for UI** (e.g., "all host-management permissions" as a single checkbox in the role editor). The UI can render groupings from `categories`; no registry change needed. -3. **Time-bounded role assignments** ("user is `security_admin` until 2026-06-01"). Useful for short-term escalation. Defer; for now, admin manually unassigns. If demanded, add `expires_at` to `user_roles`. -4. **Permission usage telemetry** ("which permissions are never used in production?"). Helpful for retiring unused permissions. Out of scope for Stage 0; telemetry collection is its own initiative. -5. **Role inheritance** (`security_admin` inherits from `ops_lead` plus extras). Rejected for v1: clone-and-extend is simpler and the explicit permission list is what reviewers want to read. Revisit if role definitions grow past ~30 permissions and clone-drift becomes painful. -6. **Permissions for self-actions vs others** (e.g., `auth:write` is changing your own password; `user:write` is changing someone else's). The current design uses category separation (`auth:*` for self, `user:*` for admin-managed). Acceptable for now; revisit if self-vs-other surface grows. - ---- - -## Cross-references - -- License features: `licensing/features.yaml` — `license_gated` permissions reference these IDs. -- Audit events: `audit/events.yaml` — `authz.permission_denied`, `authz.role.assigned`, `authz.role.removed`, `admin.role.changed`, `license.feature_check_denied`. -- Error codes: `api/error_codes.yaml` — `authz.permission_denied`, `authz.role_required`, `license.feature_unavailable`, `validation.field_unknown`, `resource.conflict`, `resource.builtin` (to be added). -- Policies: `docs/engineering/policies_as_data.md` §5.2 — `approver_roles` cross-validates against active role set. -- API design: `docs/engineering/api_design_principles.md` §11 (extensions including `x-required-permission`). -- Roadmap: 2026-04-30 entries on this design. -- Stage 0: Day 8 (after licensing Day 7, before policies Day 9). diff --git a/docs/engineering/remediation_core_plan.md b/docs/engineering/remediation_core_plan.md deleted file mode 100644 index 93315bbf..00000000 --- a/docs/engineering/remediation_core_plan.md +++ /dev/null @@ -1,222 +0,0 @@ -# Remediation — OpenWatch Core (Free) Plan - -> **Companion doc:** [`remediation_licensed_plan.md`](remediation_licensed_plan.md) -> covers the OpenWatch+ (paid) half. **Forward-looking remainder context:** -> [`scan_remaining_work.md`](scan_remaining_work.md) (Phase 7). -> -> **Status:** scoping / design. No remediation handler, service, or schema -> exists yet — only the registries (RBAC, license feature, audit codes) and the -> OpenAPI skeleton. This doc defines what ships **free, in the AGPLv3 core**. - ---- - -## 1. Why a free/paid line exists here at all - -OpenWatch Core is **AGPLv3 + Managed Service Exception** (`LICENSE`). The MSE -restricts *offering OpenWatch as a hosted service to third parties*; it does -**not** grant feature tiering. Feature tiering is a separate **open-core / -dual-licensing** decision layered on top of the AGPL base, enforced by the -license subsystem (`internal/license/`, `licensing/features.yaml`, signed -Ed25519 JWTs minted by `cmd/owlicgen`). - -The product line is **"OpenWatch sees, plans, and governs remediation for -free; the act of mutating a host is OpenWatch+."** This doc is the *free* side -of that line. The paid side is the companion doc. - -> **AGPL implication, stated plainly.** Any code that ships in this core tree is -> source you are obliged to publish (AGPLv3 §13) and that a user may legally -> modify, including deleting a runtime license check (§2). So an in-core 402 -> gate is an *honor-system + friction* control, not DRM. That is an acceptable -> and common open-core posture for the manual-execution tier; the robustly -> gated capability (the auto-remediation engine) is treated differently in the -> companion doc. See Decision D-3 there. - ---- - -## 2. The boundary (what is free) - -| Capability | Free (this doc) | Licensed (companion) | -|---|---|---| -| View remediable findings, projected score lift | ✅ | | -| Request a remediation (`remediation:request`) | ✅ | | -| Approve / reject a request (`remediation:approve`) | ✅ | | -| View transaction history + signed evidence (`remediation:read`) | ✅ | | -| Configure the approvals policy (who approves, dual-approval) | ✅ | | -| **Execute a single-rule fix on a host** (`remediation:execute`) | ✅ | | -| **Rollback** (`remediation:rollback`) | ✅ | | -| Bulk / fleet remediation (many rules at once) | | ✅ `remediation_execution` | -| Auto-remediation policy engine (scheduled / policy-driven) | | ✅ `remediation_execution` | - -> **Boundary update (2026-06-18):** the free/paid line moved. Per-rule **manual -> execute + rollback are now free core** (Tier A) — the requester gets a **Fix** -> button on their approved request and applies the fix to that one finding. The -> OpenWatch+ `remediation_execution` feature now gates **bulk** (many rules / -> fleet) and **auto** remediation only. Because Tier A is free, its execution -> engine lives in-core (AGPL); the open-core "separate plugin" option applies -> only to the bulk/auto engine. - -> **Execution status (2026-06-18): live and working as of kensa v0.5.1.** The -> full execute path — apply-enabled SSH transport, `kensa.Remediate`/`Rollback` -> wiring (`internal/kensa/remediatefunc.go`), the queued remediation worker -> (`internal/worker/remediation_worker.go`), the `:execute`/`:rollback` handlers, -> and the lifecycle-aware **Fix** button — is implemented and **verified end to -> end against a real host**: an approved `cron-d-permissions` fix applied -> `/etc/cron.d` `755`→`700` (committed, rule flipped to pass, score moved), then -> rollback restored `755`. -> -> The first live test (on kensa v0.5.0) surfaced a real upstream blocker: -> kensa kept its apply handlers in `kensa/internal/handlers/*`, registered only -> via blank imports internal to the kensa module, so an external consumer could -> not register them and `Kensa.Remediate` failed preflight -> (`mechanism "file_permissions" is not registered`). Filed as -> [kensa #94](https://github.com/Hanalyx/kensa/issues/94); fixed in **kensa -> v0.5.1** (public `pkg/kensa/handlers` bundle auto-registered by `Default*`). -> OpenWatch needed only the version bump. `friendlyTxnErr` in -> `remediatefunc.go` is retained as defense-in-depth against any future -> packaging regression (a "not registered" failure is always before any apply, -> so no host is changed). - -The free tier is a complete **see-and-govern** loop: an operator can discover -what is fixable, understand the projected compliance-score impact, request the -fix, route it through approval, and audit every fix that was applied. The one -thing it cannot do is pull the trigger on a host mutation — that is the paid -moment, and the upsell is honest because the whole workflow up to it is free. - -This matches OpenWatch's "The Eye" visibility-first positioning and the risk -gradient ratified in `scan_remaining_work.md` (read-only is safe; host mutation -has blast radius). - ---- - -## 3. What already exists (build on, do not re-create) - -- **RBAC** (`auth/permissions.yaml` → `internal/auth/permissions.gen.go`): - `remediation:read`, `:request`, `:approve` (free); `:execute`, `:rollback` - (`license_gated: remediation_execution`, `dangerous: true`). -- **Audit codes** (`audit/events.yaml`): `remediation.requested`, - `remediation.approved`, `remediation.executed`, `remediation.rolled_back`. -- **OpenAPI skeleton** (`api/remediation.yaml`, fidelity = skeleton): the full - lifecycle `request → approve → dry-run → execute → rollback`, with read - endpoints explicitly un-gated and act endpoints gated. Also - `api/scans.yaml` → `POST /scans/{scan_id}:remediate` (create-from-findings). -- **Kensa** (`internal/kensa/`, kensa **v0.5.0**): `executor.go` wired for - scans; `transport.go` implements `Run` (scan path), with `Put`/`Get` stubbed - (`ErrTransportOpNotSupported`) pending a remediation payload-upload need. The - Kensa transaction model is `Capture → Apply → Validate → Commit`, with - automatic pre-state restore on validation failure. -- **License subsystem** (`internal/license/`): `EnforcePermission` / - `EnforceFeature` / `RequireFeature`, 402-on-deny with rate-limited audit; - free tier with no license file; SIGHUP reload. - -**Does not exist:** any `remediations` migration (next number is **0037**), any -remediation handler/service, any frontend beyond the placeholder Remediation -tab (`HostDetailPage.tsx`, "deferred (BACKLOG)"). - ---- - -## 4. Architecture — what the core owns - -The data model and state machine are built **in core** because both the free -governance path and the paid execution path read and write the same tables. -Only the *act* handlers carry the license check. - -### 4.1 Schema (migration `0037_remediation.sql`) - -- `remediation_requests` — one row per requested fix. - `id`, `host_id`, `rule_id`, `scan_run_id` (provenance), - `status` (`pending_approval | approved | rejected | dry_run_complete | - executing | executed | rolled_back | failed`), `requester_id`, - `approver_id`, `created_at`, `decided_at`, projected-lift snapshot - (`projected_cis`, `projected_stig`, `projected_nist`), `mechanism` - (kensa handler id), `reboot_required bool`, `transactional bool`. -- `remediation_transactions` — the Kensa per-rule transaction journal: `id`, - `request_id`, `kensa_txn_id`, `phase_result` (`committed | rolled_back | - skipped`), `pre_state` (captured), `evidence` (content-addressed, mirrors the - `scan_results` store pattern), `applied_at`. This is the durable rollback - point and the signed-evidence record (`kensa verify`). - -State transitions only ever move forward except the `:rollback` path -(`executed → rolled_back`). The journal is append-only. - -### 4.2 Service (`internal/remediation/`) - -- `Request(...)`, `Approve(...)`, `Reject(...)` — free verbs; pure state - transitions + audit, no host contact. -- `ProjectLift(...)` — read-only: compute the predicted CIS/STIG/NIST delta if a - rule (or set) flips to pass, from the current `host_rule_state` + framework - mappings. Powers the "Projected lift" UI. No mutation. -- The mutating methods (`DryRun`, `Execute`, `Rollback`) are **defined in core** - but their handlers call `EnforceFeature(remediation_execution)` before - touching a host (see companion doc). The Kensa apply/rollback plumbing - (`transport.Put`/`Get` if a mechanism needs to push a payload) lands here. - -### 4.3 API (core-owned, free endpoints) - -From the existing `api/remediation.yaml` skeleton, promote to full fidelity the -un-gated endpoints: - -- `GET /api/v1/remediation/requests` (list, filter) -- `GET /api/v1/remediation/requests/{id}` (+ `/steps`, `/audit`) -- `POST /api/v1/remediation/requests` (`remediation:request`) -- `POST /api/v1/remediation/requests/{id}:approve` (`remediation:approve`) -- `POST /api/v1/remediation/requests/{id}:reject` -- `POST /api/v1/scans/{scan_id}:remediate` (create requests from findings) - ---- - -## 5. Frontend (free surfaces) - -- **Compliance tab → Top failed rules** (`HostDetailPage`): each failed rule - gets a **"Request remediation"** affordance (prototype shows "Remediate"; the - free action is *request*, which routes to approval). Shows the per-rule - projected lift. -- **Remediation tab** (read surfaces only in the free build): the "How each fix - runs · Capture → Apply → Validate → Commit" explainer, the - `committed/rolled_back/skipped` legend, the **Recent transactions** table with - signed-evidence verification, and per-request status. The **Remediate / - Rollback buttons render as upsell** (disabled with an "OpenWatch+" affordance) - when the license lacks `remediation_execution`. The frontend does not gate - today (backend-only enforcement); this adds the first license-aware UI. -- **Projected lift** display is free everywhere it appears (planning is free; - applying is paid). - ---- - -## 6. Specs to author (SDD) - -- `system-remediation` — the request/approve state machine, schema invariants, - audit emission, the free/paid verb split as a constraint. -- `api-remediation` — promote `api/remediation.yaml` from skeleton; ACs for the - free endpoints + the 402 contract on the act endpoints. -- `frontend-remediation-tab` — the read surfaces + the request affordance + the - license-upsell rendering. - -Register in `specter.yaml`; annotate tests with `// @spec` + `// @ac`. - ---- - -## 7. Sequencing - -1. Migration `0037` + `internal/remediation` service (state machine + projection, - no host contact). Backend-first, the same layering used for exceptions. -2. Free API endpoints (`request`/`approve`/`reject`/list/get) + audit wiring. -3. Frontend: request affordance on the Compliance tab + read surfaces on the - Remediation tab + license-upsell rendering of the act buttons. -4. Hand off the **act** verbs (`dry-run`/`execute`/`rollback`) to the companion - doc's Tier A, which reuses this schema and service. - -This is the GA **beta** remediation slice's free half. Execution is beta-in-GA -per `scan_remaining_work.md`; the free governance loop can ship first and stand -on its own. - ---- - -## 8. Open decisions (carried from the design discussion) - -- **D-1 (line placement).** Keep "any host mutation = paid" (current in-tree - encoding, recommended), or carve out free *manual single-host single-rule* - execution? Keeping it is cleaner and is what the registry already encodes; - the cost is a possible "approve, then paywall at execute" funnel feel, - mitigated by honest upsell copy. **Recommend: keep.** -- **D-2 / D-3** are about the paid tiers and the enforcement model — see the - companion doc. diff --git a/docs/engineering/remediation_exception_governance.md b/docs/engineering/remediation_exception_governance.md deleted file mode 100644 index b5bee7c6..00000000 --- a/docs/engineering/remediation_exception_governance.md +++ /dev/null @@ -1,87 +0,0 @@ -# Remediation & Exception Governance — Role Matrix - -> **Status:** Current as of 2026-06-19. -> **Authority:** `auth/permissions.yaml` is the source of truth for who can do -> what (codegen produces `internal/auth/permissions.gen.go` / `roles.gen.go`). -> This document is a human-readable view of it; if the two disagree, the YAML -> wins and this doc is stale. -> **Audience:** Operators deciding how to assign roles, and engineers working on -> the remediation / exception lifecycles. - -This is the answer to "which role can **request**, **approve/reject**, and -**execute** remediation and exceptions." Two governed lifecycles share the same -separation-of-duties rule. - ---- - -## Built-in roles (least → most privilege) - -`viewer` → `auditor` → `ops_lead` → `security_admin` → `admin` - -`admin` holds the `*` wildcard (every permission). Custom roles may be created -and are validated against the permission registry. - -## Remediation - -| Action | permission | viewer | auditor | ops_lead | security_admin | admin | -|--------|------------|:------:|:-------:|:--------:|:--------------:|:-----:| -| View requests/history | `remediation:read` | ✓ | ✓ | ✓ | ✓ | ✓ | -| **Request** | `remediation:request` | | | ✓ | ✓ | ✓ | -| **Approve / Reject** | `remediation:approve` | | | | ✓ | ✓ | -| Execute (Fix) | `remediation:execute` | | | ✓ | ✓ | ✓ | -| Rollback | `remediation:rollback` | | | ✓ | ✓ | ✓ | - -Note the deliberate asymmetry: **`ops_lead` can request and execute remediation -but cannot approve it** — approval needs `security_admin` or `admin`. - -`remediation:execute` and `remediation:rollback` are **free core** (single-rule -manual). Bulk and automated remediation is the licensed track, gated separately -at the handler via `license.EnforceFeature(remediation_execution)` — not via a -permission. - -## Exceptions - -| Action | permission | viewer | auditor | ops_lead | security_admin | admin | -|--------|------------|:------:|:-------:|:--------:|:--------------:|:-----:| -| View | `exception:read` | ✓ | ✓ | ✓ | ✓ | ✓ | -| **Request** | `exception:request` | | ✓ | ✓ | ✓ | ✓ | -| Comment | `exception:comment` | | ✓ | ✓ | ✓ | ✓ | -| **Approve** | `exception:approve` | | ✓ | | ✓ | ✓ | -| Revoke | `exception:revoke` | | | | ✓ | ✓ | - -Note the asymmetry mirrors remediation in reverse: **`auditor` can approve -exceptions but not remediation**, and **`ops_lead` can request exceptions but not -approve them**. - -## Separation of duties (self-review rule) - -For **both** lifecycles, the reviewer must differ from the requester. Approving -or rejecting your own request is refused with **409 `self_review`** — and there -is **no bypass**: not for `admin`, and there is no config flag. - -- Remediation: `internal/remediation/service.go` (`ErrSelfReview`) -- Exceptions: `internal/exception/service.go` - -**One-operator note:** because of this rule, a single-operator workspace cannot -complete the request → approve flow today. The resolution for the free tier is -[Remediation Approval Governance (ADR)](remediation_governance_adr.md): free-core -single-rule remediation will not require a separate approval; the approval gate -(with self-review) is reserved for the licensed bulk/auto track. Until that lands, -two distinct users are required to approve any remediation/exception. - -## On `approver_roles` policies - -The policies-as-data framework registers an `approvals` policy *type* -(`internal/policy/types.go`), but **no `approvals` policy is currently -configured**, and no code reads `approver_roles`. The enforced approval gate -today is purely the `remediation:approve` / `exception:approve` **permission** -above. If an `approvals` policy is ever added, its `approver_roles` must be a -subset of the roles that hold the corresponding `*:approve` permission, or the -policy can name a role that cannot actually approve. - -## References - -- Source of truth: `auth/permissions.yaml` -- RBAC registry: [rbac_registry.md](rbac_registry.md) -- Decision record: [remediation_governance_adr.md](remediation_governance_adr.md) -- Operator guide: [../guides/HOSTS_AND_REMEDIATION.md](../guides/HOSTS_AND_REMEDIATION.md) diff --git a/docs/engineering/remediation_governance_adr.md b/docs/engineering/remediation_governance_adr.md deleted file mode 100644 index 259fe315..00000000 --- a/docs/engineering/remediation_governance_adr.md +++ /dev/null @@ -1,112 +0,0 @@ -# Remediation Approval Governance (ADR) - -> **Status:** Accepted 2026-06-19. Implementation pending (the conditional-approval -> path is not yet built; today every remediation request goes through -> request → approve → execute). -> **Authority:** This document is the decision record for *when* a remediation -> requires human approval. The role/permission matrix that backs it is -> [remediation_exception_governance.md](remediation_exception_governance.md); -> the permission source of truth is `auth/permissions.yaml`. -> **Audience:** Anyone implementing or specing the remediation lifecycle, and -> anyone scoping the OpenWatch+ licensed remediation track. - ---- - -## Context - -Remediation is open-core. The boundary, decided separately, is: - -- **Free core:** per-rule **manual** remediation — an operator fixes one finding - on one host, and can roll it back. -- **OpenWatch+ (licensed):** **bulk and automated** remediation — apply many - rules / fleet-wide, and policy-driven auto-remediation. Gated at the handler - via `license.EnforceFeature(remediation_execution)`. - -The shipped lifecycle is a single state machine with a human approval gate: - -``` -Request → pending_approval → (Approve) → approved → (MarkExecuting) → executing → executed → rolled_back - │ (failed, dry_run_complete are side branches) - └── (Reject) → rejected -``` - -Approval enforces **separation of duties**: the reviewer must differ from the -requester. This is hard-coded with no bypass (`internal/remediation/service.go`, -`if requestedBy == reviewedBy { return ErrSelfReview }`) and the execute handler -refuses anything not in `approved` state -(`internal/server/remediation_handlers.go`, 409 `only an approved request can be -executed`). - -**The problem this ADR resolves:** that gate makes the product unusable for a -single operator. A lone administrator can request but can never approve their -own request (409 `self_review`, even as `admin`), so they never reach Fix. The -same applies to compliance exceptions. Requiring approval here also buys *no* -separation of duties — the requester and the approver would be the same human. - -## Decision - -**Keep the governance machinery; make the human approval step *conditional* on -the remediation track ("A-keep").** - -- **Free-core, single-rule manual remediation does not require a separate human - approval.** A free-core request reaches an executable state directly (auto-approved - on creation, or a `ready` state the execute handler also accepts). The operator - clicks **Fix**; there is no `pending_approval` interstitial. -- **The licensed bulk / auto-remediation track keeps the full request → approve → - execute flow with the self-review separation-of-duties guard.** This is where an - approval gate carries real risk-management value (many rules, fleet-wide, or - unattended), and where multiple roles realistically exist. - -We do **not** delete the governance code. It is exactly the machinery the -licensed track needs. - -## Consequences - -**Stays, unchanged:** - -- The `remediation_requests` + `remediation_transactions` tables (migration 0037) - — every request and its transactions are still recorded for audit, history, and - rollback, approval or not. -- The execution half of the state machine (`executing → executed → rolled_back → - failed`, `dry_run_complete`), `MarkExecuting`, `RecordExecution`, the - `RemediationWorker`, the execute/rollback handlers, the - `remediation:execute` / `remediation:rollback` permissions, and the frontend - Fix/rollback UI. - -**Stays, but becomes conditional — reserved for the licensed track:** - -- `Request` / `Approve` / `Reject`, the self-review guard, the - `pending_approval` / `approved` / `rejected` states, and the - `remediation:request` / `remediation:approve` permissions. - -**Changes (small, surgical):** - -1. A free-core single-rule request reaches an executable state without a human - approval transition. -2. UI: the Fix button is live immediately for free-core (no pending-approval step). -3. Specs/tests: the `api-remediation` ACs that assert "must be approved before - execute" split into free-core (no approval) vs. licensed (approval + self-review). - The self-review test stays, retargeted to the licensed path. - -**Accepted trade-off:** until the bulk/auto track ships, the approve/reject/ -self-review code is present-but-dormant (exercised only by its tests). We accept -carrying it rather than deleting working, tested code and rebuilding it later. - -## Alternatives considered - -- **Single-operator mode (config flag relaxing self-review).** Viable, but adds a - config surface and an "I approved my own request" audit nuance. The conditional - split achieves the same outcome for the free tier without a flag. -- **Require a second approver account.** Rejected as the *only* answer: it is poor - UX and, since the same human clicks both, delivers no real separation of duties. -- **A-defer (strip governance now, rebuild for the licensed track).** Rejected: - throws away working, tested, just-merged code to rebuild the same machinery later. - -## References - -- Role/permission matrix + self-review rule: - [remediation_exception_governance.md](remediation_exception_governance.md) -- Permission source of truth: `auth/permissions.yaml` -- RBAC registry: [rbac_registry.md](rbac_registry.md) -- Lifecycle code: `internal/remediation/`, `internal/server/remediation_handlers.go` -- Spec: `specs/api/remediation.spec.yaml` diff --git a/docs/engineering/remediation_licensed_plan.md b/docs/engineering/remediation_licensed_plan.md deleted file mode 100644 index 43e57c75..00000000 --- a/docs/engineering/remediation_licensed_plan.md +++ /dev/null @@ -1,201 +0,0 @@ -# Remediation — OpenWatch+ (Licensed) Plan - -> **Companion doc:** [`remediation_core_plan.md`](remediation_core_plan.md) -> covers the free AGPLv3 half (see-and-govern). This doc covers the **paid** -> capabilities: the act of mutating a host, and the fleet automation on top. -> -> **Status:** scoping / design. Builds on the same schema + `internal/remediation` -> service defined in the core doc; adds the license-gated act path and a -> second feature for the automation engine. -> -> **Ratified (2026-06-18):** **auto-remediation is an OpenWatch+ (licensed) -> feature.** Tier B below is paid; it is not part of the free AGPL core. The -> remaining open points are the *granularity* (own key vs. shared) and *SKU -> level* of that gate (D-2) and *where the code lives* (D-3). - ---- - -## 1. The two paid tiers - -The prototype shows two distinct paid surfaces with very different value and -risk. They should be **two feature keys**, not one (Decision D-2). - -| Tier | Feature key | Capability | Prototype surface | -|---|---|---|---| -| **A — Apply** | `remediation_execution` *(exists)* | Dry-run, execute, and rollback a fix on a **single host**, operator-driven, one rule (or one request) at a time. | Host Detail → Compliance "Remediate", Remediation tab per-txn **Rollback** | -| **B — Automate** | `remediation_auto` *(proposed, new)* | Fleet/bulk remediation, remediation **groups**, and the **auto-remediation policy engine**: per-severity auto-fix/approve/off, scope-by-group, canary-first, max-changes-per-run, circuit breaker, scheduled playbooks. | Scans → **Configuration** (auto-remediation), Host Detail "Remediate all · groups" | - -Tier A is "let me fix this one thing and prove it." Tier B is "keep the fleet -compliant without me clicking" — the most powerful and most dangerous surface, -and where the commercial value concentrates. - ---- - -## 2. Tier A — `remediation_execution` (the act of applying) - -### 2.1 What it is - -The three act verbs already gated in `auth/permissions.yaml`: -`remediation:execute` (dry-run + execute) and `remediation:rollback`, both -`license_gated: remediation_execution`, `dangerous: true`. The skeleton -`api/remediation.yaml` already marks `:dry-run`, `:execute`, `:rollback` as -requiring the feature. - -### 2.2 Where the code lives — Decision D-3 - -Tier A is built **in the core tree** (`internal/remediation`), gated at the -handler by `EnforceFeature(remediation_execution)`. This is an *honor-system + -friction* gate: under AGPLv3 the execute code is publishable source a user -could recompile without the check. **That is an accepted posture for Tier A** -because: - -- The manual single-host primitive (apply one Kensa transaction, capture - pre-state, rollback) is small and intrinsic to the remediation engine the - free governance loop already references. -- The license here is about legitimacy, support, and audit, not DRM. - -The robust open-core treatment is reserved for Tier B (§3.3), where it is worth -the architectural cost. - -### 2.3 Execution model (first slice) - -Per-rule, per-host, **approval-gated**, **snapshot + rollback** — exactly the -`scan_remaining_work.md` first slice. Flow: - -1. `:dry-run` — Kensa `Capture → Apply → Validate` with no `Commit`; returns the - would-be transaction + projected lift. Free users see the *plan* (read), paid - users can *run* the dry-run. -2. `:execute` — full `Capture → Apply → Validate → Commit`; writes the - `remediation_transactions` journal row with signed evidence; re-scan the - rule to confirm state flip; emit `remediation.executed`. -3. `:rollback` — restore from the captured pre-state; emit - `remediation.rolled_back`. - -### 2.4 Kensa work - -- Wire `kensa.Remediate()` (available in v0.5.0) through `internal/kensa`. -- Implement transport `Put`/`Get` **only if** a mechanism needs to push a helper - payload (`transport.go` currently returns `ErrTransportOpNotSupported`). The - scan path proves `Run` is enough for command-based checks; many handlers - (`config_set`, `service_enabled`, `sysctl_set`) are command-only. - -### 2.5 API + audit - -Promote the act endpoints in `api/remediation.yaml` to full fidelity; they -already carry `x-required-feature: remediation_execution` and 402 responses. -Audit codes `remediation.executed` (with `dry_run` flag, steps succeeded/failed) -and `remediation.rolled_back` already exist. - ---- - -## 3. Tier B — `remediation_auto` (the automation engine) - -### 3.1 What it is (the prototype's Scans → Configuration screen) - -- **Policy by severity** — High/Med/Low each: auto-fix · require-approval · off. -- **Scope & guardrails** — auto-remediate only in named groups - (e.g. "Development only"); **canary-first** (one host, validate, then the - rest); **max changes per run**; **circuit breaker** (pause all auto-remediation - if rollbacks exceed N). -- **Bulk / groups** — "Remediate all High & Med · N rules"; themed - **remediation groups** ("Harden SSH", "Enable firewall", "Install auditd") - each showing the multi-framework lift. -- **Scheduled / cadence playbooks** — auto-remediation on the adaptive schedule. - -### 3.2 Hard dependency — Kensa rule ordering (carries `scan_remaining_work.md` D-4) - -Bulk and grouped remediation need rule **ordering / grouping** metadata -(`depends_on` / `conflicts` / `supersedes`). Kensa's `LoadRules` deliberately -does **not** expose this today. Tier B's groups and "remediate all" are -**blocked on a Kensa-team ratification**, not an OpenWatch-only build. Per-rule -manual (Tier A) has no such dependency, which is one more reason Tier A ships -first. - -### 3.3 Where the code lives — Decision D-3 (the robust seam) - -Tier B is the right place to spend the open-core architecture cost. Recommended: -build the **auto-remediation policy engine as a separate licensed module** -loaded through the existing plugin interface (ORSA), **not** in the AGPL core. -Rationale: - -- It is the flagship paid capability and the most defensible to truly gate. -- It is the highest blast-radius surface (unattended fleet mutation); keeping it - behind a real boundary is also a safety win, not only a licensing one. -- A module that is physically absent without a license is an *enforceable* cap, - unlike the in-core honor-system gate acceptable for Tier A. - -The core exposes the Tier-A primitive (apply one rule, rollback) as the -interface; the Tier-B module orchestrates it (policy evaluation, fleet fan-out, -canary, circuit breaker, scheduling). - -### 3.4 Feature registry change - -Add to `licensing/features.yaml`: - -```yaml - - id: remediation_auto - tier: openwatch_plus # or `enterprise` if it should be a higher SKU — D-2 - description: Policy-driven and fleet/bulk auto-remediation (canary, circuit - breaker, scheduled playbooks, remediation groups) - introduced: "<next release>" -``` - -Then `go generate ./internal/license/...`, reference it from the auto-remediation -routes' `x-required-feature`, and (if a new perm is warranted) a -`remediation:auto` permission gated on it. CI (`scripts/validate-features.go`) -enforces that every gated reference resolves to a registered feature. - -### 3.5 New surfaces - -- **API:** an auto-remediation policy resource (`GET/PUT - /api/v1/remediation/policy`), bulk/group execute endpoints, all gated on - `remediation_auto`. -- **Frontend:** the Scans → Configuration auto-remediation panel and the Host - Detail "Remediate all / groups" cards, rendered as upsell when unlicensed. -- **Audit:** likely new codes for policy changes and auto-runs - (`remediation.policy.changed`, `remediation.auto.run`) — register before use. - ---- - -## 4. Specs to author (SDD) - -- `system-remediation-policy` — the policy data model, severity routing, - guardrails (canary, max-changes, circuit breaker), and the Tier-B module - boundary. -- `api-remediation` (extend) — the gated act endpoints (Tier A) and the policy / - bulk endpoints (Tier B), 402 contracts. -- `frontend-scan-remediation-config` — the Scans Configuration auto-remediation - surface + upsell rendering. - ---- - -## 5. Sequencing - -1. **Tier A first**, on the core schema/service: wire `kensa.Remediate`, - build `:dry-run`/`:execute`/`:rollback` gated by `remediation_execution`, - per-rule manual + approval + rollback. Ship as GA **beta**. -2. Prove Tier A on a real host (the test fleet) before any automation. -3. **Tier B second**, after (a) Kensa ratifies rule ordering (§3.2) and (b) the - `remediation_auto` feature + plugin module boundary are agreed. Build the - policy engine in the licensed module; start with bulk/manual groups, then - the auto/scheduled posture last (riskiest). - ---- - -## 6. Open decisions - -- **D-1 (line placement).** Recommended: keep "any host mutation = paid" - (Tier A gates all of dry-run/execute/rollback). See core doc §8. -- **D-2 (one tier or two).** *Auto-remediation is licensed — ratified - 2026-06-18.* Still to confirm: give it its **own key** `remediation_auto` - (recommended, so it can be priced/tiered independently of single-host apply) - vs. folding it under the existing `remediation_execution`; and whether that key - is `openwatch_plus` or a higher `enterprise` SKU. -- **D-3 (enforcement model / code location).** Recommended graduated answer: - Tier A **in-core, honor-system gate** (pragmatic, small primitive); Tier B as - a **separate licensed plugin module** (robustly enforceable, safety boundary). - This is the most consequential fork — it sets where the auto-remediation - engine gets built. -- **D-4 (Kensa ordering).** Bulk/grouped remediation requires a Kensa-team - ratification of rule ordering before it can be built. Tracks - `scan_remaining_work.md` decision #4. diff --git a/docs/engineering/reports_design.md b/docs/engineering/reports_design.md deleted file mode 100644 index 475ab9d1..00000000 --- a/docs/engineering/reports_design.md +++ /dev/null @@ -1,613 +0,0 @@ -# Reports — Design & Architecture - -**Status:** Proposed (design). Supersedes the thin executive-only MVP. -**Last updated:** 2026-06-21 -**Owner:** (unassigned) -**Related:** `internal/report/`, `internal/scanresult/`, `internal/fleetrollup/`, -`internal/posture/`, `internal/exception/`, `internal/queue/`, -`specs/api/reports.spec.yaml`, `specs/frontend/reports.spec.yaml`, -prototype `docs/engineering/prototypes/openwatch-v1/Reports.html`. - ---- - -## 0. Why this doc exists - -`/reports` today is a deliberately thin MVP: one report kind -(`executive`), JSON only, generated synchronously in the request, no -export, no signing, no scheduling, no scope picker. The Templates and -Scheduled tabs are honest `ComingSoon` stubs. - -The goal of this document is to define a reports system that serves the -four audiences who actually consume compliance output — **operators, -leadership (CISO), auditors, and compliance/GRC** — without ever -producing the thing that makes compliance reporting hated: a -thousand-page PDF nobody reads. - -**The central design problem.** A naive "report" at fleet scale — -100+ hosts × ~500 rules — is ~50,000 result rows. Rendered as one PDF -that is 1,000+ pages. That artifact is useless to every persona: the -CISO won't read it, the auditor can't sample it efficiently, the -operator can't act on it, and the GRC tool can't ingest it. The -architecture below exists to make that artifact structurally -impossible to generate by accident. - -**The good news.** Every input a great reports system needs already -exists as data in OpenWatch (see §9). This is a rendering, aggregation, -and delivery problem — **not** a data-collection one. Reports *derive* -from scan truth, so they cannot drift from it. - ---- - -## 1. Design principles - -### P1 — A report is a *snapshot with faces*, not a document - -A report is **one immutable, signed, point-in-time snapshot**. PDF, CSV, -OSCAL, JSON, and the in-app view are **projections (faces)** of that -single snapshot, not independent documents. - -Consequences: - -- **Sign the snapshot once.** Every face inherits verifiability — there - is no per-format signing and no way for two faces to disagree. -- **"Data as of" and the coverage caveat are snapshot properties**, - identical across every face. The CISO's PDF and the auditor's CSV are - guaranteed to describe the same fleet at the same instant. -- **Never regenerate — re-render.** Asking for the OSCAL of an existing - report renders a new face over the frozen snapshot; it never - re-samples the fleet. - -### P2 — Format follows *audience × cardinality* - -| | Summary (low cardinality) | Bulk evidence (high cardinality) | -| -------------- | ------------------------------------ | ------------------------------------------------- | -| **Human** | **PDF** — narrative, bounded, signed | **In-app drill** (query, don't paginate) + **CSV** | -| **Machine** | **JSON rollup** (dashboards, API) | **OSCAL bundle / NDJSON** — async, streamed | - -The **human × bulk** cell is the 1,000-page PDF. **That cell stays -empty.** Humans who need bulk evidence either drill interactively or -sample in a spreadsheet; machines ingest OSCAL. A PDF that tries to be -complete evidence is using the wrong tool. - -### P3 — The PDF is bounded by construction - -A report PDF's page count MUST be `O(controls + exceptions + sampled -findings)`, never `O(hosts × rules)`. Full evidence is **referenced by -hash** and shipped as the attached OSCAL/CSV face: - -> Complete evidence: `openwatch-fleet-2026-05.oscal.json` — -> SHA256 `abc1234…` — 50,000 observations. - -The auditor samples the PDF; the attached bundle is complete and -independently verifiable. Sampling rule: top-N failing findings per -control inline, the remainder "by reference." - -### P4 — Reports are *derived*, never *collected* - -No report introduces new data collection. Each report kind is an -aggregation + render over tables that already exist (§9). This makes -report kinds cheap to add and impossible to drift from the scan results -they summarize. - -### P5 — Fleet-scale generation is asynchronous - -A fleet OSCAL/PDF over 100+ hosts cannot be produced inside an HTTP -request. Generation is **queued** (existing PG `SKIP LOCKED` job queue), -executed in a worker, and the caller is **notified when ready** -(in-app notification bell + optional scheduled email). The synchronous -path is retained only for the small, bounded executive JSON. - -### P6 — Integrity is first-class - -Every snapshot is **content-addressed and Ed25519-signed**, verifiable -offline. A report discloses its own coverage gaps (stale/unreachable -hosts) — *staleness honesty is what earns auditor trust*, and it is the -single most important non-obvious feature in the prototype. - ---- - -## 2. The four personas (what each would *love* to see) - -### Operator — sysadmin / security engineer - -Does **not** want a "report" — wants a **worklist**. - -- Failing rules ranked by **blast radius** ("CIS-3.5.1.1 fails on 18 of - 20 hosts"). -- **Projected lift**: "run the Enable-host-firewall remediation group → - CIS +7, STIG +6 across 18 hosts." -- **What changed since last period** — regressions first. -- Lives **in-app, interactive**; exports **CSV** into ticketing. -- Rarely opens a PDF. - -Primary face: **in-app drill + CSV**. - -### Leadership — CISO - -Wants **one number, the trend, and "are we getting better or worse."** - -- Fleet average compliance + **trend delta** (the delta matters more - than the absolute). -- Top 3–5 risks framed as **business risk**, not rule IDs. -- Coverage honesty + 2–3 recommended actions. -- 1–2 pages, forwardable to a board. - -Primary face: **signed PDF**, scheduled to inbox. (The prototype's -Executive Summary is the reference design.) - -### Auditor - -Wants **evidence of control satisfaction at a point in time** — -complete, verifiable, navigable — **and samples rather than reading -everything.** - -- **OSCAL SAR** (machine-complete) + a navigable evidence path: - control → all hosts' status → one host's evidence. -- PDF attestation = cover + **methodology** + **coverage** (which hosts, - when) + framework rollup + **sampled** findings + **signature**. An - index into the complete bundle, not the bundle. -- **CSV** is their real bulk tool (pivot 50k rows in seconds). -- Cares about chain-of-custody and the scan method. - -Primary faces: **OSCAL SAR + CSV + bounded PDF attestation**. - -### Compliance / GRC officer - -Wants **gaps and a plan toward authorization.** - -- Framework posture mapped to an authorization boundary. -- **Exception register**: waiver, justification, approver, expiry. -- **POA&M**: each finding → milestone → target date, tracked over time. - -Primary faces: **OSCAL POA&M + Exception Register (PDF/CSV) + framework -rollup**. OpenWatch can serve this persona unusually well because -exceptions and remediation transactions are already first-class. - ---- - -## 3. Report kinds catalog - -Mapped from the prototype's six templates onto existing data sources. -Each kind is a `(snapshot builder, faces[])` pair. - -| Kind | Audience | Faces | Derived from | -| ------------------------ | --------------- | --------------------------- | ---------------------------------------------------------------------- | -| **Executive Summary** | Leadership/CISO | PDF, JSON | `fleetrollup` + `posture_snapshots` (trend) + `host_liveness` (coverage) | -| **Framework Attestation**| Auditor/GRC | OSCAL SAR, CSV, PDF | `scan_results` + `scan_evidence` (per-scan OSCAL aggregated), `exceptions` | -| **Remediation Activity** | Operations | CSV, JSON, PDF | remediation `transactions` (committed / rolled_back over a period) | -| **Exception Register** | Compliance/GRC | PDF, CSV | `internal/exception` (waiver, justification, approver, expiry) | -| **Host Evidence Pack** | Per-host audit | PDF (per host), OSCAL, CSV | `scan_results` + intelligence snapshot + scan history for one host | -| **Drift & Trend** | Management | PDF, CSV | `transactions` (state changes) + `posture_snapshots` (history) | -| **POA&M** (Phase D) | GRC | OSCAL POA&M | open findings + remediation milestones + exception expiries | - ---- - -## 4. Data model - -### 4.1 The snapshot - -``` -report_snapshots - id uuid pk - kind text -- executive | attestation | remediation | exception | host_evidence | drift | poam - scope jsonb -- {groups:[], framework:"cis"|null, period:{from,to}|null, host_ids:[]|null} - data_as_of timestamptz -- the sample instant (frozen) - coverage jsonb -- {hosts_total, hosts_fresh, hosts_stale, hosts_unreachable, stale_host_ids:[]} - content_sha256 text -- content address of the canonical snapshot bytes - signature bytea -- Ed25519 over content_sha256 (nullable until signing lands) - signing_key_id text - generated_by text -- principal id or "system"/"scheduler" - created_at timestamptz -``` - -The **canonical snapshot** is a deterministic, sorted JSON document -holding every datum any face needs (rollups, per-(host,rule) outcomes -referenced by evidence hash, exceptions in scope, trend series). It is -content-addressed and signed once. Faces are pure functions of it. - -> Migration note: the current `reports` table (migration 0028) is the -> executive-JSON MVP. `report_snapshots` generalizes it; the executive -> kind migrates onto it as the first kind, preserving the existing wire -> contract as the JSON face. - -### 4.2 Faces - -A face is rendered on demand and cached by `(snapshot_id, face)`: - -``` -report_faces - snapshot_id uuid fk -> report_snapshots - face text -- pdf | csv | oscal_sar | oscal_poam | json | ndjson - media_type text - size_bytes bigint - blob_sha256 text -- content-addressed; large faces stored in the blob store, streamed - status text -- pending | ready | failed (async render) - created_at timestamptz - primary key (snapshot_id, face) -``` - -Small faces (JSON, exec PDF) may render synchronously. Large faces -(fleet OSCAL, full CSV) are queued and flip `pending → ready`, at which -point the notification bell fires. - -### 4.3 Signing - -Ed25519 over `content_sha256`. The key is the same class of release -signing OpenWatch already operates; verification is offline -(`SHA256SUMS.asc`-style, or an in-product "Verify signature" action as -in the prototype). The reserved `audit_events.signature` work -(activity-readability Phase 5, AU-9) and report signing should share one -signing service. - -> **Open item:** align OSCAL version. `internal/scanresult` emits OSCAL -> **1.0.6**; the prototype mock shows **1.1.2**. Pick one (recommend the -> newest NIST stable) and use it across per-scan and fleet OSCAL. - ---- - -## 5. Format-by-format scale strategy - -- **PDF** — bounded by P3. Page budget + sampling rule. Deterministic - size regardless of fleet count. Rendered from the snapshot's rollup + - a sampled findings slice + a hash pointer to the bundle. -- **CSV** — the bulk workhorse. One row per `(host, rule)`: - `host, ip, group, os, rule_id, title, status, severity, - framework_refs, evidence_sha256, scan_at, exception_id`. - 100×500 = 50k rows is trivial for CSV and instantly pivotable. **Apply - the existing CSV formula-injection guard** (`csvSafe`, CWE-1236, shared - with the audit export) and the truncation-disclosure header. -- **OSCAL** — evidence is **already content-addressed**, so observations - reference evidence **by hash** (no duplication). Two viable shapes: - (a) one fleet SAR with hash-referenced evidence; (b) per-host/group - **shards with an assembly index** via OSCAL back-matter resources. - Generated async; streamed as a zip bundle or NDJSON. Reuses the - per-scan reconstruction (`scanresult.ReconstructScan`). -- **JSON rollup** — small; the summary numbers for dashboards/API. -- **In-app** — virtualized, query-driven drill - (fleet → framework → control → host → rule → evidence). Never - materialize the whole tree. - -A direct answer to the framing question: **every-host-every-rule OSCAL -rendered to PDF is 1,000+ pages — which is exactly why OSCAL is never -rendered to PDF.** OSCAL is a machine format for a GRC tool; the human -PDF is a separate, bounded face over the same snapshot. The prototype's -instinct to keep them as two different viewers is correct. - ---- - -## 6. In-app view vs export - -- **In-app** is the *query/drill* surface and the *preview*. The - executive face renders in-app exactly as the PDF will (the prototype - does this). For bulk kinds, in-app is the interactive evidence - explorer — not a paginated render of the export. -- **Export** is the *frozen artifact*: PDF for humans, OSCAL/CSV/JSON for - machines and spreadsheet sampling. Every export carries the snapshot's - signature and "data as of." -- The Library lists **snapshots**; each row offers its available faces - (the prototype's "PDF", "OSCAL · PDF", "JSON" format column). - ---- - -## 7. Generation & delivery pipeline (friction-free) - -1. **Template pre-scopes** the report by persona (prototype Templates - tab). -2. **Scope picker**: groups / framework lens / period / (optional) - explicit host set. -3. **Build snapshot** → content-address → sign. Small kinds inline; - fleet kinds **enqueue** (`internal/queue`). -4. Worker renders the requested faces; flips `report_faces.status` to - `ready`. -5. **Notify when ready** — the in-app **notification bell** (currently a - P1 stub) gains its first real producer; optional **email delivery** - via the existing notification-channel dispatch. -6. **Scheduled reports** (prototype Scheduled tab): a scheduler tick - enqueues a snapshot on a cadence and delivers faces to recipients. - -This is the symbiosis worth calling out: **reports give the -notification bell a reason to exist, and the bell makes async reports -feel instant.** Build them aware of each other. - ---- - -## 8. Phasing - -### Phase A — Executive report, real for humans -- Signed, **bounded PDF** face for the executive kind. -- Auto-generated **coverage caveat** (stale/unreachable disclosure from - `host_liveness`). -- **Scope picker**: group / framework lens / period (built on - `fleetrollup` + `posture_snapshots`). -- Migrate `reports` → `report_snapshots` + `report_faces`; keep the - executive JSON wire contract as the JSON face. -- Establishes snapshot-with-faces + the signing service. -- **Specs:** `api-reports` v2, `frontend-reports` (Library detail + PDF - viewer + scope picker), new `system-report-snapshot`. - -### Phase B — The scale-correct bulk path -- **Framework Attestation**: fleet **OSCAL SAR** (aggregate per-scan - OSCAL) + **CSV** evidence extract. -- **Async** generation via the job queue; `report_faces` status flips. -- Sampling rule for the PDF attestation; hash pointer to the bundle. -- Solves the 1,000-page problem properly; serves auditor/GRC. -- **Specs:** extend `api-reports` (async + export endpoints + content - negotiation), `system-report-faces`. - -### Phase C — Delivery spine -- **C1 — Exception Register kind.** *(SHIPPED 2026-06-22, PR #657.)* A - point-in-time Compliance/GRC read-model of compliance waivers - (`compliance_exceptions`): a frozen `ExceptionContent` {summary, - exceptions[]} (counts by state + active/expiring-soon + the register - rows, requester/reviewer resolved to usernames), a CSV register face, a - bounded PDF summary face, and a kind-aware in-app `ExceptionBody`. - Migration 0044 admits `kind='exception'`. Spec: `api-reports` v1.12.0 - (C-17 / AC-23), `frontend-reports` v1.9.0 (C-12 / AC-13). -- **C2 — Remediation Activity kind.** *(SHIPPED 2026-06-22, PR #658.)* A - read-model of remediation requests over a look-back window - (`remediation_requests` filtered on `requested_at`): a frozen - `RemediationContent` {period_from, period_to, summary, activities[]} - (exact counts by outcome + the activity rows, requester/reviewer resolved - to usernames), a CSV activity-log face, a bounded PDF summary face, and a - kind-aware in-app `RemediationBody`. The generate request gains - `period_days` (1..365, default 30); the UI shows a Last 7/30/90 days - selector for the kind. Migration 0045 admits `kind='remediation'`. Spec: - `api-reports` v1.13.0 (C-18 / AC-24), `frontend-reports` v1.10.0 - (C-13 / AC-14). -- **C3 — Scheduled dispatcher.** *(SHIPPED 2026-06-22, PR #659.)* A - `report_schedules` row (migration 0046) recurs a report on a - daily/weekly/monthly cadence and emails its PDF through an email - notification channel. A cron dispatcher (`internal/reportschedule`, - ticking each minute) claims due schedules, generates the report, renders - the PDF, emails it as a MIME-multipart attachment - (`notification.Service.SendReportEmail`), records last_run/last_status, - and advances next_run_at. The Scheduled tab is live (create form + - pause/resume + delete over the host:read/host:write CRUD endpoints). Spec: - `system-report-schedule` v1.0.0, `frontend-reports` v1.11.0 (C-14 / - AC-15). Chosen design: daily/weekly/monthly cadence (not cron), - email-with-PDF-attached delivery, recipients from the email channel's To. - -### Phase D — GRC depth -- **POA&M** (OSCAL) — open findings → milestones → target dates, - tracked over time. -- **Host Evidence Pack** + **Drift & Trend** kinds. - ---- - -## 9. What we build on (existing capabilities) - -| Need | Exists as | -| ---------------------------- | ------------------------------------------ | -| Per-(host,rule) outcomes | `host_rule_state` | -| Durable per-scan evidence | `scan_results` + `scan_evidence` (content-addressed) | -| Per-scan OSCAL | `scanresult.ReconstructScan` (OSCAL 1.0.6) | -| Live fleet aggregations | `internal/fleetrollup` (score, liveness, top-failing rules/hosts, recent changes) | -| Trend history | `internal/posture` → `posture_snapshots` (daily per host) | -| Drift (state changes) | `transactions` | -| Exceptions/waivers | `internal/exception` | -| Remediation history | remediation `transactions` | -| Coverage / staleness | `host_liveness` | -| Async execution | `internal/queue` (PG `SKIP LOCKED`) | -| Email delivery | `internal/notification` dispatch | -| Ready signal | in-app notification bell (P1 stub — first real producer) | -| Signing | release signing class; shared with AU-9 audit signing | - ---- - -## 10. Open decisions - -1. **OSCAL version** — align per-scan (1.0.6) and the prototype (1.1.2) - on one version. -2. **Fleet OSCAL shape** — single SAR with hash-referenced evidence vs. - per-host/group shards + assembly index. Recommend shards + index for - 100+ hosts (bounded memory, resumable). -3. **Retention** — reports are immutable + signed; define a retention - window (prototype says "retained 1 year") and whether snapshots are - purgeable. Relate to the audit retention sweep (AU-11). -4. **Signing key custody** — where the report signing key lives and how - "Verify signature" works in-product and offline. -5. **Snapshot storage budget** — a fleet snapshot is ~50k rows of - canonical JSON; store compressed in the blob store, not inline JSONB. - ---- - -## 11. Phase A — resolved decisions & implementation plan - -> **STATUS: Phase A shipped (2026-06-21, PRs #631–#637).** The executive -> report is now scoped (group/framework, A1), coverage-honest (A2), -> content-addressed on the `report_snapshots` + `report_faces` model -> (A3a), with a bounded pure-Go PDF face + export endpoint (A3b) and a -> frontend Download control (A3b-2), Ed25519-signed with offline -> verification (A4a) and a frontend Signed badge + Verify action (A4b). -> Two adjustments to the plan below, made during implementation and noted -> here: (a) the **coverage caveat shipped before the snapshot/faces -> migration** — the migration had no user value until a second face -> existed, so A2 delivered coverage and the structural migration moved to -> **A3a**; (b) A3 and A4 were each split backend/frontend (A3b/A3b-2, -> A4a/A4b) to isolate the fpdf dependency (A3b) and the cookie-auth blob -> download / client-side Web-Crypto verification (frontend slices). The -> §10 signing-key decision resolved as a config-path key -> (`[reports].signing_key_file`) with an ephemeral per-boot dev key. -> **Remaining: Phases B–D** (OSCAL/CSV faces, async + scheduling, the -> other report kinds) — a separate initiative. - -Phase A makes the **executive** report real for humans without taking on -the bulk/OSCAL machinery. The §10 decisions are resolved for Phase A as -follows so that *none of them blocks the start*: - -| # | Decision | Phase A resolution | -| - | -------- | ------------------ | -| 1 | OSCAL version | **N/A in Phase A** (no OSCAL face). Resolve in Phase B; recommend bumping the per-scan emitter 1.0.6 → the prototype's 1.1.2 there. | -| 2 | Fleet OSCAL shape | **Deferred to Phase B.** | -| 3 | Retention | **Keep indefinitely for now** (matches host soft-delete today); add an operator-configurable window in a later phase. Not blocking. | -| 4 | Signing-key custody | **De-risked:** `report_snapshots.signature` is nullable; **signing is the last Phase A slice (A4)** and is the only step gated on the operator provisioning a key. A1–A3 ship unsigned, then A4 turns signing on. So the open operator decision does not block starting. | -| 5 | Snapshot storage | **Inline JSONB is fine in Phase A** — the executive snapshot is a small rollup, not 50k rows. The compressed blob-store path is introduced in Phase B with the first bulk kind. | - -### Slices (each one reviewable PR: spec + migration/code + tests) - -**A1 — Scope the executive report.** *(api-reports v1.1.0, additive)* -- `POST /api/v1/reports:generate` accepts an optional scope: - `{ group_id?, framework? }`. (Period applies to the trend, which - arrives in A3 — not to the point-in-time snapshot.) -- Framework lens is already supported — `fleetrollup.WithFramework`. - **Add `fleetrollup.WithGroup(groupID)`** to filter the rollup to a - group's host membership (groups: migration 0027, `internal/group`). -- Store the resolved `scope` + a derived `scope_label` - (e.g. "Production · CIS") on the report. -- Frontend: a scope picker (group + framework) on the generate action, - matching the prototype's Templates builder. -- **Value:** scoped executive reports ("Production / CIS posture"), - no new architecture yet. - -**A2 — Snapshot/faces model + coverage caveat.** *(api-reports v2.0.0; -new `system-report-snapshot`)* -- Migration: `report_snapshots` + `report_faces` (§4). Migrate the - executive kind onto it; **keep the existing executive JSON as the - `json` face** so the v1 wire contract (C-01) is preserved unchanged. -- Compute the `coverage` block from `host_liveness` - (`hosts_total / fresh / stale / unreachable` + `stale_host_ids`) and - surface the **auto-generated coverage caveat** (P6) in-app. -- **Value:** the trust-critical staleness disclosure; the data model - every later face/kind reuses. - -**A3 — PDF face + in-app viewer parity.** *(api-reports v2.1.0)* -- Bounded server-side executive **PDF** renderer (P3): posture snapshot - + 30-day trend (from `posture_snapshots`) + KPI strip + coverage - caveat + framework rollup + top risks + recommended actions — the - prototype's Executive document. -- Export endpoint: `GET /api/v1/reports/{id}/export?format=pdf|json` - (content-addressed `report_faces`, streamed as an attachment). -- In-app viewer renders the same document the PDF will (preview == - export). -- **Decision for A3:** PDF engine. Prefer a **pure-Go PDF library** - (airgap-friendly, no headless-browser dependency) over HTML→PDF. - Confirm the lib choice at A3 start. - -**A4 — Signing.** *(extends `system-report-snapshot`)* -- Ed25519 signing service over `content_sha256`; populate - `report_snapshots.signature` + `signing_key_id`. -- "Verify signature" action in the viewer + offline verification. -- **Gated on:** the signing-key custody decision (§10.4) — recommend a - dedicated report key provisioned like the release key (mounted secret, - never in DB; ephemeral dev key when unset). Shares the signing service - with the reserved `audit_events.signature` work (AU-9). - -### Spec footprint -- New: `specs/system/report-snapshot.spec.yaml` (snapshot/faces/signing - service), registered in `specter.yaml`. -- Bump: `specs/api/reports.spec.yaml` v1.0.0 → v1.1.0 (A1) → v2.x (A2/A3). -- Update: `specs/frontend/reports.spec.yaml` (scope picker, coverage - caveat, PDF viewer; Templates tab becomes the persona launcher). - -### Recommended order -A1 → A2 → A3 → A4. A1 ships visible value immediately and is fully -additive; A2 lays the architecture; A3 delivers the CISO's signed-looking -document; A4 turns on cryptographic signing once the key is provisioned. - ---- - -## 12. Phase B — resolved decisions & implementation plan - -Phase B builds the **Framework Attestation** kind: the scale-correct bulk -path (fleet **OSCAL SAR** + **CSV** evidence extract, generated **async**) -for auditors/GRC. Grounded in an infrastructure audit (2026-06-21): the -per-scan OSCAL emitter, the content-addressed evidence store, the -audit-CSV pattern (`csvSafe` + truncation header), the generic job queue, -and the SSE event bus all exist; the gaps are a fleet framework catalog, -a fleet SAR assembler, the async report job, and a "report.ready" event. - -### Resolved decisions (the §10 items Phase B touches) - -| # | Decision | Phase B resolution | -| - | -------- | ------------------ | -| 1 | OSCAL version | **Stay on 1.0.6.** The per-scan emitter delegates OSCAL marshaling to the Kensa library (`kensapkg.ExportOSCALScan`), which emits **1.0.6 assessment-results**; the version is Kensa-controlled. The fleet SAR must match the per-scan output, so align *down* to 1.0.6 rather than force a Kensa change to chase the prototype's aspirational 1.1.2. Revisit only if a GRC consumer requires 1.1.x (a coordinated Kensa + OpenWatch bump). | -| 2 | Fleet OSCAL shape | **Single `assessment-results` document; evidence REFERENCED by content hash (not inlined); STREAMED to the blob store.** A fleet SAR that inlined 100+ hosts × ~500 rules × up-to-256 KiB evidence is the 1000-page problem in OSCAL form. Instead the SAR carries one observation + finding per `(host, rule)` with the evidence `sha256` as a back-matter resource reference; the bytes stay in `scan_evidence`. Streaming the SAR to the blob bounds memory without the complexity of per-host shards + an assembly index (deferred unless a single SAR proves unwieldy). | -| 3 | Retention | **Keep indefinitely (unchanged).** A retention sweep (AU-11, relate to the host soft-delete sweep) is a later phase. | -| 5 | Snapshot storage | **Bulk-kind content goes to a content-addressed blob, not inline JSONB.** The attestation snapshot is the ~50k-row per-`(host, rule)` result set, too large for the reports row. Reuse the `scan_evidence` content-addressing pattern (or `report_faces.content` bytea, already present) and compress. The executive kind stays inline (small). | - -(§10.4 signing-key custody was resolved in A4a: `[reports].signing_key_file`.) - -### Slices (each a reviewable PR: spec + migration/code + tests) - -**B0 — Fleet framework catalog.** `GET /api/v1/reports/frameworks` -(host:read) returns the distinct `framework_refs` keys present across the -in-scope fleet (`SELECT DISTINCT jsonb_object_keys(framework_refs) FROM -host_rule_state` [scoped]). Small, and it ALSO closes the **A1 deferred -gap**: the frontend framework-lens picker (deferred in A1 for lack of a -catalog) can now populate. Spec: `api-reports`. - -**B1 — Attestation kind + CSV face.** A new `attestation` report kind -whose snapshot is `{scope, framework, per-(host,rule) outcomes}`, queried -via `host_rule_state` → `scan_runs` (`last_scan_id`) → `scan_results`. The -CSV face (reusing `csvSafe` + the truncation-disclosure header) streams one -row per `(host, rule)`: host, ip, group, os, rule_id, title, status, -severity, framework_refs, evidence_sha256, scan_at, exception_id. The -snapshot content is blob-stored (compressed). Spec: `api-reports` -(kind=attestation), new `system-report-attestation`. - -**B2 — Fleet OSCAL SAR face.** *(SHIPPED 2026-06-21, PR #643.)* Assemble a -single OSCAL 1.0.6 `assessment-results` from the attestation snapshot -(`internal/report/oscal.go`): one result whose findings + observations -carry one entry per `(host, rule)`, reviewed-controls aggregated as -framework-prefixed control-id tokens (digit-leading native ids stay valid -OSCAL tokens), the finding state "satisfied" only on a pass, the host as a -deterministic-v5 inventory-item subject, narrowed by the snapshot's -framework lens. Evidence is REFERENCED by `sha256` in back-matter (an rlink -SHA-256 hash), never inlined as base64 — the bytes stay in `scan_evidence`. -Since Kensa's `ExportOSCALScan` is per-scan and *inlines* evidence, the -fleet assembler is a light hash-referencing custom builder (not Kensa's -exporter), with its own minimal OSCAL structs mirroring the per-scan shape. -Every uuid is a deterministic v5 from the snapshot id, so the document is -byte-deterministic and cached in `report_faces` (face `oscal_sar`, status -`ready`) like the other faces; the assembly is bounded by the same row cap -as the CSV (a metadata prop discloses truncation). `format=oscal_sar` is -attestation-only (executive is `ErrInvalidFace`). Spec: `api-reports` -v1.8.0 (C-14 / AC-20). True streaming to a separate blob store is deferred -(the in-memory + row-cap + `report_faces.content` pattern matches the CSV -face). - -**B3b — Bounded attestation PDF face.** *(SHIPPED 2026-06-21, PR #644.)* -The `pdf` face is now KIND-DISPATCHED (`internal/report/export.go`): an -executive report renders the executive summary PDF, an attestation report -renders a bounded one-page cover (`renderAttestationPDF` in `pdf.go`) — -methodology note, aggregate attestation coverage + framework rollup -(compliance %, checks evaluated, pass/fail/skipped/error), a SAMPLED -top-failing list, and a footer carrying the snapshot content hash + signing -status as the pointer to the bulk faces. The rollup is O(1) in fleet size -(aggregate `count(*) FILTER` over the frozen scans + a top-N grouped query, -framework-lensed), so the PDF stays bounded. Cached in `report_faces` (face -`pdf`) like the others. Spec: `api-reports` v1.9.0 (C-15 / AC-21; C-10 -updated: pdf kind-dispatched, not executive-only). - -**B3a — Async generation + report.ready.** *(SHIPPED 2026-06-21, PR #646.)* -Generating an attestation marks its bulk faces (`csv`, `oscal_sar`, `pdf`) -`pending` in `report_faces` and enqueues a `report.render` job -(`internal/report/job.go`), returning immediately (the executive summary -stays synchronous). A `RenderProcessor` registered on the in-process worker -(`worker.WithReportProcessor`) claims the job, renders each face via -`Export` (flipping `pending → ready`; a render error marks the face -`failed` and fails the job for retry), and publishes -`EventKindReportReady` on the event bus — **the in-app notification bell's -first producer**. Async is an optimization, not a correctness gate: `Export` -stays the lazy fallback so a download before the job runs still renders -inline. Spec: `api-reports` v1.10.0 (C-16 / AC-22) + the new eventbus kind. - -**B3c — Notification bell (frontend).** *(SHIPPED 2026-06-21, PR #647 — -conservative MVP.)* The stubbed TopBar bell is now a real consumer of -`report.ready`: `useLiveEvents` subscribes to the topic and bumps a -session-scoped unread counter in a small Zustand store -(`useNotificationStore`); the bell renders that count as a badge and, on -click, opens `/reports` and clears it. MVP scope is deliberately small and -honest — the counter is session-scoped (a refresh resets it), there is no -dropdown feed of individual notifications, and `report.ready` is the only -event type. A durable per-user feed (a dropdown list, multiple event types, -cross-session persistence) is the deferred follow-on and is NOT faked. Spec: -`frontend-live-events` v1.3.0 (C-08 / AC-10) + new `frontend-notifications` -v1.0.0. - -### Recommended order -B0 → B1 → B2 → B3b → B3a → B3c. B0 unblocks attestation scoping + the -deferred A1 framework picker; B1/B2/B3b build the three bulk/cover faces -(CSV, OSCAL SAR, PDF); B3a makes generation async and emits the "ready" -signal; B3c surfaces it in the notification bell (the product-sensitive -slice, sequenced last). diff --git a/docs/engineering/scan_implementation_plan.md b/docs/engineering/scan_implementation_plan.md deleted file mode 100644 index 031db011..00000000 --- a/docs/engineering/scan_implementation_plan.md +++ /dev/null @@ -1,258 +0,0 @@ -# Compliance Scan — Implementation Plan - -**Status:** Phases 0-2 SHIPPED (branch feat/scan-foundation, 15 commits, live-verified) · **Updated:** 2026-06-12 · **Owner:** TBD - -## Status snapshot (2026-06-13) - -| Phase | Status | Evidence | -|-------|--------|----------| -| 0 — Foundation | **DONE** | kensa v0.3.2 bound (NewScanner + in-memory transport); R6 multi-valued refs fixed pre-data; scan_runs logbook + audit lifecycle; serve processes scan jobs in-process. Live: 539 rules vs owas-hrm01 in 83s, 0 errors | -| 1 — On-demand scan | **DONE** | POST /hosts/{id}/scans (idempotency, RBAC, 409 single-flight) + Run scan button + scan.completed SSE refresh. Found+fixed a latent platform bug: http.Server WriteTimeout killed ALL SSE streams at 60s | -| 2 — Top failed rules | **DONE** | GET /hosts/{id}/compliance/failed-rules (no-evidence C-02, multi-valued control_ids) + RuleCatalog titles + live card. Verified: hrm01's real 147 failures render with catalog titles | -| 3 — Compliance tab lens | **DONE** | GET /hosts/{id}/compliance (+/frameworks) with C-05 reconciliation; ComplianceTab.tsx lens UI. Live: lens switch recounts 68.1% all-rules -> 71.4% under stig_rhel8 (266 rules exactly). Prototype-fidelity pass done 2026-06-12: per-lens scores on chips + overall aggregate, result-mix/scan panels, duration_seconds, catalog descriptions, search, in-strip Re-scan (specs v1.2.0 / v1.1.0) | -| 4 — Adaptive scheduler + settings | **DONE** (2026-06-12; scan variables shipped in PR #517, surfaced on Settings > Compliance policies) | system-scheduler v3.0.0: ladder from systemconfig (RunManaged per-tick refresh), five bands + migration 0024 backfill, PersistAfterScan after every scan, dispatch logbook rows. api-system-scan-config (6 endpoints incl. scan/variables + scan/schedule + host schedule tile) + wired Settings section. Scan variables SHIPPED (PR #517): VariableCatalog over the 20 corpus-used vars, operator overrides on Settings > Compliance policies, per-scan corpus reload. Live: first tick auto-dispatched all 9 seeded hosts; fleet classified across the five bands; a UI variable override reloaded the corpus on the next scan. **Phase 4 fully DONE.** | -| 5 — Fleet surfaces | **mostly DONE** | Per-host Scan, scan-queue KPI, hosts-list compliance_summary enrichment, avg/critical KPIs, fleet avg-compliance delta. Remaining: **bulk scan** (POST /hosts:scan) -> see `scan_remaining_work.md` | -| 6 — Trend / posture snapshots | **DONE** (2026-06-12, PR #518) | posture_snapshots daily rollup (hourly cron + boot pass); GET /hosts/{id}/compliance/trend + /fleet/compliance/trend; live trend card + avg-compliance delta. Shipped alongside: the host-detail hero strip went fully live (Auto-scan tile -> GET /compliance/schedule, Watchlist tile -> live active-alerts) and OS-aware framework lens filtering (api-host-compliance v1.3.0 — a RHEL 8 host no longer offers RHEL 9/10 lenses) | -| 7 — Exceptions / Remediation | **Exceptions DONE (rc.6); remediation NOT STARTED** | Exception governance complete end to end (PRs #521/#522/#523): lifecycle, separation of duties, overlay model, host-detail surfaces + fleet approver queue. **Remediation** (host-mutating, scoping required) -> see `scan_remaining_work.md` | - -All risks R1-R6 resolved. Fleet self-scans on the adaptive ladder (9 hosts seeded, classified across all five bands; 3 critical re-scan every 4h). Merged PRs: #515 (Phases 0-3 + scheduler core), #517 (scan variables), #518 (posture trend + live hero tiles + OS-aware lenses), #519 (service-wiring guard), #521/#522/#523 (exception governance: backend + host-detail surfaces + fleet approver queue). All of the above shipped in release **v0.2.0-rc.6** (2026-06-13). **7 of 8 phases complete**; remaining: Phase 7 remediation (the host-mutating half, its own track) + the Phase 5 bulk-scan endpoint. - -Operational follow-ups still open (small, not blocking Phase 7): -- Bulk scan endpoint (POST /hosts:scan) — the last Phase 5 item. -- Deadline-free SSE streams outlive graceful shutdown's 30s grace (cancel on shutdown ctx). -- Scan-context Capabilities line needs stored capability data from Kensa. -- Found + fixed during the host-detail tile work: the alerts lifecycle service was never wired in serve (every /api/v1/alerts endpoint 503'd in production); a generic source-test (system-daemon-orchestration AC-11) now guards that EVERY server builder is wired in main.go. - -Covers end-to-end compliance scanning for OpenWatch: wiring the -Kensa engine, triggering scans (on-demand + adaptive auto-scan), persisting and -serving results, and the two operator surfaces in the prototypes — -`docs/engineering/prototypes/openwatch-v1/Host Detail.html` and `Host Management.html`. - -> **Design anchor — the lens model.** Kensa runs its **native rules once** per -> host and returns a per-rule verdict plus that rule's normalized -> **framework references**. Every compliance number in the UI (CIS %, STIG %, -> NIST %, the "Top failed rules" list, the per-framework rule view) is a -> **projection of one scan**, regrouped by `framework_refs`. There is never more -> than one scan behind the screen. Build for this from day one. - ---- - -## 1. Current state (verified 2026-06-11) - -| Layer | State | Notes | -|-------|-------|-------| -| Kensa dependency | ⚠️ v0.2.1 pinned | bump to **v0.3.0** — it adds the real `ComplianceStatus` verdict (§3) | -| `internal/kensa` executor | ✅ built, **scan stubbed** | `executor.go` has the `WithScanFunc` seam; default is `unwiredScanFunc` which errors. Concurrency guard, credential-resolver hook, audit emission all present | -| Scan worker | ✅ built | `internal/worker/scan_worker.go` consumes `"scan"` jobs → executor → persists via `transactionlog.Writer.Apply()` | -| Result storage | ✅ schema ready | `host_rule_state` (current state per host×rule) + `transactions` (append-on-change). `current_status ∈ {pass,fail,skipped,error}`, `framework_refs JSONB`, `severity`, `evidence`, `skip_reason` | -| Compliance scheduler | ❌ **not booted** | `internal/scheduler` has `Dispatch()`/`UpdateAfterScan()` but `main.go` never instantiates it; `host_compliance_schedule` is never advanced | -| Intelligence scheduler | ✅ running | `internal/intelligence/scheduler` (OS-intel collection) — the working template for the compliance scheduler | -| On-demand scan API | ❌ none | no `/hosts/{id}/scan` or `/scans` in `api/openapi.yaml` | -| Host Detail compliance UI | ⚠️ partial | hero card reads `compliance_summary` (works, shows empty); Top-failed-rules + trend cards are stubs; Compliance tab is a `TabStub` | -| Host Management compliance UI | ⚠️ partial | list shows status/intel; compliance %, last-scan, per-host scan, fleet stats not wired | - -**The chain is built but cold:** executor → worker → storage → display all exist; -they're dark because (a) the scan is a placeholder, (b) nothing enqueues scan -jobs, and (c) the read endpoints don't exist yet. - ---- - -## 2. Target experience (extracted from the prototypes) - -### Host Detail (`Host Detail.html`) -1. **Header `Run scan`** (on-demand trigger) + a **maintenance toggle** ("Pause scans & alerts"). -2. **Offline banner** — "figures below reflect the last completed scan, may be stale." -3. **Overview → Top failed rules card**: rows of `[severity] · title · framework-control-id · category · occurrence-detail · [Remediate]`; footer "View all N failed rules →". -4. **Overview → Compliance trend (30 d)** card — needs posture snapshots. -5. **Compliance tab → lens model**: `scan-context` (last scan time, **auto-detected capabilities**), `Export` + `Re-scan`, a `lens-bar` ("View as" CIS/STIG/NIST/…), and summary + categories + rule-list, **all re-projected from one scan by `framework_refs`**. -6. **Per-rule `Remediate`** action. - -### Host Management (`Host Management.html`) -7. **Fleet `Run scan`** (bulk) + per-host `Run scan` (table row + card). -8. **Fleet stats**: Avg. compliance, **Scan queue** depth. -9. **Fleet health banner** (e.g. "compliance dropped 4.2 pts in 24h"). -10. **Host list columns**: compliance %, passed/failed/total, **last scan**, status; sorted "down first, then compliance asc". - -### Settings → Scanning & monitoring (`Settings.html`, "Compliance scanner" section) -11. **Master toggle** — "Automatic compliance scanning" on/off, with the **48 h hard-ceiling** copy ("per host even when state hasn't changed") and a Running badge. -12. **Next-scan readout** — "Next scan in 2 min · **5 hosts queued**" (live queue depth). -13. **24 h schedule strip** — "What this will run · next 24 hours" visual projection of upcoming scans (the Q2 plan's "preview histogram"). -14. **State-interval table** — one row per compliance state with: state name + score band, **hosts-in-state count** ("5 of 7 hosts"), an editable **interval stepper** (minutes), and the computed cadence ("Every 1h · 120 scans/day"). Prototype rows: Critical <20% → 60 m · Low 20–49% → 120 m · Partial 50–69% → 360 m · Mostly compliant 70–89% → 720 m · Compliant ≥90% → 1440 m. -15. The current `ScanningPage.tsx` already renders this section as a **"UI only" placeholder** (`Section title="Compliance scanner" badge="UI only"`, ~lines 457–502) — Phase 4 replaces it with the wired version, alongside the already-wired Connectivity / OS discovery / OS intelligence sections. - ---- - -## 3. The Kensa v0.3.0 contract (foundation) - -`k.Scan(ctx, HostConfig, []*Rule)` → `ScanResult{ Outcomes []RuleOutcome }`, where: - -```go -type RuleOutcome struct { - RuleID string // native rule id, e.g. "ssh-disable-root-login" - Status ComplianceStatus // pass | fail | skipped | error - Severity string // critical|high|medium|low (copied from rule) - Detail string // UI-suitable explanation of the verdict - FrameworkRefs []FrameworkRef // normalized, OS-resolved - Err error // non-nil iff Status==error -} -type FrameworkRef struct { FrameworkID string; ControlID string } // {"cis_rhel9_v2","5.2.3"} -``` - -The OpenWatch ↔ Kensa mapping is therefore a **field copy**, with zero compliance -logic on our side (this is the whole reason to be on v0.3.0): - -``` -kensa.RuleOutcome.Status → host_rule_state.current_status (1:1: pass/fail/skipped/error) -kensa.RuleOutcome.Severity → host_rule_state.severity -kensa.RuleOutcome.Detail → host_rule_state.evidence / .skip_reason -kensa.RuleOutcome.FrameworkRefs → host_rule_state.framework_refs (map FrameworkID→ControlID) -``` - -`FrameworkRef` is already OS-resolved (`cis_rhel9_v2` bakes in the OS), which is -exactly the lens data — the Compliance tab groups `framework_refs` by `FrameworkID`. - -**Isolation rule:** the `kensa.* → host_rule_state` translation lives in exactly -**one** adapter (`pkg/kensa`). Nothing downstream (worker, DB, API, UI) ever sees -a Kensa type. When Kensa's API evolves, we change one function. - ---- - -## 4. Open risks to resolve before / during Phase 0 - -| # | Risk | Resolution | -|---|------|-----------| -| R1 | **Rule corpus loader — ✅ RESOLVED (2026-06-11, consensus with Kensa team).** Kensa will ship a public loader in their `pkg/kensa` (small, spec-covered PR): `LoadRules(dir string, paths []string, vars map[string]string) ([]*api.Rule, error)`, plus two discovery functions for the operator UI: `BuiltInVars() map[string]string` and `RuleVariables(dir) map[string][]string`. Decisive facts from their investigation: **23 of 539 rules are `{{ var }}` templates** resolved against an embedded defaults.yml (a copied parser would mis-read them *today*); the loader also does reference normalization, param-contract validation, and draft-tolerant walking; and the corpus ships as the **signed `kensa-rules` OS package** at the loader's default path — vendoring rule files into OpenWatch would fork the corpus and exit the GPG/cosign trust chain. | **OpenWatch consumes, never copies:** `kensa-rules` package on disk → `kensa.LoadRules(dir, nil, mergedVars)` → `Kensa.Scan` → `Outcomes`. Zero copied files, zero private imports. **Ratified boundaries:** (1) `LoadRules` returns parsed rules only — `depends_on`/`conflicts`/`supersedes` ordering stays unexported (revisit as its own ratification when Phase 7 remediation sequencing needs it); (2) per-host/per-group **variable storage lives in OpenWatch's DB** — OpenWatch passes the already-merged map per scan; Kensa stays a single-host resolver (same division as the liveness boundary). **Phase 0 gate is now just the Kensa PR (~a day on their side).** | -| R6 | **`FrameworkRefs` cardinality bug — MUST FIX BEFORE FIRST SCAN DATA (2026-06-11 review).** Kensa's `[]FrameworkRef` allows multiple controls per framework — e.g. `ssh-disable-root-login` maps to **three** NIST controls (`AC-6(2)`, `AC-17(2)`, `IA-2(5)`) and two PCI controls. OpenWatch's `RuleOutcome.FrameworkRefs map[string]string` (`internal/kensa/types.go:96`) holds **one** control per framework — converting silently drops the rest, and `transactionlog.Writer` marshals that lossy map straight into `host_rule_state.framework_refs`. The NIST/PCI lens would under-count. | Change to `map[string][]string` in `internal/kensa/types.go` + adjust `transactionlog/writer.go` marshaling. No migration needed (column is JSONB) and **nothing reads the column yet** — fix lands in Phase 0, before any data exists. Update `system-transaction-log`/`host-rule-state` spec shape notes. | -| R2 | **Transport on-disk key.** Kensa's default `ssh.Factory{}` needs `HostConfig.KeyPath` (a key on disk); OpenWatch decrypts credentials in-memory and must not write them out. **2026-06-11 review: the in-memory adapter does NOT exist** — `internal/kensa/doc.go` documents the intent only; no OpenWatch type implements `api.Transport`. | **Build it in Phase 0** (it is the sprint's largest single code item): implement `api.Transport` on top of `internal/ssh.Dial` (in-memory key auth, known-hosts policy already handled). Scope verified against kensa internals: the **scan path only calls `Run()`** — `Put`/`Get` are used solely by agent bootstrap, so implement `Run` (with `sudo -n sh -c` wrapping per the interface contract), `Close`, `ControlChannelSensitive() → false`, and return explicit not-implemented errors from `Put`/`Get` until remediation (Phase 7) needs them. Do **not** use `pkg/kensa.Default`'s transport. | -| R3 | **`skipped` semantics.** A validated corpus rule always has a default impl, so capability-mismatched rules fall through to pass/fail/error; `skipped` fires only for rules with implementations and no default. | Trust Kensa's verdict verbatim. Do **not** re-derive applicability. The lens denominator = the outcomes Kensa returns. | -| R4 | **Kensa result store.** `pkg/kensa.Default` opens a SQLite store for Kensa's engine/evidence. OpenWatch is the system of record (PostgreSQL). | Give Kensa an ephemeral/throwaway store path; persist authoritative results to `host_rule_state`/`transactions` only. | -| R5 | **Capability detection ownership.** The lens header shows "auto-detected capabilities". Kensa auto-detects; OpenWatch also has intelligence. | Use Kensa's detected capabilities (surface them via scan metadata); don't double-detect. | - ---- - -## 5. Phased plan - -Each phase is independently shippable and SDD-disciplined (spec → tests → code → -validate). "Spec" = a new/updated `.spec.yaml` with enforcing tests. - -### Phase 0 — Foundation: wire the real Kensa scan ⟶ *unblocks everything* -**Goal:** a manually-enqueued `"scan"` job produces real `host_rule_state` rows. -- **Resolve R1** (corpus loader) and **R2** (in-memory transport) first. -- **Fix R6 first-in-phase:** `FrameworkRefs` → `map[string][]string` (types + writer) before any scan data is written. -- Bump `go.mod` → `github.com/Hanalyx/kensa v0.3.1`; update `KensaModuleVersion` + the pin test in `internal/kensa`. *(v0.3.1 ships the public loader — verified by smoke test 2026-06-11: `LoadRules` returns all 539 rules with zero unresolved templates, operator-var override works, `RuleVariables` reports the 20 corpus vars. Note `LoadRules` is STRICT — any unparseable file or undefined variable fails the whole load, naming the file — the right semantics for a compliance corpus.)* -- New `pkg/kensa` (OpenWatch side) — the production `ScanFunc`: - - decrypt credential in-memory (existing `internal/credential` resolver) → build OpenWatch `TransportFactory`; - - construct Kensa with our transport + Kensa's engine/scanner/store; - - `kensa.LoadRules(rulesDir, nil, vars)` → `Scan()` → copy `Outcomes` into the executor's `RuleOutcome` (the §3 field copy); - - **Vars in Phase 0: pass `nil`** — every templated rule has a safe embedded default, so scans work out of the box. Operator variable config is Phase 4 scope. Load the corpus once at worker start (and on config change), not per scan — `LoadRules` resolves templates at load time, so per-host variable tiers (future) would force per-scan loads; defer that until a real need. - - `rulesDir`: default to the `kensa-rules` package path; honor an explicit override (env/config) for dev checkouts where the OS package isn't installed. - - capture scan metadata (started/finished, capabilities, rule count, engine/policy version). -- Bind it via `WithScanFunc(...)` in the worker subcommand, replacing `unwiredScanFunc`. -- **Spec:** `system-kensa-executor` (close AC-18); `system-scan-execution` (new — verdict mapping, evidence cap, framework-ref copy, error/skip handling). -- **Exit:** enqueue a job by hand against a test host (`id_rsa` + `test_hosts.csv`); rows land in `host_rule_state`/`transactions`; the Host Detail **hero card lights up**. Verify the mapping against a known rule (e.g. `ssh-disable-root-login`). - -### Phase 1 — On-demand single-host scan (trigger) -**Goal:** the prototype's `Run scan` button works end-to-end. -- **API:** `POST /api/v1/hosts/{id}/scan` — enqueues one `"scan"` job; **Idempotency-Key** required; RBAC `host:write`; returns the scan/job id + queued status. 404 on unknown host; 409/202 semantics for an in-flight scan per the executor's busy guard. -- **Backend:** thin handler → existing queue enqueue. Audit `scan.requested`. -- **Frontend (Host Detail):** wire the header `Run scan` + the card `Re-scan`/`Run scan` buttons (idempotency-keyed, `host:write`-gated, inline busy/feedback). Invalidate `['host', id]` + compliance keys on completion via SSE (`scan.completed`). -- **Spec:** `api-host-scan` (new); update `frontend-host-detail`. -- **Exit:** click Run scan → job runs → hero card updates without reload. - -### Phase 2 — Top failed rules (Host Detail overview) -**Goal:** the "Top failed rules" card renders real data. -- **API:** `GET /api/v1/hosts/{id}/compliance/failed-rules?framework=&limit=` — reads `host_rule_state WHERE current_status='fail'`, ordered by severity desc then last-changed; joins `transactions` for first-seen/last-changed; projects `framework_refs[framework]` for the control-id + category. Returns `{title, native_id, control_id, severity, category, occurrence_detail, first_seen, last_changed}`. -- **Frontend:** replace the `CardTopFailed` stub; "View all N failed rules →" deep-links to the Compliance tab. -- **Spec:** `api-host-compliance` (new); update `frontend-host-detail`. -- **Exit:** card shows the same numbers as the hero; deep-link works. - -### Phase 3 — Compliance tab: the lens model -**Goal:** "One scan, viewed through any framework." -- **API:** `GET /api/v1/hosts/{id}/compliance?framework=` — returns, for the selected framework lens: summary (pass/fail/total + %), category breakdown, and the rule list (each with `control_id` from `framework_refs`, severity, status, detail). Also `GET …/compliance/frameworks` → the lens-bar options (frameworks this host's outcomes actually map to) + scan-context (last scan, detected capabilities). -- **Frontend:** build the Compliance tab — `scan-context` header, `lens-bar` (`?framework=` drives the query key, matching the existing host-detail framework-param pattern), `comp-summary` / `comp-cats` / `comp-rules`, `Export`. -- **Spec:** `frontend-host-compliance-tab` (new); extend `api-host-compliance`. -- **Exit:** switching the lens re-scores instantly from one scan; counts reconcile across lenses. - -### Phase 4 — Adaptive auto-scan scheduler ⟶ *the originally-planned model* -**Goal:** scans run on their own, state-based cadence (max 48 h). -- *(2026-06-11 review: smaller than originally scoped — `internal/scheduler` already has `Run(ctx, interval)` cron, `Dispatch()` with HMAC-signed `queue.Enqueue("scan")`, `UpdateAfterScan()`, tier ladder, and policy-revocation plumbing. The work is boot wiring + the post-scan callback + seeding, not building the scheduler.)* -- **Backend:** boot `internal/scheduler` in `main.go` (mirror `intelSched`); seed `host_compliance_schedule` on host-create; call `UpdateAfterScan()` after each scan to set `compliance_state` + `next_scheduled_scan`; `Dispatch()` on the cron tick enqueues due hosts. Respect `hosts.maintenance_mode`. Independent backoff (`probe_type='scan'`). -- **Scan-run metadata:** a `scans` (or `scan_runs`) record per run → powers "last scan", "scan queue" depth, scan status/history. - -**Settings → Scanning "Compliance scanner" section** (replaces the existing "UI only" placeholder in `ScanningPage.tsx` ~457–502; targets §2 items 11–15): -- **API:** `GET`/`PUT /api/v1/system/scan/config` — `{enabled, interval_mins per state, rate_limit, maintenance_global}` in the established `{config, defaults}` envelope (same systemconfig store + `SystemConfigChanged` audit pattern as connectivity / intelligence / discovery configs). Server clamps to the scheduler's bounds (never below the floor; ceiling 48 h = 2880 m). -- **Read endpoints for the section's live data:** - - hosts-per-state counts → `GET /api/v1/fleet/compliance/states` (one row per `ComplianceState`, analogous to the fleet connectivity breakdown); - - "Next scan in X · N hosts queued" → queue depth for `job_type='scan'` + min(`next_scheduled_scan`); - - 24 h schedule strip → `GET /api/v1/system/scan/schedule:preview` projecting `host_compliance_schedule` forward 24 h (read-only projection, not a dry-run dispatch). -- **Frontend:** wire the section — master toggle, state-interval steppers (per-state minutes), per-state host counts, computed cadence labels, next-scan/queue readout, schedule strip. Section-local Save/Reset, mirroring `OSIntelligenceSection`'s container/pure-view split. -- **Scan variables sub-section (operator config for the 23 templated rules):** key the list off `kensa.RuleVariables(dir)` (the **20 variables actually used** by corpus rules, with "affects N rules" per variable), using `kensa.BuiltInVars()` for the default values — BuiltInVars returns 29 entries, 9 of which no corpus rule uses; don't render those; store operator overrides in OpenWatch (global tier first — systemconfig key `scan_variables`; per-group/per-host tiers are a later phase per the ratified boundary); the worker passes the merged map to `LoadRules`. **Flag the three org-specific placeholders prominently as "configure me"** — `rsyslog_remote_server`, `chrony_ntp_pool`, `banner_text` — since scans against their example defaults produce technically-valid but practically meaningless verdicts for those rules. -- **Spec:** `system-compliance-scheduler` (promote/author); `api-system-scan-config`; `frontend-settings-scan-config` (new, incl. scan variables). - -**⚠️ Reconciliation issues found in the 2026-06-11 settings review — resolve at Phase 4 start:** -- **(a) Config source-of-truth conflict.** The built scheduler loads its `TierLadder` from a **signed schedules policy file** (`PolicyTiers` + signature verification + revocation list, wired in `main.go`), while the prototype shows **operator-editable steppers** — i.e. the systemconfig PUT pattern every other section uses. Decide: (i) move the ladder to systemconfig like its sibling configs (drop/repurpose the signing machinery for this knob), or (ii) keep the signed policy and make the Settings UI read-only for intervals. Recommendation: (i) for consistency — signing the cadence config has unclear threat-model value when the same operator can already PUT maintenance toggles that stop scanning entirely. -- **(b) State-band mismatch.** Prototype shows **5 score bands** (Critical <20, Low 20–49, Partial 50–69, Mostly compliant 70–89, Compliant ≥90); the backend `ComplianceState` enum has **4 score states + `unknown`** (`critical`, `non_compliant`, `partial`, `compliant`). Either add a 5th band to the enum/`StateFromScore` (+ migration of the `host_compliance_schedule` CHECK if any) or collapse the UI to 4 bands + an "Unknown / never scanned" row. Decide before authoring `api-system-scan-config`, since the state names are the config keys. - -- **Host Detail tie-in:** show `next scan` + last-scan freshness; trend card's "auto-scan resumes" copy becomes real. -- **Exit:** a fresh host gets scanned without anyone clicking; interval adapts to compliance state; the Settings section edits the live ladder and the strip/queue readouts move. - -### Phase 5 — Host Management fleet surfaces — **mostly DONE** -Compliance columns, tier coloring, per-host Scan, scan-queue + avg-compliance -KPIs, and the fleet-health delta all shipped (PRs #515 / #518). The one -remaining piece — the **bulk scan endpoint** (`POST /hosts:scan`) — now lives in -[`scan_remaining_work.md`](scan_remaining_work.md). - -### Phase 6 — Compliance trend (posture snapshots) -**Goal:** the 30-day trend card. -- **Backend:** a daily posture-snapshot rollup (per host + fleet) from `transactions`. -- **API:** `GET /api/v1/hosts/{id}/compliance/trend?days=30`; fleet equivalent for the health banner delta. -- **Frontend:** replace the trend empty-state with the chart. -- **Spec:** `system-posture-snapshots` + `api-compliance-trend`. - -### Phase 7 — Remediation + exceptions -**Exceptions: DONE** (PRs #521 backend, #522 host-detail surfaces, #523 fleet -approver queue) — DB-backed request→approve/reject→revoke/expire lifecycle, -separation of duties, the overlay model (a waiver never changes a rule's raw -verdict). Specs `api-compliance-exceptions`, `frontend-host-compliance-tab`, -`frontend-settings-exception-queue`. - -**Remediation: NOT STARTED** — the host-mutating half, its own track with a -scoping decision required first. Full plan + the five decisions in -[`scan_remaining_work.md`](scan_remaining_work.md). - ---- - -## 6. Cross-cutting requirements - -- **RBAC:** read = `host:read`/`system:read`; scan trigger + remediation = `host:write` (per `rbac_registry.md`). Anonymous → 401/403. -- **Audit:** `scan.requested`, `scan.started`, `scan.completed`, `scan.failed`, `remediation.*` per `audit_event_taxonomy.md` (executor already emits the started/completed/failed legs). -- **Idempotency:** every mutating scan/remediate endpoint requires `Idempotency-Key` (reuse the connectivity:check pattern). -- **SSE / live refresh:** publish `scan.completed` on the event bus; extend `useLiveEvents` to invalidate `['host', id]`, the compliance keys, and `['hosts']` (fleet). (Tracks the existing Track-B SSE backlog.) -- **OpenAPI-first:** every endpoint lands in `api/openapi.yaml` → `make generate-api` → Go stubs + `frontend/src/api/schema.d.ts`. -- **Packaging (RATIFIED 2026-06-12 — air-gapped installs are the primary deployment target):** the corpus ships as the **signed `kensa-rules` OS package** at the loader's default path. OpenWatch's RPM/DEB MUST declare a dependency on it **and the air-gapped artifact set MUST bundle it** so an offline install is complete with no network fetch. OpenWatch never embeds or forks the rule files. `OPENWATCH_KENSA_RULES_DIR` is a development-only override — both boot paths warn loudly when it is set (spec system-kensa-executor C-16/AC-23), and no production runbook, unit file, or default config may use it; pointing production at a Go module cache or any unpackaged source is prohibited. Fold the dependency + bundling into `packaging/` + the `RELEASING` runbook before the first scan-capable release. - ---- - -## 7. Sequencing & dependencies - -``` -Phase 0 (foundation) ─┬─→ Phase 1 (on-demand) ─→ Phase 2 (top failed) ─→ Phase 3 (lens tab) - └─→ Phase 4 (scheduler) ─→ Phase 5 (fleet) ─→ Phase 6 (trend) - Phase 7 (remediation) ⟂ later -``` - -- **Phase 0 gates everything** and is itself gated on **R1 + R2**. -- Phases 1–3 deliver the Host Detail story on one host; Phase 4 makes it autonomous; Phase 5 scales it to the fleet. -- The lens model (Phase 3) is mostly frontend/SQL over `framework_refs`, since Kensa already normalizes the refs. -- Recommended first PR after this plan is approved: **Phase 0, step 1 only** — bump to v0.3.0, resolve R1's loader question, and prove the build — before writing the `ScanFunc`. - ---- - -## 8. Decisions needed from review - -1. ~~**R1** — corpus loader~~ — **RESOLVED 2026-06-11**: Kensa ships `pkg/kensa.LoadRules` + `BuiltInVars` + `RuleVariables`; OpenWatch consumes the signed `kensa-rules` package; both boundary ratifications accepted (no ordering export; per-host/group variable storage in OpenWatch's DB). -2. ~~**Trigger posture**~~ — **RESOLVED 2026-06-12: ship both.** On-demand (`Run scan`/`Re-scan`, shipped in Phases 1-3) stays as the dev/first-contact path; the adaptive scheduler (Phase 4) is the steady-state model. -3. ~~**Scan-run record**~~ — **RESOLVED 2026-06-11: yes, with full scan auditability.** Two complementary records: (a) a `scan_runs` table (Phase 0 migration) — one row per scan attempt: id, host_id, trigger (`on_demand`/`scheduled`), requested_by, queued/started/finished timestamps, status (`queued`/`running`/`completed`/`failed`), rule counts by outcome, policy/engine version, failure reason — powering "last scan", "scan queue", and history; (b) **audit events** for the full lifecycle per the audit taxonomy — `scan.requested` (who triggered, from where), plus the executor's existing `scan.started`/`scan.completed`/`scan.failed` emissions, all correlation-id linked to the run row. -4. ~~**Scheduler config source-of-truth**~~ — **RESOLVED 2026-06-12: systemconfig** (option i). The tier ladder moves to the systemconfig store like every other Settings → Scanning section; the signed-policy/revocation machinery is dropped for this knob (signing the cadence has no threat-model value when the same operator can PUT maintenance toggles that stop scanning entirely). -5. ~~**Compliance state bands**~~ — **RESOLVED 2026-06-12: add the 5th band.** `ComplianceState` gains `mostly_compliant` (70-89) to match the prototype's five bands: `critical` <20, `non_compliant` 20-49, `partial` 50-69, `mostly_compliant` 70-89, `compliant` >=90, plus `unknown` for never-scanned. These names are the config keys and UI labels. -6. *(new, low-priority)* **Per-host / per-group scan variables** — the ratified boundary puts their storage in OpenWatch when we want them; global tier ships in Phase 4. Note: per-host vars force per-host `LoadRules` calls (templates resolve at load time) — defer until a concrete need justifies the load cost. diff --git a/docs/engineering/scan_remaining_work.md b/docs/engineering/scan_remaining_work.md deleted file mode 100644 index 1457f6f6..00000000 --- a/docs/engineering/scan_remaining_work.md +++ /dev/null @@ -1,134 +0,0 @@ -# Compliance Scan — Remaining Work (Phase 5 tail + Phase 7 remediation) - -> Split out of [`scan_implementation_plan.md`](scan_implementation_plan.md) on -> 2026-06-13. That document is now the record of the **delivered** compliance -> scanning platform (Phases 0–4 and 6 done; Phase 7's **exception** half done; -> all shipped in **v0.2.0-rc.6**). This file holds the **forward-looking -> remainder** — the two items that are not yet built. -> -> **UPDATE (2026-06-20, v0.2.0-rc.11): Phase 7 first-slice remediation has -> SHIPPED free-core.** Per-rule manual apply + snapshot/rollback from the host -> Remediation tab is built and live: the `remediation` service over the Kensa -> transport, the `remediation_requests` + `remediation_transactions` logbook, -> the worker-driven `approved → executing → executed | rolled_back` lifecycle, -> per-host serialization (busy fixes back off + requeue), and live SSE status -> (`remediation.completed`). Landed via #601 (execute/rollback + governance), -> #606 (conditional approval — free-core single-rule **auto-approves**, "A-keep" -> ADR), #607 (serialize + live status). Specs: `api-remediation`, -> `frontend-remediation-tab`, `system-rbac`. **What remains of Phase 7 is the -> licensed track** — bulk/sequenced and auto/policy-driven remediation — which -> keeps the approval-required lifecycle and the design notes below. The five -> decisions and "likely shape" below remain the reference for that track. -> -> **Status: 7 of 8 phases complete** (Phase 7 first-slice now shipped; licensed -> bulk/auto remediation + the Phase 5 bulk-scan tail remain). What is left: -> -> | Item | Size | Touches live hosts? | -> |------|------|---------------------| -> | Phase 5 tail — bulk scan endpoint | small | no | -> | Phase 7 — licensed bulk/auto remediation | large (own track) | **yes** | -> -> **GA scope decision (2026-06-05): remediation ships as a BETA feature in the -> GA release.** It is in-scope for GA but explicitly labelled *beta* — surfaced -> behind a `Beta` badge, gated by the `remediation:*` RBAC perms, and limited to -> the first-slice posture (per-rule manual, approval-gated, snapshot+rollback) -> ratified in the decisions below. The beta label sets the expectation that the -> auto/policy-driven and bulk-sequenced postures are *not* in GA and that the -> blast-radius surface is still hardening. Everything else in this file (the -> five decisions, the likely shape, the sequencing) stands — "beta in GA" is a -> labelling + scope-boundary decision, not a change to the build order. - ---- - -## Phase 5 (tail) — Bulk scan - -Everything else in Phase 5 shipped (per-host Scan buttons, scan-queue KPI, -hosts-list `compliance_summary` enrichment, avg/critical KPIs, the fleet -avg-compliance delta from Phase 6). The one remaining piece: - -- **API:** `POST /api/v1/hosts:scan` — enqueue a scan for a selection of hosts - or the whole fleet, idempotency-keyed. Reuses the Phase 1 single-host enqueue - path per host (same `scan_runs` logbook + `scan.completed` SSE), bounded by the - scheduler's per-tick rate limit so a whole-fleet click cannot stampede. -- **Frontend (Host Management):** a fleet-level / multi-select "Run scan" action - feeding the same scan-queue KPI. -- **Spec:** extend `api-host-scan` (or a small `api-fleet-scan`); update - `frontend-hosts-list`. -- **Risk:** low — no host mutation, just N enqueues of the already-proven scan - path. A good low-risk warm-up before the remediation track. - ---- - -## Phase 7 — Remediation *(its own track; SCOPING REQUIRED before building)* - -Remediation **changes target hosts** (edits configs, installs/removes packages, -restarts services). Unlike everything shipped so far — which only *reads* host -state — this has real blast radius. **Do not start coding until the decisions -below are ratified**, the same discipline used for the scheduler config and the -exception storage choices. - -### What exists already - -- `kensa.Remediate()` is available (kensa v0.3.2 `DefaultWithTransportFactory`). -- The in-memory SSH transport's `Put`/`Get` are currently not-implemented (the - scan path only calls `Run`). Remediation may need them to push remediation - scripts/files — first real implementation lands here. -- Exception governance is **done** and is the natural companion: a rule you - cannot or will not remediate gets a waiver instead. - -### Decisions needed (ratify first) - -1. **Execution model.** Three postures, riskiest last: - - *Manual, per-rule* — operator clicks "Remediate" on one failing rule → - one fix runs → rescan. **Recommended first slice** (mirrors how on-demand - scan preceded the adaptive scheduler). - - *Manual bulk* — apply N fixes together. Needs the rule-ordering question (#4). - - *Policy-driven / auto* — playbooks on a cadence. Most powerful, most - dangerous; a separate, later decision. - -2. **Approval gating.** The RBAC registry already splits `remediation:request` - / `:approve` / `:execute` / `:rollback`. A config edit on a production host - is arguably more consequential than a waiver. **Recommended: gate execution - behind approval** for anything beyond a dry-run, mirroring the exception - request→approve workflow. - -3. **Rollback + safety.** Kensa K-4/K-5 give transactional apply + rollback. - **Recommended: snapshot the pre-state, store it, expose a Rollback action.** - This is the difference between a tool people trust on prod and one they don't. - -4. **Open Kensa ratification.** `LoadRules` deliberately does **not** expose - rule-ordering (`depends_on` / `conflicts` / `supersedes`). If bulk/sequenced - remediation needs ordering, that is a **new Kensa-team ratification**, not - something OpenWatch re-implements. Only bites past per-rule-manual. - -5. **Transport `Put`/`Get`.** Implement only if a remediation mechanism needs to - push files; the scan path proves `Run` is enough for command-based checks. - -### Likely shape (pending the decisions) - -- **Backend:** `remediation` service over the transport (apply + rollback, - pre-state capture); a `remediations` logbook table mirroring `scan_runs`; - request→approve→execute→(rollback) lifecycle with the existing RBAC perms + - audit codes. -- **API:** `POST …/rules/{rule_id}:remediate` (request), the review actions, a - rollback action; suppressed/remediated-rule rendering. -- **Frontend:** the Remediation tab + per-rule Remediate affordance on the - Compliance tab (alongside the existing Request-exception action). -- **Spec:** `system-remediation`, `api-host-remediation`, `frontend-remediation-tab`. - -### Sequencing recommendation - -1. Ratify the five decisions (a short decision doc or a direct answer to #1 + #2). -2. Build **per-rule manual + approval-gated + rollback** as the first slice, - backend-first then frontend, the same layering as exceptions. -3. Revisit bulk / auto / ordering only after the manual path is trusted on a - real host. - ---- - -## Cross-cutting follow-ups (small, not blocking either item) - -- SSE streams outlive graceful shutdown's 30s grace — cancel streams on the - shutdown ctx. -- The scan-context **Capabilities** line needs stored capability data from Kensa - (currently absent). diff --git a/docs/engineering/stage_2_slice_a.md b/docs/engineering/stage_2_slice_a.md deleted file mode 100644 index 4db2cf83..00000000 --- a/docs/engineering/stage_2_slice_a.md +++ /dev/null @@ -1,400 +0,0 @@ -# Stage 2 — Slice A: Auth + Add a Host - -> **Status:** Plan (locked 2026-05-25) -> **Goal:** Replace the `X-Stub-Role` shim with real identity. Add the first real product object — a host you can scan against — including the credentials needed to reach it. -> **Estimate:** 3–5 weeks of focused work -> **Pre-req:** `stage-0-complete` tag (18/18 specs at 100% strict, all foundations wired) -> **Output:** ~6 new specs at 100% strict coverage, ~15 new endpoints, real auth on every existing endpoint, two new database scopes (users + hosts) with full RBAC + audit integration - ---- - -## Why this slice exists - -Stage 0 proved every foundation works. Stage 2 Slice A is the first time an operator can actually USE the platform: log in, prove who you are, register a host, verify the platform can reach it. Every Stage-0 endpoint is still gated by the `X-Stub-Role` header today; Slice A replaces that with real identity bound from a real credential. After Slice A, the demo header is gone and any caller without a session token gets 401. - -This slice deliberately does NOT include scans, findings, or compliance state — those land in Slice B. The scope discipline is: **the platform can authenticate a user and reach a host**. That's it. - ---- - -## Locked design decisions - -These are the answers from the 2026-05-25 scoping conversation. Every implementation decision rolls up to one of these: - -### 1. Auth surface - -- **Both** JWT and sessions. JWT for API consumers (tools, automation, the frontend's API client). Sessions for browser sign-in (cookie-bound, CSRF-protected). The same `users` row backs both. -- **TOTP MFA** ships with the slice. Required for every user; first login enrolls. Free-tier feature. -- **OIDC + SAML** declared in the registry but **license-gated** (`sso_saml` feature on the openwatch_plus tier). Wired in Slice A as stubs that return 402 without the license; full implementation lands in a follow-up slice. The free-tier baseline is local username + password + TOTP. - -### 2. Password policy: NIST SP 800-63B - -| Requirement | Value | -|---|---| -| Min length | 8 chars (15 for users with the `admin` role) | -| Max length | 64 chars enforced; 128 chars accepted (longer truncated to 128) | -| Character class rules | **None** — length is the signal | -| Forced rotation | **Prohibited** — rotation only on suspected compromise | -| Breach corpus check | **Required** — local SHA-1 prefix list (top 1M compromised passwords), refreshed offline | -| Lockout | Rate-limit attempts (10/min/IP and 5/min/user), not hard lockout | -| Password hints | Forbidden — `hint` field never present in schema | -| KDF | Argon2id, 64 MiB memory, 3 iterations, 1 lane (matches Stage-0 spec for license signing strength) | - -### 3. Session timeouts - -- Inactivity: 15 minutes (matches Python backend) -- Absolute: 12 hours (matches Python backend) -- Refresh token: 7 days, rotated on every use -- JWT access token: 30 minutes -- Stored sessions persisted in `sessions` table (server-side; can be revoked individually) - -### 4. Identity model - -- `users.id` is `UUID` (server-assigned, stable across renames) -- `users.username` is the human handle; unique -- `users.email` is informational; unique; used for MFA-recovery flows later -- No `(hostname, environment)`-style natural key on users — username is the natural key - -### 5. Host inventory (Slice-A subset) - -Columns shipped in Slice A: - -| Field | Type | Notes | -|---|---|---| -| `id` | UUID PK | Server-assigned | -| `hostname` | TEXT NOT NULL | FQDN; unique with environment | -| `ip_address` | INET NOT NULL | IPv4 or IPv6 | -| `port` | INTEGER NOT NULL DEFAULT 22 | SSH port | -| `display_name` | TEXT | Operator-friendly label | -| `description` | TEXT | | -| `environment` | TEXT NOT NULL DEFAULT 'production' | Label string (`production` / `staging` / etc.) | -| `tags` | TEXT[] NOT NULL DEFAULT '{}' | GIN-indexed | -| `group_id` | UUID NULL | Forward-compatibility — groups land in a later slice | -| `username` | TEXT NULL | Per-host override; null = fall back to system credential | -| `created_by` | UUID NOT NULL REFERENCES users(id) | | -| `created_at` | TIMESTAMPTZ NOT NULL DEFAULT now() | | -| `updated_at` | TIMESTAMPTZ NOT NULL DEFAULT now() | | -| `deleted_at` | TIMESTAMPTZ NULL | Soft delete; partial unique index `WHERE deleted_at IS NULL` | - -Columns **deferred** to later slices (each lands with its producer): -- `operating_system`, `os_family`, `os_version`, `architecture`, `platform_identifier` — discovered by Kensa (Slice B) -- `status`, `last_check`, `next_check_time`, consecutive-failure/success counters — populated by the adaptive scheduler (post-Slice-B) - -Unique constraints: -- `UNIQUE (hostname, environment) WHERE deleted_at IS NULL` - -### 6. SSH credentials — symmetric tier model (Option B) - -One `credentials` table for both scopes: - -| Field | Type | Notes | -|---|---|---| -| `id` | UUID PK | | -| `scope` | TEXT NOT NULL CHECK IN ('system','host') | Slice A: these two only; `host_group` reserved | -| `scope_id` | UUID NULL | NULL for `scope=system`; host UUID for `scope=host` | -| `name` | TEXT NOT NULL | e.g. "default ops account" | -| `description` | TEXT | | -| `username` | TEXT NOT NULL | | -| `auth_method` | TEXT NOT NULL CHECK IN ('ssh_key','password','both') | | -| `encrypted_password` | BYTEA NULL | AES-256-GCM | -| `encrypted_private_key` | BYTEA NULL | AES-256-GCM | -| `encrypted_private_key_passphrase` | BYTEA NULL | AES-256-GCM | -| `ssh_key_fingerprint` | TEXT NULL | SHA256:base64 — display metadata | -| `ssh_key_type` | TEXT NULL | ed25519, rsa, ecdsa | -| `ssh_key_bits` | INTEGER NULL | | -| `ssh_key_comment` | TEXT NULL | | -| `is_default` | BOOLEAN NOT NULL DEFAULT false | Only one row WHERE scope='system' AND is_default=true | -| `is_active` | BOOLEAN NOT NULL DEFAULT true | | -| `created_by` | UUID NOT NULL REFERENCES users(id) | | -| `created_at`, `updated_at` | TIMESTAMPTZ | | - -Unique constraints + invariants: -- `UNIQUE (scope, scope_id, name) WHERE is_active = true` -- Partial unique index: `UNIQUE WHERE scope='system' AND is_default=true` (only one system default) -- CHECK: `scope='system' → scope_id IS NULL` -- CHECK: `scope='host' → scope_id IS NOT NULL` -- CHECK: `auth_method IN ('ssh_key','both') → encrypted_private_key IS NOT NULL` -- CHECK: `auth_method IN ('password','both') → encrypted_password IS NOT NULL` - -Resolver (`internal/credential/resolve.go`): -``` -Resolve(ctx, hostID) → - 1. SELECT credentials WHERE scope='host' AND scope_id=hostID AND is_active=true - Return if found (highest precedence). - 2. SELECT credentials WHERE scope='system' AND is_default=true AND is_active=true - Return if found. - 3. Return ErrNoCredential. -``` - -The resolver returns *one* fully-formed credential — never blends fields across tiers. Mixed-tier credentials are a footgun. - -### 7. Encryption key for credentials - -`credentials.*` fields are encrypted at rest with AES-256-GCM. The data-encryption key (DEK) is loaded from `/etc/openwatch/secrets/credential-key` (32 random bytes, mode 0600, owner openwatch). The key path is configurable via `OPENWATCH_CREDENTIAL_KEY_FILE`. **Out of scope for Slice A**: KMS / Vault integration — that lands as a Slice-A.5 if needed. - -The DEK is **separate** from the license signing key. License keys are public-key crypto (Ed25519) verifying signed JWTs; credential keys are symmetric AES for encrypting secrets at rest. Same operator-managed file, different key. - -### 8. Audit additions - -New audit codes (added to `audit/events.yaml`, codegen regenerated): - -- `auth.login.success`, `auth.login.failure` (exist) -- `auth.logout` (exist) -- `auth.session.created`, `auth.session.revoked`, `auth.session.expired` -- `auth.password.changed`, `auth.password.policy_failed` -- `auth.mfa.enrolled`, `auth.mfa.challenged`, `auth.mfa.failed` (some exist) -- `user.created`, `user.updated`, `user.deleted`, `user.role_assigned`, `user.role_removed` -- `host.created`, `host.updated`, `host.deleted` -- `host.connectivity_check` (success/failure carried via `Event.Outcome`) -- `credential.created`, `credential.updated`, `credential.deleted`, `credential.used` - -Pre-store redaction (already in place from Stage 0) handles `password`, `ssh_key`, `private_key_passphrase`, `secret`, `token`, `license_jwt`. We add `credential_dek` to the redaction list to be safe; the DEK itself never appears in any code path that emits audit, but defense in depth. - ---- - -## New specs (writing order) - -Each spec gets ~10-15 ACs. Total ~80-100 ACs across the slice. - -| Order | Spec | Tier | Scope | -|---|---|---|---| -| 1 | `system-auth-identity` | T1 | Password hashing, password policy enforcement, breach corpus check, session token issue/verify/revoke, JWT issue/verify/refresh, MFA TOTP enrollment + challenge | -| 2 | `system-user-management` | T1 | `users`+`user_roles` schemas, user CRUD service layer, role assignment with custom-role validation | -| 3 | `system-credential-store` | T1 | `credentials` schema, AES-256-GCM at rest, resolver with system→host fallback, key file loading + validation | -| 4 | `system-host-inventory` | T2 | `hosts` schema, host CRUD service layer, soft delete, tag array index | -| 5 | `system-ssh-connectivity` | T2 | SSH dial against a host's resolved credential, known-hosts policy, NIST SP 800-57 key bit checks (RSA ≥2048, Ed25519 always OK), timeout enforcement | -| 6 | `api-auth` | T1 | `POST /auth/login`, `POST /auth/logout`, `POST /auth/refresh`, `GET /auth/me`, `POST /auth/mfa:enroll`, `POST /auth/mfa:verify`, `POST /auth/password:change` | -| 7 | `api-users` | T2 | `GET/POST/GET-by-id/PUT/DELETE /admin/users`, `POST /admin/users/{id}/roles:assign`, `:unassign`, `POST /admin/roles` (custom-role create — finally lands per `docs/rbac_registry.md §8`) | -| 8 | `api-credentials` | T2 | `GET/POST/PUT/DELETE /admin/credentials` (scope=system), `GET/POST/PUT/DELETE /hosts/{id}/credentials` (scope=host) | -| 9 | `api-hosts` | T2 | `GET/POST/GET-by-id/PUT/DELETE /hosts`, `POST /hosts/{id}:connectivity-check`, `GET /hosts/{id}/audit-events` | - -Specs 1-5 are "system" (foundation behavior); 6-9 are "api" (HTTP surface). Implementation order matches. - ---- - -## Implementation plan (week-by-week) - -### Week 1 — Auth foundation - -| Day | Deliverable | -|---|---| -| 1 | `internal/identity/` package: password hashing (Argon2id), password policy validator (NIST 800-63B), breach corpus check (local SHA-1 prefix list, no network) | -| 2 | `users` table migration; `users` repository; CRUD service | -| 3 | TOTP enrollment + verify; `auth_mfa_secrets` table; QR-code provisioning URI generation | -| 4 | Session token mint/verify/revoke; `sessions` table; refresh-token rotation; absolute-timeout enforcement | -| 5 | JWT mint (RS256) and verify; the same `users` row issues both — JWT for API, session cookie for browser | - -By Friday: `internal/identity/` is feature-complete, fully tested at 100% strict spec coverage. No HTTP yet. - -### Week 2 — Credentials + host model - -| Day | Deliverable | -|---|---| -| 1 | `internal/credential/` package: AES-256-GCM encrypt/decrypt; DEK loader from configurable path | -| 2 | `credentials` table migration; `credentials` repository; CRUD service; resolver with fallback | -| 3 | `hosts` table migration; `hosts` repository; CRUD service with soft delete | -| 4 | `internal/ssh/` package: dial, known-hosts policy, key-bit validator, timeout-bound connectivity check | -| 5 | Buffer day; load testing of resolver hot path; bench the credential decrypt + dial chain end-to-end | - -### Week 3 — HTTP surface - -| Day | Deliverable | -|---|---| -| 1 | `api/auth.yaml` OpenAPI subspec; `POST /auth/login`, `/logout`, `/refresh`, `GET /auth/me`. Replace `auth.StubIdentityBinder` with the real identity binder. Every existing endpoint now gets real `auth.Identity` from the session/JWT. | -| 2 | `POST /auth/mfa:enroll`, `/auth/mfa:verify`, `POST /auth/password:change`. Login flow enforces MFA challenge. | -| 3 | `api/users.yaml`; user CRUD endpoints; role-assignment endpoints; custom-role CRUD finally ships (per `rbac_registry.md §8`) | -| 4 | `api/credentials.yaml`; system + host credential CRUD; on read the encrypted fields are NEVER returned — only metadata (fingerprint, key type, key bits). The plaintext only leaves the DB into an SSH dial; the API surface returns null for the secrets. | -| 5 | `api/hosts.yaml`; host CRUD endpoints; `POST /hosts/{id}:connectivity-check` performs a real SSH dial via the resolved credential and returns success/failure + diagnostic detail (NEVER the credential itself) | - -### Week 4 — Wire-through + integration - -| Day | Deliverable | -|---|---| -| 1 | Remove `auth.StubIdentityBinder` from production code path; tests still use it via a build-tag-isolated test helper. Every API integration test now uses a real login + token. | -| 2 | OIDC + SAML stubs: handler endpoints exist, return 402 license.feature_unavailable when called, audit-log the attempt. Real implementations are a follow-up. | -| 3 | Update `docs/guides/INSTALLATION.md` with the first-run flow: bootstrap-admin command, login, MFA enrollment, replace demo cert (which already exists), add a host. | -| 4 | Re-run the 19-step DoD from `release-stage-0-signoff` — every step still passes, but now via real auth instead of stub roles. Update DoD with the new step list (no more `X-Stub-Role`). | -| 5 | `make check` clean; specter sync clean; `slice-a-complete` tag candidate | - -### Week 5 — Buffer / cleanup - -Reality buffer. Estimated days are honest but real engineering surfaces unknown-unknowns. Use this week to: -- Fix anything that surfaced under load -- Tighten any flaky tests -- Pay any test debt (the 4 Stage-1 modules `auth.token_blacklist_pg`, `auth.credential_handler` that Stage-1 evidence called out for Slice A entry) -- Documentation passes -- Code review cycle - -If Week 5 is genuinely empty: ship and move to Slice B early. - ---- - -## What replaces the `X-Stub-Role` header at the end - -```go -// Before (Stage 0): -r.Use(auth.StubIdentityBinder) // reads X-Stub-Role - -// After (Slice A): -r.Use(auth.IdentityBinder(identityService)) // reads session cookie or JWT bearer -``` - -Same `auth.Identity` shape on the request context. Same `RequirePermission` / `EnforcePermission` middleware. Same RBAC + license-gate ordering. The Identity is just *real* now. - -Test fixtures get a build-tag isolated helper: `auth_test_helpers.go` under `//go:build test` (or similar) that lets tests mint sessions without going through the full login flow. Production builds physically cannot import the helper. - ---- - -## Stage-2 entry criteria (from Stage 1 evidence) - -From `docs/MUST_BACKEND_FUNCTIONALITY.md §Correction 2`, **Slice A cannot ship without test coverage for**: - -- `services/auth/credential_handler.py` — equivalent functionality lives in `internal/credential/` (spec `system-credential-store`) -- `services/auth/token_blacklist_pg.py` — equivalent functionality lives in `internal/identity/sessions.go` (spec `system-auth-identity` AC covering "revoked session rejects subsequent requests") - -These are entry criteria. The specs above already cover them — flagging here so the test debt list from Stage 1 is explicitly closed. - ---- - -## Out of scope (explicit deferrals) - -Calling these out so they don't accidentally creep in: - -- **OIDC / SAML full implementation** — handlers exist as 402-stubs in Slice A; real flows are a follow-up slice (call it Slice A.5 — SSO providers). -- **Host groups** — `hosts.group_id` column exists but no `host_groups` table yet. Group CRUD and group-level credentials land with Slice B or A.5 (TBD by which one needs it first). -- **OS discovery / fingerprinting** — Kensa does that. Lands in Slice B. -- **Adaptive scheduler** — has its own spec doc (`docs/openwatchos/02-ADAPTIVE-COMPLIANCE-SCHEDULER.md`); needs scans to exist. Lands post-Slice-B. -- **KMS / Vault credential keys** — file-based DEK in Slice A. KMS integration is operator-driven; lands if/when a customer needs it. -- **WebAuthn / FIDO2** — `fido2_mfa` license feature is in the registry; TOTP is the Slice-A MFA baseline. FIDO2 lands as a follow-up MFA method behind the license gate. -- **API keys** (for automation that doesn't want JWT lifecycle) — useful but not in Slice A. Slice B if there's demand. -- **Self-service signup** — no operator has asked for this. Admin creates accounts; that's it for Slice A. Slice A.5 (or never) for self-service. - ---- - -## What "Slice A done" means concretely - -The 19-step Definition of Done from Stage 0 is updated: - -| # | Step | Status after Slice A | -|---|------|----------------------| -| 1-5 | Build + install + service start | Unchanged | -| 6 | `systemctl start openwatch` | Unchanged | -| 7-10 | health, echo, audit, replay | Now require Bearer token from a real login | -| 11-15 | RBAC + license demo endpoints | `X-Stub-Role` removed; role comes from the user's `user_roles` | -| 16 | Enqueue test job | Same | -| 17 | `specter sync` | Same — strict mode passes | -| 18 | Cert hot-reload | Same | -| 19 | DB persistence across restart | Same | -| **NEW 20** | `POST /auth/login` with valid creds + TOTP returns access + refresh tokens | Slice A | -| **NEW 21** | `GET /auth/me` with token returns identity + role | Slice A | -| **NEW 22** | `POST /admin/users` with admin role creates a user | Slice A | -| **NEW 23** | `POST /hosts` with valid body creates a host | Slice A | -| **NEW 24** | `POST /admin/credentials` creates a system credential | Slice A | -| **NEW 25** | `POST /hosts/{id}:connectivity-check` successfully dials the test host | Slice A | -| **NEW 26** | `OIDC/SAML` initiate endpoint returns 402 license.feature_unavailable | Slice A | - -Total: 26 steps. The new spec `release-slice-a-signoff` carries this list. - ---- - -## Why this slice is the right size - -It's tempting to expand. Examples of expansion I've already filtered out: -- **"Add API keys too"** — same auth surface, but a distinct lifecycle. Adds 3-4 days. Not asked for. -- **"Add host groups"** — adds 3-5 days. Useful, but no scan-time consumer of groups in this slice (no scans yet). -- **"Add WebAuthn / FIDO2"** — license-gated; TOTP gets us 95% of the value at 30% of the effort. FIDO2 is a follow-up. -- **"Add full OIDC/SAML"** — 2-3 weeks alone. The 402-stubs preserve the option without paying the cost yet. - -Slice A is "auth + add a host." The decisions above are the minimum viable cut. If anything below 3 weeks is forced, my recommendation is to drop the custom-role CRUD (defer to A.5) — built-in roles cover Slice-A demos. Custom roles are operator-pleasing but no Slice-A capability depends on them. - ---- - -## What I need before starting implementation - -Nothing. The decisions above are locked. Once you've read this doc and agreed (or pushed back on anything), I: - -1. Write specs 1-9 in order -2. Each spec lands with its tests (the tests fail at first; that's the point) -3. Implement to make the tests pass -4. `make check` clean after each spec -5. `specter sync` strict-mode clean after the slice ends - -The first commit on this slice is `app/specs/system/auth-identity.spec.yaml`. After you approve this plan, that's where I start. - ---- - -## Slice A — Completion notes (2026-05-25) - -All 9 specs shipped. `specter sync` reports 28/28 at 100% strict coverage -(19 pre-existing + 9 new). Full module test suite green, `golangci-lint` -clean across the module. - -| Spec | ACs | Implementation | -|---|---|---| -| `system-auth-identity` | 20 | `internal/identity/{password,sessions,jwt,refresh,mfa,binder}.go` | -| `system-user-management` | 12 | `internal/users/users.go`, migration 0005 | -| `system-credential-store` | 12 | `internal/credential/credential.go`, migration 0007 | -| `system-host-inventory` | 12 | `internal/host/host.go`, migration 0008 | -| `system-ssh-connectivity` | 10 | `internal/ssh/{validate,known_hosts,dial}.go` | -| `api-auth` | 12 | `internal/server/auth_handlers.go` | -| `api-users` | 12 | `internal/server/users_handlers.go`, migration 0009 | -| `api-credentials` | 12 | `internal/server/credentials_handlers.go`, `internal/credential/api.go` | -| `api-hosts` | 12 | `internal/server/hosts_handlers.go` | - -### Deltas from the plan - -- **DEK source** is `internal/secretkey` shared by MFA + credential - encryption (file-based for Slice A; KMS/Vault deferred to a later - slice). MFA secrets, credential password/private-key/passphrase all - encrypted with AES-256-GCM under the same DEK. -- **Stub identity binder retained** with a `X-Stub-User-Id` header - override so admin endpoints whose handlers persist `created_by` FKs - can be exercised in tests without the full session cookie dance. - The stub binder is no-op when the production binder has already set - a non-anonymous identity (see `auth.StubIdentityBinder` for the - coexistence rule). Removal of the stub is a Slice-B item. -- **`api-credentials` C-01 enforced via a metadata-only struct**: - `credential.Metadata` carries no plaintext or ciphertext secret - fields; the dial path uses `credential.Credential` with decrypted - secrets but that struct never crosses the HTTP layer. -- **`api-hosts` PATCH** is supported; PUT is not. The PR plan allowed - either; PATCH is cleaner since most callers only mutate a subset. -- **Spec 25 (`connectivity-check`) is NOT in Slice A**. The plan - reserved it for "POST /hosts/{id}:connectivity-check"; the executor - + audit event for connectivity probes is sized for Slice B with the - scan executor. No connectivity-check endpoint ships in Slice A. -- **OIDC/SAML 402 stubs** declared in the permissions registry - (`admin:sso_provider`) but no SSO endpoints land in Slice A. The - Slice B plan picks them up alongside the SCIM bridge work. - -### Wire-through - -End-to-end coverage lives at `internal/server/api_slice_a_e2e_test.go` -(`TestSliceA_WireThrough_RealIdentity`). The flow: -bootstrap admin → `/auth/login` (real session cookie) → `/auth/me` → -create host → create system + host credentials → resolve (host-scope -wins) → soft-delete host cred → resolve (falls back to system default) -→ soft-delete host → confirm `host.created` / `host.deleted` / -`credential.created` / `credential.deleted` audit events landed. The -test uses no `X-Stub-*` header after step 1, proving the production -identity binder threads through every layer. - -### Drive-bys - -- `internal/credential/credential_test.go`: `TestResolve_HostScopeWins` - was inserting a host-scope credential against a UUID the `hosts` - table didn't have, violating the deferred FK from migration 0008. - Fixed with a `seedHost` helper. -- Test fixture `freshAPIServer` clears custom roles between tests so - `api-users` AC-11 doesn't collide with leftover `field_auditor` rows. -- Test fixture seeds a "stub-admin" user and pins its UUID into - `stubAdminUserID`; the `asRole` helper attaches `X-Stub-User-Id` so - handlers writing `created_by` FK columns resolve to a real - `users.id`. - -### Slice-A signoff steps (revised tally) - -Of the 26 signoff steps the plan listed, 25 are reachable today — -step 25 (`POST /hosts/{id}:connectivity-check`) moves to Slice B with -the scan executor. The `release-slice-a-signoff` spec is not yet -written; that's the Slice-A closer alongside whatever doc/changelog -work the release process needs. diff --git a/docs/guides/API_GUIDE.md b/docs/guides/API_GUIDE.md index eb867212..32aa171e 100644 --- a/docs/guides/API_GUIDE.md +++ b/docs/guides/API_GUIDE.md @@ -108,7 +108,7 @@ protected endpoint declares the permission it requires (visible as A caller missing the required permission receives `403`. The full permission and role registry is the source of truth at -[`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md); it is +`docs/engineering/rbac_registry.md`; it is served at runtime via `GET /api/v1/auth/permissions:registry`. --- @@ -380,13 +380,13 @@ longer worker-internal only): against them until they appear in `api/openapi.yaml`. For how OpenWatch invokes Kensa, see -[`docs/KENSA_OPENWATCH_BOUNDARY.md`](../KENSA_OPENWATCH_BOUNDARY.md). +`docs/KENSA_OPENWATCH_BOUNDARY.md`. --- ## What's next - [Install guide](INSTALLATION.md) — install, configure, and run the service. -- [RBAC registry](../engineering/rbac_registry.md) — permission and role reference. -- [Kensa ↔ OpenWatch boundary](../KENSA_OPENWATCH_BOUNDARY.md) — how scanning works. +- RBAC registry — permission and role reference. +- Kensa ↔ OpenWatch boundary — how scanning works. - `api/openapi.yaml` — the authoritative, always-current API contract. diff --git a/docs/guides/BACKUP_RECOVERY.md b/docs/guides/BACKUP_RECOVERY.md index 9277040c..207c65de 100644 --- a/docs/guides/BACKUP_RECOVERY.md +++ b/docs/guides/BACKUP_RECOVERY.md @@ -311,7 +311,7 @@ journalctl -u openwatch -n 200 --no-pager | grep -iE 'scheduler|worker|scan' 4. **Review access.** Audit user accounts and role assignments. Roles and permissions are defined in - [`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md). + `docs/engineering/rbac_registry.md`. 5. **Recover.** If integrity is in doubt, rebuild on a clean host from a known-good backup using the disaster-recovery procedure above, then rotate @@ -347,5 +347,5 @@ The following are not part of OpenWatch today. Do not script against them. | Logs | `journalctl -u openwatch -f` | See also: [`INSTALLATION.md`](INSTALLATION.md), -[`rbac_registry.md`](../engineering/rbac_registry.md), and the API contract in +`rbac_registry.md`, and the API contract in [`api/openapi.yaml`](../../api/openapi.yaml). diff --git a/docs/guides/ENVIRONMENT_REFERENCE.md b/docs/guides/ENVIRONMENT_REFERENCE.md index ee726dbe..4d7f3b04 100644 --- a/docs/guides/ENVIRONMENT_REFERENCE.md +++ b/docs/guides/ENVIRONMENT_REFERENCE.md @@ -192,7 +192,7 @@ curl -k https://localhost:8443/api/v1/health The API is served under `/api/v1/`; `api/openapi.yaml` is the contract source of truth. Role definitions live in -[`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md) and +`docs/engineering/rbac_registry.md` and `internal/auth/permissions.yaml`. ## Operational runbooks diff --git a/docs/guides/HOSTS_AND_REMEDIATION.md b/docs/guides/HOSTS_AND_REMEDIATION.md index afbd8cf8..f3740a31 100644 --- a/docs/guides/HOSTS_AND_REMEDIATION.md +++ b/docs/guides/HOSTS_AND_REMEDIATION.md @@ -232,9 +232,9 @@ For organizations that require an approval step: 3. Once approved, a user with `remediation:execute` clicks **Fix** to apply the change. Execution is operator-initiated, not automatic. -See [Remediation & Exception Governance](../engineering/remediation_exception_governance.md) +See Remediation & Exception Governance for the full role matrix. Single-operator workspaces cannot self-approve today; -see the [governance ADR](../engineering/remediation_governance_adr.md). +see the governance ADR. --- @@ -284,7 +284,7 @@ returned to its previous state. Built-in roles, least to most privilege: `viewer` → `auditor` → `ops_lead` → `security_admin` → `admin` (`admin` holds every permission). The permission source of truth is `auth/permissions.yaml`; see -[Remediation & Exception Governance](../engineering/remediation_exception_governance.md) +Remediation & Exception Governance for the complete matrix. | Operation | Permission | Roles that hold it | diff --git a/docs/guides/PRODUCTION_DEPLOYMENT.md b/docs/guides/PRODUCTION_DEPLOYMENT.md index 8c072c54..da73d553 100644 --- a/docs/guides/PRODUCTION_DEPLOYMENT.md +++ b/docs/guides/PRODUCTION_DEPLOYMENT.md @@ -46,7 +46,7 @@ step today (write a unit that runs `ExecStart=/usr/bin/openwatch worker`). | API + UI | `https://<host>:8443/` | UI embedded via `go:embed`; API under `/api/v1/` | | Database | PostgreSQL 14+ | The only datastore. Not provisioned by the package. | | Job queue | PostgreSQL table, `SKIP LOCKED` | No external broker. Drained by `serve`/`worker`. | -| Compliance engine | Kensa (Go), in-process | SSH-based, native YAML rules. See [the boundary doc](../KENSA_OPENWATCH_BOUNDARY.md). | +| Compliance engine | Kensa (Go), in-process | SSH-based, native YAML rules. See the boundary doc. | --- @@ -418,7 +418,7 @@ psql -h 127.0.0.1 -U openwatch -d openwatch -c "\ ## See also - [Install guide](INSTALLATION.md) — canonical install and provisioning. -- [Kensa ↔ OpenWatch boundary](../KENSA_OPENWATCH_BOUNDARY.md) — compliance engine integration. -- [RBAC registry](../engineering/rbac_registry.md) — roles and permissions. +- Kensa ↔ OpenWatch boundary — compliance engine integration. +- RBAC registry — roles and permissions. - [API contract](../../api/openapi.yaml) — every endpoint, its permission, and audit events. - [Releasing runbook](../runbooks/RELEASING.md) — building and signing releases. diff --git a/docs/guides/QUICKSTART.md b/docs/guides/QUICKSTART.md index 050c3410..a6db3dc8 100644 --- a/docs/guides/QUICKSTART.md +++ b/docs/guides/QUICKSTART.md @@ -260,8 +260,8 @@ counts to a single framework key. | Task | Where | |------|-------| | Full install and configuration reference | [docs/guides/INSTALLATION.md](INSTALLATION.md) | -| Roles and permissions | [docs/engineering/rbac_registry.md](../engineering/rbac_registry.md) | -| Kensa ↔ OpenWatch boundary | [docs/KENSA_OPENWATCH_BOUNDARY.md](../KENSA_OPENWATCH_BOUNDARY.md) | +| Roles and permissions | docs/engineering/rbac_registry.md | +| Kensa ↔ OpenWatch boundary | docs/KENSA_OPENWATCH_BOUNDARY.md | | API contract (source of truth) | `api/openapi.yaml` (paths under `/api/v1`) | ## Troubleshooting diff --git a/docs/guides/SECURITY_HARDENING.md b/docs/guides/SECURITY_HARDENING.md index 16a0c5ab..669dab7f 100644 --- a/docs/guides/SECURITY_HARDENING.md +++ b/docs/guides/SECURITY_HARDENING.md @@ -37,7 +37,7 @@ Source: `cmd/openwatch/main.go`, `packaging/common/openwatch.service`, The compliance engine is Kensa, which connects to managed hosts over SSH and runs native YAML checks. See -[`docs/KENSA_OPENWATCH_BOUNDARY.md`](../KENSA_OPENWATCH_BOUNDARY.md). +`docs/KENSA_OPENWATCH_BOUNDARY.md`. --- @@ -207,7 +207,7 @@ error. Source of truth: `auth/permissions.yaml` → `internal/auth/permissions.gen.go` and `internal/auth/roles.gen.go`. Enforcement: `internal/auth/middleware.go` (`EnforcePermission`, `RequirePermission`). -Design doc: [`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md). +Design doc: `docs/engineering/rbac_registry.md`. Built-in roles, least to most privileged: @@ -245,7 +245,7 @@ the process logs JSON to `journald`. Source: `cmd/openwatch/main.go` (`audit.Init`, `audit.EmitSync`), `internal/audit/`, `internal/db/migrations/0002_audit_events_taxonomy.sql`, -[`docs/engineering/audit_event_taxonomy.md`](../engineering/audit_event_taxonomy.md). +`docs/engineering/audit_event_taxonomy.md`. Representative event codes (taxonomy): @@ -537,11 +537,11 @@ Source for every checklist item is cited in the section above that introduces it - Install, configure, TLS replacement, uninstall: [`docs/guides/INSTALLATION.md`](INSTALLATION.md) - RBAC registry and permission model: - [`docs/engineering/rbac_registry.md`](../engineering/rbac_registry.md) + `docs/engineering/rbac_registry.md` - Audit event taxonomy: - [`docs/engineering/audit_event_taxonomy.md`](../engineering/audit_event_taxonomy.md) + `docs/engineering/audit_event_taxonomy.md` - Kensa ↔ OpenWatch boundary: - [`docs/KENSA_OPENWATCH_BOUNDARY.md`](../KENSA_OPENWATCH_BOUNDARY.md) + `docs/KENSA_OPENWATCH_BOUNDARY.md` - API contract (per-operation required permission, license gate, audit events): [`api/openapi.yaml`](../../api/openapi.yaml) - Behavioral specs: [`specs/`](../../specs/) diff --git a/docs/guides/UPGRADE_PROCEDURE.md b/docs/guides/UPGRADE_PROCEDURE.md index f9eead35..e9f610d6 100644 --- a/docs/guides/UPGRADE_PROCEDURE.md +++ b/docs/guides/UPGRADE_PROCEDURE.md @@ -277,7 +277,7 @@ binary. Rules therefore travel with the binary — installing a new OpenWatch package is what updates the bundled rule set. There is no separate rule-pull or out-of-band rule-sync step. For the Kensa/OpenWatch responsibility boundary, see -[`docs/KENSA_OPENWATCH_BOUNDARY.md`](../KENSA_OPENWATCH_BOUNDARY.md). +`docs/KENSA_OPENWATCH_BOUNDARY.md`. ## Upgrading PostgreSQL diff --git a/docs/guides/runbooks/HIGH_CPU.md b/docs/guides/runbooks/HIGH_CPU.md index 22582094..f9b66b2e 100644 --- a/docs/guides/runbooks/HIGH_CPU.md +++ b/docs/guides/runbooks/HIGH_CPU.md @@ -22,7 +22,7 @@ This runbook covers the three processes that can saturate CPU on an OpenWatch ho For install and configuration details, see [`docs/guides/INSTALLATION.md`](../INSTALLATION.md). For the -Kensa boundary, see [`docs/KENSA_OPENWATCH_BOUNDARY.md`](../../KENSA_OPENWATCH_BOUNDARY.md). +Kensa boundary, see `docs/KENSA_OPENWATCH_BOUNDARY.md`. --- @@ -219,7 +219,7 @@ curl -sk -X PUT https://localhost:8443/api/v1/system/discovery/config \ > These endpoints require an authenticated token with the appropriate role. Confirm > the exact request body and required permission against `api/openapi.yaml` and -> [`docs/engineering/rbac_registry.md`](../../engineering/rbac_registry.md) before use. +> `docs/engineering/rbac_registry.md` before use. > The schedulers log a warning at startup when paused. ### Path D: slow the worker poll loop diff --git a/docs/guides/runbooks/SECURITY_INCIDENT.md b/docs/guides/runbooks/SECURITY_INCIDENT.md index 8e3ecc4f..33741a31 100644 --- a/docs/guides/runbooks/SECURITY_INCIDENT.md +++ b/docs/guides/runbooks/SECURITY_INCIDENT.md @@ -7,7 +7,7 @@ OpenWatch runs as a single Go binary (`/usr/bin/openwatch`) managed by `systemd` (`openwatch.service`). It serves the REST API and the embedded UI over HTTPS on port `8443` and stores all data in PostgreSQL (there is no MongoDB, Redis, Celery, or container runtime). Audit events are written to the `audit_events` table; the service logs to the journal (`journalctl -u openwatch`). Adjust `psql` connection flags (`-h`, `-p`) for your deployment. -This runbook covers containment, investigation, and recovery for a suspected compromise. For install, config, and role definitions see [docs/guides/INSTALLATION.md](../INSTALLATION.md) and [docs/engineering/rbac_registry.md](../../engineering/rbac_registry.md). +This runbook covers containment, investigation, and recovery for a suspected compromise. For install, config, and role definitions see [docs/guides/INSTALLATION.md](../INSTALLATION.md) and docs/engineering/rbac_registry.md. --- @@ -172,7 +172,7 @@ ORDER BY ur.granted_at DESC; " ``` -The five built-in roles, in increasing privilege, are `viewer`, `auditor`, `ops_lead`, `security_admin`, and `admin`. See [docs/engineering/rbac_registry.md](../../engineering/rbac_registry.md) for the full permission sets. +The five built-in roles, in increasing privilege, are `viewer`, `auditor`, `ops_lead`, `security_admin`, and `admin`. See docs/engineering/rbac_registry.md for the full permission sets. ### Active sessions and refresh tokens diff --git a/frontend/package-lock.json b/frontend/package-lock.json index 40a513a8..d8a3f255 100644 --- a/frontend/package-lock.json +++ b/frontend/package-lock.json @@ -4601,10 +4601,20 @@ "license": "MIT" }, "node_modules/js-yaml": { - "version": "4.1.1", - "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.1.1.tgz", - "integrity": "sha512-qQKT4zQxXl8lLwBtHMWwaTcGfFOZviOJet3Oy/xmGk2gZH677CJM9EvtfdSkgWcATZhj/55JZ0rmy3myCT5lsA==", + "version": "4.2.0", + "resolved": "https://registry.npmjs.org/js-yaml/-/js-yaml-4.2.0.tgz", + "integrity": "sha512-ePWsvanv0DWuDRsW8dnt+R4jQ31SCRCQ7hhNcPXZPsoBZiemuZNYGf7adZdqX2D86j6rvKp3RpCxVTSb8WQlOw==", "dev": true, + "funding": [ + { + "type": "github", + "url": "https://github.com/sponsors/puzrin" + }, + { + "type": "github", + "url": "https://github.com/sponsors/nodeca" + } + ], "license": "MIT", "dependencies": { "argparse": "^2.0.1" diff --git a/frontend/package.json b/frontend/package.json index 84fd6819..c887fe23 100644 --- a/frontend/package.json +++ b/frontend/package.json @@ -61,5 +61,8 @@ "typescript-eslint": "^8.61.0", "vite": "^8.0.16", "vitest": "^4.1.9" + }, + "overrides": { + "js-yaml": "^4.2.0" } }