Skip to content

fix(security): pre-release hardening batch — 8 SDD-specced fixes#584

Merged
remyluslosius merged 11 commits into
mainfrom
feat/security-hardening
Jun 17, 2026
Merged

fix(security): pre-release hardening batch — 8 SDD-specced fixes#584
remyluslosius merged 11 commits into
mainfrom
feat/security-hardening

Conversation

@remyluslosius

@remyluslosius remyluslosius commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Pre-release security hardening — 8 SDD-specced fixes

A complete, six-dimension security review (auth/session, authZ/RBAC,
crypto/secrets, injection/SSH/SSRF, web/HTTP/audit, supply-chain) was run
against the Go backend, the React frontend, and packaging/CI. Every
High/Critical finding was re-verified by hand against the source. All eight
actionable findings are fixed here under spec→test→code discipline (one commit
per finding); the full review is in
docs/PRE_RELEASE_SECURITY_REVIEW_2026-06-16.md.

Fixes (each: spec bumped, test with @spec/@ac tokens, then code)

# Sev Finding Spec
1 High GET /api/v1/audit/events had zero authorization (anonymous audit-trail disclosure) → require audit:read api-audit-events-query C-06/AC-11
2 High API-token privilege escalation (token:write holder could mint an admin bearer) → auth.RoleGrantsWithin subset check api-tokens C-03/AC-05
3 Med-High roles:assign had no subset guard → same RoleGrantsWithin check api-users C-05/AC-13
4 High No security headers (HSTS/CSP/nosniff/frame-deny) on a one-origin SPA+API → securityHeaders middleware system-http-server C-12/AC-17
5 High Breach-password check dead (every NewService got a nil corpus) → always-on embedded DefaultBreachCorpus (airgap-safe) + HIBP file override system-auth-identity C-15/AC-27
6 High No CSRF enforcement (frontend double-submit was theater) → csrfProtect constant-time double-submit, gated on the session cookie system-http-server C-14/AC-19
7 High No login rate-limiting → per-IP sliding-window limiter on /auth/login + /auth/mfa:verify (429 + Retry-After) system-http-server C-13/AC-18
8 Med-High SSH host-key in-memory per-process TOFU (MITM on first scan after each restart) → PostgreSQL-backed KnownHostsStore (migration 0036) at all 4 dial sites system-ssh-connectivity C-13/AC-22

Items 6/7/8 were the three release blockers; they are now closed.

Verification

  • gofmt clean · go vet ./... clean · go build ./... OK
  • specter check — all specs pass
  • Full internal/server suite green (CSRF + rate-limit middlewares exercised end-to-end); new internal/knownhosts store test passes against isolated PG

Notes for review

  • CSRF is gated on ambient authority: Bearer/token requests, /api/v1/auth/*, and cookieless requests are exempt — so the API-token path is unaffected.
  • Rate-limit keys on RemoteAddr (not X-Forwarded-For) so an attacker cannot rotate the bucket.
  • The breach corpus is embedded (go:embed), keeping airgap builds self-contained; operators can point OPENWATCH_BREACH_CORPUS_FILE at a full HIBP list.
  • Open medium/low findings (revocable JWTs, demo-cert handling, body-size limits, etc.) are catalogued in the review doc as fast-follow, not blockers.

🤖 Generated with Claude Code

Pre-release security review finding (HIGH, anonymous-exploitable):
GetAuditEvents had ZERO authorization — any unauthenticated caller could
read the entire audit trail (actor ids, IPs, resource ids, action detail)
with a single GET. The endpoint forgot the auth.EnforcePermission call
every sibling data handler makes, and no spec AC required it, so it shipped.

SDD fix:
- spec api-audit-events-query v1.1.0: add C-06 (MUST require audit:read;
  anonymous/unauthorized -> 403, no events) + AC-11 (deny-path AC).
- handler: EnforcePermission(AuditRead) as the first statement.
- tests: existing AC-01..10 now authenticate as auditor (they relied on the
  endpoint being open); new AC-11 asserts anonymous -> 403 with no list
  leaked, auditor -> 200.
…hin caller's grant)

Pre-release finding (HIGH): POST /tokens gated only on token:write and
passed role_id straight through, so a security_admin (holds token:write,
not the admin grant) could mint a role_id=admin bearer token and act as
full admin. No subset check existed anywhere.

SDD fix:
- spec api-tokens v1.1.0: C-03 (requested role's permissions MUST be a
  subset of the caller's; else 403, no token) + AC-05.
- internal/auth.RoleGrantsWithin(caller, requested): reusable anti-
  escalation primitive (also applies to role assignment + custom roles).
- PostAPIToken: 403 authz.role_exceeds_grant when the requested role
  exceeds the caller's permissions.
- test AC-05: security_admin→admin denied 403; admin→security_admin 201.
Pre-release finding (Medium/High): POST /users/{id}/roles:assign gated on
role:assign but never checked that the assigned role was within the
granter's own permissions. Only admin holds role:assign today (and admin
covers everything), so this is defense-in-depth now — but it becomes a live
escalation path the moment any lower/custom role is granted role:assign.

SDD fix:
- spec api-users v1.1.0: C-05 + AC-13 (assigned role's permissions MUST be
  a subset of the caller's; else 403, no row).
- handler: reuse auth.RoleGrantsWithin; 403 on exceeds-grant.
- AC-13 test asserts the helper denies security_admin->admin and allows
  admin->security_admin (within-grant 204 covered by AC-08).
…me-deny)

Pre-release finding (HIGH): the server set NO security headers. On an
origin serving both the SPA and the API with no edge proxy, that means
clickjacking (no X-Frame-Options/frame-ancestors), SSL-strip on first
contact (no HSTS), MIME-sniffing (no nosniff), and no CSP defense-in-depth.

SDD fix:
- spec system-http-server v1.1.0: C-12 + AC-17 (HSTS, CSP w/ frame-ancestors
  'none' + default-src 'self', nosniff, X-Frame-Options DENY,
  Referrer-Policy no-referrer; /docs gets a Swagger-compatible CSP that
  still denies framing). C-11/AC-15/16 left for the open static-delivery PR.
- securityHeaders middleware mounted right after correlation.
- test AC-17 asserts all five headers + the /docs CSP relaxation.

NOTE: the app CSP is SPA-safe in theory (Vite external bundles + MUI inline
styles) but should be smoke-tested against the running SPA and /docs before
release; it's a single tunable const.
…l corpus)

Pre-release finding (HIGH): every production users.NewService(pool, nil)
passed a nil breach corpus, and ValidatePassword skips the breach check when
the corpus is nil — so the HaveIBeenPwned apparatus was dead code in prod.
Users/admins could set known-compromised passwords (password123, etc.).

SDD fix:
- spec system-auth-identity v1.3.0: C-15 (prod MUST validate against a
  non-nil corpus) + AC-27.
- identity.DefaultBreachCorpus(): an always-on embedded baseline of the most
  common compromised passwords (internal/identity/common_passwords.txt, 129
  entries, SHA-1'd at load). Airgap-safe — no external file needed; operators
  can extend via a full HIBP dump (OPENWATCH_BREACH_CORPUS_FILE).
- wired all 3 prod sites (server.New, newHandlers, admin-create CLI) to
  DefaultBreachCorpus() instead of nil.
- test AC-27 rejects password123/qwerty123/Password1, accepts a strong pw.
- .gitignore: negate the blanket *password* rule for the embedded corpus
  (it's a public list, not a secret, and go:embed requires it tracked).
Follow-up to the audit:read gate (finding #1): api_echo_test and
api_signoff_test also queried GET /audit/events via the unauthenticated
doGet helper and now correctly 403. Convert them to authenticate as
auditor (a role holding audit:read).
…ict)

Consolidated, risk-ranked review behind PR #584's fixes: the 5 fixed
findings, the 3 open release blockers (CSRF, login rate-limiting, SSH
host-key persistence), the medium/low backlog, and the verified strengths.
Names per the *_SECURITY_REVIEW_*.md convention (gitignore-allowlisted).
Pre-release blocker (HIGH): POST /auth/login and /auth/mfa:verify had no
throttle, no lockout — unlimited online password/OTP guessing and
credential stuffing (flagged by both the auth and web audits).

SDD fix:
- spec system-http-server v1.2.0: C-13 + AC-18 (per-IP rate limit on the
  auth endpoints; 429 + Retry-After over the limit; IP from RemoteAddr, not
  a client-supplied forwarded header).
- dependency-free sliding-window rateLimiter + rateLimitAuth middleware
  (20/min/IP), mounted after securityHeaders. Per-server-instance so it's
  test-isolated.
- tests: limiter allows-to-limit/denies-limit+1/keys-independent, and an
  end-to-end POST /auth/login returns 429 after the budget.

Follow-up: per-username lockout + trusted-proxy client-IP extraction (the
X-Forwarded-For trust finding) are deferred.
Pre-release blocker (HIGH): state-changing endpoints authenticate via the
openwatch_session cookie, and the frontend advertised a double-submit CSRF
scheme — but NO server code set an XSRF-TOKEN cookie or validated
X-CSRF-Token. Protection reduced to SameSite=Lax; the client-side half was
theater.

SDD fix:
- spec system-http-server v1.3.0: C-14 + AC-19 (double-submit; issue
  non-HttpOnly XSRF-TOKEN at login/refresh; require matching X-CSRF-Token on
  unsafe cookie-auth requests; exempt Bearer/token, /api/v1/auth/*, and
  requests with no session cookie).
- csrfProtect middleware (constant-time compare) mounted after the rate
  limiter; setCSRFCookie at login + refresh-cookie.
- the frontend already echoes XSRF-TOKEN -> X-CSRF-Token (client.ts), so no
  frontend change was needed — only the server half.
- tests: AC-19 deny (no/mismatched token -> 403 authz.csrf_invalid) + proceed
  with a matching token + no-session-cookie exemption. doReq now simulates
  the browser double-submit centrally so the suite stays green; the deny
  test bypasses it deliberately.
Pre-release blocker (Med-High): every production dial used ModeTOFU +
NewMemoryStore(), so host keys were never persisted — TOFU re-trusted every
host on each daemon restart. A network-positioned attacker could MITM the
first scan after every restart and harvest the credentials presented to
each host.

SDD fix:
- spec system-ssh-connectivity v1.3.0: C-13 (production TOFU MUST be durable
  across restarts) + AC-22.
- migration 0036: ssh_known_hosts table (hostname PK, public_key, first/last
  seen).
- internal/knownhosts.Store: a pgx-backed ssh.KnownHostsStore (kept out of
  internal/ssh so the dial layer stays DB-free). Get degrades to not-found
  on a DB blip (first-seen TOFU), never fails closed; a CHANGED key is still
  rejected by the dial layer (ErrHostKeyMismatch).
- wired at all 4 production dial sites (scan worker + serve, discovery x2)
  in place of NewMemoryStore().
- test AC-22: Put/Get round-trip, absent→not-found, and a fresh Store on the
  same DB sees the key (durable TOFU across 'restart').

Follow-up (documented in the review): an optional ModeStrict deployment
path (pre-provisioned keys) and folding internal/sshprivilege's
InsecureIgnoreHostKey probe onto this same store.
@remyluslosius remyluslosius changed the title fix(security): pre-release hardening batch — 5 SDD-specced fixes fix(security): pre-release hardening batch — 8 SDD-specced fixes Jun 17, 2026
@remyluslosius remyluslosius merged commit 1f6e72d into main Jun 17, 2026
19 checks passed
remyluslosius added a commit that referenced this pull request Jun 17, 2026
…ides, future-proof upgrade test (#585)

* docs(changelog): spec a human-readable changelog + populate [Unreleased]

Add specs/release/changelog.spec.yaml (release-changelog): every entry is a
user-facing sentence (>=5 words, no commit-subject prefix), Keep a Changelog
categories, dated version sections, no emoji. Enforced by
packaging/tests/changelog_test.go (source-inspection, AC-01..05), scoped to
the actively-edited [Unreleased] section so history is grandfathered.

Populate [Unreleased] with the post-rc.7 work in that style: Settings
activation (users, notifications, security/SSO), per-host SSH auth/sudo
learning, one-command safe upgrade, airgap + fresh-install fixes, and the
pre-release security-hardening batch (#584).

* docs: refresh hardening + install guides; de-hardcode upgrade test

- SECURITY_HARDENING.md: drop the 'not yet implemented' rate-limit/headers
  callout (shipped in #584); document auth rate limiting, CSRF, and security
  headers as live; note always-on breach screening; add an outbound-SSH
  persistent known-hosts subsection; version line to 0.2.0 rc series.
- install_guide.md: fix the stale DEB version example (rc.5 -> rc.7) and add an
  'Upgrading' section documenting the one-command auto-migrate + backup +
  fail-safe path and /etc/openwatch/upgrade.conf.
- upgrade-container-test.sh / run-upgrade-container-test.sh: stop hardcoding
  migration 34->35 and the host_connection_profile table. The host driver now
  derives the head migration's goose version_id and its -- +goose Down SQL and
  passes them in; the test reverses exactly the head migration and asserts a
  generic prior->head advance, so it survives every future migration.

(CLAUDE.md was also refreshed locally for AI sessions but is intentionally
gitignored per commit 7a73353, so it is not part of this commit.)

* test(packaging): pin upgrade-test driver to the host RPM arch

dist/ can hold a leftover cross-built arm64 RPM (the packaging Go suite
cross-builds it for AC-14), and the rockylinux:9 container runs the host
platform, so the previous `cp dist/openwatch-*-1.*.rpm` swept in both arches
and `rpm -i` collided on /usr/bin/openwatch. Glob a single host-arch RPM.

Validated end-to-end: the container test now passes — real `rpm -U` runs the
%post helper ($1=2), which stops the service, takes a pg_dump restore point,
applies the head migration (35 -> 36), and restarts.
@remyluslosius remyluslosius deleted the feat/security-hardening branch June 17, 2026 03:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant