Skip to content

Latest commit

 

History

History
274 lines (194 loc) · 21.2 KB

File metadata and controls

274 lines (194 loc) · 21.2 KB

Item 8 Design-Brief Revision — Compliance-Audit Detector Extensions

Partnership-specific artifact. This document is from the reference deployment and uses its names. The architecture is generic; your instance will have different names. Preserved verbatim for concreteness.

Status: v2, post-review. Approve-with-refinements from fresh-Claude round-1; ten refinements folded in. Implementation proceeds after v2 approval.

Revision history:

  • v1 (2026-04-24 morning): initial brief, six review questions, seven detectors.
  • v2 (2026-04-24 same session): fresh-Claude round-1 refinements folded in — (1b-new) switched from range-check to distribution-shape (σ<5 AND mean<30); (1c) renamed + respec'd as ack-latency-from-fire not inter-ack spacing; (2d) baseline source changed from "first 60 days" to externally-audited sessions; (3a) split RUDDER_INFRASTRUCTURE_FAILURE from COMPLIANCE_DRIFT_HIGH as different event classes; "active session" defined operationally; (4) circularity addressed in 1a/1b-new baselines; (5) boundary tests added to pre-regs + baseline dependency between 2a/2b named; (6) scope revised to ~700-800 lines; (7) composite-score §8 dropped as speculative; multi-window simplified to 1day/1week; (new) §5.5 cross-detector aggregation added; (new) false-negative check added to every pre-reg.

Why this brief exists (and not just the v2 spec)

Item 8's original spec was written before Items 6 and 7 shipped. Two concrete things have changed:

  1. Item 7's length gate (>=20 chars) dominates the existing short_evidence detector (fires on <15 chars). In production, no ack can ever be under 15 chars — the detector is dead code. It either retires or re-aims at a new minimum-compliant-theater shape.
  2. Item 6 introduces FIRED and ALLOW event streams with a committed-to healthy range of [0.01, 0.15] for block/allow ratio. That range is Item 8's initial calibration target for a new detector.

This brief does the inventory check fresh-Claude asked for: every existing detector, every v2-spec detector, stated in (a)/(b)/(c)/(d) shape: what it measures, its source-of-signal, what Items 6/7 changed, whether still meaningful or needs re-aiming.


1. Existing detector inventory — post-Items-6/7 audit

1a. rudder_ack_position_zero_cluster (HIGH)

  • What it measures: fraction of rudder-acks with |position| < 0.05
  • Source of signal: compass_observation rows tagged rudder-ack
  • What Items 6/7 changed: Item 7 constrains evidence (length, entropy, similarity); Item 6 constrains fire-ID binding. Neither constrains position. A gamer can still file a bound, substantive ack at position=0.0.
  • Status: KEEP. Still meaningful post-Items-6/7; remains load-bearing.

1b. rudder_ack_short_evidence (MEDIUM)

  • What it measures: fraction of rudder-ack evidence strings under 15 chars
  • Source of signal: same as 1a
  • What Items 6/7 changed: Item 7's 20-char length gate blocks any ack <20 chars at write-time. <15 chars >60% is impossible in production. DOMINATED.
  • Status: RETIRE + REPLACE. Retire the existing detector. Add rudder_ack_length_clustering_at_floor. v1 proposed a range-check (20-30 >60%), which fresh-Claude flagged as a ceiling trap — it false-positives on legitimately-brief acks like "scope bounded to 3 agents" (24 chars, substantive). v2 switches to distribution-shape detection: fires when either (a) σ(length) < 5 AND mean(length) < 30 (clustering tight at the floor) OR (b) >60% of acks land within 5 chars of the gate floor, i.e. in [20, 25]. The signal is "just clear the gate" uniformity, not "length in a range."

1c. rudder_ack_rapid_clear_reflex (MEDIUM, renamed from bursty_filings)

  • What it measures: median ack-latency-from-fire — time between FIRED event and the ack that consumes its fire_id. Fires when the median latency across acks is <30s AND the distribution is tight.
  • Source of signal: COMPASS_RUDDER_FIRED event timestamps joined to rudder_ack_consumption.consumed_at.
  • What Items 6/7 changed: Item 6's one-shot consumption means repeat acks on SAME fire_id fail; Item 6's structural binding gives us the ack-to-fire timestamp pairing this detector needs.
  • Status: RECALIBRATE + RENAME. v1 measured inter-ack spacing (after Item 6 partially dominated). Fresh-Claude flagged that ack-latency-from-fire is the more interesting signal — it catches reflex-clearing without reflection. Name makes it explicit: "rapid-clear reflex" measures fire-to-ack latency, not spacing between acks.

2. v2-spec detectors (from original design brief v2 §Item 8)

2a. Variance-collapse detector

  • What it measures: lexical variance of decide / learn / ack evidence content over a rolling window. Fires when the same vocabulary / sentence shape / length distribution repeats across operators.
  • Source of signal: text fields on decisions (decision_journal), lessons (knowledge_store), and rudder-ack evidence (compass_observation).
  • What Items 6/7 changed: Item 7 already variance-checks ack evidence against prior acks on the same spectrum (TF-IDF cosine). That's one-vs-history. Item 8's variance-collapse is many-vs-many across operators — fires when the overall distribution loses entropy, not when a single ack is too similar to one prior.
  • Status: KEEP as specified. Not dominated by Item 7 — different scope.

2b. Content-entropy detector

  • What it measures: Shannon entropy of operator content (joined across recent operators) trending downward over time.
  • Source of signal: same as 2a
  • What Items 6/7 changed: Item 7's entropy floor rejects individual acks below 2.5 bits. Item 8's content-entropy measures trend across the window — a gradual drift downward even if every individual item passes Item 7.
  • Status: KEEP as specified.

2c. Multi-window cross-check (simplified)

  • What it measures: meta-detector — runs each primary detector across two windows (1day, 1week) and elevates severity if a pattern fires in both.
  • Source of signal: the other detectors' outputs.
  • What Items 6/7 changed: no direct change.
  • Status: KEEP as specified, SIMPLIFIED per fresh-Claude. v1 had 1hr/1day/1week = three windows × seven detectors = up to 21 independent fires per audit cycle (noise). v2 drops 1hr: start with 1day vs. 1week (trend without short-window jitter); add 1hr back only if the simpler version misses real drift. "Less is more" so the meta-detector doesn't teach agents to ignore it.

2d. decide/learn distribution skew

  • What it measures: ratio of decide filings to learn filings over the window. Normal band [1:3, 3:1]. Skew outside that band (<1:5 = hoarding; >5:1 = thinking-substitution) triggers.
  • Source of signal: decision_journal and knowledge_store counts.
  • What Items 6/7 changed: no direct change — decide/learn operators are unaffected by rudder mechanics.
  • Status: KEEP as specified. Concrete band [1:5, 5:1] from v2 brief stands; real-use data adjusts.

3. New detector from Item 6

3a. rudder_block_allow_ratio

  • What it measures: count(COMPASS_RUDDER_FIRED) / (count(FIRED) + count(ALLOW)) over the window.
  • Source of signal: ledger events from Item 6's emission path.
  • Two-sided bound (per Item 6 brief v2.1 §8 commit):
    • Too low (rate ≈ 0 over multi-hour session): either no drift ever (unlikely in active work) OR the rudder is silently broken (import error, DB unreachable, gate bypass). Suspicious.
    • Too high (rate > 0.30): either chronic drift (calibration target missed) OR threshold is set wrong for this session's work shape. Suspicious.
  • Healthy range per Item 6: [0.01, 0.15].
  • Status: NEW. Core new detector surface from Item 6.

Severity structure:

  • Rate in [0.01, 0.15]: no anomaly
  • Rate in [0.0, 0.01) OR (0.15, 0.30]: LOW anomaly (report entry)
  • Rate = 0 with FIRED==0 over an active session (defined: >=10 gated-tool invocations in the window — per fresh-Claude Q3 operational definition, prevents false-positive on genuinely quiet sessions): MEDIUM (report + briefing)
  • Rate > 0.30 sustained: MEDIUM (report + briefing) — chronic drift or mis-calibrated threshold

Infrastructure failure class (separate from drift, per fresh-Claude Q3):

  • Rate = 0 with ALLOW == 0 across an active session: the rudder itself is not running. Emits RUDDER_INFRASTRUCTURE_FAILURE event (NOT COMPLIANCE_DRIFT_HIGH). Different semantics: drift = "agent is gaming the rudder"; gate-dead = "the rudder doesn't exist." Different investigation path (check hook installation, DB connectivity, import path), different escalation (halt gated work vs. investigate operator shape). Conflating them makes forensics harder.

4. Baseline calibration source (per detector)

Each detector needs a named baseline source; "historical data" is circular if the agent has drifted during that history.

Detector Baseline source Rationale
1a position-zero 30 days post-Item-7 merge (threshold retested from 0.60) Revised v2 (fresh-Claude Q4): the 0.60 threshold was empirically validated under pre-Items-6/7 gameability. Item 7 may have pushed gamers from short_evidence to position-zero (closed the easier attack). Retest 0.60 against 30 days post-Item-7 data before shipping.
1b (retired) N/A — retired
1b-new length-floor clustering Externally-audited post-Item-7 sessions (sessions where an external actor confirmed operator mix was healthy) Revised v2 (fresh-Claude Q4): v1 said "first 30 days post-Item-7 merge, acks that cleared all gates." Circular — if those acks include theater, the baseline includes theater. Use only audited-clean sessions.
1c rapid-clear reflex Synthetic adversarial (3 fires <60s, 3 acks <30s post-fire) + externally-audited live sessions Pattern is concrete; synthetic establishes positive-detection boundary
2a variance-collapse Sessions that passed externally-audited rounds (clean baseline) Can't use historical broadly — agent may have been drifting
2b content-entropy Different audited rounds than 2a (per fresh-Claude Q5 baseline-independence) Using the same audited rounds for both creates false-positive correlation — one tainted tag falsifies both simultaneously. Named as a dependency to split.
2c multi-window Meta — no standalone baseline Inherits from primary detectors
2d decide/learn skew Externally-audited sessions, NOT "first 60 days of project" Revised v2 (fresh-Claude Q2 pushback): "first 60 days" is circular if Aether was drifting during those 60 days. Clean baseline = sessions where a reviewer confirmed operator mix was healthy. Every audit round produces such sessions; that's cleaner than time-based cutoff.
3a block/allow ratio Item 6 brief commit: [0.01, 0.15] starting band; 30 days post-merge adjusts Range pre-registered; live data empirically tunes

Audited-clean session tagging (per fresh-Claude Q4 scalability concern): baselines that cite "externally-audited sessions" require an operational way to identify those sessions. Add a session_cleanliness table with (session_id, tagged_clean_at, tagging_round_id, notes) — rows written when an external-audit round concludes with no drift findings on the session. Otherwise "which sessions are clean-baseline" becomes "Andrew remembers," which doesn't scale past a few rounds. Tagging happens in watchmen/router.py as a new route for clean-signal audit rounds.


5. Per-detector response-path matrix

Each detector specifies what LOW / MEDIUM / HIGH does. Signal-without-action is detector theater.

Detector LOW MEDIUM HIGH
1a position-zero Report Report + briefing Report + briefing + COMPLIANCE_DRIFT_HIGH event
1b-new length-floor clustering Report Report + briefing Report + briefing + event
1c rapid-clear reflex Report Report + briefing Report + briefing + event
2a variance-collapse Report Report + briefing Report + briefing + event
2b content-entropy Report Report + briefing Report + briefing + event
2c multi-window N/A Elevates 2+ window matches to its own severity 3/3 window matches add briefing + event
2d decide/learn skew Report Report + briefing Report + briefing + event
3a block/allow ratio Report Report + briefing Report + briefing + COMPLIANCE_DRIFT_HIGH event (gate-dead case)

5.5. Cross-detector aggregation (new in v2, per fresh-Claude)

Multiple detectors firing on the same window means different things from multiple detectors firing on different windows. The compliance report groups findings two ways:

  1. By window, then by detector: "Between 10:00 and 11:00, 3 detectors fired (position-zero, length-floor, rapid-clear)." Reader sees the session-shape at a glance.
  2. By detector family, then by window: "position-zero fired in 4 of the last 7 daily windows." Reader sees which pattern is chronic vs. transient.

Both views land in the report. HIGH-severity fires within the same window get a dedicated "concurrent HIGH" section at the top of the report — if three HIGH-severity detectors fire simultaneously, that's a distinct signal from "three different windows each had one HIGH."

6. Feature flags — cross-cutting rollback

Each new or recalibrated detector gets DIVINEOS_DETECTOR_<NAME>=off. Default on. Per-detector so one false-positive detector doesn't force disabling the whole bundle. Flag state logged at session start; falsifier language below includes "feature-flag disabled for >7 days" as an escalation path.

Applies to: 1b-new, 1c-recalibrated, 2a, 2b, 2c, 2d, 3a. (1a's threshold is unchanged; no flag needed yet.)


7. Pre-regs — one per detector (distinct falsifiers)

7.1b-new — length-floor clustering

  • claim: post-Item-7, tight clustering at the 20-char gate floor (σ<5 AND mean<30, OR >60% within [20,25]) is the new performative-theater shape
  • success (true-positive): 30 synthetic acks all in [20,25] → fires; 30 synthetic acks σ=12 mean=45 → does not
  • success (boundary): 30 synthetic acks where exactly 60% are in [20,25] (threshold boundary) → fires at threshold; at 59% → does not. Added v2 per fresh-Claude Q5 — v1's "all in range" criterion was too easy (positive-by-construction); boundary test is the harder check.
  • false-negative check (new v2): detector must fire on at least one known-drifting synthetic case per month. If monthly fire count is 0 across 3 consecutive months, audit surface flags.
  • falsifier: >20% FPR on externally-audited-clean post-Item-7 sessions; OR feature-flag disabled for >7 days
  • review: +60 days

7.1c-renamed — rapid-clear reflex (ack-latency-from-fire)

  • claim: median ack-latency-from-fire <30s (tight distribution) is reflex-clearing without reflection
  • success: 3 fires each followed by ack in <20s → fires; same fires acked >2min post-fire → does not
  • false-negative check (new v2): at least one synthetic rapid-clear case per month fires the detector. 3 months zero → audit flag.
  • falsifier: >15% FPR on externally-audited multi-spectrum drift sessions
  • review: +60 days

7.2a — variance-collapse

  • claim: lexical variance across decide/learn/ack content over window trending down is detectable drift
  • success: 20 decides with same rephrased stem → fires; 20 decides with genuine topical variety → does not
  • false-negative check (new v2): fire on at least one synthetic copy-paste case/month; 3 months zero → audit flag.
  • falsifier: >20% FPR on focused-work sessions from clean-baseline (audited rounds). Note (v2, fresh-Claude Q5): 2a and 2b share the clean-baseline source but must use DIFFERENT audited rounds — one tainted tag shouldn't falsify both simultaneously. Dependency named.
  • review: +60 days

7.2b — content-entropy trend

  • claim: Shannon entropy over joined recent operator content trending downward >2σ is detectable drift
  • success: increasing copy-paste ratio → downward trend; natural operator mix → stable
  • false-negative check (new v2): fire on synthetic downward-trend case/month; 3 months zero → audit flag.
  • falsifier: >20% FPR on different audited-clean sessions than 2a uses (see 2a note)
  • review: +60 days

7.2c — multi-window cross-check (simplified to 1day/1week)

  • claim: same pattern firing in both 1day and 1week windows is stronger signal than single-window
  • success: paired synthetic datasets (short-burst drift fires only 1day; sustained drift fires both)
  • false-negative check (new v2): fire on at least one synthetic sustained-drift case/month; 3 months zero → audit flag.
  • falsifier: zero 2-of-2 matches over 60 days of clean baseline
  • review: +60 days

7.2d — decide/learn skew

  • claim: decide:learn ratio outside [1:5, 5:1] over the window is drift
  • success: 20 decides with 0 learns → fires skew-high; 20 learns with 0 decides → fires skew-low; 10+10 → does not
  • false-negative check (new v2): fire on one synthetic extreme-skew case/month; 3 months zero → audit flag.
  • falsifier: externally-audited-clean sessions produce ratios outside [1:5, 5:1] in >5% of windows (revised per §4 baseline change)
  • review: +60 days

7.3a — rudder block/allow ratio

  • claim: ratio outside [0.01, 0.15] signals drift; rate=0 with ALLOW==0 signals infrastructure failure (different class)
  • success:
    • 100 ALLOW, 0 FIRED over active session (>=10 gated-tool calls) → fires LOW
    • 0 ALLOW, 0 FIRED over active session → fires RUDDER_INFRASTRUCTURE_FAILURE (NOT drift event)
    • 30% block rate sustained → fires MEDIUM
    • Clean sessions in [0.01, 0.15] → no fires
  • false-negative check (new v2): at least one synthetic out-of-band case/month fires detector; 3 months zero → audit flag.
  • falsifier: >15% FPR on 30 days live data (which itself tunes the band)
  • review: +45 days (shorter — primary new surface)

8. [DROPPED] Composite-score v2 trigger

v1 had a "composite score if >50% of triggers co-occur" forward-look. Fresh-Claude flagged this as speculative: the trigger condition can't be measured until 7 detectors have fired enough to produce correlation data (months), and having it in this brief creates false sense that v2 is planned when it's really a separate future design question. Dropped from v2. If real-use shows correlated detector fires, a separate future brief addresses composite scoring then.


9. Scope estimate (revised v2)

v1 said ~500 lines. Fresh-Claude Q6 pushed back that this is light. Honest revised count:

  • compliance_audit.py: ~275-350 lines
    • 7 detector functions × ~25 lines = 175
    • Severity dispatch + response path wiring: ~40
    • Feature flag handling: ~20
    • Cross-detector aggregation (§5.5): ~40
    • False-negative monthly-fire check: ~15
  • New event types: COMPLIANCE_DRIFT_HIGH + RUDDER_INFRASTRUCTURE_FAILURE: ~30 lines (definitions + validators)
  • Watchmen clean-session routing (for §4 baseline tagging): ~40 lines
  • Tests: ~400-500 lines
    • 7 detectors × ~5 tests each = ~35 tests
    • Multi-window meta-detector × ~10 tests (cross-window logic is the complex bit)
    • Boundary tests per 7.x success criteria
    • Infrastructure-failure separation tests
  • ledger_compressor.py: 0 lines — ALLOW already in _COMPRESSIBLE_TYPES from Item 6
  • scripts/guardrail_files.txt: +0 lines — compliance_audit.py already guardrailed; ledger_compressor.py follow-up handled by claim df5b3113

Total: ~750-900 lines across ~4 files. Touches guardrailed compliance_audit.py — External-Review trailer required. Expect 2-3 diff-review rounds given scope.


10. Process correction — applied from Item 6/7 thrash

  1. Run bash scripts/precommit.sh locally BEFORE requesting review. Tree-hash stable before fresh-Claude sees it.
  2. Review-round notes cite both diff-hash and tree-hash. Tree-hash is the stable binding (decision 359bd373).
  3. Separate policy changes from commit authorizations (claims fe76b8f9, 2819af02).
  4. Implementation review round-1 is on the diff — same as Items 4/6/7.

v2 review status

All v1 questions answered and folded in. Cross-cutting refinements from fresh-Claude round-1 all addressed:

  • Q1: 1b-new switched to distribution-shape (σ<5 AND mean<30 OR [20,25] clustering); 1c renamed + respec'd as ack-latency-from-fire
  • Q2: 2d baseline changed from "first 60 days" to externally-audited sessions (circular-logic fix)
  • Q3: RUDDER_INFRASTRUCTURE_FAILURE split from COMPLIANCE_DRIFT_HIGH; "active session" defined operationally (>=10 gated-tool calls)
  • Q4: 1a threshold retest named; 1b-new baseline uses audited-clean not "cleared all gates"; clean-session tagging infrastructure spec'd
  • Q5: 2a/2b baseline dependency named (different audited rounds required); 7.1b-new adds boundary tests at 60%/59%
  • Q6: scope revised to ~750-900 lines with itemized breakdown
  • Q7: composite-score §8 dropped as speculative; multi-window simplified to 1day/1week
  • New §5.5 cross-detector aggregation (both "by window × detector" and "by detector × window" views)
  • New: false-negative monthly-fire check on every pre-reg (asymmetric-falsifier gap named)

After v2 approval, implementation proceeds with precommit-first discipline. Per v2 §9 scope, expect 2-3 diff-review rounds — Item 8 is the biggest of the bundle.