AI-native-Systems-Research
diff --git a/‎CLAUDE.md‎
Lines changed: 20 additions & 0 deletions b/‎CLAUDE.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 48 additions & 0 deletions b/‎README.md‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎docs/architecture.md‎
Lines changed: 74 additions & 0 deletions b/‎docs/architecture.md‎
Lines changed: 74 additions & 0 deletions
diff --git a/‎docs/campaign-authoring-guide.md‎
Lines changed: 144 additions & 0 deletions b/‎docs/campaign-authoring-guide.md‎
Lines changed: 144 additions & 0 deletions
@@ -148,9 +148,29 @@ measurement). The validator floor (`validate_evidence`) rejects
 aspirational platitudes regardless of source. See `docs/data-model.md`
 for the schema.
 
+## Spec fidelity (issue #246 / F1, friction-report #245)
+
+Every campaign should declare ``locked_parameters`` (and, when
+applicable, ``locked_workload``) for every knob whose deviation
+would invalidate the experiment. The validator hard-fails any
+bundle whose ``experiment_spec.verified_parameters`` deviates from
+``locked_parameters`` — regardless of ``--auto-approve``. This
+closes the spec-fidelity gap that allowed paper-memorytime-mirage
+iter-1 to silently rewrite four locked workload parameters.
+
+Authoring discipline lives in ``docs/campaign-authoring-guide.md``
+(the "what to lock" inventory + the rehearsal-as-instrument
+worked example). The full friction-report resolution map is in
+``docs/friction-245-resolution.md``.
+
 ## See also
 
 - `docs/contributing/workflow.md` — full workflow doc.
 - `docs/security.md` — permission policy (#135).
 - `docs/architecture.md` — internals.
+- `docs/campaign-authoring-guide.md` — locked_parameters, the
+  "what to lock" inventory, rehearsal-as-instrument (#245
+  resolution).
+- `docs/friction-245-resolution.md` — F1..F21 → file map for
+  paper-memorytime-mirage friction report.
 - `docs/plans/CHECKPOINT.md` — current state of the #120 epic.
@@ -218,6 +218,54 @@ nous run campaign.yaml --auto-approve --max-iterations 1  # quick unattended run
 nous run campaign.yaml --bundle ./fig7_bundle.yaml --auto-approve
 ```
 
+#### `--auto-approve` safety preconditions (#255 / F10)
+
+`--auto-approve` skips the HUMAN_DESIGN_GATE and HUMAN_FINDINGS_GATE,
+which are nous's primary safety mechanisms for catching design-agent
+deviations from campaign intent. **Auto-approve is safe to use only
+when ALL of these hold**:
+
+1. The campaign declares ``locked_parameters`` (#246 / F1) for every
+   campaign-spec-critical knob (model, concurrency, duration, warmup,
+   K-class parameters like ``total_kv_blocks``, anything whose
+   deviation would silently invalidate the experiment). nous hard-fails
+   any bundle whose ``experiment_spec.verified_parameters`` contradicts
+   a locked parameter — regardless of ``--auto-approve``.
+2. If the campaign has a canonical workload, declare
+   ``locked_workload`` (#265 / F20). The validator diffs
+   ``bundle.inputs/*.yaml`` against it. Deliberate deviations require
+   ``bundle.workload_changes_from_canonical`` to be populated.
+3. The target repo's docs do not contain example values that
+   contradict the campaign's locked spec (#247 / F2). When they do,
+   the methodology prompt's "campaign > target-repo-docs" hierarchy
+   covers it — but a stale methodology prompt would not. If you're
+   running a target whose docs heavily contradict the campaign,
+   verify the methodology prompt has the hierarchy clause before
+   trusting auto-approve.
+4. The campaign's apparatus checks are robust to design-agent
+   variation, and validate ATTRIBUTION (not just upstream totals,
+   #252 / F7).
+5. A stale ``principles.json`` ledger is acceptable. Auto-approve
+   never gates on it.
+
+**If any of these fail**, either run interactively (no
+``--auto-approve``) so a human reviewer sees the design at the gate,
+or invoke an external watchdog process to compare bundles against
+your campaign spec.
+
+Even under ``--auto-approve``, every design gate writes
+``gate_summary_design.json`` with a deterministic
+``campaign_spec_diff`` block (#249 / F4). Watchdog-style audit:
+
+```bash
+jq '.campaign_spec_diff' "$NOUS_CAMPAIGN_PARENT"/<run>/runs/iter-*/gate_summary_design.json
+```
+
+Non-empty ``locked_parameters_violations`` means F1's hard-fail
+fired (the iteration won't have proceeded). ``depth_overrides_present``
+or ``workload_changes_from_canonical_declared`` flag deliberate
+deviations the design agent declared.
+
 ### Overnight / long-running campaigns
 
 For unattended runs, increase retries and timeout so transient failures don't kill the campaign:
 
@@ -240,6 +240,80 @@ Before each human gate, a formatted summary (`gate_summary_*.json`) is produced.
 
 Gates display the summary first, then the raw artifact (for those who want full detail).
 
+**Spec-fidelity diff (#249 / F4).** For the design-phase summary,
+``_augment_summary_with_spec_diff`` (in `orchestrator/iteration.py`)
+post-processes the LLM-generated summary to attach a deterministic
+``campaign_spec_diff`` block: locked_parameters violations, depth_overrides
+presence, declared workload changes. Always emitted, regardless of
+``--auto-approve``. ``nous status`` surfaces it in human-readable
+form.
+
+### Spec fidelity (`orchestrator/validate.py`)
+
+Two pure-Python validators close the gap between *self-consistency*
+(the executor matches the bundle) and *spec-fidelity* (the bundle
+matches the campaign):
+
+* `_validate_locked_parameters` (#246 / F1) — every entry in
+  ``campaign.locked_parameters`` must match
+  ``bundle.experiment_spec.verified_parameters`` exactly. Hard-fail
+  regardless of ``--auto-approve``.
+* `_validate_locked_workload` (#265 / F20) — walks the canonical
+  workload structure and diffs against ``bundle.inputs/*.yaml``.
+  Declared deviations (``bundle.workload_changes_from_canonical``)
+  are allowed; undeclared are hard-fails.
+
+`compute_campaign_spec_diff` exposes the same logic for read-only
+auditor use (the F4 gate-summary diff). See
+`docs/campaign-authoring-guide.md` for the discipline these enforce.
+
+### Reproducibility metadata (`orchestrator/reproducibility.py`)
+
+`capture_reproducibility_metadata` (#262 / F17) runs at INIT and
+records target repo HEAD, dirty flag, hardware-config sha,
+language versions, gpu_memory_utilization, latency-config file
+paths. The block is persisted in `state.json` (first-capture wins;
+re-running INIT preserves the original) and surfaced via `nous
+status`. Per-iteration `snapshot_iter_files` copies the actual
+hardware/latency config files into `runs/iter-N/snapshots/` so a
+future reviewer can diff exact numbers even after the operator
+edits the source-of-truth file.
+
+### Cross-campaign code reuse (`orchestrator/lineage.py`)
+
+`emit_cumulative_patch` (#266 / F21) runs at iteration completion,
+*before* the experiment branch is destroyed, capturing
+``git diff <main>..<branch>`` to ``runs/iter-N/patches/cumulative.patch``.
+Future campaigns reuse it via:
+
+```yaml
+derived_from:
+  campaign: paper-memorytime-mirage
+  iteration: 2          # or "final"
+```
+
+`apply_derived_from_patch` resolves and applies the cumulative
+patch to every experiment worktree as a preflight. `nous lineage
+<run_id>` surfaces the inheritance chain.
+
+### Per-phase silence threshold (`orchestrator/sdk_dispatch.py`)
+
+`_resolve_turn_silence_threshold(phase)` (#264 / F19) walks the
+resolution chain — bundle per-phase override → bundle scalar
+override → campaign per-phase value → phase default
+(design=600, execute_analyze=120, report=240). DESIGN's heavy
+reasoning between tool calls earns a longer threshold than
+EXECUTE_ANALYZE's frequent simulator calls, eliminating the active
+stall observed in paper-memorytime-mirage iter-3.
+
+### Plot specs + paper packaging (`orchestrator/plot_specs.py`, `nous package`)
+
+`invoke_plot_specs` (#263 / F18) reads `campaign.plot_specs`,
+invokes each user-supplied figure script with `NOUS_RESULTS_DIR`
+and `NOUS_FIGURES_DIR` environment variables. `nous package`
+tarballs work_dir + reproduce.sh + Dockerfile + README using the
+F17 reproducibility metadata.
+
 
 ## Data Flow
 
 
@@ -0,0 +1,144 @@
+# Campaign authoring guide
+
+This guide covers the disciplines that make a Nous campaign
+**reproducible**, **spec-faithful**, and **defensible to a paper
+reviewer**. It collects the practices that emerged from the friction
+report on paper-memorytime-mirage (tracking issue #245) — most of
+which now have concrete tooling support behind them.
+
+## The "what to lock" inventory (#258 / F13)
+
+Before you start a campaign, enumerate every target-system parameter
+that could plausibly affect the experimental physics. For EACH,
+decide: **is deviation acceptable?** If no, lock it.
+
+The reactive failure mode — adding parameters to ``locked_parameters``
+only after each one bites you in a review round — turns a 2-week
+campaign into a 5-round dance with no end in sight. Up-front
+enumeration costs an hour and saves all of it.
+
+### Inventory template (LLM-serving target)
+
+Adapt for your target system. The table is the discipline; the
+specific knob names vary.
+
+| Category | Parameter | Default | Lock? | Reasoning |
+|---|---|---|---|---|
+| Workload identity | ``model`` | (varies) | YES | Model determines π/δ in the latency model — different model = different physics |
+| Workload identity | ``concurrency_per_tenant`` | 32 | YES | Concurrency directly drives the metric |
+| Workload identity | ``duration_seconds`` | 600 | YES | Below ~120s, scale-dependent checks (PMF histogram, 99.9% backlog-nonempty) lose statistical power |
+| Workload identity | ``warmup_seconds`` | 30 | YES | Short warmup admits transients into measurement window |
+| KV / batching | ``total_kv_blocks`` | (derived from GPU) | YES if testing contention | K=1M with 16-token blocks = no contention; K=24576 ≈ realistic on H100 |
+| KV / batching | ``MaxModelLen`` | 4096 | YES if requests exceed | Below P_max, requests are silently dropped |
+| KV / batching | ``MaxOutputLen`` | 1024 | YES if D matters | Overrides D=1 in the workload spec |
+| KV / batching | ``max_num_seqs`` | 256 | Maybe | Below 2 × concurrency, throttles closed-loop |
+| KV / batching | ``max_batched_tokens`` | 8192 | YES if prefill matters | Limits prefill batch composition |
+| KV / batching | ``gpu_memory_utilization`` | 0.9 | YES if K is derived | Affects K derivation |
+| KV / batching | ``BlockSize`` | 16 | Usually no | Architecture-dependent; document the value |
+| Latency model | ``MfuPrefill`` | (per-model file) | Snapshot via #262 | Snapshot the file SHA into reproducibility metadata |
+| Latency model | ``MfuDecode`` | (per-model file) | Snapshot via #262 | Same |
+| Latency model | TP factor | 1 | YES if testing distributed | TP=2 vs TP=1 changes π/δ |
+| Admission / gateway | ``AdmissionLatency`` | 0 | Usually no | Document; rarely matters |
+| Admission / gateway | ``RoutingLatency`` | 0 | Usually no | Document; rarely matters |
+| Admission / gateway | ``FlowControlEnabled`` | false | YES if relevant | Changes admission semantics |
+| Disaggregation | ``PDDecider`` | none | YES if testing | Changes architecture |
+| Disaggregation | ``PDTransfer*`` | (defaults) | YES if testing | Changes architecture |
+| Network | ``rtt_ms`` | 0 | YES if testing | Changes timing |
+| Network | ``bandwidth`` | unlimited | YES if testing | Changes timing |
+| Streaming | ``streaming`` | false | YES if testing | Changes per-token timing |
+
+## Pre-lock unit check (#259 / F14)
+
+Before locking a parameter to a specific value, **unit-check the
+closed-form prediction against your locked parameters**.
+
+### Worked example (paper-memorytime-mirage)
+
+The campaign locked ``D=8`` (output tokens per request) and ``K=1M``
+(KV blocks). Both choices were defensible in isolation; combined,
+they produced a regime where the campaign's own theory predicted
+ρ_mt ≈ 1.06 — a null result. The campaign author would have caught
+this by computing:
+
+```
+C_KV(P=1024, D=8) / C_KV(P=mixture, D=8)
+```
+
+Under realistic π/δ, that ratio comes out to ≈ 1.06 (decode
+dominates; equal-mean P_A=P_B masks the variance signal). Pre-lock
+unit check would have shown the D=8 error before iter-1 ran.
+
+The principle: *closed-form math is cheap; an iter-1 LLM run is
+expensive*. Walk the prediction by hand at your locked parameters
+before committing.
+
+## Rehearsal as scientific instrument (#259 / F14)
+
+This is the **affirmative case for the rehearsal mechanism**. In
+paper-memorytime-mirage iter-1, the rehearsal_subset (h-main arm,
+seed 42, both schedulers) ran at the campaign's locked parameters.
+Both Token-WFQ and KV-time-greedy produced ρ_mt ≈ 1.06 — vastly
+below the predicted 3.0×. Rather than reporting null findings, the
+agent ran a diagnostic D=1 probe, which produced ρ_mt ≈ 4.378 under
+WFQ. From the contrast, it correctly diagnosed two campaign-author
+errors:
+
+1. **D=8 puts the system in a decode-dominated regime** where
+   memory-time ∝ P·D, and equal-mean P_A=P_B masks the variance
+   signal. Recommendation: D=1.
+2. **K=1M blocks makes the bucket inoperative** (ω·K = 450K vs
+   ~152 actual occupancy). Recommendation: K ≤ 1000.
+
+The findings.json discrepancy_analysis was a clean post-mortem.
+The agent confirmed apparatus correctness (zero conservation
+violations, WFQ counter balance ratio 1.003) before declaring
+REFUTED with diagnostic_note recommending specific parameter fixes
+for iter-2.
+
+**Why this matters.** The campaign author made two non-trivial
+workload-design errors that no amount of pre-run review caught.
+Iter-1 surfaced both with diagnostic precision, suggested fixes, and
+confirmed the underlying mechanism is real (4.38× mirage at D=1).
+Without rehearsal, iter-2 would have produced null results at full
+scale.
+
+**Use rehearsal as the diagnostic instrument it is.** When iter-1
+produces a result far from the predicted magnitude, don't just mark
+it REFUTED — probe the regime, contrast against a known-engaging
+configuration, and recommend specific fixes. ``rehearsal_subset``
+exists for this discipline; populate it generously.
+
+## Apparatus discipline (#252 / F7)
+
+Apparatus invariants must validate the **attribution** the experiment
+depends on, not an upstream total. See the methodology prompt for
+the worked example (the BLIS ``runningBatch`` vs ``RequestMap``
+case). Two-line summary: *if the bug-of-interest involves
+attribution among items, your invariant must distinguish per-item,
+not just sum*.
+
+## Spec-fidelity (#246 / F1, #265 / F20)
+
+``locked_parameters`` and ``locked_workload`` are nous's spec-
+fidelity primitives. They hard-fail bundles that deviate from the
+campaign's intent, regardless of ``--auto-approve``. Use them
+liberally — they are the cheapest defense against silent design-
+agent rewrites.
+
+## Reproducibility (#262 / F17)
+
+nous auto-captures ``reproducibility_metadata`` at INIT (target
+repo commit, hardware-config sha, language versions, latency-config
+file snapshots). The first capture wins — re-running INIT on an
+existing campaign preserves the original commit, which is what
+reviewers want.
+
+To produce a paper-grade artifact tarball:
+
+```
+nous package <run_id>
+```
+
+This bundles the work_dir, a ``reproduce.sh`` template, a
+``Dockerfile`` pinning captured language versions, and a README.
+Drop the tarball in your artifact-evaluation submission.