Skip to content

Commit a30f09c

Browse files
sriumcpclaude
andauthored
fix: resolve all 21 sub-issues of friction-report tracker #245 (#267)
Closes #246 Closes #247 Closes #248 Closes #249 Closes #250 Closes #251 Closes #252 Closes #253 Closes #254 Closes #255 Closes #256 Closes #257 Closes #258 Closes #259 Closes #260 Closes #261 Closes #262 Closes #263 Closes #264 Closes #265 Closes #266 Closes #245 External campaign-author friction report from running the paper-memorytime-mirage campaign on nous against BLIS surfaced 21 distinct points of friction, clustered around five themes: A. Spec-fidelity (F1, F2, F3, F4, F10, F13, F20) — nous validated *self-consistency* (executor matches bundle) but not *spec-fidelity* (bundle matches campaign) under --auto-approve. Headline architectural primitive: campaign.locked_parameters (#246/F1) hard-fails any deviation, regardless of --auto-approve. Adoption: locked_workload (#265/F20), unlocked_parameters_audit (#261/F16), methodology hierarchy (#247/F2), depth_overrides+invalidates (#248/F3), gate-summary diff (#249/F4), auto-approve safety docs (#255/F10), create-campaign scaffold + authoring guide (#258/F13). B. Apparatus discipline (F7, F14, F16) — invariants must validate ATTRIBUTION, not upstream totals. Methodology prompt sections in design.md/execute_analyze.md cover the bug-class question with the BLIS runningBatch vs RequestMap worked example. Authoring guide covers rehearsal-as-instrument and pre-lock unit checks. C. Lifecycle / portability (F5, F11, F12, F19, F21) — per-phase silence threshold (#264/F19) closes the active stall where DESIGN's heavy reasoning trips an EXECUTE_ANALYZE-tuned watchdog. F21 lands cross-campaign code reuse via cumulative.patch + derived_from + nous lineage. F5 stop --immediate, F11 high-BUILD warning, F12 asyncio race fix. D. Reproducibility (F17, F18) — reproducibility_metadata auto-captured at INIT (target repo commit, hardware-config sha, language versions, latency-config snapshots). nous package tarballs work_dir + reproduce.sh + Dockerfile + README for paper artifact evaluation. E. Hygiene (F6, F8, F9, F15) — F6 worktree_extras tracked-path warning at campaign load; F8 nous resume diagnostic for work_dir confusion; F9 nous clean --orphaned; F15 physical_realism_check schema + soft warning. See docs/friction-245-resolution.md for the per-F-entry → file map. Every change is tagged in code with (#NNN / F<n>) so git blame + issue tracker form a complete audit trail. New modules: orchestrator/reproducibility.py — F17 capture orchestrator/lineage.py — F21 cumulative patches + derived_from orchestrator/plot_specs.py — F18 figure pipeline New CLI subcommands: nous lineage, nous clean, nous package, nous stop --immediate New schema fields: campaign: locked_parameters, locked_workload, derived_from, plot_specs, reproducibility_metadata, sdk_timeouts.turn_silence_threshold_seconds (per-phase map) bundle: physical_realism_check, unlocked_parameters_audit, workload_changes_from_canonical, rehearsal_subset.depth_overrides, timing_observations.recommended_turn_silence_threshold_seconds (per-phase map) Tests: 32 new in tests/test_friction_245.py, 1278 total passing. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent efc748f commit a30f09c

20 files changed

Lines changed: 2969 additions & 25 deletions

CLAUDE.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,9 +148,29 @@ measurement). The validator floor (`validate_evidence`) rejects
148148
aspirational platitudes regardless of source. See `docs/data-model.md`
149149
for the schema.
150150

151+
## Spec fidelity (issue #246 / F1, friction-report #245)
152+
153+
Every campaign should declare ``locked_parameters`` (and, when
154+
applicable, ``locked_workload``) for every knob whose deviation
155+
would invalidate the experiment. The validator hard-fails any
156+
bundle whose ``experiment_spec.verified_parameters`` deviates from
157+
``locked_parameters`` — regardless of ``--auto-approve``. This
158+
closes the spec-fidelity gap that allowed paper-memorytime-mirage
159+
iter-1 to silently rewrite four locked workload parameters.
160+
161+
Authoring discipline lives in ``docs/campaign-authoring-guide.md``
162+
(the "what to lock" inventory + the rehearsal-as-instrument
163+
worked example). The full friction-report resolution map is in
164+
``docs/friction-245-resolution.md``.
165+
151166
## See also
152167

153168
- `docs/contributing/workflow.md` — full workflow doc.
154169
- `docs/security.md` — permission policy (#135).
155170
- `docs/architecture.md` — internals.
171+
- `docs/campaign-authoring-guide.md` — locked_parameters, the
172+
"what to lock" inventory, rehearsal-as-instrument (#245
173+
resolution).
174+
- `docs/friction-245-resolution.md` — F1..F21 → file map for
175+
paper-memorytime-mirage friction report.
156176
- `docs/plans/CHECKPOINT.md` — current state of the #120 epic.

README.md

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,54 @@ nous run campaign.yaml --auto-approve --max-iterations 1 # quick unattended run
218218
nous run campaign.yaml --bundle ./fig7_bundle.yaml --auto-approve
219219
```
220220

221+
#### `--auto-approve` safety preconditions (#255 / F10)
222+
223+
`--auto-approve` skips the HUMAN_DESIGN_GATE and HUMAN_FINDINGS_GATE,
224+
which are nous's primary safety mechanisms for catching design-agent
225+
deviations from campaign intent. **Auto-approve is safe to use only
226+
when ALL of these hold**:
227+
228+
1. The campaign declares ``locked_parameters`` (#246 / F1) for every
229+
campaign-spec-critical knob (model, concurrency, duration, warmup,
230+
K-class parameters like ``total_kv_blocks``, anything whose
231+
deviation would silently invalidate the experiment). nous hard-fails
232+
any bundle whose ``experiment_spec.verified_parameters`` contradicts
233+
a locked parameter — regardless of ``--auto-approve``.
234+
2. If the campaign has a canonical workload, declare
235+
``locked_workload`` (#265 / F20). The validator diffs
236+
``bundle.inputs/*.yaml`` against it. Deliberate deviations require
237+
``bundle.workload_changes_from_canonical`` to be populated.
238+
3. The target repo's docs do not contain example values that
239+
contradict the campaign's locked spec (#247 / F2). When they do,
240+
the methodology prompt's "campaign > target-repo-docs" hierarchy
241+
covers it — but a stale methodology prompt would not. If you're
242+
running a target whose docs heavily contradict the campaign,
243+
verify the methodology prompt has the hierarchy clause before
244+
trusting auto-approve.
245+
4. The campaign's apparatus checks are robust to design-agent
246+
variation, and validate ATTRIBUTION (not just upstream totals,
247+
#252 / F7).
248+
5. A stale ``principles.json`` ledger is acceptable. Auto-approve
249+
never gates on it.
250+
251+
**If any of these fail**, either run interactively (no
252+
``--auto-approve``) so a human reviewer sees the design at the gate,
253+
or invoke an external watchdog process to compare bundles against
254+
your campaign spec.
255+
256+
Even under ``--auto-approve``, every design gate writes
257+
``gate_summary_design.json`` with a deterministic
258+
``campaign_spec_diff`` block (#249 / F4). Watchdog-style audit:
259+
260+
```bash
261+
jq '.campaign_spec_diff' "$NOUS_CAMPAIGN_PARENT"/<run>/runs/iter-*/gate_summary_design.json
262+
```
263+
264+
Non-empty ``locked_parameters_violations`` means F1's hard-fail
265+
fired (the iteration won't have proceeded). ``depth_overrides_present``
266+
or ``workload_changes_from_canonical_declared`` flag deliberate
267+
deviations the design agent declared.
268+
221269
### Overnight / long-running campaigns
222270

223271
For unattended runs, increase retries and timeout so transient failures don't kill the campaign:

docs/architecture.md

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -240,6 +240,80 @@ Before each human gate, a formatted summary (`gate_summary_*.json`) is produced.
240240

241241
Gates display the summary first, then the raw artifact (for those who want full detail).
242242

243+
**Spec-fidelity diff (#249 / F4).** For the design-phase summary,
244+
``_augment_summary_with_spec_diff`` (in `orchestrator/iteration.py`)
245+
post-processes the LLM-generated summary to attach a deterministic
246+
``campaign_spec_diff`` block: locked_parameters violations, depth_overrides
247+
presence, declared workload changes. Always emitted, regardless of
248+
``--auto-approve``. ``nous status`` surfaces it in human-readable
249+
form.
250+
251+
### Spec fidelity (`orchestrator/validate.py`)
252+
253+
Two pure-Python validators close the gap between *self-consistency*
254+
(the executor matches the bundle) and *spec-fidelity* (the bundle
255+
matches the campaign):
256+
257+
* `_validate_locked_parameters` (#246 / F1) — every entry in
258+
``campaign.locked_parameters`` must match
259+
``bundle.experiment_spec.verified_parameters`` exactly. Hard-fail
260+
regardless of ``--auto-approve``.
261+
* `_validate_locked_workload` (#265 / F20) — walks the canonical
262+
workload structure and diffs against ``bundle.inputs/*.yaml``.
263+
Declared deviations (``bundle.workload_changes_from_canonical``)
264+
are allowed; undeclared are hard-fails.
265+
266+
`compute_campaign_spec_diff` exposes the same logic for read-only
267+
auditor use (the F4 gate-summary diff). See
268+
`docs/campaign-authoring-guide.md` for the discipline these enforce.
269+
270+
### Reproducibility metadata (`orchestrator/reproducibility.py`)
271+
272+
`capture_reproducibility_metadata` (#262 / F17) runs at INIT and
273+
records target repo HEAD, dirty flag, hardware-config sha,
274+
language versions, gpu_memory_utilization, latency-config file
275+
paths. The block is persisted in `state.json` (first-capture wins;
276+
re-running INIT preserves the original) and surfaced via `nous
277+
status`. Per-iteration `snapshot_iter_files` copies the actual
278+
hardware/latency config files into `runs/iter-N/snapshots/` so a
279+
future reviewer can diff exact numbers even after the operator
280+
edits the source-of-truth file.
281+
282+
### Cross-campaign code reuse (`orchestrator/lineage.py`)
283+
284+
`emit_cumulative_patch` (#266 / F21) runs at iteration completion,
285+
*before* the experiment branch is destroyed, capturing
286+
``git diff <main>..<branch>`` to ``runs/iter-N/patches/cumulative.patch``.
287+
Future campaigns reuse it via:
288+
289+
```yaml
290+
derived_from:
291+
campaign: paper-memorytime-mirage
292+
iteration: 2 # or "final"
293+
```
294+
295+
`apply_derived_from_patch` resolves and applies the cumulative
296+
patch to every experiment worktree as a preflight. `nous lineage
297+
<run_id>` surfaces the inheritance chain.
298+
299+
### Per-phase silence threshold (`orchestrator/sdk_dispatch.py`)
300+
301+
`_resolve_turn_silence_threshold(phase)` (#264 / F19) walks the
302+
resolution chain — bundle per-phase override → bundle scalar
303+
override → campaign per-phase value → phase default
304+
(design=600, execute_analyze=120, report=240). DESIGN's heavy
305+
reasoning between tool calls earns a longer threshold than
306+
EXECUTE_ANALYZE's frequent simulator calls, eliminating the active
307+
stall observed in paper-memorytime-mirage iter-3.
308+
309+
### Plot specs + paper packaging (`orchestrator/plot_specs.py`, `nous package`)
310+
311+
`invoke_plot_specs` (#263 / F18) reads `campaign.plot_specs`,
312+
invokes each user-supplied figure script with `NOUS_RESULTS_DIR`
313+
and `NOUS_FIGURES_DIR` environment variables. `nous package`
314+
tarballs work_dir + reproduce.sh + Dockerfile + README using the
315+
F17 reproducibility metadata.
316+
243317

244318
## Data Flow
245319

docs/campaign-authoring-guide.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# Campaign authoring guide
2+
3+
This guide covers the disciplines that make a Nous campaign
4+
**reproducible**, **spec-faithful**, and **defensible to a paper
5+
reviewer**. It collects the practices that emerged from the friction
6+
report on paper-memorytime-mirage (tracking issue #245) — most of
7+
which now have concrete tooling support behind them.
8+
9+
## The "what to lock" inventory (#258 / F13)
10+
11+
Before you start a campaign, enumerate every target-system parameter
12+
that could plausibly affect the experimental physics. For EACH,
13+
decide: **is deviation acceptable?** If no, lock it.
14+
15+
The reactive failure mode — adding parameters to ``locked_parameters``
16+
only after each one bites you in a review round — turns a 2-week
17+
campaign into a 5-round dance with no end in sight. Up-front
18+
enumeration costs an hour and saves all of it.
19+
20+
### Inventory template (LLM-serving target)
21+
22+
Adapt for your target system. The table is the discipline; the
23+
specific knob names vary.
24+
25+
| Category | Parameter | Default | Lock? | Reasoning |
26+
|---|---|---|---|---|
27+
| Workload identity | ``model`` | (varies) | YES | Model determines π/δ in the latency model — different model = different physics |
28+
| Workload identity | ``concurrency_per_tenant`` | 32 | YES | Concurrency directly drives the metric |
29+
| Workload identity | ``duration_seconds`` | 600 | YES | Below ~120s, scale-dependent checks (PMF histogram, 99.9% backlog-nonempty) lose statistical power |
30+
| Workload identity | ``warmup_seconds`` | 30 | YES | Short warmup admits transients into measurement window |
31+
| KV / batching | ``total_kv_blocks`` | (derived from GPU) | YES if testing contention | K=1M with 16-token blocks = no contention; K=24576 ≈ realistic on H100 |
32+
| KV / batching | ``MaxModelLen`` | 4096 | YES if requests exceed | Below P_max, requests are silently dropped |
33+
| KV / batching | ``MaxOutputLen`` | 1024 | YES if D matters | Overrides D=1 in the workload spec |
34+
| KV / batching | ``max_num_seqs`` | 256 | Maybe | Below 2 × concurrency, throttles closed-loop |
35+
| KV / batching | ``max_batched_tokens`` | 8192 | YES if prefill matters | Limits prefill batch composition |
36+
| KV / batching | ``gpu_memory_utilization`` | 0.9 | YES if K is derived | Affects K derivation |
37+
| KV / batching | ``BlockSize`` | 16 | Usually no | Architecture-dependent; document the value |
38+
| Latency model | ``MfuPrefill`` | (per-model file) | Snapshot via #262 | Snapshot the file SHA into reproducibility metadata |
39+
| Latency model | ``MfuDecode`` | (per-model file) | Snapshot via #262 | Same |
40+
| Latency model | TP factor | 1 | YES if testing distributed | TP=2 vs TP=1 changes π/δ |
41+
| Admission / gateway | ``AdmissionLatency`` | 0 | Usually no | Document; rarely matters |
42+
| Admission / gateway | ``RoutingLatency`` | 0 | Usually no | Document; rarely matters |
43+
| Admission / gateway | ``FlowControlEnabled`` | false | YES if relevant | Changes admission semantics |
44+
| Disaggregation | ``PDDecider`` | none | YES if testing | Changes architecture |
45+
| Disaggregation | ``PDTransfer*`` | (defaults) | YES if testing | Changes architecture |
46+
| Network | ``rtt_ms`` | 0 | YES if testing | Changes timing |
47+
| Network | ``bandwidth`` | unlimited | YES if testing | Changes timing |
48+
| Streaming | ``streaming`` | false | YES if testing | Changes per-token timing |
49+
50+
## Pre-lock unit check (#259 / F14)
51+
52+
Before locking a parameter to a specific value, **unit-check the
53+
closed-form prediction against your locked parameters**.
54+
55+
### Worked example (paper-memorytime-mirage)
56+
57+
The campaign locked ``D=8`` (output tokens per request) and ``K=1M``
58+
(KV blocks). Both choices were defensible in isolation; combined,
59+
they produced a regime where the campaign's own theory predicted
60+
ρ_mt ≈ 1.06 — a null result. The campaign author would have caught
61+
this by computing:
62+
63+
```
64+
C_KV(P=1024, D=8) / C_KV(P=mixture, D=8)
65+
```
66+
67+
Under realistic π/δ, that ratio comes out to ≈ 1.06 (decode
68+
dominates; equal-mean P_A=P_B masks the variance signal). Pre-lock
69+
unit check would have shown the D=8 error before iter-1 ran.
70+
71+
The principle: *closed-form math is cheap; an iter-1 LLM run is
72+
expensive*. Walk the prediction by hand at your locked parameters
73+
before committing.
74+
75+
## Rehearsal as scientific instrument (#259 / F14)
76+
77+
This is the **affirmative case for the rehearsal mechanism**. In
78+
paper-memorytime-mirage iter-1, the rehearsal_subset (h-main arm,
79+
seed 42, both schedulers) ran at the campaign's locked parameters.
80+
Both Token-WFQ and KV-time-greedy produced ρ_mt ≈ 1.06 — vastly
81+
below the predicted 3.0×. Rather than reporting null findings, the
82+
agent ran a diagnostic D=1 probe, which produced ρ_mt ≈ 4.378 under
83+
WFQ. From the contrast, it correctly diagnosed two campaign-author
84+
errors:
85+
86+
1. **D=8 puts the system in a decode-dominated regime** where
87+
memory-time ∝ P·D, and equal-mean P_A=P_B masks the variance
88+
signal. Recommendation: D=1.
89+
2. **K=1M blocks makes the bucket inoperative** (ω·K = 450K vs
90+
~152 actual occupancy). Recommendation: K ≤ 1000.
91+
92+
The findings.json discrepancy_analysis was a clean post-mortem.
93+
The agent confirmed apparatus correctness (zero conservation
94+
violations, WFQ counter balance ratio 1.003) before declaring
95+
REFUTED with diagnostic_note recommending specific parameter fixes
96+
for iter-2.
97+
98+
**Why this matters.** The campaign author made two non-trivial
99+
workload-design errors that no amount of pre-run review caught.
100+
Iter-1 surfaced both with diagnostic precision, suggested fixes, and
101+
confirmed the underlying mechanism is real (4.38× mirage at D=1).
102+
Without rehearsal, iter-2 would have produced null results at full
103+
scale.
104+
105+
**Use rehearsal as the diagnostic instrument it is.** When iter-1
106+
produces a result far from the predicted magnitude, don't just mark
107+
it REFUTED — probe the regime, contrast against a known-engaging
108+
configuration, and recommend specific fixes. ``rehearsal_subset``
109+
exists for this discipline; populate it generously.
110+
111+
## Apparatus discipline (#252 / F7)
112+
113+
Apparatus invariants must validate the **attribution** the experiment
114+
depends on, not an upstream total. See the methodology prompt for
115+
the worked example (the BLIS ``runningBatch`` vs ``RequestMap``
116+
case). Two-line summary: *if the bug-of-interest involves
117+
attribution among items, your invariant must distinguish per-item,
118+
not just sum*.
119+
120+
## Spec-fidelity (#246 / F1, #265 / F20)
121+
122+
``locked_parameters`` and ``locked_workload`` are nous's spec-
123+
fidelity primitives. They hard-fail bundles that deviate from the
124+
campaign's intent, regardless of ``--auto-approve``. Use them
125+
liberally — they are the cheapest defense against silent design-
126+
agent rewrites.
127+
128+
## Reproducibility (#262 / F17)
129+
130+
nous auto-captures ``reproducibility_metadata`` at INIT (target
131+
repo commit, hardware-config sha, language versions, latency-config
132+
file snapshots). The first capture wins — re-running INIT on an
133+
existing campaign preserves the original commit, which is what
134+
reviewers want.
135+
136+
To produce a paper-grade artifact tarball:
137+
138+
```
139+
nous package <run_id>
140+
```
141+
142+
This bundles the work_dir, a ``reproduce.sh`` template, a
143+
``Dockerfile`` pinning captured language versions, and a README.
144+
Drop the tarball in your artifact-evaluation submission.

0 commit comments

Comments
 (0)