|
| 1 | +# Campaign authoring guide |
| 2 | + |
| 3 | +This guide covers the disciplines that make a Nous campaign |
| 4 | +**reproducible**, **spec-faithful**, and **defensible to a paper |
| 5 | +reviewer**. It collects the practices that emerged from the friction |
| 6 | +report on paper-memorytime-mirage (tracking issue #245) — most of |
| 7 | +which now have concrete tooling support behind them. |
| 8 | + |
| 9 | +## The "what to lock" inventory (#258 / F13) |
| 10 | + |
| 11 | +Before you start a campaign, enumerate every target-system parameter |
| 12 | +that could plausibly affect the experimental physics. For EACH, |
| 13 | +decide: **is deviation acceptable?** If no, lock it. |
| 14 | + |
| 15 | +The reactive failure mode — adding parameters to ``locked_parameters`` |
| 16 | +only after each one bites you in a review round — turns a 2-week |
| 17 | +campaign into a 5-round dance with no end in sight. Up-front |
| 18 | +enumeration costs an hour and saves all of it. |
| 19 | + |
| 20 | +### Inventory template (LLM-serving target) |
| 21 | + |
| 22 | +Adapt for your target system. The table is the discipline; the |
| 23 | +specific knob names vary. |
| 24 | + |
| 25 | +| Category | Parameter | Default | Lock? | Reasoning | |
| 26 | +|---|---|---|---|---| |
| 27 | +| Workload identity | ``model`` | (varies) | YES | Model determines π/δ in the latency model — different model = different physics | |
| 28 | +| Workload identity | ``concurrency_per_tenant`` | 32 | YES | Concurrency directly drives the metric | |
| 29 | +| Workload identity | ``duration_seconds`` | 600 | YES | Below ~120s, scale-dependent checks (PMF histogram, 99.9% backlog-nonempty) lose statistical power | |
| 30 | +| Workload identity | ``warmup_seconds`` | 30 | YES | Short warmup admits transients into measurement window | |
| 31 | +| KV / batching | ``total_kv_blocks`` | (derived from GPU) | YES if testing contention | K=1M with 16-token blocks = no contention; K=24576 ≈ realistic on H100 | |
| 32 | +| KV / batching | ``MaxModelLen`` | 4096 | YES if requests exceed | Below P_max, requests are silently dropped | |
| 33 | +| KV / batching | ``MaxOutputLen`` | 1024 | YES if D matters | Overrides D=1 in the workload spec | |
| 34 | +| KV / batching | ``max_num_seqs`` | 256 | Maybe | Below 2 × concurrency, throttles closed-loop | |
| 35 | +| KV / batching | ``max_batched_tokens`` | 8192 | YES if prefill matters | Limits prefill batch composition | |
| 36 | +| KV / batching | ``gpu_memory_utilization`` | 0.9 | YES if K is derived | Affects K derivation | |
| 37 | +| KV / batching | ``BlockSize`` | 16 | Usually no | Architecture-dependent; document the value | |
| 38 | +| Latency model | ``MfuPrefill`` | (per-model file) | Snapshot via #262 | Snapshot the file SHA into reproducibility metadata | |
| 39 | +| Latency model | ``MfuDecode`` | (per-model file) | Snapshot via #262 | Same | |
| 40 | +| Latency model | TP factor | 1 | YES if testing distributed | TP=2 vs TP=1 changes π/δ | |
| 41 | +| Admission / gateway | ``AdmissionLatency`` | 0 | Usually no | Document; rarely matters | |
| 42 | +| Admission / gateway | ``RoutingLatency`` | 0 | Usually no | Document; rarely matters | |
| 43 | +| Admission / gateway | ``FlowControlEnabled`` | false | YES if relevant | Changes admission semantics | |
| 44 | +| Disaggregation | ``PDDecider`` | none | YES if testing | Changes architecture | |
| 45 | +| Disaggregation | ``PDTransfer*`` | (defaults) | YES if testing | Changes architecture | |
| 46 | +| Network | ``rtt_ms`` | 0 | YES if testing | Changes timing | |
| 47 | +| Network | ``bandwidth`` | unlimited | YES if testing | Changes timing | |
| 48 | +| Streaming | ``streaming`` | false | YES if testing | Changes per-token timing | |
| 49 | + |
| 50 | +## Pre-lock unit check (#259 / F14) |
| 51 | + |
| 52 | +Before locking a parameter to a specific value, **unit-check the |
| 53 | +closed-form prediction against your locked parameters**. |
| 54 | + |
| 55 | +### Worked example (paper-memorytime-mirage) |
| 56 | + |
| 57 | +The campaign locked ``D=8`` (output tokens per request) and ``K=1M`` |
| 58 | +(KV blocks). Both choices were defensible in isolation; combined, |
| 59 | +they produced a regime where the campaign's own theory predicted |
| 60 | +ρ_mt ≈ 1.06 — a null result. The campaign author would have caught |
| 61 | +this by computing: |
| 62 | + |
| 63 | +``` |
| 64 | +C_KV(P=1024, D=8) / C_KV(P=mixture, D=8) |
| 65 | +``` |
| 66 | + |
| 67 | +Under realistic π/δ, that ratio comes out to ≈ 1.06 (decode |
| 68 | +dominates; equal-mean P_A=P_B masks the variance signal). Pre-lock |
| 69 | +unit check would have shown the D=8 error before iter-1 ran. |
| 70 | + |
| 71 | +The principle: *closed-form math is cheap; an iter-1 LLM run is |
| 72 | +expensive*. Walk the prediction by hand at your locked parameters |
| 73 | +before committing. |
| 74 | + |
| 75 | +## Rehearsal as scientific instrument (#259 / F14) |
| 76 | + |
| 77 | +This is the **affirmative case for the rehearsal mechanism**. In |
| 78 | +paper-memorytime-mirage iter-1, the rehearsal_subset (h-main arm, |
| 79 | +seed 42, both schedulers) ran at the campaign's locked parameters. |
| 80 | +Both Token-WFQ and KV-time-greedy produced ρ_mt ≈ 1.06 — vastly |
| 81 | +below the predicted 3.0×. Rather than reporting null findings, the |
| 82 | +agent ran a diagnostic D=1 probe, which produced ρ_mt ≈ 4.378 under |
| 83 | +WFQ. From the contrast, it correctly diagnosed two campaign-author |
| 84 | +errors: |
| 85 | + |
| 86 | +1. **D=8 puts the system in a decode-dominated regime** where |
| 87 | + memory-time ∝ P·D, and equal-mean P_A=P_B masks the variance |
| 88 | + signal. Recommendation: D=1. |
| 89 | +2. **K=1M blocks makes the bucket inoperative** (ω·K = 450K vs |
| 90 | + ~152 actual occupancy). Recommendation: K ≤ 1000. |
| 91 | + |
| 92 | +The findings.json discrepancy_analysis was a clean post-mortem. |
| 93 | +The agent confirmed apparatus correctness (zero conservation |
| 94 | +violations, WFQ counter balance ratio 1.003) before declaring |
| 95 | +REFUTED with diagnostic_note recommending specific parameter fixes |
| 96 | +for iter-2. |
| 97 | + |
| 98 | +**Why this matters.** The campaign author made two non-trivial |
| 99 | +workload-design errors that no amount of pre-run review caught. |
| 100 | +Iter-1 surfaced both with diagnostic precision, suggested fixes, and |
| 101 | +confirmed the underlying mechanism is real (4.38× mirage at D=1). |
| 102 | +Without rehearsal, iter-2 would have produced null results at full |
| 103 | +scale. |
| 104 | + |
| 105 | +**Use rehearsal as the diagnostic instrument it is.** When iter-1 |
| 106 | +produces a result far from the predicted magnitude, don't just mark |
| 107 | +it REFUTED — probe the regime, contrast against a known-engaging |
| 108 | +configuration, and recommend specific fixes. ``rehearsal_subset`` |
| 109 | +exists for this discipline; populate it generously. |
| 110 | + |
| 111 | +## Apparatus discipline (#252 / F7) |
| 112 | + |
| 113 | +Apparatus invariants must validate the **attribution** the experiment |
| 114 | +depends on, not an upstream total. See the methodology prompt for |
| 115 | +the worked example (the BLIS ``runningBatch`` vs ``RequestMap`` |
| 116 | +case). Two-line summary: *if the bug-of-interest involves |
| 117 | +attribution among items, your invariant must distinguish per-item, |
| 118 | +not just sum*. |
| 119 | + |
| 120 | +## Spec-fidelity (#246 / F1, #265 / F20) |
| 121 | + |
| 122 | +``locked_parameters`` and ``locked_workload`` are nous's spec- |
| 123 | +fidelity primitives. They hard-fail bundles that deviate from the |
| 124 | +campaign's intent, regardless of ``--auto-approve``. Use them |
| 125 | +liberally — they are the cheapest defense against silent design- |
| 126 | +agent rewrites. |
| 127 | + |
| 128 | +## Reproducibility (#262 / F17) |
| 129 | + |
| 130 | +nous auto-captures ``reproducibility_metadata`` at INIT (target |
| 131 | +repo commit, hardware-config sha, language versions, latency-config |
| 132 | +file snapshots). The first capture wins — re-running INIT on an |
| 133 | +existing campaign preserves the original commit, which is what |
| 134 | +reviewers want. |
| 135 | + |
| 136 | +To produce a paper-grade artifact tarball: |
| 137 | + |
| 138 | +``` |
| 139 | +nous package <run_id> |
| 140 | +``` |
| 141 | + |
| 142 | +This bundles the work_dir, a ``reproduce.sh`` template, a |
| 143 | +``Dockerfile`` pinning captured language versions, and a README. |
| 144 | +Drop the tarball in your artifact-evaluation submission. |
0 commit comments