Skip to content

[F14] Document the rehearsal-as-scientific-instrument positive pattern in methodology docs #259

Description

@sriumcp

Problem

This is a positive observation captured in the friction report, not a bug. Filing it as an issue so the lesson isn't lost.

Background

In paper-memorytime-mirage iter-1, the rehearsal_subset (h-main arm, seed 42, both schedulers) ran at the campaign's locked parameters. Both Token-WFQ and KV-time-greedy produced memorytime_share_ratio ≈ 1.06 — vastly below the predicted 3.0×. Rather than reporting null findings, the agent ran a diagnostic D=1 probe, which produced ρ_mt ≈ 4.378 under WFQ. From the contrast, it correctly diagnosed two campaign-author errors:

  1. D=8 puts the system in a decode-dominated regime where memory-time ∝ P·D, and equal-mean P_A=P_B masks the variance signal. Recommended: D=1.
  2. K=1M blocks makes the bucket inoperative (ω·K = 450K vs ~152 actual occupancy). Recommended: K ≤ 1000.

The findings.json discrepancy_analysis was a clean post-mortem. The agent confirmed apparatus correctness (zero conservation violations, WFQ counter balance ratio 1.003) before declaring REFUTED with diagnostic_note recommending specific parameter fixes for iter-2.

Why this matters

This is the affirmative case for the rehearsal mechanism. The campaign author made two non-trivial workload-design errors that no amount of pre-run review caught. Iter-1 surfaced both with diagnostic precision, suggested fixes, and confirmed the underlying mechanism is real (4.38× mirage at D=1). Without rehearsal, iter-2 would have produced null results at full scale.

Desired behavior

Capture this lesson in nous's documentation, in two places:

  1. Methodology docs (the page that explains experiment_spec.rehearsal_subset and the iter-1-as-rehearsal pattern): add a worked example illustrating the affirmative case. Show how a diagnostic-mode rehearsal can both (a) refute the campaign-author's stated parameters and (b) recommend specific fixes — without escalating to full-scale iter-2.

  2. Campaign-authoring guide: add a "unit-check the closed-form prediction against your locked parameters" step before locking. In the paper-memorytime-mirage case, evaluating C_KV(P=1024, D=8) / C_KV(P=mixture, D=8) under realistic π/δ would have shown ratio ≈ 1.06 (decode dominates), revealing the D=8 error pre-run. This step would have eliminated one of the two errors before iter-1 ran.

Suggested implementation sketch

  1. Add a "Rehearsal as scientific instrument" section to the methodology docs with the paper-memorytime-mirage iter-1 worked example (D=8 → D=1 + K=1M → K=1000 diagnoses).
  2. Add a "Pre-lock unit check" step to the campaign-authoring guide.
  3. Cross-link from the rehearsal_subset schema doc to the worked example.

Acceptance criteria

  • Methodology docs include a "rehearsal as scientific instrument" worked example.
  • Campaign-authoring guide includes a pre-lock unit-check step.
  • Friction report F14 row in the tracking issue checks off.

Severity

N/A — positive case, recorded for completeness. Documentation-only.

Source

friction-report.md F14, paper-memorytime-mirage campaign (2026-05).


Part of friction-report tracking issue #245.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationfriction-reportFrom external campaign-author friction reports

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions