Merge pull request #741 from PlanExeOrg/codex-napkin-math-priorities

neoneye · web-flow · commit 5206bf926299 · 2026-05-21T03:17:33.000+02:00
docs(napkin-math): prioritize compress variance follow-up
diff --git a/experiments/napkin_math/docs/20260520_plan.md b/experiments/napkin_math/docs/20260520_plan.md
@@ -76,13 +76,11 @@ separated below.
 
 - **PR #737** (merged) — Phase 1 compress prompts + initial extract threshold-pairing rule + `OPTIMIZE_INSTRUCTIONS` discipline banner. Substantive content described below under "PR #737 detail" for continuity.
 - **PR #739** (merged) — Proposal 141 ("Source-Preservation Audit for the Napkin Math Pipeline") landed as design only. No code or prompt change. Implementation deferred.
-
-### Open for merge
-
-- **PR #740** — Phase 2 extract-prompt rules. Three commits land in `extract-parameters-from-digest` and `extract-parameters-from-full`:
+- **PR #740** (merged) — Phase 2 extract-prompt rules. Four commits landed in `extract-parameters-from-digest` and `extract-parameters-from-full`:
   - `4cda70ba` — Source-arithmetic preservation rule (Patterns 1/2/3: aggregate sum, burn rate × duration, explicit decomposition block) + threshold-pairing parity backfill into the full-extract skill.
   - `19f927b7` — Tightened aggregate-sum wording so independent caps/envelopes are NOT collapsed into derived sums; reconciled discipline-shared paragraph with the cap-pressure paragraph.
   - `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence).
+  - `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
   All edits applied symmetrically to both extract skills. No corpus literals introduced.
 
 ### PR #737 detail (already on main)
@@ -225,7 +223,7 @@ too:
 | Phase | Skill / module | Status |
 |---|---|---|
 | 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) |
-| 2 | `extract-parameters-from-{full,digest}` | **PARTIAL** — threshold-pairing on `from-digest` shipped in PR #737. Source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline are in PR #740, open for merge. After #740 merges, Phase 2's prompt-side work is complete for the original directives. |
+| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
 | 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
 | 4 | `generate-bounds` | not started |
 | 5 | `verify-bounds-citations` (new) | not started |
@@ -237,30 +235,42 @@ too:
 
 ### Next likely move
 
-After PR #740 merges, the next-most-load-bearing follow-ups, in
-preferred order:
-
-1. **Implement proposal 141** (`dropped_signals` schema in extract
+After PR #740, the next work should be ordered by what improves
+napkin_math output quality most directly, not by what is easiest to
+measure. Preferred order:
+
+1. **Compress-LLM variance handling.** Deterministic retry/merge or
+   lower-temperature reruns for high-impact compress buckets should
+   come next. The clearest driver is the paperclip OPC UA / latency
+   tripwires that v49 surfaced and v50/v51 drop at the compress
+   layer. This is upstream of extraction: if the digest does not
+   carry the tripwire, no extract prompt can recover it. Proposal
+   141 would classify this loss, but variance handling is the piece
+   that can restore the missing source signal.
+2. **Implement proposal 141** (`dropped_signals` schema in extract
    prompts + `audit_source_preservation.py` deterministic script).
-   The design is on main; without the implementation, v49 absences
-   across the probe set cannot be mechanically classified, and the
-   yellowstone-style cap-pressure tradeoffs have no place to record
-   their structural rationale in the artifact itself.
-2. **Compress-LLM variance handling.** Deterministic retry/merge or
-   lower-temperature reruns for high-impact buckets. The clearest
-   driver: the paperclip OPC UA / latency tripwires that v49
-   surfaced and v50/v51 drop at the compress layer.
-3. **Different-LLM behavioural validation** of the rules now in
-   #740. A Self-Improve run with the default napkin_math LLM
-   (Gemini Flash Lite) against the same v51 digests would close
-   the same-LLM same-session confound.
+   This should follow close behind variance handling. It is the
+   right guardrail for v49/v51 absences and cap-pressure tradeoffs,
+   but it is primarily a measurement and accountability layer: it
+   classifies preserved / replaced / dropped signals and records
+   rationale in the artifact. It does not by itself make the
+   compressor less lossy.
+3. **Different-LLM behavioural validation** of the rules now on
+   main. A Self-Improve run with the default napkin_math LLM
+   (Gemini Flash Lite) against the same digests would close the
+   same-LLM same-session confound. This should be treated as
+   validation of prompt generality, not as the next quality fix.
 4. **Prompt-hygiene pass** for the remaining domain-specific
    examples (e.g. `european_prepper_active_buyers`) in either
-   extract prompt. Small, scoped, can ride alongside any of the
-   above.
-
-These are four separate PRs, not one. Bundling them re-creates the
-scope creep PR #740 was extracted from.
+   extract prompt. This is worthwhile and small, but not
+   load-bearing for the currently observed napkin_math failures.
+
+These are separate PRs. The next PR should be compress variance only:
+no corpus literals, no hand-patched outputs, rerun compress + extract
+through the skills, validate regenerated `parameters.json`, and
+compare against v49/v50/v51 honestly. Bundling the audit,
+behavioural validation, or prompt hygiene into that PR would obscure
+whether the upstream signal-loss fix actually worked.
 
 ## Per-theme mapping