Skip to content

Commit 8d14232

Browse files
committed
docs(napkin-math): prioritize compress variance next
1 parent d73ff9b commit 8d14232

1 file changed

Lines changed: 36 additions & 26 deletions

File tree

experiments/napkin_math/docs/20260520_plan.md

Lines changed: 36 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -76,13 +76,11 @@ separated below.
7676

7777
- **PR #737** (merged) — Phase 1 compress prompts + initial extract threshold-pairing rule + `OPTIMIZE_INSTRUCTIONS` discipline banner. Substantive content described below under "PR #737 detail" for continuity.
7878
- **PR #739** (merged) — Proposal 141 ("Source-Preservation Audit for the Napkin Math Pipeline") landed as design only. No code or prompt change. Implementation deferred.
79-
80-
### Open for merge
81-
82-
- **PR #740** — Phase 2 extract-prompt rules. Three commits land in `extract-parameters-from-digest` and `extract-parameters-from-full`:
79+
- **PR #740** (merged) — Phase 2 extract-prompt rules. Four commits landed in `extract-parameters-from-digest` and `extract-parameters-from-full`:
8380
- `4cda70ba` — Source-arithmetic preservation rule (Patterns 1/2/3: aggregate sum, burn rate × duration, explicit decomposition block) + threshold-pairing parity backfill into the full-extract skill.
8481
- `19f927b7` — Tightened aggregate-sum wording so independent caps/envelopes are NOT collapsed into derived sums; reconciled discipline-shared paragraph with the cap-pressure paragraph.
8582
- `8f94c8cd` — 20-word `source_text` cap reinforced with explicit truncation discipline (drop the consequence clause, end with ellipsis if mid-sentence).
83+
- `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
8684
All edits applied symmetrically to both extract skills. No corpus literals introduced.
8785

8886
### PR #737 detail (already on main)
@@ -225,7 +223,7 @@ too:
225223
| Phase | Skill / module | Status |
226224
|---|---|---|
227225
| 1 | `compress_report_section.py` | **DONE on main via PR #737** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner) |
228-
| 2 | `extract-parameters-from-{full,digest}` | **PARTIAL** — threshold-pairing on `from-digest` shipped in PR #737. Source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline are in PR #740, open for merge. After #740 merges, Phase 2's prompt-side work is complete for the original directives. |
226+
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
229227
| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
230228
| 4 | `generate-bounds` | not started |
231229
| 5 | `verify-bounds-citations` (new) | not started |
@@ -237,30 +235,42 @@ too:
237235

238236
### Next likely move
239237

240-
After PR #740 merges, the next-most-load-bearing follow-ups, in
241-
preferred order:
242-
243-
1. **Implement proposal 141** (`dropped_signals` schema in extract
238+
After PR #740, the next work should be ordered by what improves
239+
napkin_math output quality most directly, not by what is easiest to
240+
measure. Preferred order:
241+
242+
1. **Compress-LLM variance handling.** Deterministic retry/merge or
243+
lower-temperature reruns for high-impact compress buckets should
244+
come next. The clearest driver is the paperclip OPC UA / latency
245+
tripwires that v49 surfaced and v50/v51 drop at the compress
246+
layer. This is upstream of extraction: if the digest does not
247+
carry the tripwire, no extract prompt can recover it. Proposal
248+
141 would classify this loss, but variance handling is the piece
249+
that can restore the missing source signal.
250+
2. **Implement proposal 141** (`dropped_signals` schema in extract
244251
prompts + `audit_source_preservation.py` deterministic script).
245-
The design is on main; without the implementation, v49 absences
246-
across the probe set cannot be mechanically classified, and the
247-
yellowstone-style cap-pressure tradeoffs have no place to record
248-
their structural rationale in the artifact itself.
249-
2. **Compress-LLM variance handling.** Deterministic retry/merge or
250-
lower-temperature reruns for high-impact buckets. The clearest
251-
driver: the paperclip OPC UA / latency tripwires that v49
252-
surfaced and v50/v51 drop at the compress layer.
253-
3. **Different-LLM behavioural validation** of the rules now in
254-
#740. A Self-Improve run with the default napkin_math LLM
255-
(Gemini Flash Lite) against the same v51 digests would close
256-
the same-LLM same-session confound.
252+
This should follow close behind variance handling. It is the
253+
right guardrail for v49/v51 absences and cap-pressure tradeoffs,
254+
but it is primarily a measurement and accountability layer: it
255+
classifies preserved / replaced / dropped signals and records
256+
rationale in the artifact. It does not by itself make the
257+
compressor less lossy.
258+
3. **Different-LLM behavioural validation** of the rules now on
259+
main. A Self-Improve run with the default napkin_math LLM
260+
(Gemini Flash Lite) against the same digests would close the
261+
same-LLM same-session confound. This should be treated as
262+
validation of prompt generality, not as the next quality fix.
257263
4. **Prompt-hygiene pass** for the remaining domain-specific
258264
examples (e.g. `european_prepper_active_buyers`) in either
259-
extract prompt. Small, scoped, can ride alongside any of the
260-
above.
261-
262-
These are four separate PRs, not one. Bundling them re-creates the
263-
scope creep PR #740 was extracted from.
265+
extract prompt. This is worthwhile and small, but not
266+
load-bearing for the currently observed napkin_math failures.
267+
268+
These are separate PRs. The next PR should be compress variance only:
269+
no corpus literals, no hand-patched outputs, rerun compress + extract
270+
through the skills, validate regenerated `parameters.json`, and
271+
compare against v49/v50/v51 honestly. Bundling the audit,
272+
behavioural validation, or prompt hygiene into that PR would obscure
273+
whether the upstream signal-loss fix actually worked.
264274

265275
## Per-theme mapping
266276

0 commit comments

Comments
 (0)