Skip to content

Commit b1c6716

Browse files
neoneyeclaude
andcommitted
docs(napkin-math): record PR PlanExeOrg#746 (Phase 3) and PR PlanExeOrg#747 (Phase 4 runtime) in 20260520 plan
Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section. Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes. Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 372ddc3 commit b1c6716

1 file changed

Lines changed: 23 additions & 12 deletions

File tree

experiments/napkin_math/docs/20260520_plan.md

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,9 @@ separated below.
8383
- `f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
8484
All edits applied symmetrically to both extract skills. No corpus literals introduced.
8585
- **PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
86-
- **PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
86+
- **PR #744** (merged) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
87+
- **PR #746** (merged) — Phase 3 validate-parameters. Two new deterministic structural checks added to `validate_parameters.py`. `aggregate_not_bounded` (R1.1): when an entry's `formula_hint` is a pure sum of named identifiers and its `output_name` also appears in `missing_values_to_estimate`, the aggregate is sampled independently of its constituents (a single Monte Carlo trial can pair sub-component p95s with a total p05). Detection is syntactic — RHS contains `+`, no other operators, every operand is a snake_case identifier. `requirement_has_margin` (R2.5): when a `key_value`'s id ends in `_required`, at least one calculation must reference it via formula RHS, contain a subtraction or ratio operator, AND emit an `output_name` with a positive-pass margin suffix (`_margin`/`_surplus`/`_buffer`/`_coverage`). All three properties required so a bare reference inside a sum (`combined = actual + required`) no longer satisfies the rule. 19 unit tests, 6 v51 plans still validate clean. Out of scope: the `sampling_discipline` enum expansion (lives in `run_monte_carlo.py`, deferred to Phase 4 to avoid a silent-shim antipattern).
88+
- **PR #747** (merged) — Phase 4 runtime + schema readiness. Code-side half of Phase 4: (1) `strip_threshold_bounds` extended with a `calculation-output` reason — variables that are the declared `output_name` of any calculation get stripped from bounds (R1.1 backstop); (2) `lognormal` and `pert` reserved in `VALID_DISCIPLINES`, `sample_one` raises `NotImplementedError` loudly when sampling is attempted (no silent fall-back to triangular); (3) the optional `correlations` top-level key reserved and preserved through the stripper; (4) the warning text after `strip_threshold_bounds` is now reason-branched (calculation-output strips say "simulation will compute it from calculations.py", suffix/formula-side strips keep the original threshold wording). 73 unit tests, 9/9 smoke checks, 0 false positives on the v48 corpus. The LLM-rule changes (base-anchoring conditional, citation-context-leak self-audit examples, plan_type-driven lognormal default, detailed correlations selection rules) ship in the Phase 4 follow-up.
8789

8890
### PR #737 detail (already on main)
8991

@@ -229,10 +231,10 @@ too:
229231

230232
| Phase | Skill / module | Status |
231233
|---|---|---|
232-
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
234+
| 1 | `compress_report_section.py` | **DONE on main via PR #737 + PR #743 + PR #744** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer). |
233235
| 2 | `extract-parameters-from-{full,digest}` | **DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
234-
| 3 | `validate-parameters` | not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py` itself exists and was used to validate v51. |
235-
| 4 | `generate-bounds` | not started |
236+
| 3 | `validate-parameters` | **DONE on main via PR #746** (R1.1 `aggregate_not_bounded` and R2.5 `requirement_has_margin` structural checks). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
237+
| 4 | `generate-bounds` | **Code-side runtime DONE on main via PR #747** (R1.1 calculation-output strip extension; R1.4 / R2.2 `lognormal` and `pert` reserved in `VALID_DISCIPLINES` with sample-time `NotImplementedError`; R1.3 `correlations` top-level key reserved and preserved). **Prompt-side LLM-rule changes (R1.2 base-anchoring conditional, R1.5 self-audit citation examples, R2.2 plan_type lognormal default, R1.3 detailed correlations rules) remain a follow-up.** |
236238
| 5 | `verify-bounds-citations` (new) | not started |
237239
| 6 | `generate-calculations` | no change required per the original plan |
238240
| 7 | `run-scenarios` | not started |
@@ -242,31 +244,40 @@ too:
242244

243245
### Next likely move
244246

245-
After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
246-
tolerant quote verification), the remaining work is ordered by what
247-
improves napkin_math output quality most directly:
248-
249-
1. **Bucket-categorisation discipline in compress.** The residual
247+
After PR #746 (validate-parameters structural checks) and PR #747
248+
(generate-bounds runtime + schema readiness), the remaining work is
249+
ordered by what improves napkin_math output quality most directly:
250+
251+
1. **Phase 4 prompt-side follow-up.** With the bounds runtime now
252+
accepting `lognormal`/`pert` and stripping calculation outputs,
253+
the next move is the LLM-rule layer: base-anchoring conditional
254+
rewrite (R1.2 — source: assumption on commitment-default,
255+
source: data on a named anchor), self-audit examples for
256+
citation context-leak (R1.5), `plan_type`-driven lognormal
257+
default for megaproject CAPEX (R2.2), and the detailed
258+
correlations-emission rules (R1.3). Same-LLM same-session
259+
posture: regression check, not improvement claim.
260+
2. **Bucket-categorisation discipline in compress.** The residual
250261
public-output miss in paperclip v53c is the LLM filing a
251262
`$X exceeds threshold` tripwire under `risks_and_shocks` instead
252263
of `gates_and_thresholds`. The bucket-prompt for
253264
`gates_and_thresholds` could be tightened to claim any
254265
"If <metric> <comparator> <numeric threshold>, then ..." sentence,
255266
even when the source frames it as a downside risk. Verify across
256267
the 6-plan probe set; do not overfit to the paperclip OPC UA case.
257-
2. **Implement proposal 141** (`dropped_signals` schema in extract
268+
3. **Implement proposal 141** (`dropped_signals` schema in extract
258269
prompts + `audit_source_preservation.py` deterministic script).
259270
This is the right guardrail for v49/v51 absences and
260271
cap-pressure tradeoffs. Now that the upstream variance fixes
261272
landed (#743, #744), the audit's classification of preserved /
262273
replaced / dropped signals will be measuring against a less
263274
leaky pipeline.
264-
3. **Different-LLM behavioural validation** of the rules now on
275+
4. **Different-LLM behavioural validation** of the rules now on
265276
main. A Self-Improve run with the default napkin_math LLM
266277
(Gemini Flash Lite) against the same digests would close the
267278
same-LLM same-session confound. This should be treated as
268279
validation of prompt generality, not as the next quality fix.
269-
4. **Prompt-hygiene pass** for the remaining domain-specific
280+
5. **Prompt-hygiene pass** for the remaining domain-specific
270281
examples (e.g. `european_prepper_active_buyers`) in either
271282
extract prompt. This is worthwhile and small, but not
272283
load-bearing for the currently observed napkin_math failures.

0 commit comments

Comments
 (0)