You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Marks PR PlanExeOrg#744 as merged (previously open), and adds PR PlanExeOrg#746 (Phase 3 validate-parameters — aggregate_not_bounded + requirement_has_margin) and PR PlanExeOrg#747 (Phase 4 runtime + schema readiness — calculation-output strip, lognormal/pert reserved, correlations key reserved, reason-branched warning text) to the landed-on-main section.
Phase 1 status row now references all three compress PRs (PlanExeOrg#737, PlanExeOrg#743, PlanExeOrg#744). Phase 3 row marks DONE via PR PlanExeOrg#746 with a note that the sampling_discipline enum bullet was routed to Phase 4. Phase 4 row marks the code-side DONE via PR PlanExeOrg#747 and lists the deferred prompt-side LLM-rule changes.
Next-likely-move list re-ordered: the Phase 4 prompt-side follow-up takes item 1 (was deferred from the previous update). Bucket-categorisation discipline, proposal 141 implementation, different-LLM validation, and prompt hygiene shift down to items 2-5.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: experiments/napkin_math/docs/20260520_plan.md
+23-12Lines changed: 23 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,7 +83,9 @@ separated below.
83
83
-`f9d90ebb` — Updated this plan-status section for PR #740's narrow scope and verification limits.
84
84
All edits applied symmetrically to both extract skills. No corpus literals introduced.
85
85
-**PR #743** (merged) — Compress emission-layer second pass. `compress_report_section.py` now makes a second LLM call per saturated bucket with the first-pass items as context, asking only for items the first pass missed. `merge_second_pass_items` deduplicates by normalised `source_quote`. Honest framing: this closes the *emission* side of the run-to-run variance problem (when a tripwire is skipped by the first pass, the second pass often catches it) but does not close the *ranking* side — items that emit with `quote_verified=False` can still be outranked at the deterministic top-N filter.
86
-
-**PR #744** (open, CI green, awaiting merge) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
86
+
-**PR #744** (merged) — Compress ranking-layer paraphrase tolerance. `quote_is_in_source` keeps the substring fast path and adds a token-overlap fallback that requires every quote token to appear in the source (min-3-token gate). Closes the case where an LLM paraphrase (reordered noun phrase, dropped intermediate words) flips `quote_verified` to False even though every content token came from the source. Empirical posture: 165 fallback-only verifies across 1206 `qv=True` items (13.7%), 30-sample audit all legitimate paraphrases. Threshold-tightening cost (90% → 100%) is 0 lost `qv=True` items on observed data. Paperclip 3× — 2/3 runs now have `$75k` OPC UA bid in public top-6 (vs 1/3 before); 3/3 verified-when-emitted. Out of scope: bucket-categorisation variance (v53c places the bid in `risks_and_shocks` rather than `gates_and_thresholds`) and the remaining emission-layer miss.
87
+
- **PR #746** (merged) — Phase 3 validate-parameters. Two new deterministic structural checks added to `validate_parameters.py`. `aggregate_not_bounded` (R1.1): when an entry's `formula_hint` is a pure sum of named identifiers and its `output_name` also appears in `missing_values_to_estimate`, the aggregate is sampled independently of its constituents (a single Monte Carlo trial can pair sub-component p95s with a total p05). Detection is syntactic — RHS contains `+`, no other operators, every operand is a snake_case identifier. `requirement_has_margin` (R2.5): when a `key_value`'s id ends in `_required`, at least one calculation must reference it via formula RHS, contain a subtraction or ratio operator, AND emit an `output_name` with a positive-pass margin suffix (`_margin`/`_surplus`/`_buffer`/`_coverage`). All three properties required so a bare reference inside a sum (`combined = actual + required`) no longer satisfies the rule. 19 unit tests, 6 v51 plans still validate clean. Out of scope: the `sampling_discipline` enum expansion (lives in `run_monte_carlo.py`, deferred to Phase 4 to avoid a silent-shim antipattern).
88
+
- **PR #747** (merged) — Phase 4 runtime + schema readiness. Code-side half of Phase 4: (1) `strip_threshold_bounds` extended with a `calculation-output` reason — variables that are the declared `output_name` of any calculation get stripped from bounds (R1.1 backstop); (2) `lognormal` and `pert` reserved in `VALID_DISCIPLINES`, `sample_one` raises `NotImplementedError` loudly when sampling is attempted (no silent fall-back to triangular); (3) the optional `correlations` top-level key reserved and preserved through the stripper; (4) the warning text after `strip_threshold_bounds` is now reason-branched (calculation-output strips say "simulation will compute it from calculations.py", suffix/formula-side strips keep the original threshold wording). 73 unit tests, 9/9 smoke checks, 0 false positives on the v48 corpus. The LLM-rule changes (base-anchoring conditional, citation-context-leak self-audit examples, plan_type-driven lognormal default, detailed correlations selection rules) ship in the Phase 4 follow-up.
87
89
88
90
### PR #737 detail (already on main)
89
91
@@ -229,10 +231,10 @@ too:
229
231
230
232
| Phase | Skill / module | Status |
231
233
|---|---|---|
232
-
| 1 |`compress_report_section.py`|**DONE on main via PR #737 + PR #743** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance). **PR #744 (open)** adds paraphrase-tolerant quote verification on the ranking layer. |
234
+
| 1 |`compress_report_section.py`|**DONE on main via PR #737 + PR #743 + PR #744** (R2.3 numeric_values, R2.3 missing_data, R2.5 gates_and_thresholds, OPTIMIZE_INSTRUCTIONS banner; per-bucket emission-layer second pass for run-to-run variance; paraphrase-tolerant quote verification on the ranking layer). |
233
235
| 2 |`extract-parameters-from-{full,digest}`|**DONE for prompt-side directives on main via PR #740** — threshold-pairing on `from-digest` shipped in PR #737; source-arithmetic preservation (Patterns 1/2/3 for R1.1, R2.3, R2.4), threshold-pairing parity into `from-full`, aggregate-sum tightening, and source_text truncation discipline shipped in PR #740. Behavioural validation on a different LLM remains a follow-up, not additional prompt-scope work. |
234
-
| 3 |`validate-parameters`|not started for the no-dead-end / threshold-pair extensions in the plan. Note: `validate_parameters.py`itself exists and was used to validate v51. |
235
-
| 4 |`generate-bounds`|not started|
236
+
| 3 |`validate-parameters`|**DONE on main via PR #746** (R1.1 `aggregate_not_bounded` and R2.5 `requirement_has_margin` structural checks). The `sampling_discipline` enum expansion bullet that the original plan tucked here actually lives in `run_monte_carlo.py` and was routed to Phase 4 (PR #747) to avoid a silent shim. |
237
+
| 4 |`generate-bounds`|**Code-side runtime DONE on main via PR #747** (R1.1 calculation-output strip extension; R1.4 / R2.2 `lognormal` and `pert` reserved in `VALID_DISCIPLINES` with sample-time `NotImplementedError`; R1.3 `correlations` top-level key reserved and preserved). **Prompt-side LLM-rule changes (R1.2 base-anchoring conditional, R1.5 self-audit citation examples, R2.2 plan_type lognormal default, R1.3 detailed correlations rules) remain a follow-up.**|
236
238
| 5 |`verify-bounds-citations` (new) | not started |
237
239
| 6 |`generate-calculations`| no change required per the original plan |
238
240
| 7 |`run-scenarios`| not started |
@@ -242,31 +244,40 @@ too:
242
244
243
245
### Next likely move
244
246
245
-
After PR #743 (emission-layer second pass) and PR #744 (paraphrase-
246
-
tolerant quote verification), the remaining work is ordered by what
247
-
improves napkin_math output quality most directly:
248
-
249
-
1.**Bucket-categorisation discipline in compress.** The residual
247
+
After PR #746 (validate-parameters structural checks) and PR #747
248
+
(generate-bounds runtime + schema readiness), the remaining work is
249
+
ordered by what improves napkin_math output quality most directly:
250
+
251
+
1.**Phase 4 prompt-side follow-up.** With the bounds runtime now
252
+
accepting `lognormal`/`pert` and stripping calculation outputs,
253
+
the next move is the LLM-rule layer: base-anchoring conditional
254
+
rewrite (R1.2 — source: assumption on commitment-default,
255
+
source: data on a named anchor), self-audit examples for
0 commit comments