spec 015: pipeline convergence protocol (closes #239)#250
Conversation
…+ review-model overhaul (#239) Comprehensive Spec Kit specification for umbrella issue #239, grounded in the 2026-05-27 design doc SSoT and a code-verified audit. Covers: the inode-table summarize/desummarize primitive (no silent loss of check-critical elements), the generic identify->revise->re-review convergence engine + adaptive kickback, removal of the point system for unanimous-panel acceptance + advisory triage, per-step ReviewSpec adapters across the whole research + paper track, reviewer calibration (9 domains, held-out generality), end-to-end traversal proof, living-document discussion board, and all 10 audit bug fixes + arXiv resilience. Three scope decisions resolved with maintainer up front (living-doc=full; point cutover=migrate-forward; overflow floor=inode-table pointers). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five clarifications integrated into the spec (Clarifications + FRs/SCs/scenarios/assumptions): - Publish target: real public Zenodo/GitHub/site, but a MANDATORY manual maintainer sign-off before every DOI mint for the duration of this spec (new FR-054, SC-014; FR-036/FR-048 updated). - E2E coverage: all 9 domains traverse end-to-end to posted (FR-045, SC-007). - Calibration: differential clean-vs-injected test + manual adjudication + adaptive sensitivity tuning (no fixed over-flag % / K) (FR-042, FR-044, SC-005). - Kickback budget: NO global cap; monotonic-improvement-until-convergence; per-step 3-round cap retained (FR-017, edge case, assumptions). - Cutover: no posted/done projects exist -> migration applies to in-flight only (FR-025). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plan.md (Constitution Check: points-removal + no-global-cap tracked as authorized deviations -> constitution amendment task), research.md (10 grounded technical decisions incl. inode-table summarizer format, engine-as-callables, adaptive kickback, manual DOI sign-off, differential calibration), data-model.md (pydantic entities), quickstart.md, and 6 contracts (summarize-api, convergence-engine, reviewspec-registry, review-intake-triage, kickback-record, publisher-signoff). CLAUDE.md SPECKIT ref -> 015. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Organized by user story (US1-US8) with Setup/Foundational/Polish. TDD + real-call + manual-QC tasks included per spec. Dependency chain: summarizer first -> engine -> bug fixes -> review model -> per-step panels -> calibration (9 domains) -> e2e to posted (9 domains, manual DOI sign-off) -> living-doc -> polish. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closed 4 coverage/underspecification findings from /speckit-analyze (0 remain): - C1 (HIGH): FR-006 authoring-side overflow routing + paper twins -> T054-T057 - C2 (MED): FR-026 repository_hygiene line-count/gitignore -> T043 - U1 (MED): FR-053 convergence principle encoding -> T007 - U2 (LOW): FR-017 ProgressRecord emission -> T026 Constitution point-conflict (CRITICAL) resolved by explicit amendment task T007. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- T001: new package dirs (convergence/, calibration/, agents/prompts/panels/) - T002: STATUS.md living progress doc (FR-052) - T003: Stage.AWAITING_PUBLICATION_SIGNOFF; config CONVERGENCE_MAX_ROUNDS=3 + CONVERGENCE_PER_ROUND_BUDGET_SECONDS=600. Imports verified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New SSoT primitive src/llmxive/tools/summarize.py: summarize()/desummarize() with on-disk inode-table pointer hierarchy. Deterministic no-loss guarantee (URLs/DOIs/ arXiv/citations/FR-SC-task ids/numbers preserved verbatim; full content on disk, recursively paged in). 12 tests pass (7 edge cases + core no-loss + manifest contract + no-dangling-pointer); ruff + mypy clean. Remaining for US1: T009 real-call fidelity, T017 re-point paper_reviewer (SSoT), T018 real-call verification. See STATUS.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_build_corpus_with_summaries now delegates context reduction to tools/summarize.summarize() (inode-table, no silent truncation), preserving the 1-arg summarize_fn contract + _cached_summarize memoization. Supersedes the old truncate-with-notice fallback (Const. I SSoT). Updated the 2 coupled unit tests to the new behavior (full source recoverable via desummarize); _chunk_corpus + its 3 tests untouched. 24 paper_reviewer + 12 summarizer tests pass; mypy-clean for the changed function. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tests/real_call/test_summarize_fidelity.py: real qwen3.5-122b summarize_fn over an over-budget doc; desummarize recovers EVERY critical element verbatim (no loss through a real-LLM reduction). PASSED in 334s. US1 (summarizer) fully done & verified: 12 offline + 1 real-call, ruff clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion (#239) - T004/T005: convergence/types.py — Severity (ordered + legacy mapping) and the Concern/ConcernResponse/Verdict/ProgressRecord/ConvergenceResult/KickbackRecord/ TriageRecord pydantic models + Reviewer/Reviser Protocols + ReviewSpec dataclass. - T006: tests/contract/test_convergence_types.py (7 pass; ruff + mypy clean). - T007: constitution -> v1.1.0; added Principle VI (Convergent Review, NON-NEGOTIABLE), replaced the point-based Review-thresholds gate with unanimous-panel convergence + advisory triage, Sync Impact Report updated. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
convergence/engine.py run_convergence: identify->revise->re-review loop with honest converged reporting (FR-016), 3-round cap, self-review/producer exclusion + stale-never-passes (FR-018), per-round wall-clock budget (FR-013), and overflow inputs routed through tools/summarize (FR-006). convergence/kickback.py route_kickback (adaptive worst-severity->stage, full-provenance KickbackRecord) + progress_record (FR-017). 15 unit tests pass; ruff + mypy clean. US2 remaining (coupled to US4/US3): T021 real-project integration, T025 advancement.py _produced_by stub, T027 tasker Mode-A/B refactor into the engine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addressed the tech debt I had flagged (per "fix issues as you notice them"):
- types-PyYAML dev dep -> yaml stubs resolve under `python -m mypy` (clears yaml
errors codebase-wide).
- ReviewRecord.score: invalid Literal[float] -> float + field_validator (PEP 586;
identical {0.0,0.5,1.0} constraint).
- paper_reviewer: list[dict]->list[dict[str,Any]]; text coerced to str.
- removed 2 unused PaperReviewerAgent imports in test.
- FIX: T003 added Stage.AWAITING_PUBLICATION_SIGNOFF but not the project-state
schema enum -> contract test failed; added it (single SSoT schema).
- FIX: T001 panels dir was under src/llmxive/agents/prompts/ but prompts live at
repo-root agents/prompts/ -> relocated; corrected 7 path refs in tasks.md.
Finding (STATUS.md): project does NOT gate on ruff/mypy (no config, no CI step;
gates = pytest + checks.*). ~273 legacy mypy errors are pre-existing, out of #239.
Focused regression: 92 passed (all contract + score/paper_reviewer/convergence).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239) New agents/prompts/implementer_research.md: instructs the research speckit implementer to emit the artifacts/verdict YAML the parser expects (write real runnable code/data, no stubs/diffs, fail-loud verdicts). implement_cmd.py now renders it instead of the paper-revision LaTeX implementer.md (which stays for the separate paper-revision agent). Also fixed 2 pre-existing ruff nits in implement_cmd.py (I001 import sort, F541) since I touched the file. tests/integration/test_audit_bugfixes.py verifies the fix (2 pass). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
theoremsearch.search() now retries transient failures (429/500/502/503/504 + RequestException/timeout) with exponential backoff (MAX_TRANSIENT_RETRIES=3), then degrades via TransientBackendError (the librarian wrapper already treats that as "optional source unavailable"). Non-transient 4xx are not retried. retry_backoff_base_seconds is injectable (tests pass 0). 4 unit tests; ruff+mypy clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#239) Full offline suite verified green: tests/contract + 599 tests/unit (7.45s) + real-call summarize_fidelity. Flagged pre-existing live-PDF test in tests/unit (not CI-gated, hangs offline) for separate gating. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…yze (#239) Discrepancy #4 fix: ANALYZE_SYSTEM_PROMPT_PATH was defined but unused (inline prompt hardcoded; paper reused research tasker.md). Now there are TWO real analyze prompts that ARE used via render_prompt: - agents/prompts/analyze.md (research): requirements_coverage / internal_consistency / testability / scope / constitution_alignment lenses (same vocabulary as the US4 Tasks panel). - agents/prompts/paper_analyze.md (paper): reader_scenario_coverage / claims_supported / required_sections_figures / scope_vs_research / internal_consistency / constitution_alignment. run_analyze() gains kind={"research","paper"} + constitution_text kwargs. paper_tasks_cmd passes kind="paper" + paper constitution; tasks_cmd passes research constitution (FR-030: constitution is a standard analyze input from `specified` onward). 6 audit-bugfix tests + 38 phase4 integration tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clarifier.attempts_so_far was hardcoded 0 (escalation unreachable) and paper_clarifier never branched on verdict=escalate AND silently substituted a "Resolved by default" stub on missing patches — a no-silent-shortcuts violation. Fixes: - New shared _clarify_attempts.py: persists per-project attempt count under .specify/memory/clarifier_attempts.yaml; bump/read/reset + write_human_input_needed. - Both clarifiers now read REAL attempts and pass them to the prompt. - Both branch on verdict=escalate -> write human_input_needed.yaml + raise. - Both escalate at TASKER_MAX_REVISION_ROUNDS (=5) -> write human_input_needed.yaml + raise. - paper_clarifier no longer substitutes the silent "Resolved by default" stub (matches research clarifier's loud failure behavior). - Also removed 2 pre-existing F841 dead locals in clarify_cmd._spec_path. 29 tests pass (audit + phase3 integration); ruff clean for touched files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239) paper_specifier.md advertised `code_summary` / `data_summary` inputs that the code never supplied (silent drift between prompt and reality). paper_specify_cmd now injects both blocks into the user message, reusing research_reviewer's _summarize_tree() as the SSoT tree-summary helper — Const. I (share, don't fork). The advertised inputs ARE now present, grounding the paper-spec generation in the project's actual code/ and data/ trees. 11 audit-bugfix tests pass; ruff clean for touched files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…R-054) (#239) Discrepancy #2 fix (FR-036): graph._decide_next_stage no longer shortcuts PAPER_ACCEPTED -> POSTED. It now routes paper_accepted -> AWAITING_PUBLICATION_SIGNOFF, then AWAITING_PUBLICATION_SIGNOFF -> POSTED ONLY when the maintainer sign-off record exists. The PaperPublisher itself enforces the same gate (defense-in-depth) — at PAPER_ACCEPTED or AWAITING_PUBLICATION_SIGNOFF with NO signoff record it SKIPs with a clear "awaiting manual maintainer DOI sign-off (FR-054)" reason. No Zenodo DOI is minted without recorded approval. New surface: - src/llmxive/speckit/_publication_signoff.py: read/write/has/clear_signoff persistence under <project>/.specify/memory/publication_signoff.yaml; FR-054 who/when/what record (kinds "initial" / "version"). - `llmxive project publish-approve <PROJ-ID> --who X --what Y [--kind initial|version]` CLI command writes the sign-off record. - 6 new audit-bugfix tests + 27 publisher/graph regression tests pass. Also fixed 38 pre-existing ruff issues in touched files (auto-fix). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discrepancy #7 fix (FR-018): advancement._produced_by was a stub returning None. It now scans state/run-log/<YYYY-MM>/*.jsonl for the latest entry whose outputs list contains the artifact path and returns that entry's agent_name. Exact + suffix path matching tolerates relative-vs-absolute bookkeeping. A repo_root kwarg keeps the production call (no repo_root) working while making tests hermetic. Defensive: returns None on missing run-log instead of raising. T029: the audit-bugfix test file (now 18 tests) verifies T030/T031/T032/T033/ T034/T035/T025 fixes. 38 tests pass (audit + advancement regression). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to US3 (#239) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New convergence/triage.py — stage-aware triage for submitted human + simulated- personality reviews. Three filters: quality (length + evidence-indicator regex sweep — FR/SC/T ids, citations, URLs, DOIs, quoted phrases, code fences, scientific topic vocab), safety + on-topic (rule-based stop-list + stage/lens vocabulary overlap), and aspect-mapping to LLM reviewer lenses (preserved but mapped_lenses=[] when no match -> routes to the step's generic reviewer per FR-022). Injectable judge_fn for the real-LLM path (US4 wiring); rule-based default keeps unit tests offline. tests/integration/test_triage.py: 8 tests covering quality pass/fail, safety exclusion, off-topic exclusion, lens mapping, unmapped-but-preserved, record provenance, and the judge_fn injection override. All pass; ruff clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
) Rewrote the user-facing status-model descriptions in README + web/index.html + docs/index.html (HTML mirror copy) to convergence semantics: identify -> revise -> re-review; unanimous panel acceptance within a 3-round cap; advisory triage for human + simulated-personality reviews; no accumulated points. Replaces 6 stale "points threshold" / "Human reviews count double" passages. status_reporter.py + repository_hygiene.py needed no change for the new status model — their FR-026 duties (projects.json regen, GitHub issue comment/close on POSTED, line-count delta, gitignore assertions) are not point-dependent and remain in force unchanged. The points_research_total / points_paper_total fields the web JS displays will be removed in a follow-up (part of T041 point-system removal). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239) Discrepancy #9 + Const. I cleanup: the accumulated review-point system is gone from the advancement decision path. Unanimous LLM-panel acceptance is now the sole gate everywhere (research + paper both). advancement.py: - Research-review gate no longer reads `accept_total` / `RESEARCH_ACCEPT_THRESHOLD`. It now uses `_all_specialists_accept(records, required)` with a defensive backstop (require ≥1 accept AND zero non-accept records when the registry isn't loaded) — mirroring the paper-side default. - Paper-review gate's `_award_review_points` call removed (the all-specialists- accept-most-recent check was already the real decision). - `_award_review_points` definition DELETED (no remaining callers). - `RESEARCH_ACCEPT_THRESHOLD` import dropped; replaced with an FR-019 comment. config.py: - `RESEARCH_ACCEPT_THRESHOLD` and `PAPER_ACCEPT_THRESHOLD` constants kept for back-compat with `web/about.html` mirror consumers, but VALUES set to 0.0 and no advancement code reads them. T038 tests (`tests/integration/test_no_points.py`, 3 tests): grep guard + behavioral assertion that no-accept records cannot trip the gate. T044: per clarify Q3 there are no posted/done projects to grandfather; the gate change applies on next tick automatically — no data-migration logic needed. Broad regression: 784 passed, 1 skipped (was 781 — three new T038 tests added). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
src/llmxive/convergence/reviewspecs.py: reviewspec_for(stage) -> ReviewSpec | None. 9 stage entries (idea + 4 research + 4 paper) matching contracts/reviewspec- registry.md; EXEMPT_STAGES frozenset of 7 mechanical steps. Constitution input is True for every spec from `specified` onward (FR-030); idea-stage opts out (no constitution yet). Kickback routing per the contract's worst-severity -> prior-stage table. Stages whose panel prompts (T049-T053) or wiring (T054-T059) haven't landed yet get _TodoReviewer / _TodoReviser placeholders that conform to the Protocol but raise NotImplementedError with a clear pointer to the follow-up task -- fail-loud SSoT structure, no silent empty verdicts. 15 contract tests pass; ruff clean. Also marked T060 (constitution-as-analyze-input, done in T031) and T061 (publisher wired into graph, done in T035) as already complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
US4 panel-prompt authoring: 27 lens prompts + 1 SSoT shared block + a contract
test that catches future registry/file-name drift.
agents/prompts/_shared/panel_review_block.md
- SSoT (Constitution Principle I) for the panel R1/R3 output contract.
Severity vocabulary matches the spec-015 Severity enum (trivial → fatal);
identify and re-review phases both defined.
agents/prompts/panels/ — 27 files total
T049: panel_idea_{rq_validity,novelty,feasibility,idea_quality}.md
T050: panel_spec_{requirements_coverage,internal_consistency,testability,scope}.md
T051: panel_plan_{methodology,spec_coverage,data_resources,consistency}.md
T052: panel_tasks_{coverage,ordering,executability,constraint_preservation}.md
T053: panel_paper_spec_* (4) + panel_paper_plan_* (3) + panel_paper_tasks_* (4)
Each per-lens file is thin: lens + scope ("what NOT to flag") + inputs
(constitution from `specified` onward per FR-030) + per-severity-class
guidance + reference to the SSoT block. T054-T059 wiring will concatenate
lens-prompt + SSoT-block at render time.
tests/contract/test_panel_prompts.py (16 tests)
- Every lens in the ReviewSpec registry resolves to a real prompt file.
- Every panel file references the SSoT block (Principle I drift guard).
- Every panel file has `## Lens` and `## Output format` sections.
- Reuse-stages (research_review/paper_review) map to existing specialist
files, with the _research/_paper suffix convention preserved.
- The SSoT block enumerates every Severity enum value + defines R1 and R3.
Tech debt fixed inline (surfaced by ruff+mypy installation in venv):
- reviewspecs.py: _todo_reviewers now returns list[Reviewer] (list is
invariant). Removed an unused `# type: ignore`.
- triage.py: JudgeFn return-type narrowed to dict[str, object]; the
mapped_lenses access narrowed with isinstance(list|tuple) at the
callsite — honest about the contract boundary rather than ignore.
Verification:
- ruff check src/llmxive/convergence + summarize.py: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (7 source files)
- pytest tests/contract: 43 passed
- pytest 4 conv-related unit files: 27 passed
- pytest 3 spec-015 integration files: 29 passed
- llmxive.checks.prompts: OK (53 agents)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spec convergence unit: the new SpecReviser implements the Reviser Protocol
and folds BOTH `[NEEDS CLARIFICATION]` marker resolution AND every panel
concern into ONE LLM round. This is the spec-015 "collapse" — the previous
two-step author + refine flow becomes one R2 call that produces a fully-
revised spec.md plus a per-concern change-log.
src/llmxive/convergence/revisers/spec_reviser.py
- `SpecReviser` class (Reviser-protocol-conformant): constructed with
(backend, repo_root, project_id, model?, token_budget?, cache_dir?).
- `.revise(artifacts, concerns)`:
- Picks the spec.md artifact (suffix match; excludes paper-side spec).
- Gathers idea text from artifacts (`idea/` keys).
- Overflow routing (FR-006): when bundle approx-tokens > budget, routes
idea + comments_block through `tools.summarize.summarize` with a
preservation goal that pins FR/SC ids verbatim. spec.md itself is
NEVER summarized — the reviser must see what it's editing.
- Composes a system (clarifier.md SSoT) + user (current spec + concerns
+ remaining markers + comments) prompt asking for ONE JSON document
with `new_spec_md` + `responses[]`.
- Honest failure modes: missing `new_spec_md` raises; non-JSON raises;
fewer responses than concerns → padded with `<missing>` entries
(Constitution Principle II: no silent omission).
- `_scan_markers` + `_strip_json_fences` helpers (testable in isolation).
src/llmxive/convergence/revisers/__init__.py
- Package docstring documenting the build_*_reviewspec pattern.
src/llmxive/convergence/reviewspecs.py
- New `build_spec_reviewspec(backend=, repo_root=, project_id=, model=?)`
returns a LIVE ReviewSpec for the spec stage with the SpecReviser bound
as `.reviser`. Static `reviewspec_for("clarified")` still returns the
TodoReviser placeholder; the build_* path is the live wiring (T058 will
add reviewer-side wiring for the panel).
- Local import of SpecReviser keeps the static-registry import graph
clean for callers that never touch the live path.
tests/integration/test_spec_reviser.py (8 tests)
- `_scan_markers` handles bracket + bold marker forms; returns empty
on clean specs.
- `_strip_json_fences` handles fenced + bare JSON.
- End-to-end revise: backend called with system+user; new spec text
written; markers resolved; ConcernResponse per concern.
- Padded missing responses: backend omits one concern → `<missing>`
marker preserved (honest no-silent-omission).
- Missing `new_spec_md` → RuntimeError.
- Non-JSON reply → RuntimeError.
- No spec.md in artifacts → ValueError (engine misuse).
Verification
- ruff check src/llmxive/convergence + tests: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (9 source files)
- pytest tests/integration/test_spec_reviser.py + tests/contract: 51 passed
- pytest broader unit + integration suite: 52 passed (no regressions)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nalyze) Research verified live: math.pi/e/tau, golden ratio, scipy.constants CODATA (c/h/G), and sympy evaluation of every spec example (1+2=3, 1>2=False, identity, 5 km=5000 m, round(pi,2)=3.14). 29 tasks; analyze clean after one fix-loop (decouple US1 approximate from the US2 constants channel; mixed-claim routing; constants top authority rank). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oximate comparator (T001-T008) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…13-T017) US2 (T013/T014): fill/channels/constants.py wraps verify.constants into a zero-network FetchedSource; AUTHORITY["constants"]=0 (top rank); constants added to NUMERIC channel list in channels_for; wired in _get_channel. US5 (T016/T017): compute.py fully replaces the NotImplementedError placeholder with evaluate() (sympy parse_expr, no eval/exec; arithmetic, comparisons, percentages, unit conversions, algebraic identities), extract_expression() (deterministic regex for backend=None), and verify_computational() returning ComputeVerdict. Reuses approximate.is_valid_rounding for real-valued results. Also fixes test_fill_wikidata_parse.py authority assertion to use AUTHORITY lookup instead of hardcoded 1 (broken by the authority re-ranking). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…009-T012,T018-T019) - resolve.py: mode router before kind dispatch (LLMXIVE_CLAIM_FILL=1); approximate path uses _extract_constant_from_text + verify.approximate for precision compare + correction; computational path uses verify_computational (sympy); RESULT-kind never goes to compute; not_evaluable/no-constant falls through to existing kind dispatch unchanged - fill/extract.py: present_in_source mode-aware for constants channel only (decimal values, never bare integers — FR-003 exact-count gate untouched) - tests/integration/test_verify_approximate_wireup.py: pi 3.14→VERIFIED, pi 3.15→corrected, knot count→exact route, 1+1=2→compute, 1+2=1→corrected (T011) - tests/real_call/test_verify_pi_e_real.py: pi/e zero-network constants path (T012) - tests/real_call/test_compute_real.py: arithmetic/comparison/pct/unit-conv via sympy (T019) - tasks.md: T009/T010/T011/T012/T018/T019 marked [X] Offline: 1838 passed (+5 vs 1833 baseline), 0 failures. Real-call: 11 passed (0.30s, zero HTTP for constants path). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- _FILLABLE_KINDS now includes MAGNITUDE and RELATIONAL (fill/service.py) - channels_for(MAGNITUDE/RELATIONAL) → [wikidata, wikipedia, paper] (fill/channels/__init__.py) - resolve_magnitude/resolve_relational wire _maybe_fill at NEI/REFUTED sites (claims/resolve.py) - present_in_source uses entity-name check for MAGNITUDE/RELATIONAL (fill/extract.py) - subject_query extracts entity name for RELATIONAL, category for MAGNITUDE (fill/subject_query.py) - wikidata channel resolves referenced Q-IDs to labels (e.g. P36→Canberra) (fill/channels/wikidata.py) - wikidata channel scans 60 P-claims (up from 20) to reach P36 at position 28 (fill/channels/wikidata.py) - _chat_reasoning_safe always passes model kwarg to satisfy DartmouthBackend (fill/extract.py) - Updated spec-017 deferral tests to reflect spec-018 enablement (test_fill_service_blocks.py, test_fill_service_logic.py, test_fill_conflict.py) - New integration tests T021 + T024 (test_fill_magnitude_wireup.py, test_fill_relational_wireup.py) - New real-call tests T022 + T025 (test_fill_superlative_real.py, test_fill_relational_real.py) - T020-T025 marked [X] in specs/018/tasks.md Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…start (T026/T027/T029) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2e no-regress; all 29 tasks done Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update — claim-trustworthiness stack added (specs 016 → 017 → 018)Beyond the spec-015 convergence protocol, this branch now adds a three-layer system that closes the fabricated-facts gap surfaced in the Part-7 shakeout (the spec reviser inventing "27,635 prime knots at 13 crossings"; correct = 9,988). All verified with real models/sources (no mocks); full offline gate 1858 passed, 0 regressions. Spec 016 — Claim-Verification Layer (detective)Every check-worthy claim a doc-producing agent writes is extracted → registered → substituted with a pointer → resolved (external source or harness-signed execution receipt) → rendered from the verified value. Unresolved claims hard-block via the unified Spec 017 — Authoritative-Fill (constructive)When an external claim can't be verified as written, the layer searches authoritative sources (OEIS b-file via the Wikipedia→A-number bridge, Wikipedia, Wikidata, papers, theorem search), extracts the correct value, verifies it is literally present in a fetched source (never model memory), substitutes it, and repairs the citation. Verified live, end-to-end through the real chokepoint: 27,635 → sourced 9,988 (OEIS A002863); capital of Australia Sydney → Canberra; unsourceable claims stay blocked. Spec 018 — Per-Claim Verification ModesThe verifier picks a mode per claim:
Plus the 017 fast-follow: magnitude/superlative ("largest planet is Saturn" → Jupiter) and set/relational fills. Each spec went through the full speckit pipeline (specify → clarify → plan → tasks → analyze → implement → verify). New dependency: |
…o longer drops every claim Part-7 finding (PROJ-552 spec stage): the extraction model emits a verbatim claim_text containing an embedded double-quoted paper title (e.g. "A Census of Knots."). That broke yaml.safe_load, _parse_extraction_reply returned [], so NO claims were extracted — a silent fabrication passthrough that let the wrong 27,635 prime-knot count survive un-flagged-and-un-filled (panel kicked back to a human instead of the fill layer correcting 27,635 -> 9,988 from OEIS A002863). General fix (applies to every project, not just PROJ-552): - _tolerant_parse_claims: line-oriented recovery parser that scans for the known field keys and takes the line remainder as the value (one outer quote pair stripped), robust to embedded quotes/colons. _parse_extraction_reply falls back to it on any YAML failure OR an empty strict-parse result. - prompts/claim_extraction.md: explicit quoting rules (single line per field; no raw embedded double-quotes — use single quotes for inner marks) to reduce malformed YAML at the source. - 6 new offline regression tests reproducing the exact embedded-quote failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 016-018) Ran ruff --fix (import sorting, unused imports, __all__/datetime/Optional modernization) and hand-fixed the remaining non-autofixable findings: RUF005 (iterable unpacking), B007/RUF059/F841 (unused loop/unpack/locals renamed to _ or removed as dead code), E402 (moved always-available imports above pytestmark), RUF002/RUF003 (non-semantic en-dashes -> ASCII hyphen). str+Enum classes keep the mixin (UP042 noqa) to preserve str() repr. No runtime behavior change. ruff check . clean; mypy src/llmxive unchanged (same 63 pre-existing errors); offline gate 1864 passed / 10 skipped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotation-only cleanup — no runtime behavior changes: - Add generic params (dict[str, Any], list[str], Callable[..., Verdict], re.Pattern[str], re.Match[str]) across verify/, fill/, results/, state/, claims/. - Add missing function param/return annotations (backend: Any, Iterator[Path], dict[str, Any] returns, channel callable, __getattr__ -> Any). - Widen repo_root passthrough to str | Path | None in select_mode / verify_computational (fixes [arg-type] without changing the forwarded value). - Make Any returns explicit via bool() wraps and str(m.group(0)). - Add [mypy-scipy.*] and [mypy-sympy.*] ignore_missing_imports sections. mypy src/llmxive: Success (0 errors); ruff clean; offline gate 1864 passed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s missing Part-7 finding (PROJ-552 spec stage): a single reviewer (scope_fidelity) emitted an opening `---` + valid metadata + concerns but NO closing `---` delimiter (the reasoning model ended after concerns: / the endpoint hung mid-response). The strict both-delimiters regex (^---\n(.*?)\n---) matched nothing, so _parse_response raised RuntimeError "no YAML frontmatter" — crashing the ENTIRE spec panel/run instead of degrading gracefully. General fix (every reviewer, every stage): - _extract_frontmatter: recovers the YAML frontmatter in three shapes — proper both-delimiters (fast path), opening + later doc-boundary (---/...), and opening with NO closing delimiter (take the longest leading line-block that still parses to a non-empty YAML mapping, dropping any unfenced trailing prose). Only a response with no opening --- at all is rejected. - panel_review_block.md: explicit instruction to always emit BOTH delimiters and start at column 0 with no leading blank lines. - 8 new offline tests incl. the exact PROJ-552 failure shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pping Deeper root cause of the PROJ-552 reviewer crash (commit 1c8451e handled the missing closing `---`, but the run still died identically). _parse_response stripped a ```yaml fence BEFORE extracting frontmatter — and a reviewer's prose body routinely contains a fenced YAML/code example. With no closing `---` on the frontmatter, _CODE_FENCE_RE.search hijacked `candidate` to the prose example's contents (e.g. "foo: bar"), which has no `---`, so extraction returned None and the whole panel/run crashed. Fix: run _extract_frontmatter on the RAW stripped response FIRST; only fall back to unwrapping a ```yaml fence when the raw response has no recoverable frontmatter (the wholly-fence-wrapped case). +4 tests: fenced example in prose (the exact crash), whole-response fence wrap, and proper-delims-with-fenced-prose no-hijack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The claim layer was not idempotent across convergence rounds. Two root
causes are fixed here:
A. Prose-preserving render (claims/pointer.py). A {{claim:<id>}} pointer
stands in for the claim's FULL raw_text sentence, but render() replaced
it with the BARE resolved_value — collapsing a whole sentence to a number
(PROJ-552 garble). render() now reconstructs the prose: for a VERIFIED
NUMERIC/RESULT claim it swaps ONLY the asserted numeric token (selected by
thousands-separator/idempotency heuristic) for resolved_value; an
already-correct assertion is returned byte-for-byte unchanged. ENTITY_FACT/
RELATIONAL/MAGNITUDE leave prose intact unless the object span is locatable.
A non-verified claim now PRESERVES the prose and APPENDS one inline
[UNRESOLVED-CLAIM:] marker instead of replacing the sentence with a marker.
B. Idempotent extraction (claims/gate.py strip_claim_artifacts + service.py).
process_document now strips prior-round [UNRESOLVED-CLAIM:] markers and
stray {{claim:<id>}} pointers BEFORE extraction, so the layer no longer
re-extracts its own marker bodies as new claims or accumulates markers.
New render contract: a VERIFIED pointer renders the claim's sentence with only
the asserted token swapped for the verified value (idempotent); a non-verified
pointer renders the sentence followed by one [UNRESOLVED-CLAIM:] marker.
Updated the chokepoint test's _make_claim fixture (raw_text now carries a
numeric token) — its old "some number"/resolved="9988" pair encoded the OLD
bare-value-substitution contract, which the prose-preserving render supersedes.
Added 13 tests (pointer prose-preservation incl. the exact PROJ-552 garble +
idempotency; gate strip_claim_artifacts; process_document double-run stability).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…shared parser
C. Drop the redundant F-19 reviser pass (convergence/revisers/_self_consistency.py).
The reviser chokepoint ran F-19 _ground_factual_claims THEN 016 _verify_claims
on its output; 016 then re-extracted F-19's inline [UNVERIFIED] marker reasons
as new "claims" and the two text-mutation models fought (PROJ-552 root cause 2).
_clean_citations now runs ONLY _verify_claims (spec 016 is the SSoT).
_ground_factual_claims remains defined for other importers but no longer runs
in the chokepoint; docstring updated. The grounding *service* (used BY 016
resolvers) is untouched.
D. Extraction precision (claims/extract.py). A purely promotional "standing"
statement (well-established / peer-reviewed / community-standard / widely-used /
well-known / established-reference / gold-standard) with no crisp checkable core
is now dropped — it cannot be substantiated and otherwise left a residual
[UNRESOLVED-CLAIM:] marker that blocked convergence (root cause 5). A statement
with a salient NUMBER or explicit citation still passes.
E. Shared tolerant parser (claims/extract.py + agents/grounding_guard.py). The
grounding guard had a SECOND YAML claim parser with the same embedded-quote
fragility already fixed in claims/extract — exactly why the bug recurred. The
tolerant field-recovery is now a shared tolerant_field_entries(); the grounding
guard's _parse_extraction_reply falls back to it on YAML failure or no usable
claims, recovering an embedded-quote cited claim ("A Census of Knots.") instead
of silently dropping every claim. Strict-path behavior preserved.
Added 4 tests (F-19-not-invoked-in-reviser-path; promotional-statement drop +
number/citation survival; grounding-guard embedded-quote recovery).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…actually route Part-7 finding (PROJ-552): a project that kicks back at a doc-stage convergence panel was STUCK re-running that stage forever and never escalating. Root cause: a non-converging panel writes convergence_kickback.yaml then RAISES StagePanelKickback from inside the agent run. That raise propagated straight out of run_one_step (caught only by the CLI as a FAIL), so _decide_next_stage — the ONLY place consume_convergence_kickback runs — was never reached. The sentinel was never consumed, current_stage never advanced, and the per-stage kickback cap never incremented (so the 3-strikes→human escalation never fired either). The adaptive-kickback resilience (F-20 Part B) worked in its unit tests but was dead in the real `llmxive run` path because every test exercised _decide_next_stage in isolation, never the raise-through-run_one_step seam. Fix: run_one_step now catches StagePanelKickback (controlled non-convergence) and StagePanelEscalation (engine failure) around the speckit agent call and falls through to _decide_next_stage, which consumes the sentinel and routes the project to the content stage (or to HUMAN_INPUT_NEEDED at the cap / on engine failure) instead of crashing the run loop. +2 regression tests exercising the REAL run_one_step exception handling + real _decide_next_stage/_kickback routing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part-7 finding (PROJ-552 plan stage): the planner emitted a contracts/knot_record.schema.yaml whose body contained a `---` document separator (a second YAML doc at line 115). The FR-007 guard (_research_guard.assert_data_model_contracts_consistent) parses each schema with yaml.safe_load (single-document) and correctly rejected it as invalid — but the rejection hard-crashes the plan stage (plan_cmd unlinks all artifacts + re-raises → CLI FAIL), stranding the project at `clarified`. This commit addresses the TRIGGER: the planner prompt now explicitly forbids an internal `---` separator and tells the model to emit a separate `<!-- FILE: contracts/<name>.schema.yaml -->` block per schema. NOTE (deeper robustness gap, tracked separately in notes/spec-015-review-status): the deterministic plan guards (FR-005/006/007) fail-closed by raising, with NO revision loop — a malformed planner artifact strands the project instead of driving a bounded planner re-run with the guard feedback (the identify→revise philosophy of #239). That fix is a careful plan-stage flow change for a follow-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/ChunkedEncoding) Part-7 finding (PROJ-552 plan stage, run #12): the run crashed with a raw ('Connection broken: IncompleteRead(77671 bytes read)', ...) — the flaky Dartmouth endpoint dropped the connection while streaming the planner's ~75KB multi-file reply. The DartmouthBackend's transient-error classifier had "connection reset"/"connection refused" but NOT "connection broken" / IncompleteRead / ChunkedEncodingError, so the drop was classified Permanent → no retry → the plan stage failed and stranded the project at `clarified`. Fix: add the connection-dropped-mid-stream markers (connection broken, incompleteread, chunkedencodingerror, connection aborted, remotedisconnected, remote end closed, broken pipe, eof occurred) to the transient set so _retry_with_backoff retries them. Also extracted the marker tuple + match into a module-level _is_transient_error_text() (was buried in a closure, untestable) and added regression tests for the exact failure + a no-over-match guard on genuine permanent errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part-7 (PROJ-552 plan run #13): after the single-document fix, the planner hit a SECOND YAML pitfall — an unquoted schema description "... (target: ≥95%)" whose bare ": " made yaml.safe_load read it as a nested mapping ("mapping values are not allowed here"), again rejected by the FR-007 guard. Prompt now requires quoting any string value containing a colon, '#', leading ≥/≤/%, or brackets. (The robust backstop — a bounded planner revision-with-feedback loop on guard failure — is implemented separately; prompt nudges alone are whack-a-mole.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A malformed planner artifact (e.g. a contracts schema with an internal `---` multi-document marker, or an unquoted `: ` that breaks YAML) previously hard-crashed PlannerAgent.write_artifacts on the first deterministic-guard failure — marking the run FAILED and stranding the project at 'clarified' with no retry. Wrap the split → FR-005 → write → FR-007 → FR-006 pipeline in a bounded retry loop: - Refactor the write+guard body into _write_and_validate(ctx, mechanical_output, response_text) -> list[str], which raises the guard exception (and unlinks partial writes) on failure, exactly as before. - write_artifacts now calls it in a loop: on a guard exception, if attempts remain, re-call the planner LLM via _revise_with_feedback with ONE corrective user message quoting the exact guard error and demanding all five files re-emitted in the FILE-marker format; else re-raise the last guard exception (fail-closed preserved). - Cap: MAX_PLAN_REVISION_RETRIES = 2 (up to 3 total attempts). Each retry logged at INFO. - Offline-safety gate: the corrective re-call builds the backend via make_backend(ctx.default_backend.value); if it returns None or raises, the loop does NOT retry and re-raises the guard exception. Offline unit tests therefore stay network-free. - The plan convergence panel runs only AFTER the artifacts pass the guards. Tests: - tests/unit/test_plan_revision_loop.py drives the real write_artifacts / _write_and_validate with a fake backend collaborator: invalid-then-valid retries-and-succeeds (both PROJ-552 failure modes), all-invalid raises the last guard exception with no partial artifacts, and no-backend / make_backend-raises both fail closed without retrying. - Updated two existing phase4 tests that encoded the old hard-crash-on-first-failure contract (template-reject unlink, bad-URL unlink) to force the offline path (make_backend -> None), asserting the same fail-closed-and-unlink behavior via the loop's no-backend branch. ruff + mypy clean; offline gate 1899 passed (baseline 1894 + 5), 0 failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el to human Part-7 finding (PROJ-552 plan panel, run #14): the qwen endpoint hung past its 180s deadline; the backend's own retry+backoff exhausted and surfaced a TransientBackendError. _stage_panel's engine-failure handler caught it like any exception → wrote human_input_needed.yaml + raised StagePanelEscalation → the project was stranded at human_input_needed. But a transiently-degraded model endpoint is NOT human-actionable: a human cannot fix a hung endpoint. Fix: catch TransientBackendError separately in run_stage_panel and re-raise it AS-IS (no human_input_needed.yaml, no StagePanelEscalation wrap) so the run fails transiently and the project STAYS at its current stage to retry on the next scheduler tick when the endpoint recovers. run_one_step does not catch TransientBackendError, so it surfaces as a transient CLI FAIL with no stage change. Genuine (non-transient) engine failures still escalate to human as before. +1 regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part-7 end-to-end pipeline hardening (driving PROJ-552 live)Drove a real project (PROJ-552, knot-complexity, mathematics) through the live pipeline stage-by-stage per issue #239 Part 7, and fixed 8 general bugs — each surfaced by a real failure and fixed so it benefits every project, not just the test case:
Recurring lesson: controlled/expected failures (panel non-convergence, malformed planner output, connection drops, hung endpoints) must route / revise / fail-transiently, never hard-crash and strand the project. Proven end-to-end: the convergence protocol works — PROJ-552 went brainstorm → idea → drift-kickback → flesh_out realign → idea-converge → spec converged (with the correct 9,988-prime-knots-at-13-crossings count verified against a real 2025 J. Knot Theory Ramifications DOI), and the plan stage now self-heals malformed artifacts + survives transient endpoint failures. Offline gate: 1900 passed, 0 failures; ruff + mypy clean. Traversal continues (plan panel → tasks → … → publisher, then the 9-domain repetition) once the (currently-degraded) model endpoint recovers. 🤖 Generated with Claude Code |
…w notes Persists PROJ-552's pipeline state so the traversal can resume from the plan panel on another machine or in a GitHub Action while this laptop is suspended. Stage=clarified (spec converged at specs/002 with the verified 9,988 knot count); plan convergence panel is next. Includes the claims/citations registries, librarian cache, run-log telemetry, and the living review tracker (notes/spec-015-review-status.md) documenting all 8 fixes + the resume command. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The llmxive-pipeline workflow ran `llmxive run` but never committed the result,
so an Action's progress (advanced stage, new artifacts, run-log) was discarded
when the ephemeral runner tore down. Add an always() commit+push step so a
GitHub-Action run (e.g. driving PROJ-552 while a laptop is suspended) actually
persists. [skip ci] avoids retriggering. Pushes to the checked-out branch via
HEAD:${GITHUB_REF_NAME}.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nce-protocol # Conflicts: # .github/workflows/spec015-calibration.yml
…m/ContextLab/llmXive into 015-pipeline-convergence-protocol
…-552 plan panel Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
Implements spec 015 — Pipeline Convergence Protocol (issue #239). Replaces the legacy accumulated-review-points model (≥10 LLMs / ≥5 humans, 0.5/1.0 points) with a convergence-based gate: each reviewable stage runs
identify → revise → re-reviewwith its LLM panel and advances only on unanimous panel acceptance within a 3-round cap, else an adaptive kickback to the prior stage with full provenance. Human/personality reviews are advisory only and route through stage-aware triage.Key behavior (selected FRs)
convergedreporting (FR-016).dartmouth → local.Hardening in this PR
finally: returnbugs (implementer/publisher) that double-appended run-log entries on the skip path and swallowed re-raises.NameErrorinagents/librarian.py(loop var/body mismatch on the marginal-fallback path), surfaced by the mypy pass.LLMXIVE_REPO_ROOTrepo-root override (centralized ~60__file__climbs) and de-rotted the Phase-3 real-call e2e so it runs hermetically against a synthetic repo (verified: real Specifier+Clarifier run, 95s).(str, Enum) → StrEnummigration; mypy strict: 213 → 0;ruff check .: clean (repo-wide); offline suite 1232 passed.Verification
ruff check .→ All checks passedmypy src/llmxive→ 0 errors (154 files)prompts-check→ OK (53 agents)🤖 Generated with Claude Code