Skip to content

spec 015: pipeline convergence protocol (closes #239)#250

Merged
jeremymanning merged 213 commits into
mainfrom
015-pipeline-convergence-protocol
May 31, 2026
Merged

spec 015: pipeline convergence protocol (closes #239)#250
jeremymanning merged 213 commits into
mainfrom
015-pipeline-convergence-protocol

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

Implements spec 015 — Pipeline Convergence Protocol (issue #239). Replaces the legacy accumulated-review-points model (≥10 LLMs / ≥5 humans, 0.5/1.0 points) with a convergence-based gate: each reviewable stage runs identify → revise → re-review with its LLM panel and advances only on unanimous panel acceptance within a 3-round cap, else an adaptive kickback to the prior stage with full provenance. Human/personality reviews are advisory only and route through stage-aware triage.

Key behavior (selected FRs)

  • Convergence engine: R1 identify → R2 revise → R3 re-review; unanimous-acceptance gate; honest converged reporting (FR-016).
  • FR-012 selective re-review: dissenters always re-review; R1-accepters re-review only when R2 changed a lens-relevant artifact.
  • FR-011 reviser self-consistency: a second code-level audit call + one corrective re-pass, exception-guarded.
  • FR-048 living-document batched recompile: render Discussion → sha256 material-change → FR-054 sign-off gate → version DOI; cron auto-triggers but never auto-mints.
  • HF Inference-API backend removed — HF models run locally via transformers; backend chain is dartmouth → local.

Hardening in this PR

  • Fixed 2 latent finally: return bugs (implementer/publisher) that double-appended run-log entries on the skip path and swallowed re-raises.
  • Fixed a real NameError in agents/librarian.py (loop var/body mismatch on the marginal-fallback path), surfaced by the mypy pass.
  • Introduced LLMXIVE_REPO_ROOT repo-root override (centralized ~60 __file__ climbs) and de-rotted the Phase-3 real-call e2e so it runs hermetically against a synthetic repo (verified: real Specifier+Clarifier run, 95s).
  • (str, Enum) → StrEnum migration; mypy strict: 213 → 0; ruff check .: clean (repo-wide); offline suite 1232 passed.

Verification

  • ruff check . → All checks passed
  • mypy src/llmxive → 0 errors (154 files)
  • offline suite → 1232 passed, 1 skipped, 2 deselected
  • Phase-3 e2e (real-call) → passes (95s); prompts-check → OK (53 agents)

Note: part 7 of #239 (full sequential end-to-end pipeline run with per-step artifact-quality review) is in progress as follow-up work on this branch.

🤖 Generated with Claude Code

jeremymanning and others added 30 commits May 27, 2026 20:08
…+ review-model overhaul (#239)

Comprehensive Spec Kit specification for umbrella issue #239, grounded in the
2026-05-27 design doc SSoT and a code-verified audit. Covers: the inode-table
summarize/desummarize primitive (no silent loss of check-critical elements),
the generic identify->revise->re-review convergence engine + adaptive kickback,
removal of the point system for unanimous-panel acceptance + advisory triage,
per-step ReviewSpec adapters across the whole research + paper track, reviewer
calibration (9 domains, held-out generality), end-to-end traversal proof,
living-document discussion board, and all 10 audit bug fixes + arXiv resilience.

Three scope decisions resolved with maintainer up front (living-doc=full;
point cutover=migrate-forward; overflow floor=inode-table pointers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Five clarifications integrated into the spec (Clarifications + FRs/SCs/scenarios/assumptions):
- Publish target: real public Zenodo/GitHub/site, but a MANDATORY manual
  maintainer sign-off before every DOI mint for the duration of this spec
  (new FR-054, SC-014; FR-036/FR-048 updated).
- E2E coverage: all 9 domains traverse end-to-end to posted (FR-045, SC-007).
- Calibration: differential clean-vs-injected test + manual adjudication +
  adaptive sensitivity tuning (no fixed over-flag % / K) (FR-042, FR-044, SC-005).
- Kickback budget: NO global cap; monotonic-improvement-until-convergence;
  per-step 3-round cap retained (FR-017, edge case, assumptions).
- Cutover: no posted/done projects exist -> migration applies to in-flight only (FR-025).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
plan.md (Constitution Check: points-removal + no-global-cap tracked as
authorized deviations -> constitution amendment task), research.md (10 grounded
technical decisions incl. inode-table summarizer format, engine-as-callables,
adaptive kickback, manual DOI sign-off, differential calibration), data-model.md
(pydantic entities), quickstart.md, and 6 contracts (summarize-api,
convergence-engine, reviewspec-registry, review-intake-triage, kickback-record,
publisher-signoff). CLAUDE.md SPECKIT ref -> 015.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Organized by user story (US1-US8) with Setup/Foundational/Polish. TDD + real-call
+ manual-QC tasks included per spec. Dependency chain: summarizer first ->
engine -> bug fixes -> review model -> per-step panels -> calibration (9 domains)
-> e2e to posted (9 domains, manual DOI sign-off) -> living-doc -> polish.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closed 4 coverage/underspecification findings from /speckit-analyze (0 remain):
- C1 (HIGH): FR-006 authoring-side overflow routing + paper twins -> T054-T057
- C2 (MED): FR-026 repository_hygiene line-count/gitignore -> T043
- U1 (MED): FR-053 convergence principle encoding -> T007
- U2 (LOW): FR-017 ProgressRecord emission -> T026
Constitution point-conflict (CRITICAL) resolved by explicit amendment task T007.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- T001: new package dirs (convergence/, calibration/, agents/prompts/panels/)
- T002: STATUS.md living progress doc (FR-052)
- T003: Stage.AWAITING_PUBLICATION_SIGNOFF; config CONVERGENCE_MAX_ROUNDS=3 +
  CONVERGENCE_PER_ROUND_BUDGET_SECONDS=600. Imports verified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New SSoT primitive src/llmxive/tools/summarize.py: summarize()/desummarize() with
on-disk inode-table pointer hierarchy. Deterministic no-loss guarantee (URLs/DOIs/
arXiv/citations/FR-SC-task ids/numbers preserved verbatim; full content on disk,
recursively paged in). 12 tests pass (7 edge cases + core no-loss + manifest
contract + no-dangling-pointer); ruff + mypy clean.

Remaining for US1: T009 real-call fidelity, T017 re-point paper_reviewer (SSoT),
T018 real-call verification. See STATUS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_build_corpus_with_summaries now delegates context reduction to
tools/summarize.summarize() (inode-table, no silent truncation), preserving the
1-arg summarize_fn contract + _cached_summarize memoization. Supersedes the old
truncate-with-notice fallback (Const. I SSoT). Updated the 2 coupled unit tests
to the new behavior (full source recoverable via desummarize); _chunk_corpus +
its 3 tests untouched. 24 paper_reviewer + 12 summarizer tests pass; mypy-clean
for the changed function.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tests/real_call/test_summarize_fidelity.py: real qwen3.5-122b summarize_fn over an
over-budget doc; desummarize recovers EVERY critical element verbatim (no loss
through a real-LLM reduction). PASSED in 334s. US1 (summarizer) fully done &
verified: 12 offline + 1 real-call, ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…tion (#239)

- T004/T005: convergence/types.py — Severity (ordered + legacy mapping) and the
  Concern/ConcernResponse/Verdict/ProgressRecord/ConvergenceResult/KickbackRecord/
  TriageRecord pydantic models + Reviewer/Reviser Protocols + ReviewSpec dataclass.
- T006: tests/contract/test_convergence_types.py (7 pass; ruff + mypy clean).
- T007: constitution -> v1.1.0; added Principle VI (Convergent Review,
  NON-NEGOTIABLE), replaced the point-based Review-thresholds gate with
  unanimous-panel convergence + advisory triage, Sync Impact Report updated.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
convergence/engine.py run_convergence: identify->revise->re-review loop with
honest converged reporting (FR-016), 3-round cap, self-review/producer exclusion
+ stale-never-passes (FR-018), per-round wall-clock budget (FR-013), and overflow
inputs routed through tools/summarize (FR-006). convergence/kickback.py
route_kickback (adaptive worst-severity->stage, full-provenance KickbackRecord)
+ progress_record (FR-017). 15 unit tests pass; ruff + mypy clean.

US2 remaining (coupled to US4/US3): T021 real-project integration, T025
advancement.py _produced_by stub, T027 tasker Mode-A/B refactor into the engine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Addressed the tech debt I had flagged (per "fix issues as you notice them"):
- types-PyYAML dev dep -> yaml stubs resolve under `python -m mypy` (clears yaml
  errors codebase-wide).
- ReviewRecord.score: invalid Literal[float] -> float + field_validator (PEP 586;
  identical {0.0,0.5,1.0} constraint).
- paper_reviewer: list[dict]->list[dict[str,Any]]; text coerced to str.
- removed 2 unused PaperReviewerAgent imports in test.
- FIX: T003 added Stage.AWAITING_PUBLICATION_SIGNOFF but not the project-state
  schema enum -> contract test failed; added it (single SSoT schema).
- FIX: T001 panels dir was under src/llmxive/agents/prompts/ but prompts live at
  repo-root agents/prompts/ -> relocated; corrected 7 path refs in tasks.md.

Finding (STATUS.md): project does NOT gate on ruff/mypy (no config, no CI step;
gates = pytest + checks.*). ~273 legacy mypy errors are pre-existing, out of #239.
Focused regression: 92 passed (all contract + score/paper_reviewer/convergence).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

New agents/prompts/implementer_research.md: instructs the research speckit
implementer to emit the artifacts/verdict YAML the parser expects (write real
runnable code/data, no stubs/diffs, fail-loud verdicts). implement_cmd.py now
renders it instead of the paper-revision LaTeX implementer.md (which stays for
the separate paper-revision agent). Also fixed 2 pre-existing ruff nits in
implement_cmd.py (I001 import sort, F541) since I touched the file.
tests/integration/test_audit_bugfixes.py verifies the fix (2 pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
theoremsearch.search() now retries transient failures (429/500/502/503/504 +
RequestException/timeout) with exponential backoff (MAX_TRANSIENT_RETRIES=3),
then degrades via TransientBackendError (the librarian wrapper already treats
that as "optional source unavailable"). Non-transient 4xx are not retried.
retry_backoff_base_seconds is injectable (tests pass 0). 4 unit tests; ruff+mypy clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
#239)

Full offline suite verified green: tests/contract + 599 tests/unit (7.45s) +
real-call summarize_fidelity. Flagged pre-existing live-PDF test in tests/unit
(not CI-gated, hangs offline) for separate gating.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…yze (#239)

Discrepancy #4 fix: ANALYZE_SYSTEM_PROMPT_PATH was defined but unused (inline
prompt hardcoded; paper reused research tasker.md). Now there are TWO real
analyze prompts that ARE used via render_prompt:
- agents/prompts/analyze.md (research): requirements_coverage / internal_consistency /
  testability / scope / constitution_alignment lenses (same vocabulary as the
  US4 Tasks panel).
- agents/prompts/paper_analyze.md (paper): reader_scenario_coverage /
  claims_supported / required_sections_figures / scope_vs_research /
  internal_consistency / constitution_alignment.

run_analyze() gains kind={"research","paper"} + constitution_text kwargs.
paper_tasks_cmd passes kind="paper" + paper constitution; tasks_cmd passes
research constitution (FR-030: constitution is a standard analyze input from
`specified` onward). 6 audit-bugfix tests + 38 phase4 integration tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
clarifier.attempts_so_far was hardcoded 0 (escalation unreachable) and
paper_clarifier never branched on verdict=escalate AND silently substituted a
"Resolved by default" stub on missing patches — a no-silent-shortcuts violation.

Fixes:
- New shared _clarify_attempts.py: persists per-project attempt count under
  .specify/memory/clarifier_attempts.yaml; bump/read/reset + write_human_input_needed.
- Both clarifiers now read REAL attempts and pass them to the prompt.
- Both branch on verdict=escalate -> write human_input_needed.yaml + raise.
- Both escalate at TASKER_MAX_REVISION_ROUNDS (=5) -> write human_input_needed.yaml + raise.
- paper_clarifier no longer substitutes the silent "Resolved by default" stub
  (matches research clarifier's loud failure behavior).
- Also removed 2 pre-existing F841 dead locals in clarify_cmd._spec_path.

29 tests pass (audit + phase3 integration); ruff clean for touched files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

paper_specifier.md advertised `code_summary` / `data_summary` inputs that the
code never supplied (silent drift between prompt and reality). paper_specify_cmd
now injects both blocks into the user message, reusing research_reviewer's
_summarize_tree() as the SSoT tree-summary helper — Const. I (share, don't fork).
The advertised inputs ARE now present, grounding the paper-spec generation in
the project's actual code/ and data/ trees.

11 audit-bugfix tests pass; ruff clean for touched files.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…R-054) (#239)

Discrepancy #2 fix (FR-036): graph._decide_next_stage no longer shortcuts
PAPER_ACCEPTED -> POSTED. It now routes paper_accepted -> AWAITING_PUBLICATION_SIGNOFF,
then AWAITING_PUBLICATION_SIGNOFF -> POSTED ONLY when the maintainer sign-off record
exists. The PaperPublisher itself enforces the same gate (defense-in-depth) — at
PAPER_ACCEPTED or AWAITING_PUBLICATION_SIGNOFF with NO signoff record it SKIPs with a
clear "awaiting manual maintainer DOI sign-off (FR-054)" reason. No Zenodo DOI is
minted without recorded approval.

New surface:
- src/llmxive/speckit/_publication_signoff.py: read/write/has/clear_signoff
  persistence under <project>/.specify/memory/publication_signoff.yaml; FR-054
  who/when/what record (kinds "initial" / "version").
- `llmxive project publish-approve <PROJ-ID> --who X --what Y [--kind initial|version]`
  CLI command writes the sign-off record.
- 6 new audit-bugfix tests + 27 publisher/graph regression tests pass.

Also fixed 38 pre-existing ruff issues in touched files (auto-fix).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Discrepancy #7 fix (FR-018): advancement._produced_by was a stub returning None.
It now scans state/run-log/<YYYY-MM>/*.jsonl for the latest entry whose outputs
list contains the artifact path and returns that entry's agent_name. Exact +
suffix path matching tolerates relative-vs-absolute bookkeeping. A repo_root
kwarg keeps the production call (no repo_root) working while making tests
hermetic. Defensive: returns None on missing run-log instead of raising.

T029: the audit-bugfix test file (now 18 tests) verifies T030/T031/T032/T033/
T034/T035/T025 fixes. 38 tests pass (audit + advancement regression).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…to US3 (#239)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New convergence/triage.py — stage-aware triage for submitted human + simulated-
personality reviews. Three filters: quality (length + evidence-indicator regex
sweep — FR/SC/T ids, citations, URLs, DOIs, quoted phrases, code fences,
scientific topic vocab), safety + on-topic (rule-based stop-list + stage/lens
vocabulary overlap), and aspect-mapping to LLM reviewer lenses (preserved but
mapped_lenses=[] when no match -> routes to the step's generic reviewer per
FR-022). Injectable judge_fn for the real-LLM path (US4 wiring); rule-based
default keeps unit tests offline.

tests/integration/test_triage.py: 8 tests covering quality pass/fail, safety
exclusion, off-topic exclusion, lens mapping, unmapped-but-preserved, record
provenance, and the judge_fn injection override. All pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
)

Rewrote the user-facing status-model descriptions in README + web/index.html +
docs/index.html (HTML mirror copy) to convergence semantics: identify -> revise
-> re-review; unanimous panel acceptance within a 3-round cap; advisory
triage for human + simulated-personality reviews; no accumulated points.
Replaces 6 stale "points threshold" / "Human reviews count double" passages.

status_reporter.py + repository_hygiene.py needed no change for the new
status model — their FR-026 duties (projects.json regen, GitHub issue
comment/close on POSTED, line-count delta, gitignore assertions) are not
point-dependent and remain in force unchanged. The points_research_total /
points_paper_total fields the web JS displays will be removed in a follow-up
(part of T041 point-system removal).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…239)

Discrepancy #9 + Const. I cleanup: the accumulated review-point system is gone
from the advancement decision path. Unanimous LLM-panel acceptance is now the
sole gate everywhere (research + paper both).

advancement.py:
  - Research-review gate no longer reads `accept_total` / `RESEARCH_ACCEPT_THRESHOLD`.
    It now uses `_all_specialists_accept(records, required)` with a defensive
    backstop (require ≥1 accept AND zero non-accept records when the registry
    isn't loaded) — mirroring the paper-side default.
  - Paper-review gate's `_award_review_points` call removed (the all-specialists-
    accept-most-recent check was already the real decision).
  - `_award_review_points` definition DELETED (no remaining callers).
  - `RESEARCH_ACCEPT_THRESHOLD` import dropped; replaced with an FR-019 comment.

config.py:
  - `RESEARCH_ACCEPT_THRESHOLD` and `PAPER_ACCEPT_THRESHOLD` constants kept for
    back-compat with `web/about.html` mirror consumers, but VALUES set to 0.0 and
    no advancement code reads them.

T038 tests (`tests/integration/test_no_points.py`, 3 tests): grep guard +
behavioral assertion that no-accept records cannot trip the gate.

T044: per clarify Q3 there are no posted/done projects to grandfather; the gate
change applies on next tick automatically — no data-migration logic needed.

Broad regression: 784 passed, 1 skipped (was 781 — three new T038 tests added).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
src/llmxive/convergence/reviewspecs.py: reviewspec_for(stage) -> ReviewSpec | None.
9 stage entries (idea + 4 research + 4 paper) matching contracts/reviewspec-
registry.md; EXEMPT_STAGES frozenset of 7 mechanical steps. Constitution input
is True for every spec from `specified` onward (FR-030); idea-stage opts out
(no constitution yet). Kickback routing per the contract's worst-severity ->
prior-stage table.

Stages whose panel prompts (T049-T053) or wiring (T054-T059) haven't landed yet
get _TodoReviewer / _TodoReviser placeholders that conform to the Protocol but
raise NotImplementedError with a clear pointer to the follow-up task -- fail-loud
SSoT structure, no silent empty verdicts. 15 contract tests pass; ruff clean.

Also marked T060 (constitution-as-analyze-input, done in T031) and T061 (publisher
wired into graph, done in T035) as already complete.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
US4 panel-prompt authoring: 27 lens prompts + 1 SSoT shared block + a contract
test that catches future registry/file-name drift.

agents/prompts/_shared/panel_review_block.md
- SSoT (Constitution Principle I) for the panel R1/R3 output contract.
  Severity vocabulary matches the spec-015 Severity enum (trivial → fatal);
  identify and re-review phases both defined.

agents/prompts/panels/ — 27 files total
  T049: panel_idea_{rq_validity,novelty,feasibility,idea_quality}.md
  T050: panel_spec_{requirements_coverage,internal_consistency,testability,scope}.md
  T051: panel_plan_{methodology,spec_coverage,data_resources,consistency}.md
  T052: panel_tasks_{coverage,ordering,executability,constraint_preservation}.md
  T053: panel_paper_spec_* (4) + panel_paper_plan_* (3) + panel_paper_tasks_* (4)

Each per-lens file is thin: lens + scope ("what NOT to flag") + inputs
(constitution from `specified` onward per FR-030) + per-severity-class
guidance + reference to the SSoT block. T054-T059 wiring will concatenate
lens-prompt + SSoT-block at render time.

tests/contract/test_panel_prompts.py (16 tests)
- Every lens in the ReviewSpec registry resolves to a real prompt file.
- Every panel file references the SSoT block (Principle I drift guard).
- Every panel file has `## Lens` and `## Output format` sections.
- Reuse-stages (research_review/paper_review) map to existing specialist
  files, with the _research/_paper suffix convention preserved.
- The SSoT block enumerates every Severity enum value + defines R1 and R3.

Tech debt fixed inline (surfaced by ruff+mypy installation in venv):
- reviewspecs.py: _todo_reviewers now returns list[Reviewer] (list is
  invariant). Removed an unused `# type: ignore`.
- triage.py: JudgeFn return-type narrowed to dict[str, object]; the
  mapped_lenses access narrowed with isinstance(list|tuple) at the
  callsite — honest about the contract boundary rather than ignore.

Verification:
- ruff check src/llmxive/convergence + summarize.py: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (7 source files)
- pytest tests/contract: 43 passed
- pytest 4 conv-related unit files: 27 passed
- pytest 3 spec-015 integration files: 29 passed
- llmxive.checks.prompts: OK (53 agents)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spec convergence unit: the new SpecReviser implements the Reviser Protocol
and folds BOTH `[NEEDS CLARIFICATION]` marker resolution AND every panel
concern into ONE LLM round. This is the spec-015 "collapse" — the previous
two-step author + refine flow becomes one R2 call that produces a fully-
revised spec.md plus a per-concern change-log.

src/llmxive/convergence/revisers/spec_reviser.py
- `SpecReviser` class (Reviser-protocol-conformant): constructed with
  (backend, repo_root, project_id, model?, token_budget?, cache_dir?).
- `.revise(artifacts, concerns)`:
  - Picks the spec.md artifact (suffix match; excludes paper-side spec).
  - Gathers idea text from artifacts (`idea/` keys).
  - Overflow routing (FR-006): when bundle approx-tokens > budget, routes
    idea + comments_block through `tools.summarize.summarize` with a
    preservation goal that pins FR/SC ids verbatim. spec.md itself is
    NEVER summarized — the reviser must see what it's editing.
  - Composes a system (clarifier.md SSoT) + user (current spec + concerns
    + remaining markers + comments) prompt asking for ONE JSON document
    with `new_spec_md` + `responses[]`.
  - Honest failure modes: missing `new_spec_md` raises; non-JSON raises;
    fewer responses than concerns → padded with `<missing>` entries
    (Constitution Principle II: no silent omission).
- `_scan_markers` + `_strip_json_fences` helpers (testable in isolation).

src/llmxive/convergence/revisers/__init__.py
- Package docstring documenting the build_*_reviewspec pattern.

src/llmxive/convergence/reviewspecs.py
- New `build_spec_reviewspec(backend=, repo_root=, project_id=, model=?)`
  returns a LIVE ReviewSpec for the spec stage with the SpecReviser bound
  as `.reviser`. Static `reviewspec_for("clarified")` still returns the
  TodoReviser placeholder; the build_* path is the live wiring (T058 will
  add reviewer-side wiring for the panel).
- Local import of SpecReviser keeps the static-registry import graph
  clean for callers that never touch the live path.

tests/integration/test_spec_reviser.py (8 tests)
- `_scan_markers` handles bracket + bold marker forms; returns empty
  on clean specs.
- `_strip_json_fences` handles fenced + bare JSON.
- End-to-end revise: backend called with system+user; new spec text
  written; markers resolved; ConcernResponse per concern.
- Padded missing responses: backend omits one concern → `<missing>`
  marker preserved (honest no-silent-omission).
- Missing `new_spec_md` → RuntimeError.
- Non-JSON reply → RuntimeError.
- No spec.md in artifacts → ValueError (engine misuse).

Verification
- ruff check src/llmxive/convergence + tests: All checks passed
- mypy src/llmxive/convergence + summarize.py: 0 errors (9 source files)
- pytest tests/integration/test_spec_reviser.py + tests/contract: 51 passed
- pytest broader unit + integration suite: 52 passed (no regressions)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
jeremymanning and others added 8 commits May 30, 2026 15:51
…nalyze)

Research verified live: math.pi/e/tau, golden ratio, scipy.constants CODATA
(c/h/G), and sympy evaluation of every spec example (1+2=3, 1>2=False, identity,
5 km=5000 m, round(pi,2)=3.14). 29 tasks; analyze clean after one fix-loop
(decouple US1 approximate from the US2 constants channel; mixed-claim routing;
constants top authority rank).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oximate comparator (T001-T008)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…13-T017)

US2 (T013/T014): fill/channels/constants.py wraps verify.constants into a
zero-network FetchedSource; AUTHORITY["constants"]=0 (top rank); constants
added to NUMERIC channel list in channels_for; wired in _get_channel.

US5 (T016/T017): compute.py fully replaces the NotImplementedError placeholder
with evaluate() (sympy parse_expr, no eval/exec; arithmetic, comparisons,
percentages, unit conversions, algebraic identities), extract_expression()
(deterministic regex for backend=None), and verify_computational() returning
ComputeVerdict. Reuses approximate.is_valid_rounding for real-valued results.

Also fixes test_fill_wikidata_parse.py authority assertion to use AUTHORITY
lookup instead of hardcoded 1 (broken by the authority re-ranking).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…009-T012,T018-T019)

- resolve.py: mode router before kind dispatch (LLMXIVE_CLAIM_FILL=1); approximate path
  uses _extract_constant_from_text + verify.approximate for precision compare + correction;
  computational path uses verify_computational (sympy); RESULT-kind never goes to compute;
  not_evaluable/no-constant falls through to existing kind dispatch unchanged
- fill/extract.py: present_in_source mode-aware for constants channel only (decimal values,
  never bare integers — FR-003 exact-count gate untouched)
- tests/integration/test_verify_approximate_wireup.py: pi 3.14→VERIFIED, pi 3.15→corrected,
  knot count→exact route, 1+1=2→compute, 1+2=1→corrected (T011)
- tests/real_call/test_verify_pi_e_real.py: pi/e zero-network constants path (T012)
- tests/real_call/test_compute_real.py: arithmetic/comparison/pct/unit-conv via sympy (T019)
- tasks.md: T009/T010/T011/T012/T018/T019 marked [X]

Offline: 1838 passed (+5 vs 1833 baseline), 0 failures.
Real-call: 11 passed (0.30s, zero HTTP for constants path).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- _FILLABLE_KINDS now includes MAGNITUDE and RELATIONAL (fill/service.py)
- channels_for(MAGNITUDE/RELATIONAL) → [wikidata, wikipedia, paper] (fill/channels/__init__.py)
- resolve_magnitude/resolve_relational wire _maybe_fill at NEI/REFUTED sites (claims/resolve.py)
- present_in_source uses entity-name check for MAGNITUDE/RELATIONAL (fill/extract.py)
- subject_query extracts entity name for RELATIONAL, category for MAGNITUDE (fill/subject_query.py)
- wikidata channel resolves referenced Q-IDs to labels (e.g. P36→Canberra) (fill/channels/wikidata.py)
- wikidata channel scans 60 P-claims (up from 20) to reach P36 at position 28 (fill/channels/wikidata.py)
- _chat_reasoning_safe always passes model kwarg to satisfy DartmouthBackend (fill/extract.py)
- Updated spec-017 deferral tests to reflect spec-018 enablement (test_fill_service_blocks.py, test_fill_service_logic.py, test_fill_conflict.py)
- New integration tests T021 + T024 (test_fill_magnitude_wireup.py, test_fill_relational_wireup.py)
- New real-call tests T022 + T025 (test_fill_superlative_real.py, test_fill_relational_real.py)
- T020-T025 marked [X] in specs/018/tasks.md

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…start (T026/T027/T029)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…2e no-regress; all 29 tasks done

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeremymanning
Copy link
Copy Markdown
Member Author

Update — claim-trustworthiness stack added (specs 016 → 017 → 018)

Beyond the spec-015 convergence protocol, this branch now adds a three-layer system that closes the fabricated-facts gap surfaced in the Part-7 shakeout (the spec reviser inventing "27,635 prime knots at 13 crossings"; correct = 9,988). All verified with real models/sources (no mocks); full offline gate 1858 passed, 0 regressions.

Spec 016 — Claim-Verification Layer (detective)

Every check-worthy claim a doc-producing agent writes is extracted → registered → substituted with a pointer → resolved (external source or harness-signed execution receipt) → rendered from the verified value. Unresolved claims hard-block via the unified [UNRESOLVED-CLAIM:] marker and auto-route for re-resolution (no routine human input). Receipts are HMAC-signed by the harness; an LLM can never mint or alter one. Verified live: 27,635 fabrication blocked; receipt forgery rejected.

Spec 017 — Authoritative-Fill (constructive)

When an external claim can't be verified as written, the layer searches authoritative sources (OEIS b-file via the Wikipedia→A-number bridge, Wikipedia, Wikidata, papers, theorem search), extracts the correct value, verifies it is literally present in a fetched source (never model memory), substitutes it, and repairs the citation. Verified live, end-to-end through the real chokepoint: 27,635 → sourced 9,988 (OEIS A002863); capital of Australia Sydney → Canberra; unsourceable claims stay blocked.

Spec 018 — Per-Claim Verification Modes

The verifier picks a mode per claim:

  • exact-count (literal; the 9,988 path — unchanged, no regression)
  • approximate-constant — precision-aware rounding vs a library-backed constants table (math + scipy.constants CODATA, zero-network): "π is 3.14"/"about 3" verify; "π is 3.15" → corrected 3.14
  • computational — safe sympy evaluation (no eval/exec): "1 plus 2 is 1" → REFUTED→3, "1 is larger than 2" → REFUTED, unit conversions, algebraic identities — the evaluator computes, never the LLM
  • source-fact (016/017)

Plus the 017 fast-follow: magnitude/superlative ("largest planet is Saturn" → Jupiter) and set/relational fills.

Each spec went through the full speckit pipeline (specify → clarify → plan → tasks → analyze → implement → verify). New dependency: sympy (free/open-source). Commits 74abda95aec34068.

jeremymanning and others added 13 commits May 30, 2026 18:55
…o longer drops every claim

Part-7 finding (PROJ-552 spec stage): the extraction model emits a verbatim
claim_text containing an embedded double-quoted paper title (e.g. "A Census of
Knots."). That broke yaml.safe_load, _parse_extraction_reply returned [], so NO
claims were extracted — a silent fabrication passthrough that let the wrong
27,635 prime-knot count survive un-flagged-and-un-filled (panel kicked back to a
human instead of the fill layer correcting 27,635 -> 9,988 from OEIS A002863).

General fix (applies to every project, not just PROJ-552):
- _tolerant_parse_claims: line-oriented recovery parser that scans for the known
  field keys and takes the line remainder as the value (one outer quote pair
  stripped), robust to embedded quotes/colons. _parse_extraction_reply falls back
  to it on any YAML failure OR an empty strict-parse result.
- prompts/claim_extraction.md: explicit quoting rules (single line per field; no
  raw embedded double-quotes — use single quotes for inner marks) to reduce
  malformed YAML at the source.
- 6 new offline regression tests reproducing the exact embedded-quote failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 016-018)

Ran ruff --fix (import sorting, unused imports, __all__/datetime/Optional
modernization) and hand-fixed the remaining non-autofixable findings:
RUF005 (iterable unpacking), B007/RUF059/F841 (unused loop/unpack/locals
renamed to _ or removed as dead code), E402 (moved always-available imports
above pytestmark), RUF002/RUF003 (non-semantic en-dashes -> ASCII hyphen).
str+Enum classes keep the mixin (UP042 noqa) to preserve str() repr.

No runtime behavior change. ruff check . clean; mypy src/llmxive unchanged
(same 63 pre-existing errors); offline gate 1864 passed / 10 skipped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Annotation-only cleanup — no runtime behavior changes:
- Add generic params (dict[str, Any], list[str], Callable[..., Verdict],
  re.Pattern[str], re.Match[str]) across verify/, fill/, results/, state/, claims/.
- Add missing function param/return annotations (backend: Any, Iterator[Path],
  dict[str, Any] returns, channel callable, __getattr__ -> Any).
- Widen repo_root passthrough to str | Path | None in select_mode /
  verify_computational (fixes [arg-type] without changing the forwarded value).
- Make Any returns explicit via bool() wraps and str(m.group(0)).
- Add [mypy-scipy.*] and [mypy-sympy.*] ignore_missing_imports sections.

mypy src/llmxive: Success (0 errors); ruff clean; offline gate 1864 passed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s missing

Part-7 finding (PROJ-552 spec stage): a single reviewer (scope_fidelity) emitted
an opening `---` + valid metadata + concerns but NO closing `---` delimiter (the
reasoning model ended after concerns: / the endpoint hung mid-response). The
strict both-delimiters regex (^---\n(.*?)\n---) matched nothing, so
_parse_response raised RuntimeError "no YAML frontmatter" — crashing the ENTIRE
spec panel/run instead of degrading gracefully.

General fix (every reviewer, every stage):
- _extract_frontmatter: recovers the YAML frontmatter in three shapes — proper
  both-delimiters (fast path), opening + later doc-boundary (---/...), and
  opening with NO closing delimiter (take the longest leading line-block that
  still parses to a non-empty YAML mapping, dropping any unfenced trailing
  prose). Only a response with no opening --- at all is rejected.
- panel_review_block.md: explicit instruction to always emit BOTH delimiters and
  start at column 0 with no leading blank lines.
- 8 new offline tests incl. the exact PROJ-552 failure shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pping

Deeper root cause of the PROJ-552 reviewer crash (commit 1c8451e handled the
missing closing `---`, but the run still died identically). _parse_response
stripped a ```yaml fence BEFORE extracting frontmatter — and a reviewer's prose
body routinely contains a fenced YAML/code example. With no closing `---` on the
frontmatter, _CODE_FENCE_RE.search hijacked `candidate` to the prose example's
contents (e.g. "foo: bar"), which has no `---`, so extraction returned None and
the whole panel/run crashed.

Fix: run _extract_frontmatter on the RAW stripped response FIRST; only fall back
to unwrapping a ```yaml fence when the raw response has no recoverable
frontmatter (the wholly-fence-wrapped case). +4 tests: fenced example in prose
(the exact crash), whole-response fence wrap, and proper-delims-with-fenced-prose
no-hijack.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The claim layer was not idempotent across convergence rounds. Two root
causes are fixed here:

A. Prose-preserving render (claims/pointer.py). A {{claim:<id>}} pointer
   stands in for the claim's FULL raw_text sentence, but render() replaced
   it with the BARE resolved_value — collapsing a whole sentence to a number
   (PROJ-552 garble). render() now reconstructs the prose: for a VERIFIED
   NUMERIC/RESULT claim it swaps ONLY the asserted numeric token (selected by
   thousands-separator/idempotency heuristic) for resolved_value; an
   already-correct assertion is returned byte-for-byte unchanged. ENTITY_FACT/
   RELATIONAL/MAGNITUDE leave prose intact unless the object span is locatable.
   A non-verified claim now PRESERVES the prose and APPENDS one inline
   [UNRESOLVED-CLAIM:] marker instead of replacing the sentence with a marker.

B. Idempotent extraction (claims/gate.py strip_claim_artifacts + service.py).
   process_document now strips prior-round [UNRESOLVED-CLAIM:] markers and
   stray {{claim:<id>}} pointers BEFORE extraction, so the layer no longer
   re-extracts its own marker bodies as new claims or accumulates markers.

New render contract: a VERIFIED pointer renders the claim's sentence with only
the asserted token swapped for the verified value (idempotent); a non-verified
pointer renders the sentence followed by one [UNRESOLVED-CLAIM:] marker.

Updated the chokepoint test's _make_claim fixture (raw_text now carries a
numeric token) — its old "some number"/resolved="9988" pair encoded the OLD
bare-value-substitution contract, which the prose-preserving render supersedes.

Added 13 tests (pointer prose-preservation incl. the exact PROJ-552 garble +
idempotency; gate strip_claim_artifacts; process_document double-run stability).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…shared parser

C. Drop the redundant F-19 reviser pass (convergence/revisers/_self_consistency.py).
   The reviser chokepoint ran F-19 _ground_factual_claims THEN 016 _verify_claims
   on its output; 016 then re-extracted F-19's inline [UNVERIFIED] marker reasons
   as new "claims" and the two text-mutation models fought (PROJ-552 root cause 2).
   _clean_citations now runs ONLY _verify_claims (spec 016 is the SSoT).
   _ground_factual_claims remains defined for other importers but no longer runs
   in the chokepoint; docstring updated. The grounding *service* (used BY 016
   resolvers) is untouched.

D. Extraction precision (claims/extract.py). A purely promotional "standing"
   statement (well-established / peer-reviewed / community-standard / widely-used /
   well-known / established-reference / gold-standard) with no crisp checkable core
   is now dropped — it cannot be substantiated and otherwise left a residual
   [UNRESOLVED-CLAIM:] marker that blocked convergence (root cause 5). A statement
   with a salient NUMBER or explicit citation still passes.

E. Shared tolerant parser (claims/extract.py + agents/grounding_guard.py). The
   grounding guard had a SECOND YAML claim parser with the same embedded-quote
   fragility already fixed in claims/extract — exactly why the bug recurred. The
   tolerant field-recovery is now a shared tolerant_field_entries(); the grounding
   guard's _parse_extraction_reply falls back to it on YAML failure or no usable
   claims, recovering an embedded-quote cited claim ("A Census of Knots.") instead
   of silently dropping every claim. Strict-path behavior preserved.

Added 4 tests (F-19-not-invoked-in-reviser-path; promotional-statement drop +
number/citation survival; grounding-guard embedded-quote recovery).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…actually route

Part-7 finding (PROJ-552): a project that kicks back at a doc-stage convergence
panel was STUCK re-running that stage forever and never escalating. Root cause: a
non-converging panel writes convergence_kickback.yaml then RAISES
StagePanelKickback from inside the agent run. That raise propagated straight out
of run_one_step (caught only by the CLI as a FAIL), so _decide_next_stage — the
ONLY place consume_convergence_kickback runs — was never reached. The sentinel
was never consumed, current_stage never advanced, and the per-stage kickback cap
never incremented (so the 3-strikes→human escalation never fired either). The
adaptive-kickback resilience (F-20 Part B) worked in its unit tests but was dead
in the real `llmxive run` path because every test exercised _decide_next_stage in
isolation, never the raise-through-run_one_step seam.

Fix: run_one_step now catches StagePanelKickback (controlled non-convergence) and
StagePanelEscalation (engine failure) around the speckit agent call and falls
through to _decide_next_stage, which consumes the sentinel and routes the project
to the content stage (or to HUMAN_INPUT_NEEDED at the cap / on engine failure)
instead of crashing the run loop. +2 regression tests exercising the REAL
run_one_step exception handling + real _decide_next_stage/_kickback routing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part-7 finding (PROJ-552 plan stage): the planner emitted a
contracts/knot_record.schema.yaml whose body contained a `---` document
separator (a second YAML doc at line 115). The FR-007 guard
(_research_guard.assert_data_model_contracts_consistent) parses each schema with
yaml.safe_load (single-document) and correctly rejected it as invalid — but the
rejection hard-crashes the plan stage (plan_cmd unlinks all artifacts + re-raises
→ CLI FAIL), stranding the project at `clarified`.

This commit addresses the TRIGGER: the planner prompt now explicitly forbids an
internal `---` separator and tells the model to emit a separate
`<!-- FILE: contracts/<name>.schema.yaml -->` block per schema.

NOTE (deeper robustness gap, tracked separately in notes/spec-015-review-status):
the deterministic plan guards (FR-005/006/007) fail-closed by raising, with NO
revision loop — a malformed planner artifact strands the project instead of
driving a bounded planner re-run with the guard feedback (the identify→revise
philosophy of #239). That fix is a careful plan-stage flow change for a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…/ChunkedEncoding)

Part-7 finding (PROJ-552 plan stage, run #12): the run crashed with a raw
('Connection broken: IncompleteRead(77671 bytes read)', ...) — the flaky
Dartmouth endpoint dropped the connection while streaming the planner's ~75KB
multi-file reply. The DartmouthBackend's transient-error classifier had
"connection reset"/"connection refused" but NOT "connection broken" /
IncompleteRead / ChunkedEncodingError, so the drop was classified Permanent → no
retry → the plan stage failed and stranded the project at `clarified`.

Fix: add the connection-dropped-mid-stream markers (connection broken,
incompleteread, chunkedencodingerror, connection aborted, remotedisconnected,
remote end closed, broken pipe, eof occurred) to the transient set so
_retry_with_backoff retries them. Also extracted the marker tuple + match into a
module-level _is_transient_error_text() (was buried in a closure, untestable) and
added regression tests for the exact failure + a no-over-match guard on genuine
permanent errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Part-7 (PROJ-552 plan run #13): after the single-document fix, the planner hit a
SECOND YAML pitfall — an unquoted schema description "... (target: ≥95%)" whose
bare ": " made yaml.safe_load read it as a nested mapping ("mapping values are
not allowed here"), again rejected by the FR-007 guard. Prompt now requires
quoting any string value containing a colon, '#', leading ≥/≤/%, or brackets.
(The robust backstop — a bounded planner revision-with-feedback loop on guard
failure — is implemented separately; prompt nudges alone are whack-a-mole.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A malformed planner artifact (e.g. a contracts schema with an internal
`---` multi-document marker, or an unquoted `: ` that breaks YAML)
previously hard-crashed PlannerAgent.write_artifacts on the first
deterministic-guard failure — marking the run FAILED and stranding the
project at 'clarified' with no retry.

Wrap the split → FR-005 → write → FR-007 → FR-006 pipeline in a bounded
retry loop:
- Refactor the write+guard body into _write_and_validate(ctx,
  mechanical_output, response_text) -> list[str], which raises the guard
  exception (and unlinks partial writes) on failure, exactly as before.
- write_artifacts now calls it in a loop: on a guard exception, if
  attempts remain, re-call the planner LLM via _revise_with_feedback with
  ONE corrective user message quoting the exact guard error and demanding
  all five files re-emitted in the FILE-marker format; else re-raise the
  last guard exception (fail-closed preserved).
- Cap: MAX_PLAN_REVISION_RETRIES = 2 (up to 3 total attempts). Each retry
  logged at INFO.
- Offline-safety gate: the corrective re-call builds the backend via
  make_backend(ctx.default_backend.value); if it returns None or raises,
  the loop does NOT retry and re-raises the guard exception. Offline unit
  tests therefore stay network-free.
- The plan convergence panel runs only AFTER the artifacts pass the guards.

Tests:
- tests/unit/test_plan_revision_loop.py drives the real write_artifacts /
  _write_and_validate with a fake backend collaborator: invalid-then-valid
  retries-and-succeeds (both PROJ-552 failure modes), all-invalid raises
  the last guard exception with no partial artifacts, and no-backend /
  make_backend-raises both fail closed without retrying.
- Updated two existing phase4 tests that encoded the old
  hard-crash-on-first-failure contract (template-reject unlink, bad-URL
  unlink) to force the offline path (make_backend -> None), asserting the
  same fail-closed-and-unlink behavior via the loop's no-backend branch.

ruff + mypy clean; offline gate 1899 passed (baseline 1894 + 5), 0 failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el to human

Part-7 finding (PROJ-552 plan panel, run #14): the qwen endpoint hung past its
180s deadline; the backend's own retry+backoff exhausted and surfaced a
TransientBackendError. _stage_panel's engine-failure handler caught it like any
exception → wrote human_input_needed.yaml + raised StagePanelEscalation → the
project was stranded at human_input_needed. But a transiently-degraded model
endpoint is NOT human-actionable: a human cannot fix a hung endpoint.

Fix: catch TransientBackendError separately in run_stage_panel and re-raise it
AS-IS (no human_input_needed.yaml, no StagePanelEscalation wrap) so the run fails
transiently and the project STAYS at its current stage to retry on the next
scheduler tick when the endpoint recovers. run_one_step does not catch
TransientBackendError, so it surfaces as a transient CLI FAIL with no stage
change. Genuine (non-transient) engine failures still escalate to human as
before. +1 regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeremymanning
Copy link
Copy Markdown
Member Author

Part-7 end-to-end pipeline hardening (driving PROJ-552 live)

Drove a real project (PROJ-552, knot-complexity, mathematics) through the live pipeline stage-by-stage per issue #239 Part 7, and fixed 8 general bugs — each surfaced by a real failure and fixed so it benefits every project, not just the test case:

# Bug (effect before fix) Commit
1 Claim extraction dropped all claims on embedded-quote YAML (paper titles) → fabrications passed un-checked 1eb590d3
2 Reviewer crashed the whole panel on frontmatter edge cases (missing closing ---, fenced prose) 1c8451ee, ff2ba357
3 Claim layer not idempotent under the convergence loop → garbled/accumulating prose (016 vs F-19 conflict) d2b76008, 14814f40
4 Kickbacks never routed — StagePanelKickback bypassed _decide_next_stage → project looped forever, no escalation 1032b6b0
5A Planner emitted invalid YAML schemas (internal ---, unquoted colons) → FR-007 hard-crash 84dd3137, 67f1a001
5B Plan deterministic guards hard-crashed instead of driving a bounded planner revision-with-feedback loop 0d230f2d
6 Connection-dropped-mid-stream (IncompleteRead/ChunkedEncoding) not classified transient → plan crash 7ea73cc0
8 A transient backend failure (hung endpoint) wrongly escalated a panel to human_input_needed 26d1c14a
Cleared accumulated lint/type debt from specs 016/017/018 (ruff 217→0, mypy 63→0) cf9f0d79, 00d5f171

Recurring lesson: controlled/expected failures (panel non-convergence, malformed planner output, connection drops, hung endpoints) must route / revise / fail-transiently, never hard-crash and strand the project.

Proven end-to-end: the convergence protocol works — PROJ-552 went brainstorm → idea → drift-kickback → flesh_out realign → idea-converge → spec converged (with the correct 9,988-prime-knots-at-13-crossings count verified against a real 2025 J. Knot Theory Ramifications DOI), and the plan stage now self-heals malformed artifacts + survives transient endpoint failures.

Offline gate: 1900 passed, 0 failures; ruff + mypy clean. Traversal continues (plan panel → tasks → … → publisher, then the 9-domain repetition) once the (currently-degraded) model endpoint recovers.

🤖 Generated with Claude Code

jeremymanning and others added 5 commits May 31, 2026 07:57
…w notes

Persists PROJ-552's pipeline state so the traversal can resume from the plan
panel on another machine or in a GitHub Action while this laptop is suspended.
Stage=clarified (spec converged at specs/002 with the verified 9,988 knot count);
plan convergence panel is next. Includes the claims/citations registries,
librarian cache, run-log telemetry, and the living review tracker
(notes/spec-015-review-status.md) documenting all 8 fixes + the resume command.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The llmxive-pipeline workflow ran `llmxive run` but never committed the result,
so an Action's progress (advanced stage, new artifacts, run-log) was discarded
when the ephemeral runner tore down. Add an always() commit+push step so a
GitHub-Action run (e.g. driving PROJ-552 while a laptop is suspended) actually
persists. [skip ci] avoids retriggering. Pushes to the checked-out branch via
HEAD:${GITHUB_REF_NAME}.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nce-protocol

# Conflicts:
#	.github/workflows/spec015-calibration.yml
@jeremymanning jeremymanning merged commit df968f0 into main May 31, 2026
3 of 5 checks passed
jeremymanning added a commit that referenced this pull request May 31, 2026
…-552 plan panel

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants