Commit 6173f33
Phase 4 validation (planner + tasker) + deterministic dataset resolver (#223)
* feat(014): Planner research guards + Tasker per-round inspection capture
T003: new canonical _research_guard.py (FR-005/006/007, stdlib only):
IncompleteArtifactSet/UnreachableReference/InconsistentDataModel +
assert_artifact_set_complete/assert_urls_reachable/assert_data_model_contracts_consistent.
T004: wire the three gates into PlannerAgent.write_artifacts; unlink every
artifact written this invocation on any raise (parity with guard_emit).
T005: capture() gains optional rounds=, persisted under top-level 'rounds'
(default []); escalated added to valid outcomes; rounds added to required keys.
T006: _maybe_write_inspection reads agent._inspection_rounds and passes rounds=.
T007: TaskerAgent accumulates one sub-record per analyze round (observability
only; no decision/branch reads it).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(014): scripts/validate_phase4.py driver (preflight/reset/run/verify + manifests)
T008: preflight (Dartmouth key, runner import, stage==clarified else FR-019
decline, spec.md real, inspections dir writable).
T009: FR-018 reset (delete Phase-4 outputs + memory markers, PRESERVE spec.md).
T010: run with
LLMXIVE_INSPECTION_DIR set; capture exit + run-id.
T011: post-run verify — stage chain, five plan artifacts + tasks.md, >=10 T###
lines, FR-010 ordering (check_task_ordering), FR-012 constraint-non-deletion
(fr_sc_counts across Mode-B rounds), FR-020 Constitution Check.
T022/T023: emit_carry_forward + emit_phase_report per the contracts. Pure
helpers (check_task_ordering, fr_sc_counts, constitution_check_ok) importable
by tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(014): Phase-4 regression + schema tests; self-guard destructive Phase-3 e2e
tests/integration/test_phase4_plan_tasks.py (T016-T021,T025): FILE-marker split,
FR-005 completeness, FR-008 template reject (real PlannerAgent.write_artifacts
unlink), FR-006 URL reachability via a REAL local http.server (200 pass; 404/500/
connection-refused raise; Planner unlinks on bad URL), FR-007 consistency,
FR-016(c) prose-stub Mode-A reject, FR-016(d) diff-leak, FR-016(e) header-clobber,
FR-012 constraint non-deletion, FR-016(f) analyze-loop cap escalation (real
tasks_cmd loop, synthetic analyze/Mode-B; no real LLM), FR-010 ordering, inspection
schema incl rounds + _redact no-secrets, carry-forward + phase-report schema.
test_phase3_specify_clarify.py: the spec-011 gated e2e is DESTRUCTIVE (rolls
PROJ-261 back to project_initialized, deletes spec.md). Skip it unless the project
is still at its Phase-3 entry stage, so it can no longer clobber the Phase-4 input.
ruff clean on all new files; stdlib-only guard; no new pip dependency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs(014): mark T003-T011,T016-T023,T025-T028 done in tasks.md
Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002)
left unchecked for the parent to complete.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(audit): body-density no longer mis-flags table/diagram-heavy artifacts
The template-vs-real auditor stripped fenced blocks (incl. mermaid) before
measuring section bodies and counted parent headings as empty, so a
substantive data-model.md (ER diagram + per-entity attribute tables + fenced
CSV schemas) classified 'partial'. This blocked the Planner from advancing
ANY project past 'clarified'.
Now tables/fenced blocks/lists count as real content and parent-of-subsection
headings are not 'short'; genuinely-empty/stub sections are still flagged and
literal templates are still caught by the phrase/bracket rules.
Found during spec-014 Phase-4 validation; cited by failing inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json
(TemplateRefused body_density_short>=60pct=9/13 on a real data-model.md).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(014): FR-007 = robust structural check, not fragile 1:1 name match
Real PROJ-261 planner output revealed the FR-007 guard was over-specified:
it required every data-model entity to have a same-named contracts/ schema
(and vice versa), but the Planner prompt only mandates >=1 schema, and schema
filenames legitimately differ from entity headings (e.g. code_duplication_
metrics.schema.yaml for a CloneDensityMetric entity). It also mis-counted
'Data Flow'/'Entity Relationships'/CSV-filename headings as entities.
FR-007 now verifies (a) data-model.md defines real entities (table/diagram/
headings) and (b) every contracts/*.yaml is a non-empty parseable schema;
cardinality + naming are unconstrained. Updated spec/contracts/data-model and
the regression tests (incl. an explicit test that the prior real-world
mismatch now passes). Cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(audit): don't learn structural task labels ([US1],[Story]) as template phrases
The literal-template-phrase rule extracted [US1]/[US2]/[US3]/[Story] from
tasks-template.md and then flagged any real tasks.md (which MUST use those
format labels) as 'template'. This blocked the Tasker from ever committing a
tasks.md. Structural labels ([P],[ID],[TaskID],[Story],[USn]) are now excluded
from the learned placeholder set; genuine placeholders ([FEATURE NAME],
[Entity], [Service], ...) still trigger template detection, so the template
files themselves remain correctly classified.
Found during spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/tasker.json
(TemplateRefused literal_template_phrases>=3 sample=['[Story]','[US1]','[US2]']).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(014): add --force re-validation rollback to validate_phase4 driver
A prior partial run (planner advanced clarified->planned, tasker then failed)
left a project at 'planned', so the FR-019 preflight correctly declined to
re-run. --force rolls such a project back to 'clarified' (logged in
history.jsonl) so the full planner->tasker chain can be re-validated from the
canonical entry state. Default FR-019 decline behavior is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(audit): bracket-density rule ignores fenced/diagram/link content
Rule 2 (unfilled_bracket_density>=6) counted bracketed node labels inside a
fenced ASCII/mermaid data-flow chart (e.g. '[Dataset Download] -> data/raw/')
as unfilled placeholders, rejecting a real planner data-model.md as 'template'.
Rule 2 now scans a view with fenced blocks, HTML comments, and markdown links
stripped, and excludes structural labels; Rule 1 (learned phrases) still uses
the full text so genuine templates remain caught (verified: plan/spec/tasks
templates still classify 'template'). Also gave the planner-guard test a real
.specify/templates dir so it exercises Rule 1 like production.
Found during spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json
(unfilled_bracket_density on a fenced data-flow chart).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(014): step driver to 'analyzed'; align spec to best-effort cap-hit
Validation finding: the Tasker advances across TWO runner steps (planned->tasked,
tasked->analyzed) per the pipeline graph (STAGE_AFTER_AGENT), so a fixed
--max-tasks 2 left the project stuck at 'tasked'. _run_pipeline now steps one
agent at a time until a terminal Phase-4 stage and STOPS at 'analyzed' (never
invoking the Phase-5 implementer).
Per the 2026-05-21 decision, analyze-loop cap-hit WITHOUT convergence is
best-effort: the Tasker accepts tasks.md, records converged:false, and the
project advances to 'analyzed' (downstream reviewers catch issues);
human_input_needed is reserved for an explicit Mode-B escalate verdict or a
backend failure. Updated spec FR-013 / Background / US1 / US2 / US3 / edge case
/ data-model, and added a regression test for the best-effort cap-hit path
(the explicit-escalate path test is retained).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(014): raise per-step timeout to 3600s; allow resume from mid-Phase-4 stage
The Tasker re-runs its full 5-round analyze loop on each runner step, and the
second step (tasked->analyzed) exceeded the 1900s per-step timeout. Raised to
3600s. Also: when a run is interrupted mid-Phase-4 (project left at planned/
tasked/analyze_in_progress), the driver now RESUMES from there (no reset, no
rollback) and steps to a terminal stage, instead of requiring --force. FR-019
still declines projects that have COMPLETED Phase 4 (analyzed+).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(014): FR-006 URL extraction strips wrapping backticks (false-404 fix)
PROJ-262's planner cited a URL wrapped in markdown backticks; _URL_RE captured
the closing backtick into the path ('.../realKnownCause/`'), producing a false
404. Backtick is now excluded from the URL/doi character classes and added to
the trailing-strip set. Regression test added. (Note: the planner is also
citing genuinely-dead/irrelevant dataset URLs for PROJ-262 — figshare 404, NAB
dir 404 — which FR-006 correctly hard-fails per the 2026-05-21 strict-mode
decision; that is the gate working, distinct from this extraction bug.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(PROJ-262): correct dead QM9 dataset DOI in Phase-3 spec.md
The Phase-3 spec.md cited a dead QM9 DOI (10.6084/m9.figshare.9981994 -> HTTP
404), which the Planner faithfully carried into research.md, so Phase-4 FR-006
(reachability) correctly hard-failed the plan. Replaced with the canonical,
verified-reachable QM9 dataset DOI 10.1038/sdata.2014.22 (Ramakrishnan et al.,
Sci Data 2014; HTTP 200). A Verified-Accuracy (constitution Principle II) fix.
Surfaced by spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-262-predicting-molecular-dipole-moments-with/planner.json.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* design(dataset-resolver): deterministic web-search dataset resolution for the Planner
Brainstormed design (Approach A): a librarian/dataset_resolver.py module, called
by the Planner's mechanical step, finds real datasets via HF Hub + figshare/
Zenodo/DataCite + reused Semantic Scholar/arXiv, verifies reachability + a
sample-stream format sniff, and injects the top-N verified candidates per
dataset into the Planner prompt (cite-only, never invent). Removes URL
generation from the LLM; FR-006 stays as the safety net. Root-cause fix for the
PROJ-262 hallucinated-dataset-URL finding from spec-014.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* plan(dataset-resolver): bite-sized TDD implementation plan
8 tasks: DatasetCandidate + HF Hub source; figshare/Zenodo/DataCite sources;
sample-stream format sniff; verify_candidate (reuses verify.py); intent
extraction + resolve_datasets top-N orchestration; manifest + escalation +
cite-only planner block; wire into Planner; full suite + real PROJ-262 re-run.
All real-call tests (HF Hub + registry APIs + local http.server fixture).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): DatasetCandidate + HuggingFace Hub source
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): figshare/Zenodo/DataCite sources
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): sample-stream format sniff
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): verify_candidate (reachability + sniff, reuses verify.py)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): intent extraction + resolve_datasets top-N orchestration
HF candidates now point at the resolve URL of an actual data file (csv/parquet/
etc.) instead of the HTML landing page, so reachability+sniff verification can
succeed. The landing page is HTML and is correctly rejected by the sniffer; the
design calls for the HF resolve URL / streaming first rows. Data file is picked
deterministically by extension preference then path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): manifest write + planner block + unresolved helper
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dataset-resolver): xyz/sdf/tar sniffers + granular candidates_tried audit
FIX 1: add xyz/sdf/tar detectors to _detect_and_parse so the HF picker
(_HF_DATA_EXTS advertises .xyz/.sdf) and the sniffer are consistent -- QM9 is
natively .xyz, which was picked then wrongly rejected. xyz = integer atom-count
header or "<El> x y z" coordinate lines; sdf/mol = V2000/V3000 or "$$$$"
delimiter; tar = "ustar" magic at offset 257.
FIX 2: split the generic "rejected" candidates_tried status into "verified",
"unreachable" (reachability step failed) and "wrong_format" (reachable but the
sniff failed) via new probe_candidate() + VerifyResult; verify_candidate keeps
its original contract. Verified-selection behavior unchanged.
FIX 4: document that the Semantic Scholar/arXiv paper-linked-data source is
DEFERRED (yields paper pages, not sniffable files) and that repo_root is
reserved for that future reuse (Task 7's plan_cmd still passes it).
Real-call/local-http.server tests added for xyz + sdf detection and for the
404->unreachable / HTML->wrong_format audit statuses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(dataset-resolver): de-vacuify DataCite test with a real DataCite DOI
FIX 3: 10.1038/sdata.2014.22 is a Crossref DOI (not DataCite-registered), so
search_datacite returned [] and the test was vacuously true. Replace with the
Zenodo-minted DOI 10.5281/zenodo.1227121 (api.datacite.org/dois/<doi> -> 200,
verified by curl), assert >=1 datacite candidate AND that its doi.org URL is
reachable (200) -- genuinely exercising the resolve path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dataset-resolver): wire resolver into Planner (inject verified URLs)
Task 7 of the dataset-resolver plan. PlannerAgent.mechanical_step now reads
spec.md, calls resolve_datasets + write_manifest, and returns a rendered
dataset_block; build_prompt injects the '# Verified datasets' block after the
plan template and before the comments/Task line (falling back to resolving in
build_prompt when mechanical_output lacks the key). planner.md replaces the
'NEVER invent URLs' rule with the cite-only rule. New offline test stubs
resolve_datasets to assert the verified block + URL reach the user message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(planner): remove contradictory dataset-substitution rule (NAB URL)
Task-7 review caught a leftover rule that told the LLM to 'substitute a
comparable open dataset that has a known-stable raw URL', hard-coding the exact
NAB raw.githubusercontent URL PROJ-262 hallucinated. It contradicted the new
cite-only rule and perpetuated the very hallucination the resolver removes.
Replaced with: reference ONLY the verified-datasets block (verified URL or a
well-known loader for that same dataset); never substitute or invent.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(dataset-resolver): store stable URL, not expiring presigned redirect target
PROJ-262 run 6 surfaced this: verify_candidate stored head.final_url, which for
a HuggingFace resolve URL is a short-lived presigned cas-bridge URL
(X-Amz-Expires=3600). The Planner cited it; ~32 min later FR-006's re-check hit
the expired signature -> HTTP 403 -> rejected. Now we store the STABLE original
c.url (the HF resolve URL, which HF re-signs on every access) while still
sniffing the live redirect target. Regression test: a redirecting URL keeps the
stable original in the verified record.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(planner): strip wrapping/stray markdown code fences from multi-file split
PROJ-262 run 7: the Planner appended a stray trailing ``` to
contracts/prediction.schema.yaml, making it invalid YAML, which FR-007 (schema
validity) correctly rejected. _split_multi_file now strips (1) a fence that
wraps an entire file (```lang ... ```) and (2) a stray unmatched fence
(odd ``` count, e.g. a trailing closer), while leaving balanced code blocks
inside .md files intact. General fix for a common LLM output artifact across
all planner artifacts. Regression tests added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(tasker): FR-012 guard - refuse Mode-B spec.md patch that deletes requirements
PROJ-262 run 8 reached 'analyzed' but the Tasker's Mode-B had gutted the
project spec.md from 12 FR / 5 SC to 0 FR / 2 SC across rounds 1-4 — 'resolving'
analyze findings by DELETING requirements (the exact constraint-weakening the
constitution forbids). The Mode-B per-patch validation now refuses a spec.md
patch whose distinct FR-/SC- identifier set is smaller than the current file's,
alongside the existing diff/header/task-id guards. (The validation-layer
FR-012 check in validate_phase4 already flagged it post-hoc; this stops the
corruption at the source.) Restored PROJ-262's spec.md from git. Regression test
added.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(audit): bracket-density counts only multi-word placeholders
PROJ-262 run 9: the Tasker annotated tasks.md with 20 single-token [REVISION]
tags, tripping unfilled_bracket_density (>=6) -> tasks.md rejected as template.
Rule 2 now counts ONLY multi-word descriptive placeholders ([FEATURE NAME],
[e.g., ...]) — the genuine 'saturated unfilled template' signal. Single-token
brackets are excluded: real ones ([FEATURE],[DATE]) are caught by Rule 1's
learned set, and LLM annotations/labels ([P],[US1],[REVISION],[X]) legitimately
appear in a real tasks.md. Templates still classify 'template' (verified
plan/spec/tasks). Root fix for the recurring bracket-density false-positives.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* validate(014): both canonicals reach analyzed via dataset resolver; wrap-up
fr_sc_counts now counts DISTINCT requirement ids (not occurrences), fixing a
false-positive FR-012 finding (a dropped cross-reference is not a deleted
requirement). Added --verify-only mode to re-verify existing analyzed artifacts
+ emit manifests without a pipeline re-run.
Result: PROJ-261 + PROJ-262 both reach 'analyzed', 0 findings (verify-only:
2 passed). carry-forward.yaml + phase-report.md generated; both projects
'passed', Mode-B exercised on real content (5 rounds each). Commits the produced
plan artifacts + tasks.md + inspection records + analyzed project state. All 28
spec-014 tasks complete.
Known follow-up: extract_dataset_intents over-extracts non-dataset tokens
(GNN/MAE/MUST/FR-001) -> some resolve to irrelevant (but reachable) HF datasets;
a precision refinement, not a blocker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(014): commit Phase-4 validation run-log entries (FR-014 audit trail)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: publish_blocked schema gap (publisher crash) + stale spec-012 scheduler tests
Two pre-existing issues surfaced during spec-014 validation, fixed here (handle-
issues-as-you-go):
1. project-state schema's current_stage enum was MISSING 'publish_blocked' — the
only Stage value absent — so the publisher (agents/publisher.py FR-030, sets
publish_blocked after 5 Zenodo failures) would crash on save with a
ValidationError. Added it to the enum + a contract test asserting every Stage
value is in the schema enum (guards against future drift).
2. test_revision_in_progress_idempotency.py asserted READY_FOR_IMPLEMENTATION ∈
_NEVER_PICK — the spec-012 expectation that spec-013 deliberately REVERSED
(the llmXive-implementer agent now consumes those projects via the scheduler;
documented in scheduler.py + implementer.py). Updated the 2 stale tests to the
spec-013 behavior (preserving their idempotency intent with genuinely-locked
stages); did NOT revert the correct code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(submission-intake): VALID_FIELDS reuses canonical LIBRARIAN_DEFAULT_FIELDS (SSoT)
submission_intake.py re-typed the librarian's canonical 9-field list as a prefix
of VALID_FIELDS, a third copy of the field list that violated Constitution
Principle I (single source of truth) and failed
test_librarian_default_fields::test_no_third_copy_of_the_field_list. VALID_FIELDS
is now frozenset(LIBRARIAN_DEFAULT_FIELDS) | {submission-only extras}; the
resulting set is byte-identical (17 fields), so classification/validation
behavior is unchanged. Pre-existing issue, fixed per handle-issues-as-you-go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: run-log entries from final regression runs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(phase1): make citation-resolver timeout test deterministic (drop flaky httpbin)
test_timeout_fires depended on the public httpbin.org/delay/30 endpoint sleeping
30s; when httpbin is overloaded it responds fast, taking the non-timeout
unreachable path (api_response_snippet=None) and failing the assertion. Replaced
with a LOCAL http.server that sleeps past the timeout, so the deadline
deterministically fires (TimeoutError path, snippet set). No third-party
dependency; tests the actual timeout behavior reliably.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* test(013): accept HTTP 202 from just-minted Zenodo sandbox DOIs
The two publisher sandbox tests HEAD the freshly-minted DOI as a smoke
check that the resolver knows about it. They only accepted (200,302,403),
but a just-minted DataCite/Zenodo sandbox DOI returns 202 ("Accepted",
still propagating) before it settles — a racy CI failure unrelated to the
publisher code. Broaden both assertions to any 2xx/3xx plus 403 (doi.org's
bare-HEAD response for sandbox DOIs); the only real failure is 404/5xx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(backends): bound LLM calls with a daemon-thread deadline (fixes 54-min CI hang)
Root cause of the real-call job timing out: the Dartmouth backend wrapped
client.invoke() in `with ThreadPoolExecutor() as ex: ex.submit(...).result(180)`.
When the 180s deadline fired and raised, exiting the `with` block invoked
ThreadPoolExecutor.__exit__ -> shutdown(wait=True), which BLOCKS until the
still-hung worker thread finishes. Since the worker was stuck in a socket read
with no HTTP timeout (langchain forwards `timeout` as a chat-completion body
param, not an HTTP/socket timeout), shutdown never returned — the implementer
e2e test stalled ~54 min until the 60-min job cap cancelled it (the test step
emitted zero output between 16:22 and the 17:16 cancel).
Replace the executor with a shared `invoke_with_deadline` helper in
backends/base.py that runs the call on a DAEMON thread and abandons it past the
deadline. A daemon thread never blocks interpreter exit (unlike an
abandoned ThreadPoolExecutor worker, which its atexit join would wait on), so a
sick connection can no longer hang the process. Apply it to both the Dartmouth
and HuggingFace backends (HF's bare client.invoke() had no deadline at all —
same latent hang as a fallback).
Verified: unit test asserts the caller regains control ~at the deadline (not
after the slow call finishes) and the abandoned worker is a daemon; a real
Dartmouth chat call still succeeds through the wrapper.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix(backends): enforce free-only Dartmouth models; correct registry model drift
Dartmouth's chat catalog now mixes FREE self-hosted models with PAID
external providers (gpt-5, claude, gemini, voyage, ...). The real-call CI
failed because (1) a test fell through to the PAID gpt-5.3-chat-latest,
which rejects temperature=0, and (2) a transient "model not found" on the
primary model was misclassified as permanent, so the router never fell
through to the free gpt-oss-120b peer.
- dartmouth.py: derive the free-model set from the API's explicit
input/output_cost_per_token fields (authoritative, not heuristic);
refuse any non-free model before calling it (Constitution Principle IV:
v1 cost==0); route list_models() through the working
chat.dartmouth.edu/api/models endpoint (ChatDartmouth.list() targets a
Dartmouth host that rejects the chat key and returns non-JSON); classify
"model not found" as transient so the router falls through to a free
peer; drop unsupported temperature on retry.
- router.py: MODEL_FALLBACKS use free models only (gemma-3-27b-it ->
gemma-4-31B-it).
- registry.yaml: lightweight agents -> gemma-4-31B-it; paper_implementer
and all intensive agents -> qwen.qwen3.5-122b.
- test_dartmouth_chat.py: select only free models (never a paid gpt-5).
Verified locally: the 4 originally-failing real-call tests pass; full
real_call suite 18 passed / 4 skipped (HF/network-gated); unit+contract
592 passed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* test(implementer-e2e): raise SC-001 wall-clock budget 1200s -> 2400s
The free-only backend fix makes a transient "model not found" (a model
briefly unloaded on Dartmouth's vLLM cluster) retryable — the router now
walks the free peer-model fallback chain instead of failing fast. That
resilience is correct but adds real wall-clock when blips occur: a CI run
took 1264s on the 3-task fixture (vs the old 1200s budget set from
fail-fast timing). Bump to 2400s for generous headroom over the observed
worst case while still catching a genuine hang (bounded by the 180s
per-request deadline x the finite retry/fallback fan-out).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent d69d7b5 commit 6173f33
121 files changed
Lines changed: 12902 additions & 8539 deletions
File tree
- .specify
- agents
- prompts
- docs/superpowers
- plans
- specs
- notes
- projects
- PROJ-261-evaluating-the-impact-of-code-duplicatio
- .specify/memory
- specs/001-evaluating-the-impact-of-code-duplicatio
- contracts
- PROJ-262-predicting-molecular-dipole-moments-with
- .specify/memory
- specs/001-predicting-molecular-dipole-moments-with
- contracts
- scripts
- specs
- 001-agentic-pipeline-refactor/contracts
- 014-phase4-plan-tasks-testing
- .omc/state
- checklists
- contracts
- inspections
- PROJ-261-evaluating-the-impact-of-code-duplicatio
- PROJ-262-predicting-molecular-dipole-moments-with
- src/llmxive
- agents
- audit
- backends
- librarian
- speckit
- state
- citations
- librarian-cache
- projects
- run-log/2026-05
- tests
- contract
- integration
- phase1
- real_call
- unit
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
70 | 70 | | |
71 | 71 | | |
72 | 72 | | |
73 | | - | |
| 73 | + | |
74 | 74 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
57 | 57 | | |
58 | 58 | | |
59 | 59 | | |
60 | | - | |
61 | | - | |
62 | | - | |
63 | | - | |
64 | | - | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
65 | 65 | | |
66 | | - | |
67 | | - | |
68 | | - | |
69 | | - | |
70 | | - | |
71 | | - | |
72 | | - | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
73 | 74 | | |
74 | 75 | | |
75 | 76 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
29 | 29 | | |
30 | 30 | | |
31 | 31 | | |
32 | | - | |
| 32 | + | |
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
| |||
218 | 218 | | |
219 | 219 | | |
220 | 220 | | |
221 | | - | |
| 221 | + | |
222 | 222 | | |
223 | 223 | | |
224 | 224 | | |
| |||
316 | 316 | | |
317 | 317 | | |
318 | 318 | | |
319 | | - | |
| 319 | + | |
320 | 320 | | |
321 | 321 | | |
322 | 322 | | |
| |||
399 | 399 | | |
400 | 400 | | |
401 | 401 | | |
402 | | - | |
| 402 | + | |
403 | 403 | | |
404 | 404 | | |
405 | 405 | | |
| |||
445 | 445 | | |
446 | 446 | | |
447 | 447 | | |
448 | | - | |
| 448 | + | |
449 | 449 | | |
450 | 450 | | |
451 | 451 | | |
| |||
461 | 461 | | |
462 | 462 | | |
463 | 463 | | |
464 | | - | |
| 464 | + | |
465 | 465 | | |
466 | 466 | | |
467 | 467 | | |
| |||
496 | 496 | | |
497 | 497 | | |
498 | 498 | | |
499 | | - | |
| 499 | + | |
500 | 500 | | |
501 | 501 | | |
502 | 502 | | |
| |||
818 | 818 | | |
819 | 819 | | |
820 | 820 | | |
821 | | - | |
| 821 | + | |
822 | 822 | | |
823 | 823 | | |
824 | 824 | | |
| |||
0 commit comments