Skip to content

Commit 6173f33

Browse files
Phase 4 validation (planner + tasker) + deterministic dataset resolver (#223)
* feat(014): Planner research guards + Tasker per-round inspection capture T003: new canonical _research_guard.py (FR-005/006/007, stdlib only): IncompleteArtifactSet/UnreachableReference/InconsistentDataModel + assert_artifact_set_complete/assert_urls_reachable/assert_data_model_contracts_consistent. T004: wire the three gates into PlannerAgent.write_artifacts; unlink every artifact written this invocation on any raise (parity with guard_emit). T005: capture() gains optional rounds=, persisted under top-level 'rounds' (default []); escalated added to valid outcomes; rounds added to required keys. T006: _maybe_write_inspection reads agent._inspection_rounds and passes rounds=. T007: TaskerAgent accumulates one sub-record per analyze round (observability only; no decision/branch reads it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(014): scripts/validate_phase4.py driver (preflight/reset/run/verify + manifests) T008: preflight (Dartmouth key, runner import, stage==clarified else FR-019 decline, spec.md real, inspections dir writable). T009: FR-018 reset (delete Phase-4 outputs + memory markers, PRESERVE spec.md). T010: run with LLMXIVE_INSPECTION_DIR set; capture exit + run-id. T011: post-run verify — stage chain, five plan artifacts + tasks.md, >=10 T### lines, FR-010 ordering (check_task_ordering), FR-012 constraint-non-deletion (fr_sc_counts across Mode-B rounds), FR-020 Constitution Check. T022/T023: emit_carry_forward + emit_phase_report per the contracts. Pure helpers (check_task_ordering, fr_sc_counts, constitution_check_ok) importable by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(014): Phase-4 regression + schema tests; self-guard destructive Phase-3 e2e tests/integration/test_phase4_plan_tasks.py (T016-T021,T025): FILE-marker split, FR-005 completeness, FR-008 template reject (real PlannerAgent.write_artifacts unlink), FR-006 URL reachability via a REAL local http.server (200 pass; 404/500/ connection-refused raise; Planner unlinks on bad URL), FR-007 consistency, FR-016(c) prose-stub Mode-A reject, FR-016(d) diff-leak, FR-016(e) header-clobber, FR-012 constraint non-deletion, FR-016(f) analyze-loop cap escalation (real tasks_cmd loop, synthetic analyze/Mode-B; no real LLM), FR-010 ordering, inspection schema incl rounds + _redact no-secrets, carry-forward + phase-report schema. test_phase3_specify_clarify.py: the spec-011 gated e2e is DESTRUCTIVE (rolls PROJ-261 back to project_initialized, deletes spec.md). Skip it unless the project is still at its Phase-3 entry stage, so it can no longer clobber the Phase-4 input. ruff clean on all new files; stdlib-only guard; no new pip dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(014): mark T003-T011,T016-T023,T025-T028 done in tasks.md Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002) left unchecked for the parent to complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(audit): body-density no longer mis-flags table/diagram-heavy artifacts The template-vs-real auditor stripped fenced blocks (incl. mermaid) before measuring section bodies and counted parent headings as empty, so a substantive data-model.md (ER diagram + per-entity attribute tables + fenced CSV schemas) classified 'partial'. This blocked the Planner from advancing ANY project past 'clarified'. Now tables/fenced blocks/lists count as real content and parent-of-subsection headings are not 'short'; genuinely-empty/stub sections are still flagged and literal templates are still caught by the phrase/bracket rules. Found during spec-014 Phase-4 validation; cited by failing inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (TemplateRefused body_density_short>=60pct=9/13 on a real data-model.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(014): FR-007 = robust structural check, not fragile 1:1 name match Real PROJ-261 planner output revealed the FR-007 guard was over-specified: it required every data-model entity to have a same-named contracts/ schema (and vice versa), but the Planner prompt only mandates >=1 schema, and schema filenames legitimately differ from entity headings (e.g. code_duplication_ metrics.schema.yaml for a CloneDensityMetric entity). It also mis-counted 'Data Flow'/'Entity Relationships'/CSV-filename headings as entities. FR-007 now verifies (a) data-model.md defines real entities (table/diagram/ headings) and (b) every contracts/*.yaml is a non-empty parseable schema; cardinality + naming are unconstrained. Updated spec/contracts/data-model and the regression tests (incl. an explicit test that the prior real-world mismatch now passes). Cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(audit): don't learn structural task labels ([US1],[Story]) as template phrases The literal-template-phrase rule extracted [US1]/[US2]/[US3]/[Story] from tasks-template.md and then flagged any real tasks.md (which MUST use those format labels) as 'template'. This blocked the Tasker from ever committing a tasks.md. Structural labels ([P],[ID],[TaskID],[Story],[USn]) are now excluded from the learned placeholder set; genuine placeholders ([FEATURE NAME], [Entity], [Service], ...) still trigger template detection, so the template files themselves remain correctly classified. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/tasker.json (TemplateRefused literal_template_phrases>=3 sample=['[Story]','[US1]','[US2]']). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(014): add --force re-validation rollback to validate_phase4 driver A prior partial run (planner advanced clarified->planned, tasker then failed) left a project at 'planned', so the FR-019 preflight correctly declined to re-run. --force rolls such a project back to 'clarified' (logged in history.jsonl) so the full planner->tasker chain can be re-validated from the canonical entry state. Default FR-019 decline behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(audit): bracket-density rule ignores fenced/diagram/link content Rule 2 (unfilled_bracket_density>=6) counted bracketed node labels inside a fenced ASCII/mermaid data-flow chart (e.g. '[Dataset Download] -> data/raw/') as unfilled placeholders, rejecting a real planner data-model.md as 'template'. Rule 2 now scans a view with fenced blocks, HTML comments, and markdown links stripped, and excludes structural labels; Rule 1 (learned phrases) still uses the full text so genuine templates remain caught (verified: plan/spec/tasks templates still classify 'template'). Also gave the planner-guard test a real .specify/templates dir so it exercises Rule 1 like production. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (unfilled_bracket_density on a fenced data-flow chart). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(014): step driver to 'analyzed'; align spec to best-effort cap-hit Validation finding: the Tasker advances across TWO runner steps (planned->tasked, tasked->analyzed) per the pipeline graph (STAGE_AFTER_AGENT), so a fixed --max-tasks 2 left the project stuck at 'tasked'. _run_pipeline now steps one agent at a time until a terminal Phase-4 stage and STOPS at 'analyzed' (never invoking the Phase-5 implementer). Per the 2026-05-21 decision, analyze-loop cap-hit WITHOUT convergence is best-effort: the Tasker accepts tasks.md, records converged:false, and the project advances to 'analyzed' (downstream reviewers catch issues); human_input_needed is reserved for an explicit Mode-B escalate verdict or a backend failure. Updated spec FR-013 / Background / US1 / US2 / US3 / edge case / data-model, and added a regression test for the best-effort cap-hit path (the explicit-escalate path test is retained). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(014): raise per-step timeout to 3600s; allow resume from mid-Phase-4 stage The Tasker re-runs its full 5-round analyze loop on each runner step, and the second step (tasked->analyzed) exceeded the 1900s per-step timeout. Raised to 3600s. Also: when a run is interrupted mid-Phase-4 (project left at planned/ tasked/analyze_in_progress), the driver now RESUMES from there (no reset, no rollback) and steps to a terminal stage, instead of requiring --force. FR-019 still declines projects that have COMPLETED Phase 4 (analyzed+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(014): FR-006 URL extraction strips wrapping backticks (false-404 fix) PROJ-262's planner cited a URL wrapped in markdown backticks; _URL_RE captured the closing backtick into the path ('.../realKnownCause/`'), producing a false 404. Backtick is now excluded from the URL/doi character classes and added to the trailing-strip set. Regression test added. (Note: the planner is also citing genuinely-dead/irrelevant dataset URLs for PROJ-262 — figshare 404, NAB dir 404 — which FR-006 correctly hard-fails per the 2026-05-21 strict-mode decision; that is the gate working, distinct from this extraction bug.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(PROJ-262): correct dead QM9 dataset DOI in Phase-3 spec.md The Phase-3 spec.md cited a dead QM9 DOI (10.6084/m9.figshare.9981994 -> HTTP 404), which the Planner faithfully carried into research.md, so Phase-4 FR-006 (reachability) correctly hard-failed the plan. Replaced with the canonical, verified-reachable QM9 dataset DOI 10.1038/sdata.2014.22 (Ramakrishnan et al., Sci Data 2014; HTTP 200). A Verified-Accuracy (constitution Principle II) fix. Surfaced by spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-262-predicting-molecular-dipole-moments-with/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * design(dataset-resolver): deterministic web-search dataset resolution for the Planner Brainstormed design (Approach A): a librarian/dataset_resolver.py module, called by the Planner's mechanical step, finds real datasets via HF Hub + figshare/ Zenodo/DataCite + reused Semantic Scholar/arXiv, verifies reachability + a sample-stream format sniff, and injects the top-N verified candidates per dataset into the Planner prompt (cite-only, never invent). Removes URL generation from the LLM; FR-006 stays as the safety net. Root-cause fix for the PROJ-262 hallucinated-dataset-URL finding from spec-014. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * plan(dataset-resolver): bite-sized TDD implementation plan 8 tasks: DatasetCandidate + HF Hub source; figshare/Zenodo/DataCite sources; sample-stream format sniff; verify_candidate (reuses verify.py); intent extraction + resolve_datasets top-N orchestration; manifest + escalation + cite-only planner block; wire into Planner; full suite + real PROJ-262 re-run. All real-call tests (HF Hub + registry APIs + local http.server fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): DatasetCandidate + HuggingFace Hub source Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): figshare/Zenodo/DataCite sources Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): sample-stream format sniff Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): verify_candidate (reachability + sniff, reuses verify.py) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): intent extraction + resolve_datasets top-N orchestration HF candidates now point at the resolve URL of an actual data file (csv/parquet/ etc.) instead of the HTML landing page, so reachability+sniff verification can succeed. The landing page is HTML and is correctly rejected by the sniffer; the design calls for the HF resolve URL / streaming first rows. Data file is picked deterministically by extension preference then path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): manifest write + planner block + unresolved helper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dataset-resolver): xyz/sdf/tar sniffers + granular candidates_tried audit FIX 1: add xyz/sdf/tar detectors to _detect_and_parse so the HF picker (_HF_DATA_EXTS advertises .xyz/.sdf) and the sniffer are consistent -- QM9 is natively .xyz, which was picked then wrongly rejected. xyz = integer atom-count header or "<El> x y z" coordinate lines; sdf/mol = V2000/V3000 or "$$$$" delimiter; tar = "ustar" magic at offset 257. FIX 2: split the generic "rejected" candidates_tried status into "verified", "unreachable" (reachability step failed) and "wrong_format" (reachable but the sniff failed) via new probe_candidate() + VerifyResult; verify_candidate keeps its original contract. Verified-selection behavior unchanged. FIX 4: document that the Semantic Scholar/arXiv paper-linked-data source is DEFERRED (yields paper pages, not sniffable files) and that repo_root is reserved for that future reuse (Task 7's plan_cmd still passes it). Real-call/local-http.server tests added for xyz + sdf detection and for the 404->unreachable / HTML->wrong_format audit statuses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(dataset-resolver): de-vacuify DataCite test with a real DataCite DOI FIX 3: 10.1038/sdata.2014.22 is a Crossref DOI (not DataCite-registered), so search_datacite returned [] and the test was vacuously true. Replace with the Zenodo-minted DOI 10.5281/zenodo.1227121 (api.datacite.org/dois/<doi> -> 200, verified by curl), assert >=1 datacite candidate AND that its doi.org URL is reachable (200) -- genuinely exercising the resolve path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dataset-resolver): wire resolver into Planner (inject verified URLs) Task 7 of the dataset-resolver plan. PlannerAgent.mechanical_step now reads spec.md, calls resolve_datasets + write_manifest, and returns a rendered dataset_block; build_prompt injects the '# Verified datasets' block after the plan template and before the comments/Task line (falling back to resolving in build_prompt when mechanical_output lacks the key). planner.md replaces the 'NEVER invent URLs' rule with the cite-only rule. New offline test stubs resolve_datasets to assert the verified block + URL reach the user message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(planner): remove contradictory dataset-substitution rule (NAB URL) Task-7 review caught a leftover rule that told the LLM to 'substitute a comparable open dataset that has a known-stable raw URL', hard-coding the exact NAB raw.githubusercontent URL PROJ-262 hallucinated. It contradicted the new cite-only rule and perpetuated the very hallucination the resolver removes. Replaced with: reference ONLY the verified-datasets block (verified URL or a well-known loader for that same dataset); never substitute or invent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(dataset-resolver): store stable URL, not expiring presigned redirect target PROJ-262 run 6 surfaced this: verify_candidate stored head.final_url, which for a HuggingFace resolve URL is a short-lived presigned cas-bridge URL (X-Amz-Expires=3600). The Planner cited it; ~32 min later FR-006's re-check hit the expired signature -> HTTP 403 -> rejected. Now we store the STABLE original c.url (the HF resolve URL, which HF re-signs on every access) while still sniffing the live redirect target. Regression test: a redirecting URL keeps the stable original in the verified record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(planner): strip wrapping/stray markdown code fences from multi-file split PROJ-262 run 7: the Planner appended a stray trailing ``` to contracts/prediction.schema.yaml, making it invalid YAML, which FR-007 (schema validity) correctly rejected. _split_multi_file now strips (1) a fence that wraps an entire file (```lang ... ```) and (2) a stray unmatched fence (odd ``` count, e.g. a trailing closer), while leaving balanced code blocks inside .md files intact. General fix for a common LLM output artifact across all planner artifacts. Regression tests added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(tasker): FR-012 guard - refuse Mode-B spec.md patch that deletes requirements PROJ-262 run 8 reached 'analyzed' but the Tasker's Mode-B had gutted the project spec.md from 12 FR / 5 SC to 0 FR / 2 SC across rounds 1-4 — 'resolving' analyze findings by DELETING requirements (the exact constraint-weakening the constitution forbids). The Mode-B per-patch validation now refuses a spec.md patch whose distinct FR-/SC- identifier set is smaller than the current file's, alongside the existing diff/header/task-id guards. (The validation-layer FR-012 check in validate_phase4 already flagged it post-hoc; this stops the corruption at the source.) Restored PROJ-262's spec.md from git. Regression test added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(audit): bracket-density counts only multi-word placeholders PROJ-262 run 9: the Tasker annotated tasks.md with 20 single-token [REVISION] tags, tripping unfilled_bracket_density (>=6) -> tasks.md rejected as template. Rule 2 now counts ONLY multi-word descriptive placeholders ([FEATURE NAME], [e.g., ...]) — the genuine 'saturated unfilled template' signal. Single-token brackets are excluded: real ones ([FEATURE],[DATE]) are caught by Rule 1's learned set, and LLM annotations/labels ([P],[US1],[REVISION],[X]) legitimately appear in a real tasks.md. Templates still classify 'template' (verified plan/spec/tasks). Root fix for the recurring bracket-density false-positives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * validate(014): both canonicals reach analyzed via dataset resolver; wrap-up fr_sc_counts now counts DISTINCT requirement ids (not occurrences), fixing a false-positive FR-012 finding (a dropped cross-reference is not a deleted requirement). Added --verify-only mode to re-verify existing analyzed artifacts + emit manifests without a pipeline re-run. Result: PROJ-261 + PROJ-262 both reach 'analyzed', 0 findings (verify-only: 2 passed). carry-forward.yaml + phase-report.md generated; both projects 'passed', Mode-B exercised on real content (5 rounds each). Commits the produced plan artifacts + tasks.md + inspection records + analyzed project state. All 28 spec-014 tasks complete. Known follow-up: extract_dataset_intents over-extracts non-dataset tokens (GNN/MAE/MUST/FR-001) -> some resolve to irrelevant (but reachable) HF datasets; a precision refinement, not a blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(014): commit Phase-4 validation run-log entries (FR-014 audit trail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: publish_blocked schema gap (publisher crash) + stale spec-012 scheduler tests Two pre-existing issues surfaced during spec-014 validation, fixed here (handle- issues-as-you-go): 1. project-state schema's current_stage enum was MISSING 'publish_blocked' — the only Stage value absent — so the publisher (agents/publisher.py FR-030, sets publish_blocked after 5 Zenodo failures) would crash on save with a ValidationError. Added it to the enum + a contract test asserting every Stage value is in the schema enum (guards against future drift). 2. test_revision_in_progress_idempotency.py asserted READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK — the spec-012 expectation that spec-013 deliberately REVERSED (the llmXive-implementer agent now consumes those projects via the scheduler; documented in scheduler.py + implementer.py). Updated the 2 stale tests to the spec-013 behavior (preserving their idempotency intent with genuinely-locked stages); did NOT revert the correct code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(submission-intake): VALID_FIELDS reuses canonical LIBRARIAN_DEFAULT_FIELDS (SSoT) submission_intake.py re-typed the librarian's canonical 9-field list as a prefix of VALID_FIELDS, a third copy of the field list that violated Constitution Principle I (single source of truth) and failed test_librarian_default_fields::test_no_third_copy_of_the_field_list. VALID_FIELDS is now frozenset(LIBRARIAN_DEFAULT_FIELDS) | {submission-only extras}; the resulting set is byte-identical (17 fields), so classification/validation behavior is unchanged. Pre-existing issue, fixed per handle-issues-as-you-go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: run-log entries from final regression runs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(phase1): make citation-resolver timeout test deterministic (drop flaky httpbin) test_timeout_fires depended on the public httpbin.org/delay/30 endpoint sleeping 30s; when httpbin is overloaded it responds fast, taking the non-timeout unreachable path (api_response_snippet=None) and failing the assertion. Replaced with a LOCAL http.server that sleeps past the timeout, so the deadline deterministically fires (TimeoutError path, snippet set). No third-party dependency; tests the actual timeout behavior reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * test(013): accept HTTP 202 from just-minted Zenodo sandbox DOIs The two publisher sandbox tests HEAD the freshly-minted DOI as a smoke check that the resolver knows about it. They only accepted (200,302,403), but a just-minted DataCite/Zenodo sandbox DOI returns 202 ("Accepted", still propagating) before it settles — a racy CI failure unrelated to the publisher code. Broaden both assertions to any 2xx/3xx plus 403 (doi.org's bare-HEAD response for sandbox DOIs); the only real failure is 404/5xx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(backends): bound LLM calls with a daemon-thread deadline (fixes 54-min CI hang) Root cause of the real-call job timing out: the Dartmouth backend wrapped client.invoke() in `with ThreadPoolExecutor() as ex: ex.submit(...).result(180)`. When the 180s deadline fired and raised, exiting the `with` block invoked ThreadPoolExecutor.__exit__ -> shutdown(wait=True), which BLOCKS until the still-hung worker thread finishes. Since the worker was stuck in a socket read with no HTTP timeout (langchain forwards `timeout` as a chat-completion body param, not an HTTP/socket timeout), shutdown never returned — the implementer e2e test stalled ~54 min until the 60-min job cap cancelled it (the test step emitted zero output between 16:22 and the 17:16 cancel). Replace the executor with a shared `invoke_with_deadline` helper in backends/base.py that runs the call on a DAEMON thread and abandons it past the deadline. A daemon thread never blocks interpreter exit (unlike an abandoned ThreadPoolExecutor worker, which its atexit join would wait on), so a sick connection can no longer hang the process. Apply it to both the Dartmouth and HuggingFace backends (HF's bare client.invoke() had no deadline at all — same latent hang as a fallback). Verified: unit test asserts the caller regains control ~at the deadline (not after the slow call finishes) and the abandoned worker is a daemon; a real Dartmouth chat call still succeeds through the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(backends): enforce free-only Dartmouth models; correct registry model drift Dartmouth's chat catalog now mixes FREE self-hosted models with PAID external providers (gpt-5, claude, gemini, voyage, ...). The real-call CI failed because (1) a test fell through to the PAID gpt-5.3-chat-latest, which rejects temperature=0, and (2) a transient "model not found" on the primary model was misclassified as permanent, so the router never fell through to the free gpt-oss-120b peer. - dartmouth.py: derive the free-model set from the API's explicit input/output_cost_per_token fields (authoritative, not heuristic); refuse any non-free model before calling it (Constitution Principle IV: v1 cost==0); route list_models() through the working chat.dartmouth.edu/api/models endpoint (ChatDartmouth.list() targets a Dartmouth host that rejects the chat key and returns non-JSON); classify "model not found" as transient so the router falls through to a free peer; drop unsupported temperature on retry. - router.py: MODEL_FALLBACKS use free models only (gemma-3-27b-it -> gemma-4-31B-it). - registry.yaml: lightweight agents -> gemma-4-31B-it; paper_implementer and all intensive agents -> qwen.qwen3.5-122b. - test_dartmouth_chat.py: select only free models (never a paid gpt-5). Verified locally: the 4 originally-failing real-call tests pass; full real_call suite 18 passed / 4 skipped (HF/network-gated); unit+contract 592 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(implementer-e2e): raise SC-001 wall-clock budget 1200s -> 2400s The free-only backend fix makes a transient "model not found" (a model briefly unloaded on Dartmouth's vLLM cluster) retryable — the router now walks the free peer-model fallback chain instead of failing fast. That resilience is correct but adds real wall-clock when blips occur: a CI run took 1264s on the 3-task fixture (vs the old 1200s budget set from fail-fast timing). Bump to 2400s for generous headroom over the observed worst case while still catching a genuine hang (bounded by the 180s per-request deadline x the finite retry/fallback fan-out). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent d69d7b5 commit 6173f33

121 files changed

Lines changed: 12902 additions & 8539 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.specify/feature.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
{"feature_directory": "specs/013-paper-revision-implementer"}
1+
{"feature_directory": "specs/014-phase4-plan-tasks-testing"}

CLAUDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional
7070
<!-- SPECKIT START -->
7171
For additional context about technologies to be used, project structure,
7272
shell commands, and other important information, read the current plan:
73-
[specs/013-paper-revision-implementer/plan.md](specs/013-paper-revision-implementer/plan.md).
73+
[specs/014-phase4-plan-tasks-testing/plan.md](specs/014-phase4-plan-tasks-testing/plan.md).
7474
<!-- SPECKIT END -->

agents/prompts/planner.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -57,19 +57,20 @@ $schema: ...
5757
- For computational projects, `contracts/` MUST include at least one
5858
schema (e.g., dataset schema, output schema) that the
5959
Implementer's tests can validate against.
60-
- NEVER invent URLs or citations. If the spec/idea has cited URLs,
61-
copy them verbatim; do not add new ones, do not fabricate
62-
`(verified YYYY-MM-DD)` annotations. The Reference-Validator
63-
fetches every cited URL — fabricated URLs flip the verdict to
64-
mismatch.
60+
- For dataset/code/paper references in research.md, cite ONLY the URLs listed in
61+
the "# Verified datasets" block of the user message (these have been
62+
web-searched and reachability/format-verified for you). NEVER invent or guess
63+
a dataset URL. If the block says a dataset has NO verified source, describe the
64+
dataset by name but do NOT fabricate a URL.
6565
- For DATASETS specifically: `research.md`'s "Dataset Strategy"
66-
table MUST name only real, programmatically-fetchable sources.
67-
If the spec calls for "UCI Electricity" but the canonical UCI
68-
endpoint requires browser navigation, plan for the `ucimlrepo`
69-
Python package OR substitute a comparable open dataset that has
70-
a known-stable raw URL (e.g., NAB benchmark CSVs at
71-
`https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/`,
72-
or HuggingFace `datasets.load_dataset(...)`).
66+
table MUST reference ONLY the sources in the "# Verified datasets"
67+
block above — cite each dataset by its verified URL, or load that
68+
SAME dataset via a well-known programmatic loader (e.g.
69+
`datasets.load_dataset(...)` for a verified HuggingFace dataset, or
70+
`ucimlrepo` for a UCI dataset). Do NOT substitute a different dataset
71+
and do NOT invent or guess a raw URL. If a dataset the spec needs has
72+
NO verified source in the block, state that explicitly rather than
73+
fabricating one.
7374
- For COMPUTATIONAL TASK ORDERING: the plan MUST order phases so
7475
data is downloaded BEFORE any task that consumes it, models are
7576
fitted BEFORE any task that evaluates them, and figures are

agents/registry.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ agents:
2929
fallback_backends:
3030
- huggingface
3131
- local
32-
default_model: google.gemma-3-27b-it
32+
default_model: google.gemma-4-31B-it
3333
wall_clock_budget_seconds: 300
3434
paid_opt_in: false
3535
- name: flesh_out
@@ -218,7 +218,7 @@ agents:
218218
fallback_backends:
219219
- huggingface
220220
- local
221-
default_model: google.gemma-3-27b-it
221+
default_model: google.gemma-4-31B-it
222222
tools:
223223
- citation_fetcher
224224
wall_clock_budget_seconds: 300
@@ -316,7 +316,7 @@ agents:
316316
fallback_backends:
317317
- huggingface
318318
- local
319-
default_model: google.gemma-3-27b-it
319+
default_model: qwen.qwen3.5-122b
320320
wall_clock_budget_seconds: 300
321321
paid_opt_in: false
322322
- name: paper_writing
@@ -399,7 +399,7 @@ agents:
399399
fallback_backends:
400400
- huggingface
401401
- local
402-
default_model: google.gemma-3-27b-it
402+
default_model: google.gemma-4-31B-it
403403
wall_clock_budget_seconds: 600
404404
paid_opt_in: false
405405
- name: latex_fix
@@ -445,7 +445,7 @@ agents:
445445
fallback_backends:
446446
- huggingface
447447
- local
448-
default_model: google.gemma-3-27b-it
448+
default_model: google.gemma-4-31B-it
449449
wall_clock_budget_seconds: 300
450450
paid_opt_in: false
451451
- name: repository_hygiene
@@ -461,7 +461,7 @@ agents:
461461
fallback_backends:
462462
- huggingface
463463
- local
464-
default_model: google.gemma-3-27b-it
464+
default_model: google.gemma-4-31B-it
465465
wall_clock_budget_seconds: 300
466466
paid_opt_in: false
467467
- name: task_atomizer
@@ -496,7 +496,7 @@ agents:
496496
fallback_backends:
497497
- huggingface
498498
- local
499-
default_model: google.gemma-3-27b-it
499+
default_model: google.gemma-4-31B-it
500500
wall_clock_budget_seconds: 300
501501
paid_opt_in: false
502502
- name: paper_reviewer_writing_quality
@@ -818,7 +818,7 @@ agents:
818818
fallback_backends:
819819
- huggingface
820820
- local
821-
default_model: google.gemma-3-27b-it
821+
default_model: google.gemma-4-31B-it
822822
tools: []
823823
wall_clock_budget_seconds: 300
824824
paid_opt_in: false

0 commit comments

Comments
 (0)