Skip to content

Phase 4 validation (planner + tasker) + deterministic dataset resolver#223

Merged
jeremymanning merged 40 commits into
mainfrom
014-phase4-plan-tasks-testing
May 27, 2026
Merged

Phase 4 validation (planner + tasker) + deterministic dataset resolver#223
jeremymanning merged 40 commits into
mainfrom
014-phase4-plan-tasks-testing

Conversation

@jeremymanning
Copy link
Copy Markdown
Member

Summary

Validates Phase 4 of the llmXive agentic pipeline (Spec Kit Plan → Tasks with the analyze loop; issue #48, agents planner #65 + tasker #66) end-to-end on two real projects, and — because the validation surfaced that the Planner hallucinates dataset URLs — adds a deterministic dataset resolver so the Planner cites real, verified datasets instead of inventing them.

Both reference projects now reach analyzed cleanly:

Project Final stage Findings Analyze rounds
PROJ-261 (code-duplication, CS) analyzed 0 5
PROJ-262 (molecular dipoles, Chemistry) analyzed 0 5

specs/014-phase4-plan-tasks-testing/carry-forward.yaml lists both as passed (ready for Phase 5); phase-report.md maps every FR → evidence.

What's here

  • Phase-4 validation harnessscripts/validate_phase4.py (preflight, FR-018 reset, step-to-analyzed, --force rollback, --verify-only) + tests/integration/test_phase4_plan_tasks.py (FR-016 regression tests + schema/ordering tests, all real-call or local-http.server, no network mocks). Spec/plan/tasks/contracts under specs/014-phase4-plan-tasks-testing/.
  • Deterministic dataset resolver (new feature) — src/llmxive/librarian/{dataset_resolver,dataset_sources}.py: web-searches HuggingFace Hub + figshare/Zenodo/DataCite (+ reuses verify.py reachability), verifies a sample-stream format sniff, and injects the top-N verified dataset URLs into the Planner prompt (cite-only). Design + plan in docs/superpowers/.

Pipeline bugs found & fixed (the point of the validation)

Each is a separate commit with a regression test:

  • Auditor template_vs_real (×4) — false-positives that would reject legitimate rich artifacts: body-density on table/mermaid data-model.md; Rule-1 learning the [US1]/[Story] task labels; Rule-2 bracket-density counting fenced flowchart labels and single-token annotations ([REVISION]). Now counts only multi-word placeholders / strips fences / excludes structural labels — templates still detected.
  • Tasker Mode-B gutting spec.md — it "converged" analyze by deleting requirements (12 FR/5 SC → 0 FR/2 SC, observed on PROJ-262). Added an FR-012 guard refusing any Mode-B spec.md patch that drops requirement IDs.
  • _split_multi_file didn't strip ``` code fences → broke contracts/*.yaml.
  • Dataset resolver — stored the expiring presigned HF redirect target (→ 403) instead of the stable resolve/main/... URL; FR-006 URL extraction captured a wrapping backtick.

Pre-existing issues handled along the way

  • publish_blocked missing from the project-state schema enum — the publisher (FR-030, after 5 Zenodo failures) would crash on save; added the value + a contract test asserting every Stage is in the schema enum.
  • Stale spec-012 scheduler idempotency tests — they asserted READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK, which spec-013 deliberately reversed (the implementer agent consumes those projects); updated the tests to spec-013 behavior (code was correct).
  • VALID_FIELDS third copy of the field list — now built from the canonical LIBRARIAN_DEFAULT_FIELDS (SSoT, byte-identical set).
  • Flaky httpbin-dependent citation-timeout test — made deterministic with a local slow server.

Decisions

  • Analyze-loop cap-hit = best-effort advance to analyzed (recording converged: false); human_input_needed is reserved for an explicit Mode-B escalate verdict or backend failure.
  • FR-005/006/007 gates added to the Planner (agent hardening, per clarification); FR-006 hard-fails any non-2xx/3xx with no retry.

Test status

unit + integration + contract: 735 passed, 0 failed · phase1: 23 · phase2: 202 · e2e (test_site.py): 5 · real_call: gated/skipped. The PROJ-261/262 runs and the resolver source tests are real-call (Dartmouth + HF/figshare/Zenodo/DataCite).

Known follow-ups (non-blocking)

  • extract_dataset_intents over-extracts non-dataset tokens (GNN, MAE, FR-001) → some resolve to irrelevant-but-reachable HF datasets; a precision refinement.
  • The Tasker re-runs its full analyze loop on both runner steps (≈2× cost) — an existing inefficiency, noted not fixed.

🤖 Generated with Claude Code

jeremymanning and others added 30 commits May 21, 2026 08:06
T003: new canonical _research_guard.py (FR-005/006/007, stdlib only):
IncompleteArtifactSet/UnreachableReference/InconsistentDataModel +
assert_artifact_set_complete/assert_urls_reachable/assert_data_model_contracts_consistent.
T004: wire the three gates into PlannerAgent.write_artifacts; unlink every
artifact written this invocation on any raise (parity with guard_emit).
T005: capture() gains optional rounds=, persisted under top-level 'rounds'
(default []); escalated added to valid outcomes; rounds added to required keys.
T006: _maybe_write_inspection reads agent._inspection_rounds and passes rounds=.
T007: TaskerAgent accumulates one sub-record per analyze round (observability
only; no decision/branch reads it).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ify + manifests)

T008: preflight (Dartmouth key, runner import, stage==clarified else FR-019
decline, spec.md real, inspections dir writable).
T009: FR-018 reset (delete Phase-4 outputs + memory markers, PRESERVE spec.md).
T010: run  with
LLMXIVE_INSPECTION_DIR set; capture exit + run-id.
T011: post-run verify — stage chain, five plan artifacts + tasks.md, >=10 T###
lines, FR-010 ordering (check_task_ordering), FR-012 constraint-non-deletion
(fr_sc_counts across Mode-B rounds), FR-020 Constitution Check.
T022/T023: emit_carry_forward + emit_phase_report per the contracts. Pure
helpers (check_task_ordering, fr_sc_counts, constitution_check_ok) importable
by tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Phase-3 e2e

tests/integration/test_phase4_plan_tasks.py (T016-T021,T025): FILE-marker split,
FR-005 completeness, FR-008 template reject (real PlannerAgent.write_artifacts
unlink), FR-006 URL reachability via a REAL local http.server (200 pass; 404/500/
connection-refused raise; Planner unlinks on bad URL), FR-007 consistency,
FR-016(c) prose-stub Mode-A reject, FR-016(d) diff-leak, FR-016(e) header-clobber,
FR-012 constraint non-deletion, FR-016(f) analyze-loop cap escalation (real
tasks_cmd loop, synthetic analyze/Mode-B; no real LLM), FR-010 ordering, inspection
schema incl rounds + _redact no-secrets, carry-forward + phase-report schema.

test_phase3_specify_clarify.py: the spec-011 gated e2e is DESTRUCTIVE (rolls
PROJ-261 back to project_initialized, deletes spec.md). Skip it unless the project
is still at its Phase-3 entry stage, so it can no longer clobber the Phase-4 input.

ruff clean on all new files; stdlib-only guard; no new pip dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002)
left unchecked for the parent to complete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…facts

The template-vs-real auditor stripped fenced blocks (incl. mermaid) before
measuring section bodies and counted parent headings as empty, so a
substantive data-model.md (ER diagram + per-entity attribute tables + fenced
CSV schemas) classified 'partial'. This blocked the Planner from advancing
ANY project past 'clarified'.

Now tables/fenced blocks/lists count as real content and parent-of-subsection
headings are not 'short'; genuinely-empty/stub sections are still flagged and
literal templates are still caught by the phrase/bracket rules.

Found during spec-014 Phase-4 validation; cited by failing inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json
(TemplateRefused body_density_short>=60pct=9/13 on a real data-model.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real PROJ-261 planner output revealed the FR-007 guard was over-specified:
it required every data-model entity to have a same-named contracts/ schema
(and vice versa), but the Planner prompt only mandates >=1 schema, and schema
filenames legitimately differ from entity headings (e.g. code_duplication_
metrics.schema.yaml for a CloneDensityMetric entity). It also mis-counted
'Data Flow'/'Entity Relationships'/CSV-filename headings as entities.

FR-007 now verifies (a) data-model.md defines real entities (table/diagram/
headings) and (b) every contracts/*.yaml is a non-empty parseable schema;
cardinality + naming are unconstrained. Updated spec/contracts/data-model and
the regression tests (incl. an explicit test that the prior real-world
mismatch now passes). Cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plate phrases

The literal-template-phrase rule extracted [US1]/[US2]/[US3]/[Story] from
tasks-template.md and then flagged any real tasks.md (which MUST use those
format labels) as 'template'. This blocked the Tasker from ever committing a
tasks.md. Structural labels ([P],[ID],[TaskID],[Story],[USn]) are now excluded
from the learned placeholder set; genuine placeholders ([FEATURE NAME],
[Entity], [Service], ...) still trigger template detection, so the template
files themselves remain correctly classified.

Found during spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/tasker.json
(TemplateRefused literal_template_phrases>=3 sample=['[Story]','[US1]','[US2]']).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A prior partial run (planner advanced clarified->planned, tasker then failed)
left a project at 'planned', so the FR-019 preflight correctly declined to
re-run. --force rolls such a project back to 'clarified' (logged in
history.jsonl) so the full planner->tasker chain can be re-validated from the
canonical entry state. Default FR-019 decline behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rule 2 (unfilled_bracket_density>=6) counted bracketed node labels inside a
fenced ASCII/mermaid data-flow chart (e.g. '[Dataset Download] -> data/raw/')
as unfilled placeholders, rejecting a real planner data-model.md as 'template'.
Rule 2 now scans a view with fenced blocks, HTML comments, and markdown links
stripped, and excludes structural labels; Rule 1 (learned phrases) still uses
the full text so genuine templates remain caught (verified: plan/spec/tasks
templates still classify 'template'). Also gave the planner-guard test a real
.specify/templates dir so it exercises Rule 1 like production.

Found during spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json
(unfilled_bracket_density on a fenced data-flow chart).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation finding: the Tasker advances across TWO runner steps (planned->tasked,
tasked->analyzed) per the pipeline graph (STAGE_AFTER_AGENT), so a fixed
--max-tasks 2 left the project stuck at 'tasked'. _run_pipeline now steps one
agent at a time until a terminal Phase-4 stage and STOPS at 'analyzed' (never
invoking the Phase-5 implementer).

Per the 2026-05-21 decision, analyze-loop cap-hit WITHOUT convergence is
best-effort: the Tasker accepts tasks.md, records converged:false, and the
project advances to 'analyzed' (downstream reviewers catch issues);
human_input_needed is reserved for an explicit Mode-B escalate verdict or a
backend failure. Updated spec FR-013 / Background / US1 / US2 / US3 / edge case
/ data-model, and added a regression test for the best-effort cap-hit path
(the explicit-escalate path test is retained).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-4 stage

The Tasker re-runs its full 5-round analyze loop on each runner step, and the
second step (tasked->analyzed) exceeded the 1900s per-step timeout. Raised to
3600s. Also: when a run is interrupted mid-Phase-4 (project left at planned/
tasked/analyze_in_progress), the driver now RESUMES from there (no reset, no
rollback) and steps to a terminal stage, instead of requiring --force. FR-019
still declines projects that have COMPLETED Phase 4 (analyzed+).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fix)

PROJ-262's planner cited a URL wrapped in markdown backticks; _URL_RE captured
the closing backtick into the path ('.../realKnownCause/`'), producing a false
404. Backtick is now excluded from the URL/doi character classes and added to
the trailing-strip set. Regression test added. (Note: the planner is also
citing genuinely-dead/irrelevant dataset URLs for PROJ-262 — figshare 404, NAB
dir 404 — which FR-006 correctly hard-fails per the 2026-05-21 strict-mode
decision; that is the gate working, distinct from this extraction bug.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase-3 spec.md cited a dead QM9 DOI (10.6084/m9.figshare.9981994 -> HTTP
404), which the Planner faithfully carried into research.md, so Phase-4 FR-006
(reachability) correctly hard-failed the plan. Replaced with the canonical,
verified-reachable QM9 dataset DOI 10.1038/sdata.2014.22 (Ramakrishnan et al.,
Sci Data 2014; HTTP 200). A Verified-Accuracy (constitution Principle II) fix.

Surfaced by spec-014 Phase-4 validation; cited by inspection record
specs/014-phase4-plan-tasks-testing/inspections/PROJ-262-predicting-molecular-dipole-moments-with/planner.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… for the Planner

Brainstormed design (Approach A): a librarian/dataset_resolver.py module, called
by the Planner's mechanical step, finds real datasets via HF Hub + figshare/
Zenodo/DataCite + reused Semantic Scholar/arXiv, verifies reachability + a
sample-stream format sniff, and injects the top-N verified candidates per
dataset into the Planner prompt (cite-only, never invent). Removes URL
generation from the LLM; FR-006 stays as the safety net. Root-cause fix for the
PROJ-262 hallucinated-dataset-URL finding from spec-014.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks: DatasetCandidate + HF Hub source; figshare/Zenodo/DataCite sources;
sample-stream format sniff; verify_candidate (reuses verify.py); intent
extraction + resolve_datasets top-N orchestration; manifest + escalation +
cite-only planner block; wire into Planner; full suite + real PROJ-262 re-run.
All real-call tests (HF Hub + registry APIs + local http.server fixture).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s verify.py)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chestration

HF candidates now point at the resolve URL of an actual data file (csv/parquet/
etc.) instead of the HTML landing page, so reachability+sniff verification can
succeed. The landing page is HTML and is correctly rejected by the sniffer; the
design calls for the HF resolve URL / streaming first rows. Data file is picked
deterministically by extension preference then path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elper

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed audit

FIX 1: add xyz/sdf/tar detectors to _detect_and_parse so the HF picker
(_HF_DATA_EXTS advertises .xyz/.sdf) and the sniffer are consistent -- QM9 is
natively .xyz, which was picked then wrongly rejected. xyz = integer atom-count
header or "<El> x y z" coordinate lines; sdf/mol = V2000/V3000 or "$$$$"
delimiter; tar = "ustar" magic at offset 257.

FIX 2: split the generic "rejected" candidates_tried status into "verified",
"unreachable" (reachability step failed) and "wrong_format" (reachable but the
sniff failed) via new probe_candidate() + VerifyResult; verify_candidate keeps
its original contract. Verified-selection behavior unchanged.

FIX 4: document that the Semantic Scholar/arXiv paper-linked-data source is
DEFERRED (yields paper pages, not sniffable files) and that repo_root is
reserved for that future reuse (Task 7's plan_cmd still passes it).

Real-call/local-http.server tests added for xyz + sdf detection and for the
404->unreachable / HTML->wrong_format audit statuses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… DOI

FIX 3: 10.1038/sdata.2014.22 is a Crossref DOI (not DataCite-registered), so
search_datacite returned [] and the test was vacuously true. Replace with the
Zenodo-minted DOI 10.5281/zenodo.1227121 (api.datacite.org/dois/<doi> -> 200,
verified by curl), assert >=1 datacite candidate AND that its doi.org URL is
reachable (200) -- genuinely exercising the resolve path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RLs)

Task 7 of the dataset-resolver plan. PlannerAgent.mechanical_step now reads
spec.md, calls resolve_datasets + write_manifest, and returns a rendered
dataset_block; build_prompt injects the '# Verified datasets' block after the
plan template and before the comments/Task line (falling back to resolving in
build_prompt when mechanical_output lacks the key). planner.md replaces the
'NEVER invent URLs' rule with the cite-only rule. New offline test stubs
resolve_datasets to assert the verified block + URL reach the user message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task-7 review caught a leftover rule that told the LLM to 'substitute a
comparable open dataset that has a known-stable raw URL', hard-coding the exact
NAB raw.githubusercontent URL PROJ-262 hallucinated. It contradicted the new
cite-only rule and perpetuated the very hallucination the resolver removes.
Replaced with: reference ONLY the verified-datasets block (verified URL or a
well-known loader for that same dataset); never substitute or invent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect target

PROJ-262 run 6 surfaced this: verify_candidate stored head.final_url, which for
a HuggingFace resolve URL is a short-lived presigned cas-bridge URL
(X-Amz-Expires=3600). The Planner cited it; ~32 min later FR-006's re-check hit
the expired signature -> HTTP 403 -> rejected. Now we store the STABLE original
c.url (the HF resolve URL, which HF re-signs on every access) while still
sniffing the live redirect target. Regression test: a redirecting URL keeps the
stable original in the verified record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le split

PROJ-262 run 7: the Planner appended a stray trailing ``` to
contracts/prediction.schema.yaml, making it invalid YAML, which FR-007 (schema
validity) correctly rejected. _split_multi_file now strips (1) a fence that
wraps an entire file (```lang ... ```) and (2) a stray unmatched fence
(odd ``` count, e.g. a trailing closer), while leaving balanced code blocks
inside .md files intact. General fix for a common LLM output artifact across
all planner artifacts. Regression tests added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…requirements

PROJ-262 run 8 reached 'analyzed' but the Tasker's Mode-B had gutted the
project spec.md from 12 FR / 5 SC to 0 FR / 2 SC across rounds 1-4 — 'resolving'
analyze findings by DELETING requirements (the exact constraint-weakening the
constitution forbids). The Mode-B per-patch validation now refuses a spec.md
patch whose distinct FR-/SC- identifier set is smaller than the current file's,
alongside the existing diff/header/task-id guards. (The validation-layer
FR-012 check in validate_phase4 already flagged it post-hoc; this stops the
corruption at the source.) Restored PROJ-262's spec.md from git. Regression test
added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROJ-262 run 9: the Tasker annotated tasks.md with 20 single-token [REVISION]
tags, tripping unfilled_bracket_density (>=6) -> tasks.md rejected as template.
Rule 2 now counts ONLY multi-word descriptive placeholders ([FEATURE NAME],
[e.g., ...]) — the genuine 'saturated unfilled template' signal. Single-token
brackets are excluded: real ones ([FEATURE],[DATE]) are caught by Rule 1's
learned set, and LLM annotations/labels ([P],[US1],[REVISION],[X]) legitimately
appear in a real tasks.md. Templates still classify 'template' (verified
plan/spec/tasks). Root fix for the recurring bracket-density false-positives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rap-up

fr_sc_counts now counts DISTINCT requirement ids (not occurrences), fixing a
false-positive FR-012 finding (a dropped cross-reference is not a deleted
requirement). Added --verify-only mode to re-verify existing analyzed artifacts
+ emit manifests without a pipeline re-run.

Result: PROJ-261 + PROJ-262 both reach 'analyzed', 0 findings (verify-only:
2 passed). carry-forward.yaml + phase-report.md generated; both projects
'passed', Mode-B exercised on real content (5 rounds each). Commits the produced
plan artifacts + tasks.md + inspection records + analyzed project state. All 28
spec-014 tasks complete.

Known follow-up: extract_dataset_intents over-extracts non-dataset tokens
(GNN/MAE/MUST/FR-001) -> some resolve to irrelevant (but reachable) HF datasets;
a precision refinement, not a blocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jeremymanning and others added 10 commits May 22, 2026 04:23
…rail)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heduler tests

Two pre-existing issues surfaced during spec-014 validation, fixed here (handle-
issues-as-you-go):

1. project-state schema's current_stage enum was MISSING 'publish_blocked' — the
   only Stage value absent — so the publisher (agents/publisher.py FR-030, sets
   publish_blocked after 5 Zenodo failures) would crash on save with a
   ValidationError. Added it to the enum + a contract test asserting every Stage
   value is in the schema enum (guards against future drift).

2. test_revision_in_progress_idempotency.py asserted READY_FOR_IMPLEMENTATION ∈
   _NEVER_PICK — the spec-012 expectation that spec-013 deliberately REVERSED
   (the llmXive-implementer agent now consumes those projects via the scheduler;
   documented in scheduler.py + implementer.py). Updated the 2 stale tests to the
   spec-013 behavior (preserving their idempotency intent with genuinely-locked
   stages); did NOT revert the correct code.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LT_FIELDS (SSoT)

submission_intake.py re-typed the librarian's canonical 9-field list as a prefix
of VALID_FIELDS, a third copy of the field list that violated Constitution
Principle I (single source of truth) and failed
test_librarian_default_fields::test_no_third_copy_of_the_field_list. VALID_FIELDS
is now frozenset(LIBRARIAN_DEFAULT_FIELDS) | {submission-only extras}; the
resulting set is byte-identical (17 fields), so classification/validation
behavior is unchanged. Pre-existing issue, fixed per handle-issues-as-you-go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… flaky httpbin)

test_timeout_fires depended on the public httpbin.org/delay/30 endpoint sleeping
30s; when httpbin is overloaded it responds fast, taking the non-timeout
unreachable path (api_response_snippet=None) and failing the assertion. Replaced
with a LOCAL http.server that sleeps past the timeout, so the deadline
deterministically fires (TimeoutError path, snippet set). No third-party
dependency; tests the actual timeout behavior reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two publisher sandbox tests HEAD the freshly-minted DOI as a smoke
check that the resolver knows about it. They only accepted (200,302,403),
but a just-minted DataCite/Zenodo sandbox DOI returns 202 ("Accepted",
still propagating) before it settles — a racy CI failure unrelated to the
publisher code. Broaden both assertions to any 2xx/3xx plus 403 (doi.org's
bare-HEAD response for sandbox DOIs); the only real failure is 404/5xx.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4-min CI hang)

Root cause of the real-call job timing out: the Dartmouth backend wrapped
client.invoke() in `with ThreadPoolExecutor() as ex: ex.submit(...).result(180)`.
When the 180s deadline fired and raised, exiting the `with` block invoked
ThreadPoolExecutor.__exit__ -> shutdown(wait=True), which BLOCKS until the
still-hung worker thread finishes. Since the worker was stuck in a socket read
with no HTTP timeout (langchain forwards `timeout` as a chat-completion body
param, not an HTTP/socket timeout), shutdown never returned — the implementer
e2e test stalled ~54 min until the 60-min job cap cancelled it (the test step
emitted zero output between 16:22 and the 17:16 cancel).

Replace the executor with a shared `invoke_with_deadline` helper in
backends/base.py that runs the call on a DAEMON thread and abandons it past the
deadline. A daemon thread never blocks interpreter exit (unlike an
abandoned ThreadPoolExecutor worker, which its atexit join would wait on), so a
sick connection can no longer hang the process. Apply it to both the Dartmouth
and HuggingFace backends (HF's bare client.invoke() had no deadline at all —
same latent hang as a fallback).

Verified: unit test asserts the caller regains control ~at the deadline (not
after the slow call finishes) and the abandoned worker is a daemon; a real
Dartmouth chat call still succeeds through the wrapper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odel drift

Dartmouth's chat catalog now mixes FREE self-hosted models with PAID
external providers (gpt-5, claude, gemini, voyage, ...). The real-call CI
failed because (1) a test fell through to the PAID gpt-5.3-chat-latest,
which rejects temperature=0, and (2) a transient "model not found" on the
primary model was misclassified as permanent, so the router never fell
through to the free gpt-oss-120b peer.

- dartmouth.py: derive the free-model set from the API's explicit
  input/output_cost_per_token fields (authoritative, not heuristic);
  refuse any non-free model before calling it (Constitution Principle IV:
  v1 cost==0); route list_models() through the working
  chat.dartmouth.edu/api/models endpoint (ChatDartmouth.list() targets a
  Dartmouth host that rejects the chat key and returns non-JSON); classify
  "model not found" as transient so the router falls through to a free
  peer; drop unsupported temperature on retry.
- router.py: MODEL_FALLBACKS use free models only (gemma-3-27b-it ->
  gemma-4-31B-it).
- registry.yaml: lightweight agents -> gemma-4-31B-it; paper_implementer
  and all intensive agents -> qwen.qwen3.5-122b.
- test_dartmouth_chat.py: select only free models (never a paid gpt-5).

Verified locally: the 4 originally-failing real-call tests pass; full
real_call suite 18 passed / 4 skipped (HF/network-gated); unit+contract
592 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The free-only backend fix makes a transient "model not found" (a model
briefly unloaded on Dartmouth's vLLM cluster) retryable — the router now
walks the free peer-model fallback chain instead of failing fast. That
resilience is correct but adds real wall-clock when blips occur: a CI run
took 1264s on the 3-task fixture (vs the old 1200s budget set from
fail-fast timing). Bump to 2400s for generous headroom over the observed
worst case while still catching a genuine hang (bounded by the 180s
per-request deadline x the finite retry/fallback fan-out).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit 6173f33 into main May 27, 2026
5 checks passed
@jeremymanning jeremymanning deleted the 014-phase4-plan-tasks-testing branch May 27, 2026 23:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant