Phase 4 validation (planner + tasker) + deterministic dataset resolver by jeremymanning · Pull Request #223 · ContextLab/llmXive

jeremymanning · 2026-05-22T13:03:54Z

Summary

Validates Phase 4 of the llmXive agentic pipeline (Spec Kit Plan → Tasks with the analyze loop; issue #48, agents planner #65 + tasker #66) end-to-end on two real projects, and — because the validation surfaced that the Planner hallucinates dataset URLs — adds a deterministic dataset resolver so the Planner cites real, verified datasets instead of inventing them.

Both reference projects now reach analyzed cleanly:

Project	Final stage	Findings	Analyze rounds
`PROJ-261` (code-duplication, CS)	`analyzed`	0	5
`PROJ-262` (molecular dipoles, Chemistry)	`analyzed`	0	5

specs/014-phase4-plan-tasks-testing/carry-forward.yaml lists both as passed (ready for Phase 5); phase-report.md maps every FR → evidence.

What's here

Phase-4 validation harness — scripts/validate_phase4.py (preflight, FR-018 reset, step-to-analyzed, --force rollback, --verify-only) + tests/integration/test_phase4_plan_tasks.py (FR-016 regression tests + schema/ordering tests, all real-call or local-http.server, no network mocks). Spec/plan/tasks/contracts under specs/014-phase4-plan-tasks-testing/.
Deterministic dataset resolver (new feature) — src/llmxive/librarian/{dataset_resolver,dataset_sources}.py: web-searches HuggingFace Hub + figshare/Zenodo/DataCite (+ reuses verify.py reachability), verifies a sample-stream format sniff, and injects the top-N verified dataset URLs into the Planner prompt (cite-only). Design + plan in docs/superpowers/.

Pipeline bugs found & fixed (the point of the validation)

Each is a separate commit with a regression test:

Auditor template_vs_real (×4) — false-positives that would reject legitimate rich artifacts: body-density on table/mermaid data-model.md; Rule-1 learning the [US1]/[Story] task labels; Rule-2 bracket-density counting fenced flowchart labels and single-token annotations ([REVISION]). Now counts only multi-word placeholders / strips fences / excludes structural labels — templates still detected.
Tasker Mode-B gutting spec.md — it "converged" analyze by deleting requirements (12 FR/5 SC → 0 FR/2 SC, observed on PROJ-262). Added an FR-012 guard refusing any Mode-B spec.md patch that drops requirement IDs.
_split_multi_file didn't strip ``` code fences → broke contracts/*.yaml.
Dataset resolver — stored the expiring presigned HF redirect target (→ 403) instead of the stable resolve/main/... URL; FR-006 URL extraction captured a wrapping backtick.

Pre-existing issues handled along the way

publish_blocked missing from the project-state schema enum — the publisher (FR-030, after 5 Zenodo failures) would crash on save; added the value + a contract test asserting every Stage is in the schema enum.
Stale spec-012 scheduler idempotency tests — they asserted READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK, which spec-013 deliberately reversed (the implementer agent consumes those projects); updated the tests to spec-013 behavior (code was correct).
VALID_FIELDS third copy of the field list — now built from the canonical LIBRARIAN_DEFAULT_FIELDS (SSoT, byte-identical set).
Flaky httpbin-dependent citation-timeout test — made deterministic with a local slow server.

Decisions

Analyze-loop cap-hit = best-effort advance to analyzed (recording converged: false); human_input_needed is reserved for an explicit Mode-B escalate verdict or backend failure.
FR-005/006/007 gates added to the Planner (agent hardening, per clarification); FR-006 hard-fails any non-2xx/3xx with no retry.

Test status

unit + integration + contract: 735 passed, 0 failed · phase1: 23 · phase2: 202 · e2e (test_site.py): 5 · real_call: gated/skipped. The PROJ-261/262 runs and the resolver source tests are real-call (Dartmouth + HF/figshare/Zenodo/DataCite).

Known follow-ups (non-blocking)

extract_dataset_intents over-extracts non-dataset tokens (GNN, MAE, FR-001) → some resolve to irrelevant-but-reachable HF datasets; a precision refinement.
The Tasker re-runs its full analyze loop on both runner steps (≈2× cost) — an existing inefficiency, noted not fixed.

🤖 Generated with Claude Code

T003: new canonical _research_guard.py (FR-005/006/007, stdlib only): IncompleteArtifactSet/UnreachableReference/InconsistentDataModel + assert_artifact_set_complete/assert_urls_reachable/assert_data_model_contracts_consistent. T004: wire the three gates into PlannerAgent.write_artifacts; unlink every artifact written this invocation on any raise (parity with guard_emit). T005: capture() gains optional rounds=, persisted under top-level 'rounds' (default []); escalated added to valid outcomes; rounds added to required keys. T006: _maybe_write_inspection reads agent._inspection_rounds and passes rounds=. T007: TaskerAgent accumulates one sub-record per analyze round (observability only; no decision/branch reads it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ify + manifests) T008: preflight (Dartmouth key, runner import, stage==clarified else FR-019 decline, spec.md real, inspections dir writable). T009: FR-018 reset (delete Phase-4 outputs + memory markers, PRESERVE spec.md). T010: run with LLMXIVE_INSPECTION_DIR set; capture exit + run-id. T011: post-run verify — stage chain, five plan artifacts + tasks.md, >=10 T### lines, FR-010 ordering (check_task_ordering), FR-012 constraint-non-deletion (fr_sc_counts across Mode-B rounds), FR-020 Constitution Check. T022/T023: emit_carry_forward + emit_phase_report per the contracts. Pure helpers (check_task_ordering, fr_sc_counts, constitution_check_ok) importable by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Phase-3 e2e tests/integration/test_phase4_plan_tasks.py (T016-T021,T025): FILE-marker split, FR-005 completeness, FR-008 template reject (real PlannerAgent.write_artifacts unlink), FR-006 URL reachability via a REAL local http.server (200 pass; 404/500/ connection-refused raise; Planner unlinks on bad URL), FR-007 consistency, FR-016(c) prose-stub Mode-A reject, FR-016(d) diff-leak, FR-016(e) header-clobber, FR-012 constraint non-deletion, FR-016(f) analyze-loop cap escalation (real tasks_cmd loop, synthetic analyze/Mode-B; no real LLM), FR-010 ordering, inspection schema incl rounds + _redact no-secrets, carry-forward + phase-report schema. test_phase3_specify_clarify.py: the spec-011 gated e2e is DESTRUCTIVE (rolls PROJ-261 back to project_initialized, deletes spec.md). Skip it unless the project is still at its Phase-3 entry stage, so it can no longer clobber the Phase-4 input. ruff clean on all new files; stdlib-only guard; no new pip dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002) left unchecked for the parent to complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…facts The template-vs-real auditor stripped fenced blocks (incl. mermaid) before measuring section bodies and counted parent headings as empty, so a substantive data-model.md (ER diagram + per-entity attribute tables + fenced CSV schemas) classified 'partial'. This blocked the Planner from advancing ANY project past 'clarified'. Now tables/fenced blocks/lists count as real content and parent-of-subsection headings are not 'short'; genuinely-empty/stub sections are still flagged and literal templates are still caught by the phrase/bracket rules. Found during spec-014 Phase-4 validation; cited by failing inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (TemplateRefused body_density_short>=60pct=9/13 on a real data-model.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Real PROJ-261 planner output revealed the FR-007 guard was over-specified: it required every data-model entity to have a same-named contracts/ schema (and vice versa), but the Planner prompt only mandates >=1 schema, and schema filenames legitimately differ from entity headings (e.g. code_duplication_ metrics.schema.yaml for a CloneDensityMetric entity). It also mis-counted 'Data Flow'/'Entity Relationships'/CSV-filename headings as entities. FR-007 now verifies (a) data-model.md defines real entities (table/diagram/ headings) and (b) every contracts/*.yaml is a non-empty parseable schema; cardinality + naming are unconstrained. Updated spec/contracts/data-model and the regression tests (incl. an explicit test that the prior real-world mismatch now passes). Cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…plate phrases The literal-template-phrase rule extracted [US1]/[US2]/[US3]/[Story] from tasks-template.md and then flagged any real tasks.md (which MUST use those format labels) as 'template'. This blocked the Tasker from ever committing a tasks.md. Structural labels ([P],[ID],[TaskID],[Story],[USn]) are now excluded from the learned placeholder set; genuine placeholders ([FEATURE NAME], [Entity], [Service], ...) still trigger template detection, so the template files themselves remain correctly classified. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/tasker.json (TemplateRefused literal_template_phrases>=3 sample=['[Story]','[US1]','[US2]']). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A prior partial run (planner advanced clarified->planned, tasker then failed) left a project at 'planned', so the FR-019 preflight correctly declined to re-run. --force rolls such a project back to 'clarified' (logged in history.jsonl) so the full planner->tasker chain can be re-validated from the canonical entry state. Default FR-019 decline behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rule 2 (unfilled_bracket_density>=6) counted bracketed node labels inside a fenced ASCII/mermaid data-flow chart (e.g. '[Dataset Download] -> data/raw/') as unfilled placeholders, rejecting a real planner data-model.md as 'template'. Rule 2 now scans a view with fenced blocks, HTML comments, and markdown links stripped, and excludes structural labels; Rule 1 (learned phrases) still uses the full text so genuine templates remain caught (verified: plan/spec/tasks templates still classify 'template'). Also gave the planner-guard test a real .specify/templates dir so it exercises Rule 1 like production. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (unfilled_bracket_density on a fenced data-flow chart). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Validation finding: the Tasker advances across TWO runner steps (planned->tasked, tasked->analyzed) per the pipeline graph (STAGE_AFTER_AGENT), so a fixed --max-tasks 2 left the project stuck at 'tasked'. _run_pipeline now steps one agent at a time until a terminal Phase-4 stage and STOPS at 'analyzed' (never invoking the Phase-5 implementer). Per the 2026-05-21 decision, analyze-loop cap-hit WITHOUT convergence is best-effort: the Tasker accepts tasks.md, records converged:false, and the project advances to 'analyzed' (downstream reviewers catch issues); human_input_needed is reserved for an explicit Mode-B escalate verdict or a backend failure. Updated spec FR-013 / Background / US1 / US2 / US3 / edge case / data-model, and added a regression test for the best-effort cap-hit path (the explicit-escalate path test is retained). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-4 stage The Tasker re-runs its full 5-round analyze loop on each runner step, and the second step (tasked->analyzed) exceeded the 1900s per-step timeout. Raised to 3600s. Also: when a run is interrupted mid-Phase-4 (project left at planned/ tasked/analyze_in_progress), the driver now RESUMES from there (no reset, no rollback) and steps to a terminal stage, instead of requiring --force. FR-019 still declines projects that have COMPLETED Phase 4 (analyzed+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fix) PROJ-262's planner cited a URL wrapped in markdown backticks; _URL_RE captured the closing backtick into the path ('.../realKnownCause/`'), producing a false 404. Backtick is now excluded from the URL/doi character classes and added to the trailing-strip set. Regression test added. (Note: the planner is also citing genuinely-dead/irrelevant dataset URLs for PROJ-262 — figshare 404, NAB dir 404 — which FR-006 correctly hard-fails per the 2026-05-21 strict-mode decision; that is the gate working, distinct from this extraction bug.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Phase-3 spec.md cited a dead QM9 DOI (10.6084/m9.figshare.9981994 -> HTTP 404), which the Planner faithfully carried into research.md, so Phase-4 FR-006 (reachability) correctly hard-failed the plan. Replaced with the canonical, verified-reachable QM9 dataset DOI 10.1038/sdata.2014.22 (Ramakrishnan et al., Sci Data 2014; HTTP 200). A Verified-Accuracy (constitution Principle II) fix. Surfaced by spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-262-predicting-molecular-dipole-moments-with/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… for the Planner Brainstormed design (Approach A): a librarian/dataset_resolver.py module, called by the Planner's mechanical step, finds real datasets via HF Hub + figshare/ Zenodo/DataCite + reused Semantic Scholar/arXiv, verifies reachability + a sample-stream format sniff, and injects the top-N verified candidates per dataset into the Planner prompt (cite-only, never invent). Removes URL generation from the LLM; FR-006 stays as the safety net. Root-cause fix for the PROJ-262 hallucinated-dataset-URL finding from spec-014. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

8 tasks: DatasetCandidate + HF Hub source; figshare/Zenodo/DataCite sources; sample-stream format sniff; verify_candidate (reuses verify.py); intent extraction + resolve_datasets top-N orchestration; manifest + escalation + cite-only planner block; wire into Planner; full suite + real PROJ-262 re-run. All real-call tests (HF Hub + registry APIs + local http.server fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s verify.py) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…chestration HF candidates now point at the resolve URL of an actual data file (csv/parquet/ etc.) instead of the HTML landing page, so reachability+sniff verification can succeed. The landing page is HTML and is correctly rejected by the sniffer; the design calls for the HF resolve URL / streaming first rows. Data file is picked deterministically by extension preference then path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…elper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed audit FIX 1: add xyz/sdf/tar detectors to _detect_and_parse so the HF picker (_HF_DATA_EXTS advertises .xyz/.sdf) and the sniffer are consistent -- QM9 is natively .xyz, which was picked then wrongly rejected. xyz = integer atom-count header or "<El> x y z" coordinate lines; sdf/mol = V2000/V3000 or "$$$$" delimiter; tar = "ustar" magic at offset 257. FIX 2: split the generic "rejected" candidates_tried status into "verified", "unreachable" (reachability step failed) and "wrong_format" (reachable but the sniff failed) via new probe_candidate() + VerifyResult; verify_candidate keeps its original contract. Verified-selection behavior unchanged. FIX 4: document that the Semantic Scholar/arXiv paper-linked-data source is DEFERRED (yields paper pages, not sniffable files) and that repo_root is reserved for that future reuse (Task 7's plan_cmd still passes it). Real-call/local-http.server tests added for xyz + sdf detection and for the 404->unreachable / HTML->wrong_format audit statuses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… DOI FIX 3: 10.1038/sdata.2014.22 is a Crossref DOI (not DataCite-registered), so search_datacite returned [] and the test was vacuously true. Replace with the Zenodo-minted DOI 10.5281/zenodo.1227121 (api.datacite.org/dois/<doi> -> 200, verified by curl), assert >=1 datacite candidate AND that its doi.org URL is reachable (200) -- genuinely exercising the resolve path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…RLs) Task 7 of the dataset-resolver plan. PlannerAgent.mechanical_step now reads spec.md, calls resolve_datasets + write_manifest, and returns a rendered dataset_block; build_prompt injects the '# Verified datasets' block after the plan template and before the comments/Task line (falling back to resolving in build_prompt when mechanical_output lacks the key). planner.md replaces the 'NEVER invent URLs' rule with the cite-only rule. New offline test stubs resolve_datasets to assert the verified block + URL reach the user message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Task-7 review caught a leftover rule that told the LLM to 'substitute a comparable open dataset that has a known-stable raw URL', hard-coding the exact NAB raw.githubusercontent URL PROJ-262 hallucinated. It contradicted the new cite-only rule and perpetuated the very hallucination the resolver removes. Replaced with: reference ONLY the verified-datasets block (verified URL or a well-known loader for that same dataset); never substitute or invent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ect target PROJ-262 run 6 surfaced this: verify_candidate stored head.final_url, which for a HuggingFace resolve URL is a short-lived presigned cas-bridge URL (X-Amz-Expires=3600). The Planner cited it; ~32 min later FR-006's re-check hit the expired signature -> HTTP 403 -> rejected. Now we store the STABLE original c.url (the HF resolve URL, which HF re-signs on every access) while still sniffing the live redirect target. Regression test: a redirecting URL keeps the stable original in the verified record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le split PROJ-262 run 7: the Planner appended a stray trailing ``` to contracts/prediction.schema.yaml, making it invalid YAML, which FR-007 (schema validity) correctly rejected. _split_multi_file now strips (1) a fence that wraps an entire file (```lang ... ```) and (2) a stray unmatched fence (odd ``` count, e.g. a trailing closer), while leaving balanced code blocks inside .md files intact. General fix for a common LLM output artifact across all planner artifacts. Regression tests added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…requirements PROJ-262 run 8 reached 'analyzed' but the Tasker's Mode-B had gutted the project spec.md from 12 FR / 5 SC to 0 FR / 2 SC across rounds 1-4 — 'resolving' analyze findings by DELETING requirements (the exact constraint-weakening the constitution forbids). The Mode-B per-patch validation now refuses a spec.md patch whose distinct FR-/SC- identifier set is smaller than the current file's, alongside the existing diff/header/task-id guards. (The validation-layer FR-012 check in validate_phase4 already flagged it post-hoc; this stops the corruption at the source.) Restored PROJ-262's spec.md from git. Regression test added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PROJ-262 run 9: the Tasker annotated tasks.md with 20 single-token [REVISION] tags, tripping unfilled_bracket_density (>=6) -> tasks.md rejected as template. Rule 2 now counts ONLY multi-word descriptive placeholders ([FEATURE NAME], [e.g., ...]) — the genuine 'saturated unfilled template' signal. Single-token brackets are excluded: real ones ([FEATURE],[DATE]) are caught by Rule 1's learned set, and LLM annotations/labels ([P],[US1],[REVISION],[X]) legitimately appear in a real tasks.md. Templates still classify 'template' (verified plan/spec/tasks). Root fix for the recurring bracket-density false-positives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rap-up fr_sc_counts now counts DISTINCT requirement ids (not occurrences), fixing a false-positive FR-012 finding (a dropped cross-reference is not a deleted requirement). Added --verify-only mode to re-verify existing analyzed artifacts + emit manifests without a pipeline re-run. Result: PROJ-261 + PROJ-262 both reach 'analyzed', 0 findings (verify-only: 2 passed). carry-forward.yaml + phase-report.md generated; both projects 'passed', Mode-B exercised on real content (5 rounds each). Commits the produced plan artifacts + tasks.md + inspection records + analyzed project state. All 28 spec-014 tasks complete. Known follow-up: extract_dataset_intents over-extracts non-dataset tokens (GNN/MAE/MUST/FR-001) -> some resolve to irrelevant (but reachable) HF datasets; a precision refinement, not a blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…heduler tests Two pre-existing issues surfaced during spec-014 validation, fixed here (handle- issues-as-you-go): 1. project-state schema's current_stage enum was MISSING 'publish_blocked' — the only Stage value absent — so the publisher (agents/publisher.py FR-030, sets publish_blocked after 5 Zenodo failures) would crash on save with a ValidationError. Added it to the enum + a contract test asserting every Stage value is in the schema enum (guards against future drift). 2. test_revision_in_progress_idempotency.py asserted READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK — the spec-012 expectation that spec-013 deliberately REVERSED (the llmXive-implementer agent now consumes those projects via the scheduler; documented in scheduler.py + implementer.py). Updated the 2 stale tests to the spec-013 behavior (preserving their idempotency intent with genuinely-locked stages); did NOT revert the correct code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…LT_FIELDS (SSoT) submission_intake.py re-typed the librarian's canonical 9-field list as a prefix of VALID_FIELDS, a third copy of the field list that violated Constitution Principle I (single source of truth) and failed test_librarian_default_fields::test_no_third_copy_of_the_field_list. VALID_FIELDS is now frozenset(LIBRARIAN_DEFAULT_FIELDS) | {submission-only extras}; the resulting set is byte-identical (17 fields), so classification/validation behavior is unchanged. Pre-existing issue, fixed per handle-issues-as-you-go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… flaky httpbin) test_timeout_fires depended on the public httpbin.org/delay/30 endpoint sleeping 30s; when httpbin is overloaded it responds fast, taking the non-timeout unreachable path (api_response_snippet=None) and failing the assertion. Replaced with a LOCAL http.server that sleeps past the timeout, so the deadline deterministically fires (TimeoutError path, snippet set). No third-party dependency; tests the actual timeout behavior reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The two publisher sandbox tests HEAD the freshly-minted DOI as a smoke check that the resolver knows about it. They only accepted (200,302,403), but a just-minted DataCite/Zenodo sandbox DOI returns 202 ("Accepted", still propagating) before it settles — a racy CI failure unrelated to the publisher code. Broaden both assertions to any 2xx/3xx plus 403 (doi.org's bare-HEAD response for sandbox DOIs); the only real failure is 404/5xx. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4-min CI hang) Root cause of the real-call job timing out: the Dartmouth backend wrapped client.invoke() in `with ThreadPoolExecutor() as ex: ex.submit(...).result(180)`. When the 180s deadline fired and raised, exiting the `with` block invoked ThreadPoolExecutor.__exit__ -> shutdown(wait=True), which BLOCKS until the still-hung worker thread finishes. Since the worker was stuck in a socket read with no HTTP timeout (langchain forwards `timeout` as a chat-completion body param, not an HTTP/socket timeout), shutdown never returned — the implementer e2e test stalled ~54 min until the 60-min job cap cancelled it (the test step emitted zero output between 16:22 and the 17:16 cancel). Replace the executor with a shared `invoke_with_deadline` helper in backends/base.py that runs the call on a DAEMON thread and abandons it past the deadline. A daemon thread never blocks interpreter exit (unlike an abandoned ThreadPoolExecutor worker, which its atexit join would wait on), so a sick connection can no longer hang the process. Apply it to both the Dartmouth and HuggingFace backends (HF's bare client.invoke() had no deadline at all — same latent hang as a fallback). Verified: unit test asserts the caller regains control ~at the deadline (not after the slow call finishes) and the abandoned worker is a daemon; a real Dartmouth chat call still succeeds through the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-testing

…odel drift Dartmouth's chat catalog now mixes FREE self-hosted models with PAID external providers (gpt-5, claude, gemini, voyage, ...). The real-call CI failed because (1) a test fell through to the PAID gpt-5.3-chat-latest, which rejects temperature=0, and (2) a transient "model not found" on the primary model was misclassified as permanent, so the router never fell through to the free gpt-oss-120b peer. - dartmouth.py: derive the free-model set from the API's explicit input/output_cost_per_token fields (authoritative, not heuristic); refuse any non-free model before calling it (Constitution Principle IV: v1 cost==0); route list_models() through the working chat.dartmouth.edu/api/models endpoint (ChatDartmouth.list() targets a Dartmouth host that rejects the chat key and returns non-JSON); classify "model not found" as transient so the router falls through to a free peer; drop unsupported temperature on retry. - router.py: MODEL_FALLBACKS use free models only (gemma-3-27b-it -> gemma-4-31B-it). - registry.yaml: lightweight agents -> gemma-4-31B-it; paper_implementer and all intensive agents -> qwen.qwen3.5-122b. - test_dartmouth_chat.py: select only free models (never a paid gpt-5). Verified locally: the 4 originally-failing real-call tests pass; full real_call suite 18 passed / 4 skipped (HF/network-gated); unit+contract 592 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The free-only backend fix makes a transient "model not found" (a model briefly unloaded on Dartmouth's vLLM cluster) retryable — the router now walks the free peer-model fallback chain instead of failing fast. That resilience is correct but adds real wall-clock when blips occur: a CI run took 1264s on the 3-task fixture (vs the old 1200s budget set from fail-fast timing). Bump to 2400s for generous headroom over the observed worst case while still catching a genuine hang (bounded by the 180s per-request deadline x the finite retry/fallback fan-out). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

jeremymanning and others added 30 commits May 21, 2026 08:06

docs(014): mark T003-T011,T016-T023,T025-T028 done in tasks.md

54b6246

Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002) left unchecked for the parent to complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(dataset-resolver): DatasetCandidate + HuggingFace Hub source

60c75fb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(dataset-resolver): figshare/Zenodo/DataCite sources

8a19fdb

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(dataset-resolver): sample-stream format sniff

6671a90

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(dataset-resolver): verify_candidate (reachability + sniff, reuse…

8c9892f

…s verify.py) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(dataset-resolver): manifest write + planner block + unresolved h…

dfc1431

…elper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jeremymanning and others added 10 commits May 22, 2026 04:23

chore(014): commit Phase-4 validation run-log entries (FR-014 audit t…

0147f8a

…rail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: run-log entries from final regression runs

cfa6e76

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into 014-phase4-plan-tasks…

01dddb3

…-testing

jeremymanning merged commit 6173f33 into main May 27, 2026
5 checks passed

jeremymanning deleted the 014-phase4-plan-tasks-testing branch May 27, 2026 23:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 4 validation (planner + tasker) + deterministic dataset resolver#223

Phase 4 validation (planner + tasker) + deterministic dataset resolver#223
jeremymanning merged 40 commits into
mainfrom
014-phase4-plan-tasks-testing

jeremymanning commented May 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremymanning commented May 22, 2026

Summary

What's here

Pipeline bugs found & fixed (the point of the validation)

Pre-existing issues handled along the way

Decisions

Test status

Known follow-ups (non-blocking)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant