Phase 4 validation (planner + tasker) + deterministic dataset resolver#223
Merged
Conversation
T003: new canonical _research_guard.py (FR-005/006/007, stdlib only): IncompleteArtifactSet/UnreachableReference/InconsistentDataModel + assert_artifact_set_complete/assert_urls_reachable/assert_data_model_contracts_consistent. T004: wire the three gates into PlannerAgent.write_artifacts; unlink every artifact written this invocation on any raise (parity with guard_emit). T005: capture() gains optional rounds=, persisted under top-level 'rounds' (default []); escalated added to valid outcomes; rounds added to required keys. T006: _maybe_write_inspection reads agent._inspection_rounds and passes rounds=. T007: TaskerAgent accumulates one sub-record per analyze round (observability only; no decision/branch reads it). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ify + manifests) T008: preflight (Dartmouth key, runner import, stage==clarified else FR-019 decline, spec.md real, inspections dir writable). T009: FR-018 reset (delete Phase-4 outputs + memory markers, PRESERVE spec.md). T010: run with LLMXIVE_INSPECTION_DIR set; capture exit + run-id. T011: post-run verify — stage chain, five plan artifacts + tasks.md, >=10 T### lines, FR-010 ordering (check_task_ordering), FR-012 constraint-non-deletion (fr_sc_counts across Mode-B rounds), FR-020 Constitution Check. T022/T023: emit_carry_forward + emit_phase_report per the contracts. Pure helpers (check_task_ordering, fr_sc_counts, constitution_check_ok) importable by tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Phase-3 e2e tests/integration/test_phase4_plan_tasks.py (T016-T021,T025): FILE-marker split, FR-005 completeness, FR-008 template reject (real PlannerAgent.write_artifacts unlink), FR-006 URL reachability via a REAL local http.server (200 pass; 404/500/ connection-refused raise; Planner unlinks on bad URL), FR-007 consistency, FR-016(c) prose-stub Mode-A reject, FR-016(d) diff-leak, FR-016(e) header-clobber, FR-012 constraint non-deletion, FR-016(f) analyze-loop cap escalation (real tasks_cmd loop, synthetic analyze/Mode-B; no real LLM), FR-010 ordering, inspection schema incl rounds + _redact no-secrets, carry-forward + phase-report schema. test_phase3_specify_clarify.py: the spec-011 gated e2e is DESTRUCTIVE (rolls PROJ-261 back to project_initialized, deletes spec.md). Skip it unless the project is still at its Phase-3 entry stage, so it can no longer clobber the Phase-4 input. ruff clean on all new files; stdlib-only guard; no new pip dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real-run-dependent tasks (T012-T015,T024) and parent-owned setup (T001-T002) left unchecked for the parent to complete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…facts The template-vs-real auditor stripped fenced blocks (incl. mermaid) before measuring section bodies and counted parent headings as empty, so a substantive data-model.md (ER diagram + per-entity attribute tables + fenced CSV schemas) classified 'partial'. This blocked the Planner from advancing ANY project past 'clarified'. Now tables/fenced blocks/lists count as real content and parent-of-subsection headings are not 'short'; genuinely-empty/stub sections are still flagged and literal templates are still caught by the phrase/bracket rules. Found during spec-014 Phase-4 validation; cited by failing inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (TemplateRefused body_density_short>=60pct=9/13 on a real data-model.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real PROJ-261 planner output revealed the FR-007 guard was over-specified: it required every data-model entity to have a same-named contracts/ schema (and vice versa), but the Planner prompt only mandates >=1 schema, and schema filenames legitimately differ from entity headings (e.g. code_duplication_ metrics.schema.yaml for a CloneDensityMetric entity). It also mis-counted 'Data Flow'/'Entity Relationships'/CSV-filename headings as entities. FR-007 now verifies (a) data-model.md defines real entities (table/diagram/ headings) and (b) every contracts/*.yaml is a non-empty parseable schema; cardinality + naming are unconstrained. Updated spec/contracts/data-model and the regression tests (incl. an explicit test that the prior real-world mismatch now passes). Cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plate phrases The literal-template-phrase rule extracted [US1]/[US2]/[US3]/[Story] from tasks-template.md and then flagged any real tasks.md (which MUST use those format labels) as 'template'. This blocked the Tasker from ever committing a tasks.md. Structural labels ([P],[ID],[TaskID],[Story],[USn]) are now excluded from the learned placeholder set; genuine placeholders ([FEATURE NAME], [Entity], [Service], ...) still trigger template detection, so the template files themselves remain correctly classified. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/tasker.json (TemplateRefused literal_template_phrases>=3 sample=['[Story]','[US1]','[US2]']). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A prior partial run (planner advanced clarified->planned, tasker then failed) left a project at 'planned', so the FR-019 preflight correctly declined to re-run. --force rolls such a project back to 'clarified' (logged in history.jsonl) so the full planner->tasker chain can be re-validated from the canonical entry state. Default FR-019 decline behavior is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rule 2 (unfilled_bracket_density>=6) counted bracketed node labels inside a fenced ASCII/mermaid data-flow chart (e.g. '[Dataset Download] -> data/raw/') as unfilled placeholders, rejecting a real planner data-model.md as 'template'. Rule 2 now scans a view with fenced blocks, HTML comments, and markdown links stripped, and excludes structural labels; Rule 1 (learned phrases) still uses the full text so genuine templates remain caught (verified: plan/spec/tasks templates still classify 'template'). Also gave the planner-guard test a real .specify/templates dir so it exercises Rule 1 like production. Found during spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-261-evaluating-the-impact-of-code-duplicatio/planner.json (unfilled_bracket_density on a fenced data-flow chart). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validation finding: the Tasker advances across TWO runner steps (planned->tasked, tasked->analyzed) per the pipeline graph (STAGE_AFTER_AGENT), so a fixed --max-tasks 2 left the project stuck at 'tasked'. _run_pipeline now steps one agent at a time until a terminal Phase-4 stage and STOPS at 'analyzed' (never invoking the Phase-5 implementer). Per the 2026-05-21 decision, analyze-loop cap-hit WITHOUT convergence is best-effort: the Tasker accepts tasks.md, records converged:false, and the project advances to 'analyzed' (downstream reviewers catch issues); human_input_needed is reserved for an explicit Mode-B escalate verdict or a backend failure. Updated spec FR-013 / Background / US1 / US2 / US3 / edge case / data-model, and added a regression test for the best-effort cap-hit path (the explicit-escalate path test is retained). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-4 stage The Tasker re-runs its full 5-round analyze loop on each runner step, and the second step (tasked->analyzed) exceeded the 1900s per-step timeout. Raised to 3600s. Also: when a run is interrupted mid-Phase-4 (project left at planned/ tasked/analyze_in_progress), the driver now RESUMES from there (no reset, no rollback) and steps to a terminal stage, instead of requiring --force. FR-019 still declines projects that have COMPLETED Phase 4 (analyzed+). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fix)
PROJ-262's planner cited a URL wrapped in markdown backticks; _URL_RE captured
the closing backtick into the path ('.../realKnownCause/`'), producing a false
404. Backtick is now excluded from the URL/doi character classes and added to
the trailing-strip set. Regression test added. (Note: the planner is also
citing genuinely-dead/irrelevant dataset URLs for PROJ-262 — figshare 404, NAB
dir 404 — which FR-006 correctly hard-fails per the 2026-05-21 strict-mode
decision; that is the gate working, distinct from this extraction bug.)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Phase-3 spec.md cited a dead QM9 DOI (10.6084/m9.figshare.9981994 -> HTTP 404), which the Planner faithfully carried into research.md, so Phase-4 FR-006 (reachability) correctly hard-failed the plan. Replaced with the canonical, verified-reachable QM9 dataset DOI 10.1038/sdata.2014.22 (Ramakrishnan et al., Sci Data 2014; HTTP 200). A Verified-Accuracy (constitution Principle II) fix. Surfaced by spec-014 Phase-4 validation; cited by inspection record specs/014-phase4-plan-tasks-testing/inspections/PROJ-262-predicting-molecular-dipole-moments-with/planner.json. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… for the Planner Brainstormed design (Approach A): a librarian/dataset_resolver.py module, called by the Planner's mechanical step, finds real datasets via HF Hub + figshare/ Zenodo/DataCite + reused Semantic Scholar/arXiv, verifies reachability + a sample-stream format sniff, and injects the top-N verified candidates per dataset into the Planner prompt (cite-only, never invent). Removes URL generation from the LLM; FR-006 stays as the safety net. Root-cause fix for the PROJ-262 hallucinated-dataset-URL finding from spec-014. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
8 tasks: DatasetCandidate + HF Hub source; figshare/Zenodo/DataCite sources; sample-stream format sniff; verify_candidate (reuses verify.py); intent extraction + resolve_datasets top-N orchestration; manifest + escalation + cite-only planner block; wire into Planner; full suite + real PROJ-262 re-run. All real-call tests (HF Hub + registry APIs + local http.server fixture). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s verify.py) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chestration HF candidates now point at the resolve URL of an actual data file (csv/parquet/ etc.) instead of the HTML landing page, so reachability+sniff verification can succeed. The landing page is HTML and is correctly rejected by the sniffer; the design calls for the HF resolve URL / streaming first rows. Data file is picked deterministically by extension preference then path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…elper Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed audit FIX 1: add xyz/sdf/tar detectors to _detect_and_parse so the HF picker (_HF_DATA_EXTS advertises .xyz/.sdf) and the sniffer are consistent -- QM9 is natively .xyz, which was picked then wrongly rejected. xyz = integer atom-count header or "<El> x y z" coordinate lines; sdf/mol = V2000/V3000 or "$$$$" delimiter; tar = "ustar" magic at offset 257. FIX 2: split the generic "rejected" candidates_tried status into "verified", "unreachable" (reachability step failed) and "wrong_format" (reachable but the sniff failed) via new probe_candidate() + VerifyResult; verify_candidate keeps its original contract. Verified-selection behavior unchanged. FIX 4: document that the Semantic Scholar/arXiv paper-linked-data source is DEFERRED (yields paper pages, not sniffable files) and that repo_root is reserved for that future reuse (Task 7's plan_cmd still passes it). Real-call/local-http.server tests added for xyz + sdf detection and for the 404->unreachable / HTML->wrong_format audit statuses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… DOI FIX 3: 10.1038/sdata.2014.22 is a Crossref DOI (not DataCite-registered), so search_datacite returned [] and the test was vacuously true. Replace with the Zenodo-minted DOI 10.5281/zenodo.1227121 (api.datacite.org/dois/<doi> -> 200, verified by curl), assert >=1 datacite candidate AND that its doi.org URL is reachable (200) -- genuinely exercising the resolve path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…RLs) Task 7 of the dataset-resolver plan. PlannerAgent.mechanical_step now reads spec.md, calls resolve_datasets + write_manifest, and returns a rendered dataset_block; build_prompt injects the '# Verified datasets' block after the plan template and before the comments/Task line (falling back to resolving in build_prompt when mechanical_output lacks the key). planner.md replaces the 'NEVER invent URLs' rule with the cite-only rule. New offline test stubs resolve_datasets to assert the verified block + URL reach the user message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Task-7 review caught a leftover rule that told the LLM to 'substitute a comparable open dataset that has a known-stable raw URL', hard-coding the exact NAB raw.githubusercontent URL PROJ-262 hallucinated. It contradicted the new cite-only rule and perpetuated the very hallucination the resolver removes. Replaced with: reference ONLY the verified-datasets block (verified URL or a well-known loader for that same dataset); never substitute or invent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ect target PROJ-262 run 6 surfaced this: verify_candidate stored head.final_url, which for a HuggingFace resolve URL is a short-lived presigned cas-bridge URL (X-Amz-Expires=3600). The Planner cited it; ~32 min later FR-006's re-check hit the expired signature -> HTTP 403 -> rejected. Now we store the STABLE original c.url (the HF resolve URL, which HF re-signs on every access) while still sniffing the live redirect target. Regression test: a redirecting URL keeps the stable original in the verified record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le split PROJ-262 run 7: the Planner appended a stray trailing ``` to contracts/prediction.schema.yaml, making it invalid YAML, which FR-007 (schema validity) correctly rejected. _split_multi_file now strips (1) a fence that wraps an entire file (```lang ... ```) and (2) a stray unmatched fence (odd ``` count, e.g. a trailing closer), while leaving balanced code blocks inside .md files intact. General fix for a common LLM output artifact across all planner artifacts. Regression tests added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…requirements PROJ-262 run 8 reached 'analyzed' but the Tasker's Mode-B had gutted the project spec.md from 12 FR / 5 SC to 0 FR / 2 SC across rounds 1-4 — 'resolving' analyze findings by DELETING requirements (the exact constraint-weakening the constitution forbids). The Mode-B per-patch validation now refuses a spec.md patch whose distinct FR-/SC- identifier set is smaller than the current file's, alongside the existing diff/header/task-id guards. (The validation-layer FR-012 check in validate_phase4 already flagged it post-hoc; this stops the corruption at the source.) Restored PROJ-262's spec.md from git. Regression test added. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PROJ-262 run 9: the Tasker annotated tasks.md with 20 single-token [REVISION] tags, tripping unfilled_bracket_density (>=6) -> tasks.md rejected as template. Rule 2 now counts ONLY multi-word descriptive placeholders ([FEATURE NAME], [e.g., ...]) — the genuine 'saturated unfilled template' signal. Single-token brackets are excluded: real ones ([FEATURE],[DATE]) are caught by Rule 1's learned set, and LLM annotations/labels ([P],[US1],[REVISION],[X]) legitimately appear in a real tasks.md. Templates still classify 'template' (verified plan/spec/tasks). Root fix for the recurring bracket-density false-positives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rap-up fr_sc_counts now counts DISTINCT requirement ids (not occurrences), fixing a false-positive FR-012 finding (a dropped cross-reference is not a deleted requirement). Added --verify-only mode to re-verify existing analyzed artifacts + emit manifests without a pipeline re-run. Result: PROJ-261 + PROJ-262 both reach 'analyzed', 0 findings (verify-only: 2 passed). carry-forward.yaml + phase-report.md generated; both projects 'passed', Mode-B exercised on real content (5 rounds each). Commits the produced plan artifacts + tasks.md + inspection records + analyzed project state. All 28 spec-014 tasks complete. Known follow-up: extract_dataset_intents over-extracts non-dataset tokens (GNN/MAE/MUST/FR-001) -> some resolve to irrelevant (but reachable) HF datasets; a precision refinement, not a blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…heduler tests Two pre-existing issues surfaced during spec-014 validation, fixed here (handle- issues-as-you-go): 1. project-state schema's current_stage enum was MISSING 'publish_blocked' — the only Stage value absent — so the publisher (agents/publisher.py FR-030, sets publish_blocked after 5 Zenodo failures) would crash on save with a ValidationError. Added it to the enum + a contract test asserting every Stage value is in the schema enum (guards against future drift). 2. test_revision_in_progress_idempotency.py asserted READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK — the spec-012 expectation that spec-013 deliberately REVERSED (the llmXive-implementer agent now consumes those projects via the scheduler; documented in scheduler.py + implementer.py). Updated the 2 stale tests to the spec-013 behavior (preserving their idempotency intent with genuinely-locked stages); did NOT revert the correct code. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LT_FIELDS (SSoT)
submission_intake.py re-typed the librarian's canonical 9-field list as a prefix
of VALID_FIELDS, a third copy of the field list that violated Constitution
Principle I (single source of truth) and failed
test_librarian_default_fields::test_no_third_copy_of_the_field_list. VALID_FIELDS
is now frozenset(LIBRARIAN_DEFAULT_FIELDS) | {submission-only extras}; the
resulting set is byte-identical (17 fields), so classification/validation
behavior is unchanged. Pre-existing issue, fixed per handle-issues-as-you-go.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… flaky httpbin) test_timeout_fires depended on the public httpbin.org/delay/30 endpoint sleeping 30s; when httpbin is overloaded it responds fast, taking the non-timeout unreachable path (api_response_snippet=None) and failing the assertion. Replaced with a LOCAL http.server that sleeps past the timeout, so the deadline deterministically fires (TimeoutError path, snippet set). No third-party dependency; tests the actual timeout behavior reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two publisher sandbox tests HEAD the freshly-minted DOI as a smoke
check that the resolver knows about it. They only accepted (200,302,403),
but a just-minted DataCite/Zenodo sandbox DOI returns 202 ("Accepted",
still propagating) before it settles — a racy CI failure unrelated to the
publisher code. Broaden both assertions to any 2xx/3xx plus 403 (doi.org's
bare-HEAD response for sandbox DOIs); the only real failure is 404/5xx.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4-min CI hang) Root cause of the real-call job timing out: the Dartmouth backend wrapped client.invoke() in `with ThreadPoolExecutor() as ex: ex.submit(...).result(180)`. When the 180s deadline fired and raised, exiting the `with` block invoked ThreadPoolExecutor.__exit__ -> shutdown(wait=True), which BLOCKS until the still-hung worker thread finishes. Since the worker was stuck in a socket read with no HTTP timeout (langchain forwards `timeout` as a chat-completion body param, not an HTTP/socket timeout), shutdown never returned — the implementer e2e test stalled ~54 min until the 60-min job cap cancelled it (the test step emitted zero output between 16:22 and the 17:16 cancel). Replace the executor with a shared `invoke_with_deadline` helper in backends/base.py that runs the call on a DAEMON thread and abandons it past the deadline. A daemon thread never blocks interpreter exit (unlike an abandoned ThreadPoolExecutor worker, which its atexit join would wait on), so a sick connection can no longer hang the process. Apply it to both the Dartmouth and HuggingFace backends (HF's bare client.invoke() had no deadline at all — same latent hang as a fallback). Verified: unit test asserts the caller regains control ~at the deadline (not after the slow call finishes) and the abandoned worker is a daemon; a real Dartmouth chat call still succeeds through the wrapper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odel drift Dartmouth's chat catalog now mixes FREE self-hosted models with PAID external providers (gpt-5, claude, gemini, voyage, ...). The real-call CI failed because (1) a test fell through to the PAID gpt-5.3-chat-latest, which rejects temperature=0, and (2) a transient "model not found" on the primary model was misclassified as permanent, so the router never fell through to the free gpt-oss-120b peer. - dartmouth.py: derive the free-model set from the API's explicit input/output_cost_per_token fields (authoritative, not heuristic); refuse any non-free model before calling it (Constitution Principle IV: v1 cost==0); route list_models() through the working chat.dartmouth.edu/api/models endpoint (ChatDartmouth.list() targets a Dartmouth host that rejects the chat key and returns non-JSON); classify "model not found" as transient so the router falls through to a free peer; drop unsupported temperature on retry. - router.py: MODEL_FALLBACKS use free models only (gemma-3-27b-it -> gemma-4-31B-it). - registry.yaml: lightweight agents -> gemma-4-31B-it; paper_implementer and all intensive agents -> qwen.qwen3.5-122b. - test_dartmouth_chat.py: select only free models (never a paid gpt-5). Verified locally: the 4 originally-failing real-call tests pass; full real_call suite 18 passed / 4 skipped (HF/network-gated); unit+contract 592 passed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The free-only backend fix makes a transient "model not found" (a model briefly unloaded on Dartmouth's vLLM cluster) retryable — the router now walks the free peer-model fallback chain instead of failing fast. That resilience is correct but adds real wall-clock when blips occur: a CI run took 1264s on the 3-task fixture (vs the old 1200s budget set from fail-fast timing). Bump to 2400s for generous headroom over the observed worst case while still catching a genuine hang (bounded by the 180s per-request deadline x the finite retry/fallback fan-out). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validates Phase 4 of the llmXive agentic pipeline (Spec Kit Plan → Tasks with the analyze loop; issue #48, agents
planner#65 +tasker#66) end-to-end on two real projects, and — because the validation surfaced that the Planner hallucinates dataset URLs — adds a deterministic dataset resolver so the Planner cites real, verified datasets instead of inventing them.Both reference projects now reach
analyzedcleanly:PROJ-261(code-duplication, CS)analyzedPROJ-262(molecular dipoles, Chemistry)analyzedspecs/014-phase4-plan-tasks-testing/carry-forward.yamllists both aspassed(ready for Phase 5);phase-report.mdmaps every FR → evidence.What's here
scripts/validate_phase4.py(preflight, FR-018 reset, step-to-analyzed,--forcerollback,--verify-only) +tests/integration/test_phase4_plan_tasks.py(FR-016 regression tests + schema/ordering tests, all real-call or local-http.server, no network mocks). Spec/plan/tasks/contracts underspecs/014-phase4-plan-tasks-testing/.src/llmxive/librarian/{dataset_resolver,dataset_sources}.py: web-searches HuggingFace Hub + figshare/Zenodo/DataCite (+ reusesverify.pyreachability), verifies a sample-stream format sniff, and injects the top-N verified dataset URLs into the Planner prompt (cite-only). Design + plan indocs/superpowers/.Pipeline bugs found & fixed (the point of the validation)
Each is a separate commit with a regression test:
template_vs_real(×4) — false-positives that would reject legitimate rich artifacts: body-density on table/mermaiddata-model.md; Rule-1 learning the[US1]/[Story]task labels; Rule-2 bracket-density counting fenced flowchart labels and single-token annotations ([REVISION]). Now counts only multi-word placeholders / strips fences / excludes structural labels — templates still detected.spec.md— it "converged" analyze by deleting requirements (12 FR/5 SC → 0 FR/2 SC, observed on PROJ-262). Added an FR-012 guard refusing any Mode-Bspec.mdpatch that drops requirement IDs._split_multi_filedidn't strip```code fences → brokecontracts/*.yaml.resolve/main/...URL; FR-006 URL extraction captured a wrapping backtick.Pre-existing issues handled along the way
publish_blockedmissing from the project-state schema enum — the publisher (FR-030, after 5 Zenodo failures) would crash on save; added the value + a contract test asserting everyStageis in the schema enum.READY_FOR_IMPLEMENTATION ∈ _NEVER_PICK, which spec-013 deliberately reversed (the implementer agent consumes those projects); updated the tests to spec-013 behavior (code was correct).VALID_FIELDSthird copy of the field list — now built from the canonicalLIBRARIAN_DEFAULT_FIELDS(SSoT, byte-identical set).Decisions
analyzed(recordingconverged: false);human_input_neededis reserved for an explicit Mode-Bescalateverdict or backend failure.Test status
unit + integration + contract: 735 passed, 0 failed ·phase1: 23 ·phase2: 202 ·e2e(test_site.py): 5 ·real_call: gated/skipped. The PROJ-261/262 runs and the resolver source tests are real-call (Dartmouth + HF/figshare/Zenodo/DataCite).Known follow-ups (non-blocking)
extract_dataset_intentsover-extracts non-dataset tokens (GNN,MAE,FR-001) → some resolve to irrelevant-but-reachable HF datasets; a precision refinement.🤖 Generated with Claude Code