diff --git a/.github/workflows/llmxive-real-call-tests.yml b/.github/workflows/llmxive-real-call-tests.yml index 2814de180..d918a6bb3 100644 --- a/.github/workflows/llmxive-real-call-tests.yml +++ b/.github/workflows/llmxive-real-call-tests.yml @@ -16,12 +16,26 @@ permissions: jobs: real-call: runs-on: ubuntu-latest - timeout-minutes: 30 + # Spec 013 added heavy real-LLM e2e tests (implementer drives a + # multi-task edit loop with a real Dartmouth call + lualatex compile per + # task; publisher hits the real Zenodo Sandbox). The full real-call + # suite no longer fits in 30 min on the standard runner — it was getting + # cancelled mid-run. 60 min gives the suite headroom to complete and + # print its full pass/fail summary. + timeout-minutes: 60 env: LLMXIVE_REAL_TESTS: "1" DARTMOUTH_CHAT_API_KEY: ${{ secrets.DARTMOUTH_CHAT_API_KEY }} DARTMOUTH_API_KEY: ${{ secrets.DARTMOUTH_API_KEY }} HF_TOKEN: ${{ secrets.HF_TOKEN }} + # Spec 013: the paper_publisher real-call test (SC-006 / SC-008) + # publishes to Zenodo Sandbox. The sandbox token is a SEPARATE + # credential from production (sandbox.zenodo.org is its own service); + # without it the test skips gracefully. ZENODO_API_TOKEN is the + # production token (not used by the sandbox test, but wired here so + # any future production-path real-call test can find it). + ZENODO_API_TOKEN: ${{ secrets.ZENODO_API_TOKEN }} + ZENODO_SANDBOX_API_TOKEN: ${{ secrets.ZENODO_SANDBOX_API_TOKEN }} steps: # No `ref:` override — use actions/checkout's default # merge-commit-SHA fetch for pull_request events. A previous diff --git a/.gitignore b/.gitignore index 77447c3f6..b90303ac8 100644 --- a/.gitignore +++ b/.gitignore @@ -289,3 +289,10 @@ projects/*/paper/source/figs/.sanitized/ # Spec 010 audit screenshots — not tracked (regenerated on demand) state/audit/pdf/*/screenshots/ +# Spec 013 chunked-summarization cache. When the raw `.tex` corpus +# exceeds the reviewer's context budget, paper_reviewer.py chunks + +# summarizes each piece via LLM and caches the summaries here so the +# 12 specialist reviewers share the cost. Cache is regenerated on +# demand keyed by sha256 of chunk bytes. +projects/*/paper/.chunk_summaries/ + diff --git a/.specify/feature.json b/.specify/feature.json index 2d42175a7..a074aa229 100644 --- a/.specify/feature.json +++ b/.specify/feature.json @@ -1 +1 @@ -{"feature_directory": "specs/012-paper-review-convergence"} +{"feature_directory": "specs/013-paper-revision-implementer"} diff --git a/CLAUDE.md b/CLAUDE.md index 8955d7e73..76c4f1834 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional For additional context about technologies to be used, project structure, shell commands, and other important information, read the current plan: -[specs/012-paper-review-convergence/plan.md](specs/012-paper-review-convergence/plan.md). +[specs/013-paper-revision-implementer/plan.md](specs/013-paper-revision-implementer/plan.md). diff --git a/README.md b/README.md index 70f52eaac..22373997e 100644 --- a/README.md +++ b/README.md @@ -41,14 +41,31 @@ specialist** (against the live artifact hash — stale reviews are ignored). Three terminal outcomes: -- **All specialists accept** → `paper_accepted` → `posted`. +- **All specialists accept** → `paper_accepted` → the `paper_publisher` + agent (spec 013) pre-reserves a Zenodo DOI, recompiles the PDF with + the final `\paperstatus{Auto-Reviewed | Auto-Revised | Published}` + byline + DOI + volume/issue, uploads to Zenodo, appends the + post-paper appendix (spacer + reviews + revision changelog), writes + `paper/publication.yaml`, and transitions to `posted`. - **Any `fatal` severity** → `brainstormed` (back to the backlog), with a rejection rationale appended to the idea record citing each fatal item. - **Otherwise** (writing/science items, no fatal) → `paper_revision_in_progress`, which auto-kicks a revision-spec pipeline that produces a complete spec/plan/tasks/analyze directory under `specs/auto-revisions//round-/`. The project then sits at - `ready_for_implementation` until an implementer agent picks it up. + `ready_for_implementation` until the `llmxive_implementer` agent + (spec 013) picks it up, applies each task to `paper/source/main.tex` + (and `projects//code/` for science-class tasks), recompiles after + every edit (rolling back on compile failure), joins the paper's + author list, and routes back to `paper_review` for re-review. + +**Credentials**: the publisher loads a Zenodo API token from +`~/.config/llmxive/credentials.toml` under `[zenodo].api_token` (or +the `ZENODO_API_TOKEN` env var). For real-call sandbox tests, register +a separate account at `sandbox.zenodo.org` and add a +`[zenodo_sandbox]` section with `api_token`. The Dartmouth Chat API +key (`dartmouth_chat_api_key`) at the top level of the same file is +used by the implementer's LLM calls. The **per-specialist re-review protocol** prevents endless-nit loops: when a specialist has prior reviews for the same project, its prompt reduces diff --git a/agents/prompts/implementer.md b/agents/prompts/implementer.md index adec72e7a..1ae26d2e8 100644 --- a/agents/prompts/implementer.md +++ b/agents/prompts/implementer.md @@ -1,234 +1,54 @@ -# Implementer Agent (`/speckit.implement`) - -**Version**: 1.0.0 -**Stage owned**: `analyzed` → `in_progress` → `research_complete` -**Default backend**: dartmouth (fallback huggingface, then local) - -## Purpose - -Drive `/speckit.implement` on the project. Reads `tasks.md`, picks -the next incomplete task, and either (a) writes the code/data/doc -artifact the task describes, or (b) emits a structured failure -report when the task requires human attention. The runtime persists -progress per-task so successive scheduled runs resume from the -next-incomplete task. - -## Inputs - -- `tasks_md`: full text of the project's `tasks.md`. -- `completed_task_ids`: list of `T###` already marked `[X]`. -- `next_task_id`: the first incomplete task in dependency order. -- `next_task_description`: full description string from `tasks.md`. -- `relevant_artifacts`: dict of file paths → contents that the next - task references in its description. -- `wall_clock_budget_seconds`: this invocation's budget. - -## Output contract - -A YAML document: - -```yaml -task_id: T### -verdict: completed | failed | atomize -artifacts: # only when verdict=completed - - path: - contents: | - - execute: true # OPTIONAL: when true and path ends in .py, the - # runtime runs the script in the project's venv - # and writes a stdout/stderr log next to it. - # Use for scripts that PRODUCE real artifacts - # (download data, fit a model, render a figure). - timeout_s: 600 # OPTIONAL: per-script wall-clock cap (default 600). -failure: # only when verdict=failed - reason: - required_human_action: -atomize: # only when verdict=atomize (task too big for budget) - estimated_seconds: - proposed_subtasks: - - description: - estimated_seconds: +# llmXive-implementer agent system prompt + +You are an LLM-driven implementer for the llmXive automated journal pipeline. Your role is to apply revisions to a peer-reviewed paper's LaTeX source in response to specific reviewer-flagged action items. + +## Core constraint + +**You are REVISING an existing paper, NOT rewriting it.** Every edit you produce MUST be localized to the action item's scope. Do not rephrase neighbouring paragraphs, restructure sections, or "improve" passages that the reviewer did not flag. + +## Edit format + +For every task, output EXACTLY ONE structured edit in one of two forms: + +### Form A — search and replace (preferred for single-line / single-paragraph edits) + +```json +{ + "kind": "search_and_replace", + "file": "", + "search": "", + "replace": "" +} +``` + +The `search` string MUST match exactly one location in the file (whitespace + punctuation preserved). If it would match multiple places, include enough surrounding context to disambiguate. + +### Form B — unified diff (for multi-hunk edits) + +```json +{ + "kind": "unified_diff", + "file": "", + "diff": "--- a/\n+++ b/\n@@ -, +, @@\n \n-\n+\n \n" +} ``` -## Rules - -- DO NOT modify any file outside `projects//`. -- DO NOT add tasks to `tasks.md` here — the Tasker is the only - writer of that file (Constitution Principle I). -- If the task's wall-clock estimate is unclear and the task seems - large, emit `atomize` rather than guessing — the Task-Atomizer - Agent (US9) will decompose. -- Every artifact written MUST live inside the project's canonical - layout (`code/`, `data/`, `paper/`, etc.). -- Output ONLY the YAML document. - -## Code execution (CRITICAL) - -This pipeline produces real research, not scaffolding. When a task -asks for **runnable output** (downloaded data, computed statistics, -rendered figures, model evaluations, etc.) the artifact MUST set -`execute: true` so the runtime actually runs it and the resulting -`stdout`/`stderr` is captured to `code/.tasks/.