ContextLab · jeremymanning · May 21, 2026 · May 18, 2026 · May 18, 2026 · May 18, 2026
diff --git a/.github/workflows/llmxive-real-call-tests.yml b/.github/workflows/llmxive-real-call-tests.yml
@@ -16,12 +16,26 @@ permissions:
 jobs:
   real-call:
     runs-on: ubuntu-latest
-    timeout-minutes: 30
+    # Spec 013 added heavy real-LLM e2e tests (implementer drives a
+    # multi-task edit loop with a real Dartmouth call + lualatex compile per
+    # task; publisher hits the real Zenodo Sandbox). The full real-call
+    # suite no longer fits in 30 min on the standard runner — it was getting
+    # cancelled mid-run. 60 min gives the suite headroom to complete and
+    # print its full pass/fail summary.
+    timeout-minutes: 60
     env:
       LLMXIVE_REAL_TESTS: "1"
       DARTMOUTH_CHAT_API_KEY: ${{ secrets.DARTMOUTH_CHAT_API_KEY }}
       DARTMOUTH_API_KEY: ${{ secrets.DARTMOUTH_API_KEY }}
       HF_TOKEN: ${{ secrets.HF_TOKEN }}
+      # Spec 013: the paper_publisher real-call test (SC-006 / SC-008)
+      # publishes to Zenodo Sandbox. The sandbox token is a SEPARATE
+      # credential from production (sandbox.zenodo.org is its own service);
+      # without it the test skips gracefully. ZENODO_API_TOKEN is the
+      # production token (not used by the sandbox test, but wired here so
+      # any future production-path real-call test can find it).
+      ZENODO_API_TOKEN: ${{ secrets.ZENODO_API_TOKEN }}
+      ZENODO_SANDBOX_API_TOKEN: ${{ secrets.ZENODO_SANDBOX_API_TOKEN }}
     steps:
       # No `ref:` override — use actions/checkout's default
       # merge-commit-SHA fetch for pull_request events. A previous

diff --git a/.gitignore b/.gitignore
@@ -289,3 +289,10 @@ projects/*/paper/source/figs/.sanitized/
 # Spec 010 audit screenshots — not tracked (regenerated on demand)
 state/audit/pdf/*/screenshots/
 
+# Spec 013 chunked-summarization cache. When the raw `.tex` corpus
+# exceeds the reviewer's context budget, paper_reviewer.py chunks +
+# summarizes each piece via LLM and caches the summaries here so the
+# 12 specialist reviewers share the cost. Cache is regenerated on
+# demand keyed by sha256 of chunk bytes.
+projects/*/paper/.chunk_summaries/
+
diff --git a/.specify/feature.json b/.specify/feature.json
@@ -1 +1 @@
-{"feature_directory": "specs/012-paper-review-convergence"}
+{"feature_directory": "specs/013-paper-revision-implementer"}
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional
 <!-- SPECKIT START -->
 For additional context about technologies to be used, project structure,
 shell commands, and other important information, read the current plan:
-[specs/012-paper-review-convergence/plan.md](specs/012-paper-review-convergence/plan.md).
+[specs/013-paper-revision-implementer/plan.md](specs/013-paper-revision-implementer/plan.md).
 <!-- SPECKIT END -->
diff --git a/README.md b/README.md
@@ -41,14 +41,31 @@ specialist** (against the live artifact hash — stale reviews are ignored).
 
 Three terminal outcomes:
 
-- **All specialists accept** → `paper_accepted` → `posted`.
+- **All specialists accept** → `paper_accepted` → the `paper_publisher`
+  agent (spec 013) pre-reserves a Zenodo DOI, recompiles the PDF with
+  the final `\paperstatus{Auto-Reviewed | Auto-Revised | Published}`
+  byline + DOI + volume/issue, uploads to Zenodo, appends the
+  post-paper appendix (spacer + reviews + revision changelog), writes
+  `paper/publication.yaml`, and transitions to `posted`.
 - **Any `fatal` severity** → `brainstormed` (back to the backlog), with a
   rejection rationale appended to the idea record citing each fatal item.
 - **Otherwise** (writing/science items, no fatal) → `paper_revision_in_progress`,
   which auto-kicks a revision-spec pipeline that produces a complete
   spec/plan/tasks/analyze directory under
   `specs/auto-revisions/<PROJ-ID>/round-<N>/`. The project then sits at
-  `ready_for_implementation` until an implementer agent picks it up.
+  `ready_for_implementation` until the `llmxive_implementer` agent
+  (spec 013) picks it up, applies each task to `paper/source/main.tex`
+  (and `projects/<id>/code/` for science-class tasks), recompiles after
+  every edit (rolling back on compile failure), joins the paper's
+  author list, and routes back to `paper_review` for re-review.
+
+**Credentials**: the publisher loads a Zenodo API token from
+`~/.config/llmxive/credentials.toml` under `[zenodo].api_token` (or
+the `ZENODO_API_TOKEN` env var). For real-call sandbox tests, register
+a separate account at `sandbox.zenodo.org` and add a
+`[zenodo_sandbox]` section with `api_token`. The Dartmouth Chat API
+key (`dartmouth_chat_api_key`) at the top level of the same file is
+used by the implementer's LLM calls.
 
 The **per-specialist re-review protocol** prevents endless-nit loops: when
 a specialist has prior reviews for the same project, its prompt reduces

diff --git a/agents/prompts/implementer.md b/agents/prompts/implementer.md
@@ -1,234 +1,54 @@
-# Implementer Agent (`/speckit.implement`)
-
-**Version**: 1.0.0
-**Stage owned**: `analyzed` → `in_progress` → `research_complete`
-**Default backend**: dartmouth (fallback huggingface, then local)
-
-## Purpose
-
-Drive `/speckit.implement` on the project. Reads `tasks.md`, picks
-the next incomplete task, and either (a) writes the code/data/doc
-artifact the task describes, or (b) emits a structured failure
-report when the task requires human attention. The runtime persists
-progress per-task so successive scheduled runs resume from the
-next-incomplete task.
-
-## Inputs
-
-- `tasks_md`: full text of the project's `tasks.md`.
-- `completed_task_ids`: list of `T###` already marked `[X]`.
-- `next_task_id`: the first incomplete task in dependency order.
-- `next_task_description`: full description string from `tasks.md`.
-- `relevant_artifacts`: dict of file paths → contents that the next
-  task references in its description.
-- `wall_clock_budget_seconds`: this invocation's budget.
-
-## Output contract
-
-A YAML document:
-
-```yaml
-task_id: T###
-verdict: completed | failed | atomize
-artifacts:           # only when verdict=completed
-  - path: <repo-relative path>
-    contents: |
-      <FULL file contents from first line to last — NEVER a unified
-      diff, NEVER a partial patch. The runtime writes this verbatim
-      to disk and (if execute:true) runs it as Python; a diff fragment
-      will produce a SyntaxError. If the file already exists, output
-      the entire merged file with your additions integrated.>
-    execute: true     # OPTIONAL: when true and path ends in .py, the
-                      # runtime runs the script in the project's venv
-                      # and writes a stdout/stderr log next to it.
-                      # Use for scripts that PRODUCE real artifacts
-                      # (download data, fit a model, render a figure).
-    timeout_s: 600    # OPTIONAL: per-script wall-clock cap (default 600).
-failure:             # only when verdict=failed
-  reason: <one sentence>
-  required_human_action: <one sentence>
-atomize:             # only when verdict=atomize (task too big for budget)
-  estimated_seconds: <int>
-  proposed_subtasks:
-    - description: <one sentence>
-      estimated_seconds: <int>
+# llmXive-implementer agent system prompt
+
+You are an LLM-driven implementer for the llmXive automated journal pipeline. Your role is to apply revisions to a peer-reviewed paper's LaTeX source in response to specific reviewer-flagged action items.
+
+## Core constraint
+
+**You are REVISING an existing paper, NOT rewriting it.** Every edit you produce MUST be localized to the action item's scope. Do not rephrase neighbouring paragraphs, restructure sections, or "improve" passages that the reviewer did not flag.
+
+## Edit format
+
+For every task, output EXACTLY ONE structured edit in one of two forms:
+
+### Form A — search and replace (preferred for single-line / single-paragraph edits)
+
+```json
+{
+  "kind": "search_and_replace",
+  "file": "<path relative to project root, e.g. paper/source/main.tex>",
+  "search": "<verbatim text from the file, appearing EXACTLY ONCE>",
+  "replace": "<replacement text>"
+}
+```
+
+The `search` string MUST match exactly one location in the file (whitespace + punctuation preserved). If it would match multiple places, include enough surrounding context to disambiguate.
+
+### Form B — unified diff (for multi-hunk edits)
+
+```json
+{
+  "kind": "unified_diff",
+  "file": "<path>",
+  "diff": "--- a/<path>\n+++ b/<path>\n@@ -<line>,<count> +<line>,<count> @@\n <context>\n-<removed>\n+<added>\n <context>\n"
+}
 ```
 
-## Rules
-
-- DO NOT modify any file outside `projects/<PROJ-ID>/`.
-- DO NOT add tasks to `tasks.md` here — the Tasker is the only
-  writer of that file (Constitution Principle I).
-- If the task's wall-clock estimate is unclear and the task seems
-  large, emit `atomize` rather than guessing — the Task-Atomizer
-  Agent (US9) will decompose.
-- Every artifact written MUST live inside the project's canonical
-  layout (`code/`, `data/`, `paper/`, etc.).
-- Output ONLY the YAML document.
-
-## Code execution (CRITICAL)
-
-This pipeline produces real research, not scaffolding. When a task
-asks for **runnable output** (downloaded data, computed statistics,
-rendered figures, model evaluations, etc.) the artifact MUST set
-`execute: true` so the runtime actually runs it and the resulting
-`stdout`/`stderr` is captured to `code/.tasks/<T###>.<script>.log`.
-
-Concretely:
-
-- Task says "Download dataset X to data/X.csv" → write a small
-  `code/scripts/download_X.py` that uses `urllib.request` /
-  `pandas.read_csv` etc., and set `execute: true`.
-- Task says "Compute correlation between A and B" → write
-  `code/scripts/compute_corr.py` that loads the data, computes
-  scipy.stats.pearsonr, prints the result, and saves a CSV/JSON to
-  `data/results/`. Set `execute: true`.
-- Task says "Render Figure 1" → write
-  `code/scripts/render_fig1.py` that produces a real matplotlib
-  PNG at `paper/figures/fig1.png`. Set `execute: true`.
-
-A research_complete project is one where the *output artifacts*
-exist on disk, not just the source code. Reviewers check this.
-
-For tasks that legitimately produce only source code (model
-classes, contract schemas, unit tests, configs) you do NOT need
-`execute: true`; the test harness runs separately.
-
-## Script-must-do-work-by-default (CRITICAL)
-
-When you set `execute: true`, the runtime invokes the script as
-`python <script>` with NO arguments. Your script MUST do its full
-intended work in that exact invocation.
-
-- ❌ argparse defaults like `--all` that REQUIRE an explicit flag
-  to do anything will silently no-op (exit 0, produce no
-  artifacts → reviewer sees "the script ran but no output").
-- ✅ The script's `main()` (called without args) must download/
-  compute/render the full intended output.
-- ✅ If you want optional flags for debugging, fine — but set
-  defaults so `python script.py` does the real work.
-
-## Don't break working code (CRITICAL)
-
-If the task references a file that already exists AND a previous run
-of that file in `code/.tasks/<T###>.<path>.log` shows `exit=0` with
-real outputs (not "0 bytes downloaded" — actual data), DO NOT rewrite
-the file from scratch. Extend it minimally to address the new task
-requirement. The most expensive failure mode is the LLM regressing a
-working download/training/evaluation script because a later task
-asked for a "fix" or "refactor".
-
-Specifically: if `data/raw/<dataset>.csv` exists with non-trivial size
-(>1MB), the download approach in the existing `download_datasets.py`
-WORKED. Don't replace it with `ucimlrepo` calls if the previous direct
-HTTP download was producing real data — that's a regression.
-
-## API consistency (CRITICAL — MOST COMMON FAILURE)
-
-You will be given a `# Existing project API surface` block listing
-the public names exported by every Python file already written in
-this project, plus a `# Full contents of files this task references`
-block with full source for any file the task line names.
-
-**Every name you import or call from a sibling module MUST appear in
-that API surface block.** Examples of the bug this avoids:
-
-- ❌ Test imports `from models.baselines import ARIMABaseline`,
-  but the existing `code/models/baselines.py` has only
-  `MovingAverageZScore`. Either change the import to the existing
-  name, OR add `ARIMABaseline` to baselines.py in this task's
-  `artifacts` list (alongside the test).
-- ❌ Verify-script calls `model.initialize(...)`, but the existing
-  `code/models/dpgmm.py` has only `_initialize_model` (private).
-  Either call `_initialize_model`, OR rename to `initialize` in
-  dpgmm.py in this task's `artifacts` list.
-
-If the task line references a file that already exists, that file's
-full contents will appear in the second block — extend it rather
-than rewrite it. Preserve all existing public names.
-
-## Real, reachable dataset URLs (CRITICAL)
-
-When a task asks you to download data, the URL MUST be one that
-actually serves the dataset right now. Fabricated URLs waste a
-sandbox run and get the task marked FAILED-IN-EXECUTION.
-
-Verified-working public dataset endpoints for time-series anomaly
-detection:
-
-- NAB benchmark, e.g.,
-  `https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv`,
-  `.../ec2_request_latency_system_failure.csv`,
-  `.../machine_temperature_system_failure.csv`,
-  `.../cpu_utilization_asg_misconfiguration.csv`
-- Synthetic signals: generate locally with numpy (`np.sin`,
-  `np.random.normal`) with a fixed seed — always reachable.
-- UCI ML Repository: prefer the `ucimlrepo` Python package
-  (`pip install ucimlrepo`) over guessing URLs.
-- HuggingFace Datasets: `datasets.load_dataset(...)` from the
-  `datasets` package — never raw HF URLs.
-
-If you do not know a real URL, your script MUST generate the data
-synthetically and document the synthesis in `data/README.md`. Do
-NOT invent a URL.
-
-## Output completeness (CRITICAL)
-
-The runtime gives you up to 32K output tokens — generous, but you
-MUST emit the COMPLETE file in one shot.  Truncated output (e.g.
-mid-dict, mid-string, unbalanced brackets) is REJECTED at write
-time by a `compile()` pre-flight check, the task fails, and we
-waste a turn.  Before emitting, mentally check:
-
-- All `{`, `[`, `(` have matching closers.
-- The last line of any function or class returns a complete
-  expression / has `pass` if intentionally empty.
-- Triple-quoted docstrings are terminated.
-- The file ends on a complete line.
-
-If you have to omit anything, emit a `# TODO(implementer):` comment
-in valid syntax rather than letting the file be truncated.
-
-## Common Python library gotchas (avoid these)
-
-These are the most frequent runtime errors we see — write your code
-to avoid them up front rather than discovering them via execution
-failure:
-
-1. **`from typing import List, Optional, Dict, Tuple, Any`** — if
-   you use any of these in type hints, ALL of them must be imported.
-   Modern Python 3.9+ allows `list[int]` instead of `List[int]`, but
-   if you write `List[int]` you must import `List`.  Either style is
-   fine, just be consistent within a file.
-
-2. **`json.dumps(numpy_value)` raises `TypeError: Object of type
-   bool_/int64/ndarray is not JSON serializable`.**  Always convert
-   numpy scalars first:
-   ```python
-   import numpy as np
-   def _np_to_py(o):
-       if isinstance(o, np.ndarray): return o.tolist()
-       if isinstance(o, (np.bool_,)): return bool(o)
-       if isinstance(o, (np.integer,)): return int(o)
-       if isinstance(o, (np.floating,)): return float(o)
-       raise TypeError(f"unhandled: {type(o)}")
-   json.dumps(data, default=_np_to_py)
-   ```
-   Or simpler — convert the whole structure with `pandas.DataFrame.to_json`
-   or `np.asarray(...).tolist()` before json.dumps.
-
-3. **`urllib.request.urlretrieve(url, dest, context=ctx)` is invalid.**
-   `urlretrieve` does NOT take a `context` keyword — only `urlopen`
-   does.  Use `urllib.request.urlopen(url, context=ctx)` and write
-   the response yourself, OR set the SSL context globally via
-   `ssl._create_default_https_context = ssl._create_unverified_context`
-   before calling `urlretrieve`.
-
-4. **Import paths must match the API-surface block exactly.**  If
-   the API surface lists `from models.dpgmm import DPGMMModel`,
-   that's the canonical path — don't write `from src.models.dpgmm`
-   or `from code.models.dpgmm` or `from .dpgmm`.
-
-5. **pandas.DataFrame doesn't have `.to_csv` if it's None.**  Always
-   check that operations returning DataFrames actually produced a
-   DataFrame before calling methods on the result.
+The diff MUST apply cleanly to the current file (`git apply --check` passes).
+
+## Hard constraints
+
+1. **Output JSON only.** No prose around the JSON, no markdown fences.
+2. **Do not delete entire sections, the abstract, or the bibliography.** Delete-only edits whose `replace` is empty AND whose `search` matches a `\begin{abstract}...\end{abstract}` or `\bibliography{}` block will be rejected.
+3. **Do not modify `paper/metadata.json`.** Author management is handled by the implementer infrastructure, not by your edits.
+4. **Localized scope.** Each task must produce a single edit (or a unified diff with a small number of nearby hunks). Sweeping rewrites are rejected.
+5. **Compile gate.** After each edit, LaTeX is recompiled. If compile fails, the edit is rolled back and the task is marked `compile-failed` — your job is to address ONE action item per call.
+
+## What you receive
+
+Per task, the prompt will include:
+- The action item's text (the reviewer's request).
+- The action item's severity (`writing` or `science`).
+- A windowed view of the manuscript LaTeX source (lines near where the action item likely applies, plus surrounding context).
+- (For science-class tasks) a list of project code files that may be referenced.
+
+Apply your edit precisely to address the action item, nothing else.
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		{"feature_directory": "specs/012-paper-review-convergence"}
		{"feature_directory": "specs/013-paper-revision-implementer"}