Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
bb1e856
spec(013): LLM Implementer + Author Management + PDF Regen — initial …
jeremymanning May 18, 2026
5058ba7
spec(013): add publication step (DOI via Zenodo + badges + citation f…
jeremymanning May 18, 2026
a26178f
spec(013): badge logic — auto-reviewed only if revision actually happ…
jeremymanning May 18, 2026
523671d
spec(013): coversheet prototype + rendered screenshots
jeremymanning May 18, 2026
9fce81a
spec(013): revised approach — no coversheet, extend llmxive.cls, appe…
jeremymanning May 18, 2026
d4523eb
spec(013): polish — drop bullet, vol non-breaking, larger spacer text…
jeremymanning May 18, 2026
feb3fbe
spec(013) polish v2 — vol/doi layout, display headings, page counts
jeremymanning May 18, 2026
16bd204
spec(013) polish v3 — 3-state badge, LLM co-authors, heading spacing
jeremymanning May 18, 2026
2cb02c6
spec(013) polish v4 — wider status column, email expansion, Qwen affi…
jeremymanning May 18, 2026
3817c32
spec(013): general paper-rendering fixes + chunked reviewer summariza…
jeremymanning May 19, 2026
ac34ccd
spec(013): plan, research, contracts, tasks + Phase 1 setup
jeremymanning May 19, 2026
1c21ec4
spec(013) Phase 2: foundational schemas + state I/O + Zenodo client +…
jeremymanning May 19, 2026
1a59eb8
spec(013) Phases 3-7: implementer + publisher + tests (53/58 tasks done)
jeremymanning May 19, 2026
02934af
spec(013) Phase 9 polish: dashboard + README updates (57/58 tasks done)
jeremymanning May 19, 2026
545f8a7
spec(013): real-call fixes — all 4 SC-001/SC-005/SC-006/SC-008 tests …
jeremymanning May 19, 2026
75089a4
spec(013) T056: dashboard revision-history modal section (58/58 tasks…
jeremymanning May 20, 2026
f00583b
spec(013): wire Zenodo secrets into the real-call CI workflow
jeremymanning May 20, 2026
2b8e17d
spec(013): general paper-rendering fixes from full-PDF audit + overfl…
jeremymanning May 20, 2026
6a4f822
Merge remote-tracking branch 'origin/main' into 013-paper-revision-im…
jeremymanning May 20, 2026
bf2d3d5
pipeline: re-render all 30 papers with the audited conversion fixes
jeremymanning May 20, 2026
51fe8e2
ci(013): real-call timeout 30->60 min + SC-001 budget 600->1200s
jeremymanning May 21, 2026
10798fa
ci(013): publisher sandbox test skips gracefully when env can't publish
jeremymanning May 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion .github/workflows/llmxive-real-call-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,26 @@ permissions:
jobs:
real-call:
runs-on: ubuntu-latest
timeout-minutes: 30
# Spec 013 added heavy real-LLM e2e tests (implementer drives a
# multi-task edit loop with a real Dartmouth call + lualatex compile per
# task; publisher hits the real Zenodo Sandbox). The full real-call
# suite no longer fits in 30 min on the standard runner — it was getting
# cancelled mid-run. 60 min gives the suite headroom to complete and
# print its full pass/fail summary.
timeout-minutes: 60
env:
LLMXIVE_REAL_TESTS: "1"
DARTMOUTH_CHAT_API_KEY: ${{ secrets.DARTMOUTH_CHAT_API_KEY }}
DARTMOUTH_API_KEY: ${{ secrets.DARTMOUTH_API_KEY }}
HF_TOKEN: ${{ secrets.HF_TOKEN }}
# Spec 013: the paper_publisher real-call test (SC-006 / SC-008)
# publishes to Zenodo Sandbox. The sandbox token is a SEPARATE
# credential from production (sandbox.zenodo.org is its own service);
# without it the test skips gracefully. ZENODO_API_TOKEN is the
# production token (not used by the sandbox test, but wired here so
# any future production-path real-call test can find it).
ZENODO_API_TOKEN: ${{ secrets.ZENODO_API_TOKEN }}
ZENODO_SANDBOX_API_TOKEN: ${{ secrets.ZENODO_SANDBOX_API_TOKEN }}
steps:
# No `ref:` override — use actions/checkout's default
# merge-commit-SHA fetch for pull_request events. A previous
Expand Down
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -289,3 +289,10 @@ projects/*/paper/source/figs/.sanitized/
# Spec 010 audit screenshots — not tracked (regenerated on demand)
state/audit/pdf/*/screenshots/

# Spec 013 chunked-summarization cache. When the raw `.tex` corpus
# exceeds the reviewer's context budget, paper_reviewer.py chunks +
# summarizes each piece via LLM and caches the summaries here so the
# 12 specialist reviewers share the cost. Cache is regenerated on
# demand keyed by sha256 of chunk bytes.
projects/*/paper/.chunk_summaries/

2 changes: 1 addition & 1 deletion .specify/feature.json
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"feature_directory": "specs/012-paper-review-convergence"}
{"feature_directory": "specs/013-paper-revision-implementer"}
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,5 +70,5 @@ Since this is primarily a research documentation repository without traditional
<!-- SPECKIT START -->
For additional context about technologies to be used, project structure,
shell commands, and other important information, read the current plan:
[specs/012-paper-review-convergence/plan.md](specs/012-paper-review-convergence/plan.md).
[specs/013-paper-revision-implementer/plan.md](specs/013-paper-revision-implementer/plan.md).
<!-- SPECKIT END -->
21 changes: 19 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,14 +41,31 @@ specialist** (against the live artifact hash — stale reviews are ignored).

Three terminal outcomes:

- **All specialists accept** → `paper_accepted` → `posted`.
- **All specialists accept** → `paper_accepted` → the `paper_publisher`
agent (spec 013) pre-reserves a Zenodo DOI, recompiles the PDF with
the final `\paperstatus{Auto-Reviewed | Auto-Revised | Published}`
byline + DOI + volume/issue, uploads to Zenodo, appends the
post-paper appendix (spacer + reviews + revision changelog), writes
`paper/publication.yaml`, and transitions to `posted`.
- **Any `fatal` severity** → `brainstormed` (back to the backlog), with a
rejection rationale appended to the idea record citing each fatal item.
- **Otherwise** (writing/science items, no fatal) → `paper_revision_in_progress`,
which auto-kicks a revision-spec pipeline that produces a complete
spec/plan/tasks/analyze directory under
`specs/auto-revisions/<PROJ-ID>/round-<N>/`. The project then sits at
`ready_for_implementation` until an implementer agent picks it up.
`ready_for_implementation` until the `llmxive_implementer` agent
(spec 013) picks it up, applies each task to `paper/source/main.tex`
(and `projects/<id>/code/` for science-class tasks), recompiles after
every edit (rolling back on compile failure), joins the paper's
author list, and routes back to `paper_review` for re-review.

**Credentials**: the publisher loads a Zenodo API token from
`~/.config/llmxive/credentials.toml` under `[zenodo].api_token` (or
the `ZENODO_API_TOKEN` env var). For real-call sandbox tests, register
a separate account at `sandbox.zenodo.org` and add a
`[zenodo_sandbox]` section with `api_token`. The Dartmouth Chat API
key (`dartmouth_chat_api_key`) at the top level of the same file is
used by the implementer's LLM calls.

The **per-specialist re-review protocol** prevents endless-nit loops: when
a specialist has prior reviews for the same project, its prompt reduces
Expand Down
284 changes: 52 additions & 232 deletions agents/prompts/implementer.md
Original file line number Diff line number Diff line change
@@ -1,234 +1,54 @@
# Implementer Agent (`/speckit.implement`)

**Version**: 1.0.0
**Stage owned**: `analyzed` → `in_progress` → `research_complete`
**Default backend**: dartmouth (fallback huggingface, then local)

## Purpose

Drive `/speckit.implement` on the project. Reads `tasks.md`, picks
the next incomplete task, and either (a) writes the code/data/doc
artifact the task describes, or (b) emits a structured failure
report when the task requires human attention. The runtime persists
progress per-task so successive scheduled runs resume from the
next-incomplete task.

## Inputs

- `tasks_md`: full text of the project's `tasks.md`.
- `completed_task_ids`: list of `T###` already marked `[X]`.
- `next_task_id`: the first incomplete task in dependency order.
- `next_task_description`: full description string from `tasks.md`.
- `relevant_artifacts`: dict of file paths → contents that the next
task references in its description.
- `wall_clock_budget_seconds`: this invocation's budget.

## Output contract

A YAML document:

```yaml
task_id: T###
verdict: completed | failed | atomize
artifacts: # only when verdict=completed
- path: <repo-relative path>
contents: |
<FULL file contents from first line to last — NEVER a unified
diff, NEVER a partial patch. The runtime writes this verbatim
to disk and (if execute:true) runs it as Python; a diff fragment
will produce a SyntaxError. If the file already exists, output
the entire merged file with your additions integrated.>
execute: true # OPTIONAL: when true and path ends in .py, the
# runtime runs the script in the project's venv
# and writes a stdout/stderr log next to it.
# Use for scripts that PRODUCE real artifacts
# (download data, fit a model, render a figure).
timeout_s: 600 # OPTIONAL: per-script wall-clock cap (default 600).
failure: # only when verdict=failed
reason: <one sentence>
required_human_action: <one sentence>
atomize: # only when verdict=atomize (task too big for budget)
estimated_seconds: <int>
proposed_subtasks:
- description: <one sentence>
estimated_seconds: <int>
# llmXive-implementer agent system prompt

You are an LLM-driven implementer for the llmXive automated journal pipeline. Your role is to apply revisions to a peer-reviewed paper's LaTeX source in response to specific reviewer-flagged action items.

## Core constraint

**You are REVISING an existing paper, NOT rewriting it.** Every edit you produce MUST be localized to the action item's scope. Do not rephrase neighbouring paragraphs, restructure sections, or "improve" passages that the reviewer did not flag.

## Edit format

For every task, output EXACTLY ONE structured edit in one of two forms:

### Form A — search and replace (preferred for single-line / single-paragraph edits)

```json
{
"kind": "search_and_replace",
"file": "<path relative to project root, e.g. paper/source/main.tex>",
"search": "<verbatim text from the file, appearing EXACTLY ONCE>",
"replace": "<replacement text>"
}
```

The `search` string MUST match exactly one location in the file (whitespace + punctuation preserved). If it would match multiple places, include enough surrounding context to disambiguate.

### Form B — unified diff (for multi-hunk edits)

```json
{
"kind": "unified_diff",
"file": "<path>",
"diff": "--- a/<path>\n+++ b/<path>\n@@ -<line>,<count> +<line>,<count> @@\n <context>\n-<removed>\n+<added>\n <context>\n"
}
```

## Rules

- DO NOT modify any file outside `projects/<PROJ-ID>/`.
- DO NOT add tasks to `tasks.md` here — the Tasker is the only
writer of that file (Constitution Principle I).
- If the task's wall-clock estimate is unclear and the task seems
large, emit `atomize` rather than guessing — the Task-Atomizer
Agent (US9) will decompose.
- Every artifact written MUST live inside the project's canonical
layout (`code/`, `data/`, `paper/`, etc.).
- Output ONLY the YAML document.

## Code execution (CRITICAL)

This pipeline produces real research, not scaffolding. When a task
asks for **runnable output** (downloaded data, computed statistics,
rendered figures, model evaluations, etc.) the artifact MUST set
`execute: true` so the runtime actually runs it and the resulting
`stdout`/`stderr` is captured to `code/.tasks/<T###>.<script>.log`.

Concretely:

- Task says "Download dataset X to data/X.csv" → write a small
`code/scripts/download_X.py` that uses `urllib.request` /
`pandas.read_csv` etc., and set `execute: true`.
- Task says "Compute correlation between A and B" → write
`code/scripts/compute_corr.py` that loads the data, computes
scipy.stats.pearsonr, prints the result, and saves a CSV/JSON to
`data/results/`. Set `execute: true`.
- Task says "Render Figure 1" → write
`code/scripts/render_fig1.py` that produces a real matplotlib
PNG at `paper/figures/fig1.png`. Set `execute: true`.

A research_complete project is one where the *output artifacts*
exist on disk, not just the source code. Reviewers check this.

For tasks that legitimately produce only source code (model
classes, contract schemas, unit tests, configs) you do NOT need
`execute: true`; the test harness runs separately.

## Script-must-do-work-by-default (CRITICAL)

When you set `execute: true`, the runtime invokes the script as
`python <script>` with NO arguments. Your script MUST do its full
intended work in that exact invocation.

- ❌ argparse defaults like `--all` that REQUIRE an explicit flag
to do anything will silently no-op (exit 0, produce no
artifacts → reviewer sees "the script ran but no output").
- ✅ The script's `main()` (called without args) must download/
compute/render the full intended output.
- ✅ If you want optional flags for debugging, fine — but set
defaults so `python script.py` does the real work.

## Don't break working code (CRITICAL)

If the task references a file that already exists AND a previous run
of that file in `code/.tasks/<T###>.<path>.log` shows `exit=0` with
real outputs (not "0 bytes downloaded" — actual data), DO NOT rewrite
the file from scratch. Extend it minimally to address the new task
requirement. The most expensive failure mode is the LLM regressing a
working download/training/evaluation script because a later task
asked for a "fix" or "refactor".

Specifically: if `data/raw/<dataset>.csv` exists with non-trivial size
(>1MB), the download approach in the existing `download_datasets.py`
WORKED. Don't replace it with `ucimlrepo` calls if the previous direct
HTTP download was producing real data — that's a regression.

## API consistency (CRITICAL — MOST COMMON FAILURE)

You will be given a `# Existing project API surface` block listing
the public names exported by every Python file already written in
this project, plus a `# Full contents of files this task references`
block with full source for any file the task line names.

**Every name you import or call from a sibling module MUST appear in
that API surface block.** Examples of the bug this avoids:

- ❌ Test imports `from models.baselines import ARIMABaseline`,
but the existing `code/models/baselines.py` has only
`MovingAverageZScore`. Either change the import to the existing
name, OR add `ARIMABaseline` to baselines.py in this task's
`artifacts` list (alongside the test).
- ❌ Verify-script calls `model.initialize(...)`, but the existing
`code/models/dpgmm.py` has only `_initialize_model` (private).
Either call `_initialize_model`, OR rename to `initialize` in
dpgmm.py in this task's `artifacts` list.

If the task line references a file that already exists, that file's
full contents will appear in the second block — extend it rather
than rewrite it. Preserve all existing public names.

## Real, reachable dataset URLs (CRITICAL)

When a task asks you to download data, the URL MUST be one that
actually serves the dataset right now. Fabricated URLs waste a
sandbox run and get the task marked FAILED-IN-EXECUTION.

Verified-working public dataset endpoints for time-series anomaly
detection:

- NAB benchmark, e.g.,
`https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv`,
`.../ec2_request_latency_system_failure.csv`,
`.../machine_temperature_system_failure.csv`,
`.../cpu_utilization_asg_misconfiguration.csv`
- Synthetic signals: generate locally with numpy (`np.sin`,
`np.random.normal`) with a fixed seed — always reachable.
- UCI ML Repository: prefer the `ucimlrepo` Python package
(`pip install ucimlrepo`) over guessing URLs.
- HuggingFace Datasets: `datasets.load_dataset(...)` from the
`datasets` package — never raw HF URLs.

If you do not know a real URL, your script MUST generate the data
synthetically and document the synthesis in `data/README.md`. Do
NOT invent a URL.

## Output completeness (CRITICAL)

The runtime gives you up to 32K output tokens — generous, but you
MUST emit the COMPLETE file in one shot. Truncated output (e.g.
mid-dict, mid-string, unbalanced brackets) is REJECTED at write
time by a `compile()` pre-flight check, the task fails, and we
waste a turn. Before emitting, mentally check:

- All `{`, `[`, `(` have matching closers.
- The last line of any function or class returns a complete
expression / has `pass` if intentionally empty.
- Triple-quoted docstrings are terminated.
- The file ends on a complete line.

If you have to omit anything, emit a `# TODO(implementer):` comment
in valid syntax rather than letting the file be truncated.

## Common Python library gotchas (avoid these)

These are the most frequent runtime errors we see — write your code
to avoid them up front rather than discovering them via execution
failure:

1. **`from typing import List, Optional, Dict, Tuple, Any`** — if
you use any of these in type hints, ALL of them must be imported.
Modern Python 3.9+ allows `list[int]` instead of `List[int]`, but
if you write `List[int]` you must import `List`. Either style is
fine, just be consistent within a file.

2. **`json.dumps(numpy_value)` raises `TypeError: Object of type
bool_/int64/ndarray is not JSON serializable`.** Always convert
numpy scalars first:
```python
import numpy as np
def _np_to_py(o):
if isinstance(o, np.ndarray): return o.tolist()
if isinstance(o, (np.bool_,)): return bool(o)
if isinstance(o, (np.integer,)): return int(o)
if isinstance(o, (np.floating,)): return float(o)
raise TypeError(f"unhandled: {type(o)}")
json.dumps(data, default=_np_to_py)
```
Or simpler — convert the whole structure with `pandas.DataFrame.to_json`
or `np.asarray(...).tolist()` before json.dumps.

3. **`urllib.request.urlretrieve(url, dest, context=ctx)` is invalid.**
`urlretrieve` does NOT take a `context` keyword — only `urlopen`
does. Use `urllib.request.urlopen(url, context=ctx)` and write
the response yourself, OR set the SSL context globally via
`ssl._create_default_https_context = ssl._create_unverified_context`
before calling `urlretrieve`.

4. **Import paths must match the API-surface block exactly.** If
the API surface lists `from models.dpgmm import DPGMMModel`,
that's the canonical path — don't write `from src.models.dpgmm`
or `from code.models.dpgmm` or `from .dpgmm`.

5. **pandas.DataFrame doesn't have `.to_csv` if it's None.** Always
check that operations returning DataFrames actually produced a
DataFrame before calling methods on the result.
The diff MUST apply cleanly to the current file (`git apply --check` passes).

## Hard constraints

1. **Output JSON only.** No prose around the JSON, no markdown fences.
2. **Do not delete entire sections, the abstract, or the bibliography.** Delete-only edits whose `replace` is empty AND whose `search` matches a `\begin{abstract}...\end{abstract}` or `\bibliography{}` block will be rejected.
3. **Do not modify `paper/metadata.json`.** Author management is handled by the implementer infrastructure, not by your edits.
4. **Localized scope.** Each task must produce a single edit (or a unified diff with a small number of nearby hunks). Sweeping rewrites are rejected.
5. **Compile gate.** After each edit, LaTeX is recompiled. If compile fails, the edit is rolled back and the task is marked `compile-failed` — your job is to address ONE action item per call.

## What you receive

Per task, the prompt will include:
- The action item's text (the reviewer's request).
- The action item's severity (`writing` or `science`).
- A windowed view of the manuscript LaTeX source (lines near where the action item likely applies, plus surrounding context).
- (For science-class tasks) a list of project code files that may be referenced.

Apply your edit precisely to address the action item, nothing else.
Loading
Loading