Skip to content

Add /autoresearch workflow to transformerlab-cli skill#1979

Merged
deep1401 merged 25 commits into
mainfrom
add/autoresearch-skill
May 6, 2026
Merged

Add /autoresearch workflow to transformerlab-cli skill#1979
deep1401 merged 25 commits into
mainfrom
add/autoresearch-skill

Conversation

@aliasaria
Copy link
Copy Markdown
Member

Summary

  • Adds a references/autoresearch.md deep-dive that defines the /autoresearch workflow — an autonomous experiment loop adapted from pi-autoresearch, but built natively on lab primitives (no extension, no autoresearch.jsonl, no separate widget).
  • Updates SKILL.md with a new /autoresearch trigger section and adds the autoresearch reference to the deep-dive list and skill description so the skill activates on "run autoresearch", "optimize X in a loop", and /autoresearch ….

What's different from pi-autoresearch

pi-autoresearch this workflow
autoresearch.jsonl per-run log replaced by per-job --description + lab job list --score-metric
autoresearch.sh benchmark script task itself computes the metric and calls lab.finish(score=…)
autoresearch.checks.sh lab.error(…) on correctness failure → FAILED jobs naturally drop out of "best"
Auto-commit on keep server-side: keep ↔ lab job discard --undo, discard ↔ lab job discard (preserved, not deleted)
Custom widget lab job list --score-metric <primary>

Subcommands

init <goal>, run, sweep <key=v1,v2,…>, status, keep <job_id>, discard <job_id>, idea <text>, finalize, stop, off. sweep uses the task.yaml sweeps: block for parallel hyperparameter fan-out; the agent-driven loop is reserved for ideas that aren't pure parameter combos. Parallelism is explicit — defaults to 1, agent must ask before raising.

Test plan

  • Open the skill in a fresh Claude Code session and ask "set up autoresearch to optimize eval/loss on my-task" — verify the skill activates and the agent reads references/autoresearch.md before acting.
  • Confirm the agent runs lab experiment create … --set-default first and only writes autoresearch.md (no jsonl, no benchmark script).
  • After 2–3 queued jobs, confirm /autoresearch status produces a ranked summary via lab --format json job list --score-metric <primary>.
  • Confirm /autoresearch discard <job_id> calls lab job discard <id> (not lab job delete).
  • Confirm /autoresearch sweep lr=1e-5,3e-5,1e-4 adds a sweeps: block and lab task edit --from-files it before queuing one parent job.

Adds an autonomous experiment loop adapted from pi-autoresearch
(https://github.com/davebcn87/pi-autoresearch) but built natively on top of
the lab CLI: one experiment per session, one job per iteration, scores stored
on the job via lab.finish(score=...), keep/discard via lab job discard,
ranking via lab job list --score-metric. The only file the workflow writes
is autoresearch.md (objective, files in scope, backlog, what's been tried).

Subcommands: init, run, sweep, status, keep, discard, idea, finalize, stop, off.
sweep uses the task.yaml sweeps: block for parallel hyperparameter fan-out;
the agent-driven loop is reserved for ideas that aren't pure parameter combos.
@netlify
Copy link
Copy Markdown

netlify Bot commented May 4, 2026

Deploy Preview for transformerlab canceled.

Name Link
🔨 Latest commit adff41a
🔍 Latest deploy log https://app.netlify.com/projects/transformerlab/deploys/69fb4bc869538200086e4be6

@paragon-review
Copy link
Copy Markdown

paragon-review Bot commented May 4, 2026

Paragon Review Skipped

Hi @aliasaria! Your Polarity credit balance is insufficient to complete this review.

Please visit https://app.paragon.run to finish your review.

aliasaria added 3 commits May 4, 2026 14:54
Three follow-up additions inspired by pi-autoresearch's design discussion:

- Step 0 of every /autoresearch run iteration is now an explicit rehydrate:
  re-read autoresearch.md + tail of `lab job list --score-metric`. Treats
  in-context memory as untrustworthy across long sessions.

- Make the fixed-evaluator / mutable-implementation split explicit. The
  score-computation code feeding `lab.finish(score=…)` belongs in Off limits
  so the agent can't cheat the metric; the implementation under test is the
  only thing in Files in scope. Recommend splitting into score.py + solve.py.

- Optional `Max iterations` field in autoresearch.md as a cost cap. Loop
  stops and reports when reached. Reinforces the existing "never raise
  parallelism past 1 on non-Local providers" rule.
Four follow-up additions inspired by skypilot/examples/autoresearch:

- Provider question is now part of /autoresearch init — ask once up front
  and record in autoresearch.md so resuming agents don't re-prompt.

- "Search strategy depends on parallelism" callout: N=1 is depth (greedy
  hill-climb), N>=4 is breadth (grid search). Tell the user what kind of
  session their N choice implies.

- Loop step 6 now branches on parallelism: at N=1 wait after queueing,
  at N>=2 fire-and-advance immediately. Waiting after every queue
  collapses parallelism back to 1 — this is what enables grid-style
  search at scale.

- New step 3: stale-job sweep. minutes_requested is guidance, not
  enforcement, so the loop has to actively `lab job stop` runs that
  exceed the configured timeout or they consume the parallelism budget
  forever. Adds Stale-job timeout to the autoresearch.md template.

- Step 7 now scans all unprocessed COMPLETE jobs, not just the
  most-recently-queued one (necessary for fire-and-advance), and
  explicitly notes that FAILED jobs need no action since they're
  excluded from "best" naturally.
The original PR exposed ten /autoresearch subcommands, but seven of them
were just renames of single lab calls the parent skill already documents
(status, keep, discard, idea, stop, off, sweep). Pretending those were
slash commands made the surface look bigger than it is and hid the fact
that the agent should already know how to do them.

Keep only the three that bundle multi-step rituals:
- /autoresearch init     — multi-step setup
- /autoresearch run      — entry into the loop
- /autoresearch finalize — best-run summary + publish handoff

Rename "/autoresearch sweep" header to "Hyperparameter sweeps" — same
content, no longer a fake subcommand.

The other operations move into a "During-session operations" section
that maps natural-language requests ("show status", "keep job X",
"add an idea", "stop everything", "stop the loop") to the underlying
lab calls. Same content, more honest framing.

Update SKILL.md trigger blurb to match.
@aliasaria aliasaria marked this pull request as draft May 4, 2026 19:13
aliasaria added 7 commits May 5, 2026 10:12
Adds a new "Experiment Notes" section to SKILL.md covering
`lab notes show/edit/append`, plus a default flow recommending agents
load prior context at session start and append dated one-liners after
meaningful actions. Also expands the autoresearch reference with
related guidance.
Reconciles the rename with upstream's migration of the session plan from
a local autoresearch.md file to experiment notes (lab notes).
…er toggle

- Align default metric with the chip in JobsList (prefer key 'score', else
  first numeric key) so chart and chip agree.
- Replace the floating Nivo tooltip with a fixed side panel inside the modal
  so long autoresearch run descriptions stay fully visible regardless of
  cursor position.
- Make the hovered point's job ID a RouterLink to the job detail page;
  clicking closes the modal and navigates.
- Add a "Lower is better" checkbox at the top, defaulting to the
  auto-detected value but user-overridable.
Documents the /lab-autoresearch agent loop: prereqs (provider, lab CLI,
agent-skill install), the three subcommands (init, run, finalize), the
experiment-notes session contract with the lab notes command surface,
hyperparameter sweeps, and resume semantics.
@aliasaria aliasaria marked this pull request as ready for review May 5, 2026 16:44
Comment thread .agents/skills/transformerlab-cli/references/autoresearch.md Outdated
Comment thread .agents/skills/transformerlab-cli/references/autoresearch.md Outdated
Comment thread .agents/skills/transformerlab-cli/references/autoresearch.md Outdated
Comment thread src/renderer/components/Experiment/Tasks/JobsChartModal.tsx Outdated
Comment thread .agents/skills/lab-autoresearch/SKILL.md Outdated
@sentry
Copy link
Copy Markdown

sentry Bot commented May 5, 2026

Codecov Report

❌ Patch coverage is 38.46154% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ansformerlab/services/remote_job_status_service.py 45.45% 5 Missing and 1 partial ⚠️
api/transformerlab/services/task_service.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@deep1401 deep1401 merged commit 04dc42a into main May 6, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants