Add /autoresearch workflow to transformerlab-cli skill by aliasaria · Pull Request #1979 · transformerlab/transformerlab-app

aliasaria · 2026-05-04T18:45:59Z

Summary

Adds a references/autoresearch.md deep-dive that defines the /autoresearch workflow — an autonomous experiment loop adapted from pi-autoresearch, but built natively on lab primitives (no extension, no autoresearch.jsonl, no separate widget).
Updates SKILL.md with a new /autoresearch trigger section and adds the autoresearch reference to the deep-dive list and skill description so the skill activates on "run autoresearch", "optimize X in a loop", and /autoresearch ….

What's different from pi-autoresearch

pi-autoresearch	this workflow
`autoresearch.jsonl` per-run log	replaced by per-job `--description` + `lab job list --score-metric`
`autoresearch.sh` benchmark script	task itself computes the metric and calls `lab.finish(score=…)`
`autoresearch.checks.sh`	`lab.error(…)` on correctness failure → `FAILED` jobs naturally drop out of "best"
Auto-commit on `keep`	server-side: keep ↔ `lab job discard --undo`, discard ↔ `lab job discard` (preserved, not deleted)
Custom widget	`lab job list --score-metric <primary>`

Subcommands

init <goal>, run, sweep <key=v1,v2,…>, status, keep <job_id>, discard <job_id>, idea <text>, finalize, stop, off. sweep uses the task.yaml sweeps: block for parallel hyperparameter fan-out; the agent-driven loop is reserved for ideas that aren't pure parameter combos. Parallelism is explicit — defaults to 1, agent must ask before raising.

Test plan

Open the skill in a fresh Claude Code session and ask "set up autoresearch to optimize eval/loss on my-task" — verify the skill activates and the agent reads references/autoresearch.md before acting.
Confirm the agent runs lab experiment create … --set-default first and only writes autoresearch.md (no jsonl, no benchmark script).
After 2–3 queued jobs, confirm /autoresearch status produces a ranked summary via lab --format json job list --score-metric <primary>.
Confirm /autoresearch discard <job_id> calls lab job discard <id> (not lab job delete).
Confirm /autoresearch sweep lr=1e-5,3e-5,1e-4 adds a sweeps: block and lab task edit --from-files it before queuing one parent job.

Adds an autonomous experiment loop adapted from pi-autoresearch (https://github.com/davebcn87/pi-autoresearch) but built natively on top of the lab CLI: one experiment per session, one job per iteration, scores stored on the job via lab.finish(score=...), keep/discard via lab job discard, ranking via lab job list --score-metric. The only file the workflow writes is autoresearch.md (objective, files in scope, backlog, what's been tried). Subcommands: init, run, sweep, status, keep, discard, idea, finalize, stop, off. sweep uses the task.yaml sweeps: block for parallel hyperparameter fan-out; the agent-driven loop is reserved for ideas that aren't pure parameter combos.

netlify · 2026-05-04T18:46:04Z

✅ Deploy Preview for transformerlab canceled.

Name	Link
🔨 Latest commit	`adff41a`
🔍 Latest deploy log	https://app.netlify.com/projects/transformerlab/deploys/69fb4bc869538200086e4be6

paragon-review · 2026-05-04T18:46:25Z

Paragon Review Skipped

Hi @aliasaria! Your Polarity credit balance is insufficient to complete this review.

Please visit https://app.paragon.run to finish your review.

Three follow-up additions inspired by pi-autoresearch's design discussion: - Step 0 of every /autoresearch run iteration is now an explicit rehydrate: re-read autoresearch.md + tail of `lab job list --score-metric`. Treats in-context memory as untrustworthy across long sessions. - Make the fixed-evaluator / mutable-implementation split explicit. The score-computation code feeding `lab.finish(score=…)` belongs in Off limits so the agent can't cheat the metric; the implementation under test is the only thing in Files in scope. Recommend splitting into score.py + solve.py. - Optional `Max iterations` field in autoresearch.md as a cost cap. Loop stops and reports when reached. Reinforces the existing "never raise parallelism past 1 on non-Local providers" rule.

Four follow-up additions inspired by skypilot/examples/autoresearch: - Provider question is now part of /autoresearch init — ask once up front and record in autoresearch.md so resuming agents don't re-prompt. - "Search strategy depends on parallelism" callout: N=1 is depth (greedy hill-climb), N>=4 is breadth (grid search). Tell the user what kind of session their N choice implies. - Loop step 6 now branches on parallelism: at N=1 wait after queueing, at N>=2 fire-and-advance immediately. Waiting after every queue collapses parallelism back to 1 — this is what enables grid-style search at scale. - New step 3: stale-job sweep. minutes_requested is guidance, not enforcement, so the loop has to actively `lab job stop` runs that exceed the configured timeout or they consume the parallelism budget forever. Adds Stale-job timeout to the autoresearch.md template. - Step 7 now scans all unprocessed COMPLETE jobs, not just the most-recently-queued one (necessary for fire-and-advance), and explicitly notes that FAILED jobs need no action since they're excluded from "best" naturally.

The original PR exposed ten /autoresearch subcommands, but seven of them were just renames of single lab calls the parent skill already documents (status, keep, discard, idea, stop, off, sweep). Pretending those were slash commands made the surface look bigger than it is and hid the fact that the agent should already know how to do them. Keep only the three that bundle multi-step rituals: - /autoresearch init — multi-step setup - /autoresearch run — entry into the loop - /autoresearch finalize — best-run summary + publish handoff Rename "/autoresearch sweep" header to "Hyperparameter sweeps" — same content, no longer a fake subcommand. The other operations move into a "During-session operations" section that maps natural-language requests ("show status", "keep job X", "add an idea", "stop everything", "stop the loop") to the underlying lab calls. Same content, more honest framing. Update SKILL.md trigger blurb to match.

Adds a new "Experiment Notes" section to SKILL.md covering `lab notes show/edit/append`, plus a default flow recommending agents load prior context at session start and append dated one-liners after meaningful actions. Also expands the autoresearch reference with related guidance.

Reconciles the rename with upstream's migration of the session plan from a local autoresearch.md file to experiment notes (lab notes).

…er toggle - Align default metric with the chip in JobsList (prefer key 'score', else first numeric key) so chart and chip agree. - Replace the floating Nivo tooltip with a fixed side panel inside the modal so long autoresearch run descriptions stay fully visible regardless of cursor position. - Make the hovered point's job ID a RouterLink to the job detail page; clicking closes the modal and navigates. - Add a "Lower is better" checkbox at the top, defaulting to the auto-detected value but user-overridable.

Documents the /lab-autoresearch agent loop: prereqs (provider, lab CLI, agent-skill install), the three subcommands (init, run, finalize), the experiment-notes session contract with the lab notes command surface, hyperparameter sweeps, and resume semantics.

…erlab/transformerlab-app into add/autoresearch-skill

sentry · 2026-05-05T19:53:42Z

Codecov Report

❌ Patch coverage is 38.46154% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ansformerlab/services/remote_job_status_service.py	45.45%	5 Missing and 1 partial ⚠️
api/transformerlab/services/task_service.py	0.00%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…erlab/transformerlab-app into add/autoresearch-skill

aliasaria added 3 commits May 4, 2026 14:54

aliasaria marked this pull request as draft May 4, 2026 19:13

aliasaria added 7 commits May 5, 2026 10:12

Merge branch 'main' into add/autoresearch-skill

cbd1607

rename /autoresearch slash command to /lab-autoresearch

ef5e4af

Reconciles the rename with upstream's migration of the session plan from a local autoresearch.md file to experiment notes (lab notes).

add /lab-autoresearch entry-point skill

c5ffa52

trim pi-autoresearch references to single attribution

73e4271

aliasaria marked this pull request as ready for review May 5, 2026 16:44

deep1401 and others added 4 commits May 5, 2026 13:04

prettier

c551a5f

Merge branch 'main' into add/autoresearch-skill

4772c61

fix file mounts bug

6f17164

Merge branch 'add/autoresearch-skill' of https://github.com/transform…

9391a14

…erlab/transformerlab-app into add/autoresearch-skill

dadmobile requested changes May 5, 2026

View reviewed changes

dadmobile and others added 9 commits May 5, 2026 15:58

Fix miscount in autoresearch skill

c610ce8

Changed beta.lab.cloud to *lab.cloud

b2d0c4d

Remove unneeded echo y

9bf61e4

fix remote job status service

2b846c1

Merge branch 'add/autoresearch-skill' of https://github.com/transform…

3c4a4ff

…erlab/transformerlab-app into add/autoresearch-skill

Simplify experimentName to use ID

8ae2556

Merge branch 'add/autoresearch-skill' of https://github.com/transform…

95b2d5e

…erlab/transformerlab-app into add/autoresearch-skill

Merge branch 'main' into add/autoresearch-skill

207ffcb

Use start_time instead of created_at to measure running job length

ca12fbc

Merge branch 'main' into add/autoresearch-skill

adff41a

dadmobile approved these changes May 6, 2026

View reviewed changes

deep1401 merged commit 04dc42a into main May 6, 2026
19 checks passed

deep1401 mentioned this pull request May 6, 2026

Remote SkyPilot jobs falsely marked FAILED during cluster provisioning #1981

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add /autoresearch workflow to transformerlab-cli skill#1979

Add /autoresearch workflow to transformerlab-cli skill#1979
deep1401 merged 25 commits into
mainfrom
add/autoresearch-skill

aliasaria commented May 4, 2026

Uh oh!

netlify Bot commented May 4, 2026 •

edited

Loading

Uh oh!

paragon-review Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

aliasaria commented May 4, 2026

Summary

What's different from pi-autoresearch

Subcommands

Test plan

Uh oh!

netlify Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for transformerlab canceled.

Uh oh!

paragon-review Bot commented May 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sentry Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netlify Bot commented May 4, 2026 •

edited

Loading

sentry Bot commented May 5, 2026 •

edited

Loading