Add /autoresearch workflow to transformerlab-cli skill#1979
Merged
Conversation
Adds an autonomous experiment loop adapted from pi-autoresearch (https://github.com/davebcn87/pi-autoresearch) but built natively on top of the lab CLI: one experiment per session, one job per iteration, scores stored on the job via lab.finish(score=...), keep/discard via lab job discard, ranking via lab job list --score-metric. The only file the workflow writes is autoresearch.md (objective, files in scope, backlog, what's been tried). Subcommands: init, run, sweep, status, keep, discard, idea, finalize, stop, off. sweep uses the task.yaml sweeps: block for parallel hyperparameter fan-out; the agent-driven loop is reserved for ideas that aren't pure parameter combos.
✅ Deploy Preview for transformerlab canceled.
|
|
Paragon Review Skipped Hi @aliasaria! Your Polarity credit balance is insufficient to complete this review. Please visit https://app.paragon.run to finish your review. |
Three follow-up additions inspired by pi-autoresearch's design discussion: - Step 0 of every /autoresearch run iteration is now an explicit rehydrate: re-read autoresearch.md + tail of `lab job list --score-metric`. Treats in-context memory as untrustworthy across long sessions. - Make the fixed-evaluator / mutable-implementation split explicit. The score-computation code feeding `lab.finish(score=…)` belongs in Off limits so the agent can't cheat the metric; the implementation under test is the only thing in Files in scope. Recommend splitting into score.py + solve.py. - Optional `Max iterations` field in autoresearch.md as a cost cap. Loop stops and reports when reached. Reinforces the existing "never raise parallelism past 1 on non-Local providers" rule.
Four follow-up additions inspired by skypilot/examples/autoresearch: - Provider question is now part of /autoresearch init — ask once up front and record in autoresearch.md so resuming agents don't re-prompt. - "Search strategy depends on parallelism" callout: N=1 is depth (greedy hill-climb), N>=4 is breadth (grid search). Tell the user what kind of session their N choice implies. - Loop step 6 now branches on parallelism: at N=1 wait after queueing, at N>=2 fire-and-advance immediately. Waiting after every queue collapses parallelism back to 1 — this is what enables grid-style search at scale. - New step 3: stale-job sweep. minutes_requested is guidance, not enforcement, so the loop has to actively `lab job stop` runs that exceed the configured timeout or they consume the parallelism budget forever. Adds Stale-job timeout to the autoresearch.md template. - Step 7 now scans all unprocessed COMPLETE jobs, not just the most-recently-queued one (necessary for fire-and-advance), and explicitly notes that FAILED jobs need no action since they're excluded from "best" naturally.
The original PR exposed ten /autoresearch subcommands, but seven of them
were just renames of single lab calls the parent skill already documents
(status, keep, discard, idea, stop, off, sweep). Pretending those were
slash commands made the surface look bigger than it is and hid the fact
that the agent should already know how to do them.
Keep only the three that bundle multi-step rituals:
- /autoresearch init — multi-step setup
- /autoresearch run — entry into the loop
- /autoresearch finalize — best-run summary + publish handoff
Rename "/autoresearch sweep" header to "Hyperparameter sweeps" — same
content, no longer a fake subcommand.
The other operations move into a "During-session operations" section
that maps natural-language requests ("show status", "keep job X",
"add an idea", "stop everything", "stop the loop") to the underlying
lab calls. Same content, more honest framing.
Update SKILL.md trigger blurb to match.
Adds a new "Experiment Notes" section to SKILL.md covering `lab notes show/edit/append`, plus a default flow recommending agents load prior context at session start and append dated one-liners after meaningful actions. Also expands the autoresearch reference with related guidance.
Reconciles the rename with upstream's migration of the session plan from a local autoresearch.md file to experiment notes (lab notes).
…er toggle - Align default metric with the chip in JobsList (prefer key 'score', else first numeric key) so chart and chip agree. - Replace the floating Nivo tooltip with a fixed side panel inside the modal so long autoresearch run descriptions stay fully visible regardless of cursor position. - Make the hovered point's job ID a RouterLink to the job detail page; clicking closes the modal and navigates. - Add a "Lower is better" checkbox at the top, defaulting to the auto-detected value but user-overridable.
Documents the /lab-autoresearch agent loop: prereqs (provider, lab CLI, agent-skill install), the three subcommands (init, run, finalize), the experiment-notes session contract with the lab notes command surface, hyperparameter sweeps, and resume semantics.
dadmobile
requested changes
May 5, 2026
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…erlab/transformerlab-app into add/autoresearch-skill
…erlab/transformerlab-app into add/autoresearch-skill
dadmobile
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
references/autoresearch.mddeep-dive that defines the/autoresearchworkflow — an autonomous experiment loop adapted from pi-autoresearch, but built natively onlabprimitives (no extension, noautoresearch.jsonl, no separate widget).SKILL.mdwith a new/autoresearchtrigger section and adds the autoresearch reference to the deep-dive list and skill description so the skill activates on "run autoresearch", "optimize X in a loop", and/autoresearch ….What's different from pi-autoresearch
autoresearch.jsonlper-run log--description+lab job list --score-metricautoresearch.shbenchmark scriptlab.finish(score=…)autoresearch.checks.shlab.error(…)on correctness failure →FAILEDjobs naturally drop out of "best"keeplab job discard --undo, discard ↔lab job discard(preserved, not deleted)lab job list --score-metric <primary>Subcommands
init <goal>,run,sweep <key=v1,v2,…>,status,keep <job_id>,discard <job_id>,idea <text>,finalize,stop,off.sweepuses the task.yamlsweeps:block for parallel hyperparameter fan-out; the agent-driven loop is reserved for ideas that aren't pure parameter combos. Parallelism is explicit — defaults to 1, agent must ask before raising.Test plan
references/autoresearch.mdbefore acting.lab experiment create … --set-defaultfirst and only writesautoresearch.md(no jsonl, no benchmark script)./autoresearch statusproduces a ranked summary vialab --format json job list --score-metric <primary>./autoresearch discard <job_id>callslab job discard <id>(notlab job delete)./autoresearch sweep lr=1e-5,3e-5,1e-4adds asweeps:block andlab task edit --from-files it before queuing one parent job.