Skip to content

feat(autoresearch): agent metric-optimization mode with dashboard (GROW-103)#3072

Open
fercgomes wants to merge 8 commits into
mainfrom
posthog-code/grow-103-autoresearch
Open

feat(autoresearch): agent metric-optimization mode with dashboard (GROW-103)#3072
fercgomes wants to merge 8 commits into
mainfrom
posthog-code/grow-103-autoresearch

Conversation

@fercgomes

@fercgomes fercgomes commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements GROW-103: autoresearch as a PostHog Code mode, modeled on pi-autoresearch — a long-running agent loop that optimizes a metric. Under the hood it is the task's regular LLM session; a dedicated dashboard panel shows the run's progress.

Screenshot 2026-07-02 at 18 50 04 Screenshot 2026-07-02 at 18 50 20

How it works

  • Protocol — the kickoff prompt teaches the agent to end every turn with a fenced ```autoresearch report block (metric: <number>, summary: <one line>). Each iteration = one focused change + one measurement.
  • Auto-continue engine (packages/core/src/autoresearch/) — AutoresearchService subscribes to the sessions store, detects turn completion (isPromptPending true→false), parses the report from the transcript, records the iteration, and sends the next continuation prompt (with best-so-far and recent history) until the target is reached or the iteration budget is spent. A turn without a report gets one reminder; a second lapse fails the run. Session errors and send failures fail the run with a reason. Runs can be paused (iterations still recorded, no auto-continue), resumed, and stopped.
  • Dashboard (packages/ui/src/features/autoresearch/) — a new "Autoresearch" panel tab in task detail: status/direction badges, pause/resume/stop controls, stat cards (best / last / iterations / target), an SVG chart of the metric with best-so-far frontier and target line, an iterations table with deltas, a run-history selector, and a start-configuration dialog (metric, direction, optional target, budget, instructions). A ChartLineUp button in the task header (with a live-run indicator) opens the panel.
  • Architecture — business logic lives in @posthog/core behind DI; the service reaches the session through a narrow AutoresearchPromptClient seam bound in desktop-services.ts (same pattern as AGENT_PROMPT_SENDER). The UI degrades gracefully on hosts that don't bind the service (useServiceOptional), so web stays a portability smoke test.

Screenshots

The dashboard renders from the domain store; start a run from any task via the header chart button → Start autoresearch.

Test plan

  • 80 new unit tests (packages/core/src/autoresearch/*.test.ts): config schema validation, stats/decision functions, report parsing + transcript extraction, prompt builders, and a full service harness driving fake session turns (iteration recording, continuation, target/budget completion, reminders, pause/resume/stop, session errors, send failures, unrelated-task isolation, dispose).
  • Full suites pass: core 1944, ui 1120, shared 423, app 23 files.
  • pnpm --filter @posthog/core|@posthog/ui|code typecheck clean.
  • Biome clean; zero noRestrictedImports in core; node scripts/check-host-boundaries.mjs reports no new violations.
  • Manual smoke test in the desktop app (start a run against a real task, watch the dashboard update).

Note: @posthog/git and two workspace-server integration suites fail in the cloud sandbox because real git commit is blocked there — pre-existing and unrelated to this change.

🤖 Generated with Claude Code

…op and dashboard

Implements GROW-103: a long-running agent mode that iteratively optimizes a
metric over the task's regular agent session, with a dashboard panel showing
progress (like pi-autoresearch).

Core (packages/core/src/autoresearch):
- Report protocol: the agent ends each turn with a fenced ```autoresearch
  block (metric + summary); prompts.ts holds the kickoff/continuation/reminder
  builders, the report parser, and the transcript extractor.
- AutoresearchService watches sessionStore for turn completion, records
  iterations, and auto-continues until target reached, budget spent, or the
  run is paused/stopped/failed. One reminder on a missing report, then fail.
- Zod schemas, pure stats/decision functions, vanilla zustand domain store.
- 80 unit tests across schemas, stats, protocol, and a service harness.

UI (packages/ui/src/features/autoresearch):
- Dashboard panel tab: status badges, pause/resume/stop, stat cards, SVG
  metric chart (value + best-so-far frontier + target line), iterations
  table, run history selector, and a start-configuration dialog.
- Task-header button with live-run indicator opens the panel.
- Degrades gracefully on hosts without the service (useServiceOptional).

Wiring: new "autoresearch" panel tab kind; AUTORESEARCH_PROMPT_CLIENT host
seam bound in desktop-services.ts forwarding to SessionService.sendPrompt;
core module loaded in desktop-contributions.ts; RendererBindings entries.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

React Doctor found 9 issues in 3 files · 9 warnings.

9 warnings

src/features/autoresearch/AutoresearchConfigDialog.tsx

src/features/autoresearch/stageModels.tsx

src/features/task-detail/components/TaskInput.tsx

Reviewed by React Doctor for commit 03dad33.

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Reviews (1): Last reviewed commit: "feat(autoresearch): add autoresearch mod..." | Re-trigger Greptile

Comment thread packages/core/src/autoresearch/autoresearch.ts Outdated
Comment thread packages/core/src/autoresearch/autoresearch.ts
Comment thread packages/ui/src/features/autoresearch/AutoresearchConfigDialog.tsx Outdated
Comment thread packages/ui/src/features/autoresearch/AutoresearchPanel.tsx
fercgomes added 6 commits July 1, 2026 21:12
…ofitting existing ones

Design feedback: autoresearch should be a way to create a new task, not a
mode bolted onto an existing task (matches pi-autoresearch's model).

New entry point — the new-task composer:
- An "Autoresearch" button in TaskInput opens a config dialog (metric,
  direction, target, iteration budget); the armed mode shows as a banner
  under the editor with edit/exit controls, like Inbox mode.
- The composer prompt IS the optimization brief: on submit the kickoff
  protocol preamble is prepended to the prompt content (file/folder chips
  intact) and sent as the new task's initial message.
- useTaskCreation gains onTaskCreatedEffect — a side effect that runs with
  the created task without suppressing the default pending-view/navigation
  behavior. It registers the run and auto-opens the dashboard tab.

Core:
- AutoresearchService.registerRun() (register + subscribe, no send) split
  out of startRun() (register + send kickoff). startRun now only serves
  "New run" on a task that already ran autoresearch.
- buildKickoffPreamble() split from buildKickoffPrompt() so hosts can
  substitute the brief; new autoresearchDraftConfigSchema (config minus
  taskId/instructions).

Removed the retrofit affordances: the panel's "Start autoresearch" empty
state CTA is now informational, and the task-header button only renders
for tasks that actually have runs. StartAutoresearchDialog is generalized
into AutoresearchConfigDialog (also used, prefilled, for dashboard
re-runs).

Core autoresearch tests: 85 passing (registerRun takeover, guard sharing,
preamble/prompt composition).

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
…dings

react-doctor (blocking error no-adjust-state-on-prop-change): the config
dialog synced form state from the `initial` prop inside an effect. The form
now lives in a child component rendered inside Dialog.Content — Radix
unmounts closed content, so each open mounts a fresh form seeded via
useState initializers. Also collapses the field states into one values
object and hoists MetricChart's Intl formatters to module scope, clearing
the derived-state / cascading-set-state / prefer-useReducer / js-hoist-intl
warnings. This same change fixes Greptile's stale-form-across-opens note.

Greptile P1: a sendPrompt rejection landing after a run already ended
(user stop, session error) overwrote the terminal status with
"send-failed". send() now re-reads the run and returns if it is terminal;
covered by a new test (86 total).

Greptile P2s: the three-way terminal-status check, previously duplicated
in the service and the store, is now a shared isTerminalRunStatus() in
schemas.ts; and starting a new run from the dashboard resets the run
selector so the panel follows the new run instead of a previously
selected old one.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
…r-stage models

Rework after real-world unattended runs died silently (~iteration 9/25),
vanished from task surfaces, and lost their dashboards on app restart.

Resilience and persistence:
- Persist runs to a new SQLite autoresearch_runs table (migration 0017,
  repository, host-router procedures, narrow storage-client seam in core).
  Every mutation writes through; dashboards and failure reasons survive
  restarts, and the header entry point hydrates per task.
- New "interrupted" status with reason (session-error, rate-limited,
  send-failed, app-restart). Infrastructure obstacles no longer end runs:
  recovery retries with backoff (1-15 min, 20 attempts max), reconnects
  dead sessions via clearSessionError, and auto-resumes the loop the
  moment the session is usable, telling the agent to re-check the working
  tree. Boot rehydration brings mid-loop runs back as interrupted, so an
  app restart pauses a run instead of killing it.
- Fix latent races: a prompt-request-count cursor ignores isPromptPending
  flips that carry no new turn (a rate-limited send previously re-parsed
  the prior report and could duplicate iterations); the missing-report
  reminder waits a grace period so stop reasons win; stopReason
  "rate_limited" interrupts and "cancelled" pauses instead of re-prompting;
  user pause outranks all automation.

Composer UX (metric input removed):
- The Autoresearch button now toggles the mode; settings live in an inline
  strip under the editor (direction, target, iteration budget, stage
  models) instead of a dialog. The prompt itself is the optimization brief.
- The metric is no longer configured anywhere: the agent labels it via a
  name: line in its report blocks and the run adopts the first label for
  the dashboard and follow-up prompts.

Per-stage models (split iterations):
- Optional build/measure stage models. When set, each iteration runs as an
  implement turn on the build model and a measurement turn on the measure
  model (cheap models for experiment tool calls), switching the session
  model at phase boundaries. Opportunistic reports in implement turns skip
  the measure turn; the phase persists so resume re-applies the right
  model; pause/stop hands the session back on the build model.

Tests: 121 autoresearch core tests (recovery, rate limits, turn artifacts,
rehydration, naming, phase machine, model switching) plus repository tests
against the real migration; full core/ui/app suites green.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
…r controls into the prompt input

Review fixes (all 14 findings from a high-effort pass):
- Defer the split-run implement->measure advance through the same grace
  timer as reminders, so a cancelled or rate-limited turn pauses the run
  before any re-prompt reaches the agent.
- Sequence stage-model switches ahead of the send (switchThenSend) so a
  measure turn cannot race onto the wrong model; single-turn sends keep
  their synchronous path.
- Clear the reminder budget on interrupt/pause/cancel so a recovered run
  gets a fresh reminder instead of failing one strike early.
- Capture originalModel at registration and restore it when a split run
  pauses or ends, instead of leaving the session pinned on a stage model.
- Skip reconnect for cloud workspaces (the cloud watcher owns recovery;
  clearSessionError is local-only) rather than throwing on every attempt.
- Chain per-run persistence writes so saves land in call order; handle the
  "queued" stop reason; manual Resume no longer spends the automatic
  recovery budget.
- Clamp the iteration budget to the schema cap in both composer and dialog
  (an out-of-range value previously created the task but silently failed
  run registration, leaving an untracked kickoff).
- Cleanup: shared stageModels module (one sentinel, one options helper,
  one StageModelSelect for strip/dialog/panel), shared getBackoffDelay,
  single phase->prompt mapping (buildPhasePrompt), and a liveRunIds set so
  the session subscription skips terminal runs instead of scanning every
  run ever registered.

Composer UX: autoresearch controls now render inside the prompt input
itself via a new PromptInput headerAddon slot — one input view with the
direction/target/iteration inputs in a header row above the editor text,
stage models tucked behind a popover, and the guidance moved into the
editor placeholder. The attached strip below the composer is gone.

Tests: 127 autoresearch core tests (+6 regressions covering the grace
deferral, reminder reset, queued sends, recovery budget, and model
restoration); full core/ui/app suites green.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
…ve model line, staff feature flag

Stage efforts: each stage (implementation / experiment) now carries a
reasoning-effort level alongside its model. A run splits whenever the
stages differ in model OR effort — effort-only splits work — and
identical stages run as plain single-turn iterations. The session seam
gained setEffort (thought_level config category); switchStage applies
model+effort ahead of each send, and pause/stop/cancel restore the
model and effort captured at registration (originalModel/originalEffort).

Composer model UX: arming autoresearch hides the toolbar's model and
effort pickers — the stages popover in the header row becomes the single
control surface, seeded from the composer's values at arm time (and
backfilled if the preview config loads late). Its trigger shows a live
summary ("Opus · high → Haiku · low"). The task is created on the
measure stage, since the kickoff baseline is a measurement. Trade-off:
adapter switching is unavailable while armed (it lives in the hidden
model selector).

Live surfacing: the dashboard header shows the session's actual current
model/effort while the run is active ("Agent is on Haiku · low effort —
measure phase"), read from the live session config so stage switches
reflect immediately, plus a static summary of the configured stages.

Feature flag: the whole feature is gated behind
posthog-code-autoresearch (create the flag in PostHog and target staff).
UI entry points check useAutoresearchEnabled(); boot-time run
rehydration awaits a core-side AUTORESEARCH_GATE seam bound to
posthog-js flags (waits for the first flags load with a 10s fallback),
so ungated users get no restored runs, no auto-resume, no session
subscription. A panel with already-live runs stays controllable if the
flag is revoked mid-session. Dev builds (import.meta.env.DEV) are
always on.

Tests: 131 autoresearch core tests (+effort-only split alternation,
identical-stages single-turn, effort restoration, flag-off dormancy);
full core/ui/shared/app suites green.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
…shboard value

The report protocol gains a structured unit: line (e.g. kB, ms, %) and
the name: guidance drops "with units", so the dashboard title stays a
clean label while every value carries the unit. The first report with a
unit sets run.metricUnit (first-wins, persisted, like the name); units
longer than 16 chars are ignored as prose, and unitless metrics render
bare as before.

Rendered via a shared withMetricUnit helper (space-separated, % hugs the
number) on: the Best/Last/Target stat cards, the chart's y-axis extremes,
target label and point tooltips, and the iteration table's Value and
delta columns. Continuation prompts include the unit in the history block
too, keeping the agent grounded in the units it reports.

Tests: 136 autoresearch core tests (unit parsing, too-long-unit guard,
first-unit-wins adoption, unit in history prompts, protocol example);
full core/ui/app suites green.

Generated-By: PostHog Code
Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981
@fercgomes fercgomes marked this pull request as ready for review July 2, 2026 21:48
@fercgomes fercgomes requested a review from a team July 2, 2026 21:48
@DanielVisca DanielVisca self-requested a review July 2, 2026 21:58
@greptile-apps

greptile-apps Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "feat(autoresearch): report and render th..." | Re-trigger Greptile

@DanielVisca DanielVisca left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ran the autoresearch suite locally, 136 passing, and the module layout follows the core conventions really cleanly. two questions inline but nothing blocking — Ive got an MCP eval metric ready to point at this the moment it lands

* ```autoresearch fenced block wins, so an agent quoting the protocol and
* then reporting still parses correctly.
*/
export function parseMetricReport(text: string): AutoresearchReport | null {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric is entirely self-reported — parseMetricReport over the agents own reply is the only source, and the host seam (AutoresearchSessionClient) has no way to run a measurement itself, so a single hallucinated finite value past the target ends the run as target-reached and sits as Best permanently. an agent under optimization pressure grading its own homework is the classic goodhart trap.. have you thought about a verifier seam (host-run command whose output cross-checks the reported number), even as a post-v1 follow-up? asking because Id much rather my eval harness compute the number than trust the block.

const reminders = this.remindersSent.get(runId) ?? 0;
if (reminders >= 1) {
this.endRun(runId, "failed", {
endReason: "missing-report",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noticed the task chat stays live during a run — a user prompt bumps the prompt-request cursor, so an ordinary chat reply without a report block burns the one reminder, and a second one ends the run as missing-report. is that intended? feels easy to hit without realizing (tests cover unrelated-task turns but not this one).

Resolve conflict in TaskInput.tsx: combine main's file-preview
restructure with the autoresearch composer additions.

Generated-By: PostHog Code
Task-Id: ed0d2aad-b4b6-4595-9677-b7a4f3959183
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants