feat(autoresearch): agent metric-optimization mode with dashboard (GROW-103) by fercgomes · Pull Request #3072 · PostHog/code

fercgomes · 2026-07-01T23:35:08Z

Summary

Implements GROW-103: autoresearch as a PostHog Code mode, modeled on pi-autoresearch — a long-running agent loop that optimizes a metric. Under the hood it is the task's regular LLM session; a dedicated dashboard panel shows the run's progress.

How it works

Protocol — the kickoff prompt teaches the agent to end every turn with a fenced ```autoresearch report block (metric: <number>, summary: <one line>). Each iteration = one focused change + one measurement.
Auto-continue engine (packages/core/src/autoresearch/) — AutoresearchService subscribes to the sessions store, detects turn completion (isPromptPending true→false), parses the report from the transcript, records the iteration, and sends the next continuation prompt (with best-so-far and recent history) until the target is reached or the iteration budget is spent. A turn without a report gets one reminder; a second lapse fails the run. Session errors and send failures fail the run with a reason. Runs can be paused (iterations still recorded, no auto-continue), resumed, and stopped.
Dashboard (packages/ui/src/features/autoresearch/) — a new "Autoresearch" panel tab in task detail: status/direction badges, pause/resume/stop controls, stat cards (best / last / iterations / target), an SVG chart of the metric with best-so-far frontier and target line, an iterations table with deltas, a run-history selector, and a start-configuration dialog (metric, direction, optional target, budget, instructions). A ChartLineUp button in the task header (with a live-run indicator) opens the panel.
Architecture — business logic lives in @posthog/core behind DI; the service reaches the session through a narrow AutoresearchPromptClient seam bound in desktop-services.ts (same pattern as AGENT_PROMPT_SENDER). The UI degrades gracefully on hosts that don't bind the service (useServiceOptional), so web stays a portability smoke test.

Screenshots

The dashboard renders from the domain store; start a run from any task via the header chart button → Start autoresearch.

Test plan

80 new unit tests (packages/core/src/autoresearch/*.test.ts): config schema validation, stats/decision functions, report parsing + transcript extraction, prompt builders, and a full service harness driving fake session turns (iteration recording, continuation, target/budget completion, reminders, pause/resume/stop, session errors, send failures, unrelated-task isolation, dispose).
Full suites pass: core 1944, ui 1120, shared 423, app 23 files.
pnpm --filter @posthog/core|@posthog/ui|code typecheck clean.
Biome clean; zero noRestrictedImports in core; node scripts/check-host-boundaries.mjs reports no new violations.
Manual smoke test in the desktop app (start a run against a real task, watch the dashboard update).

Note: @posthog/git and two workspace-server integration suites fail in the cloud sandbox because real git commit is blocked there — pre-existing and unrelated to this change.

🤖 Generated with Claude Code

…op and dashboard Implements GROW-103: a long-running agent mode that iteratively optimizes a metric over the task's regular agent session, with a dashboard panel showing progress (like pi-autoresearch). Core (packages/core/src/autoresearch): - Report protocol: the agent ends each turn with a fenced ```autoresearch block (metric + summary); prompts.ts holds the kickoff/continuation/reminder builders, the report parser, and the transcript extractor. - AutoresearchService watches sessionStore for turn completion, records iterations, and auto-continues until target reached, budget spent, or the run is paused/stopped/failed. One reminder on a missing report, then fail. - Zod schemas, pure stats/decision functions, vanilla zustand domain store. - 80 unit tests across schemas, stats, protocol, and a service harness. UI (packages/ui/src/features/autoresearch): - Dashboard panel tab: status badges, pause/resume/stop, stat cards, SVG metric chart (value + best-so-far frontier + target line), iterations table, run history selector, and a start-configuration dialog. - Task-header button with live-run indicator opens the panel. - Degrades gracefully on hosts without the service (useServiceOptional). Wiring: new "autoresearch" panel tab kind; AUTORESEARCH_PROMPT_CLIENT host seam bound in desktop-services.ts forwarding to SessionService.sendPrompt; core module loaded in desktop-contributions.ts; RendererBindings entries. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

github-actions · 2026-07-01T23:35:36Z

React Doctor found 9 issues in 3 files · 9 warnings.

9 warnings

src/features/autoresearch/AutoresearchConfigDialog.tsx

⚠️ src/features/autoresearch/AutoresearchConfigDialog.tsx:55 Empty default prop breaks memo rerender-memo-with-default-value
⚠️ src/features/autoresearch/AutoresearchConfigDialog.tsx:56 Empty default prop breaks memo rerender-memo-with-default-value

src/features/autoresearch/stageModels.tsx

⚠️ src/features/autoresearch/stageModels.tsx:24 Non-component export in component file only-export-components
⚠️ src/features/autoresearch/stageModels.tsx:35 Non-component export in component file only-export-components
⚠️ src/features/autoresearch/stageModels.tsx:44 Non-component export in component file only-export-components
⚠️ src/features/autoresearch/stageModels.tsx:49 Non-component export in component file only-export-components
⚠️ src/features/autoresearch/stageModels.tsx:54 Non-component export in component file only-export-components

src/features/task-detail/components/TaskInput.tsx

⚠️ src/features/task-detail/components/TaskInput.tsx:628 Event logic handled in an effect no-event-handler
⚠️ src/features/task-detail/components/TaskInput.tsx:1128 JSX element passed as a prop jsx-no-jsx-as-prop

_{Reviewed by React Doctor for commit 03dad33.}

greptile-apps · 2026-07-01T23:39:54Z

_{Reviews (1): Last reviewed commit: "feat(autoresearch): add autoresearch mod..." | Re-trigger Greptile}

…ofitting existing ones Design feedback: autoresearch should be a way to create a new task, not a mode bolted onto an existing task (matches pi-autoresearch's model). New entry point — the new-task composer: - An "Autoresearch" button in TaskInput opens a config dialog (metric, direction, target, iteration budget); the armed mode shows as a banner under the editor with edit/exit controls, like Inbox mode. - The composer prompt IS the optimization brief: on submit the kickoff protocol preamble is prepended to the prompt content (file/folder chips intact) and sent as the new task's initial message. - useTaskCreation gains onTaskCreatedEffect — a side effect that runs with the created task without suppressing the default pending-view/navigation behavior. It registers the run and auto-opens the dashboard tab. Core: - AutoresearchService.registerRun() (register + subscribe, no send) split out of startRun() (register + send kickoff). startRun now only serves "New run" on a task that already ran autoresearch. - buildKickoffPreamble() split from buildKickoffPrompt() so hosts can substitute the brief; new autoresearchDraftConfigSchema (config minus taskId/instructions). Removed the retrofit affordances: the panel's "Start autoresearch" empty state CTA is now informational, and the task-header button only renders for tasks that actually have runs. StartAutoresearchDialog is generalized into AutoresearchConfigDialog (also used, prefilled, for dashboard re-runs). Core autoresearch tests: 85 passing (registerRun takeover, guard sharing, preamble/prompt composition). Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

…dings react-doctor (blocking error no-adjust-state-on-prop-change): the config dialog synced form state from the `initial` prop inside an effect. The form now lives in a child component rendered inside Dialog.Content — Radix unmounts closed content, so each open mounts a fresh form seeded via useState initializers. Also collapses the field states into one values object and hoists MetricChart's Intl formatters to module scope, clearing the derived-state / cascading-set-state / prefer-useReducer / js-hoist-intl warnings. This same change fixes Greptile's stale-form-across-opens note. Greptile P1: a sendPrompt rejection landing after a run already ended (user stop, session error) overwrote the terminal status with "send-failed". send() now re-reads the run and returns if it is terminal; covered by a new test (86 total). Greptile P2s: the three-way terminal-status check, previously duplicated in the service and the store, is now a shared isTerminalRunStatus() in schemas.ts; and starting a new run from the dashboard resets the run selector so the panel follows the new run instead of a previously selected old one. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

…r-stage models Rework after real-world unattended runs died silently (~iteration 9/25), vanished from task surfaces, and lost their dashboards on app restart. Resilience and persistence: - Persist runs to a new SQLite autoresearch_runs table (migration 0017, repository, host-router procedures, narrow storage-client seam in core). Every mutation writes through; dashboards and failure reasons survive restarts, and the header entry point hydrates per task. - New "interrupted" status with reason (session-error, rate-limited, send-failed, app-restart). Infrastructure obstacles no longer end runs: recovery retries with backoff (1-15 min, 20 attempts max), reconnects dead sessions via clearSessionError, and auto-resumes the loop the moment the session is usable, telling the agent to re-check the working tree. Boot rehydration brings mid-loop runs back as interrupted, so an app restart pauses a run instead of killing it. - Fix latent races: a prompt-request-count cursor ignores isPromptPending flips that carry no new turn (a rate-limited send previously re-parsed the prior report and could duplicate iterations); the missing-report reminder waits a grace period so stop reasons win; stopReason "rate_limited" interrupts and "cancelled" pauses instead of re-prompting; user pause outranks all automation. Composer UX (metric input removed): - The Autoresearch button now toggles the mode; settings live in an inline strip under the editor (direction, target, iteration budget, stage models) instead of a dialog. The prompt itself is the optimization brief. - The metric is no longer configured anywhere: the agent labels it via a name: line in its report blocks and the run adopts the first label for the dashboard and follow-up prompts. Per-stage models (split iterations): - Optional build/measure stage models. When set, each iteration runs as an implement turn on the build model and a measurement turn on the measure model (cheap models for experiment tool calls), switching the session model at phase boundaries. Opportunistic reports in implement turns skip the measure turn; the phase persists so resume re-applies the right model; pause/stop hands the session back on the build model. Tests: 121 autoresearch core tests (recovery, rate limits, turn artifacts, rehydration, naming, phase machine, model switching) plus repository tests against the real migration; full core/ui/app suites green. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

…r controls into the prompt input Review fixes (all 14 findings from a high-effort pass): - Defer the split-run implement->measure advance through the same grace timer as reminders, so a cancelled or rate-limited turn pauses the run before any re-prompt reaches the agent. - Sequence stage-model switches ahead of the send (switchThenSend) so a measure turn cannot race onto the wrong model; single-turn sends keep their synchronous path. - Clear the reminder budget on interrupt/pause/cancel so a recovered run gets a fresh reminder instead of failing one strike early. - Capture originalModel at registration and restore it when a split run pauses or ends, instead of leaving the session pinned on a stage model. - Skip reconnect for cloud workspaces (the cloud watcher owns recovery; clearSessionError is local-only) rather than throwing on every attempt. - Chain per-run persistence writes so saves land in call order; handle the "queued" stop reason; manual Resume no longer spends the automatic recovery budget. - Clamp the iteration budget to the schema cap in both composer and dialog (an out-of-range value previously created the task but silently failed run registration, leaving an untracked kickoff). - Cleanup: shared stageModels module (one sentinel, one options helper, one StageModelSelect for strip/dialog/panel), shared getBackoffDelay, single phase->prompt mapping (buildPhasePrompt), and a liveRunIds set so the session subscription skips terminal runs instead of scanning every run ever registered. Composer UX: autoresearch controls now render inside the prompt input itself via a new PromptInput headerAddon slot — one input view with the direction/target/iteration inputs in a header row above the editor text, stage models tucked behind a popover, and the guidance moved into the editor placeholder. The attached strip below the composer is gone. Tests: 127 autoresearch core tests (+6 regressions covering the grace deferral, reminder reset, queued sends, recovery budget, and model restoration); full core/ui/app suites green. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

…ve model line, staff feature flag Stage efforts: each stage (implementation / experiment) now carries a reasoning-effort level alongside its model. A run splits whenever the stages differ in model OR effort — effort-only splits work — and identical stages run as plain single-turn iterations. The session seam gained setEffort (thought_level config category); switchStage applies model+effort ahead of each send, and pause/stop/cancel restore the model and effort captured at registration (originalModel/originalEffort). Composer model UX: arming autoresearch hides the toolbar's model and effort pickers — the stages popover in the header row becomes the single control surface, seeded from the composer's values at arm time (and backfilled if the preview config loads late). Its trigger shows a live summary ("Opus · high → Haiku · low"). The task is created on the measure stage, since the kickoff baseline is a measurement. Trade-off: adapter switching is unavailable while armed (it lives in the hidden model selector). Live surfacing: the dashboard header shows the session's actual current model/effort while the run is active ("Agent is on Haiku · low effort — measure phase"), read from the live session config so stage switches reflect immediately, plus a static summary of the configured stages. Feature flag: the whole feature is gated behind posthog-code-autoresearch (create the flag in PostHog and target staff). UI entry points check useAutoresearchEnabled(); boot-time run rehydration awaits a core-side AUTORESEARCH_GATE seam bound to posthog-js flags (waits for the first flags load with a 10s fallback), so ungated users get no restored runs, no auto-resume, no session subscription. A panel with already-live runs stays controllable if the flag is revoked mid-session. Dev builds (import.meta.env.DEV) are always on. Tests: 131 autoresearch core tests (+effort-only split alternation, identical-stages single-turn, effort restoration, flag-off dormancy); full core/ui/shared/app suites green. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

…shboard value The report protocol gains a structured unit: line (e.g. kB, ms, %) and the name: guidance drops "with units", so the dashboard title stays a clean label while every value carries the unit. The first report with a unit sets run.metricUnit (first-wins, persisted, like the name); units longer than 16 chars are ignored as prose, and unitless metrics render bare as before. Rendered via a shared withMetricUnit helper (space-separated, % hugs the number) on: the Best/Last/Target stat cards, the chart's y-axis extremes, target label and point tooltips, and the iteration table's Value and delta columns. Continuation prompts include the unit in the history block too, keeping the agent grounded in the units it reports. Tests: 136 autoresearch core tests (unit parsing, too-long-unit guard, first-unit-wins adoption, unit in history prompts, protocol example); full core/ui/app suites green. Generated-By: PostHog Code Task-Id: 41d083af-f0ef-49c1-bce2-9d4a34046981

greptile-apps · 2026-07-02T22:03:13Z

_{Reviews (2): Last reviewed commit: "feat(autoresearch): report and render th..." | Re-trigger Greptile}

DanielVisca

ran the autoresearch suite locally, 136 passing, and the module layout follows the core conventions really cleanly. two questions inline but nothing blocking — Ive got an MCP eval metric ready to point at this the moment it lands

DanielVisca · 2026-07-02T22:19:16Z

+ * ```autoresearch fenced block wins, so an agent quoting the protocol and
+ * then reporting still parses correctly.
+ */
+export function parseMetricReport(text: string): AutoresearchReport | null {


the metric is entirely self-reported — parseMetricReport over the agents own reply is the only source, and the host seam (AutoresearchSessionClient) has no way to run a measurement itself, so a single hallucinated finite value past the target ends the run as target-reached and sits as Best permanently. an agent under optimization pressure grading its own homework is the classic goodhart trap.. have you thought about a verifier seam (host-run command whose output cross-checks the reported number), even as a post-v1 follow-up? asking because Id much rather my eval harness compute the number than trust the block.

DanielVisca · 2026-07-02T22:19:16Z

+      const reminders = this.remindersSent.get(runId) ?? 0;
+      if (reminders >= 1) {
+        this.endRun(runId, "failed", {
+          endReason: "missing-report",


noticed the task chat stays live during a run — a user prompt bumps the prompt-request cursor, so an ordinary chat reply without a report block burns the one reminder, and a second one ends the run as missing-report. is that intended? feels easy to hit without realizing (tests cover unrelated-task turns but not this one).

Resolve conflict in TaskInput.tsx: combine main's file-preview restructure with the autoresearch composer additions. Generated-By: PostHog Code Task-Id: ed0d2aad-b4b6-4595-9677-b7a4f3959183

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

fercgomes added 6 commits July 1, 2026 21:12

fercgomes marked this pull request as ready for review July 2, 2026 21:48

fercgomes requested a review from a team July 2, 2026 21:48

DanielVisca self-requested a review July 2, 2026 21:58

DanielVisca approved these changes Jul 2, 2026

View reviewed changes

Merge branch 'main' into posthog-code/grow-103-autoresearch

03dad33

Resolve conflict in TaskInput.tsx: combine main's file-preview restructure with the autoresearch composer additions. Generated-By: PostHog Code Task-Id: ed0d2aad-b4b6-4595-9677-b7a4f3959183

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(autoresearch): agent metric-optimization mode with dashboard (GROW-103)#3072

feat(autoresearch): agent metric-optimization mode with dashboard (GROW-103)#3072
fercgomes wants to merge 8 commits into
mainfrom
posthog-code/grow-103-autoresearch

fercgomes commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jul 2, 2026

Uh oh!

DanielVisca left a comment

Uh oh!

DanielVisca Jul 2, 2026

Uh oh!

DanielVisca Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

fercgomes commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Screenshots

Test plan

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented Jul 2, 2026

Uh oh!

DanielVisca left a comment

Choose a reason for hiding this comment

Uh oh!

DanielVisca Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

DanielVisca Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fercgomes commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading