feat(agent-wiki): add compare-outcomes pass for contrastive guidelines#274
feat(agent-wiki): add compare-outcomes pass for contrastive guidelines#274vinodmut wants to merge 2 commits into
Conversation
Adds a new pipeline skill, agent-wiki-compare-outcomes, that derives *contrastive* guidelines by comparing successful vs failed trajectories for the same/similar task — rather than mining rules from a single trajectory. It LLM-judges success/failure from the normalized transcript (no dependency on benchmark-specific outcome labels) and grounds each rule in evidence (task wording, observed tool/API calls, transcript/doc snippets). The bundled compare_outcomes.py is self-contained (stdlib only). Wires it into the ingest orchestrator as a conditional Step 4.5 (after synthesize, before consolidate): the description, subagent list, pipeline diagram, and a new step section that spawns one agent-wiki-compare-outcomes subagent over the corpus when there's a success/failure contrast, renders any strong contrastive guidelines, and skips cleanly when there's no contrast. Documents the new pass in the overview docs so it's discoverable: the README skills tree, and design.md's pipeline diagram, stage table, ingest narrative, and a short "learning from contrast" rationale. Ported from the appworld-agent-wiki-experiment branch; scoped to just the new skill + its ingest wiring (the branch's separate consolidate "mine step" and synthesize changes are intentionally not included). Builder/CI conventions followed: file-local `# mypy: ignore-errors` header matching sibling scripts.
|
Warning Review limit reached
More reviews will be available in 50 minutes and 42 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits. 🚦 How do rate limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, the refill rate gradually slows as usage increases. The highest same-day bursts are limited more strictly. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds an Changesagent-wiki compare-outcomes skill
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@explorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.py`:
- Around line 261-276: The client.chat.completions.create() call lacks an
explicit timeout configuration, which could cause the script to block
indefinitely if the API becomes unresponsive during batch processing. Add a
timeout to prevent excessive blocking: either add a timeout parameter when
instantiating the OpenAI client (e.g., timeout=60.0), or use the with_options()
method on the client immediately before calling chat.completions.create() to
apply the timeout at the request level. Choose whichever approach fits your
codebase structure best.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d9b9045a-28f8-42e2-8491-cb9edc3918d0
📒 Files selected for processing (5)
explorations/agent-wiki/README.mdexplorations/agent-wiki/docs/design.mdexplorations/agent-wiki/skills/agent-wiki-compare-outcomes/SKILL.mdexplorations/agent-wiki/skills/agent-wiki-compare-outcomes/scripts/compare_outcomes.pyexplorations/agent-wiki/skills/agent-wiki-ingest/SKILL.md
Addresses CodeRabbit review finding: Add a timeout to the LLM API call A 60s client-level timeout prevents the batch judge loop from blocking indefinitely if the API becomes unresponsive.
What this adds
A new pass in the
agent-wikiingest pipeline —agent-wiki-compare-outcomes— that derives contrastive guidelines by comparing successful vs failed trajectories for the same (or similar) task, rather than mining rules from a single trajectory.Every other pass in the pipeline (summarize / extract / synthesize) learns from one trajectory at a time. This pass learns from the contrast: a rule is only promoted when it's backed by a failed path, a successful path, and concrete trajectory evidence (task wording, observed tool/API calls, transcript/doc snippets). It can LLM-judge success/failure straight from the normalized transcript, so it does not depend on benchmark-specific outcome labels.
Extends the agent-wiki exploration merged in #268; related to the offline extraction/consolidation idea in #256.
Changes
SKILL.md): a 3-step workflow — build an evidence pack over normalized trajectories (grouped bytask_id, success/failure judged or stored), inspect candidate rules, and promote only strong ones (one failed + one successful run in the same group, a task-action tool/API or workflow difference, source IDs for both sides). Weak candidates stay hypotheses, not rules.compare_outcomes.py): groups traces, contrasts success/failed runs, extracts tool/API calls + transcript evidence, optionally LLM-judges outcomes (--judge-outcomes never|missing|always), and emits an analysis JSON + Markdown (and optional render-ready guideline entities). Stdlib-only, no repo-internal deps.Scope
This ports only the compare-outcomes capability + its ingest wiring from the
appworld-agent-wiki-experimentbranch. That branch also bundled unrelated changes (a consolidate "mine step" rewrite, a synthesize faithfulness rule, arun_agent_wiki_skill_pass.pyhelper) — those are intentionally not included here, to keep this PR focused.Verification
ruff check+ruff format --check: clean.mypy .: clean (the script carries the# mypy: ignore-errorsheader used by every sibling exploration/reference script).detect-secrets: passes.task_id, contrasts the two runs, and emits the analysis without error.No changes outside
explorations/agent-wiki/.Summary by CodeRabbit
New Features
Documentation