The evaluation pipeline measures agent performance and feeds learnings back into prompts, memory, and configuration. In MVP, evaluation is manual (inspect PRs and logs). Automated evaluation is built incrementally across iterations.
- Use this doc for: understanding what gets evaluated, the tiered validation pipeline, memory effectiveness metrics, and the feedback loop.
- Related docs: MEMORY.md for how evaluation insights are stored, OBSERVABILITY.md for telemetry data sources, ORCHESTRATOR.md for prompt versioning in the data model.
The evaluation pipeline categorizes task outcomes to identify systemic issues and improvement opportunities:
| Category | Description |
|---|---|
| Reasoning errors | Agent misunderstood the task or made incorrect assumptions |
| Instruction non-compliance | Task spec was clear but agent did not follow it (skipped tests, wrong scope) |
| Missing verification | Agent did not run tests, linters, or document how to verify the change |
| Timeout | Hit 8-hour or idle timeout before completing; partial work may be on the branch |
| Environment failure | GitHub API errors, clone failures, build failures the agent could not recover from |
Evaluation consumes the same data that observability and code attribution capture:
| Source | What it provides |
|---|---|
| Task outcomes | Status, error message, PR URL, branch state |
| TaskEvents | Audit log: state transitions, step events, guardrail events |
| Agent logs and traces | CloudWatch logs, X-Ray spans, tool calls, reasoning steps |
| Code artifacts | PR description, commits, diff, repo/branch/issue links |
| PR outcome signals | Merged vs. closed-without-merge (via GitHub webhooks). Positive/negative signal on task episodes. |
| Review feedback | PR review comments captured via the review feedback memory loop (see MEMORY.md) |
At task end, the platform prompts the agent: "What information, context, or instructions were missing that would have helped you complete this task more effectively?" The response is stored in long-term memory with insight_type: "agent_self_feedback" and retrieved during context hydration for future tasks on the same repo.
Recurring themes (e.g. "I needed to know this repo uses a custom linter") are surfaced in evaluation dashboards and used to update per-repo system prompts or onboarding artifacts. The cost is a single additional turn per task.
System prompts are treated as versioned, testable artifacts. Each task records the prompt_version (SHA-256 hash of deterministic prompt parts) in the task record, enabling correlation: "did merge rates improve after prompt version X?"
- A/B comparison (planned) - Run the same task type with two prompt variants and compare outcomes (merge rate, failure rate, token usage). Requires variant assignment, outcome tracking per variant, and a comparison dashboard.
- Change tracking - Prompt diffs between versions are reviewable. Versions stored in a versioned store for audit and rollback.
The primary measure of memory's value: does the agent produce better PRs over time?
| Metric | How to measure | Improvement signal |
|---|---|---|
| First-review merge rate | % of PRs merged without revision requests | Increases over time |
| Revision cycles | Average review rounds before merge | Decreases over time |
| CI pass rate on first push | % of PRs where CI passes on initial push | Increases as agent learns build quirks |
| Review comment density | Reviewer comments per PR | Decreases over time |
| Repeated mistakes | Same reviewer feedback across multiple PRs | Drops to zero after feedback loop captures the rule |
| Time to PR | Duration from task submission to PR creation | Decreases as agent reuses past approaches |
Repeated mistakes is the most telling metric. If a reviewer says "don't use any types" on PR #10 and the agent repeats it on PR #15, the review feedback memory has failed. Detection requires embedding-based similarity between review comments (simple string matching is insufficient). The review feedback extraction prompt normalizes comments into canonical rule forms, and new comments are compared against stored rules via semantic search.
The platform validates agent-created content through three sequential tiers before PR finalization. Each tier targets a different class of defect. Tiers run as post-agent steps in the blueprint execution framework.
flowchart LR
T1["Tier 1<br/>Tool validation<br/>(build, test, lint)"] --> T2["Tier 2<br/>Code quality<br/>(DRY, SOLID, complexity)"]
T2 --> T3["Tier 3<br/>Risk analysis<br/>(blast radius, API changes)"]
T3 --> PR["PR created<br/>+ validation report<br/>+ risk label"]
Deterministic, binary pass/fail signals from the repo's own tooling: test suites, linters, type checkers, SAST scanners, and build verification. Validation commands are discovered during onboarding or configured in the blueprint's custom_steps.
On failure: Tool output is fed back to the agent for a fix cycle (up to 2 retries). If unresolved, the PR is created with failures documented in the validation report.
Structural and design quality beyond what linters catch, using a combination of static analysis tools and LLM-based review:
| Dimension | Example finding |
|---|---|
| DRY violations | "Lines 45-62 in auth.ts duplicate logic in session.ts:30-47" |
| SOLID violations | "TaskHandler handles both validation and persistence - consider splitting" |
| Pattern adherence | "Existing services use repository pattern, but UserService queries DynamoDB directly" |
| Complexity | "processTask has cyclomatic complexity 18 (threshold: 10)" |
| Naming conventions | "get_data uses snake_case but codebase convention is camelCase" |
| Repo-specific rules | "TypeScript any type used - repo policy requires explicit types" |
Findings have severity levels: error (blocking, triggers fix cycle), warning/info (advisory, included in PR report). The blocking severity threshold is configurable per repo.
Scope, impact, and regression risk of the agent's changes:
| Dimension | Method |
|---|---|
| Change surface area | Files, lines added/removed, modules touched |
| Dependency graph impact | Import/export analysis, downstream consumers of changed code |
| Public API changes | Exported functions, types, interfaces, endpoints, schemas |
| Shared infrastructure | Changes to shared utilities, base classes, CI/CD, config |
| Test coverage gaps | Cross-reference changes with existing test coverage |
| New external dependencies | Additions to package manifests (license, maintenance, security metadata) |
Every agent-created PR receives a computed risk level:
| Risk level | Criteria | PR behavior |
|---|---|---|
| Low | Small change, no API changes, high test coverage | Normal PR with risk:low label |
| Medium | Moderate surface, some dependents, partial coverage | risk:medium label + risk summary |
| High | Large surface, API changes, shared infra, low coverage | risk:high label + blast radius report |
| Critical | Breaking API changes, schema modifications, CI/CD changes | risk:critical label + optional hold for human approval |
Risk level is stored in the task record and emitted as a TaskEvent, enabling trending by repo, user, and prompt version.
The combined output of all three tiers is posted to the PR as a structured validation report (comment or GitHub Check Run).
| Phase | What it adds |
|---|---|
| Current | No automated evaluation. Manual inspection of PRs and logs. |
| Next | Agent self-feedback. Prompt versioning (hash stored with task records). Tiered validation pipeline (Tiers 1-3). PR risk level and validation reports. |
| Later | Review feedback memory loop. PR outcome tracking. Failure categorization. Memory effectiveness metrics. |
| Future | LLM-based trace analysis. A/B prompt comparison. Learned rules from memory in Tier 2. Historical risk correlation in Tier 3. Risk trending dashboards. |