Skip to content

Latest commit

 

History

History
128 lines (90 loc) · 8.31 KB

File metadata and controls

128 lines (90 loc) · 8.31 KB

Evaluation

The evaluation pipeline measures agent performance and feeds learnings back into prompts, memory, and configuration. In MVP, evaluation is manual (inspect PRs and logs). Automated evaluation is built incrementally across iterations.

  • Use this doc for: understanding what gets evaluated, the tiered validation pipeline, memory effectiveness metrics, and the feedback loop.
  • Related docs: MEMORY.md for how evaluation insights are stored, OBSERVABILITY.md for telemetry data sources, ORCHESTRATOR.md for prompt versioning in the data model.

What to evaluate

The evaluation pipeline categorizes task outcomes to identify systemic issues and improvement opportunities:

Category Description
Reasoning errors Agent misunderstood the task or made incorrect assumptions
Instruction non-compliance Task spec was clear but agent did not follow it (skipped tests, wrong scope)
Missing verification Agent did not run tests, linters, or document how to verify the change
Timeout Hit 8-hour or idle timeout before completing; partial work may be on the branch
Environment failure GitHub API errors, clone failures, build failures the agent could not recover from

Data sources

Evaluation consumes the same data that observability and code attribution capture:

Source What it provides
Task outcomes Status, error message, PR URL, branch state
TaskEvents Audit log: state transitions, step events, guardrail events
Agent logs and traces CloudWatch logs, X-Ray spans, tool calls, reasoning steps
Code artifacts PR description, commits, diff, repo/branch/issue links
PR outcome signals Merged vs. closed-without-merge (via GitHub webhooks). Positive/negative signal on task episodes.
Review feedback PR review comments captured via the review feedback memory loop (see MEMORY.md)

Agent self-feedback

At task end, the platform prompts the agent: "What information, context, or instructions were missing that would have helped you complete this task more effectively?" The response is stored in long-term memory with insight_type: "agent_self_feedback" and retrieved during context hydration for future tasks on the same repo.

Recurring themes (e.g. "I needed to know this repo uses a custom linter") are surfaced in evaluation dashboards and used to update per-repo system prompts or onboarding artifacts. The cost is a single additional turn per task.

Prompt versioning

System prompts are treated as versioned, testable artifacts. Each task records the prompt_version (SHA-256 hash of deterministic prompt parts) in the task record, enabling correlation: "did merge rates improve after prompt version X?"

  • A/B comparison (planned) - Run the same task type with two prompt variants and compare outcomes (merge rate, failure rate, token usage). Requires variant assignment, outcome tracking per variant, and a comparison dashboard.
  • Change tracking - Prompt diffs between versions are reviewable. Versions stored in a versioned store for audit and rollback.

Memory effectiveness metrics

The primary measure of memory's value: does the agent produce better PRs over time?

Metric How to measure Improvement signal
First-review merge rate % of PRs merged without revision requests Increases over time
Revision cycles Average review rounds before merge Decreases over time
CI pass rate on first push % of PRs where CI passes on initial push Increases as agent learns build quirks
Review comment density Reviewer comments per PR Decreases over time
Repeated mistakes Same reviewer feedback across multiple PRs Drops to zero after feedback loop captures the rule
Time to PR Duration from task submission to PR creation Decreases as agent reuses past approaches

Repeated mistakes is the most telling metric. If a reviewer says "don't use any types" on PR #10 and the agent repeats it on PR #15, the review feedback memory has failed. Detection requires embedding-based similarity between review comments (simple string matching is insufficient). The review feedback extraction prompt normalizes comments into canonical rule forms, and new comments are compared against stored rules via semantic search.

Tiered validation pipeline

The platform validates agent-created content through three sequential tiers before PR finalization. Each tier targets a different class of defect. Tiers run as post-agent steps in the blueprint execution framework.

flowchart LR
    T1["Tier 1<br/>Tool validation<br/>(build, test, lint)"] --> T2["Tier 2<br/>Code quality<br/>(DRY, SOLID, complexity)"]
    T2 --> T3["Tier 3<br/>Risk analysis<br/>(blast radius, API changes)"]
    T3 --> PR["PR created<br/>+ validation report<br/>+ risk label"]
Loading

Tier 1 - Tool validation

Deterministic, binary pass/fail signals from the repo's own tooling: test suites, linters, type checkers, SAST scanners, and build verification. Validation commands are discovered during onboarding or configured in the blueprint's custom_steps.

On failure: Tool output is fed back to the agent for a fix cycle (up to 2 retries). If unresolved, the PR is created with failures documented in the validation report.

Tier 2 - Code quality analysis

Structural and design quality beyond what linters catch, using a combination of static analysis tools and LLM-based review:

Dimension Example finding
DRY violations "Lines 45-62 in auth.ts duplicate logic in session.ts:30-47"
SOLID violations "TaskHandler handles both validation and persistence - consider splitting"
Pattern adherence "Existing services use repository pattern, but UserService queries DynamoDB directly"
Complexity "processTask has cyclomatic complexity 18 (threshold: 10)"
Naming conventions "get_data uses snake_case but codebase convention is camelCase"
Repo-specific rules "TypeScript any type used - repo policy requires explicit types"

Findings have severity levels: error (blocking, triggers fix cycle), warning/info (advisory, included in PR report). The blocking severity threshold is configurable per repo.

Tier 3 - Risk and blast radius analysis

Scope, impact, and regression risk of the agent's changes:

Dimension Method
Change surface area Files, lines added/removed, modules touched
Dependency graph impact Import/export analysis, downstream consumers of changed code
Public API changes Exported functions, types, interfaces, endpoints, schemas
Shared infrastructure Changes to shared utilities, base classes, CI/CD, config
Test coverage gaps Cross-reference changes with existing test coverage
New external dependencies Additions to package manifests (license, maintenance, security metadata)

PR risk level

Every agent-created PR receives a computed risk level:

Risk level Criteria PR behavior
Low Small change, no API changes, high test coverage Normal PR with risk:low label
Medium Moderate surface, some dependents, partial coverage risk:medium label + risk summary
High Large surface, API changes, shared infra, low coverage risk:high label + blast radius report
Critical Breaking API changes, schema modifications, CI/CD changes risk:critical label + optional hold for human approval

Risk level is stored in the task record and emitted as a TaskEvent, enabling trending by repo, user, and prompt version.

The combined output of all three tiers is posted to the PR as a structured validation report (comment or GitHub Check Run).

Phasing

Phase What it adds
Current No automated evaluation. Manual inspection of PRs and logs.
Next Agent self-feedback. Prompt versioning (hash stored with task records). Tiered validation pipeline (Tiers 1-3). PR risk level and validation reports.
Later Review feedback memory loop. PR outcome tracking. Failure categorization. Memory effectiveness metrics.
Future LLM-based trace analysis. A/B prompt comparison. Learned rules from memory in Tier 2. Historical risk correlation in Tier 3. Risk trending dashboards.