|
| 1 | +# Skill: Eval Gating |
| 2 | + |
| 3 | +## What It Is |
| 4 | + |
| 5 | +Eval gating uses evaluation results to make pass/fail decisions in CI/CD pipelines. It checks metrics against thresholds and baselines, blocking deployments when quality standards aren't met. |
| 6 | + |
| 7 | +## Why It Matters |
| 8 | + |
| 9 | +- **Quality Gates** — Prevent regressions from reaching production |
| 10 | +- **Automated Decisions** — Remove manual quality review bottlenecks |
| 11 | +- **Fast Feedback** — Catch issues before merge |
| 12 | +- **Consistent Standards** — Apply the same criteria to every change |
| 13 | + |
| 14 | +## How to Use It |
| 15 | + |
| 16 | +### Run Gate Evaluation |
| 17 | + |
| 18 | +```bash |
| 19 | +npx agent-eval-harness gate \ |
| 20 | + --results results/eval-123.json \ |
| 21 | + --gates gates.yaml \ |
| 22 | + --baseline results/baseline.json |
| 23 | +``` |
| 24 | + |
| 25 | +### Gate Configuration |
| 26 | + |
| 27 | +```yaml |
| 28 | +# gates.yaml |
| 29 | +gates: |
| 30 | + - name: overall-quality |
| 31 | + type: threshold |
| 32 | + metric: overall_score |
| 33 | + operator: ">=" |
| 34 | + threshold: 0.80 |
| 35 | + |
| 36 | + - name: cost-per-task |
| 37 | + type: threshold |
| 38 | + metric: avg_cost_per_task |
| 39 | + operator: "<=" |
| 40 | + threshold: 0.05 |
| 41 | + |
| 42 | + - name: latency-p99 |
| 43 | + type: threshold |
| 44 | + metric: latency_p99_ms |
| 45 | + operator: "<=" |
| 46 | + threshold: 5000 |
| 47 | + |
| 48 | + - name: no-regression |
| 49 | + type: baseline-comparison |
| 50 | + baseline: results/baseline.json |
| 51 | + metric: overall_score |
| 52 | + allow_regression: false |
| 53 | + |
| 54 | + - name: tool-correctness |
| 55 | + type: threshold |
| 56 | + metric: tool_correctness_rate |
| 57 | + operator: ">=" |
| 58 | + threshold: 0.95 |
| 59 | + |
| 60 | + - name: faithfulness |
| 61 | + type: threshold |
| 62 | + metric: avg_faithfulness_score |
| 63 | + operator: ">=" |
| 64 | + threshold: 0.85 |
| 65 | +``` |
| 66 | +
|
| 67 | +### Programmatic Gate Evaluation |
| 68 | +
|
| 69 | +```typescript |
| 70 | +import { createGateEngine } from 'agent-eval-harness'; |
| 71 | + |
| 72 | +const engine = createGateEngine([ |
| 73 | + { name: 'quality', metric: 'overall_score', operator: '>=', threshold: 0.80 }, |
| 74 | + { name: 'cost', metric: 'avg_cost_per_task', operator: '<=', threshold: 0.05 }, |
| 75 | +]); |
| 76 | + |
| 77 | +const result = await engine.evaluate(aggregatedResults); |
| 78 | + |
| 79 | +if (result.passed) { |
| 80 | + console.log('✅ All gates passed'); |
| 81 | + process.exit(0); |
| 82 | +} else { |
| 83 | + console.log('❌ Gates failed:'); |
| 84 | + for (const failure of result.failures) { |
| 85 | + console.log(` - ${failure.gate}: ${failure.actual} (expected ${failure.expected})`); |
| 86 | + } |
| 87 | + process.exit(1); |
| 88 | +} |
| 89 | +``` |
| 90 | + |
| 91 | +### CI Integration |
| 92 | + |
| 93 | +```yaml |
| 94 | +# .github/workflows/ci.yml |
| 95 | +- name: Run evaluation |
| 96 | + run: npx agent-eval-harness eval trajectories/*.jsonl --output results/ |
| 97 | + |
| 98 | +- name: Check gates |
| 99 | + run: | |
| 100 | + npx agent-eval-harness gate \ |
| 101 | + --results results/eval.json \ |
| 102 | + --gates gates.yaml \ |
| 103 | + --baseline results/baseline.json |
| 104 | +``` |
| 105 | +
|
| 106 | +## Key Metrics |
| 107 | +
|
| 108 | +| Metric | Description | Typical Threshold | |
| 109 | +|--------|-------------|-------------------| |
| 110 | +| `overall_score` | Combined quality | >= 0.80 | |
| 111 | +| `cost_per_task` | Average task cost | <= $0.05 | |
| 112 | +| `latency_p99` | 99th percentile latency | <= 5000ms | |
| 113 | +| `tool_correctness` | Tool usage accuracy | >= 0.95 | |
| 114 | +| `faithfulness` | Context adherence | >= 0.85 | |
| 115 | + |
| 116 | +## Gate Types |
| 117 | + |
| 118 | +1. **Threshold Gates** — Check metric against fixed threshold |
| 119 | +2. **Baseline Gates** — Compare against previous run |
| 120 | +3. **Statistical Gates** — Require statistical significance |
| 121 | +4. **Composite Gates** — Combine multiple metrics |
| 122 | + |
| 123 | +## Best Practices |
| 124 | + |
| 125 | +1. **Start conservative** — Set thresholds you're confident about |
| 126 | +2. **Use multiple gates** — Cover quality, cost, and performance |
| 127 | +3. **Update baselines** — When quality improves, raise the bar |
| 128 | +4. **Monitor false positives** — Adjust thresholds if gates are too strict |
| 129 | +5. **Document rationale** — Explain why each gate exists |
| 130 | + |
| 131 | +## Common Pitfalls |
| 132 | + |
| 133 | +- **Too many gates** — Start with critical metrics only |
| 134 | +- **Unrealistic thresholds** — Set achievable targets |
| 135 | +- **No baseline** — Always have something to compare against |
| 136 | +- **Ignoring trends** — Consider directional changes |
| 137 | + |
| 138 | +## Related Skills |
| 139 | + |
| 140 | +- [Regression Suites](../regression-suites/skill.md) |
| 141 | +- [Golden Trajectories](../golden-trajectories/skill.md) |
| 142 | +- [Cost Tracking](../cost-tracking/skill.md) |
| 143 | +- [Latency Budgets](../latency-budgets/skill.md) |
0 commit comments