You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/L4-standards-measurement.md
+66-28Lines changed: 66 additions & 28 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -424,40 +424,53 @@ Putting thresholds in CI workflow files rather than project config. When thresho
424
424
425
425
### Problem
426
426
427
-
L3 defines routing and optimization patterns — smart routing, intent classification, environment-aware tool selection. But without systematic evaluation, there is no evidence that the routing logic makes correct decisions. Changes to routing rules, new environment presets, or updated intent patterns can introduce regressions that go undetected until an agent session wastes tokens on a wrong tool choice. L4 requires evidence-based claims ([Pattern 4.1](#pattern-41--evidence-based-claims)); routing effectiveness needs the same rigor.
427
+
Agent behavior is governed by layers of decision logic — routing rules, intent classifiers, skill selection, enforcement pipelines, guardrail checks. Without systematic evaluation, there is no evidence that any of these layers make correct decisions. Changes to rules, new environment configurations, or updated patterns can introduce regressions that go undetected until an agent session wastes tokens, violates a constitutional rule, or takes a wrong path. L4 requires evidence-based claims ([Pattern 4.1](#pattern-41--evidence-based-claims)); agent decision-making needs the same rigor.
428
428
429
429
### Solution
430
430
431
-
**Context eval** is a structured evaluation pattern that scores an agent's routing and tool-selection decisions against expected outcomes across multiple scenarios and environments. Each scenario defines a tool call (command, tool name) and the expected routing decision for each environment preset. The evaluation runs the actual routing logic, compares results to expectations, and produces a scored report with category and environment breakdowns.
431
+
**Context eval** is a structured evaluation pattern that scores an agent's decisions against expected outcomes across multiple scenarios and configurations. The pattern is general — it applies to any agent decision layer where correctness can be defined as a mapping from input to expected output. The evaluation runs actual decision logic (not mocks), compares results to expectations, and produces a scored report that catches regressions before they reach production sessions.
432
+
433
+
**Evaluation targets — where context eval applies:**
434
+
435
+
| Agent Decision Layer | What Gets Evaluated | Example Scenarios |
Fail build if overall score below threshold (e.g., 0.7)
445
456
↓
446
-
Fix routing logic or update scenarios → re-evaluate
457
+
Fix decision logic or update scenarios → re-evaluate
447
458
```
448
459
449
-
**Why context eval matters:**Routing decisions are the highest-frequency agent actions. Every `grep`, `cat`, `find`, or `git status` passes through the routing layer. A 5% regression in routing accuracy translates to thousands of wasted tokens per session. Context evalcatches regressions before they reach production sessions.
460
+
**Why context eval matters:**Agent decisions are high-frequency and compound — a tool-routing error on every `grep` call wastes thousands of tokens per session; a misfiring enforcement pipeline lets violations through or blocks legitimate work. Context evalprovides the evidence that these decision layers work correctly and continue to work as they evolve.
450
461
451
462
**Key concepts:**
452
463
453
-
- **Scenarios**: Each scenario specifies a tool call (the command the agent issues) and the expected outcome (action + optional tool) for each environment preset
454
-
- **Environment presets**: Different tool availability combinations (e.g., both RTK and jcodemunch available, RTK only, neither). Routing that works in one environment may fail in another
455
-
- **Graduated scoring**: 1.0 for exact match (correct action and tool), 0.5 for partial match (correct action, wrong tool), 0.0 for miss
456
-
- **Category coverage**: Scenarios organized by category (bash, native, agent, pipe, edge) with per-category score reporting
464
+
- **Scenarios**: Each scenario specifies an input (command, context, tool call) and the expected outcome for each configuration variant
465
+
- **Configuration variants**: Different environmental conditions (tool availability, project type, agent phase, rule set). Logic that works in one configuration may fail in another
466
+
- **Graduated scoring**: 1.0 for exact match, 0.5 for partial match (correct category, wrong detail), 0.0 for miss. Binary scoring misses nuance — suggesting `Grep` when `rtk grep` is unavailable is correct behavior, not a failure
467
+
- **Category coverage**: Scenarios organized by category with per-category score reporting, directing attention to weak areas
457
468
- **Threshold gate**: Minimum overall score that must be met; build fails if unmet, with detailed failure output
458
-
- **Bidirectional coverage**: Not just "does the right tool get selected?" but also "does the wrong command get blocked?" — destructive operations (`sed -i`) must produce a `block` action regardless of environment
469
+
- **Bidirectional coverage**: Not just "does the right thing happen?" but also "does the wrong thing get prevented?" — destructive operations must be blocked, violations must be caught, regardless of configuration
459
470
460
-
### In Practice
471
+
### In Practice — Tool Routing Example
472
+
473
+
The following example shows context eval applied to tool routing — evaluating whether the routing layer selects the correct tool for each command across different environment configurations.
461
474
462
475
```typescript
463
476
// Scenario definition — what should happen for each environment
@@ -484,8 +497,8 @@ function scoreResult(expected: ExpectedOutcome, actual: ActualOutcome): number {
484
497
return 0.0; // miss
485
498
}
486
499
487
-
// Evaluation loop: every scenario × every environment
The same pattern applies to evaluating guardrail behavior. Here, scenarios test whether the enforcement pipeline fires at the correct severity for different code changes.
537
+
538
+
```typescript
539
+
const enforcementScenario: EvalScenario = {
540
+
id: 'mock-in-stack-test-should-block',
541
+
category: 'constitutional',
542
+
description: 'Mock pattern in stack test file triggers block',
unit_test_file: { action: 'allow' }, // mocks allowed in unit tests
550
+
}
551
+
};
552
+
```
553
+
554
+
The evaluation structure is the same — scenarios define inputs and expected outputs, the actual enforcement logic runs, results are scored and reported.
555
+
521
556
**Report structure** — the eval produces a structured report, not just pass/fail:
The category and environment breakdowns direct attention: a low `edge` score says "improve edge-case handling"; a low `neither` score says "test the degraded path more carefully."
583
+
The category and configuration breakdowns direct attention: a low `edge` score says "improve edge-case handling"; a low `neither` score says "test the degraded path more carefully."
549
584
550
585
### Reference Implementation
551
586
552
-
The [rig](https://github.com/franklywatson/claude-rig) repo implements context evalin [`tests/eval/`](https://github.com/franklywatson/claude-rig/tree/main/tests/eval) with:
587
+
The [rig](https://github.com/franklywatson/claude-rig) repo implements context evalfortool routingin [`tests/eval/`](https://github.com/franklywatson/claude-rig/tree/main/tests/eval) with:
553
588
554
589
- [`eval.test.ts`](https://github.com/franklywatson/claude-rig/blob/main/tests/eval/eval.test.ts) — Main evaluation loop: iterates all scenarios across all environment presets
555
590
- [`scenarios.ts`](https://github.com/franklywatson/claude-rig/blob/main/tests/eval/scenarios.ts) — 18 scenarios across 5 categories with 5 environment presets
@@ -558,18 +593,21 @@ The [rig](https://github.com/franklywatson/claude-rig) repo implements context e
558
593
559
594
### Anti-Pattern
560
595
561
-
- **Binary pass/fail**: Routing decisions have nuance — suggesting `Grep` when `rtk grep` is unavailable is correct, not a failure. Graduated scoring captures this.
562
-
- **Single-environment testing**: Routing that works when all tools are present may fail when tools are missing. Test across environment presets.
563
-
- **Evaluating mocked routing**: Running eval against a mock of the routing logic tests the mock, not the routing. Context eval exercises the actual `handlePreToolUse`function— the same code that runs in production sessions.
564
-
- **Threshold too low**: A threshold of 0.0 means every routing decision can be wrong and the build still passes. A meaningful threshold (0.7+) forces routing quality to improve or the build fails.
565
-
- **Scenarios that never change**: As routing logic evolves, scenarios must evolve with it. Stale scenarios test a routing system that no longer exists.
596
+
- **Binary pass/fail**: Agent decisions have nuance — degrading gracefully when a tool is unavailable is correct, not a failure. Graduated scoring captures this.
597
+
- **Single-configuration testing**: Logic that works in one environment may fail in another. Test across configuration variants.
598
+
- **Evaluating mocked logic**: Running eval against a mock of the decision layer tests the mock, not the logic. Context eval exercises the actual production code path.
599
+
- **Threshold too low**: A threshold of 0.0 means every decision can be wrong and the build still passes. A meaningful threshold (0.7+) forces quality to improve or the build fails.
600
+
- **Scenarios that never change**: As decision logic evolves, scenarios must evolve with it. Stale scenarios test a system that no longer exists.
601
+
- **Evaluating only one layer**: Tool routing is the most obvious evaluation target, but enforcement pipelines, skill selection, and constitutional compliance all benefit from the same structured eval approach.
566
602
567
603
### Cross-References
568
604
569
-
- [Pattern 3.1 — Smart Routing](L3-optimization.md#pattern-31--smart-routing--tool-selection) — The routing logic that context evalverifies
605
+
- [Pattern 3.1 — Smart Routing](L3-optimization.md#pattern-31--smart-routing--tool-selection) — Tool routing is one context evalapplication
570
606
- [Pattern 3.2 — Intent Classification](L3-optimization.md#pattern-32--intent-classification) — Intent parsing correctness is eval'd through scenarios
571
-
- [Pattern 3.3 — Environment-Aware Routing](L3-optimization.md#pattern-33--environment-aware-routing) — Environment presets in eval ensure degraded paths work
- [Pattern 3.3 — Environment-Aware Routing](L3-optimization.md#pattern-33--environment-aware-routing) — Configuration variants in eval ensure degraded paths work
608
+
- [Pattern 2.4 — Constitutional Rules](L2-behavioral-guardrails.md#pattern-24--constitutional-rules) — Constitutional compliance can be eval'd through scenarios
609
+
- [Pattern 2.6 — Enforcement Pipeline](L2-behavioral-guardrails.md#pattern-26--enforcement-pipeline-composition) — Pipeline behavior can be eval'd through scenarios
Copy file name to clipboardExpand all lines: docs/cross-cutting/glossary.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -31,7 +31,7 @@ Alphabetical reference for specialized terms used across the agentic-patterns li
31
31
### Context Eval
32
32
33
33
**Level**: L4 - Standards & Measurement
34
-
**Definition**: Structured evaluation pattern that scores an agent's routing and tool-selection decisions against expected outcomes across multiple scenarios and environment presets. Uses graduated scoring (1.0 exact, 0.5 partial, 0.0 miss) and fails the build if overall score falls below a threshold.
34
+
**Definition**: Structured evaluation pattern that scores an agent's decisions against expected outcomes across multiple scenarios and configurations. Applies to any decision layer (tool routing, enforcement pipelines, skill selection, constitutional compliance). Uses graduated scoring (1.0 exact, 0.5 partial, 0.0 miss) and fails the build if overall score falls below a threshold.
0 commit comments