Skip to content

Commit 1231856

Browse files
franklywatsonclaude
andcommitted
docs: reframe context eval (4.6) as general evaluation methodology
Broaden Pattern 4.6 beyond tool routing to cover all agent decision layers — enforcement pipelines, skill selection, constitutional compliance, environment detection. Tool routing becomes one example application. Add enforcement pipeline example alongside routing example. Update glossary to match. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 987842d commit 1231856

2 files changed

Lines changed: 67 additions & 29 deletions

File tree

docs/L4-standards-measurement.md

Lines changed: 66 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -424,40 +424,53 @@ Putting thresholds in CI workflow files rather than project config. When thresho
424424
425425
### Problem
426426
427-
L3 defines routing and optimization patterns — smart routing, intent classification, environment-aware tool selection. But without systematic evaluation, there is no evidence that the routing logic makes correct decisions. Changes to routing rules, new environment presets, or updated intent patterns can introduce regressions that go undetected until an agent session wastes tokens on a wrong tool choice. L4 requires evidence-based claims ([Pattern 4.1](#pattern-41--evidence-based-claims)); routing effectiveness needs the same rigor.
427+
Agent behavior is governed by layers of decision logic — routing rules, intent classifiers, skill selection, enforcement pipelines, guardrail checks. Without systematic evaluation, there is no evidence that any of these layers make correct decisions. Changes to rules, new environment configurations, or updated patterns can introduce regressions that go undetected until an agent session wastes tokens, violates a constitutional rule, or takes a wrong path. L4 requires evidence-based claims ([Pattern 4.1](#pattern-41--evidence-based-claims)); agent decision-making needs the same rigor.
428428
429429
### Solution
430430
431-
**Context eval** is a structured evaluation pattern that scores an agent's routing and tool-selection decisions against expected outcomes across multiple scenarios and environments. Each scenario defines a tool call (command, tool name) and the expected routing decision for each environment preset. The evaluation runs the actual routing logic, compares results to expectations, and produces a scored report with category and environment breakdowns.
431+
**Context eval** is a structured evaluation pattern that scores an agent's decisions against expected outcomes across multiple scenarios and configurations. The pattern is general — it applies to any agent decision layer where correctness can be defined as a mapping from input to expected output. The evaluation runs actual decision logic (not mocks), compares results to expectations, and produces a scored report that catches regressions before they reach production sessions.
432+
433+
**Evaluation targets — where context eval applies:**
434+
435+
| Agent Decision Layer | What Gets Evaluated | Example Scenarios |
436+
|---------------------|--------------------|--------------------|
437+
| Tool routing | Does the agent select the right tool for each command? | `grep` routes to structured search; `sed -i` gets blocked |
438+
| Intent classification | Does the classifier correctly identify operation type? | `cat file``file_read`; `sed -i``file_modify` |
439+
| Enforcement pipeline | Do guardrails fire at the right severity? | Mock in stack test → `block`; mock in unit test → `allow` |
440+
| Skill selection | Does the agent activate the right skill for the task? | Bug report → `debug+`; new feature → `plan+` |
441+
| Environment detection | Does the system adapt when tools are missing? | No RTK → degrade to `Grep`; no jcodemunch → raw search |
442+
| Constitutional compliance | Do rules hold across edge cases? | `sed -i` always blocked regardless of environment |
432443
433444
**The closed loop:**
434445
435446
```
436-
Define scenarios (tool call + expected outcome per environment)
447+
Define scenarios (input + expected outcome per configuration)
437448
438-
Run scenarios against actual routing logic
449+
Run scenarios against actual decision logic
439450
440451
Score each result (1.0 exact, 0.5 partial, 0.0 miss)
441452
442-
Generate report (overall score, per-category, per-environment, failures)
453+
Generate report (overall score, per-category, per-configuration, failures)
443454
444455
Fail build if overall score below threshold (e.g., 0.7)
445456
446-
Fix routing logic or update scenarios → re-evaluate
457+
Fix decision logic or update scenarios → re-evaluate
447458
```
448459
449-
**Why context eval matters:** Routing decisions are the highest-frequency agent actions. Every `grep`, `cat`, `find`, or `git status` passes through the routing layer. A 5% regression in routing accuracy translates to thousands of wasted tokens per session. Context eval catches regressions before they reach production sessions.
460+
**Why context eval matters:** Agent decisions are high-frequency and compound — a tool-routing error on every `grep` call wastes thousands of tokens per session; a misfiring enforcement pipeline lets violations through or blocks legitimate work. Context eval provides the evidence that these decision layers work correctly and continue to work as they evolve.
450461
451462
**Key concepts:**
452463
453-
- **Scenarios**: Each scenario specifies a tool call (the command the agent issues) and the expected outcome (action + optional tool) for each environment preset
454-
- **Environment presets**: Different tool availability combinations (e.g., both RTK and jcodemunch available, RTK only, neither). Routing that works in one environment may fail in another
455-
- **Graduated scoring**: 1.0 for exact match (correct action and tool), 0.5 for partial match (correct action, wrong tool), 0.0 for miss
456-
- **Category coverage**: Scenarios organized by category (bash, native, agent, pipe, edge) with per-category score reporting
464+
- **Scenarios**: Each scenario specifies an input (command, context, tool call) and the expected outcome for each configuration variant
465+
- **Configuration variants**: Different environmental conditions (tool availability, project type, agent phase, rule set). Logic that works in one configuration may fail in another
466+
- **Graduated scoring**: 1.0 for exact match, 0.5 for partial match (correct category, wrong detail), 0.0 for miss. Binary scoring misses nuance — suggesting `Grep` when `rtk grep` is unavailable is correct behavior, not a failure
467+
- **Category coverage**: Scenarios organized by category with per-category score reporting, directing attention to weak areas
457468
- **Threshold gate**: Minimum overall score that must be met; build fails if unmet, with detailed failure output
458-
- **Bidirectional coverage**: Not just "does the right tool get selected?" but also "does the wrong command get blocked?" — destructive operations (`sed -i`) must produce a `block` action regardless of environment
469+
- **Bidirectional coverage**: Not just "does the right thing happen?" but also "does the wrong thing get prevented?" — destructive operations must be blocked, violations must be caught, regardless of configuration
459470
460-
### In Practice
471+
### In Practice — Tool Routing Example
472+
473+
The following example shows context eval applied to tool routing — evaluating whether the routing layer selects the correct tool for each command across different environment configurations.
461474
462475
```typescript
463476
// Scenario definition — what should happen for each environment
@@ -484,8 +497,8 @@ function scoreResult(expected: ExpectedOutcome, actual: ActualOutcome): number {
484497
return 0.0; // miss
485498
}
486499
487-
// Evaluation loop: every scenario × every environment
488-
describe('Context Eval: routing effectiveness', () => {
500+
// Evaluation loop: every scenario × every configuration
501+
describe('Context Eval: tool routing', () => {
489502
const results: EvalResult[] = [];
490503
const MIN_OVERALL_SCORE = 0.7;
491504
@@ -518,12 +531,34 @@ describe('Context Eval: routing effectiveness', () => {
518531
});
519532
```
520533
534+
### In Practice — Enforcement Pipeline Example
535+
536+
The same pattern applies to evaluating guardrail behavior. Here, scenarios test whether the enforcement pipeline fires at the correct severity for different code changes.
537+
538+
```typescript
539+
const enforcementScenario: EvalScenario = {
540+
id: 'mock-in-stack-test-should-block',
541+
category: 'constitutional',
542+
description: 'Mock pattern in stack test file triggers block',
543+
input: {
544+
filePath: 'tests/stack/04-checkout.stack.test.ts',
545+
editContent: 'const mockDb = { getUser: () => ({}) }',
546+
},
547+
expected: {
548+
default: { action: 'block', reason: 'constitutional_rule_1_no_mocks_in_stack_tests' },
549+
unit_test_file: { action: 'allow' }, // mocks allowed in unit tests
550+
}
551+
};
552+
```
553+
554+
The evaluation structure is the same — scenarios define inputs and expected outputs, the actual enforcement logic runs, results are scored and reported.
555+
521556
**Report structure** — the eval produces a structured report, not just pass/fail:
522557
523558
```
524559
{
525560
overallScore: 0.83,
526-
totalScenarios: 90, // 18 scenarios × 5 environments
561+
totalScenarios: 90, // 18 scenarios × 5 configurations
527562
passCount: 75,
528563
byCategory: {
529564
bash: 0.89,
@@ -532,24 +567,24 @@ describe('Context Eval: routing effectiveness', () => {
532567
pipe: 0.80,
533568
edge: 0.67 ← edge cases need attention
534569
},
535-
byEnvironment: {
570+
byConfiguration: {
536571
full: 0.94,
537572
rtk_only: 0.83,
538573
jm_only: 0.78,
539574
neither: 0.72,
540575
jm_not_indexed: 0.89
541576
},
542577
failures: [
543-
{ scenario: 'sed-i-blocks', preset: 'neither', expected: 'block', actual: 'allow', reason: 'Destructive edit not blocked' }
578+
{ scenario: 'sed-i-blocks', configuration: 'neither', expected: 'block', actual: 'allow', reason: 'Destructive edit not blocked' }
544579
]
545580
}
546581
```
547582
548-
The category and environment breakdowns direct attention: a low `edge` score says "improve edge-case handling"; a low `neither` score says "test the degraded path more carefully."
583+
The category and configuration breakdowns direct attention: a low `edge` score says "improve edge-case handling"; a low `neither` score says "test the degraded path more carefully."
549584
550585
### Reference Implementation
551586
552-
The [rig](https://github.com/franklywatson/claude-rig) repo implements context eval in [`tests/eval/`](https://github.com/franklywatson/claude-rig/tree/main/tests/eval) with:
587+
The [rig](https://github.com/franklywatson/claude-rig) repo implements context eval for tool routing in [`tests/eval/`](https://github.com/franklywatson/claude-rig/tree/main/tests/eval) with:
553588
554589
- [`eval.test.ts`](https://github.com/franklywatson/claude-rig/blob/main/tests/eval/eval.test.ts) — Main evaluation loop: iterates all scenarios across all environment presets
555590
- [`scenarios.ts`](https://github.com/franklywatson/claude-rig/blob/main/tests/eval/scenarios.ts) — 18 scenarios across 5 categories with 5 environment presets
@@ -558,18 +593,21 @@ The [rig](https://github.com/franklywatson/claude-rig) repo implements context e
558593
559594
### Anti-Pattern
560595
561-
- **Binary pass/fail**: Routing decisions have nuance — suggesting `Grep` when `rtk grep` is unavailable is correct, not a failure. Graduated scoring captures this.
562-
- **Single-environment testing**: Routing that works when all tools are present may fail when tools are missing. Test across environment presets.
563-
- **Evaluating mocked routing**: Running eval against a mock of the routing logic tests the mock, not the routing. Context eval exercises the actual `handlePreToolUse` function the same code that runs in production sessions.
564-
- **Threshold too low**: A threshold of 0.0 means every routing decision can be wrong and the build still passes. A meaningful threshold (0.7+) forces routing quality to improve or the build fails.
565-
- **Scenarios that never change**: As routing logic evolves, scenarios must evolve with it. Stale scenarios test a routing system that no longer exists.
596+
- **Binary pass/fail**: Agent decisions have nuance — degrading gracefully when a tool is unavailable is correct, not a failure. Graduated scoring captures this.
597+
- **Single-configuration testing**: Logic that works in one environment may fail in another. Test across configuration variants.
598+
- **Evaluating mocked logic**: Running eval against a mock of the decision layer tests the mock, not the logic. Context eval exercises the actual production code path.
599+
- **Threshold too low**: A threshold of 0.0 means every decision can be wrong and the build still passes. A meaningful threshold (0.7+) forces quality to improve or the build fails.
600+
- **Scenarios that never change**: As decision logic evolves, scenarios must evolve with it. Stale scenarios test a system that no longer exists.
601+
- **Evaluating only one layer**: Tool routing is the most obvious evaluation target, but enforcement pipelines, skill selection, and constitutional compliance all benefit from the same structured eval approach.
566602
567603
### Cross-References
568604
569-
- [Pattern 3.1 — Smart Routing](L3-optimization.md#pattern-31--smart-routing--tool-selection) — The routing logic that context eval verifies
605+
- [Pattern 3.1 — Smart Routing](L3-optimization.md#pattern-31--smart-routing--tool-selection) — Tool routing is one context eval application
570606
- [Pattern 3.2 — Intent Classification](L3-optimization.md#pattern-32--intent-classification) — Intent parsing correctness is eval'd through scenarios
571-
- [Pattern 3.3 — Environment-Aware Routing](L3-optimization.md#pattern-33--environment-aware-routing) — Environment presets in eval ensure degraded paths work
572-
- [Pattern 4.1 — Evidence-Based Claims](#pattern-41--evidence-based-claims) — Context eval produces evidence for routing effectiveness claims
607+
- [Pattern 3.3 — Environment-Aware Routing](L3-optimization.md#pattern-33--environment-aware-routing) — Configuration variants in eval ensure degraded paths work
608+
- [Pattern 2.4 — Constitutional Rules](L2-behavioral-guardrails.md#pattern-24--constitutional-rules) — Constitutional compliance can be eval'd through scenarios
609+
- [Pattern 2.6 — Enforcement Pipeline](L2-behavioral-guardrails.md#pattern-26--enforcement-pipeline-composition) — Pipeline behavior can be eval'd through scenarios
610+
- [Pattern 4.1 — Evidence-Based Claims](#pattern-41--evidence-based-claims) — Context eval produces evidence for decision-layer effectiveness claims
573611
- [Pattern 4.5 — CI Guardrails](#pattern-45--ci-guardrails) — Context eval runs in CI as a non-negotiable quality gate
574612
575613
---

docs/cross-cutting/glossary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Alphabetical reference for specialized terms used across the agentic-patterns li
3131
### Context Eval
3232

3333
**Level**: L4 - Standards & Measurement
34-
**Definition**: Structured evaluation pattern that scores an agent's routing and tool-selection decisions against expected outcomes across multiple scenarios and environment presets. Uses graduated scoring (1.0 exact, 0.5 partial, 0.0 miss) and fails the build if overall score falls below a threshold.
34+
**Definition**: Structured evaluation pattern that scores an agent's decisions against expected outcomes across multiple scenarios and configurations. Applies to any decision layer (tool routing, enforcement pipelines, skill selection, constitutional compliance). Uses graduated scoring (1.0 exact, 0.5 partial, 0.0 miss) and fails the build if overall score falls below a threshold.
3535
**See**: [Pattern 4.6 - Context Eval](../L4-standards-measurement.md#pattern-46--context-eval)
3636

3737
### Conditional Assertion (Anti-Pattern)

0 commit comments

Comments
 (0)