You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .test/CUSTOMIZATION_GUIDE.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -115,7 +115,7 @@ expected_patterns:
115
115
116
116
**What it is:** Natural-language evaluation criteria passed to the LLM judge. The judge scores how well the response follows each guideline.
117
117
118
-
**How it steers optimization:** Guidelines are the most flexible steering mechanism. They influence the quality score (30% of total) and effectiveness score (40% of total).
118
+
**How it steers optimization:** Guidelines are the most flexible steering mechanism. They influence the guideline adherence score (15% of total) and the quality composite (20% of total, which averages correctness + completeness + guideline adherence).
119
119
120
120
**Example focus prompt:** `"Must parameterize catalog names with a prefix variable"`
Copy file name to clipboardExpand all lines: .test/README.md
+11-12Lines changed: 11 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -335,7 +335,7 @@ eval-criteria/
335
335
336
336
### How it works
337
337
338
-
Judges receive a lightweight listing of available criteria in their system prompt. When a criteria's description matches the trace being evaluated, the judge calls `read_eval_criteria` to load the full rubric and `read_eval_reference` for detailed reference material. This keeps judge prompts small while giving access to deep domain knowledge.
338
+
`discover_skill_paths()`in `judges.py` scans `.test/eval-criteria/` for subdirectories containing a `SKILL.md` file, filtering by `applies_to` metadata against the skill's `tool_modules`. The discovered paths are passed to `make_judge(skills=[...])` when MLflow supports the native `skills=` parameter, enabling on-demand loading of domain-specific rubrics during scoring.
339
339
340
340
### `applies_to` filtering
341
341
@@ -376,7 +376,7 @@ Each candidate skill is evaluated per-task using a WITH vs WITHOUT comparison:
376
376
377
377
1. **Generate WITH-skill response** — LLM generates with SKILL.md in context
378
378
2. **Generate WITHOUT-skill response** — LLM generates without skill (cached)
**Categorical-to-float conversion:** `excellent=1.0`, `acceptable=0.6`, `poor=0.0`. The nonlinear scale incentivizes GEPA to push from "acceptable" to "excellent" (0.4 gap).
400
+
**Binary-to-float conversion:** `yes=1.0`, `no=0.0`. Binary verdicts produce more reliable, consistent judgments than categorical or continuous scales.
401
401
402
402
### How GEPA uses evaluation feedback
403
403
@@ -430,15 +430,16 @@ Runs a real Claude Code agent and adds tool-call scoring:
430
430
431
431
| Component | Weight |
432
432
|-----------|--------|
433
-
| Content quality | 20% |
434
-
| Skill effectiveness | 20% |
435
-
| Tool call correctness | 20% |
436
-
| Behavioral compliance | 15% |
437
-
| Execution success | 10% |
438
-
| Tool call efficiency | 10% |
433
+
| Effectiveness delta | 20% |
434
+
| Correctness | 20% |
435
+
| Completeness | 15% |
436
+
| Guideline adherence | 15% |
437
+
| Assertion coverage | 15% |
438
+
| Execution success | 5% |
439
439
| Token efficiency | 5% |
440
+
| Regression penalty | -5% |
440
441
441
-
The agent evaluator also uses `assertions.py` for structured `Missing_Facts`/`Missing_Patterns` feedback. Tool-call judges use MLflow's `ToolCallCorrectness`/`ToolCallEfficiency` when available, falling back to deterministic trace scorers.
442
+
The agent evaluator uses the same focused field-based judges as the proxy evaluator, plus `assertions.py` for structured `Missing_Facts`/`Missing_Patterns` feedback and deterministic trace scorers for behavioral compliance.
442
443
443
444
---
444
445
@@ -470,8 +471,6 @@ The agent evaluator also uses `assertions.py` for structured `Missing_Facts`/`Mi
0 commit comments