Skip to content

Commit a183359

Browse files
authored
token usage reduction to avoid execessive LLM spend (#361)
* token usage reduction to avoid execessive LLM spend * linting
1 parent 17b80c7 commit a183359

File tree

10 files changed

+454
-856
lines changed

10 files changed

+454
-856
lines changed

.test/CUSTOMIZATION_GUIDE.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -115,7 +115,7 @@ expected_patterns:
115115

116116
**What it is:** Natural-language evaluation criteria passed to the LLM judge. The judge scores how well the response follows each guideline.
117117

118-
**How it steers optimization:** Guidelines are the most flexible steering mechanism. They influence the quality score (30% of total) and effectiveness score (40% of total).
118+
**How it steers optimization:** Guidelines are the most flexible steering mechanism. They influence the guideline adherence score (15% of total) and the quality composite (20% of total, which averages correctness + completeness + guideline adherence).
119119

120120
**Example focus prompt:** `"Must parameterize catalog names with a prefix variable"`
121121

.test/README.md

Lines changed: 11 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -335,7 +335,7 @@ eval-criteria/
335335

336336
### How it works
337337

338-
Judges receive a lightweight listing of available criteria in their system prompt. When a criteria's description matches the trace being evaluated, the judge calls `read_eval_criteria` to load the full rubric and `read_eval_reference` for detailed reference material. This keeps judge prompts small while giving access to deep domain knowledge.
338+
`discover_skill_paths()` in `judges.py` scans `.test/eval-criteria/` for subdirectories containing a `SKILL.md` file, filtering by `applies_to` metadata against the skill's `tool_modules`. The discovered paths are passed to `make_judge(skills=[...])` when MLflow supports the native `skills=` parameter, enabling on-demand loading of domain-specific rubrics during scoring.
339339

340340
### `applies_to` filtering
341341

@@ -376,7 +376,7 @@ Each candidate skill is evaluated per-task using a WITH vs WITHOUT comparison:
376376

377377
1. **Generate WITH-skill response** — LLM generates with SKILL.md in context
378378
2. **Generate WITHOUT-skill response** — LLM generates without skill (cached)
379-
3. **Three focused judges** — each returns categorical `"excellent"` / `"acceptable"` / `"poor"` verdicts:
379+
3. **Three focused field-based judges** — each makes 1 LLM call and returns binary `"yes"` / `"no"` verdicts:
380380
- **Correctness judge** (WITH + WITHOUT) — facts, API references, code syntax accuracy
381381
- **Completeness judge** (WITH + WITHOUT) — all parts addressed, expected info present
382382
- **Guideline adherence judge** (WITH only) — Databricks-specific patterns and practices
@@ -397,7 +397,7 @@ Each candidate skill is evaluated per-task using a WITH vs WITHOUT comparison:
397397
| Structure | 5% | Syntax validation (Python, SQL, no hallucinated APIs) |
398398
| Regression penalty | -10% | Explicit penalty when regression_judge detects harm |
399399

400-
**Categorical-to-float conversion:** `excellent=1.0`, `acceptable=0.6`, `poor=0.0`. The nonlinear scale incentivizes GEPA to push from "acceptable" to "excellent" (0.4 gap).
400+
**Binary-to-float conversion:** `yes=1.0`, `no=0.0`. Binary verdicts produce more reliable, consistent judgments than categorical or continuous scales.
401401

402402
### How GEPA uses evaluation feedback
403403

@@ -430,15 +430,16 @@ Runs a real Claude Code agent and adds tool-call scoring:
430430

431431
| Component | Weight |
432432
|-----------|--------|
433-
| Content quality | 20% |
434-
| Skill effectiveness | 20% |
435-
| Tool call correctness | 20% |
436-
| Behavioral compliance | 15% |
437-
| Execution success | 10% |
438-
| Tool call efficiency | 10% |
433+
| Effectiveness delta | 20% |
434+
| Correctness | 20% |
435+
| Completeness | 15% |
436+
| Guideline adherence | 15% |
437+
| Assertion coverage | 15% |
438+
| Execution success | 5% |
439439
| Token efficiency | 5% |
440+
| Regression penalty | -5% |
440441

441-
The agent evaluator also uses `assertions.py` for structured `Missing_Facts`/`Missing_Patterns` feedback. Tool-call judges use MLflow's `ToolCallCorrectness`/`ToolCallEfficiency` when available, falling back to deterministic trace scorers.
442+
The agent evaluator uses the same focused field-based judges as the proxy evaluator, plus `assertions.py` for structured `Missing_Facts`/`Missing_Patterns` feedback and deterministic trace scorers for behavioral compliance.
442443

443444
---
444445

@@ -470,8 +471,6 @@ The agent evaluator also uses `assertions.py` for structured `Missing_Facts`/`Mi
470471
│ ├── assertions.py # Deterministic fact/pattern assertions (zero LLM cost)
471472
│ ├── assessment_fetcher.py # MLflow assessment injection
472473
│ ├── judges.py # MLflow quality judge factory + fallback chain
473-
│ ├── eval_criteria.py # Eval criteria discovery + SKILL.md parser
474-
│ ├── judge_tools.py # MLflow JudgeTool registration for criteria
475474
│ ├── config.py # Presets, model registration
476475
│ ├── splitter.py # Train/val dataset splitting
477476
│ ├── tools.py # MCP tool description extraction

0 commit comments

Comments
 (0)