You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: skills/eval-driven-dev/references/3-define-evaluators.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@
11
11
For each eval criterion, choose an evaluator using this decision order:
12
12
13
13
1.**Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog.
14
-
2.**Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
14
+
2.**Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
15
15
3.**Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.
16
16
17
17
**Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.
@@ -26,7 +26,7 @@ If any criterion requires a custom evaluator, implement it now. Place custom eva
26
26
27
27
### Agent evaluators (`create_agent_evaluator`) — the default
28
28
29
-
Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
29
+
Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 6, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`.
@@ -213,10 +213,12 @@ Before writing the final dataset JSON, perform this self-audit:
213
213
2.**Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing.
214
214
215
215
3.**Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).
216
+
216
217
-**Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine.
217
218
-**At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.
218
219
219
220
4.**Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry.
221
+
220
222
-**Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3.
221
223
- If coverage is below threshold, add entries targeting the uncovered capabilities.
|`pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>`| Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). |
27
-
|`pixie format <file.jsonl>`| Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
28
-
|`pixie trace filter <file.jsonl> --purpose input`| Print only wrap events matching the given purposes. Outputs one JSON line per matching event. |
|`pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>`| Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). |
27
+
|`pixie format --input <trace.jsonl> --output <dataset_entry.json>`| Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
28
+
|`pixie trace filter <file.jsonl> --purpose input`| Print only wrap events matching the given purposes. Outputs one JSON line per matching event. |
0 commit comments