Skip to content

Commit fca9a9c

Browse files
committed
address comments.
1 parent 32f91ac commit fca9a9c

File tree

5 files changed

+14
-12
lines changed

5 files changed

+14
-12
lines changed

skills/eval-driven-dev/references/2c-capture-and-verify-trace.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Check that:
8787
Run `pixie format` to see the data in dataset-entry format:
8888

8989
```bash
90-
uv run pixie format <trace-file.jsonl>
90+
pixie format --input trace.jsonl --output dataset_entry.json
9191
```
9292

9393
The output shows:

skills/eval-driven-dev/references/3-define-evaluators.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
For each eval criterion, choose an evaluator using this decision order:
1212

1313
1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog.
14-
2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
14+
2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
1515
3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.
1616

1717
**Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.
@@ -26,7 +26,7 @@ If any criterion requires a custom evaluator, implement it now. Place custom eva
2626

2727
### Agent evaluators (`create_agent_evaluator`) — the default
2828

29-
Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
29+
Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 6, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
3030

3131
```python
3232
from pixie import create_agent_evaluator
@@ -56,9 +56,9 @@ schema_compliance = create_agent_evaluator(
5656

5757
Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`).
5858

59-
During `pixie test`, agent evaluators show as `` in the console. They are graded in Step 5d.
59+
During `pixie test`, agent evaluators show as `` in the console. They are graded in Step 6.
6060

61-
**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable:
61+
**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 6. Make it specific and actionable:
6262

6363
- **Bad**: "Check if the output is good" — too vague to grade consistently
6464
- **Bad**: "The response should be accurate" — doesn't say what to compare against

skills/eval-driven-dev/references/4-build-dataset.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ Then include the captured content in the entry's `eval_input`:
135135
For each set of `input_data`, run `pixie trace` to execute the app with real dependencies and capture all values:
136136

137137
```bash
138-
uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input '{"prompt": "...", "source": "..."}'
138+
pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input trace-input.json
139139
```
140140

141141
Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`.
@@ -213,10 +213,12 @@ Before writing the final dataset JSON, perform this self-audit:
213213
2. **Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing.
214214

215215
3. **Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).
216+
216217
- **Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine.
217218
- **At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.
218219

219220
4. **Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry.
221+
220222
- **Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3.
221223
- If coverage is below threshold, add entries targeting the uncovered capabilities.
222224

skills/eval-driven-dev/references/evaluators.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -542,7 +542,7 @@ Create an evaluator whose grading is deferred to a coding agent.
542542
During `pixie test`, agent evaluators are not scored automatically.
543543
Instead, they raise `AgentEvaluationPending` and record a
544544
`PendingEvaluation` with the evaluation criteria. The coding agent
545-
(guided by Step 5d) reviews each entry's trace and output, then
545+
(guided by Step 6) reviews each entry's trace and output, then
546546
grades the pending evaluations.
547547

548548
**When to use**: Quality dimensions that require holistic review of

skills/eval-driven-dev/references/wrap-api.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,11 +21,11 @@ processing pipeline. Its behavior depends on the active mode:
2121

2222
## CLI Commands
2323

24-
| Command | Description |
25-
| ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
26-
| `pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). |
27-
| `pixie format <file.jsonl>` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
28-
| `pixie trace filter <file.jsonl> --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. |
24+
| Command | Description |
25+
| ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
26+
| `pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON). |
27+
| `pixie format --input <trace.jsonl> --output <dataset_entry.json>` | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
28+
| `pixie trace filter <file.jsonl> --purpose input` | Print only wrap events matching the given purposes. Outputs one JSON line per matching event. |
2929

3030
---
3131

0 commit comments

Comments
 (0)