address comments.

yiouli · yiouli · commit fca9a9c3f338 · 2026-04-17T15:55:38.000-07:00
diff --git a/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md b/skills/eval-driven-dev/references/2c-capture-and-verify-trace.md
@@ -87,7 +87,7 @@ Check that:
 Run `pixie format` to see the data in dataset-entry format:
 
 ```bash
-uv run pixie format <trace-file.jsonl>
+pixie format --input trace.jsonl --output dataset_entry.json
 ```
 
 The output shows:
diff --git a/skills/eval-driven-dev/references/3-define-evaluators.md b/skills/eval-driven-dev/references/3-define-evaluators.md
@@ -11,7 +11,7 @@
 For each eval criterion, choose an evaluator using this decision order:
 
 1. **Built-in evaluator** — if a standard evaluator fits the criterion (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`). See `evaluators.md` for the full catalog.
-2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
+2. **Agent evaluator** (`create_agent_evaluator`) — **the default for all semantic, qualitative, and app-specific criteria**. Agent evaluators are graded by you (the coding agent) in Step 6, where you review each entry's trace and output holistically. This is far more effective than automated scoring for criteria like "Did the extraction accurately capture the source content?", "Are there hallucinated values?", or "Did the app handle noisy input gracefully?"
 3. **Manual custom evaluator** — ONLY for **mechanical, deterministic checks** where a programmatic function is definitively correct: field existence, regex pattern matching, JSON schema validation, numeric thresholds, type checking. **Never use manual custom evaluators for semantic quality** — if the check requires _judgment_ about whether content is correct, relevant, or complete, use an agent evaluator instead.
 
 **Distinguish structural from semantic criteria**: For each criterion, ask: "Can this be checked with a simple programmatic rule that always gives the right answer?" If yes → manual custom evaluator. If no → agent evaluator. Most app-specific quality criteria are semantic, not structural.
@@ -26,7 +26,7 @@ If any criterion requires a custom evaluator, implement it now. Place custom eva
 
 ### Agent evaluators (`create_agent_evaluator`) — the default
 
-Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 5d, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
+Use agent evaluators for **all semantic, qualitative, and judgment-based criteria**. These are graded by you (the coding agent) in Step 6, where you review each entry's trace and output with full context — far more effective than any automated approach for quality dimensions like accuracy, completeness, hallucination detection, or error handling.
 
 ```python
 from pixie import create_agent_evaluator
@@ -56,9 +56,9 @@ schema_compliance = create_agent_evaluator(
 
 Reference agent evaluators in the dataset via `filepath:callable_name` (e.g., `"pixie_qa/evaluators.py:extraction_accuracy"`).
 
-During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 5d.
+During `pixie test`, agent evaluators show as `⏳` in the console. They are graded in Step 6.
 
-**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 5d. Make it specific and actionable:
+**Writing effective criteria**: The `criteria` string is the grading rubric you'll follow in Step 6. Make it specific and actionable:
 
 - **Bad**: "Check if the output is good" — too vague to grade consistently
 - **Bad**: "The response should be accurate" — doesn't say what to compare against
diff --git a/skills/eval-driven-dev/references/4-build-dataset.md b/skills/eval-driven-dev/references/4-build-dataset.md
@@ -135,7 +135,7 @@ Then include the captured content in the entry's `eval_input`:
 For each set of `input_data`, run `pixie trace` to execute the app with real dependencies and capture all values:
 
 ```bash
-uv run pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input '{"prompt": "...", "source": "..."}'
+pixie trace --runnable pixie_qa/run_app.py:AppRunnable --input  trace-input.json
 ```
 
 Then extract the `purpose="input"` values from the resulting trace and use them as `eval_input`.
@@ -213,10 +213,12 @@ Before writing the final dataset JSON, perform this self-audit:
 2. **Count distinct sources**: How many unique `eval_input` data sources are in the dataset? If more than 50% of entries share the same `eval_input` content (even with different prompts), the dataset lacks diversity. Prompt variations on the same input test the LLM's interpretation, not the app's data processing.
 
 3. **Difficulty distribution (mandatory threshold)**: For each entry, label it as "routine" (confident it will pass), "moderate" (likely passes but non-trivial), or "challenging" (genuinely uncertain or targeting a known failure mode).
+
    - **Maximum 60% "routine" entries.** If you have 5 entries, at most 3 can be routine.
    - **At least one "challenging" entry** that targets a failure mode from `00-project-analysis.md` where you are genuinely uncertain about the outcome. If every entry is a guaranteed pass, the dataset cannot distinguish a good app from a broken one.
 
 4. **Capability coverage (mandatory threshold)**: Count how many capabilities from `00-project-analysis.md` are exercised by at least one dataset entry.
+
    - **Must cover ≥50% of listed capabilities.** If the analysis lists 6 capabilities, the dataset must exercise at least 3.
    - If coverage is below threshold, add entries targeting the uncovered capabilities.
 
diff --git a/skills/eval-driven-dev/references/evaluators.md b/skills/eval-driven-dev/references/evaluators.md
@@ -542,7 +542,7 @@ Create an evaluator whose grading is deferred to a coding agent.
 During `pixie test`, agent evaluators are not scored automatically.
 Instead, they raise `AgentEvaluationPending` and record a
 `PendingEvaluation` with the evaluation criteria. The coding agent
-(guided by Step 5d) reviews each entry's trace and output, then
+(guided by Step 6) reviews each entry's trace and output, then
 grades the pending evaluations.
 
 **When to use**: Quality dimensions that require holistic review of
diff --git a/skills/eval-driven-dev/references/wrap-api.md b/skills/eval-driven-dev/references/wrap-api.md
@@ -21,11 +21,11 @@ processing pipeline. Its behavior depends on the active mode:
 
 ## CLI Commands
 
-| Command                                                                                   | Description                                                                                                                                   |
-| ----------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
-| `pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON).                  |
-| `pixie format <file.jsonl>`                                                               | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
-| `pixie trace filter <file.jsonl> --purpose input`                                         | Print only wrap events matching the given purposes. Outputs one JSON line per matching event.                                                 |
+| Command                                                                                   | Description                                                                                                                                 |
+| ----------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
+| `pixie trace --runnable <filepath:ClassName> --input <kwargs.json> --output <file.jsonl>` | Run the Runnable once with kwargs from the JSON file and write a trace file. `--input` is a **file path** (not inline JSON).                |
+| `pixie format --input <trace.jsonl> --output <dataset_entry.json>`                        | Convert a trace file to a formatted dataset entry template. Shows `input_data`, `eval_input`, and `eval_output` (the real captured output). |
+| `pixie trace filter <file.jsonl> --purpose input`                                         | Print only wrap events matching the given purposes. Outputs one JSON line per matching event.                                               |
 
 ---