diff --git a/docs/README.skills.md b/docs/README.skills.md
index e3fd29872..316d77d05 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -129,7 +129,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
 | [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
 | [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`<br />`references/1-b-data-flow.md`<br />`references/1-c-eval-criteria.md`<br />`references/2-instrument-and-observe.md`<br />`references/3-run-harness.md`<br />`references/4-define-evaluators.md`<br />`references/5-build-dataset.md`<br />`references/6-run-tests.md`<br />`references/7-investigation.md`<br />`references/evaluators.md`<br />`references/instrumentation-api.md`<br />`references/run-harness-examples`<br />`references/testing-api.md`<br />`resources` |
 | [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
 | [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
 | [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
diff --git a/skills/eval-driven-dev/SKILL.md b/skills/eval-driven-dev/SKILL.md
index 498bca26b..2dff34e49 100644
--- a/skills/eval-driven-dev/SKILL.md
+++ b/skills/eval-driven-dev/SKILL.md
@@ -8,7 +8,9 @@ description: >
 license: MIT
 compatibility: Python 3.11+
 metadata:
-  version: 0.2.0
+  version: 0.4.0
+  pixie-qa-version: ">=0.4.0,<0.5.0"
+  pixie-qa-source: https://github.com/yiouli/pixie-qa/
 ---
 
 # Eval-Driven Development for Python LLM Applications
@@ -27,352 +29,154 @@ This skill is about doing the work, not describing it. Read code, edit files, ru
 
 ## Before you start
 
-Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.
-
-**Update the skill:**
+**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run setup:
 
 ```bash
-npx skills update
+bash resources/setup.sh
 ```
 
-**Upgrade the `pixie-qa` package**
-
-Make sure the python virtual environment is active and use the project's package manager:
-
-```bash
-# uv project (uv.lock exists):
-uv add pixie-qa --upgrade
-
-# poetry project (poetry.lock exists):
-poetry add pixie-qa@latest
-
-# pip / no lock file:
-pip install --upgrade pixie-qa
-```
+The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, and initialize the pixie working directory if it's not already initialized. If the skill or package update fails, continue — do not let these failures block the rest of the workflow.
 
 ---
 
 ## The workflow
 
-Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
+Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
 
-**Two modes:**
+**How to work — read this before doing anything else:**
 
-- **Setup** ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
-- **Iteration** ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.
+- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
+- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.
+- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
+- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.
 
-If ambiguous: default to setup.
+**Run Steps 1–7 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.
 
 ---
 
 ### Step 1: Understand the app and define eval criteria
 
-Read the source code to understand:
-
-1. **How it runs** — entry point, startup, config/env vars
-2. **The real entry point** — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
-3. **The request pipeline** — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
-4. **External dependencies (both directions)** — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
-   - **Data flowing IN** (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
-   - **Data flowing OUT** (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
-   - **How to mock it** — look for abstract base classes, protocols, or constructor-injected backends (e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
-5. **Use cases** — distinct scenarios, what good/bad output looks like
-
-Read `references/understanding-app.md` for detailed guidance on mapping data flows and the MEMORY.md template.
-
-Write your findings to `pixie_qa/MEMORY.md` before moving on. Include:
-
-- The entry point and the full request pipeline
-- Every external dependency, what it provides/receives, and how you'll mock it
-- The testability seams (pluggable interfaces, patchable module-level objects)
+**First, check the user's prompt for specific requirements.** Before reading app code, examine what the user asked for:
 
-Determine **high-level, application-specific eval criteria**:
-
-**Good criteria are specific to the app's purpose.** Examples:
-
-- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
-- Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
-- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"
-
-**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
-
-At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.
-
-Record the criteria in `pixie_qa/MEMORY.md` and continue.
-
-> **Checkpoint**: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.
-
----
-
-### Step 2: Instrument and observe a real run
+- **Referenced documents or specs**: Does the prompt mention a file to follow (e.g., "follow the spec in EVAL_SPEC.md", "use the methodology in REQUIREMENTS.md")? If so, **read that file first** — it may specify datasets, evaluation dimensions, pass criteria, or methodology that override your defaults.
+- **Specified datasets or data sources**: Does the prompt reference specific data files (e.g., "use questions from eval_inputs/research_questions.json", "use the scenarios in call_scenarios.json")? If so, **read those files** — you must use them as the basis for your eval dataset, not fabricate generic alternatives.
+- **Specified evaluation dimensions**: Does the prompt name specific quality aspects to evaluate (e.g., "evaluate on factuality, completeness, and bias", "test identity verification and tool call correctness")? If so, **every named dimension must have a corresponding evaluator** in your test file.
 
-**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:
+If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.
 
-1. **Learn the data shapes** — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
-2. **Verify instrumentation captures what evaluators need** — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.
+Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**
 
-**This is a normal app run with instrumentation — no mocks, no patches.**
+#### Sub-step 1a: Entry point & execution flow
 
-#### 2a. Decide what to instrument
+> **Reference**: Read `references/1-a-entry-point.md` now.
 
-This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:
+Read the source code to understand how the app starts and how a real user invokes it. Write your findings to `pixie_qa/01-entry-point.md` before moving on.
 
-- **For each eval criterion**, ask: what observable data would prove this criterion is met or violated?
-- **Map that data to code locations** — which functions produce, consume, or transform that data?
-- **Those functions need `@observe`** — so their inputs and outputs are captured in traces.
+> **Checkpoint**: `pixie_qa/01-entry-point.md` written with entry point, execution flow, user-facing interface, and env requirements.
 
-Examples:
+#### Sub-step 1b: Processing stack & data flow (DAG artifact)
 
-| Eval criterion                             | Data needed                                        | What to instrument                                           |
-| ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------------------ |
-| "Routes to correct department"             | The routing decision (which department was chosen) | The routing/dispatch function                                |
-| "Responses grounded in retrieved context"  | The retrieved documents + the final response       | The retrieval function AND the response function             |
-| "Verifies caller identity before transfer" | Whether identity check happened, transfer decision | The identity verification function AND the transfer function |
-| "Concise phone-friendly responses"         | The final response text                            | The function that produces the LLM response                  |
+> **Reference**: Read `references/1-b-data-flow.md` now.
 
-**LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured** — `enable_storage()` activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need `@observe` on the function that calls `client.chat.completions.create()` just to see the LLM interaction.
+Starting from the entry point you documented, trace the full processing stack and produce a **structured DAG JSON file** at `pixie_qa/02-data-flow.json`. The DAG has the common ancestor of LLM calls as root and contains every data dependency, intermediate state, LLM call, and side-effect as nodes with metadata and parent pointers.
 
-**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).
-
-`enable_storage()` goes at application startup. Read `references/instrumentation.md` for the full rules, code patterns, and anti-patterns for adding instrumentation.
-
-#### 2b. Add instrumentation and run the app
-
-Add `@observe` to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a **reference trace**. Do NOT mock or patch anything. This is an observation run.
-
-If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.
+After writing the JSON, validate it:
 
 ```bash
-uv run pixie trace list
-uv run pixie trace last
+uv run pixie dag validate pixie_qa/02-data-flow.json
 ```
 
-#### 2c. Examine the reference trace
+This checks the DAG structure, verifies code pointers exist, and generates a Mermaid diagram at `pixie_qa/02-data-flow.md`. If validation fails, fix the errors and re-run.
 
-Study the trace data carefully. This is your blueprint for everything that follows. Document:
+> **Checkpoint**: `pixie_qa/02-data-flow.json` written and `pixie dag validate` passes. Mermaid diagram generated at `pixie_qa/02-data-flow.md`.
+>
+> **Schema reminder**: DAG node `name` must be unique, meaningful, and lower_snake_case (for example, `handle_turn`). If a node represents an LLM provider call, set `is_llm_call: true` (otherwise omit it or set `false`). Name-matching rules for `@observe` / `start_observation(...)` are defined in instrumentation guidance (Step 2), not here.
 
-1. **Data from external dependencies (inbound)** — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
-2. **Side-effects (outbound)** — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
-3. **Intermediate states** — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
-4. **The eval_input / eval_output structure** — What does the `@observe`-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.
+#### Sub-step 1c: Eval criteria
 
-**Check instrumentation completeness**: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more `@observe` decorators and re-run.
+> **Reference**: Read `references/1-c-eval-criteria.md` now.
 
-**Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.**
+Define the app's use cases and eval criteria. Use cases drive dataset creation (Step 5); eval criteria drive evaluator selection (Step 4). For each criterion, determine whether it applies to all scenarios or only a subset — this drives whether it becomes a dataset-level default evaluator or an item-level evaluator. Write your findings to `pixie_qa/03-eval-criteria.md` before moving on.
 
-> **Checkpoint**: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.
+> **Checkpoint**: `pixie_qa/03-eval-criteria.md` written with use cases (each with a one-liner conveying input + expected behavior), eval criteria with applicability scope, and observability check. Do NOT read Step 2 instructions yet.
 
 ---
 
-### Step 3: Write a utility function to run the full app end-to-end
-
-**Why this step**: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
-
-#### The contract
-
-```
-run_app(eval_input) → eval_output
-```
-
-- **eval_input** = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
-- **eval_output** = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)
-
-#### How to implement
-
-1. **Patch external dependencies** — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or `unittest.mock.patch` the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
-
-2. **Call the app through its real entry point** — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use `TestClient` or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
-
-3. **Collect the response** — the app's output becomes eval_output, along with any side-effects captured by mock objects.
-
-Read `references/run-harness-patterns.md` for concrete examples of entry point invocation for different app types.
-
-**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
-
-#### Verify
-
-Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
+### Step 2: Instrument and observe a real run
 
-- **Same structure** — same fields present, same types, same nesting
-- **Same code path** — same routing decisions, same intermediate states captured
-- **Sensible values** — eval_output fields have real, meaningful data (not null, not empty, not error messages)
+> **Reference**: Read `references/2-instrument-and-observe.md` now — it has the detailed sub-steps for DAG-based instrumentation, running the app, verifying the trace against the DAG, documenting the reference trace, and the `@observe` and `enable_storage()` rules and patterns.
 
-**If it fails after two attempts**, stop and ask the user for help.
+Add `@observe` to application-level functions identified in your DAG (`pixie_qa/02-data-flow.json`). Run the app normally (no mocks) to produce a reference trace. Verify the trace with `pixie trace verify`, then validate it matches the DAG with `pixie dag check-trace`. Document the data shapes.
 
-> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.
+> **Checkpoint**: `pixie_qa/04-reference-trace.md` exists with eval_input/eval_output shapes and completeness verification. Instrumentation is in the source code. `pixie dag check-trace` passes. Do NOT read Step 3 instructions yet.
 
 ---
 
-### Step 4: Build the dataset
-
-**Why this step**: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.
-
-#### 4a. Determine verification and expectations
-
-Before generating data, decide how each eval criterion from Step 1 will be checked.
-
-**Examine the reference trace from Step 2** and identify:
-
-- **Structural constraints** you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
-- **Semantic constraints** that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
-- **Which criteria are universal vs. case-specific**:
-  - **Universal criteria** apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
-  - **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")
-
-#### 4b. Generate eval_input items
-
-Create eval_input items that match the data shape from the reference trace:
-
-- **Application inputs** (user queries, requests) — make these up to cover the scenarios you identified in Step 1
-- **External dependency data** (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace
-
-Each dataset item contains:
-
-- `eval_input`: the made-up input data (app input + external dependency data)
-- `expected_output`: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.
-
-At test time, `eval_output` is produced by the utility function from Step 3 and is not stored in the dataset itself.
-Read `references/dataset-generation.md` for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.
-
-#### 4c. Validate the dataset
+### Step 3: Write a utility function to run the full app end-to-end
 
-After building:
+> **Reference**: Read `references/3-run-harness.md` now — it has the contract, implementation guidance, verification steps, and concrete examples by app type (FastAPI, CLI, standalone function).
 
-1. **Execute `build_dataset.py`** — don't just write it, run it
-2. **Verify structural constraints** — each eval_input matches the reference trace's schema (same fields, same types)
-3. **Verify diversity** — items have meaningfully different inputs, not just minor variations
-4. **Verify case-specific expectations** — `expected_output` values are specific and testable, not vague
-5. For conversational apps, include items with conversation history
+Write a `run_app(eval_input) → eval_output` function that patches external dependencies, calls the app through its real entry point, and collects the response. Verify it produces the same structure as the reference trace.
 
-> **Checkpoint**: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.
+> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Do NOT read Step 4 instructions yet.
 
 ---
 
-### Step 5: Write and run eval tests
-
-**Why this step**: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.
-
-#### 5a. Map criteria to evaluators
+### Step 4: Define evaluators
 
-For each eval criterion from Step 1, decide how to evaluate it:
+> **Reference**: Read `references/4-define-evaluators.md` now — it has the sub-steps for mapping criteria to evaluators, implementing custom evaluators, verifying discoverability, and producing the evaluator mapping artifact.
 
-- **Can it be checked with a built-in evaluator?** (factual correctness → `FactualityEval`, exact match → `ExactMatchEval`, RAG faithfulness → `FaithfulnessEval`)
-- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
-- **Is it universal or case-specific?** Universal criteria go in the test function. Case-specific criteria use `expected_output` from the dataset.
+Map each eval criterion from Step 1c to a concrete evaluator — implement custom ones where needed. Then produce the evaluator mapping artifact.
 
-For open-ended LLM text, **never** use `ExactMatchEval` — LLM outputs are non-deterministic.
+> **Checkpoint**: All evaluators implemented. `pixie_qa/05-evaluator-mapping.md` written with criterion-to-evaluator mapping using exact evaluator names (built-in names from `evaluators.md`, custom names in `filepath:callable_name` format). Do NOT read Step 5 instructions yet.
 
-`AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
-
-Read `references/eval-tests.md` for the evaluator catalog, custom evaluator examples, and the test file boilerplate.
+---
 
-#### 5b. Write the test file and run
+### Step 5: Build the dataset
 
-The test file wires together: a `runnable` (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.
+> **Reference**: Read `references/5-build-dataset.md` now — it has the sub-steps for determining expectations, generating eval_input items, building the dataset JSON with the new format (runnable, evaluators, descriptions), and validating with `pixie dataset validate`.
 
-Read `references/eval-tests.md` for the exact `assert_dataset_pass` API, required parameter names, and common mistakes to avoid. **Re-read the API reference immediately before writing test code** — do not rely on earlier context.
+Create a dataset JSON file with made-up eval_input items that match the data shape from the reference trace. Set the `runnable` to the `filepath:callable_name` reference for the run function from Step 3 (e.g., `"pixie_qa/scripts/run_app.py:run_app"` — file path relative to project root). Assign evaluators based on the eval criteria (Step 1c) and the evaluator mapping (Step 4) — universal criteria become dataset-level defaults, case-specific criteria become item-level evaluators. Add a `description` for each item. Validate with `pixie dataset validate`.
 
-Run with `pixie test` — not `pytest`:
+> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/<name>.json` with diverse eval_inputs, runnable, evaluators, and descriptions. `pixie dataset validate` passes. Do NOT read Step 6 instructions yet.
 
-```bash
-uv run pixie test pixie_qa/tests/ -v
-```
+---
 
-**After running, verify the scorecard:**
+### Step 6: Run evaluation-based tests
 
-1. Shows "N/M tests passed" with real numbers
-2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
-3. Per-evaluator scores appear with real values
+> **Reference**: Read `references/6-run-tests.md` now — it has the sub-steps for running tests, verifying output, and running analysis.
 
-A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
+Run `pixie test` (without a path argument) to execute the full evaluation pipeline. Verify that real scores are produced. Once tests complete without setup errors, run `pixie analyze` to generate analysis.
 
-> **Checkpoint**: Tests run and produce real scores.
+> **Checkpoint**: Tests run and produce real scores. Analysis generated.
 >
-> - **Setup mode**: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.
-> - **Iteration mode**: Proceed directly to Step 6.
+> If the test errors out (import failures, missing keys, runnable resolution errors), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
 >
-> If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
+> **STOP GATE — read this before doing anything else after tests produce scores:**
+>
+> - If the user's original prompt asks only for setup ("set up QA", "add tests", "add evals", "set up evaluations"), **STOP HERE**. Report the test results to the user: "QA setup is complete. Tests show N/M passing. [brief summary]. Want me to investigate the failures and iterate?" Do NOT proceed to Step 7.
+> - If the user's original prompt explicitly asks for iteration ("fix", "improve", "debug", "iterate", "investigate failures", "make tests pass"), proceed to Step 7.
 
 ---
 
-### Step 6: Investigate and iterate
-
-**Iteration mode only, or after the user confirmed in setup mode.**
-
-When tests fail, understand _why_ — don't just adjust thresholds until things pass.
+### Step 7: Investigate and iterate
 
-Read `references/investigation.md` for procedures and root-cause patterns.
-
-The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.
+> **Reference**: Read `references/7-investigation.md` now — it has the stop/continue decision, analysis review, root-cause patterns, and investigation procedures. **Follow its instructions before doing any investigation work.**
 
 ---
 
-## Quick reference
-
-### Imports
+## Web Server Management
 
-```python
-from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
-from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
-```
-
-Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.
+pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script, and needs to be explicitly cleaned up when display is no longer needed.
 
-### CLI commands
+When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with the following command:
 
 ```bash
-uv run pixie test pixie_qa/tests/ -v    # Run eval tests (NOT pytest)
-uv run pixie trace list                 # List captured traces
-uv run pixie trace last                 # Show most recent trace
-uv run pixie trace show <id> --verbose  # Show specific trace
-uv run pixie dataset create <name>      # Create a new dataset
+bash resources/stop-server.sh
 ```
 
-### Directory layout
+And whenever you restart the workflow, always run the setup script again to ensure the web server is running:
 
+```bash
+bash resources/setup.sh
 ```
-pixie_qa/
-  MEMORY.md      # your understanding and eval plan
-  datasets/      # golden datasets (JSON)
-  tests/         # eval test files (test_*.py)
-  scripts/       # run_app.py, build_dataset.py
-```
-
-All pixie files go here — not at the project root, not in a top-level `tests/` directory.
-
-### Key concepts
-
-- **eval_input** = application input + data from external dependencies
-- **eval_output** = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
-- **expected_output** = case-specific evaluation reference (optional per dataset item)
-- **test function** = utility function (produces eval_output) + evaluators (check criteria)
-
-### Evaluator selection
-
-| Output type                           | Evaluator                                             | Notes                                                            |
-| ------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------- |
-| Open-ended text with reference answer | `FactualityEval`, `ClosedQAEval`                      | Best default for most apps                                       |
-| Open-ended text, no reference         | `AnswerRelevancyEval`                                 | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
-| Deterministic output                  | `ExactMatchEval`, `JSONDiffEval`                      | Never use for open-ended LLM text                                |
-| RAG with retrieved context            | `FaithfulnessEval`, `ContextRelevancyEval`            | Requires context capture in instrumentation                      |
-| Domain-specific quality               | `create_llm_evaluator(name=..., prompt_template=...)` | Custom LLM-as-judge — use for app-specific criteria              |
-
-### What goes where: SKILL.md vs references
-
-**This file** (SKILL.md) is loaded for the entire session. It contains the _what_ and _why_ — the reasoning, decision-making process, goals, and checkpoints for each step.
-
-**Reference files** are loaded when executing a specific step. They contain the _how_ — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.
-
-When in doubt: if it's about _deciding what to do_, it's in SKILL.md. If it's about _how to implement that decision_, it's in a reference file.
-
-### Reference files
-
-| Reference                            | When to read                                                                       |
-| ------------------------------------ | ---------------------------------------------------------------------------------- |
-| `references/understanding-app.md`    | Step 1 — investigating the codebase, MEMORY.md template                            |
-| `references/instrumentation.md`      | Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns       |
-| `references/run-harness-patterns.md` | Step 3 — examples of how to invoke different app types (web server, CLI, function) |
-| `references/dataset-generation.md`   | Step 4 — crafting eval_input items, expected_output strategy, validation           |
-| `references/eval-tests.md`           | Step 5 — evaluator selection, test file pattern, assert_dataset_pass API           |
-| `references/investigation.md`        | Step 6 — failure analysis, root-cause patterns                                     |
-| `references/pixie-api.md`            | Any step — full CLI and Python API reference                                       |
diff --git a/skills/eval-driven-dev/references/1-a-entry-point.md b/skills/eval-driven-dev/references/1-a-entry-point.md
new file mode 100644
index 000000000..c5576333c
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-a-entry-point.md
@@ -0,0 +1,68 @@
+# Step 1a: Entry Point & Execution Flow
+
+Identify how the application starts and how a real user invokes it.
+
+---
+
+## What to investigate
+
+### 1. How the software runs
+
+What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
+
+Look for:
+
+- `if __name__ == "__main__"` blocks
+- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
+- CLI entry points in `pyproject.toml` (`[project.scripts]`)
+- Docker/compose configs that reveal startup commands
+
+### 2. The real user entry point
+
+How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
+
+- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
+- **CLI**: What command-line arguments does the user provide?
+- **Library/function**: What function does the caller import and call? What arguments?
+
+### 3. Environment and configuration
+
+- What env vars does the app require? (API keys, database URLs, feature flags)
+- What config files does it read?
+- What has sensible defaults vs. what must be explicitly set?
+
+---
+
+## Output: `pixie_qa/01-entry-point.md`
+
+Write your findings to this file. Keep it focused — only entry point and execution flow.
+
+### Template
+
+```markdown
+# Entry Point & Execution Flow
+
+## How to run
+
+<Command to start the app, required env vars, config files>
+
+## Entry point
+
+- **File**: <e.g., app.py, main.py>
+- **Type**: <FastAPI server / CLI / standalone function / etc.>
+- **Framework**: <FastAPI, Flask, Django, none>
+
+## User-facing endpoints / interface
+
+<For each way a user interacts with the app:>
+
+- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
+- **Input format**: <request body shape, CLI args, function params>
+- **Output format**: <response shape, stdout format, return type>
+
+## Environment requirements
+
+| Variable | Purpose | Required? | Default |
+| -------- | ------- | --------- | ------- |
+| ...      | ...     | ...       | ...     |
+```
diff --git a/skills/eval-driven-dev/references/1-b-data-flow.md b/skills/eval-driven-dev/references/1-b-data-flow.md
new file mode 100644
index 000000000..1ce946c07
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-b-data-flow.md
@@ -0,0 +1,187 @@
+# Step 1b: Processing Stack & Data Flow — DAG Artifact
+
+Map the complete data flow through the application by producing a **structured DAG JSON file** that represents every important node in the processing pipeline.
+
+---
+
+## What to investigate
+
+### 1. Find where the LLM provider client is called
+
+Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
+
+- The file and function where the call lives
+- Which LLM provider/client is used
+- The exact arguments being passed (model, messages, tools, etc.)
+
+### 2. Find the common ancestor entry point
+
+Identify the single function that is the common ancestor of all LLM calls — the application's entry point for a single user request. This becomes the **root** of your DAG.
+
+### 3. Track backwards: external data dependencies flowing IN
+
+Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt:
+
+- **Application inputs**: user messages, queries, uploaded files, config
+- **External dependency data**: database lookups (Redis, Postgres), retrieved context (RAG), cache reads, third-party API responses
+- **In-code data**: system prompts, tool definitions, prompt-building logic
+
+### 4. Track forwards: external side-effects flowing OUT
+
+Starting from each LLM call site, trace **forwards** to find every side-effect: database writes, API calls, messages sent, file writes.
+
+### 5. Identify intermediate states
+
+Along the paths between input and output, identify intermediate states needed for evaluation: tool call decisions, routing/handoff decisions, retrieval results, branching logic.
+
+### 6. Identify testability seams
+
+Look for abstract base classes, protocols, or constructor-injected backends. These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
+
+---
+
+## Output: `pixie_qa/02-data-flow.json`
+
+**Write a JSON file** (not markdown) containing a flat array of DAG nodes. Each node represents a significant point in the processing pipeline.
+
+### Node schema
+
+Each node is a JSON object with these fields:
+
+| Field          | Type           | Required | Description                                                                                                        |
+| -------------- | -------------- | -------- | ------------------------------------------------------------------------------------------------------------------ |
+| `name`         | string         | Yes      | Unique, meaningful lower_snake_case node name (for example, `handle_turn`). This is the node identity.             |
+| `code_pointer` | string         | Yes      | **Absolute** file path with function/method name, optionally with line range. See format below.                    |
+| `description`  | string         | Yes      | What this node does and why it matters for evaluation.                                                             |
+| `parent`       | string or null | No       | Parent node name (`null` or omitted for root).                                                                     |
+| `is_llm_call`  | boolean        | No       | Set `true` only if the node represents an LLM provider call. Defaults to `false` when omitted.                     |
+| `metadata`     | object         | No       | Additional info: `mock_strategy`, `data_shape`, `credentials_needed`, `eval_relevant`, external system notes, etc. |
+
+### About `is_llm_call`
+
+- Use `is_llm_call: true` for nodes that represent real LLM provider spans.
+- Leave it omitted (or `false`) for all other nodes.
+
+### `code_pointer` format
+
+The `code_pointer` field uses **absolute file paths** with a symbol name, and an optional line number range:
+
+- `<absolute_file_path>:<symbol>` — points to a whole function or method. Use this when the entire function represents a single node in the DAG (most common case).
+- `<absolute_file_path>:<symbol>:<start_line>:<end_line>` — points to a specific line range within a function. Use this when the function contains an **important intermediate state** — a code fragment that transforms some input into an output that matters for evaluation, but the fragment is embedded inside a larger function rather than being its own function.
+
+**When to use a line range (intermediate states):**
+
+Some functions do multiple important things sequentially. If one of those things produces an intermediate state that your evaluators need to see (e.g., a routing decision, a context assembly step, a tool-call dispatch), but it's not factored into its own function, use a line range to identify that specific fragment. The line range marks the input → output boundary of that intermediate state within the larger function.
+
+Examples of intermediate states that warrant a line range:
+
+- **Routing decision**: lines 51–71 of `main()` decide which agent to hand off to based on user intent — the input is the user message, the output is the selected agent
+- **Context assembly**: lines 30–45 of `handle_request()` gather documents from a vector store and format them into a prompt — the input is the query, the output is the assembled context
+- **Tool dispatch**: lines 80–95 of `process_turn()` parse the LLM's tool-call response and execute the selected tool — the input is the tool-call JSON, the output is the tool result
+
+If the intermediate state is already its own function, just use the function-level `code_pointer` (no line range needed).
+
+Examples:
+
+- `/home/user/myproject/app.py:handle_turn` — whole function
+- `/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response` — whole function
+- `/home/user/myproject/src/agents/agent.py:main:51:71` — lines 51–71 of `main()`, where a routing decision happens
+
+The symbol can be:
+
+- A function name: `my_func` → matches `def my_func` in the file
+- A class.method: `MyClass.func` → matches `def func` inside `class MyClass`
+
+### Example
+
+```json
+[
+  {
+    "name": "handle_turn",
+    "code_pointer": "/home/user/myproject/src/agents/agent.py:handle_turn",
+    "description": "Entry point for a single user request. Takes user message + conversation history, returns agent response.",
+    "parent": null,
+    "metadata": {
+      "data_shape": {
+        "input": "str (user message)",
+        "output": "str (response text)"
+      }
+    }
+  },
+  {
+    "name": "load_conversation_history",
+    "code_pointer": "/home/user/myproject/src/services/redis_client.py:get_history",
+    "description": "Reads conversation history from Redis. Returns list of message dicts.",
+    "parent": "handle_turn",
+    "metadata": {
+      "system": "Redis",
+      "data_shape": "list[dict] with role/content keys",
+      "mock_strategy": "Provide canned history list",
+      "credentials_needed": true
+    }
+  },
+  {
+    "name": "run_ai_response",
+    "code_pointer": "/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response",
+    "description": "Calls OpenAI API with system prompt + history + user message. Auto-captured by OpenInference.",
+    "parent": "handle_turn",
+    "is_llm_call": true,
+    "metadata": {
+      "provider": "OpenAI",
+      "model": "gpt-4o-mini"
+    }
+  },
+  {
+    "name": "save_conversation_to_redis",
+    "code_pointer": "/home/user/myproject/src/services/redis_client.py:save_history",
+    "description": "Writes updated conversation history back to Redis after LLM responds.",
+    "parent": "handle_turn",
+    "metadata": {
+      "system": "Redis",
+      "eval_relevant": false,
+      "mock_strategy": "Capture written data for assertions"
+    }
+  }
+]
+```
+
+### Conditional / optional branches
+
+Some apps have conditional code paths where only one branch executes per request — e.g., `transfer_call` vs `end_call` depending on the outcome. `pixie dag check-trace` (Step 2) validates against a **single** trace, so every DAG node must appear in that trace.
+
+**Rule**: If two or more functions are mutually exclusive (only one runs per request), model them as a **single dispatcher node** that covers the branching logic, not as separate DAG nodes. For example, instead of `end_call_tool` + `transfer_call_tool` as separate nodes, use `execute_tool` pointing at the dispatch function.
+
+If a function only runs under certain conditions but is the sole branch (not mutually exclusive), include it in the DAG — just ensure your reference trace (Step 2) exercises that code path.
+
+### Validation checkpoint
+
+After writing `pixie_qa/02-data-flow.json`, validate the DAG:
+
+```bash
+uv run pixie dag validate pixie_qa/02-data-flow.json
+```
+
+This command:
+
+1. Checks the JSON structure is valid
+2. Verifies node names use lower_snake_case
+3. Verifies all node names are unique
+4. Verifies all parent references exist
+5. Checks exactly one root node exists (`parent` is null/omitted)
+6. Detects cycles
+7. Verifies code_pointer files exist on disk
+8. Verifies symbols exist in the referenced files
+9. Verifies line number ranges are valid (if present)
+10. **Generates a Mermaid diagram** at `pixie_qa/02-data-flow.md` if validation passes
+
+If validation fails, fix the errors and re-run. The error messages are specific — they tell you exactly which node has the problem and what's wrong.
+
+### Also document testability seams
+
+After the DAG JSON is validated, add a brief **testability seams** section at the bottom of the generated `pixie_qa/02-data-flow.md` (the Mermaid file). For each node that reads from or writes to an external system, note the mock interface:
+
+| Dependency node | Interface / module boundary | Mock strategy |
+| --------------- | --------------------------- | ------------- |
+| ...             | ...                         | ...           |
+
+This section supplements the DAG — the DAG captures _what_ the dependencies are, and this table captures _how_ to mock them.
diff --git a/skills/eval-driven-dev/references/1-c-eval-criteria.md b/skills/eval-driven-dev/references/1-c-eval-criteria.md
new file mode 100644
index 000000000..b37e8fc64
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-c-eval-criteria.md
@@ -0,0 +1,85 @@
+# Step 1c: Eval Criteria
+
+Define what quality dimensions matter for this app — based on the entry point (01-entry-point.md) and data flow (02-data-flow.md) you've already documented.
+
+This document serves two purposes:
+
+1. **Dataset creation (Step 5)**: The use cases tell you what kinds of eval_input items to generate — each use case should have representative items in the dataset.
+2. **Evaluator selection (Step 4)**: The eval criteria tell you what evaluators to choose and how to map them.
+
+Keep this concise — it's a planning artifact, not a comprehensive spec.
+
+---
+
+## What to define
+
+### 1. Use cases
+
+List the distinct scenarios the app handles. Each use case becomes a category of eval_input items in your dataset. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
+
+**Good use case descriptions:**
+
+- "Reroute to human agent on account lookup difficulties"
+- "Answer billing question using customer's plan details from CRM"
+- "Decline to answer questions outside the support domain"
+- "Summarize research findings including all queried sub-topics"
+
+**Bad use case descriptions (too vague):**
+
+- "Handle billing questions"
+- "Edge case"
+- "Error handling"
+
+### 2. Eval criteria
+
+Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 4.
+
+**Good criteria are specific to the app's purpose.** Examples:
+
+- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
+- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
+- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
+
+**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
+
+At this stage, don't pick evaluator classes or thresholds. That comes in Step 4.
+
+### 3. Check criteria applicability and observability
+
+For each criterion:
+
+1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 5 (dataset creation) because:
+   - **Universal criteria** → become dataset-level default evaluators
+   - **Case-specific criteria** → become item-level evaluators on relevant rows only
+
+2. **Verify observability** — check that the data flow in `02-data-flow.md` includes the data needed to evaluate each criterion. If a criterion requires data that isn't in the processing stack, note what additional instrumentation is needed in Step 2.
+
+---
+
+## Output: `pixie_qa/03-eval-criteria.md`
+
+Write your findings to this file. **Keep it short** — the template below is the maximum length.
+
+### Template
+
+```markdown
+# Eval Criteria
+
+## Use cases
+
+1. <Use case name>: <one-liner conveying input + expected behavior>
+2. ...
+
+## Eval criteria
+
+| #   | Criterion | Applies to    | Observable data needed |
+| --- | --------- | ------------- | ---------------------- |
+| 1   | ...       | All           | ...                    |
+| 2   | ...       | Use case 1, 3 | ...                    |
+
+## Observability check
+
+| Criterion | Available in data flow? | Gap? |
+| --------- | ----------------------- | ---- |
+| ...       | Yes / No                | ...  |
+```
diff --git a/skills/eval-driven-dev/references/2-instrument-and-observe.md b/skills/eval-driven-dev/references/2-instrument-and-observe.md
new file mode 100644
index 000000000..3be87850c
--- /dev/null
+++ b/skills/eval-driven-dev/references/2-instrument-and-observe.md
@@ -0,0 +1,213 @@
+# Step 2: Instrument and observe a real run
+
+> For a quick lookup of imports, CLI commands, and key concepts, see `instrumentation-api.md`.
+
+**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step produces a reference trace that shows the exact data shapes you'll use for datasets and evaluators.
+
+**This is a normal app run with instrumentation — no mocks, no patches.**
+
+## prerequisite: enable instrumentation
+
+Add `enable_storage()` at the application's startup point (inside `main()`, a FastAPI lifespan, or similar — **never at module level**). This function enables OTel data emission from LLM provider clients, and subscribes an event processor to save emitted Otel data into a local sqlite database.
+
+## 2a. Add instrumentation — use the DAG
+
+Open your DAG file (`pixie_qa/02-data-flow.json`). For each node that does NOT have `is_llm_call: true`:
+
+1. Go to the file and function specified by the node's `code_pointer`
+2. **If the `code_pointer` has no line number range**: decorate the function with `@observe(name="<node_name>")`
+3. **If the `code_pointer` has a line number range** (e.g., `/path/to/file.py:func:51:71`): wrap that code section with `start_observation(input=..., name="<node_name>")` inside the existing function
+
+The `name` parameter MUST be the exact `name` of the corresponding DAG node.
+
+Nodes with `is_llm_call: true` are auto-captured by OpenInference — do NOT add `@observe` to them.
+
+### @observe example
+
+```python
+# ✅ Decorating the existing production function
+from pixie import observe
+
+@observe(name="answer_question")
+def answer_question(question: str, context: str) -> str:  # existing function
+    ...  # existing code, unchanged
+```
+
+```python
+# ✅ Decorating a class method (works exactly the same way)
+from pixie import observe
+
+class OpenAIAgent:
+    def __init__(self, model: str = "gpt-4o-mini"):
+        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
+        self.model = model
+
+    @observe(name="openai_agent_respond")
+    def respond(self, user_message: str, conversation_history: list | None = None) -> str:
+        # existing code, unchanged — @observe handles `self` automatically
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        if conversation_history:
+            messages.extend(conversation_history)
+        messages.append({"role": "user", "content": user_message})
+        response = self.client.chat.completions.create(model=self.model, messages=messages)
+        return response.choices[0].message.content or ""
+```
+
+### start_observation example
+
+```python
+# ✅ Context manager inside an existing function
+from pixie import start_observation
+
+async def main():  # existing function
+    ...
+    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
+        result = await Runner.run(current_agent, input_items, context=context)
+        # ... existing response handling ...
+        obs.set_output(response_text)
+    ...
+```
+
+### Anti-patterns
+
+```python
+# ❌ WRONG — creating a new wrapper function
+@observe(name="run_for_eval")
+async def run_for_eval(user_messages: list[str]) -> str:
+    # Duplicates what main() does — creates a separate code path
+    ...
+
+# ✅ CORRECT — decorate the existing function directly
+```
+
+```python
+# ❌ WRONG — creating a wrapper method instead of decorating the existing one
+class OpenAIAgent:
+    def respond(self, user_message, conversation_history=None):
+        return self._respond_observed(...)
+
+    @observe
+    def _respond_observed(self, args):
+        ...
+
+# ✅ CORRECT — decorate the existing method
+class OpenAIAgent:
+    @observe(name="openai_agent_respond")
+    def respond(self, user_message, conversation_history=None):
+        ...  # existing code, unchanged
+```
+
+```python
+# ❌ WRONG — bypassing the app by calling the LLM directly
+@observe(name="agent_answer_question")
+def answer_question(question: str) -> str:
+    response = client.responses.create(model="gpt-4.1", input=[...])
+    return response.output_text
+
+# ✅ CORRECT — import and call the app's own function
+```
+
+### Rules
+
+- **Import rule**: All pixie symbols are importable from `from pixie import ...`. Never import from submodules.
+- **Never change the function's interface** (arguments, return type, behavior). The instrumentation is purely additive.
+- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
+- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function.
+
+## 2b. Run the app and verify the trace
+
+Run the app normally — with real external dependencies — to produce a reference trace. Do NOT mock or patch anything.
+
+If the app can't run without infrastructure you don't have, use the simplest approach to get it running (local Docker, test account, or ask the user).
+
+### Starting a web server for trace capture
+
+If the app is a web server (FastAPI, Flask, etc.), you need to start it in the background, send a request, then verify the trace. **Always use the `run-with-timeout.sh` script** to start background servers — never start them with bare `&` or `nohup` directly, because background processes may be killed between terminal commands.
+
+```bash
+# Start the server with a 120-second timeout (auto-killed after that)
+bash resources/run-with-timeout.sh 120 uv run uvicorn app:app --host 127.0.0.1 --port 8000
+
+# Wait for the server to be ready
+sleep 3
+
+# Send a test request to produce a trace
+curl -X POST http://127.0.0.1:8000/your-endpoint -H 'Content-Type: application/json' -d '{...}'
+```
+
+The script starts the command with `nohup`, prints the PID, and spawns a watchdog that kills the process after the timeout. You don't need to manually stop it — it will be cleaned up automatically.
+
+**Verify the trace:**
+
+```bash
+uv run pixie trace verify
+```
+
+If it reports issues, fix them according to the error messages and re-run.
+
+## 2c. Validate the trace against the DAG
+
+```bash
+uv run pixie dag check-trace pixie_qa/02-data-flow.json
+```
+
+This checks that every DAG node has a matching span in the trace. **Every non-LLM node must appear in this single trace.** If a node is missing, either:
+
+1. The function at `code_pointer` is not decorated with `@observe(name="<node_name>")`
+2. The function was not called during the trace run
+
+If you have conditional branches (e.g., `end_call` vs `transfer_call`), go back to Step 1b and simplify: merge mutually exclusive branches into a single dispatcher node so all nodes are exercisable in one trace.
+
+If it reports errors, fix them according to the error messages and re-run.
+
+## 2d. Document the reference trace
+
+Once both `pixie trace verify` and `pixie dag check-trace` pass, use `pixie trace last` to inspect the full trace details and document the eval_input/eval_output shapes.
+
+## Output: `pixie_qa/04-reference-trace.md`
+
+Document the reference trace:
+
+1. **eval_input shape** — field names, types, nesting from the root span's input
+2. **eval_output shape** — field names, types, nesting from the root span's output
+3. **External data (inbound)** — what the app read from databases/APIs/caches (shapes and realistic values)
+4. **Side-effects (outbound)** — what the app wrote to external systems
+5. **Completeness check** — for each eval criterion from Step 1, confirm the trace contains the data needed to evaluate it
+
+### Template
+
+```markdown
+# Reference Trace Analysis
+
+## eval_input shape
+
+<field names, types, nesting — from the @observe-decorated entry point's input>
+
+## eval_output shape
+
+<field names, types, nesting — from the @observe-decorated entry point's output>
+
+## External data (inbound)
+
+| Source | Data shape | Realistic value ranges |
+| ------ | ---------- | ---------------------- |
+| ...    | ...        | ...                    |
+
+## Side-effects (outbound)
+
+| Target | Data written | How to capture in mock |
+| ------ | ------------ | ---------------------- |
+| ...    | ...          | ...                    |
+
+## Intermediate states captured
+
+| Span name | Data captured | Eval criteria it supports |
+| --------- | ------------- | ------------------------- |
+| ...       | ...           | ...                       |
+
+## Instrumentation completeness check
+
+| Eval criterion | Data needed | Present in trace? | Fix needed? |
+| -------------- | ----------- | ----------------- | ----------- |
+| ...            | ...         | ✓ / ✗             | ...         |
+```
diff --git a/skills/eval-driven-dev/references/3-run-harness.md b/skills/eval-driven-dev/references/3-run-harness.md
new file mode 100644
index 000000000..15266f40b
--- /dev/null
+++ b/skills/eval-driven-dev/references/3-run-harness.md
@@ -0,0 +1,73 @@
+# Step 3: Write a utility function to run the full app end-to-end
+
+**Why this step**: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
+
+## The contract
+
+```text
+run_app(eval_input) → eval_output
+```
+
+- **eval_input** = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
+- **eval_output** = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)
+
+## How to implement
+
+1. **Patch external dependencies** — use the mocking plan from `pixie_qa/02-data-flow.md`. For each node that reads from or writes to an external system, either inject a mock implementation of its interface (cleanest) or `unittest.mock.patch` the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
+
+   **Do NOT mock the LLM provider** (OpenAI, Anthropic, etc.). The entire point of this QA setup is to evaluate the LLM's actual behavior — its responses, tool-call decisions, and output quality. Mocking the LLM makes the tests tautological (you'd be testing your own mock responses). The LLM call must go to the real API. Only external data dependencies (databases, caches, third-party APIs that are _not_ the LLM) get mocked.
+
+2. **Call the app through its real entry point** — the same way a real user or client would invoke it. Base on how the app runs: if it's a web server (FastAPI, Flask), use `TestClient` or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
+
+   **Starting web servers**: If you need to start a server process (for the subprocess approach), always use `run-with-timeout.sh` to start it in the background — never use bare `&` or `nohup` directly. See the FastAPI example file for the pattern.
+
+   **TestClient + database gotcha**: If the app manages DB connections in its FastAPI lifespan (common pattern: `_conn = get_connection()` in startup, `_conn.close()` in shutdown), the TestClient's lifespan teardown will close your mock connection. Read the "Gotcha: FastAPI TestClient + Database Connections" section in `references/run-harness-examples/fastapi-web-server.md` for the fix (wrap the connection to prevent lifespan from closing it).
+
+   **Concurrency — critical**: `assert_dataset_pass` calls `run_app` concurrently for multiple dataset items. Your harness **must be concurrency-safe**. Do NOT wrap the entire function in a `threading.Lock()` — this serializes all runs and makes tests extremely slow. Instead, initialize the app (TestClient, DB, services) **once at module level** and let each `run_app` call reuse the shared client. The app's per-session state (keyed by call_sid, session_id, etc.) provides natural isolation. Read the "Concurrency-safe harness" section in `references/run-harness-examples/fastapi-web-server.md` for the pattern.
+
+3. **Collect the response** — the app's output becomes eval_output, along with any side-effects captured by mock objects.
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
+
+## Output: `pixie_qa/scripts/run_app.py`
+
+## Verify
+
+Take the eval_input from your Step 2 reference trace (`pixie_qa/04-reference-trace.md`) and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
+
+- **Same structure** — same fields present, same types, same nesting
+- **Same code path** — same routing decisions, same intermediate states captured
+- **Sensible values** — eval_output fields have real, meaningful data (not null, not empty, not error messages)
+
+**If it fails after two attempts**, stop and ask the user for help.
+
+---
+
+## Example by app type
+
+Base on how the application runs, read the corresponding example file for implementation guidance:
+
+| App type                            | Entry point             | Example file                                                  |
+| ----------------------------------- | ----------------------- | ------------------------------------------------------------- |
+| **Web server** (FastAPI, Flask)     | HTTP/WebSocket endpoint | Read `references/run-harness-examples/fastapi-web-server.md`  |
+| **CLI application**                 | Command-line invocation | Read `references/run-harness-examples/cli-app.md`             |
+| **Standalone function** (no server) | Python function         | Read `references/run-harness-examples/standalone-function.md` |
+
+Read **only** the example file that matches your app type — do not read the others.
+
+For `enable_storage()` and `observe` API details, see `instrumentation-api.md`.
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. Between the entry point and that inner function, the app does request handling, state management, prompt assembly, routing — all of which is under test. When you call an inner function, you skip all of that and end up reimplementing it in your test. Now your test is testing test code, not app code.
+
+Mock only external dependencies (databases, speech services, third-party APIs) — everything you identified and planned in Step 1.
+
+---
+
+## Key Rules
+
+1. **Always call through the real entry point** — the same way a real user or client would
+2. **Mock only external dependencies** — the ones you identified in Step 1
+3. **Make `run_app` concurrency-safe** — `assert_dataset_pass` calls it concurrently; never use a global lock unless absolutely unavoidable
+4. **Use `uv run python -m <module>`** to run scripts — never `python <path>`
+5. **Add `enable_storage()` and `@observe`** in the utility function so traces are captured
+6. **After running, verify traces**: `uv run pixie trace list` then `uv run pixie trace show <trace_id> --verbose`
diff --git a/skills/eval-driven-dev/references/4-define-evaluators.md b/skills/eval-driven-dev/references/4-define-evaluators.md
new file mode 100644
index 000000000..a7a15925b
--- /dev/null
+++ b/skills/eval-driven-dev/references/4-define-evaluators.md
@@ -0,0 +1,131 @@
+# Step 4: Define Evaluators
+
+**Why this step**: With the app instrumented and a utility function ready, you now map each eval criterion to a concrete evaluator — implementing custom ones where needed — so the dataset (Step 5) can reference them by name.
+
+---
+
+## 4a. Map criteria to evaluators
+
+**Every eval criterion from Step 1c — including any dimensions specified by the user in the prompt — must have a corresponding evaluator.** If the user asked for "factuality, completeness, and bias," you need three evaluators (or a multi-criteria evaluator that covers all three). Do not silently drop any requested dimension.
+
+For each eval criterion, decide how to evaluate it:
+
+- **Can it be checked with a built-in evaluator?** (factual correctness → `Factuality`, exact match → `ExactMatch`, RAG faithfulness → `Faithfulness`)
+- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
+- **Is it universal or case-specific?** Universal criteria apply to all dataset items. Case-specific criteria apply only to certain rows.
+
+For open-ended LLM text, **never** use `ExactMatch` — LLM outputs are non-deterministic.
+
+`AnswerRelevancy` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
+
+## 4b. Implement custom evaluators
+
+If any criterion requires a custom evaluator, implement it now. Place custom evaluators in `pixie_qa/evaluators.py` (or a sub-module if there are many).
+
+### `create_llm_evaluator` factory
+
+Use when the quality dimension is domain-specific and no built-in evaluator fits.
+
+The return value is a **ready-to-use evaluator instance**. Assign it to a module-level variable — `pixie test` will import and use it directly (no class wrapper needed):
+
+```python
+from pixie import create_llm_evaluator
+
+concise_voice_style = create_llm_evaluator(
+    name="ConciseVoiceStyle",
+    prompt_template="""
+    You are evaluating whether this response is concise and phone-friendly.
+
+    Input: {eval_input}
+    Response: {eval_output}
+
+    Score 1.0 if the response is concise (under 3 sentences), directly addresses
+    the question, and uses conversational language suitable for a phone call.
+    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
+    """,
+)
+```
+
+Reference the evaluator in your dataset JSON by its `filepath:callable_name` reference (e.g., `"pixie_qa/evaluators.py:concise_voice_style"`).
+
+**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
+
+**Rules**:
+
+- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
+- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
+- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
+
+**Non-RAG response relevance** (instead of `AnswerRelevancy`):
+
+```python
+response_relevance = create_llm_evaluator(
+    name="ResponseRelevance",
+    prompt_template="""
+    You are evaluating whether a customer support response is relevant and helpful.
+
+    Input: {eval_input}
+    Response: {eval_output}
+    Expected: {expected_output}
+
+    Score 1.0 if the response directly addresses the question and meets expectations.
+    Score 0.5 if partially relevant but misses important aspects.
+    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
+    """,
+)
+```
+
+### Manual custom evaluator
+
+Custom evaluators can be **sync or async functions**. Assign them to module-level variables in `pixie_qa/evaluators.py`:
+
+```python
+from pixie import Evaluation, Evaluable
+
+def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
+    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
+    return Evaluation(score=score, reasoning="...")
+```
+
+Reference by `filepath:callable_name` in the dataset: `"pixie_qa/evaluators.py:my_evaluator"`.
+
+## 4c. Produce the evaluator mapping artifact
+
+Write the criterion-to-evaluator mapping to `pixie_qa/05-evaluator-mapping.md`. This artifact bridges between the eval criteria (Step 1c) and the dataset (Step 5).
+
+**CRITICAL**: Use the exact evaluator names as they appear in the `evaluators.md` reference — built-in evaluators use their short name (e.g., `Factuality`, `ClosedQA`), and custom evaluators use `filepath:callable_name` format (e.g., `pixie_qa/evaluators.py:ConciseVoiceStyle`).
+
+### Template
+
+```markdown
+# Evaluator Mapping
+
+## Built-in evaluators used
+
+| Evaluator name | Criterion it covers | Applies to                 |
+| -------------- | ------------------- | -------------------------- |
+| Factuality     | Factual accuracy    | All items                  |
+| ClosedQA       | Answer correctness  | Items with expected_output |
+
+## Custom evaluators
+
+| Evaluator name                           | Criterion it covers | Applies to | Source file            |
+| ---------------------------------------- | ------------------- | ---------- | ---------------------- |
+| pixie_qa/evaluators.py:ConciseVoiceStyle | Phone-friendly tone | All items  | pixie_qa/evaluators.py |
+
+## Applicability summary
+
+- **Dataset-level defaults** (apply to all items): Factuality, pixie_qa/evaluators.py:ConciseVoiceStyle
+- **Item-specific** (apply to subset): ClosedQA (only items with expected_output)
+```
+
+## Output
+
+- Custom evaluator implementations in `pixie_qa/evaluators.py` (if any custom evaluators needed)
+- `pixie_qa/05-evaluator-mapping.md` — the criterion-to-evaluator mapping
+
+---
+
+> **Evaluator selection guide**: See `evaluators.md` for the full evaluator catalog, selection guide (which evaluator for which output type), and `create_llm_evaluator` reference.
+>
+> **If you hit an unexpected error** when implementing evaluators (import failures, API mismatch), read `evaluators.md` for the authoritative evaluator reference and `instrumentation-api.md` for API details before guessing at a fix.
diff --git a/skills/eval-driven-dev/references/5-build-dataset.md b/skills/eval-driven-dev/references/5-build-dataset.md
new file mode 100644
index 000000000..9a218c9af
--- /dev/null
+++ b/skills/eval-driven-dev/references/5-build-dataset.md
@@ -0,0 +1,203 @@
+# Step 5: Build the Dataset
+
+**Why this step**: The dataset is a JSON file that ties together everything from the previous steps — the runnable (Step 3), the evaluators (Step 4), and the use cases (Step 1c) — into concrete test scenarios. At test time, `pixie test` calls the runnable on each item's `eval_input` to produce `eval_output`, then runs the assigned evaluators to score the result.
+
+---
+
+## Understanding eval_input, eval_output, and expected_output
+
+Before building the dataset, understand what these terms mean:
+
+- **eval_input** = application input + data from external dependencies. It's everything the app needs to run a single scenario: the user's request AND the data it would normally fetch from databases, caches, APIs, etc. The run function from Step 3 patches those external dependencies so the data from `eval_input` is fed into the app at runtime. **eval_input is stored in the dataset** — you create it.
+
+- **eval_output** = what the app actually produces when run on an eval_input. **eval_output is NOT stored in the dataset** — it's produced at test time when the runnable calls the real app with your eval_input. You cannot make it up because it must be the app's real response.
+
+- **expected_output** = case-specific evaluation reference. This is the reference that evaluators compare eval_output against. Its meaning depends on the evaluator (could be a factual answer, a quality description, or a set of criteria). **expected_output is stored in the dataset** — you create it by looking at the reference trace's output to understand what the app produces, then writing what a correct output should look like for each scenario.
+
+The reference trace from `pixie_qa/04-reference-trace.md` serves two purposes:
+
+1. **eval_input shape** — the trace's input shows what the observed function received. Every eval_input you create must match this shape exactly (same field names, same types, same nesting).
+2. **expected_output guidance** — the trace's output shows what the app actually returned. Use this to understand _what kind of thing_ the app produces, so you can write meaningful expected_output values for your scenarios.
+
+---
+
+## 5a. Derive evaluator assignments from previous artifacts
+
+The eval criteria artifact (`pixie_qa/03-eval-criteria.md`) maps each criterion to the use cases it applies to. The evaluator mapping artifact (`pixie_qa/05-evaluator-mapping.md`) maps each criterion to a concrete evaluator name. Combine these to determine the dataset configuration:
+
+1. **Dataset-level default evaluators**: Criteria marked as applying to "All" use cases in step 1c → their evaluator names (from step 4) go in the top-level `"evaluators"` array.
+2. **Item-level evaluators**: Criteria that apply to only a subset of use cases → their evaluator names go in `"evaluators"` on the relevant rows only, using `"..."` to also include the defaults.
+
+This is a mechanical derivation — the hard work of deciding which criteria apply where was done in Steps 1c and 4.
+
+## 5b. Generate eval_input items
+
+**If the user specified a dataset or data source in the prompt** (e.g., a JSON file with research questions, conversation scenarios, or test cases), use it as the basis for your eval_input items. Read the file, adapt each item to match the data shape from the reference trace, and incorporate them into the dataset. Do NOT ignore specified data and fabricate generic alternatives.
+
+**If no dataset was specified**, create eval_input items guided by the reference trace and use cases:
+
+- **Match the reference trace's data shape exactly** — the reference trace shows what the `@observe`-decorated function received as input. Your eval_inputs must have the same field names, types, and nesting depth. This includes both the application input (user message, query) and the external dependency data (customer profiles, retrieved documents, DB records) that the run function patches in.
+- **Cover each use case from Step 1c** — each use case should have at least one representative eval_input item, with meaningfully diverse inputs across items.
+
+Each dataset item contains:
+
+- `eval_input`: the made-up input data (application input + external dependency data), matching the reference trace's structure
+- `description`: mapped directly from the use case one-liners in `pixie_qa/03-eval-criteria.md` — use the same description text for items representing that use case (or a minor variant if there are multiple items per use case)
+- `expected_output`: case-specific expectation text (optional — only for items whose evaluators need a reference to compare against). This is a reference for evaluation, not an exact expected answer.
+- `evaluators`: (optional) item-level evaluator names, only needed for items that require **different** evaluators than the dataset defaults (as determined in 5a)
+
+At test time, `eval_output` is produced by the runnable function and is not stored in the dataset itself.
+
+## 5c. Build the dataset JSON file
+
+Create the dataset as a JSON file at `pixie_qa/datasets/<name>.json`. The dataset format includes top-level metadata and an items array.
+
+### Dataset JSON structure
+
+```json
+{
+  "name": "qa-golden-set",
+  "runnable": "pixie_qa/scripts/run_app.py:run_app",
+  "evaluators": ["Factuality", "pixie_qa/evaluators.py:ConciseVoiceStyle"],
+  "items": [
+    {
+      "description": "Customer asks about business hours with gold tier account",
+      "eval_input": {
+        "user_message": "What are your business hours?",
+        "customer_profile": {
+          "name": "Alice Johnson",
+          "account_id": "C100",
+          "tier": "gold"
+        }
+      },
+      "expected_output": "Should mention Mon-Fri 9am-5pm and Sat 10am-2pm"
+    },
+    {
+      "description": "Ambiguous change request requiring clarification from basic tier customer",
+      "eval_input": {
+        "user_message": "I want to change something",
+        "customer_profile": {
+          "name": "Bob Smith",
+          "account_id": "C200",
+          "tier": "basic"
+        }
+      },
+      "expected_output": "Should ask for clarification about what to change",
+      "evaluators": ["...", "ClosedQA"]
+    }
+  ]
+}
+```
+
+### Key fields
+
+- **`runnable`** (required): The `filepath:callable_name` reference to the run function from Step 3 (e.g., `"pixie_qa/scripts/run_app.py:run_app"`). The file path is relative to the project root. This is the function that `pixie test` calls with each item's `eval_input` to produce `eval_output`.
+- **`evaluators`** (dataset-level, optional): Default evaluator names applied to every item. These are the evaluators for criteria that apply to ALL use cases, as determined in 5a. Use the **exact names** from `pixie_qa/05-evaluator-mapping.md`.
+- **`description`** (per-item, required): Mapped from the use case one-liners in `pixie_qa/03-eval-criteria.md`. For multiple items under the same use case, use the same description or a minor variant distinguishing the specific scenario.
+- **`evaluators`** (per-item, optional): Row-level evaluator overrides for items that need case-specific evaluators (as determined in 5a). When present, these **replace** the defaults unless `"..."` is included, which **expands** to all default evaluators.
+
+### Evaluator assignment rules
+
+Use the evaluator mapping from `pixie_qa/05-evaluator-mapping.md` to assign evaluators:
+
+1. **Dataset-level defaults**: Evaluators that apply to ALL items go in the top-level `"evaluators"` array.
+2. **Item-level overrides**: Items that need **additional** evaluators beyond the defaults use `"evaluators": ["...", "ExtraEval"]` — the `"..."` expands to all defaults.
+3. **Item-level replacements**: Items that need a **completely different** set of evaluators use `"evaluators": ["OnlyThis"]` without `"..."`.
+4. **Items using only defaults**: Omit the `"evaluators"` field entirely — they automatically get all dataset defaults.
+
+## 5d. Validate the dataset
+
+After building:
+
+1. **Run validation**:
+
+   ```bash
+   uv run pixie dataset validate pixie_qa/datasets/<name>.json
+   ```
+
+   This checks:
+   - `runnable` is present and resolves to a valid callable
+   - Every item has a non-empty `description`
+   - Every item resolves to at least one evaluator (from row-level, dataset defaults, or both)
+   - All evaluator names resolve to valid evaluator classes
+
+2. **Verify structural constraints** — each eval_input matches the reference trace's schema (same fields, same types)
+3. **Verify diversity** — items have meaningfully different inputs, not just minor variations
+4. **Verify case-specific expectations** — `expected_output` values are specific and testable, not vague
+5. For conversational apps, include items with conversation history
+
+Fix any validation errors and re-run until validation passes.
+
+## Output
+
+`pixie_qa/datasets/<name>.json` — the dataset file.
+
+---
+
+## Dataset Creation Reference
+
+### Crafting eval_input items
+
+Each eval_input must match the **exact data shape** from the reference trace. The reference trace shows what the `@observe`-decorated function received — that's your eval_input shape. It includes both what the user sent (application input) AND what the app fetched from external systems (which the run function from Step 3 patches in).
+
+#### What goes into eval_input
+
+| Data category            | Example                                           | In the reference trace as                           |
+| ------------------------ | ------------------------------------------------- | --------------------------------------------------- |
+| Application input        | User message, query, request body                 | Function arguments from the user-facing entry point |
+| External dependency data | Customer profile, retrieved documents, DB records | Data the run function patches in via mocks          |
+| Conversation history     | Previous messages in a chat                       | Prior messages passed into the function             |
+| Configuration / context  | Feature flags, session state                      | Additional function arguments                       |
+
+#### Matching the reference trace shape
+
+From the reference trace (`pixie trace last`), note:
+
+1. **Field names** — use the exact same keys (e.g., `user_message` not `message`, `customer_profile` not `profile`)
+2. **Types** — if the trace shows a list, use a list; if it shows a nested dict, use a nested dict
+3. **Realistic values** — the data should look like something the app would actually receive. Don't use placeholder text like "test input" or "lorem ipsum"
+
+**Example**: If the reference trace shows the function received:
+
+```json
+{
+  "user_message": "I'd like to reschedule my appointment",
+  "customer_profile": {
+    "name": "Jane Smith",
+    "account_id": "A12345",
+    "tier": "premium"
+  },
+  "conversation_history": [
+    { "role": "assistant", "content": "Welcome! How can I help you today?" }
+  ]
+}
+```
+
+Then every eval_input you make up must have `user_message` (string), `customer_profile` (dict with `name`, `account_id`, `tier`), and `conversation_history` (list of message dicts).
+
+### Setting expected_output
+
+`expected_output` is a **reference for evaluation** — its meaning depends on which evaluator will consume it.
+
+#### When to set it
+
+| Scenario                                    | expected_output value                                                                  | Evaluator it pairs with                                   |
+| ------------------------------------------- | -------------------------------------------------------------------------------------- | --------------------------------------------------------- |
+| Deterministic answer exists                 | The exact answer: `"Paris"`                                                            | `ExactMatch`, `Factuality`, `ClosedQA`                    |
+| Open-ended but has quality criteria         | Description of good output: `"Should mention Saturday hours and be under 2 sentences"` | `create_llm_evaluator` with `{expected_output}` in prompt |
+| Truly open-ended, no case-specific criteria | Leave as `"UNSET"` or omit                                                             | Standalone evaluators (`Possible`, `Faithfulness`)        |
+
+#### Universal vs. case-specific criteria
+
+- **Universal criteria** apply to ALL test cases → become dataset-level default evaluators. These don't need expected_output.
+- **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's Tuesday appointment", "should route to billing").
+
+#### Anti-patterns
+
+- **Don't generate both eval_output and expected_output from the same source.** If they're identical and you use `ExactMatch`, the test is circular and catches zero regressions.
+- **Don't use comparison evaluators (`Factuality`, `ClosedQA`, `ExactMatch`) on items without expected_output.** They produce meaningless scores.
+- **Don't mix expected_output semantics in one dataset.** If some items use expected_output as a factual answer and others as style guidance, evaluators can't handle both. Use different evaluator assignments per item.
+
+### The cardinal rule
+
+**`eval_output` is always produced at test time by the runnable, never stored in the dataset.** The dataset contains `eval_input` (application input + external dependency data, matching the reference trace's shape), `description` (from the use case one-liners in Step 1c), and optionally `expected_output` (the reference for evaluators to judge against). The runnable produces `eval_output` (application output + captured side-effects) by replaying `eval_input` through the real app.
diff --git a/skills/eval-driven-dev/references/6-run-tests.md b/skills/eval-driven-dev/references/6-run-tests.md
new file mode 100644
index 000000000..a5928e828
--- /dev/null
+++ b/skills/eval-driven-dev/references/6-run-tests.md
@@ -0,0 +1,61 @@
+# Step 6: Run Evaluation-Based Tests
+
+**Why this step**: With the dataset ready (including runnable, evaluators, and items), you run `pixie test` to execute the full evaluation pipeline and verify that real scores are produced.
+
+---
+
+## 6a. Run tests
+
+Run with `pixie test` (not `pytest`). No path argument is needed — `pixie test` automatically discovers and runs all dataset JSON files in the pixie datasets directory:
+
+```bash
+uv run pixie test
+```
+
+For verbose output showing per-case scores and evaluator reasoning:
+
+```bash
+uv run pixie test -v
+```
+
+`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell.
+
+## 6b. Verify test output
+
+After running, verify:
+
+1. Per-entry results appear with evaluator names and real scores
+2. No import errors, missing key errors, or other setup failures
+3. Per-evaluator scores appear with real values (not all zeros or all ones)
+
+**If the test errors out** (import failures, missing keys, runnable resolution errors), that's a setup bug — fix the dataset JSON, the runnable function, or the evaluator implementations and re-run. Common issues:
+
+| Error                            | Likely cause                                                                   | Fix                                                            |
+| -------------------------------- | ------------------------------------------------------------------------------ | -------------------------------------------------------------- |
+| Runnable resolution failure      | `runnable` in dataset doesn't point to a valid function                        | Fix the `filepath:callable_name` reference in the dataset JSON |
+| Evaluator resolution failure     | Evaluator name doesn't match built-in names or `filepath:callable_name` format | Check evaluator names in `evaluators.md` reference             |
+| Eval_input shape mismatch        | Dataset eval_input fields don't match what the runnable expects                | Match field names/types from the reference trace               |
+| Import error in custom evaluator | Module path wrong or syntax error                                              | Fix the evaluator module                                       |
+| `ModuleNotFoundError: pixie_qa`  | `pixie_qa/` directory missing `__init__.py`                                    | Run `pixie init` again (it creates `__init__.py`)              |
+| `TypeError: ... is not callable` | Evaluator name points to a non-callable attribute                              | Evaluators must be functions, classes, or callable instances   |
+
+**If the test produces real pass/fail scores**, that's the deliverable — proceed to analysis.
+
+## 6c. Run analysis
+
+Once tests complete without errors caused by problems with dataset values or run-harness implementation issues, run analysis:
+
+```bash
+uv run pixie analyze <test_id>
+```
+
+Where `<test_id>` is the test run identifier printed by `pixie test` (e.g., `20250615-120000`). This generates LLM-powered markdown analysis for each dataset, identifying patterns in successes and failures.
+
+## Output
+
+- Test results at `{PIXIE_ROOT}/results/<test_id>/result.json`
+- Analysis files at `{PIXIE_ROOT}/results/<test_id>/dataset-<index>.md` (after `pixie analyze`)
+
+---
+
+> **If you hit an unexpected error** when running tests (wrong parameter names, import failures, API mismatch), read the relevant reference file (`evaluators.md`, `instrumentation-api.md`, or `testing-api.md`) for the authoritative API reference before guessing at a fix.
diff --git a/skills/eval-driven-dev/references/investigation.md b/skills/eval-driven-dev/references/7-investigation.md
similarity index 64%
rename from skills/eval-driven-dev/references/investigation.md
rename to skills/eval-driven-dev/references/7-investigation.md
index a6221c733..2f2959803 100644
--- a/skills/eval-driven-dev/references/investigation.md
+++ b/skills/eval-driven-dev/references/7-investigation.md
@@ -1,21 +1,35 @@
 # Investigation and Iteration
 
-This reference covers Step 6 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
+This reference covers Step 7 of the eval-driven-dev process: investigating test failures, root-causing them, and iterating on fixes.
 
 ---
 
-## When to use this
+## STOP — check before proceeding
 
-Only proceed with investigation if the user asked for it (iteration intent) or confirmed after seeing setup results. If the user's intent was "set up evals," stop after reporting test results and ask before investigating.
+**Before doing any investigation or iteration work, you must decide whether to continue or stop and ask the user.**
+
+**Continue immediately** if the user's original prompt explicitly asked for iteration — look for words like "fix", "improve", "debug", "iterate", "investigate failures", or "make tests pass". In this case, proceed to the investigation steps below.
+
+**Otherwise, STOP here.** Report the test results to the user:
+
+> "QA setup is complete. Tests show N/M passing. [brief summary of failures if any]. Want me to investigate the failures and iterate?"
+
+**Do not proceed with investigation until the user confirms.** This is the default — most prompts like "set up evals", "add tests", "set up QA", or "add evaluations" are asking for setup only, not iteration.
 
 ---
 
 ## Step-by-step investigation
 
-### 1. Get detailed test output
+When the user has confirmed (or their original prompt was explicitly about iteration), proceed:
+
+### 1. Read the analysis
+
+Start by reading the analysis generated in Step 6. The analysis files are at `{PIXIE_ROOT}/results/<test_id>/dataset-<index>.md`. These contain LLM-generated insights about patterns in successes and failures across your test run. Use the analysis to prioritize which failures to investigate first and to understand systemic issues.
+
+### 2. Get detailed test output
 
 ```bash
-pixie test pixie_qa/tests/ -v    # shows score and reasoning per case
+uv run pixie test -v    # shows score and reasoning per case
 ```
 
 Capture the full verbose output. For each failing case, note:
@@ -25,7 +39,7 @@ Capture the full verbose output. For each failing case, note:
 - The `expected_output` (what was expected, if applicable)
 - The evaluator score and reasoning
 
-### 2. Inspect the trace data
+### 3. Inspect the trace data
 
 For each failing case, look up the full trace to see what happened inside the app:
 
@@ -53,7 +67,7 @@ async def inspect(trace_id: str):
 asyncio.run(inspect("the-trace-id-here"))
 ```
 
-### 3. Root-cause analysis
+### 4. Root-cause analysis
 
 Walk through the trace and identify exactly where the failure originates. Common patterns:
 
@@ -77,22 +91,22 @@ Walk through the trace and identify exactly where the failure originates. Common
 
 For non-LLM failures: note them in the investigation log and recommend the code fix, but **do not adjust eval expectations or thresholds to accommodate bugs in non-LLM code**. The eval test should measure LLM quality assuming the rest of the system works correctly.
 
-### 4. Document findings in MEMORY.md
+### 5. Document findings
 
-**Every failure investigation must be documented in `pixie_qa/MEMORY.md`** under the Investigation Log section:
+**Every failure investigation should be documented** alongside the fix. Include:
 
 ````markdown
-### <date> — <test_name> failure
+### <date> — failure investigation
 
-**Test**: `test_faq_factuality` in `pixie_qa/tests/test_customer_service.py`
-**Result**: 3/5 cases passed (60%), threshold was 80% ≥ 0.7
+**Dataset**: `qa-golden-set`
+**Result**: 3/5 cases passed (60%)
 
 #### Failing case 1: "What rows have extra legroom?"
 
 - **eval_input**: `{"user_message": "What rows have extra legroom?"}`
 - **eval_output**: "I'm sorry, I don't have the exact row numbers for extra legroom..."
 - **expected_output**: "rows 5-8 Economy Plus with extra legroom"
-- **Evaluator score**: 0.1 (FactualityEval)
+- **Evaluator score**: 0.1 (Factuality)
 - **Evaluator reasoning**: "The output claims not to know the answer while the reference clearly states rows 5-8..."
 
 **Trace analysis**:
@@ -118,29 +132,33 @@ not an eval/prompt change.
 **Verification**: After fix, re-run:
 
 ```bash
-python pixie_qa/scripts/build_dataset.py  # refresh dataset
-pixie test pixie_qa/tests/ -k faq -v      # verify
+uv run pixie test -v      # verify
 ```
 ````
 
-````
+### 6. Fix and re-run
+
+Make the targeted change, update the dataset if needed, and re-run:
 
-### 5. Fix and re-run
+```bash
+uv run pixie test -v
+```
 
-Make the targeted change, rebuild the dataset if needed, and re-run. Always finish by giving the user the exact commands to verify:
+After fixes stabilize, run analysis again to see if the patterns have changed:
 
 ```bash
-pixie test pixie_qa/tests/test_<feature>.py -v
-````
+uv run pixie analyze <new_test_id>
+```
 
 ---
 
 ## The iteration cycle
 
-1. Run tests → identify failures
-2. Investigate each failure → classify as LLM vs. non-LLM
-3. For LLM failures: adjust prompts, model, or eval criteria
-4. For non-LLM failures: recommend or apply code fix
-5. Rebuild dataset if the fix changed app behavior
-6. Re-run tests
-7. Repeat until passing or user is satisfied
+1. Read analysis from Step 6 → prioritize failures
+2. Run tests verbose → identify specific failures
+3. Investigate each failure → classify as LLM vs. non-LLM
+4. For LLM failures: adjust prompts, model, or eval criteria
+5. For non-LLM failures: recommend or apply code fix
+6. Update dataset if the fix changed app behavior
+7. Re-run tests and analysis
+8. Repeat until passing or user is satisfied
diff --git a/skills/eval-driven-dev/references/dataset-generation.md b/skills/eval-driven-dev/references/dataset-generation.md
deleted file mode 100644
index cbdfebad1..000000000
--- a/skills/eval-driven-dev/references/dataset-generation.md
+++ /dev/null
@@ -1,235 +0,0 @@
-# Dataset Generation
-
-This reference covers Step 4 of the eval-driven-dev process: creating the eval dataset.
-
-For full `DatasetStore`, `Evaluable`, and CLI command signatures, see `references/pixie-api.md` (Dataset Python API and CLI Commands sections).
-
----
-
-## What a dataset contains
-
-A dataset is a collection of `Evaluable` items. Each item has:
-
-- **`eval_input`**: Made-up application input + data from external dependencies. This is what the utility function from Step 3 feeds into the app at test time.
-- **`expected_output`**: Case-specific evaluation reference (optional). The meaning depends on the evaluator — it could be an exact answer, a factual reference, or quality criteria text.
-- **`eval_output`**: **NOT stored in the dataset.** Produced at test time when the utility function replays the eval_input through the real app.
-
-The dataset is made up by you based on the data shapes observed in the reference trace from Step 2. You are NOT extracting data from traces — you are crafting realistic test scenarios.
-
----
-
-## Creating the dataset
-
-### CLI
-
-```bash
-pixie dataset create <dataset-name>
-pixie dataset list   # verify it exists
-```
-
-### Python API
-
-```python
-from pixie import DatasetStore, Evaluable
-
-store = DatasetStore()
-store.create("qa-golden-set", items=[
-    Evaluable(
-        eval_input={"user_message": "What are your hours?", "customer_profile": {"name": "Alice", "tier": "gold"}},
-        expected_output="Response should mention Monday-Friday 9am-5pm and Saturday 10am-2pm",
-    ),
-    Evaluable(
-        eval_input={"user_message": "I need to cancel my order", "customer_profile": {"name": "Bob", "tier": "basic"}},
-        expected_output="Should confirm which order and explain the cancellation policy",
-    ),
-])
-```
-
-Or build incrementally:
-
-```python
-store = DatasetStore()
-store.create("qa-golden-set")
-for item in items:
-    store.append("qa-golden-set", item)
-```
-
----
-
-## Crafting eval_input items
-
-Each eval_input must match the **exact data shape** from the reference trace. Look at what the `@observe`-decorated function received as input in Step 2 — same field names, same types, same nesting.
-
-### What goes into eval_input
-
-| Data category            | Example                                           | Source                                              |
-| ------------------------ | ------------------------------------------------- | --------------------------------------------------- |
-| Application input        | User message, query, request body                 | What a real user would send                         |
-| External dependency data | Customer profile, retrieved documents, DB records | Made up to match the shape from the reference trace |
-| Conversation history     | Previous messages in a chat                       | Made up to set up the scenario                      |
-| Configuration / context  | Feature flags, session state                      | Whatever the function expects as arguments          |
-
-### Matching the reference trace shape
-
-From the reference trace (`pixie trace last`), note:
-
-1. **Field names** — use the exact same keys (e.g., `user_message` not `message`, `customer_profile` not `profile`)
-2. **Types** — if the trace shows a list, use a list; if it shows a nested dict, use a nested dict
-3. **Realistic values** — the data should look like something the app would actually receive. Don't use placeholder text like "test input" or "lorem ipsum"
-
-**Example**: If the reference trace shows the function received:
-
-```json
-{
-  "user_message": "I'd like to reschedule my appointment",
-  "customer_profile": {
-    "name": "Jane Smith",
-    "account_id": "A12345",
-    "tier": "premium"
-  },
-  "conversation_history": [
-    { "role": "assistant", "content": "Welcome! How can I help you today?" }
-  ]
-}
-```
-
-Then every eval_input you make up must have `user_message` (string), `customer_profile` (dict with `name`, `account_id`, `tier`), and `conversation_history` (list of message dicts).
-
----
-
-## Setting expected_output
-
-`expected_output` is a **reference for evaluation** — its meaning depends on which evaluator will consume it.
-
-### When to set it
-
-| Scenario                                    | expected_output value                                                                  | Evaluator it pairs with                                    |
-| ------------------------------------------- | -------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
-| Deterministic answer exists                 | The exact answer: `"Paris"`                                                            | `ExactMatchEval`, `FactualityEval`, `ClosedQAEval`         |
-| Open-ended but has quality criteria         | Description of good output: `"Should mention Saturday hours and be under 2 sentences"` | `create_llm_evaluator` with `{expected_output}` in prompt  |
-| Truly open-ended, no case-specific criteria | Leave as `"UNSET"` or omit                                                             | Standalone evaluators (`PossibleEval`, `FaithfulnessEval`) |
-
-### Universal vs. case-specific criteria
-
-- **Universal criteria** apply to ALL test cases → implement in the test function's evaluators (e.g., "responses must be concise", "must not hallucinate"). These don't need expected_output.
-- **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's Tuesday appointment", "should route to billing").
-
-### Anti-patterns
-
-- **Don't generate both eval_output and expected_output from the same source.** If they're identical and you use `ExactMatchEval`, the test is circular and catches zero regressions.
-- **Don't use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without expected_output.** They produce meaningless scores.
-- **Don't mix expected_output semantics in one dataset.** If some items use expected_output as a factual answer and others as style guidance, evaluators can't handle both. Split into separate datasets or use separate test functions.
-
----
-
-## Validating the dataset
-
-After creating the dataset, check:
-
-### 1. Structural validation
-
-Every eval_input must match the reference trace's schema:
-
-- Same fields present
-- Same types (string, int, list, dict)
-- Same nesting depth
-- No extra or missing fields compared to what the function expects
-
-### 2. Semantic validation
-
-- **Realistic values** — names, messages, and data look like real-world inputs, not test placeholders
-- **Coherent scenarios** — if there's conversation history, it should make topical sense with the user message
-- **External dependency data makes sense** — customer profiles have realistic account IDs, retrieved documents are plausible
-
-### 3. Diversity validation
-
-- Items have **meaningfully different** inputs — different user intents, different customer types, different edge cases
-- Not just minor variations of the same scenario (e.g., don't have 5 items that are all "What are your hours?" with different names)
-- Cover: normal cases, edge cases, things the app might plausibly get wrong
-
-### 4. Expected_output validation
-
-- case-specific `expected_output` values are specific and testable, not vague
-- Items where expected_output is universal don't redundantly carry expected_output
-
-### 5. Verify by listing
-
-```bash
-pixie dataset list
-```
-
-Or in the build script:
-
-```python
-ds = store.get("qa-golden-set")
-print(f"Dataset has {len(ds.items)} items")
-for i, item in enumerate(ds.items):
-    print(f"  [{i}] input keys: {list(item.eval_input.keys()) if isinstance(item.eval_input, dict) else type(item.eval_input)}")
-    print(f"       expected_output: {item.expected_output[:80] if item.expected_output != 'UNSET' else 'UNSET'}...")
-```
-
----
-
-## Recommended build_dataset.py structure
-
-Put the build script at `pixie_qa/scripts/build_dataset.py`:
-
-```python
-"""Build the eval dataset with made-up scenarios.
-
-Each eval_input matches the data shape from the reference trace (Step 2).
-Run this script to create/recreate the dataset.
-"""
-from pixie import DatasetStore, Evaluable
-
-DATASET_NAME = "qa-golden-set"
-
-def build() -> None:
-    store = DatasetStore()
-
-    # Recreate fresh
-    try:
-        store.delete(DATASET_NAME)
-    except FileNotFoundError:
-        pass
-    store.create(DATASET_NAME)
-
-    items = [
-        # Normal case — straightforward question
-        Evaluable(
-            eval_input={
-                "user_message": "What are your business hours?",
-                "customer_profile": {"name": "Alice Johnson", "account_id": "C100", "tier": "gold"},
-            },
-            expected_output="Should mention Mon-Fri 9am-5pm and Sat 10am-2pm",
-        ),
-        # Edge case — ambiguous request
-        Evaluable(
-            eval_input={
-                "user_message": "I want to change something",
-                "customer_profile": {"name": "Bob Smith", "account_id": "C200", "tier": "basic"},
-            },
-            expected_output="Should ask for clarification about what to change",
-        ),
-        # ... more items covering different scenarios
-    ]
-
-    for item in items:
-        store.append(DATASET_NAME, item)
-
-    # Verify
-    ds = store.get(DATASET_NAME)
-    print(f"Dataset '{DATASET_NAME}' has {len(ds.items)} items")
-    for i, entry in enumerate(ds.items):
-        keys = list(entry.eval_input.keys()) if isinstance(entry.eval_input, dict) else type(entry.eval_input)
-        print(f"  [{i}] input keys: {keys}")
-
-if __name__ == "__main__":
-    build()
-```
-
----
-
-## The cardinal rule
-
-**`eval_output` is always produced at test time, never stored in the dataset.** The dataset contains `eval_input` (made-up input matching the reference trace shape) and optionally `expected_output` (the reference to judge against). The test's `runnable` function produces `eval_output` by replaying `eval_input` through the real app.
diff --git a/skills/eval-driven-dev/references/eval-tests.md b/skills/eval-driven-dev/references/eval-tests.md
deleted file mode 100644
index dcf046e39..000000000
--- a/skills/eval-driven-dev/references/eval-tests.md
+++ /dev/null
@@ -1,241 +0,0 @@
-# Eval Tests: Evaluator Selection and Test Writing
-
-This reference covers Step 5 of the eval-driven-dev process: choosing evaluators, writing the test file, and running `pixie test`.
-
-**Before writing any test code, re-read `references/pixie-api.md`** (Eval Runner API and Evaluator catalog sections) for exact parameter names and current evaluator signatures — these change when the package is updated.
-
----
-
-## Evaluator selection
-
-Choose evaluators based on the **output type** and your eval criteria from Step 1, not the app type.
-
-### Decision table
-
-| Output type                                                 | Evaluator category                                                      | Examples                                  |
-| ----------------------------------------------------------- | ----------------------------------------------------------------------- | ----------------------------------------- |
-| Deterministic (classification labels, yes/no, fixed-format) | Heuristic: `ExactMatchEval`, `JSONDiffEval`, `ValidJSONEval`            | Label classification, JSON extraction     |
-| Open-ended text with a reference answer                     | LLM-as-judge: `FactualityEval`, `ClosedQAEval`, `AnswerCorrectnessEval` | Chatbot responses, QA, summaries          |
-| Text with expected context/grounding                        | RAG evaluators: `FaithfulnessEval`, `ContextRelevancyEval`              | RAG pipelines, context-grounded responses |
-| Text with style/format requirements                         | Custom LLM-as-judge via `create_llm_evaluator`                          | Voice-friendly responses, tone checks     |
-| Multi-aspect quality                                        | Multiple evaluators combined                                            | Factuality + relevance + tone             |
-
-### Critical rules
-
-- For open-ended LLM text, **never** use `ExactMatchEval`. LLM outputs are non-deterministic — exact match will either always fail or always pass (if comparing against the same output). Use LLM-as-judge evaluators instead.
-- `AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
-- Do NOT use comparison evaluators (`FactualityEval`, `ClosedQAEval`, `ExactMatchEval`) on items without `expected_output` — they produce meaningless scores.
-
-### When `expected_output` IS available
-
-Use comparison-based evaluators:
-
-| Evaluator               | Use when                                                   |
-| ----------------------- | ---------------------------------------------------------- |
-| `FactualityEval`        | Output is factually correct compared to reference          |
-| `ClosedQAEval`          | Output matches the expected answer                         |
-| `ExactMatchEval`        | Exact string match (structured/deterministic outputs only) |
-| `AnswerCorrectnessEval` | Answer is correct vs reference                             |
-
-### When `expected_output` is NOT available
-
-Use standalone evaluators that judge quality without a reference:
-
-| Evaluator              | Use when                              | Note                                                             |
-| ---------------------- | ------------------------------------- | ---------------------------------------------------------------- |
-| `FaithfulnessEval`     | Response faithful to provided context | RAG pipelines                                                    |
-| `ContextRelevancyEval` | Retrieved context relevant to query   | RAG pipelines                                                    |
-| `AnswerRelevancyEval`  | Answer addresses the question         | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
-| `PossibleEval`         | Output is plausible / feasible        | General purpose                                                  |
-| `ModerationEval`       | Output is safe and appropriate        | Content safety                                                   |
-| `SecurityEval`         | No security vulnerabilities           | Security check                                                   |
-
-For non-RAG apps needing response relevance, write a `create_llm_evaluator` instead.
-
----
-
-## Custom evaluators
-
-### `create_llm_evaluator` factory
-
-Use when the quality dimension is domain-specific and no built-in evaluator fits:
-
-```python
-from pixie import create_llm_evaluator
-
-concise_voice_style = create_llm_evaluator(
-    name="ConciseVoiceStyle",
-    prompt_template="""
-    You are evaluating whether this response is concise and phone-friendly.
-
-    Input: {eval_input}
-    Response: {eval_output}
-
-    Score 1.0 if the response is concise (under 3 sentences), directly addresses
-    the question, and uses conversational language suitable for a phone call.
-    Score 0.0 if it's verbose, off-topic, or uses written-style formatting.
-    """,
-)
-```
-
-**How template variables work**: `{eval_input}`, `{eval_output}`, `{expected_output}` are the only placeholders. Each is replaced with a string representation of the corresponding `Evaluable` field — if the field is a dict or list, it becomes a JSON string. The LLM judge sees the full serialized value.
-
-**Rules**:
-
-- **Only `{eval_input}`, `{eval_output}`, `{expected_output}`** — no nested access like `{eval_input[key]}` (this will crash with a `TypeError`)
-- **Keep templates short and direct** — the system prompt already tells the LLM to return `Score: X.X`. Your template just needs to present the data and define the scoring criteria.
-- **Don't instruct the LLM to "parse" or "extract" data** — just present the values and state the criteria. The LLM can read JSON naturally.
-
-**Non-RAG response relevance** (instead of `AnswerRelevancyEval`):
-
-```python
-response_relevance = create_llm_evaluator(
-    name="ResponseRelevance",
-    prompt_template="""
-    You are evaluating whether a customer support response is relevant and helpful.
-
-    Input: {eval_input}
-    Response: {eval_output}
-    Expected: {expected_output}
-
-    Score 1.0 if the response directly addresses the question and meets expectations.
-    Score 0.5 if partially relevant but misses important aspects.
-    Score 0.0 if off-topic, ignores the question, or contradicts expectations.
-    """,
-)
-```
-
-### Manual custom evaluator
-
-```python
-from pixie import Evaluation, Evaluable
-
-async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
-    # evaluable.eval_input  — what was passed to the observed function
-    # evaluable.eval_output — what the function returned
-    # evaluable.expected_output — reference answer (UNSET if not provided)
-    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
-    return Evaluation(score=score, reasoning="...")
-```
-
----
-
-## Writing the test file
-
-Create `pixie_qa/tests/test_<feature>.py`. The pattern: a `runnable` adapter that calls the app's production function, plus `async` test functions that `await` `assert_dataset_pass`.
-
-**Before writing any test code, re-read the `assert_dataset_pass` API reference below.** The exact parameter names matter — using `dataset=` instead of `dataset_name=`, or omitting `await`, will cause failures that are hard to debug. Do not rely on memory from earlier in the conversation.
-
-### Test file template
-
-```python
-from pixie import enable_storage, assert_dataset_pass, FactualityEval, ScoreThreshold, last_llm_call
-
-from myapp import answer_question
-
-
-def runnable(eval_input):
-    """Replays one dataset item through the app.
-
-    Calls the same function the production app uses.
-    enable_storage() here ensures traces are captured during eval runs.
-    """
-    enable_storage()
-    answer_question(**eval_input)
-
-
-async def test_answer_quality():
-    await assert_dataset_pass(
-        runnable=runnable,
-        dataset_name="qa-golden-set",
-        evaluators=[FactualityEval()],
-        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
-        from_trace=last_llm_call,
-    )
-```
-
-### `assert_dataset_pass` API — exact parameter names
-
-```python
-await assert_dataset_pass(
-    runnable=runnable,              # callable that takes eval_input dict
-    dataset_name="my-dataset",      # NOT dataset_path — name of dataset created in Step 4
-    evaluators=[...],               # list of evaluator instances
-    pass_criteria=ScoreThreshold(   # NOT thresholds — ScoreThreshold object
-        threshold=0.7,              # minimum score to count as passing
-        pct=0.8,                    # fraction of items that must pass
-    ),
-    from_trace=last_llm_call,       # which span to extract eval data from
-)
-```
-
-### Common mistakes that break tests
-
-| Mistake                  | Symptom                                                             | Fix                                           |
-| ------------------------ | ------------------------------------------------------------------- | --------------------------------------------- |
-| `def test_...():` (sync) | RuntimeWarning "coroutine was never awaited", test passes vacuously | Use `async def test_...():`                   |
-| No `await`               | Same: "coroutine was never awaited"                                 | Add `await` before `assert_dataset_pass(...)` |
-| `dataset_path="..."`     | TypeError: unexpected keyword argument                              | Use `dataset_name="..."`                      |
-| `thresholds={...}`       | TypeError: unexpected keyword argument                              | Use `pass_criteria=ScoreThreshold(...)`       |
-| Omitting `from_trace`    | Evaluator may not find the right span                               | Add `from_trace=last_llm_call`                |
-
-**If `pixie test` shows "No assert_pass / assert_dataset_pass calls recorded"**, the test passed vacuously because `assert_dataset_pass` was never awaited. Fix the async signature and await immediately.
-
-### Multiple test functions
-
-Split into separate test functions when you have different evaluator sets:
-
-```python
-async def test_factual_answers():
-    """Test items that have deterministic expected outputs."""
-    await assert_dataset_pass(
-        runnable=runnable,
-        dataset_name="qa-deterministic",
-        evaluators=[FactualityEval()],
-        pass_criteria=ScoreThreshold(threshold=0.7, pct=0.8),
-        from_trace=last_llm_call,
-    )
-
-async def test_response_style():
-    """Test open-ended quality criteria."""
-    await assert_dataset_pass(
-        runnable=runnable,
-        dataset_name="qa-open-ended",
-        evaluators=[concise_voice_style],
-        pass_criteria=ScoreThreshold(threshold=0.6, pct=0.8),
-        from_trace=last_llm_call,
-    )
-```
-
-### Key points
-
-- `enable_storage()` belongs inside the `runnable`, not at module level — it needs to fire on each invocation so the trace is captured for that specific run.
-- The `runnable` imports and calls the **same function** that production uses — the app's entry point, going through the utility function from Step 3.
-- If the `runnable` calls a different function than what the utility function calls, something is wrong.
-- The `eval_input` dict should contain **only the semantic arguments** the function needs (e.g., `question`, `messages`, `context`). The `@observe` decorator automatically strips `self` and `cls`.
-- **Choose evaluators that match your data.** If dataset items have `expected_output`, use comparison evaluators. If not, use standalone evaluators.
-
----
-
-## Running tests
-
-The test runner is `pixie test` (not `pytest`):
-
-```bash
-uv run pixie test                           # run all test_*.py in current directory
-uv run pixie test pixie_qa/tests/           # specify path
-uv run pixie test -k factuality             # filter by name
-uv run pixie test -v                        # verbose: shows per-case scores and reasoning
-```
-
-`pixie test` automatically loads the `.env` file before running tests, so API keys do not need to be exported in the shell. No `sys.path` hacks are needed in test files.
-
-The `-v` flag is important: it shows per-case scores and evaluator reasoning, which makes it much easier to see what's passing and what isn't.
-
-### After running, verify the scorecard
-
-1. Shows "N/M tests passed" with real numbers
-2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
-3. Per-evaluator scores appear with real values
-
-A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
diff --git a/skills/eval-driven-dev/references/evaluators.md b/skills/eval-driven-dev/references/evaluators.md
new file mode 100644
index 000000000..d3a37cab4
--- /dev/null
+++ b/skills/eval-driven-dev/references/evaluators.md
@@ -0,0 +1,540 @@
+# Built-in Evaluators
+
+> Auto-generated from pixie source code docstrings.
+> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
+
+Autoevals adapters — pre-made evaluators wrapping ``autoevals`` scorers.
+
+This module provides :class:`AutoevalsAdapter`, which bridges the
+autoevals ``Scorer`` interface to pixie's ``Evaluator`` protocol, and
+a set of factory functions for common evaluation tasks.
+
+Public API (all are also re-exported from ``pixie.evals``):
+
+**Core adapter:**
+    - :class:`AutoevalsAdapter` — generic wrapper for any autoevals ``Scorer``.
+
+**Heuristic scorers (no LLM required):**
+    - :func:`LevenshteinMatch` — edit-distance string similarity.
+    - :func:`ExactMatch` — exact value comparison.
+    - :func:`NumericDiff` — normalised numeric difference.
+    - :func:`JSONDiff` — structural JSON comparison.
+    - :func:`ValidJSON` — JSON syntax / schema validation.
+    - :func:`ListContains` — overlap between two string lists.
+
+**Embedding scorer:**
+    - :func:`EmbeddingSimilarity` — cosine similarity via embeddings.
+
+**LLM-as-judge scorers:**
+    - :func:`Factuality`, :func:`ClosedQA`, :func:`Battle`,
+      :func:`Humor`, :func:`Security`, :func:`Sql`,
+      :func:`Summary`, :func:`Translation`, :func:`Possible`.
+
+**Moderation:**
+    - :func:`Moderation` — OpenAI content-moderation check.
+
+**RAGAS metrics:**
+    - :func:`ContextRelevancy`, :func:`Faithfulness`,
+      :func:`AnswerRelevancy`, :func:`AnswerCorrectness`.
+
+Evaluator Selection Guide
+-------------------------
+
+Choose evaluators based on the **output type** and eval criteria:
+
+| Output type | Evaluator category | Examples |
+| --- | --- | --- |
+| Deterministic (labels, yes/no, fixed-format) | Heuristic: ``ExactMatch``, ``JSONDiff``, ``ValidJSON`` | Label classification, JSON extraction |
+| Open-ended text with a reference answer | LLM-as-judge: ``Factuality``, ``ClosedQA``, ``AnswerCorrectness`` | Chatbot responses, QA, summaries |
+| Text with expected context/grounding | RAG: ``Faithfulness``, ``ContextRelevancy`` | RAG pipelines |
+| Text with style/format requirements | Custom via ``create_llm_evaluator`` | Voice-friendly responses, tone checks |
+| Multi-aspect quality | Multiple evaluators combined | Factuality + relevance + tone |
+
+Critical rules:
+
+- For open-ended LLM text, **never** use ``ExactMatch`` — LLM outputs are
+  non-deterministic.
+- ``AnswerRelevancy`` is **RAG-only** — requires ``context`` in the trace.
+  Returns 0.0 without it. For general relevance, use ``create_llm_evaluator``.
+- Do NOT use comparison evaluators (``Factuality``, ``ClosedQA``,
+  ``ExactMatch``) on items without ``expected_output`` — they produce
+  meaningless scores.
+
+---
+
+## Evaluator Reference
+
+### `AnswerCorrectness`
+
+```python
+AnswerCorrectness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Answer correctness evaluator (RAGAS).
+
+Judges whether ``eval_output`` is correct compared to
+``expected_output``, combining factual similarity and semantic
+similarity.
+
+**When to use**: QA scenarios in RAG pipelines where you have a
+reference answer and want a comprehensive correctness score.
+
+**Requires ``expected_output``**: Yes.
+**Requires ``eval_metadata["context"]``**: Optional (improves accuracy).
+
+Args:
+    client: OpenAI client instance.
+
+### `AnswerRelevancy`
+
+```python
+AnswerRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Answer relevancy evaluator (RAGAS).
+
+Judges whether ``eval_output`` directly addresses the question in
+``eval_input``.
+
+**When to use**: RAG pipelines only — requires ``context`` in the
+trace.  Returns 0.0 without it.  For general (non-RAG) response
+relevance, use ``create_llm_evaluator`` with a custom prompt instead.
+
+**Requires ``expected_output``**: No.
+**Requires ``eval_metadata["context"]``**: Yes — **RAG pipelines only**.
+
+Args:
+    client: OpenAI client instance.
+
+### `Battle`
+
+```python
+Battle(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Head-to-head comparison evaluator (LLM-as-judge).
+
+Uses an LLM to compare ``eval_output`` against ``expected_output``
+and determine which is better given the instructions in ``eval_input``.
+
+**When to use**: A/B testing scenarios, comparing model outputs,
+or ranking alternative responses.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `ClosedQA`
+
+```python
+ClosedQA(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Closed-book question-answering evaluator (LLM-as-judge).
+
+Uses an LLM to judge whether ``eval_output`` correctly answers the
+question in ``eval_input`` compared to ``expected_output``.  Optionally
+forwards ``eval_metadata["criteria"]`` for custom grading criteria.
+
+**When to use**: QA scenarios where the answer should match a reference —
+e.g. customer support answers, knowledge-base queries.
+
+**Requires ``expected_output``**: Yes — do NOT use on items without
+``expected_output``; produces meaningless scores.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `ContextRelevancy`
+
+```python
+ContextRelevancy(*, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Context relevancy evaluator (RAGAS).
+
+Judges whether the retrieved context is relevant to the query.
+Forwards ``eval_metadata["context"]`` to the underlying scorer.
+
+**When to use**: RAG pipelines — evaluating retrieval quality.
+
+**Requires ``expected_output``**: Yes.
+**Requires ``eval_metadata["context"]``**: Yes (RAG pipelines only).
+
+Args:
+    client: OpenAI client instance.
+
+### `EmbeddingSimilarity`
+
+```python
+EmbeddingSimilarity(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Embedding-based semantic similarity evaluator.
+
+Computes cosine similarity between embedding vectors of ``eval_output``
+and ``expected_output``.
+
+**When to use**: Comparing semantic meaning of two texts when exact
+wording doesn't matter.  More robust than Levenshtein for paraphrased
+content but less nuanced than LLM-as-judge evaluators.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    prefix: Optional text to prepend for domain context.
+    model: Embedding model name.
+    client: OpenAI client instance.
+
+### `ExactMatch`
+
+```python
+ExactMatch() -> 'AutoevalsAdapter'
+```
+
+Exact value comparison evaluator.
+
+Returns 1.0 if ``eval_output`` exactly equals ``expected_output``,
+0.0 otherwise.
+
+**When to use**: Deterministic, structured outputs (classification labels,
+yes/no answers, fixed-format strings).  **Never** use for open-ended LLM
+text — LLM outputs are non-deterministic, so exact match will almost always
+fail.
+
+**Requires ``expected_output``**: Yes.
+
+### `Factuality`
+
+```python
+Factuality(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Factual accuracy evaluator (LLM-as-judge).
+
+Uses an LLM to judge whether ``eval_output`` is factually consistent
+with ``expected_output`` given the ``eval_input`` context.
+
+**When to use**: Open-ended text where factual correctness matters
+(chatbot responses, QA answers, summaries).  Preferred over
+``ExactMatch`` for LLM-generated text.
+
+**Requires ``expected_output``**: Yes — do NOT use on items without
+``expected_output``; produces meaningless scores.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `Faithfulness`
+
+```python
+Faithfulness(*, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Faithfulness evaluator (RAGAS).
+
+Judges whether ``eval_output`` is faithful to (i.e. supported by)
+the provided context.  Forwards ``eval_metadata["context"]``.
+
+**When to use**: RAG pipelines — ensuring the answer doesn't
+hallucinate beyond what the retrieved context supports.
+
+**Requires ``expected_output``**: No.
+**Requires ``eval_metadata["context"]``**: Yes (RAG pipelines only).
+
+Args:
+    client: OpenAI client instance.
+
+### `Humor`
+
+```python
+Humor(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Humor quality evaluator (LLM-as-judge).
+
+Uses an LLM to judge the humor quality of ``eval_output`` against
+``expected_output``.
+
+**When to use**: Evaluating humor in creative writing, chatbot
+personality, or entertainment applications.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `JSONDiff`
+
+```python
+JSONDiff(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Structural JSON comparison evaluator.
+
+Recursively compares two JSON structures and produces a similarity
+score.  Handles nested objects, arrays, and mixed types.
+
+**When to use**: Structured JSON outputs where field-level comparison
+is needed (e.g. extracted data, API response schemas, tool call arguments).
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    string_scorer: Optional pairwise scorer for string fields.
+
+### `LevenshteinMatch`
+
+```python
+LevenshteinMatch() -> 'AutoevalsAdapter'
+```
+
+Edit-distance string similarity evaluator.
+
+Computes a normalised Levenshtein distance between ``eval_output`` and
+``expected_output``.  Returns 1.0 for identical strings and decreasing
+scores as edit distance grows.
+
+**When to use**: Deterministic or near-deterministic outputs where small
+textual variations are acceptable (e.g. formatting differences, minor
+spelling).  Not suitable for open-ended LLM text — use an LLM-as-judge
+evaluator instead.
+
+**Requires ``expected_output``**: Yes.
+
+### `ListContains`
+
+```python
+ListContains(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'
+```
+
+List overlap evaluator.
+
+Checks whether ``eval_output`` contains all items from
+``expected_output``.  Scores based on overlap ratio.
+
+**When to use**: Outputs that produce a list of items where completeness
+matters (e.g. extracted entities, search results, recommendations).
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    pairwise_scorer: Optional scorer for pairwise element comparison.
+    allow_extra_entities: If True, extra items in output are not penalised.
+
+### `Moderation`
+
+```python
+Moderation(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Content moderation evaluator.
+
+Uses the OpenAI moderation API to check ``eval_output`` for unsafe
+content (hate speech, violence, self-harm, etc.).
+
+**When to use**: Any application where output safety is a concern —
+chatbots, content generation, user-facing AI.
+
+**Requires ``expected_output``**: No.
+
+Args:
+    threshold: Custom flagging threshold.
+    client: OpenAI client instance.
+
+### `NumericDiff`
+
+```python
+NumericDiff() -> 'AutoevalsAdapter'
+```
+
+Normalised numeric difference evaluator.
+
+Computes a normalised numeric distance between ``eval_output`` and
+``expected_output``.  Returns 1.0 for identical numbers and decreasing
+scores as the difference grows.
+
+**When to use**: Numeric outputs where approximate equality is acceptable
+(e.g. price calculations, scores, measurements).
+
+**Requires ``expected_output``**: Yes.
+
+### `Possible`
+
+```python
+Possible(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Feasibility / plausibility evaluator (LLM-as-judge).
+
+Uses an LLM to judge whether ``eval_output`` is a plausible or
+feasible response.
+
+**When to use**: General-purpose quality check when you want to
+verify outputs are reasonable without a specific reference answer.
+
+**Requires ``expected_output``**: No.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `Security`
+
+```python
+Security(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Security vulnerability evaluator (LLM-as-judge).
+
+Uses an LLM to check ``eval_output`` for security vulnerabilities
+based on the instructions in ``eval_input``.
+
+**When to use**: Code generation, SQL output, or any scenario
+where output must be checked for injection or vulnerability risks.
+
+**Requires ``expected_output``**: No.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `Sql`
+
+```python
+Sql(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+SQL equivalence evaluator (LLM-as-judge).
+
+Uses an LLM to judge whether ``eval_output`` SQL is semantically
+equivalent to ``expected_output`` SQL.
+
+**When to use**: Text-to-SQL applications where the generated SQL
+should be functionally equivalent to a reference query.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `Summary`
+
+```python
+Summary(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Summarisation quality evaluator (LLM-as-judge).
+
+Uses an LLM to judge the quality of ``eval_output`` as a summary
+compared to the reference summary in ``expected_output``.
+
+**When to use**: Summarisation tasks where the output must capture
+key information from the source material.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `Translation`
+
+```python
+Translation(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+Translation quality evaluator (LLM-as-judge).
+
+Uses an LLM to judge the translation quality of ``eval_output``
+compared to ``expected_output`` in the target language.
+
+**When to use**: Machine translation or multilingual output scenarios.
+
+**Requires ``expected_output``**: Yes.
+
+Args:
+    language: Target language (e.g. ``"Spanish"``).
+    model: LLM model name.
+    client: OpenAI client instance.
+
+### `ValidJSON`
+
+```python
+ValidJSON(*, schema: 'Any' = None) -> 'AutoevalsAdapter'
+```
+
+JSON syntax and schema validation evaluator.
+
+Returns 1.0 if ``eval_output`` is valid JSON (and optionally matches
+the provided schema), 0.0 otherwise.
+
+**When to use**: Outputs that must be valid JSON — optionally conforming
+to a specific schema (e.g. tool call responses, structured extraction).
+
+**Requires ``expected_output``**: No.
+
+Args:
+    schema: Optional JSON Schema to validate against.
+
+---
+
+## Custom Evaluators: `create_llm_evaluator`
+
+Factory for custom LLM-as-judge evaluators from prompt templates.
+
+Usage::
+
+    from pixie import create_llm_evaluator
+
+    concise_voice_style = create_llm_evaluator(
+        name="ConciseVoiceStyle",
+        prompt_template="""
+        You are evaluating whether a voice agent response is concise and
+        phone-friendly.
+
+        User said: {eval_input}
+        Agent responded: {eval_output}
+        Expected behavior: {expected_output}
+
+        Score 1.0 if the response is concise (under 3 sentences), directly
+        addresses the question, and uses conversational language suitable for
+        a phone call. Score 0.0 if it's verbose, off-topic, or uses
+        written-style formatting.
+        """,
+    )
+
+### `create_llm_evaluator`
+
+```python
+create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
+```
+
+Create a custom LLM-as-judge evaluator from a prompt template.
+
+The template may reference these variables (populated from the
+:class:`~pixie.storage.evaluable.Evaluable` fields):
+
+- ``{eval_input}`` — the evaluable's input
+- ``{eval_output}`` — the evaluable's output
+- ``{expected_output}`` — the evaluable's expected output
+
+Args:
+    name: Display name for the evaluator (shown in scorecard).
+    prompt_template: A string template with ``{eval_input}``,
+        ``{eval_output}``, and/or ``{expected_output}`` placeholders.
+    model: OpenAI model name (default: ``gpt-4o-mini``).
+    client: Optional pre-configured OpenAI client instance.
+
+Returns:
+    An evaluator callable satisfying the ``Evaluator`` protocol.
+
+Raises:
+    ValueError: If the template uses nested field access like
+        ``{eval_input[key]}`` (only top-level placeholders are supported).
diff --git a/skills/eval-driven-dev/references/instrumentation-api.md b/skills/eval-driven-dev/references/instrumentation-api.md
new file mode 100644
index 000000000..28bfcc40b
--- /dev/null
+++ b/skills/eval-driven-dev/references/instrumentation-api.md
@@ -0,0 +1,127 @@
+# Instrumentation API Reference
+
+> Auto-generated from pixie source code docstrings.
+> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
+
+pixie.instrumentation — public API for tracing and observing LLM applications.
+
+Core functions:
+    - ``init()`` — initialize the tracer provider.
+    - ``observe`` — decorator for automatic function input/output capture.
+    - ``start_observation()`` — context-manager for manual observation blocks.
+    - ``flush()`` — flush pending spans to handlers.
+    - ``add_handler()`` / ``remove_handler()`` — register span handlers.
+    - ``enable_storage()`` — enable SQLite-backed span persistence.
+
+Configuration
+-------------
+
+| Variable | Default | Description |
+| --- | --- | --- |
+| ``PIXIE_ROOT`` | ``pixie_qa`` | Root directory for all pixie-generated artefacts |
+| ``PIXIE_DB_PATH`` | ``{PIXIE_ROOT}/pixie.db`` | SQLite database for captured spans |
+| ``PIXIE_DATASET_DIR`` | ``{PIXIE_ROOT}/datasets`` | Directory for dataset JSON files |
+
+CLI Commands
+------------
+
+| Command | Description |
+| --- | --- |
+| ``pixie init [root]`` | Scaffold the ``pixie_qa`` working directory |
+| ``pixie trace list [--limit N] [--errors]`` | List recent traces |
+| ``pixie trace show <trace_id> [-v] [--json]`` | Show span tree for a trace |
+| ``pixie trace last [--json]`` | Show the most recent trace (verbose) |
+| ``pixie trace verify`` | Verify the most recent trace for common issues |
+| ``pixie dag validate <json>`` | Validate a DAG JSON file |
+| ``pixie dag check-trace <json>`` | Check the last trace against a DAG |
+
+---
+
+## Functions
+
+### `pixie.init`
+
+```python
+pixie.init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'
+```
+
+Initialize the instrumentation sub-package.
+
+Sets up the OpenTelemetry ``TracerProvider``, span processor, delivery
+queue, and activates auto-instrumentors.  Truly idempotent — calling
+``init()`` a second time is a no-op.
+
+Handler registration is done separately via :func:`add_handler`.
+
+### `pixie.observe`
+
+```python
+pixie.observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'
+```
+
+Decorator that wraps a function in a start_observation() block.
+
+Automatically captures the function's keyword arguments as input and
+the return value as output. Uses jsonpickle for serialization.
+
+If tracing is not initialized, the function executes normally with no
+overhead beyond the decorator call itself.
+
+Args:
+    name: Optional span name. Defaults to the function's __name__.
+
+### `pixie.start_observation`
+
+```python
+pixie.start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'
+```
+
+Context manager that creates an OTel span and yields a mutable ObservationContext.
+
+If init() has not been called, yields a no-op context — the wrapped code
+executes normally but no span is captured.
+
+### `pixie.flush`
+
+```python
+pixie.flush(timeout_seconds: 'float' = 5.0) -> 'bool'
+```
+
+Flush the delivery queue, blocking until all items are processed.
+
+### `pixie.add_handler`
+
+```python
+pixie.add_handler(handler: 'InstrumentationHandler') -> 'None'
+```
+
+Register *handler* to receive span notifications.
+
+Must be called after :func:`init`.  Multiple handlers can be
+registered; each receives every span.
+
+### `pixie.remove_handler`
+
+```python
+pixie.remove_handler(handler: 'InstrumentationHandler') -> 'None'
+```
+
+Unregister a previously registered *handler*.
+
+Raises ``ValueError`` if *handler* was not registered.
+
+### `pixie.enable_storage`
+
+```python
+pixie.enable_storage() -> 'StorageHandler'
+```
+
+Set up Piccolo storage with default config and register the handler.
+
+Creates the ``pixie_qa`` root directory and observation table if they
+don't exist.  Truly idempotent — calling twice returns the same
+handler without duplicating registrations, even from different threads
+or from within an async context.
+
+Returns:
+    The :class:`StorageHandler` for optional manual control.
diff --git a/skills/eval-driven-dev/references/instrumentation.md b/skills/eval-driven-dev/references/instrumentation.md
deleted file mode 100644
index 9f8deef09..000000000
--- a/skills/eval-driven-dev/references/instrumentation.md
+++ /dev/null
@@ -1,174 +0,0 @@
-# Instrumentation
-
-This reference covers the tactical implementation of instrumentation in Step 2: how to use `@observe`, `enable_storage()`, and `start_observation` correctly.
-
-For full API signatures and all available parameters, see `references/pixie-api.md` (Instrumentation API section).
-
-For guidance on **what** to instrument (which functions, based on your eval criteria), see Step 2a in the main skill instructions.
-
----
-
-## Adding `enable_storage()` at application startup
-
-Call `enable_storage()` once at the beginning of the application's startup code — inside `main()`, or at the top of a server's initialization. **Never at module level** (top of a file outside any function), because that causes storage setup to trigger on import.
-
-Good places:
-
-- Inside `if __name__ == "__main__":` blocks
-- In a FastAPI `lifespan` or `on_startup` handler
-- At the top of `main()` / `run()` functions
-- Inside the `runnable` function in test files
-
-```python
-# ✅ CORRECT — at application startup
-async def main():
-    enable_storage()
-    ...
-
-# ✅ CORRECT — in a runnable for tests
-def runnable(eval_input):
-    enable_storage()
-    my_function(**eval_input)
-
-# ❌ WRONG — at module level, runs on import
-from pixie import enable_storage
-enable_storage()  # this runs when any file imports this module!
-```
-
----
-
-## Wrapping functions with `@observe` or `start_observation`
-
-Instrument the **existing function** that the app actually calls during normal operation. The `@observe` decorator or `start_observation` context manager goes on the production code path — not on new helper functions created for testing.
-
-```python
-# ✅ CORRECT — decorating the existing production function
-from pixie import observe
-
-@observe(name="answer_question")
-def answer_question(question: str, context: str) -> str:  # existing function
-    ...  # existing code, unchanged
-```
-
-```python
-# ✅ CORRECT — decorating a class method (works exactly the same way)
-from pixie import observe
-
-class OpenAIAgent:
-    def __init__(self, model: str = "gpt-4o-mini"):
-        self.client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
-        self.model = model
-
-    @observe(name="openai_agent_respond")
-    def respond(self, user_message: str, conversation_history: list | None = None) -> str:
-        # existing code, unchanged — @observe handles `self` automatically
-        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
-        if conversation_history:
-            messages.extend(conversation_history)
-        messages.append({"role": "user", "content": user_message})
-        response = self.client.chat.completions.create(model=self.model, messages=messages)
-        return response.choices[0].message.content or ""
-```
-
-**`@observe` handles `self` and `cls` automatically** — it strips them from the captured input so only the meaningful arguments appear in traces. Do NOT create wrapper methods or call unbound methods to work around this. Just decorate the existing method directly.
-
-```python
-# ✅ CORRECT — context manager inside an existing function
-from pixie import start_observation
-
-async def main():  # existing function
-    ...
-    with start_observation(input={"user_input": user_input}, name="handle_turn") as obs:
-        result = await Runner.run(current_agent, input_items, context=context)
-        # ... existing response handling ...
-        obs.set_output(response_text)
-    ...
-```
-
----
-
-## Anti-patterns to avoid
-
-### Creating new wrapper functions
-
-```python
-# ❌ WRONG — creating a new function that duplicates logic from main()
-@observe(name="run_for_eval")
-async def run_for_eval(user_messages: list[str]) -> str:
-    # This duplicates what main() does, creating a separate code path
-    # that diverges from production. Don't do this.
-    ...
-```
-
-### Creating wrapper methods instead of decorating the existing method
-
-```python
-# ❌ WRONG — creating a new _respond_observed wrapper method
-class OpenAIAgent:
-    def respond(self, user_message, conversation_history=None):
-        result = self._respond_observed({
-            'user_message': user_message,
-            'conversation_history': conversation_history,
-        })
-        return result['result']
-
-    @observe
-    def _respond_observed(self, args):
-        # WRONG: creates a separate code path, changes the interface,
-        # and breaks when called as an unbound method.
-        ...
-
-# ✅ CORRECT — just decorate the existing method directly
-class OpenAIAgent:
-    @observe(name="openai_agent_respond")
-    def respond(self, user_message, conversation_history=None):
-        ...  # existing code, unchanged
-```
-
-### Bypassing the app by calling the LLM directly
-
-```python
-# ❌ WRONG — calling the LLM directly instead of calling the app's function
-@observe(name="agent_answer_question")
-def answer_question(question: str) -> str:
-    # This bypasses the entire app and calls OpenAI directly.
-    # You're testing a script you just wrote, not the user's app.
-    response = client.responses.create(
-        model="gpt-4.1",
-        input=[{"role": "user", "content": question}],
-    )
-    return response.output_text
-```
-
----
-
-## Rules
-
-- **Never add new wrapper functions** to the application code for eval purposes.
-- **Never bypass the app by calling the LLM provider directly** — if you find yourself writing `client.responses.create(...)` or `openai.ChatCompletion.create(...)` in a test or utility function, you're not testing the app. Import and call the app's own function instead.
-- **Never change the function's interface** (arguments, return type, behavior).
-- **Never duplicate production logic** into a separate "testable" function.
-- The instrumentation is purely additive — if you removed all pixie imports and decorators, the app would work identically.
-- After instrumentation, call `flush()` at the end of runs to make sure all spans are written.
-- For interactive apps (CLI loops, chat interfaces), instrument the **per-turn processing** function — the one that takes user input and produces a response. The eval `runnable` should call this same function.
-
-**Import rule**: All pixie symbols are importable from the top-level `pixie` package. Never import from submodules (`pixie.instrumentation`, `pixie.evals`, `pixie.storage.evaluable`, etc.) — always use `from pixie import ...`.
-
----
-
-## What to instrument based on eval criteria
-
-**LLM provider calls are auto-captured.** When you call `enable_storage()`, pixie activates OpenInference instrumentors that automatically trace every LLM API call (OpenAI, Anthropic, Google, etc.) with full input/output messages, token usage, and model parameters. You do NOT need `@observe` on a function just because it contains an LLM call — the LLM call is already instrumented.
-
-**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone:
-
-| What your evaluator needs                                  | What to instrument with `@observe`                                       |
-| ---------------------------------------------------------- | ------------------------------------------------------------------------ |
-| App-level input/output (what user sent, what app returned) | The app's entry-point or per-turn processing function                    |
-| Retrieved context (for faithfulness/grounding checks)      | The retrieval function — captures what documents were fetched            |
-| Routing/dispatch decisions                                 | The routing function — captures which tool/agent/department was selected |
-| Side-effects sent to external systems                      | The function that writes to the external system — captures what was sent |
-| Conversation history handling                              | The per-turn processing function — captures how history is assembled     |
-| Intermediate processing stages                             | Each intermediate function — captures each stage                         |
-
-If your eval criteria can be fully assessed from the auto-captured LLM inputs and outputs alone, you may not need `@observe` at all. But typically you need at least one `@observe` on the app's entry-point function to capture the application-level input/output shape that the dataset and evaluators work with.
diff --git a/skills/eval-driven-dev/references/pixie-api.md b/skills/eval-driven-dev/references/pixie-api.md
deleted file mode 100644
index 279cce49d..000000000
--- a/skills/eval-driven-dev/references/pixie-api.md
+++ /dev/null
@@ -1,257 +0,0 @@
-# pixie API Reference
-
-> This file is auto-generated by `generate_api_doc` from the
-> live pixie-qa package. Do not edit by hand — run
-> `generate_api_doc` to regenerate after updating pixie-qa.
-
-## Configuration
-
-All settings read from environment variables at call time. By default,
-every artefact lives inside a single `pixie_qa` project directory:
-
-| Variable            | Default                    | Description                        |
-| ------------------- | -------------------------- | ---------------------------------- |
-| `PIXIE_ROOT`        | `pixie_qa`                 | Root directory for all artefacts   |
-| `PIXIE_DB_PATH`     | `pixie_qa/observations.db` | SQLite database file path          |
-| `PIXIE_DB_ENGINE`   | `sqlite`                   | Database engine (currently sqlite) |
-| `PIXIE_DATASET_DIR` | `pixie_qa/datasets`        | Directory for dataset JSON files   |
-
----
-
-## Instrumentation API (`pixie`)
-
-```python
-from pixie import enable_storage, observe, start_observation, flush, init, add_handler
-```
-
-| Function / Decorator | Signature                                                    | Notes                                                                                               |
-| -------------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- |
-| `observe`   | `observe(name: 'str | None' = None) -> 'Callable[[Callable[P, T]], Callable[P, T]]'` | Wraps a sync or async function. Captures all kwargs as `eval_input`, return value as `eval_output`. |
-| `enable_storage`   | `enable_storage() -> 'StorageHandler'` | Idempotent. Creates DB, registers handler. Call at app startup. |
-| `start_observation`   | `start_observation(*, input: 'JsonValue', name: 'str | None' = None) -> 'Generator[ObservationContext, None, None]'` | Manual span. Call `obs.set_output(v)` and `obs.set_metadata(key, value)` inside. |
-| `flush`   | `flush(timeout_seconds: 'float' = 5.0) -> 'bool'` | Drains the queue. Call after a run before using CLI commands. |
-| `init`   | `init(*, capture_content: 'bool' = True, queue_size: 'int' = 1000) -> 'None'` | Called internally by `enable_storage`. Idempotent. |
-| `add_handler`   | `add_handler(handler: 'InstrumentationHandler') -> 'None'` | Register a custom handler (must call `init()` first). |
-| `remove_handler`   | `remove_handler(handler: 'InstrumentationHandler') -> 'None'` | Unregister a previously added handler. |
-
----
-
-## CLI Commands
-
-```bash
-# Trace inspection
-pixie trace list [--limit N] [--errors]              # show recent traces
-pixie trace show <trace_id> [--verbose] [--json]     # show span tree for a trace
-pixie trace last [--json]                            # show most recent trace (verbose)
-
-# Dataset management
-pixie dataset create <name>
-pixie dataset list
-pixie dataset save <name>                              # root span (default)
-pixie dataset save <name> --select last_llm_call       # last LLM call
-pixie dataset save <name> --select by_name --span-name <name>
-pixie dataset save <name> --notes "some note"
-echo '"expected value"' | pixie dataset save <name> --expected-output
-
-# Run eval tests
-pixie test [path] [-k filter_substring] [-v]
-```
-
-### `pixie trace` commands
-
-**`pixie trace list`** — show recent traces with summary info (trace ID, root span, timestamp, span count, errors).
-
-- `--limit N` (default 10) — number of traces to show
-- `--errors` — show only traces with errors
-
-**`pixie trace show <trace_id>`** — show the span tree for a specific trace.
-
-- Default (compact): span names, types, timing
-- `--verbose` / `-v`: full input/output data for each span
-- `--json`: machine-readable JSON output
-- Trace ID accepts prefix match (first 8+ characters)
-
-**`pixie trace last`** — shortcut to show the most recent trace in verbose mode. This is the primary command to use after running the harness.
-
-- `--json`: machine-readable JSON output
-
-**`pixie dataset save` selection modes:**
-
-- `root` (default) — the outermost `@observe` or `start_observation` span
-- `last_llm_call` — the most recent LLM API call span in the trace
-- `by_name` — a span matching the `--span-name` argument (takes the last matching span)
-
----
-
-## Dataset Python API
-
-```python
-from pixie import DatasetStore, Evaluable
-```
-
-```python
-store = DatasetStore()                               # reads PIXIE_DATASET_DIR
-store.append(...)    # add one or more items
-store.create(...)    # create empty / create with items
-store.delete(...)    # delete entirely
-store.get(...)    # returns Dataset
-store.list(...)    # list names
-store.list_details(...)    # list names with metadata
-store.remove(...)    # remove by index
-```
-
-**`Evaluable` fields:**
-
-- `eval_input`: the input (what `@observe` captured as function kwargs)
-- `eval_output`: the output (return value of the observed function)
-- `eval_metadata`: dict of extra info (trace_id, span_id, provider, token counts, etc.) — always includes `trace_id` and `span_id`
-- `expected_output`: reference answer for comparison (`UNSET` if not provided)
-
----
-
-## ObservationStore Python API
-
-```python
-from pixie import ObservationStore
-
-store = ObservationStore()   # reads PIXIE_DB_PATH
-await store.create_tables()
-```
-
-```python
-await store.create_tables(self) -> 'None'
-await store.get_by_name(self, name: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans
-await store.get_by_type(self, span_kind: 'str', trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of spans filtered by kind
-await store.get_errors(self, trace_id: 'str | None' = None) -> 'list[ObserveSpan | LLMSpan]'  # → list of error spans
-await store.get_last_llm(self, trace_id: 'str') -> 'LLMSpan | None'  # → most recent LLMSpan
-await store.get_root(self, trace_id: 'str') -> 'ObserveSpan'  # → root ObserveSpan
-await store.get_trace(self, trace_id: 'str') -> 'list[ObservationNode]'  # → list[ObservationNode] (tree)
-await store.get_trace_flat(self, trace_id: 'str') -> 'list[ObserveSpan | LLMSpan]'  # → flat list of all spans
-await store.list_traces(self, limit: 'int' = 50, offset: 'int' = 0) -> 'list[dict[str, Any]]'  # → list of trace summaries
-await store.save(self, span: 'ObserveSpan | LLMSpan') -> 'None'  # persist a single span
-await store.save_many(self, spans: 'list[ObserveSpan | LLMSpan]') -> 'None'  # persist multiple spans
-
-# ObservationNode
-node.to_text()          # pretty-print span tree
-node.find(name)         # find a child span by name
-node.children           # list of child ObservationNode
-node.span               # the underlying span (ObserveSpan or LLMSpan)
-```
-
----
-
-## Eval Runner API
-
-### `assert_dataset_pass`
-
-```python
-await assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, passes: 'int' = 1, pass_criteria: 'Callable[[list[list[list[Evaluation]]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
-```
-
-**Parameters:**
-
-- `runnable` — callable that takes `eval_input` and runs the app
-- `dataset_name` — name of the dataset to load (NOT `dataset_path`)
-- `evaluators` — list of evaluator instances
-- `pass_criteria` — `ScoreThreshold(threshold=..., pct=...)` (NOT `thresholds`)
-- `from_trace` — span selector: use `last_llm_call` or `root`
-- `dataset_dir` — override dataset directory (default: reads from config)
-- `passes` — number of times to run the full matrix (default: 1)
-
-### `ScoreThreshold`
-
-```python
-ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
-
-# threshold: minimum per-item score to count as passing (0.0–1.0)
-# pct:       fraction of items that must pass (0.0–1.0, default=1.0)
-```
-
-### Trace helpers
-
-```python
-from pixie import last_llm_call, root
-
-# Pass one of these as the from_trace= argument:
-from_trace=last_llm_call  # extract eval data from the most recent LLM call span
-from_trace=root           # extract eval data from the root @observe span
-```
-
----
-
-## Evaluator catalog
-
-Import any evaluator directly from `pixie`:
-
-```python
-from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
-```
-
-### Heuristic (no LLM required)
-
-| Evaluator | Signature | Use when | Needs `expected_output`? |
-| --- | --- | --- | --- |
-| `ExactMatchEval() -> 'AutoevalsAdapter'` | Output must exactly equal the expected string | **Yes** |
-| `LevenshteinMatch() -> 'AutoevalsAdapter'` | Partial string similarity (edit distance) | **Yes** |
-| `NumericDiffEval() -> 'AutoevalsAdapter'` | Normalised numeric difference | **Yes** |
-| `JSONDiffEval(*, string_scorer: 'Any' = None) -> 'AutoevalsAdapter'` | Structural JSON comparison | **Yes** |
-| `ValidJSONEval(*, schema: 'Any' = None) -> 'AutoevalsAdapter'` | Output is valid JSON (optionally matching a schema) | No |
-| `ListContainsEval(*, pairwise_scorer: 'Any' = None, allow_extra_entities: 'bool' = False) -> 'AutoevalsAdapter'` | Output list contains expected items | **Yes** |
-
-### LLM-as-judge (require OpenAI key or compatible client)
-
-| Evaluator | Signature | Use when | Needs `expected_output`? |
-| --- | --- | --- | --- |
-| `FactualityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is factually accurate vs reference | **Yes** |
-| `ClosedQAEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Closed-book QA comparison | **Yes** |
-| `SummaryEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Summarisation quality | **Yes** |
-| `TranslationEval(*, language: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Translation quality | **Yes** |
-| `PossibleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Output is feasible / plausible | No |
-| `SecurityEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No security vulnerabilities in output | No |
-| `ModerationEval(*, threshold: 'float | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Content moderation | No |
-| `BattleEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Head-to-head comparison | **Yes** |
-| `HumorEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Humor quality evaluation | **Yes** |
-| `EmbeddingSimilarityEval(*, prefix: 'str | None' = None, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | Embedding-based semantic similarity | **Yes** |
-
-### RAG / retrieval
-
-| Evaluator | Signature | Use when | Needs `expected_output`? |
-| --- | --- | --- | --- |
-| `ContextRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Retrieved context is relevant to query | **Yes** |
-| `FaithfulnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is faithful to the provided context | No |
-| `AnswerRelevancyEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer addresses the question (⚠️ requires `context` in trace — **RAG pipelines only**) | No |
-| `AnswerCorrectnessEval(*, client: 'Any' = None) -> 'AutoevalsAdapter'` | Answer is correct vs reference | **Yes** |
-
-### Other evaluators
-
-| Evaluator | Signature | Needs `expected_output`? |
-| --- | --- | --- |
-| `SqlEval(*, model: 'str | None' = None, client: 'Any' = None) -> 'AutoevalsAdapter'` | No |
-
----
-
-## Custom evaluator — `create_llm_evaluator` factory
-
-```python
-from pixie import create_llm_evaluator
-
-my_eval = create_llm_evaluator(name: 'str', prompt_template: 'str', *, model: 'str' = 'gpt-4o-mini', client: 'Any | None' = None) -> '_LLMEvaluator'
-```
-
-- Returns a callable satisfying the `Evaluator` protocol
-- Template variables: `{eval_input}`, `{eval_output}`, `{expected_output}` — populated from `Evaluable` fields
-- No nested field access — include any needed metadata in `eval_input` when building the dataset
-- Score parsing extracts a 0–1 float from the LLM response
-
-### Custom evaluator — manual template
-
-```python
-from pixie import Evaluation, Evaluable
-
-async def my_evaluator(evaluable: Evaluable, *, trace=None) -> Evaluation:
-    # evaluable.eval_input  — what was passed to the observed function
-    # evaluable.eval_output — what the function returned
-    # evaluable.expected_output — reference answer (UNSET if not provided)
-    score = 1.0 if "expected pattern" in str(evaluable.eval_output) else 0.0
-    return Evaluation(score=score, reasoning="...")
-```
diff --git a/skills/eval-driven-dev/references/run-harness-examples/cli-app.md b/skills/eval-driven-dev/references/run-harness-examples/cli-app.md
new file mode 100644
index 000000000..41ae8b45f
--- /dev/null
+++ b/skills/eval-driven-dev/references/run-harness-examples/cli-app.md
@@ -0,0 +1,55 @@
+# Run Harness Example: CLI / Command-Line App
+
+**When your app is invoked from the command line** (e.g., `python -m myapp`, a CLI tool).
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. Between the entry point and that inner function, the app does request handling, state management, prompt assembly, routing — all of which is under test. When you call an inner function, you skip all of that and end up reimplementing it in your test. Now your test is testing test code, not app code.
+
+**Approach**: Invoke the app's entry point via `subprocess.run()`, capture stdout/stderr, parse results.
+
+```python
+# pixie_qa/scripts/run_app.py
+import subprocess
+import sys
+import json
+
+def run_app(user_input: str) -> str:
+    """Run the CLI app with the given input and return its output."""
+    result = subprocess.run(
+        [sys.executable, "-m", "myapp", "--query", user_input],
+        capture_output=True,
+        text=True,
+        timeout=120,
+    )
+    if result.returncode != 0:
+        raise RuntimeError(f"App failed: {result.stderr}")
+    return result.stdout.strip()
+
+def main() -> None:
+    inputs = [
+        "What are your business hours?",
+        "How do I reset my password?",
+        "Tell me about your return policy",
+    ]
+    for user_input in inputs:
+        output = run_app(user_input)
+        print(f"Input: {user_input}")
+        print(f"Output: {output}")
+        print("---")
+
+if __name__ == "__main__":
+    main()
+```
+
+If the CLI app needs external dependencies mocked, create a wrapper script that patches them before invoking the entry point:
+
+```python
+# pixie_qa/scripts/patched_app.py
+"""Entry point that patches DB/cache before running the real app."""
+import myapp.config as config
+config.redis_url = "mock://localhost"  # or use a mock implementation
+
+from myapp.main import main
+main()
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
diff --git a/skills/eval-driven-dev/references/run-harness-examples/fastapi-web-server.md b/skills/eval-driven-dev/references/run-harness-examples/fastapi-web-server.md
new file mode 100644
index 000000000..9fcfa326e
--- /dev/null
+++ b/skills/eval-driven-dev/references/run-harness-examples/fastapi-web-server.md
@@ -0,0 +1,245 @@
+# Run Harness Example: FastAPI / Web Server with External Services
+
+**When your app is a web server** (FastAPI, Flask, etc.) with external service dependencies (Redis, Twilio, speech services, databases). **This is the most common case** — most production apps are web servers.
+
+**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. Between the entry point and that inner function, the app does request handling, state management, prompt assembly, routing — all of which is under test. When you call an inner function, you skip all of that and end up reimplementing it in your test. Now your test is testing test code, not app code.
+
+**Approach**: Mock external dependencies, then drive the app through its HTTP/WebSocket interface. Two sub-approaches:
+
+- **Subprocess approach**: Launch the patched server as a subprocess, wait for health, then send HTTP/WebSocket requests with `httpx`. Best when the app has complex startup or uses `uvicorn.run()`.
+- **In-process approach**: Use FastAPI's `TestClient` (or `httpx.AsyncClient` with `ASGITransport`) to drive the app in-process. Simpler — no subprocess management, no ports. Best when you can import the `app` object directly.
+
+Both approaches exercise the full request pipeline: routing → middleware → state management → business logic → response assembly.
+
+## Step 1: Identify pluggable interfaces and write mock backends
+
+Look for abstract base classes, protocols, or constructor-injected backends in the codebase. These are the app's testability seams — the places where external services can be swapped out. Create mock implementations that satisfy the interface but don't call external services.
+
+```python
+# pixie_qa/scripts/mock_backends.py
+from myapp.services.transcription import TranscriptionBackend
+from myapp.services.voice_synthesis import SynthesisBackend
+
+class MockTranscriptionBackend(TranscriptionBackend):
+    """Decodes UTF-8 text instead of calling real STT service."""
+    async def transcribe_chunk(self, audio_data: bytes) -> str | None:
+        try:
+            return audio_data.decode("utf-8")
+        except UnicodeDecodeError:
+            return None
+
+class MockSynthesisBackend(SynthesisBackend):
+    """Encodes text as bytes instead of calling real TTS service."""
+    async def synthesize(self, text: str) -> bytes:
+        return text.encode("utf-8")
+```
+
+## Step 2: Write the patched server launcher
+
+Monkey-patch the app's module-level dependencies before starting the server:
+
+```python
+# pixie_qa/scripts/demo_server.py
+import uvicorn
+from pixie_qa.scripts.mock_backends import (
+    MockTranscriptionBackend,
+    MockSynthesisBackend,
+)
+
+# Patch module-level backends BEFORE uvicorn imports the ASGI app
+import myapp.app as the_app
+the_app.transcription_backend = MockTranscriptionBackend()
+the_app.synthesis_backend = MockSynthesisBackend()
+
+if __name__ == "__main__":
+    uvicorn.run(the_app.app, host="127.0.0.1", port=8000)
+```
+
+## Step 3: Write the utility function (subprocess approach)
+
+Launch the server subprocess, wait for health, send real requests, collect responses.
+
+**Starting the server**: Always use `run-with-timeout.sh` to start the server in the background. This avoids issues where background processes get killed between terminal commands.
+
+```bash
+# Start the patched server with a 120-second timeout
+bash resources/run-with-timeout.sh 120 uv run python -m pixie_qa.scripts.demo_server
+sleep 3  # Wait for server readiness
+```
+
+Then write `run_app.py` to send requests to the running server:
+
+```python
+# pixie_qa/scripts/run_app.py
+import httpx
+from pixie import enable_storage, observe
+
+BASE_URL = "http://127.0.0.1:8000"
+
+@observe
+def run_app(eval_input: dict) -> dict:
+    """Send a request to the running server and return the response."""
+    enable_storage()
+    resp = httpx.post(f"{BASE_URL}/api/chat", json={
+        "message": eval_input["user_message"],
+    }, timeout=30)
+    return resp.json()
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+## Alternative: In-process with TestClient (simpler)
+
+If the app's `app` object can be imported directly, skip the subprocess and use FastAPI's `TestClient`:
+
+```python
+# pixie_qa/scripts/run_app.py
+from unittest.mock import patch
+from fastapi.testclient import TestClient
+from pixie import enable_storage, observe
+
+from pixie_qa.scripts.mock_backends import (
+    MockTranscriptionBackend,
+    MockSynthesisBackend,
+)
+
+@observe
+def run_app(eval_input: dict) -> dict:
+    """Run the voice agent through its real FastAPI app layer."""
+    enable_storage()
+    # Patch external dependencies before importing the app
+    with patch("myapp.app.transcription_backend", MockTranscriptionBackend()), \
+         patch("myapp.app.synthesis_backend", MockSynthesisBackend()), \
+         patch("myapp.app.call_state_store", MockCallStateStore()):
+
+        from myapp.app import app
+        client = TestClient(app)
+
+        # Drive through the real HTTP/WebSocket endpoints
+        resp = client.post("/api/chat", json={
+            "message": eval_input["user_message"],
+            "call_sid": eval_input.get("call_sid", "test-call-001"),
+        })
+        return {"response": resp.json()["response"]}
+```
+
+This approach is simpler (no subprocess, no port management) and equally valid. Both approaches exercise the full request pipeline.
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
+
+---
+
+## Gotcha: FastAPI TestClient + Database Connections
+
+When using `TestClient` with a FastAPI app that manages database connections in its `lifespan`, you'll hit `sqlite3.ProgrammingError: Cannot operate on a closed database` if you're not careful. This happens because:
+
+1. **TestClient runs lifespan startup/shutdown**: When you enter `with TestClient(app) as client:`, the lifespan starts (opening the DB). When you exit, the lifespan shuts down (**closing the DB**).
+2. **In-memory SQLite dies on close**: An in-memory `sqlite3.connect(":memory:")` connection destroys the database when closed — there's no file to reconnect to.
+
+### The fix: initialize once at module level, use NonClosingConnection
+
+Create the DB and TestClient **once at module level** instead of per-call. Wrap the connection so `close()` is a no-op. This solves both the closed-DB problem and the concurrency-safety problem (see next section):
+
+```python
+class _NonClosingConnection:
+    """Wraps a sqlite3.Connection, forwarding everything except close()."""
+    def __init__(self, real_conn: sqlite3.Connection) -> None:
+        self._real = real_conn
+    def __getattr__(self, name: str):
+        return getattr(self._real, name)
+    def close(self) -> None:
+        pass  # Prevent lifespan teardown from killing our mock DB
+```
+
+See the next section ("Concurrency-safe harness") for the complete pattern that combines `_NonClosingConnection` with module-level initialization.
+
+## Concurrency-safe harness — initialize once, run concurrently
+
+`assert_dataset_pass` runs the runnable concurrently for multiple dataset items. Your `run_app` function **must be safe to call from multiple threads at the same time**. A common mistake is to wrap the entire function in a `threading.Lock()` — this serializes all runs and makes tests extremely slow (e.g., 5 items × 7s each = 35s serial, vs 9s with 4-way concurrency).
+
+**❌ WRONG — global lock serializes all runs:**
+
+```python
+_run_lock = threading.Lock()
+
+def run_app(eval_input: dict) -> dict:
+    with _run_lock:  # ← Every call waits for all previous ones to finish
+        ...
+```
+
+**✅ CORRECT — initialize once, run concurrently:**
+
+Most web apps use per-session state (keyed by session ID, call ID, etc.) that is naturally isolated across concurrent calls. The right pattern is:
+
+1. **Initialize the app ONCE at module level** — create the TestClient, DB, and services once
+2. **Each `run_app` call reuses the shared client** — the app's per-session state (keyed by unique ID) provides isolation
+3. **No lock needed** — concurrent calls use different session IDs and don't interfere
+
+```python
+# pixie_qa/scripts/run_app.py
+import sqlite3
+from fastapi.testclient import TestClient
+from unittest.mock import patch
+from pixie import enable_storage
+
+from src.services.db import init_db, seed_data
+
+
+class _NonClosingConnection:
+    """Wraps sqlite3.Connection, making close() a no-op."""
+    def __init__(self, real_conn: sqlite3.Connection) -> None:
+        self._real = real_conn
+    def __getattr__(self, name: str):
+        return getattr(self._real, name)
+    def close(self) -> None:
+        pass
+
+
+# ── Module-level setup: initialize app ONCE ────────────────────────────────
+# Create a shared in-memory DB with seed data (read-only reference data).
+# The app's per-session stores (keyed by call_sid, session_id, etc.)
+# provide natural isolation between concurrent runs.
+_conn = sqlite3.connect(":memory:", check_same_thread=False)
+_conn.row_factory = sqlite3.Row
+init_db(_conn)
+seed_data(_conn)
+_safe_conn = _NonClosingConnection(_conn)
+
+# Patch the connection getter BEFORE importing the app so the lifespan
+# picks it up, then create a single long‑lived TestClient.
+with patch("app.get_connection", return_value=_safe_conn):
+    from app import app
+    _client = TestClient(app).__enter__()
+
+
+def run_app(eval_input: dict) -> dict:
+    """Run a single eval scenario — safe to call concurrently.
+
+    Each call gets a unique session/call ID from the app, so concurrent
+    calls never share mutable state.
+    """
+    enable_storage()
+
+    # Drive through real endpoints — app assigns unique IDs internally
+    resp = _client.post("/start-call", json={...})
+    call_sid = resp.json()["call_sid"]
+    for msg in eval_input["turns"]:
+        resp = _client.post("/chat", json={"call_sid": call_sid, "message": msg})
+    return resp.json()
+```
+
+**Why this works:**
+
+- The DB holds reference data (agent configs, customer records) — read-only, safe to share
+- The app's in-memory stores (`CallStateStore`, `ActiveStreamStore`) are dicts keyed by `call_sid` — each call creates its own entry
+- The LLM service is stateless — it reads from the stores per-call
+- `TestClient` can handle concurrent requests to the same app
+
+**When does this pattern apply?**
+
+- The app manages per-session/per-call state keyed by unique IDs (most web apps)
+- The shared DB contains only seed/config data that is read, not mutated by test calls
+- The app services are stateless orchestrators that read/write to keyed stores
+
+**When you truly cannot avoid serialization** (rare — e.g., app writes to a shared file, uses global counters):
+Only then fall back to a lock, but wrap the **minimum** scope — just the racy operation, not the entire `run_app` body.
diff --git a/skills/eval-driven-dev/references/run-harness-examples/standalone-function.md b/skills/eval-driven-dev/references/run-harness-examples/standalone-function.md
new file mode 100644
index 000000000..99d3b6a90
--- /dev/null
+++ b/skills/eval-driven-dev/references/run-harness-examples/standalone-function.md
@@ -0,0 +1,50 @@
+# Run Harness Example: Standalone Function (No Infrastructure)
+
+**When your app is a single function or module** with no server, no database, no external services.
+
+**Approach**: Import the function directly and call it. This is the simplest case.
+
+```python
+# pixie_qa/scripts/run_app.py
+from pixie import enable_storage, observe
+
+# Enable trace capture
+enable_storage()
+
+from myapp.agent import answer_question
+
+@observe
+def run_agent(question: str) -> str:
+    """Wrapper that captures traces for the agent call."""
+    return answer_question(question)
+
+def main() -> None:
+    inputs = [
+        "What are your business hours?",
+        "How do I reset my password?",
+        "Tell me about your return policy",
+    ]
+    for q in inputs:
+        result = run_agent(q)
+        print(f"Q: {q}")
+        print(f"A: {result}")
+        print("---")
+
+if __name__ == "__main__":
+    main()
+```
+
+If the function depends on something that needs mocking (e.g., a vector store client), patch it before calling:
+
+```python
+from unittest.mock import MagicMock
+import myapp.retriever as retriever
+
+# Mock the vector store with a simple keyword search
+retriever.vector_client = MagicMock()
+retriever.vector_client.search.return_value = [
+    {"text": "Business hours: Mon-Fri 9am-5pm", "score": 0.95}
+]
+```
+
+**Run**: `uv run python -m pixie_qa.scripts.run_app`
diff --git a/skills/eval-driven-dev/references/run-harness-patterns.md b/skills/eval-driven-dev/references/run-harness-patterns.md
deleted file mode 100644
index ff995d323..000000000
--- a/skills/eval-driven-dev/references/run-harness-patterns.md
+++ /dev/null
@@ -1,281 +0,0 @@
-# Running the App from Its Entry Point — Examples by App Type
-
-This reference shows concrete examples of how to write the utility function from Step 3 — the function that runs the full application end-to-end with external dependencies mocked. Each example demonstrates what an "entry point" looks like for a different kind of application and how to invoke it.
-
-For `enable_storage()` and `observe` API details, see `references/pixie-api.md` (Instrumentation API section).
-
-## What entry point to use
-
-Look at how a real user or client invokes the app, and do the same thing in your utility function:
-
-| App type                                           | Entry point example     | How to invoke it                                     |
-| -------------------------------------------------- | ----------------------- | ---------------------------------------------------- |
-| **Web server** (FastAPI, Flask)                    | HTTP/WebSocket endpoint | `TestClient`, `httpx`, or subprocess + HTTP requests |
-| **CLI application**                                | Command-line invocation | `subprocess.run()`                                   |
-| **Standalone function** (no server, no middleware) | Python function         | Import and call directly                             |
-
-**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. Between the entry point and that inner function, the app does request handling, state management, prompt assembly, routing — all of which is under test. When you call an inner function, you skip all of that and end up reimplementing it in your test. Now your test is testing test code, not app code.
-
-Mock only external dependencies (databases, speech services, third-party APIs) — everything you identified and planned in Step 1.
-
----
-
-## Example: FastAPI / Web Server with External Services
-
-**When your app is a web server** (FastAPI, Flask, etc.) with external service dependencies (Redis, Twilio, speech services, databases). **This is the most common case** — most production apps are web servers.
-
-**Approach**: Mock external dependencies, then drive the app through its HTTP/WebSocket interface. Two sub-approaches:
-
-- **Subprocess approach**: Launch the patched server as a subprocess, wait for health, then send HTTP/WebSocket requests with `httpx`. Best when the app has complex startup or uses `uvicorn.run()`.
-- **In-process approach**: Use FastAPI's `TestClient` (or `httpx.AsyncClient` with `ASGITransport`) to drive the app in-process. Simpler — no subprocess management, no ports. Best when you can import the `app` object directly.
-
-Both approaches exercise the full request pipeline: routing → middleware → state management → business logic → response assembly.
-
-### Step 1: Identify pluggable interfaces and write mock backends
-
-Look for abstract base classes, protocols, or constructor-injected backends in the codebase. These are the app's testability seams — the places where external services can be swapped out. Create mock implementations that satisfy the interface but don't call external services.
-
-```python
-# pixie_qa/scripts/mock_backends.py
-from myapp.services.transcription import TranscriptionBackend
-from myapp.services.voice_synthesis import SynthesisBackend
-
-class MockTranscriptionBackend(TranscriptionBackend):
-    """Decodes UTF-8 text instead of calling real STT service."""
-    async def transcribe_chunk(self, audio_data: bytes) -> str | None:
-        try:
-            return audio_data.decode("utf-8")
-        except UnicodeDecodeError:
-            return None
-
-class MockSynthesisBackend(SynthesisBackend):
-    """Encodes text as bytes instead of calling real TTS service."""
-    async def synthesize(self, text: str) -> bytes:
-        return text.encode("utf-8")
-```
-
-### Step 2: Write the patched server launcher
-
-Monkey-patch the app's module-level dependencies before starting the server:
-
-```python
-# pixie_qa/scripts/demo_server.py
-import uvicorn
-from pixie_qa.scripts.mock_backends import (
-    MockTranscriptionBackend,
-    MockSynthesisBackend,
-)
-
-# Patch module-level backends BEFORE uvicorn imports the ASGI app
-import myapp.app as the_app
-the_app.transcription_backend = MockTranscriptionBackend()
-the_app.synthesis_backend = MockSynthesisBackend()
-
-if __name__ == "__main__":
-    uvicorn.run(the_app.app, host="127.0.0.1", port=8000)
-```
-
-### Step 3: Write the utility function
-
-Launch the server subprocess, wait for health, send real requests, collect responses:
-
-```python
-# pixie_qa/scripts/run_app.py
-import subprocess
-import sys
-import time
-import httpx
-
-BASE_URL = "http://127.0.0.1:8000"
-
-def wait_for_server(timeout: float = 30.0) -> None:
-    start = time.time()
-    while time.time() - start < timeout:
-        try:
-            resp = httpx.get(f"{BASE_URL}/health", timeout=2)
-            if resp.status_code == 200:
-                return
-        except httpx.ConnectError:
-            pass
-        time.sleep(0.5)
-    raise TimeoutError(f"Server did not start within {timeout}s")
-
-def main() -> None:
-    # Launch patched server
-    server = subprocess.Popen(
-        [sys.executable, "-m", "pixie_qa.scripts.demo_server"],
-    )
-    try:
-        wait_for_server()
-        # Drive the app with real inputs
-        resp = httpx.post(f"{BASE_URL}/api/chat", json={
-            "message": "What are your business hours?"
-        })
-        print(resp.json())
-    finally:
-        server.terminate()
-        server.wait()
-
-if __name__ == "__main__":
-    main()
-```
-
-**Run**: `uv run python -m pixie_qa.scripts.run_app`
-
-### Alternative: In-process with TestClient (simpler)
-
-If the app's `app` object can be imported directly, skip the subprocess and use FastAPI's `TestClient`:
-
-```python
-# pixie_qa/scripts/run_app.py
-from unittest.mock import patch
-from fastapi.testclient import TestClient
-from pixie import enable_storage, observe
-
-from pixie_qa.scripts.mock_backends import (
-    MockTranscriptionBackend,
-    MockSynthesisBackend,
-)
-
-@observe
-def run_app(eval_input: dict) -> dict:
-    """Run the voice agent through its real FastAPI app layer."""
-    enable_storage()
-    # Patch external dependencies before importing the app
-    with patch("myapp.app.transcription_backend", MockTranscriptionBackend()), \
-         patch("myapp.app.synthesis_backend", MockSynthesisBackend()), \
-         patch("myapp.app.call_state_store", MockCallStateStore()):
-
-        from myapp.app import app
-        client = TestClient(app)
-
-        # Drive through the real HTTP/WebSocket endpoints
-        resp = client.post("/api/chat", json={
-            "message": eval_input["user_message"],
-            "call_sid": eval_input.get("call_sid", "test-call-001"),
-        })
-        return {"response": resp.json()["response"]}
-```
-
-This approach is simpler (no subprocess, no port management) and equally valid. Both approaches exercise the full request pipeline.
-
-**Run**: `uv run python -m pixie_qa.scripts.run_app`
-
----
-
-## Example: CLI / Command-Line App
-
-**When your app is invoked from the command line** (e.g., `python -m myapp`, a CLI tool).
-
-**Approach**: Invoke the app's entry point via `subprocess.run()`, capture stdout/stderr, parse results.
-
-```python
-# pixie_qa/scripts/run_app.py
-import subprocess
-import sys
-import json
-
-def run_app(user_input: str) -> str:
-    """Run the CLI app with the given input and return its output."""
-    result = subprocess.run(
-        [sys.executable, "-m", "myapp", "--query", user_input],
-        capture_output=True,
-        text=True,
-        timeout=120,
-    )
-    if result.returncode != 0:
-        raise RuntimeError(f"App failed: {result.stderr}")
-    return result.stdout.strip()
-
-def main() -> None:
-    inputs = [
-        "What are your business hours?",
-        "How do I reset my password?",
-        "Tell me about your return policy",
-    ]
-    for user_input in inputs:
-        output = run_app(user_input)
-        print(f"Input: {user_input}")
-        print(f"Output: {output}")
-        print("---")
-
-if __name__ == "__main__":
-    main()
-```
-
-If the CLI app needs external dependencies mocked, create a wrapper script that patches them before invoking the entry point:
-
-```python
-# pixie_qa/scripts/patched_app.py
-"""Entry point that patches DB/cache before running the real app."""
-import myapp.config as config
-config.redis_url = "mock://localhost"  # or use a mock implementation
-
-from myapp.main import main
-main()
-```
-
-**Run**: `uv run python -m pixie_qa.scripts.run_app`
-
----
-
-## Example: Standalone Function (No Infrastructure)
-
-**When your app is a single function or module** with no server, no database, no external services.
-
-**Approach**: Import the function directly and call it. This is the simplest case.
-
-```python
-# pixie_qa/scripts/run_app.py
-from pixie import enable_storage, observe
-
-# Enable trace capture
-enable_storage()
-
-from myapp.agent import answer_question
-
-@observe
-def run_agent(question: str) -> str:
-    """Wrapper that captures traces for the agent call."""
-    return answer_question(question)
-
-def main() -> None:
-    inputs = [
-        "What are your business hours?",
-        "How do I reset my password?",
-        "Tell me about your return policy",
-    ]
-    for q in inputs:
-        result = run_agent(q)
-        print(f"Q: {q}")
-        print(f"A: {result}")
-        print("---")
-
-if __name__ == "__main__":
-    main()
-```
-
-If the function depends on something that needs mocking (e.g., a vector store client), patch it before calling:
-
-```python
-from unittest.mock import MagicMock
-import myapp.retriever as retriever
-
-# Mock the vector store with a simple keyword search
-retriever.vector_client = MagicMock()
-retriever.vector_client.search.return_value = [
-    {"text": "Business hours: Mon-Fri 9am-5pm", "score": 0.95}
-]
-```
-
-**Run**: `uv run python -m pixie_qa.scripts.run_app`
-
----
-
-## Key Rules
-
-1. **Always call through the real entry point** — the same way a real user or client would
-2. **Mock only external dependencies** — the ones you identified in Step 1
-3. **Use `uv run python -m <module>`** to run scripts — never `python <path>`
-4. **Add `enable_storage()` and `@observe`** in the utility function so traces are captured
-5. **After running, verify traces**: `uv run pixie trace list` then `uv run pixie trace show <trace_id> --verbose`
diff --git a/skills/eval-driven-dev/references/testing-api.md b/skills/eval-driven-dev/references/testing-api.md
new file mode 100644
index 000000000..3ffb5f6a1
--- /dev/null
+++ b/skills/eval-driven-dev/references/testing-api.md
@@ -0,0 +1,306 @@
+# Testing API Reference
+
+> Auto-generated from pixie source code docstrings.
+> Do not edit by hand — run `uv run python scripts/generate_skill_docs.py`.
+
+pixie.evals — evaluation harness for LLM applications.
+
+Public API:
+    - ``Evaluation`` — result dataclass for a single evaluator run
+    - ``Evaluator`` — protocol for evaluation callables
+    - ``evaluate`` — run one evaluator against one evaluable
+    - ``run_and_evaluate`` — evaluate spans from a MemoryTraceHandler
+    - ``assert_pass`` — batch evaluation with pass/fail criteria
+    - ``assert_dataset_pass`` — load a dataset and run assert_pass
+    - ``EvalAssertionError`` — raised when assert_pass fails
+    - ``capture_traces`` — context manager for in-memory trace capture
+    - ``MemoryTraceHandler`` — InstrumentationHandler that collects spans
+    - ``ScoreThreshold`` — configurable pass criteria
+    - ``last_llm_call`` / ``root`` — trace-to-evaluable helpers
+    - ``DatasetEntryResult`` — evaluation results for a single dataset entry
+    - ``DatasetScorecard`` — per-dataset scorecard with non-uniform evaluators
+    - ``generate_dataset_scorecard_html`` — render a scorecard as HTML
+    - ``save_dataset_scorecard`` — write scorecard HTML to disk
+
+Pre-made evaluators (autoevals adapters):
+    - ``AutoevalsAdapter`` — generic wrapper for any autoevals ``Scorer``
+    - ``LevenshteinMatch`` — edit-distance string similarity
+    - ``ExactMatch`` — exact value comparison
+    - ``NumericDiff`` — normalised numeric difference
+    - ``JSONDiff`` — structural JSON comparison
+    - ``ValidJSON`` — JSON syntax / schema validation
+    - ``ListContains`` — list overlap
+    - ``EmbeddingSimilarity`` — embedding cosine similarity
+    - ``Factuality`` — LLM factual accuracy check
+    - ``ClosedQA`` — closed-book QA evaluation
+    - ``Battle`` — head-to-head comparison
+    - ``Humor`` — humor detection
+    - ``Security`` — security vulnerability check
+    - ``Sql`` — SQL equivalence
+    - ``Summary`` — summarisation quality
+    - ``Translation`` — translation quality
+    - ``Possible`` — feasibility check
+    - ``Moderation`` — content moderation
+    - ``ContextRelevancy`` — RAGAS context relevancy
+    - ``Faithfulness`` — RAGAS faithfulness
+    - ``AnswerRelevancy`` — RAGAS answer relevancy
+    - ``AnswerCorrectness`` — RAGAS answer correctness
+
+Dataset JSON Format
+-------------------
+
+::
+
+    {
+      "name": "customer-faq",
+      "runnable": "pixie_qa/scripts/run_app.py:run_app",
+      "evaluators": ["Factuality"],
+      "items": [
+        {
+          "description": "Basic greeting",
+          "eval_input": {"question": "Hello"},
+          "expected_output": "Hi, how can I help?"
+        }
+      ]
+    }
+
+Fields:
+
+- ``runnable`` (required): ``filepath:callable_name`` reference to the function
+  that produces ``eval_output`` from ``eval_input``.
+- ``evaluators`` (optional): Dataset-level default evaluator names. Applied to
+  items without row-level evaluators.
+- ``items[].evaluators`` (optional): Row-level evaluator names. Use ``"..."`` to
+  include dataset defaults.
+- ``items[].description`` (required): Human-readable label for the test case.
+- ``items[].eval_input`` (required): Input passed to the runnable.
+- ``items[].expected_output`` (optional): Reference value for comparison-based
+  evaluators.
+- ``items[].eval_output`` (optional): Pre-computed output (skips runnable
+  execution).
+
+Evaluator Name Resolution
+--------------------------
+
+In dataset JSON, evaluator names are resolved as follows:
+
+- **Built-in names** (bare names like ``"Factuality"``, ``"ExactMatch"``) are
+  resolved to ``pixie.{Name}`` automatically.
+- **Custom evaluators** use ``filepath:callable_name`` format
+  (e.g. ``"pixie_qa/evaluators.py:my_evaluator"``).
+- Custom evaluator references point to module-level callables — classes
+  (instantiated automatically), factory functions (called if zero-arg),
+  evaluator functions (used as-is), or pre-instantiated callables (e.g.
+  ``create_llm_evaluator`` results — used as-is).
+
+CLI Commands
+------------
+
+| Command | Description |
+| --- | --- |
+| ``pixie test [path] [-v] [--no-open]`` | Run eval tests on dataset files |
+| ``pixie dataset create <name>`` | Create a new empty dataset |
+| ``pixie dataset list`` | List all datasets |
+| ``pixie dataset save <name> [--select MODE]`` | Save a span to a dataset |
+| ``pixie dataset validate [path]`` | Validate dataset JSON files |
+| ``pixie analyze <test_run_id>`` | Generate analysis and recommendations |
+
+---
+
+## Types
+
+### `Evaluable`
+
+```python
+Evaluable(*, eval_input: JsonValue = None, eval_output: JsonValue = None, eval_metadata: dict[str, JsonValue] | None = None, expected_output: Union[JsonValue, pixie.storage.evaluable._Unset] = <_Unset.UNSET: 'UNSET'>, evaluators: list[str] | None = None, description: str | None = None) -> None
+```
+
+Uniform data carrier for evaluators.
+
+All fields use Pydantic ``JsonValue`` to guarantee JSON
+round-trip fidelity.  ``expected_output`` uses a union with the
+``_Unset`` sentinel so callers can distinguish *"expected output
+was not provided"* from *"expected output is explicitly None"*.
+
+Attributes:
+    eval_input: The primary input to the observed operation.
+    eval_output: The primary output of the observed operation.
+    eval_metadata: Supplementary metadata (``None`` when absent).
+    expected_output: The expected/reference output for evaluation.
+        Defaults to ``UNSET`` (not provided). May be explicitly
+        set to ``None`` to indicate "there is no expected output".
+
+### `Evaluation`
+
+```python
+Evaluation(score: 'float', reasoning: 'str', details: 'dict[str, Any]' = <factory>) -> None
+```
+
+The result of a single evaluator applied to a single test case.
+
+Attributes:
+    score: Evaluation score between 0.0 and 1.0.
+    reasoning: Human-readable explanation (required).
+    details: Arbitrary JSON-serializable metadata.
+
+### `ScoreThreshold`
+
+```python
+ScoreThreshold(threshold: 'float' = 0.5, pct: 'float' = 1.0) -> None
+```
+
+Pass criteria: *pct* fraction of inputs must score >= *threshold* on all evaluators.
+
+Attributes:
+    threshold: Minimum score an individual evaluation must reach.
+    pct: Fraction of test-case inputs (0.0–1.0) that must pass.
+
+## Eval Functions
+
+### `pixie.run_and_evaluate`
+
+```python
+pixie.run_and_evaluate(evaluator: 'Callable[..., Any]', runnable: 'Callable[..., Any]', eval_input: 'Any', *, expected_output: 'Any' = <object object at 0x7f2420d85d60>, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'Evaluation'
+```
+
+Run *runnable(eval_input)* while capturing traces, then evaluate.
+
+Convenience wrapper combining ``_run_and_capture`` and ``evaluate``.
+The runnable is called exactly once.
+
+Args:
+    evaluator: An evaluator callable (sync or async).
+    runnable: The application function to test.
+    eval_input: The single input passed to *runnable*.
+    expected_output: Optional expected value merged into the
+        evaluable.
+    from_trace: Optional callable to select a specific span from
+        the trace tree for evaluation.
+
+Returns:
+    The ``Evaluation`` result.
+
+Raises:
+    ValueError: If no spans were captured during execution.
+
+### `pixie.assert_pass`
+
+```python
+pixie.assert_pass(runnable: 'Callable[..., Any]', eval_inputs: 'list[Any]', evaluators: 'list[Callable[..., Any]]', *, evaluables: 'list[Evaluable] | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
+```
+
+Run evaluators against a runnable over multiple inputs.
+
+For each input, runs the runnable once via ``_run_and_capture``,
+then evaluates with every evaluator concurrently via
+``asyncio.gather``.
+
+The results matrix has shape ``[eval_inputs][evaluators]``.
+If the pass criteria are not met, raises :class:`EvalAssertionError`
+carrying the matrix.
+
+When ``evaluables`` is provided, behaviour depends on whether each
+item already has ``eval_output`` populated:
+
+- **eval_output is None** — the ``runnable`` is called via
+  ``run_and_evaluate`` to produce an output from traces, and
+  ``expected_output`` from the evaluable is merged into the result.
+- **eval_output is not None** — the evaluable is used directly
+  (the runnable is not called for that item).
+
+Args:
+    runnable: The application function to test.
+    eval_inputs: List of inputs, each passed to *runnable*.
+    evaluators: List of evaluator callables.
+    evaluables: Optional list of ``Evaluable`` items, one per input.
+        When provided, their ``expected_output`` is forwarded to
+        ``run_and_evaluate``.  Must have the same length as
+        *eval_inputs*.
+    pass_criteria: Receives the results matrix, returns
+        ``(passed, message)``.  Defaults to ``ScoreThreshold()``.
+    from_trace: Optional span selector forwarded to
+        ``run_and_evaluate``.
+
+Raises:
+    EvalAssertionError: When pass criteria are not met.
+    ValueError: When *evaluables* length does not match *eval_inputs*.
+
+### `pixie.assert_dataset_pass`
+
+```python
+pixie.assert_dataset_pass(runnable: 'Callable[..., Any]', dataset_name: 'str', evaluators: 'list[Callable[..., Any]]', *, dataset_dir: 'str | None' = None, pass_criteria: 'Callable[[list[list[Evaluation]]], tuple[bool, str]] | None' = None, from_trace: 'Callable[[list[ObservationNode]], Evaluable] | None' = None) -> 'None'
+```
+
+Load a dataset by name, then run ``assert_pass`` with its items.
+
+This is a convenience wrapper that:
+
+1. Loads the dataset from the ``DatasetStore``.
+2. Extracts ``eval_input`` from each item as the runnable inputs.
+3. Uses the full ``Evaluable`` items (which carry ``expected_output``)
+   as the evaluables.
+4. Delegates to ``assert_pass``.
+
+Args:
+    runnable: The application function to test.
+    dataset_name: Name of the dataset to load.
+    evaluators: List of evaluator callables.
+    dataset_dir: Override directory for the dataset store.
+        When ``None``, reads from ``PixieConfig.dataset_dir``.
+    pass_criteria: Receives the results matrix, returns
+        ``(passed, message)``.
+    from_trace: Optional span selector forwarded to
+        ``assert_pass``.
+
+Raises:
+    FileNotFoundError: If no dataset with *dataset_name* exists.
+    EvalAssertionError: When pass criteria are not met.
+
+## Trace Helpers
+
+### `pixie.last_llm_call`
+
+```python
+pixie.last_llm_call(trace: 'list[ObservationNode]') -> 'Evaluable'
+```
+
+Find the ``LLMSpan`` with the latest ``ended_at`` in the trace tree.
+
+Args:
+    trace: The trace tree (list of root ``ObservationNode`` instances).
+
+Returns:
+    An ``Evaluable`` wrapping the most recently ended ``LLMSpan``.
+
+Raises:
+    ValueError: If no ``LLMSpan`` exists in the trace.
+
+### `pixie.root`
+
+```python
+pixie.root(trace: 'list[ObservationNode]') -> 'Evaluable'
+```
+
+Return the first root node's span as ``Evaluable``.
+
+Args:
+    trace: The trace tree (list of root ``ObservationNode`` instances).
+
+Returns:
+    An ``Evaluable`` wrapping the first root node's span.
+
+Raises:
+    ValueError: If the trace is empty.
+
+### `pixie.capture_traces`
+
+```python
+pixie.capture_traces() -> 'Generator[MemoryTraceHandler, None, None]'
+```
+
+Context manager that installs a ``MemoryTraceHandler`` and yields it.
+
+Calls ``init()`` (no-op if already initialised) then registers the
+handler via ``add_handler()``.  On exit the handler is removed and
+the delivery queue is flushed so that all spans are available on
+``handler.spans``.
diff --git a/skills/eval-driven-dev/references/understanding-app.md b/skills/eval-driven-dev/references/understanding-app.md
deleted file mode 100644
index e7c8a47af..000000000
--- a/skills/eval-driven-dev/references/understanding-app.md
+++ /dev/null
@@ -1,201 +0,0 @@
-# Understanding the Application
-
-This reference covers Step 1 of the eval-driven-dev process in detail: how to read the codebase, map the data flows, and document your findings.
-
----
-
-## What to investigate
-
-Before touching any code, spend time actually reading the source. The code will tell you more than asking the user would.
-
-### 1. How the software runs
-
-What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
-
-### 2. Find where the LLM provider client is called
-
-Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
-
-- The file and function where the call lives
-- Which LLM provider/client is used
-- The exact arguments being passed (model, messages, tools, etc.)
-
-### 3. Track backwards: external data dependencies flowing IN
-
-Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt. Categorize each data source:
-
-**Application inputs** (from the user / caller):
-
-- User messages, queries, uploaded files
-- Configuration or feature flags
-
-**External dependency data** (from systems outside the app):
-
-- Database lookups (conversation history from Redis, user profiles from Postgres, etc.)
-- Retrieved context (RAG chunks from a vector DB, search results from an API)
-- Cache reads
-- Third-party API responses
-
-For each external data dependency, document:
-
-- What system it comes from
-- What the data shape looks like (types, fields, structure)
-- What realistic values look like
-- Whether it requires real credentials or can be mocked
-
-**In-code data** (assembled by the application itself):
-
-- System prompts (hardcoded or templated)
-- Tool definitions and function schemas
-- Prompt-building logic that combines the above
-
-### 4. Track forwards: external side-effects flowing OUT
-
-Starting from each LLM call site, trace **forwards** through the code to find every side-effect the application causes in external systems based on the LLM's output:
-
-- Database writes (saving conversation history, updating records)
-- API calls to third-party services (sending emails, creating calendar entries, initiating transfers)
-- Messages sent to other systems (queues, webhooks, notifications)
-- File system writes
-
-For each side-effect, document:
-
-- What system is affected
-- What data is written/sent
-- Whether this side-effect is something evaluations should verify (e.g., "did the agent route to the correct department?")
-
-### 5. Identify intermediate states to capture
-
-Along the paths between input and output, identify intermediate states that are necessary for proper evaluation but aren't visible in the final output:
-
-- Tool call decisions and results (which tools were called, what they returned)
-- Agent routing / handoff decisions
-- Intermediate LLM calls (e.g., summarization before final answer)
-- Retrieval results (what context was fetched)
-- Any branching logic that determines the code path
-
-These are things that evaluators will need to check criteria like "did the agent verify identity before transferring?" or "did it use the correct tool?"
-
-### 6. Use cases and expected behaviors
-
-What are the distinct things the app is supposed to handle? For each use case, what does a "good" response look like? What would constitute a failure?
-
----
-
-## Writing MEMORY.md
-
-Write your findings to `pixie_qa/MEMORY.md`. This is the primary working document for the eval effort. It should be human-readable and detailed enough that someone unfamiliar with the project can understand the application and the eval strategy.
-
-**MEMORY.md documents your understanding of the existing application code. It must NOT contain references to pixie commands, instrumentation code you plan to add, or scripts/functions that don't exist yet.** Those belong in later steps, only after they've been implemented.
-
-### Template
-
-```markdown
-# Eval Notes: <Project Name>
-
-## How the application works
-
-### Entry point and execution flow
-
-<Describe how to start/run the app, what happens step by step>
-
-### LLM call sites
-
-<For each LLM call in the codebase, document:>
-
-- Where it is in the code (file + function name)
-- Which LLM provider/client is used
-- What arguments are passed
-
-### External data dependencies (data flowing IN to LLM)
-
-<For each external system the app reads from:>
-
-- **System**: <e.g., Redis, Postgres, vector DB, third-party API>
-- **What data**: <e.g., conversation history, user profile, retrieved documents>
-- **Data shape**: <types, fields, structure, realistic values>
-- **Code path**: <file:line where each read happens>
-- **Credentials needed**: <yes/no, what kind>
-
-### External side-effects (data flowing OUT from LLM output)
-
-<For each external system the app writes to / affects:>
-
-- **System**: <e.g., database, API, queue, file system>
-- **What happens**: <e.g., saves conversation, sends email, creates calendar entry>
-- **Code path**: <file:line where each write happens>
-- **Eval-relevant?**: <should evaluations verify this side-effect?>
-
-### Pluggable/injectable interfaces (testability seams)
-
-<For each abstract base class, protocol, or constructor-injected backend:>
-
-- **Interface**: <e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`>
-- **Defined in**: <file:line>
-- **What it wraps**: <e.g., real STT service, real TTS service, Redis>
-- **How it's injected**: <constructor param, module-level var, dependency injection framework>
-- **Mock strategy**: <what mock implementation should do — e.g., decode UTF-8 instead of real STT>
-
-These are the primary testability seams. In Step 3, you'll write mock implementations of these interfaces.
-
-### Mocking plan summary
-
-<For each external dependency, how will you replace it in the utility function (Step 3)?>
-
-| Dependency          | Mock approach                  | What mock provides (IN)                | What mock captures (OUT) |
-| ------------------- | ------------------------------ | -------------------------------------- | ------------------------ |
-| <e.g., Redis>       | <mock.patch / mock class / DI> | <conversation history from eval_input> | <saved messages>         |
-| <e.g., STT service> | <MockTranscriptionBackend>     | <text from eval_input>                 | <n/a>                    |
-
-### Intermediate states to capture
-
-<States along the execution path needed for evaluation but not in final output:>
-
-- <e.g., tool call decisions, routing choices, retrieval results>
-- Include code pointers (file:line) for each
-
-### Final output
-
-<What the user sees, what format, what the quality bar should be>
-
-### Use cases
-
-<List each distinct scenario the app handles, with examples of good/bad outputs>
-
-1. <Use case 1>: <description>
-   - Input example: ...
-   - Good output: ...
-   - Bad output: ...
-
-## Evaluation plan
-
-### What to evaluate and why
-
-<App-specific quality dimensions and rationale — filled in during Step 1>
-
-### Evaluators and criteria
-
-<Filled in during Step 5 — maps each quality criterion to a specific evaluator>
-
-| Criterion | Evaluator | Dataset | Pass criteria | Rationale |
-| --------- | --------- | ------- | ------------- | --------- |
-| ...       | ...       | ...     | ...           | ...       |
-
-### Data needed for evaluation
-
-<What data to capture, with code pointers>
-
-## Datasets
-
-| Dataset | Items | Purpose |
-| ------- | ----- | ------- |
-| ...     | ...   | ...     |
-
-## Investigation log
-
-### <date> — <test_name> failure
-
-<Structured investigation entries — filled in during Step 6>
-```
-
-If something is genuinely unclear from the code, ask the user — but most questions answer themselves once you've read the code carefully.
diff --git a/skills/eval-driven-dev/resources/run-with-timeout.sh b/skills/eval-driven-dev/resources/run-with-timeout.sh
new file mode 100644
index 000000000..3051d6ad3
--- /dev/null
+++ b/skills/eval-driven-dev/resources/run-with-timeout.sh
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+# Run a command in the background and automatically kill it after a timeout.
+#
+# Usage:
+#   bash resources/run-with-timeout.sh <timeout_seconds> <command> [args...]
+#
+# Examples:
+#   bash resources/run-with-timeout.sh 30 uv run uvicorn app:app --port 8000
+#   bash resources/run-with-timeout.sh 60 uv run python -m myapp.server
+#
+# The script:
+#   1. Starts the command with nohup in the background
+#   2. Prints the PID for reference
+#   3. Spawns a background watchdog that kills the process after <timeout_seconds>
+#   4. Returns immediately so you can continue working
+#
+# The process will be killed automatically after the timeout, or you can
+# kill it manually with: kill <PID>
+set -u
+
+if [ $# -lt 2 ]; then
+  echo "Usage: bash resources/run-with-timeout.sh <timeout_seconds> <command> [args...]"
+  exit 1
+fi
+
+TIMEOUT_SECS="$1"
+shift
+
+# Start the command in the background with nohup
+nohup "$@" > /tmp/run-with-timeout.log 2>&1 &
+CMD_PID=$!
+
+echo "Started PID ${CMD_PID} (timeout: ${TIMEOUT_SECS}s, log: /tmp/run-with-timeout.log)"
+
+# Spawn a watchdog that kills the process after the timeout
+(
+  sleep "$TIMEOUT_SECS"
+  if kill -0 "$CMD_PID" 2>/dev/null; then
+    kill "$CMD_PID" 2>/dev/null || true
+    sleep 2
+    if kill -0 "$CMD_PID" 2>/dev/null; then
+      kill -9 "$CMD_PID" 2>/dev/null || true
+    fi
+  fi
+) &
+WATCHDOG_PID=$!
+
+echo "Watchdog PID ${WATCHDOG_PID} will kill process after ${TIMEOUT_SECS}s"
diff --git a/skills/eval-driven-dev/resources/setup.sh b/skills/eval-driven-dev/resources/setup.sh
new file mode 100755
index 000000000..fa3ac3b1e
--- /dev/null
+++ b/skills/eval-driven-dev/resources/setup.sh
@@ -0,0 +1,45 @@
+#!/usr/bin/env bash
+# Setup script for eval-driven-dev skill.
+# Updates the skill, installs/upgrades pixie-qa[all], initializes the
+# pixie working directory, and starts the web UI server in the background.
+# Failures are non-fatal — the workflow continues even if a step here is
+# blocked by the environment.
+set -u
+
+echo "=== Updating skill ==="
+npx skills update || echo "(skill update skipped)"
+
+echo ""
+echo "=== Installing / upgrading pixie-qa[all] ==="
+if [ -f uv.lock ]; then
+  uv add "pixie-qa[all]>=0.4.0,<0.5.0" --upgrade
+elif [ -f poetry.lock ]; then
+  poetry add "pixie-qa[all]>=0.4.0,<0.5.0"
+else
+  pip install --upgrade "pixie-qa[all]>=0.4.0,<0.5.0"
+fi
+
+echo ""
+echo "=== Initializing pixie working directory ==="
+if [ -f uv.lock ]; then
+  uv run pixie init
+elif [ -f poetry.lock ]; then
+  poetry run pixie init
+else
+  pixie init
+fi
+
+echo ""
+echo "=== Starting web UI server (background) ==="
+PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
+if [ -f uv.lock ]; then
+  nohup uv run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+elif [ -f poetry.lock ]; then
+  nohup poetry run pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+else
+  nohup pixie start > "${PIXIE_ROOT}/server.log" 2>&1 &
+fi
+echo "Web UI server started (PID $!, log: ${PIXIE_ROOT}/server.log)"
+
+echo ""
+echo "=== Setup complete ==="
diff --git a/skills/eval-driven-dev/resources/stop-server.sh b/skills/eval-driven-dev/resources/stop-server.sh
new file mode 100644
index 000000000..1b53cd3b5
--- /dev/null
+++ b/skills/eval-driven-dev/resources/stop-server.sh
@@ -0,0 +1,37 @@
+#!/usr/bin/env bash
+# Stop the pixie web UI server by reading the port from server.lock
+# and killing the process bound to that port.
+set -u
+
+PIXIE_ROOT="${PIXIE_ROOT:-pixie_qa}"
+LOCK_FILE="${PIXIE_ROOT}/server.lock"
+
+if [ ! -f "$LOCK_FILE" ]; then
+  echo "No server.lock found at ${LOCK_FILE} — server may not be running."
+  exit 0
+fi
+
+PORT=$(cat "$LOCK_FILE")
+echo "Found server.lock with port ${PORT}"
+
+# Find the process listening on that port and kill it
+PID=$(lsof -ti "tcp:${PORT}" 2>/dev/null || ss -tlnp "sport = :${PORT}" 2>/dev/null | grep -oP 'pid=\K[0-9]+' || true)
+
+if [ -z "$PID" ]; then
+  echo "No process found on port ${PORT} — cleaning up stale lock."
+  rm -f "$LOCK_FILE"
+  exit 0
+fi
+
+echo "Killing server process ${PID} on port ${PORT}"
+kill "$PID" 2>/dev/null || true
+sleep 1
+
+# Force kill if still running
+if kill -0 "$PID" 2>/dev/null; then
+  echo "Process still running, sending SIGKILL"
+  kill -9 "$PID" 2>/dev/null || true
+fi
+
+rm -f "$LOCK_FILE"
+echo "Server stopped."