diff --git a/docs/README.skills.md b/docs/README.skills.md
index e3fd29872..316d77d05 100644
--- a/docs/README.skills.md
+++ b/docs/README.skills.md
@@ -129,7 +129,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
-| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md` `references/eval-tests.md` `references/instrumentation.md` `references/investigation.md` `references/pixie-api.md` `references/run-harness-patterns.md` `references/understanding-app.md` |
+| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md` `references/1-b-data-flow.md` `references/1-c-eval-criteria.md` `references/2-instrument-and-observe.md` `references/3-run-harness.md` `references/4-define-evaluators.md` `references/5-build-dataset.md` `references/6-run-tests.md` `references/7-investigation.md` `references/evaluators.md` `references/instrumentation-api.md` `references/run-harness-examples` `references/testing-api.md` `resources` |
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md` `references/excalidraw-schema.md` `scripts/.gitignore` `scripts/README.md` `scripts/add-arrow.py` `scripts/add-icon-to-diagram.py` `scripts/split-excalidraw-library.py` `templates` |
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md` `references/pyspark.md` |
| [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
diff --git a/skills/eval-driven-dev/SKILL.md b/skills/eval-driven-dev/SKILL.md
index 498bca26b..2dff34e49 100644
--- a/skills/eval-driven-dev/SKILL.md
+++ b/skills/eval-driven-dev/SKILL.md
@@ -8,7 +8,9 @@ description: >
license: MIT
compatibility: Python 3.11+
metadata:
- version: 0.2.0
+ version: 0.4.0
+ pixie-qa-version: ">=0.4.0,<0.5.0"
+ pixie-qa-source: https://github.com/yiouli/pixie-qa/
---
# Eval-Driven Development for Python LLM Applications
@@ -27,352 +29,154 @@ This skill is about doing the work, not describing it. Read code, edit files, ru
## Before you start
-Run the following to keep the skill and package up to date. If any command fails or is blocked by the environment, continue — do not let failures here block the rest of the workflow.
-
-**Update the skill:**
+**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run setup:
```bash
-npx skills update
+bash resources/setup.sh
```
-**Upgrade the `pixie-qa` package**
-
-Make sure the python virtual environment is active and use the project's package manager:
-
-```bash
-# uv project (uv.lock exists):
-uv add pixie-qa --upgrade
-
-# poetry project (poetry.lock exists):
-poetry add pixie-qa@latest
-
-# pip / no lock file:
-pip install --upgrade pixie-qa
-```
+The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, and initialize the pixie working directory if it's not already initialized. If the skill or package update fails, continue — do not let these failures block the rest of the workflow.
---
## The workflow
-Follow Steps 1–5 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
+Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
-**Two modes:**
+**How to work — read this before doing anything else:**
-- **Setup** ("set up evals", "add tests", "set up QA"): Complete Steps 1–5. After the test run, report results and ask whether to iterate.
-- **Iteration** ("fix", "improve", "debug"): Complete Steps 1–5 if not already done, then do one round of Step 6.
+- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
+- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.
+- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
+- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.
-If ambiguous: default to setup.
+**Run Steps 1–7 in sequence.** If the user's prompt makes it clear that earlier steps are already done (e.g., "run the existing tests", "re-run evals"), skip to the appropriate step. When in doubt, start from Step 1.
---
### Step 1: Understand the app and define eval criteria
-Read the source code to understand:
-
-1. **How it runs** — entry point, startup, config/env vars
-2. **The real entry point** — how a real user invokes the app (HTTP endpoint, CLI, function call). This is what the eval must exercise — not an inner function that bypasses the request pipeline.
-3. **The request pipeline** — trace the full path from entry point to response. What middleware, routing, state management, prompt assembly, retrieval, or formatting happens along the way? All of this is under test.
-4. **External dependencies (both directions)** — identify every external system the app talks to (databases, APIs, caches, queues, file systems, speech services). For each, understand:
- - **Data flowing IN** (external → app): what data does the app read from this system? What shapes, types, realistic values? You'll make up this data for eval scenarios.
- - **Data flowing OUT** (app → external): what does the app write, send, or mutate in this system? These are side-effects that evaluations may need to verify (e.g., "did the app create the right calendar entry?", "did it send the correct transfer request?").
- - **How to mock it** — look for abstract base classes, protocols, or constructor-injected backends (e.g., `TranscriptionBackend`, `SynthesisBackend`, `StorageBackend`). These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
-5. **Use cases** — distinct scenarios, what good/bad output looks like
-
-Read `references/understanding-app.md` for detailed guidance on mapping data flows and the MEMORY.md template.
-
-Write your findings to `pixie_qa/MEMORY.md` before moving on. Include:
-
-- The entry point and the full request pipeline
-- Every external dependency, what it provides/receives, and how you'll mock it
-- The testability seams (pluggable interfaces, patchable module-level objects)
+**First, check the user's prompt for specific requirements.** Before reading app code, examine what the user asked for:
-Determine **high-level, application-specific eval criteria**:
-
-**Good criteria are specific to the app's purpose.** Examples:
-
-- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation (under 3 sentences)?", "Does the agent route to the correct department based on the caller's request?"
-- Research report generator: "Does the report address all sub-questions in the query?", "Are claims supported by the retrieved sources?", "Is the report structured with clear sections?"
-- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when the context doesn't contain the answer?"
-
-**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
-
-At this stage, don't pick evaluator classes or thresholds. That comes later in Step 5, after you've seen the real data shape.
-
-Record the criteria in `pixie_qa/MEMORY.md` and continue.
-
-> **Checkpoint**: MEMORY.md written with app understanding + eval criteria. Proceed to Step 2.
-
----
-
-### Step 2: Instrument and observe a real run
+- **Referenced documents or specs**: Does the prompt mention a file to follow (e.g., "follow the spec in EVAL_SPEC.md", "use the methodology in REQUIREMENTS.md")? If so, **read that file first** — it may specify datasets, evaluation dimensions, pass criteria, or methodology that override your defaults.
+- **Specified datasets or data sources**: Does the prompt reference specific data files (e.g., "use questions from eval_inputs/research_questions.json", "use the scenarios in call_scenarios.json")? If so, **read those files** — you must use them as the basis for your eval dataset, not fabricate generic alternatives.
+- **Specified evaluation dimensions**: Does the prompt name specific quality aspects to evaluate (e.g., "evaluate on factuality, completeness, and bias", "test identity verification and tool call correctness")? If so, **every named dimension must have a corresponding evaluator** in your test file.
-**Why this step**: You need to see the actual data flowing through the app before you can build anything. This step serves two goals:
+If the prompt specifies any of the above, they take priority. Read and incorporate them before proceeding.
-1. **Learn the data shapes** — what data flows in from external dependencies, and what side-effects flow out? What types, structures, realistic values? You'll need to make up this data for eval scenarios later.
-2. **Verify instrumentation captures what evaluators need** — do the traces contain the data required to assess each eval criterion from Step 1? If a criterion is "does the agent route to the correct department," the trace must capture the routing decision.
+Step 1 has three sub-steps. Each reads its own reference file and produces its own output file. **Complete each sub-step fully before starting the next.**
-**This is a normal app run with instrumentation — no mocks, no patches.**
+#### Sub-step 1a: Entry point & execution flow
-#### 2a. Decide what to instrument
+> **Reference**: Read `references/1-a-entry-point.md` now.
-This is a reasoning step, not a coding step. Look at your eval criteria from Step 1 and your understanding of the codebase, and determine what data the evaluators will need:
+Read the source code to understand how the app starts and how a real user invokes it. Write your findings to `pixie_qa/01-entry-point.md` before moving on.
-- **For each eval criterion**, ask: what observable data would prove this criterion is met or violated?
-- **Map that data to code locations** — which functions produce, consume, or transform that data?
-- **Those functions need `@observe`** — so their inputs and outputs are captured in traces.
+> **Checkpoint**: `pixie_qa/01-entry-point.md` written with entry point, execution flow, user-facing interface, and env requirements.
-Examples:
+#### Sub-step 1b: Processing stack & data flow (DAG artifact)
-| Eval criterion | Data needed | What to instrument |
-| ------------------------------------------ | -------------------------------------------------- | ------------------------------------------------------------ |
-| "Routes to correct department" | The routing decision (which department was chosen) | The routing/dispatch function |
-| "Responses grounded in retrieved context" | The retrieved documents + the final response | The retrieval function AND the response function |
-| "Verifies caller identity before transfer" | Whether identity check happened, transfer decision | The identity verification function AND the transfer function |
-| "Concise phone-friendly responses" | The final response text | The function that produces the LLM response |
+> **Reference**: Read `references/1-b-data-flow.md` now.
-**LLM provider calls (OpenAI, Anthropic, etc.) are auto-captured** — `enable_storage()` activates OpenInference instrumentors that automatically trace every LLM API call with full input messages, output messages, token usage, and model parameters. You do NOT need `@observe` on the function that calls `client.chat.completions.create()` just to see the LLM interaction.
+Starting from the entry point you documented, trace the full processing stack and produce a **structured DAG JSON file** at `pixie_qa/02-data-flow.json`. The DAG has the common ancestor of LLM calls as root and contains every data dependency, intermediate state, LLM call, and side-effect as nodes with metadata and parent pointers.
-**Use `@observe` for application-level functions** whose inputs, outputs, or intermediate states your evaluators need but that aren't visible from the LLM call alone. Examples: the app's entry-point function (to capture what the user sent and what the app returned), retrieval functions (to capture what context was fetched), routing functions (to capture dispatch decisions).
-
-`enable_storage()` goes at application startup. Read `references/instrumentation.md` for the full rules, code patterns, and anti-patterns for adding instrumentation.
-
-#### 2b. Add instrumentation and run the app
-
-Add `@observe` to the functions you identified in 2a. Then run the app normally — with its real external dependencies, or by manually interacting with it — to produce a **reference trace**. Do NOT mock or patch anything. This is an observation run.
-
-If the app can't run without infrastructure you don't have (a real database, third-party service credentials, etc.), use the simplest possible approach to get it running — a local Docker container, a test account, or ask the user for help. The goal is one real trace.
+After writing the JSON, validate it:
```bash
-uv run pixie trace list
-uv run pixie trace last
+uv run pixie dag validate pixie_qa/02-data-flow.json
```
-#### 2c. Examine the reference trace
+This checks the DAG structure, verifies code pointers exist, and generates a Mermaid diagram at `pixie_qa/02-data-flow.md`. If validation fails, fix the errors and re-run.
-Study the trace data carefully. This is your blueprint for everything that follows. Document:
+> **Checkpoint**: `pixie_qa/02-data-flow.json` written and `pixie dag validate` passes. Mermaid diagram generated at `pixie_qa/02-data-flow.md`.
+>
+> **Schema reminder**: DAG node `name` must be unique, meaningful, and lower_snake_case (for example, `handle_turn`). If a node represents an LLM provider call, set `is_llm_call: true` (otherwise omit it or set `false`). Name-matching rules for `@observe` / `start_observation(...)` are defined in instrumentation guidance (Step 2), not here.
-1. **Data from external dependencies (inbound)** — What did the app read from databases, APIs, caches? What are the shapes, types, and realistic value ranges? This is what you'll make up in eval_input for the dataset.
-2. **Side-effects (outbound)** — What did the app write to, send to, or mutate in external systems? These need to be captured by mocks and may be part of eval_output for verification.
-3. **Intermediate states** — What did the instrumentation capture beyond the final output? Tool calls, retrieved documents, routing decisions? Are these sufficient to evaluate every criterion from Step 1?
-4. **The eval_input / eval_output structure** — What does the `@observe`-decorated function receive as input and produce as output? Note the exact field names, types, and nesting.
+#### Sub-step 1c: Eval criteria
-**Check instrumentation completeness**: For each eval criterion from Step 1, verify the trace contains the data needed to evaluate it. If not, add more `@observe` decorators and re-run.
+> **Reference**: Read `references/1-c-eval-criteria.md` now.
-**Do not proceed until you understand the data shape and have confirmed the traces capture everything your evaluators need.**
+Define the app's use cases and eval criteria. Use cases drive dataset creation (Step 5); eval criteria drive evaluator selection (Step 4). For each criterion, determine whether it applies to all scenarios or only a subset — this drives whether it becomes a dataset-level default evaluator or an item-level evaluator. Write your findings to `pixie_qa/03-eval-criteria.md` before moving on.
-> **Checkpoint**: Instrumentation added based on eval criteria. Reference trace captured with real data. For each criterion, confirm the trace contains the data needed to evaluate it. Proceed to Step 3.
+> **Checkpoint**: `pixie_qa/03-eval-criteria.md` written with use cases (each with a one-liner conveying input + expected behavior), eval criteria with applicability scope, and observability check. Do NOT read Step 2 instructions yet.
---
-### Step 3: Write a utility function to run the full app end-to-end
-
-**Why this step**: You need a function that test cases can call. Given an eval_input (app input + mock data for external dependencies), it starts the real application with external dependencies patched, sends the input through the app's real entry point, and returns the eval_output (app response + captured side-effects).
-
-#### The contract
-
-```
-run_app(eval_input) → eval_output
-```
-
-- **eval_input** = application input (what the user sends) + data from external dependencies (what databases/APIs would return)
-- **eval_output** = application output (what the user sees) + captured side-effects (what the app wrote to external systems, captured by mocks) + captured intermediate states (tool calls, routing decisions, etc., captured by instrumentation)
-
-#### How to implement
-
-1. **Patch external dependencies** — use the mocking plan from Step 1 item 4. For each external dependency, either inject a mock implementation of its interface (cleanest) or `unittest.mock.patch` the module-level client. The mock returns data from eval_input and captures side-effects for eval_output.
-
-2. **Call the app through its real entry point** — the same way a real user or client would invoke it. Look at how the app is started: if it's a web server (FastAPI, Flask), use `TestClient` or HTTP requests. If it's a CLI, use subprocess. If it's a standalone function with no server or middleware, import and call it directly.
-
-3. **Collect the response** — the app's output becomes eval_output, along with any side-effects captured by mock objects.
-
-Read `references/run-harness-patterns.md` for concrete examples of entry point invocation for different app types.
-
-**Do NOT call an inner function** like `agent.respond()` directly just because it's simpler. The whole point is to test the app's real code path — request handling, state management, prompt assembly, routing. When you call an inner function directly, you skip all of that, and the test has to reimplement it. Now you're testing test code, not app code.
-
-#### Verify
-
-Take the eval_input from your Step 2 reference trace and feed it to the utility function. The outputs won't match word-for-word (non-deterministic), but verify:
+### Step 2: Instrument and observe a real run
-- **Same structure** — same fields present, same types, same nesting
-- **Same code path** — same routing decisions, same intermediate states captured
-- **Sensible values** — eval_output fields have real, meaningful data (not null, not empty, not error messages)
+> **Reference**: Read `references/2-instrument-and-observe.md` now — it has the detailed sub-steps for DAG-based instrumentation, running the app, verifying the trace against the DAG, documenting the reference trace, and the `@observe` and `enable_storage()` rules and patterns.
-**If it fails after two attempts**, stop and ask the user for help.
+Add `@observe` to application-level functions identified in your DAG (`pixie_qa/02-data-flow.json`). Run the app normally (no mocks) to produce a reference trace. Verify the trace with `pixie trace verify`, then validate it matches the DAG with `pixie dag check-trace`. Document the data shapes.
-> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Proceed to Step 4.
+> **Checkpoint**: `pixie_qa/04-reference-trace.md` exists with eval_input/eval_output shapes and completeness verification. Instrumentation is in the source code. `pixie dag check-trace` passes. Do NOT read Step 3 instructions yet.
---
-### Step 4: Build the dataset
-
-**Why this step**: The dataset is a collection of eval_input items (made up by you) that define the test scenarios. Each item may also carry case-specific expectations. The eval_output is NOT pre-populated in the dataset — it's produced at test time by the utility function from Step 3.
-
-#### 4a. Determine verification and expectations
-
-Before generating data, decide how each eval criterion from Step 1 will be checked.
-
-**Examine the reference trace from Step 2** and identify:
-
-- **Structural constraints** you can verify with code — JSON schema, required fields, value types, enum ranges, string length bounds. These become validation checks on your generated eval_inputs.
-- **Semantic constraints** that require judgment — "the mock customer profile should be realistic", "the conversation history should be topically coherent". Apply these yourself when crafting the data.
-- **Which criteria are universal vs. case-specific**:
- - **Universal criteria** apply to ALL test cases the same way → implement in the test function (e.g., "responses must be under 3 sentences", "must not hallucinate information not in context")
- - **Case-specific criteria** vary per test case → carry as `expected_output` in the dataset item (e.g., "should mention the caller's appointment on Tuesday", "should route to billing department")
-
-#### 4b. Generate eval_input items
-
-Create eval_input items that match the data shape from the reference trace:
-
-- **Application inputs** (user queries, requests) — make these up to cover the scenarios you identified in Step 1
-- **External dependency data** (database records, API responses, cache entries) — make these up in the exact shape you observed in the reference trace
-
-Each dataset item contains:
-
-- `eval_input`: the made-up input data (app input + external dependency data)
-- `expected_output`: case-specific expectation text (optional — only for test cases with expectations beyond the universal criteria). This is a reference for evaluation, not an exact expected answer.
-
-At test time, `eval_output` is produced by the utility function from Step 3 and is not stored in the dataset itself.
-Read `references/dataset-generation.md` for the dataset creation API, data shape matching, expected_output strategy, and validation checklist.
-
-#### 4c. Validate the dataset
+### Step 3: Write a utility function to run the full app end-to-end
-After building:
+> **Reference**: Read `references/3-run-harness.md` now — it has the contract, implementation guidance, verification steps, and concrete examples by app type (FastAPI, CLI, standalone function).
-1. **Execute `build_dataset.py`** — don't just write it, run it
-2. **Verify structural constraints** — each eval_input matches the reference trace's schema (same fields, same types)
-3. **Verify diversity** — items have meaningfully different inputs, not just minor variations
-4. **Verify case-specific expectations** — `expected_output` values are specific and testable, not vague
-5. For conversational apps, include items with conversation history
+Write a `run_app(eval_input) → eval_output` function that patches external dependencies, calls the app through its real entry point, and collects the response. Verify it produces the same structure as the reference trace.
-> **Checkpoint**: Dataset created with diverse eval_inputs matching the reference trace's data shape. Proceed to Step 5.
+> **Checkpoint**: Utility function implemented and verified. When fed the reference trace's eval_input, it produces eval_output with the same structure and exercises the same code path. Do NOT read Step 4 instructions yet.
---
-### Step 5: Write and run eval tests
-
-**Why this step**: With the utility function built and the dataset ready, writing tests is straightforward — wire up the function, choose evaluators for each criterion, and run.
-
-#### 5a. Map criteria to evaluators
+### Step 4: Define evaluators
-For each eval criterion from Step 1, decide how to evaluate it:
+> **Reference**: Read `references/4-define-evaluators.md` now — it has the sub-steps for mapping criteria to evaluators, implementing custom evaluators, verifying discoverability, and producing the evaluator mapping artifact.
-- **Can it be checked with a built-in evaluator?** (factual correctness → `FactualityEval`, exact match → `ExactMatchEval`, RAG faithfulness → `FaithfulnessEval`)
-- **Does it need a custom evaluator?** Most app-specific criteria do — use `create_llm_evaluator` with a prompt that operationalizes the criterion.
-- **Is it universal or case-specific?** Universal criteria go in the test function. Case-specific criteria use `expected_output` from the dataset.
+Map each eval criterion from Step 1c to a concrete evaluator — implement custom ones where needed. Then produce the evaluator mapping artifact.
-For open-ended LLM text, **never** use `ExactMatchEval` — LLM outputs are non-deterministic.
+> **Checkpoint**: All evaluators implemented. `pixie_qa/05-evaluator-mapping.md` written with criterion-to-evaluator mapping using exact evaluator names (built-in names from `evaluators.md`, custom names in `filepath:callable_name` format). Do NOT read Step 5 instructions yet.
-`AnswerRelevancyEval` is **RAG-only** — it requires a `context` value in the trace. Returns 0.0 without it. For general relevance without RAG, use `create_llm_evaluator` with a custom prompt.
-
-Read `references/eval-tests.md` for the evaluator catalog, custom evaluator examples, and the test file boilerplate.
+---
-#### 5b. Write the test file and run
+### Step 5: Build the dataset
-The test file wires together: a `runnable` (calls your utility function from Step 3), a reference to the dataset, and the evaluators you chose.
+> **Reference**: Read `references/5-build-dataset.md` now — it has the sub-steps for determining expectations, generating eval_input items, building the dataset JSON with the new format (runnable, evaluators, descriptions), and validating with `pixie dataset validate`.
-Read `references/eval-tests.md` for the exact `assert_dataset_pass` API, required parameter names, and common mistakes to avoid. **Re-read the API reference immediately before writing test code** — do not rely on earlier context.
+Create a dataset JSON file with made-up eval_input items that match the data shape from the reference trace. Set the `runnable` to the `filepath:callable_name` reference for the run function from Step 3 (e.g., `"pixie_qa/scripts/run_app.py:run_app"` — file path relative to project root). Assign evaluators based on the eval criteria (Step 1c) and the evaluator mapping (Step 4) — universal criteria become dataset-level defaults, case-specific criteria become item-level evaluators. Add a `description` for each item. Validate with `pixie dataset validate`.
-Run with `pixie test` — not `pytest`:
+> **Checkpoint**: Dataset JSON created at `pixie_qa/datasets/.json` with diverse eval_inputs, runnable, evaluators, and descriptions. `pixie dataset validate` passes. Do NOT read Step 6 instructions yet.
-```bash
-uv run pixie test pixie_qa/tests/ -v
-```
+---
-**After running, verify the scorecard:**
+### Step 6: Run evaluation-based tests
-1. Shows "N/M tests passed" with real numbers
-2. Does NOT say "No assert_pass / assert_dataset_pass calls recorded" (that means missing `await`)
-3. Per-evaluator scores appear with real values
+> **Reference**: Read `references/6-run-tests.md` now — it has the sub-steps for running tests, verifying output, and running analysis.
-A test that passes with no recorded evaluations is worse than a failing test — it gives false confidence. Debug until real scores appear.
+Run `pixie test` (without a path argument) to execute the full evaluation pipeline. Verify that real scores are produced. Once tests complete without setup errors, run `pixie analyze` to generate analysis.
-> **Checkpoint**: Tests run and produce real scores.
+> **Checkpoint**: Tests run and produce real scores. Analysis generated.
>
-> - **Setup mode**: Report results ("QA setup is complete. Tests show N/M passing.") and ask: "Want me to investigate the failures and iterate?" Stop here unless the user says yes.
-> - **Iteration mode**: Proceed directly to Step 6.
+> If the test errors out (import failures, missing keys, runnable resolution errors), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
>
-> If the test errors out (import failures, missing keys), that's a setup bug — fix and re-run. But if tests produce real pass/fail scores, that's the deliverable.
+> **STOP GATE — read this before doing anything else after tests produce scores:**
+>
+> - If the user's original prompt asks only for setup ("set up QA", "add tests", "add evals", "set up evaluations"), **STOP HERE**. Report the test results to the user: "QA setup is complete. Tests show N/M passing. [brief summary]. Want me to investigate the failures and iterate?" Do NOT proceed to Step 7.
+> - If the user's original prompt explicitly asks for iteration ("fix", "improve", "debug", "iterate", "investigate failures", "make tests pass"), proceed to Step 7.
---
-### Step 6: Investigate and iterate
-
-**Iteration mode only, or after the user confirmed in setup mode.**
-
-When tests fail, understand _why_ — don't just adjust thresholds until things pass.
+### Step 7: Investigate and iterate
-Read `references/investigation.md` for procedures and root-cause patterns.
-
-The cycle: investigate root cause → fix (prompt, code, or eval config) → rebuild dataset if needed → re-run tests → repeat.
+> **Reference**: Read `references/7-investigation.md` now — it has the stop/continue decision, analysis review, root-cause patterns, and investigation procedures. **Follow its instructions before doing any investigation work.**
---
-## Quick reference
-
-### Imports
+## Web Server Management
-```python
-from pixie import enable_storage, observe, assert_dataset_pass, ScoreThreshold, last_llm_call
-from pixie import FactualityEval, ClosedQAEval, create_llm_evaluator
-```
-
-Only `from pixie import ...` — never subpackages (`pixie.storage`, `pixie.evals`, etc.). There is no `pixie.qa` module.
+pixie-qa runs a web server in the background for displaying context, traces, and eval results to the user. It's automatically started by the setup script, and needs to be explicitly cleaned up when display is no longer needed.
-### CLI commands
+When the user is done with the eval-driven-dev workflow, inform them the web server is still running and you can clean it up with the following command:
```bash
-uv run pixie test pixie_qa/tests/ -v # Run eval tests (NOT pytest)
-uv run pixie trace list # List captured traces
-uv run pixie trace last # Show most recent trace
-uv run pixie trace show --verbose # Show specific trace
-uv run pixie dataset create # Create a new dataset
+bash resources/stop-server.sh
```
-### Directory layout
+And whenever you restart the workflow, always run the setup script again to ensure the web server is running:
+```bash
+bash resources/setup.sh
```
-pixie_qa/
- MEMORY.md # your understanding and eval plan
- datasets/ # golden datasets (JSON)
- tests/ # eval test files (test_*.py)
- scripts/ # run_app.py, build_dataset.py
-```
-
-All pixie files go here — not at the project root, not in a top-level `tests/` directory.
-
-### Key concepts
-
-- **eval_input** = application input + data from external dependencies
-- **eval_output** = application output + captured side-effects + captured intermediate states (produced at test time by the utility function, NOT pre-populated in the dataset)
-- **expected_output** = case-specific evaluation reference (optional per dataset item)
-- **test function** = utility function (produces eval_output) + evaluators (check criteria)
-
-### Evaluator selection
-
-| Output type | Evaluator | Notes |
-| ------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------- |
-| Open-ended text with reference answer | `FactualityEval`, `ClosedQAEval` | Best default for most apps |
-| Open-ended text, no reference | `AnswerRelevancyEval` | **RAG only** — needs `context` in trace. Returns 0.0 without it. |
-| Deterministic output | `ExactMatchEval`, `JSONDiffEval` | Never use for open-ended LLM text |
-| RAG with retrieved context | `FaithfulnessEval`, `ContextRelevancyEval` | Requires context capture in instrumentation |
-| Domain-specific quality | `create_llm_evaluator(name=..., prompt_template=...)` | Custom LLM-as-judge — use for app-specific criteria |
-
-### What goes where: SKILL.md vs references
-
-**This file** (SKILL.md) is loaded for the entire session. It contains the _what_ and _why_ — the reasoning, decision-making process, goals, and checkpoints for each step.
-
-**Reference files** are loaded when executing a specific step. They contain the _how_ — tactical API usage, code patterns, anti-patterns, troubleshooting, and ready-to-adapt examples.
-
-When in doubt: if it's about _deciding what to do_, it's in SKILL.md. If it's about _how to implement that decision_, it's in a reference file.
-
-### Reference files
-
-| Reference | When to read |
-| ------------------------------------ | ---------------------------------------------------------------------------------- |
-| `references/understanding-app.md` | Step 1 — investigating the codebase, MEMORY.md template |
-| `references/instrumentation.md` | Step 2 — `@observe` and `enable_storage` rules, code patterns, anti-patterns |
-| `references/run-harness-patterns.md` | Step 3 — examples of how to invoke different app types (web server, CLI, function) |
-| `references/dataset-generation.md` | Step 4 — crafting eval_input items, expected_output strategy, validation |
-| `references/eval-tests.md` | Step 5 — evaluator selection, test file pattern, assert_dataset_pass API |
-| `references/investigation.md` | Step 6 — failure analysis, root-cause patterns |
-| `references/pixie-api.md` | Any step — full CLI and Python API reference |
diff --git a/skills/eval-driven-dev/references/1-a-entry-point.md b/skills/eval-driven-dev/references/1-a-entry-point.md
new file mode 100644
index 000000000..c5576333c
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-a-entry-point.md
@@ -0,0 +1,68 @@
+# Step 1a: Entry Point & Execution Flow
+
+Identify how the application starts and how a real user invokes it.
+
+---
+
+## What to investigate
+
+### 1. How the software runs
+
+What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
+
+Look for:
+
+- `if __name__ == "__main__"` blocks
+- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
+- CLI entry points in `pyproject.toml` (`[project.scripts]`)
+- Docker/compose configs that reveal startup commands
+
+### 2. The real user entry point
+
+How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
+
+- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
+- **CLI**: What command-line arguments does the user provide?
+- **Library/function**: What function does the caller import and call? What arguments?
+
+### 3. Environment and configuration
+
+- What env vars does the app require? (API keys, database URLs, feature flags)
+- What config files does it read?
+- What has sensible defaults vs. what must be explicitly set?
+
+---
+
+## Output: `pixie_qa/01-entry-point.md`
+
+Write your findings to this file. Keep it focused — only entry point and execution flow.
+
+### Template
+
+```markdown
+# Entry Point & Execution Flow
+
+## How to run
+
+
+
+## Entry point
+
+- **File**:
+- **Type**:
+- **Framework**:
+
+## User-facing endpoints / interface
+
+
+
+- **Endpoint / command**:
+- **Input format**:
+- **Output format**:
+
+## Environment requirements
+
+| Variable | Purpose | Required? | Default |
+| -------- | ------- | --------- | ------- |
+| ... | ... | ... | ... |
+```
diff --git a/skills/eval-driven-dev/references/1-b-data-flow.md b/skills/eval-driven-dev/references/1-b-data-flow.md
new file mode 100644
index 000000000..1ce946c07
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-b-data-flow.md
@@ -0,0 +1,187 @@
+# Step 1b: Processing Stack & Data Flow — DAG Artifact
+
+Map the complete data flow through the application by producing a **structured DAG JSON file** that represents every important node in the processing pipeline.
+
+---
+
+## What to investigate
+
+### 1. Find where the LLM provider client is called
+
+Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
+
+- The file and function where the call lives
+- Which LLM provider/client is used
+- The exact arguments being passed (model, messages, tools, etc.)
+
+### 2. Find the common ancestor entry point
+
+Identify the single function that is the common ancestor of all LLM calls — the application's entry point for a single user request. This becomes the **root** of your DAG.
+
+### 3. Track backwards: external data dependencies flowing IN
+
+Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt:
+
+- **Application inputs**: user messages, queries, uploaded files, config
+- **External dependency data**: database lookups (Redis, Postgres), retrieved context (RAG), cache reads, third-party API responses
+- **In-code data**: system prompts, tool definitions, prompt-building logic
+
+### 4. Track forwards: external side-effects flowing OUT
+
+Starting from each LLM call site, trace **forwards** to find every side-effect: database writes, API calls, messages sent, file writes.
+
+### 5. Identify intermediate states
+
+Along the paths between input and output, identify intermediate states needed for evaluation: tool call decisions, routing/handoff decisions, retrieval results, branching logic.
+
+### 6. Identify testability seams
+
+Look for abstract base classes, protocols, or constructor-injected backends. These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
+
+---
+
+## Output: `pixie_qa/02-data-flow.json`
+
+**Write a JSON file** (not markdown) containing a flat array of DAG nodes. Each node represents a significant point in the processing pipeline.
+
+### Node schema
+
+Each node is a JSON object with these fields:
+
+| Field | Type | Required | Description |
+| -------------- | -------------- | -------- | ------------------------------------------------------------------------------------------------------------------ |
+| `name` | string | Yes | Unique, meaningful lower_snake_case node name (for example, `handle_turn`). This is the node identity. |
+| `code_pointer` | string | Yes | **Absolute** file path with function/method name, optionally with line range. See format below. |
+| `description` | string | Yes | What this node does and why it matters for evaluation. |
+| `parent` | string or null | No | Parent node name (`null` or omitted for root). |
+| `is_llm_call` | boolean | No | Set `true` only if the node represents an LLM provider call. Defaults to `false` when omitted. |
+| `metadata` | object | No | Additional info: `mock_strategy`, `data_shape`, `credentials_needed`, `eval_relevant`, external system notes, etc. |
+
+### About `is_llm_call`
+
+- Use `is_llm_call: true` for nodes that represent real LLM provider spans.
+- Leave it omitted (or `false`) for all other nodes.
+
+### `code_pointer` format
+
+The `code_pointer` field uses **absolute file paths** with a symbol name, and an optional line number range:
+
+- `:` — points to a whole function or method. Use this when the entire function represents a single node in the DAG (most common case).
+- `:::` — points to a specific line range within a function. Use this when the function contains an **important intermediate state** — a code fragment that transforms some input into an output that matters for evaluation, but the fragment is embedded inside a larger function rather than being its own function.
+
+**When to use a line range (intermediate states):**
+
+Some functions do multiple important things sequentially. If one of those things produces an intermediate state that your evaluators need to see (e.g., a routing decision, a context assembly step, a tool-call dispatch), but it's not factored into its own function, use a line range to identify that specific fragment. The line range marks the input → output boundary of that intermediate state within the larger function.
+
+Examples of intermediate states that warrant a line range:
+
+- **Routing decision**: lines 51–71 of `main()` decide which agent to hand off to based on user intent — the input is the user message, the output is the selected agent
+- **Context assembly**: lines 30–45 of `handle_request()` gather documents from a vector store and format them into a prompt — the input is the query, the output is the assembled context
+- **Tool dispatch**: lines 80–95 of `process_turn()` parse the LLM's tool-call response and execute the selected tool — the input is the tool-call JSON, the output is the tool result
+
+If the intermediate state is already its own function, just use the function-level `code_pointer` (no line range needed).
+
+Examples:
+
+- `/home/user/myproject/app.py:handle_turn` — whole function
+- `/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response` — whole function
+- `/home/user/myproject/src/agents/agent.py:main:51:71` — lines 51–71 of `main()`, where a routing decision happens
+
+The symbol can be:
+
+- A function name: `my_func` → matches `def my_func` in the file
+- A class.method: `MyClass.func` → matches `def func` inside `class MyClass`
+
+### Example
+
+```json
+[
+ {
+ "name": "handle_turn",
+ "code_pointer": "/home/user/myproject/src/agents/agent.py:handle_turn",
+ "description": "Entry point for a single user request. Takes user message + conversation history, returns agent response.",
+ "parent": null,
+ "metadata": {
+ "data_shape": {
+ "input": "str (user message)",
+ "output": "str (response text)"
+ }
+ }
+ },
+ {
+ "name": "load_conversation_history",
+ "code_pointer": "/home/user/myproject/src/services/redis_client.py:get_history",
+ "description": "Reads conversation history from Redis. Returns list of message dicts.",
+ "parent": "handle_turn",
+ "metadata": {
+ "system": "Redis",
+ "data_shape": "list[dict] with role/content keys",
+ "mock_strategy": "Provide canned history list",
+ "credentials_needed": true
+ }
+ },
+ {
+ "name": "run_ai_response",
+ "code_pointer": "/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response",
+ "description": "Calls OpenAI API with system prompt + history + user message. Auto-captured by OpenInference.",
+ "parent": "handle_turn",
+ "is_llm_call": true,
+ "metadata": {
+ "provider": "OpenAI",
+ "model": "gpt-4o-mini"
+ }
+ },
+ {
+ "name": "save_conversation_to_redis",
+ "code_pointer": "/home/user/myproject/src/services/redis_client.py:save_history",
+ "description": "Writes updated conversation history back to Redis after LLM responds.",
+ "parent": "handle_turn",
+ "metadata": {
+ "system": "Redis",
+ "eval_relevant": false,
+ "mock_strategy": "Capture written data for assertions"
+ }
+ }
+]
+```
+
+### Conditional / optional branches
+
+Some apps have conditional code paths where only one branch executes per request — e.g., `transfer_call` vs `end_call` depending on the outcome. `pixie dag check-trace` (Step 2) validates against a **single** trace, so every DAG node must appear in that trace.
+
+**Rule**: If two or more functions are mutually exclusive (only one runs per request), model them as a **single dispatcher node** that covers the branching logic, not as separate DAG nodes. For example, instead of `end_call_tool` + `transfer_call_tool` as separate nodes, use `execute_tool` pointing at the dispatch function.
+
+If a function only runs under certain conditions but is the sole branch (not mutually exclusive), include it in the DAG — just ensure your reference trace (Step 2) exercises that code path.
+
+### Validation checkpoint
+
+After writing `pixie_qa/02-data-flow.json`, validate the DAG:
+
+```bash
+uv run pixie dag validate pixie_qa/02-data-flow.json
+```
+
+This command:
+
+1. Checks the JSON structure is valid
+2. Verifies node names use lower_snake_case
+3. Verifies all node names are unique
+4. Verifies all parent references exist
+5. Checks exactly one root node exists (`parent` is null/omitted)
+6. Detects cycles
+7. Verifies code_pointer files exist on disk
+8. Verifies symbols exist in the referenced files
+9. Verifies line number ranges are valid (if present)
+10. **Generates a Mermaid diagram** at `pixie_qa/02-data-flow.md` if validation passes
+
+If validation fails, fix the errors and re-run. The error messages are specific — they tell you exactly which node has the problem and what's wrong.
+
+### Also document testability seams
+
+After the DAG JSON is validated, add a brief **testability seams** section at the bottom of the generated `pixie_qa/02-data-flow.md` (the Mermaid file). For each node that reads from or writes to an external system, note the mock interface:
+
+| Dependency node | Interface / module boundary | Mock strategy |
+| --------------- | --------------------------- | ------------- |
+| ... | ... | ... |
+
+This section supplements the DAG — the DAG captures _what_ the dependencies are, and this table captures _how_ to mock them.
diff --git a/skills/eval-driven-dev/references/1-c-eval-criteria.md b/skills/eval-driven-dev/references/1-c-eval-criteria.md
new file mode 100644
index 000000000..b37e8fc64
--- /dev/null
+++ b/skills/eval-driven-dev/references/1-c-eval-criteria.md
@@ -0,0 +1,85 @@
+# Step 1c: Eval Criteria
+
+Define what quality dimensions matter for this app — based on the entry point (01-entry-point.md) and data flow (02-data-flow.md) you've already documented.
+
+This document serves two purposes:
+
+1. **Dataset creation (Step 5)**: The use cases tell you what kinds of eval_input items to generate — each use case should have representative items in the dataset.
+2. **Evaluator selection (Step 4)**: The eval criteria tell you what evaluators to choose and how to map them.
+
+Keep this concise — it's a planning artifact, not a comprehensive spec.
+
+---
+
+## What to define
+
+### 1. Use cases
+
+List the distinct scenarios the app handles. Each use case becomes a category of eval_input items in your dataset. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
+
+**Good use case descriptions:**
+
+- "Reroute to human agent on account lookup difficulties"
+- "Answer billing question using customer's plan details from CRM"
+- "Decline to answer questions outside the support domain"
+- "Summarize research findings including all queried sub-topics"
+
+**Bad use case descriptions (too vague):**
+
+- "Handle billing questions"
+- "Edge case"
+- "Error handling"
+
+### 2. Eval criteria
+
+Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 4.
+
+**Good criteria are specific to the app's purpose.** Examples:
+
+- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
+- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
+- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
+
+**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
+
+At this stage, don't pick evaluator classes or thresholds. That comes in Step 4.
+
+### 3. Check criteria applicability and observability
+
+For each criterion:
+
+1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 5 (dataset creation) because:
+ - **Universal criteria** → become dataset-level default evaluators
+ - **Case-specific criteria** → become item-level evaluators on relevant rows only
+
+2. **Verify observability** — check that the data flow in `02-data-flow.md` includes the data needed to evaluate each criterion. If a criterion requires data that isn't in the processing stack, note what additional instrumentation is needed in Step 2.
+
+---
+
+## Output: `pixie_qa/03-eval-criteria.md`
+
+Write your findings to this file. **Keep it short** — the template below is the maximum length.
+
+### Template
+
+```markdown
+# Eval Criteria
+
+## Use cases
+
+1.