Skip to content

Commit 77407e0

Browse files
committed
update eval-driven-dev skill
1 parent 5f3d66c commit 77407e0

26 files changed

+2592
-1685
lines changed

docs/README.skills.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -129,7 +129,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
129129
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
130130
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
131131
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
132-
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
132+
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`<br />`references/1-b-data-flow.md`<br />`references/1-c-eval-criteria.md`<br />`references/2-instrument-and-observe.md`<br />`references/3-run-harness.md`<br />`references/4-define-evaluators.md`<br />`references/5-build-dataset.md`<br />`references/6-run-tests.md`<br />`references/7-investigation.md`<br />`references/evaluators.md`<br />`references/instrumentation-api.md`<br />`references/run-harness-examples`<br />`references/testing-api.md`<br />`resources` |
133133
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
134134
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
135135
| [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |

skills/eval-driven-dev/SKILL.md

Lines changed: 72 additions & 268 deletions
Large diffs are not rendered by default.
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Step 1a: Entry Point & Execution Flow
2+
3+
Identify how the application starts and how a real user invokes it.
4+
5+
---
6+
7+
## What to investigate
8+
9+
### 1. How the software runs
10+
11+
What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?
12+
13+
Look for:
14+
15+
- `if __name__ == "__main__"` blocks
16+
- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
17+
- CLI entry points in `pyproject.toml` (`[project.scripts]`)
18+
- Docker/compose configs that reveal startup commands
19+
20+
### 2. The real user entry point
21+
22+
How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.
23+
24+
- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
25+
- **CLI**: What command-line arguments does the user provide?
26+
- **Library/function**: What function does the caller import and call? What arguments?
27+
28+
### 3. Environment and configuration
29+
30+
- What env vars does the app require? (API keys, database URLs, feature flags)
31+
- What config files does it read?
32+
- What has sensible defaults vs. what must be explicitly set?
33+
34+
---
35+
36+
## Output: `pixie_qa/01-entry-point.md`
37+
38+
Write your findings to this file. Keep it focused — only entry point and execution flow.
39+
40+
### Template
41+
42+
```markdown
43+
# Entry Point & Execution Flow
44+
45+
## How to run
46+
47+
<Command to start the app, required env vars, config files>
48+
49+
## Entry point
50+
51+
- **File**: <e.g., app.py, main.py>
52+
- **Type**: <FastAPI server / CLI / standalone function / etc.>
53+
- **Framework**: <FastAPI, Flask, Django, none>
54+
55+
## User-facing endpoints / interface
56+
57+
<For each way a user interacts with the app:>
58+
59+
- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
60+
- **Input format**: <request body shape, CLI args, function params>
61+
- **Output format**: <response shape, stdout format, return type>
62+
63+
## Environment requirements
64+
65+
| Variable | Purpose | Required? | Default |
66+
| -------- | ------- | --------- | ------- |
67+
| ... | ... | ... | ... |
68+
```
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
# Step 1b: Processing Stack & Data Flow — DAG Artifact
2+
3+
Map the complete data flow through the application by producing a **structured DAG JSON file** that represents every important node in the processing pipeline.
4+
5+
---
6+
7+
## What to investigate
8+
9+
### 1. Find where the LLM provider client is called
10+
11+
Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:
12+
13+
- The file and function where the call lives
14+
- Which LLM provider/client is used
15+
- The exact arguments being passed (model, messages, tools, etc.)
16+
17+
### 2. Find the common ancestor entry point
18+
19+
Identify the single function that is the common ancestor of all LLM calls — the application's entry point for a single user request. This becomes the **root** of your DAG.
20+
21+
### 3. Track backwards: external data dependencies flowing IN
22+
23+
Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt:
24+
25+
- **Application inputs**: user messages, queries, uploaded files, config
26+
- **External dependency data**: database lookups (Redis, Postgres), retrieved context (RAG), cache reads, third-party API responses
27+
- **In-code data**: system prompts, tool definitions, prompt-building logic
28+
29+
### 4. Track forwards: external side-effects flowing OUT
30+
31+
Starting from each LLM call site, trace **forwards** to find every side-effect: database writes, API calls, messages sent, file writes.
32+
33+
### 5. Identify intermediate states
34+
35+
Along the paths between input and output, identify intermediate states needed for evaluation: tool call decisions, routing/handoff decisions, retrieval results, branching logic.
36+
37+
### 6. Identify testability seams
38+
39+
Look for abstract base classes, protocols, or constructor-injected backends. These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.
40+
41+
---
42+
43+
## Output: `pixie_qa/02-data-flow.json`
44+
45+
**Write a JSON file** (not markdown) containing a flat array of DAG nodes. Each node represents a significant point in the processing pipeline.
46+
47+
### Node schema
48+
49+
Each node is a JSON object with these fields:
50+
51+
| Field | Type | Required | Description |
52+
| -------------- | -------------- | -------- | ------------------------------------------------------------------------------------------------------------------ |
53+
| `name` | string | Yes | Unique, meaningful lower_snake_case node name (for example, `handle_turn`). This is the node identity. |
54+
| `code_pointer` | string | Yes | **Absolute** file path with function/method name, optionally with line range. See format below. |
55+
| `description` | string | Yes | What this node does and why it matters for evaluation. |
56+
| `parent` | string or null | No | Parent node name (`null` or omitted for root). |
57+
| `is_llm_call` | boolean | No | Set `true` only if the node represents an LLM provider call. Defaults to `false` when omitted. |
58+
| `metadata` | object | No | Additional info: `mock_strategy`, `data_shape`, `credentials_needed`, `eval_relevant`, external system notes, etc. |
59+
60+
### About `is_llm_call`
61+
62+
- Use `is_llm_call: true` for nodes that represent real LLM provider spans.
63+
- Leave it omitted (or `false`) for all other nodes.
64+
65+
### `code_pointer` format
66+
67+
The `code_pointer` field uses **absolute file paths** with a symbol name, and an optional line number range:
68+
69+
- `<absolute_file_path>:<symbol>` — points to a whole function or method. Use this when the entire function represents a single node in the DAG (most common case).
70+
- `<absolute_file_path>:<symbol>:<start_line>:<end_line>` — points to a specific line range within a function. Use this when the function contains an **important intermediate state** — a code fragment that transforms some input into an output that matters for evaluation, but the fragment is embedded inside a larger function rather than being its own function.
71+
72+
**When to use a line range (intermediate states):**
73+
74+
Some functions do multiple important things sequentially. If one of those things produces an intermediate state that your evaluators need to see (e.g., a routing decision, a context assembly step, a tool-call dispatch), but it's not factored into its own function, use a line range to identify that specific fragment. The line range marks the input → output boundary of that intermediate state within the larger function.
75+
76+
Examples of intermediate states that warrant a line range:
77+
78+
- **Routing decision**: lines 51–71 of `main()` decide which agent to hand off to based on user intent — the input is the user message, the output is the selected agent
79+
- **Context assembly**: lines 30–45 of `handle_request()` gather documents from a vector store and format them into a prompt — the input is the query, the output is the assembled context
80+
- **Tool dispatch**: lines 80–95 of `process_turn()` parse the LLM's tool-call response and execute the selected tool — the input is the tool-call JSON, the output is the tool result
81+
82+
If the intermediate state is already its own function, just use the function-level `code_pointer` (no line range needed).
83+
84+
Examples:
85+
86+
- `/home/user/myproject/app.py:handle_turn` — whole function
87+
- `/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response` — whole function
88+
- `/home/user/myproject/src/agents/agent.py:main:51:71` — lines 51–71 of `main()`, where a routing decision happens
89+
90+
The symbol can be:
91+
92+
- A function name: `my_func` → matches `def my_func` in the file
93+
- A class.method: `MyClass.func` → matches `def func` inside `class MyClass`
94+
95+
### Example
96+
97+
```json
98+
[
99+
{
100+
"name": "handle_turn",
101+
"code_pointer": "/home/user/myproject/src/agents/agent.py:handle_turn",
102+
"description": "Entry point for a single user request. Takes user message + conversation history, returns agent response.",
103+
"parent": null,
104+
"metadata": {
105+
"data_shape": {
106+
"input": "str (user message)",
107+
"output": "str (response text)"
108+
}
109+
}
110+
},
111+
{
112+
"name": "load_conversation_history",
113+
"code_pointer": "/home/user/myproject/src/services/redis_client.py:get_history",
114+
"description": "Reads conversation history from Redis. Returns list of message dicts.",
115+
"parent": "handle_turn",
116+
"metadata": {
117+
"system": "Redis",
118+
"data_shape": "list[dict] with role/content keys",
119+
"mock_strategy": "Provide canned history list",
120+
"credentials_needed": true
121+
}
122+
},
123+
{
124+
"name": "run_ai_response",
125+
"code_pointer": "/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response",
126+
"description": "Calls OpenAI API with system prompt + history + user message. Auto-captured by OpenInference.",
127+
"parent": "handle_turn",
128+
"is_llm_call": true,
129+
"metadata": {
130+
"provider": "OpenAI",
131+
"model": "gpt-4o-mini"
132+
}
133+
},
134+
{
135+
"name": "save_conversation_to_redis",
136+
"code_pointer": "/home/user/myproject/src/services/redis_client.py:save_history",
137+
"description": "Writes updated conversation history back to Redis after LLM responds.",
138+
"parent": "handle_turn",
139+
"metadata": {
140+
"system": "Redis",
141+
"eval_relevant": false,
142+
"mock_strategy": "Capture written data for assertions"
143+
}
144+
}
145+
]
146+
```
147+
148+
### Conditional / optional branches
149+
150+
Some apps have conditional code paths where only one branch executes per request — e.g., `transfer_call` vs `end_call` depending on the outcome. `pixie dag check-trace` (Step 2) validates against a **single** trace, so every DAG node must appear in that trace.
151+
152+
**Rule**: If two or more functions are mutually exclusive (only one runs per request), model them as a **single dispatcher node** that covers the branching logic, not as separate DAG nodes. For example, instead of `end_call_tool` + `transfer_call_tool` as separate nodes, use `execute_tool` pointing at the dispatch function.
153+
154+
If a function only runs under certain conditions but is the sole branch (not mutually exclusive), include it in the DAG — just ensure your reference trace (Step 2) exercises that code path.
155+
156+
### Validation checkpoint
157+
158+
After writing `pixie_qa/02-data-flow.json`, validate the DAG:
159+
160+
```bash
161+
uv run pixie dag validate pixie_qa/02-data-flow.json
162+
```
163+
164+
This command:
165+
166+
1. Checks the JSON structure is valid
167+
2. Verifies node names use lower_snake_case
168+
3. Verifies all node names are unique
169+
4. Verifies all parent references exist
170+
5. Checks exactly one root node exists (`parent` is null/omitted)
171+
6. Detects cycles
172+
7. Verifies code_pointer files exist on disk
173+
8. Verifies symbols exist in the referenced files
174+
9. Verifies line number ranges are valid (if present)
175+
10. **Generates a Mermaid diagram** at `pixie_qa/02-data-flow.md` if validation passes
176+
177+
If validation fails, fix the errors and re-run. The error messages are specific — they tell you exactly which node has the problem and what's wrong.
178+
179+
### Also document testability seams
180+
181+
After the DAG JSON is validated, add a brief **testability seams** section at the bottom of the generated `pixie_qa/02-data-flow.md` (the Mermaid file). For each node that reads from or writes to an external system, note the mock interface:
182+
183+
| Dependency node | Interface / module boundary | Mock strategy |
184+
| --------------- | --------------------------- | ------------- |
185+
| ... | ... | ... |
186+
187+
This section supplements the DAG — the DAG captures _what_ the dependencies are, and this table captures _how_ to mock them.
Lines changed: 85 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Step 1c: Eval Criteria
2+
3+
Define what quality dimensions matter for this app — based on the entry point (01-entry-point.md) and data flow (02-data-flow.md) you've already documented.
4+
5+
This document serves two purposes:
6+
7+
1. **Dataset creation (Step 5)**: The use cases tell you what kinds of eval_input items to generate — each use case should have representative items in the dataset.
8+
2. **Evaluator selection (Step 4)**: The eval criteria tell you what evaluators to choose and how to map them.
9+
10+
Keep this concise — it's a planning artifact, not a comprehensive spec.
11+
12+
---
13+
14+
## What to define
15+
16+
### 1. Use cases
17+
18+
List the distinct scenarios the app handles. Each use case becomes a category of eval_input items in your dataset. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.
19+
20+
**Good use case descriptions:**
21+
22+
- "Reroute to human agent on account lookup difficulties"
23+
- "Answer billing question using customer's plan details from CRM"
24+
- "Decline to answer questions outside the support domain"
25+
- "Summarize research findings including all queried sub-topics"
26+
27+
**Bad use case descriptions (too vague):**
28+
29+
- "Handle billing questions"
30+
- "Edge case"
31+
- "Error handling"
32+
33+
### 2. Eval criteria
34+
35+
Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 4.
36+
37+
**Good criteria are specific to the app's purpose.** Examples:
38+
39+
- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
40+
- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
41+
- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"
42+
43+
**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.
44+
45+
At this stage, don't pick evaluator classes or thresholds. That comes in Step 4.
46+
47+
### 3. Check criteria applicability and observability
48+
49+
For each criterion:
50+
51+
1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 5 (dataset creation) because:
52+
- **Universal criteria** → become dataset-level default evaluators
53+
- **Case-specific criteria** → become item-level evaluators on relevant rows only
54+
55+
2. **Verify observability** — check that the data flow in `02-data-flow.md` includes the data needed to evaluate each criterion. If a criterion requires data that isn't in the processing stack, note what additional instrumentation is needed in Step 2.
56+
57+
---
58+
59+
## Output: `pixie_qa/03-eval-criteria.md`
60+
61+
Write your findings to this file. **Keep it short** — the template below is the maximum length.
62+
63+
### Template
64+
65+
```markdown
66+
# Eval Criteria
67+
68+
## Use cases
69+
70+
1. <Use case name>: <one-liner conveying input + expected behavior>
71+
2. ...
72+
73+
## Eval criteria
74+
75+
| # | Criterion | Applies to | Observable data needed |
76+
| --- | --------- | ------------- | ---------------------- |
77+
| 1 | ... | All | ... |
78+
| 2 | ... | Use case 1, 3 | ... |
79+
80+
## Observability check
81+
82+
| Criterion | Available in data flow? | Gap? |
83+
| --------- | ----------------------- | ---- |
84+
| ... | Yes / No | ... |
85+
```

0 commit comments

Comments
 (0)