Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/README.skills.md
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
| [ef-core](../skills/ef-core/SKILL.md) | Get best practices for Entity Framework Core | None |
| [email-drafter](../skills/email-drafter/SKILL.md) | Draft and review professional emails that match your personal writing style. Analyzes your sent emails for tone, greeting, structure, and sign-off patterns via WorkIQ, then generates context-aware drafts for any recipient. USE FOR: draft email, write email, compose email, reply email, follow-up email, analyze email tone, email style. | None |
| [entra-agent-user](../skills/entra-agent-user/SKILL.md) | Create Agent Users in Microsoft Entra ID from Agent Identities, enabling AI agents to act as digital workers with user identity capabilities in Microsoft 365 and Azure environments. | None |
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/dataset-generation.md`<br />`references/eval-tests.md`<br />`references/instrumentation.md`<br />`references/investigation.md`<br />`references/pixie-api.md`<br />`references/run-harness-patterns.md`<br />`references/understanding-app.md` |
| [eval-driven-dev](../skills/eval-driven-dev/SKILL.md) | Set up eval-based QA for Python LLM applications: instrument the app, build golden datasets, write and run eval tests, and iterate on failures. ALWAYS USE THIS SKILL when the user asks to set up QA, add tests, add evals, evaluate, benchmark, fix wrong behaviors, improve quality, or do quality assurance for any Python project that calls an LLM model. | `references/1-a-entry-point.md`<br />`references/1-b-data-flow.md`<br />`references/1-c-eval-criteria.md`<br />`references/2-instrument-and-observe.md`<br />`references/3-run-harness.md`<br />`references/4-define-evaluators.md`<br />`references/5-build-dataset.md`<br />`references/6-run-tests.md`<br />`references/7-investigation.md`<br />`references/evaluators.md`<br />`references/instrumentation-api.md`<br />`references/run-harness-examples`<br />`references/testing-api.md`<br />`resources` |
| [excalidraw-diagram-generator](../skills/excalidraw-diagram-generator/SKILL.md) | Generate Excalidraw diagrams from natural language descriptions. Use when asked to "create a diagram", "make a flowchart", "visualize a process", "draw a system architecture", "create a mind map", or "generate an Excalidraw file". Supports flowcharts, relationship diagrams, mind maps, and system architecture diagrams. Outputs .excalidraw JSON files that can be opened directly in Excalidraw. | `references/element-types.md`<br />`references/excalidraw-schema.md`<br />`scripts/.gitignore`<br />`scripts/README.md`<br />`scripts/add-arrow.py`<br />`scripts/add-icon-to-diagram.py`<br />`scripts/split-excalidraw-library.py`<br />`templates` |
| [fabric-lakehouse](../skills/fabric-lakehouse/SKILL.md) | Use this skill to get context about Fabric Lakehouse and its features for software systems and AI-powered functions. It offers descriptions of Lakehouse data components, organization with schemas and shortcuts, access control, and code examples. This skill supports users in designing, building, and optimizing Lakehouse solutions using best practices. | `references/getdata.md`<br />`references/pyspark.md` |
| [fedora-linux-triage](../skills/fedora-linux-triage/SKILL.md) | Triage and resolve Fedora issues with dnf, systemd, and SELinux-aware guidance. | None |
Expand Down
340 changes: 72 additions & 268 deletions skills/eval-driven-dev/SKILL.md

Large diffs are not rendered by default.

68 changes: 68 additions & 0 deletions skills/eval-driven-dev/references/1-a-entry-point.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Step 1a: Entry Point & Execution Flow

Identify how the application starts and how a real user invokes it.

---

## What to investigate

### 1. How the software runs

What is the entry point? How do you start it? Is it a CLI, a server, a library function? What are the required arguments, config files, or environment variables?

Look for:

- `if __name__ == "__main__"` blocks
- Framework entry points (FastAPI `app`, Flask `app`, Django `manage.py`)
- CLI entry points in `pyproject.toml` (`[project.scripts]`)
- Docker/compose configs that reveal startup commands

### 2. The real user entry point

How does a real user or client invoke the app? This is what the eval must exercise — not an inner function that bypasses the request pipeline.

- **Web server**: Which HTTP endpoints accept user input? What methods (GET/POST)? What request body shape?
- **CLI**: What command-line arguments does the user provide?
- **Library/function**: What function does the caller import and call? What arguments?

### 3. Environment and configuration

- What env vars does the app require? (API keys, database URLs, feature flags)
- What config files does it read?
- What has sensible defaults vs. what must be explicitly set?

---

## Output: `pixie_qa/01-entry-point.md`

Write your findings to this file. Keep it focused — only entry point and execution flow.

### Template

```markdown
# Entry Point & Execution Flow

## How to run

<Command to start the app, required env vars, config files>

## Entry point

- **File**: <e.g., app.py, main.py>
- **Type**: <FastAPI server / CLI / standalone function / etc.>
- **Framework**: <FastAPI, Flask, Django, none>

## User-facing endpoints / interface

<For each way a user interacts with the app:>

- **Endpoint / command**: <e.g., POST /chat, python main.py --query "...">
- **Input format**: <request body shape, CLI args, function params>
- **Output format**: <response shape, stdout format, return type>

## Environment requirements

| Variable | Purpose | Required? | Default |
| -------- | ------- | --------- | ------- |
| ... | ... | ... | ... |
```
187 changes: 187 additions & 0 deletions skills/eval-driven-dev/references/1-b-data-flow.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Step 1b: Processing Stack & Data Flow — DAG Artifact

Map the complete data flow through the application by producing a **structured DAG JSON file** that represents every important node in the processing pipeline.

---

## What to investigate

### 1. Find where the LLM provider client is called

Locate every place in the codebase where an LLM provider client is invoked (e.g., `openai.ChatCompletion.create()`, `client.chat.completions.create()`, `anthropic.messages.create()`). These are the anchor points for your analysis. For each LLM call site, record:

- The file and function where the call lives
- Which LLM provider/client is used
- The exact arguments being passed (model, messages, tools, etc.)

### 2. Find the common ancestor entry point

Identify the single function that is the common ancestor of all LLM calls — the application's entry point for a single user request. This becomes the **root** of your DAG.

### 3. Track backwards: external data dependencies flowing IN

Starting from each LLM call site, trace **backwards** through the code to find every piece of data that feeds into the LLM prompt:

- **Application inputs**: user messages, queries, uploaded files, config
- **External dependency data**: database lookups (Redis, Postgres), retrieved context (RAG), cache reads, third-party API responses
- **In-code data**: system prompts, tool definitions, prompt-building logic

### 4. Track forwards: external side-effects flowing OUT

Starting from each LLM call site, trace **forwards** to find every side-effect: database writes, API calls, messages sent, file writes.

### 5. Identify intermediate states

Along the paths between input and output, identify intermediate states needed for evaluation: tool call decisions, routing/handoff decisions, retrieval results, branching logic.

### 6. Identify testability seams

Look for abstract base classes, protocols, or constructor-injected backends. These are testability seams — you'll create mock implementations of these interfaces. If there's no clean interface, you'll use `unittest.mock.patch` at the module boundary.

---

## Output: `pixie_qa/02-data-flow.json`

**Write a JSON file** (not markdown) containing a flat array of DAG nodes. Each node represents a significant point in the processing pipeline.

### Node schema

Each node is a JSON object with these fields:

| Field | Type | Required | Description |
| -------------- | -------------- | -------- | ------------------------------------------------------------------------------------------------------------------ |
| `name` | string | Yes | Unique, meaningful lower_snake_case node name (for example, `handle_turn`). This is the node identity. |
| `code_pointer` | string | Yes | **Absolute** file path with function/method name, optionally with line range. See format below. |
| `description` | string | Yes | What this node does and why it matters for evaluation. |
| `parent` | string or null | No | Parent node name (`null` or omitted for root). |
| `is_llm_call` | boolean | No | Set `true` only if the node represents an LLM provider call. Defaults to `false` when omitted. |
| `metadata` | object | No | Additional info: `mock_strategy`, `data_shape`, `credentials_needed`, `eval_relevant`, external system notes, etc. |

### About `is_llm_call`

- Use `is_llm_call: true` for nodes that represent real LLM provider spans.
- Leave it omitted (or `false`) for all other nodes.

### `code_pointer` format

The `code_pointer` field uses **absolute file paths** with a symbol name, and an optional line number range:

- `<absolute_file_path>:<symbol>` — points to a whole function or method. Use this when the entire function represents a single node in the DAG (most common case).
- `<absolute_file_path>:<symbol>:<start_line>:<end_line>` — points to a specific line range within a function. Use this when the function contains an **important intermediate state** — a code fragment that transforms some input into an output that matters for evaluation, but the fragment is embedded inside a larger function rather than being its own function.

**When to use a line range (intermediate states):**

Some functions do multiple important things sequentially. If one of those things produces an intermediate state that your evaluators need to see (e.g., a routing decision, a context assembly step, a tool-call dispatch), but it's not factored into its own function, use a line range to identify that specific fragment. The line range marks the input → output boundary of that intermediate state within the larger function.

Examples of intermediate states that warrant a line range:

- **Routing decision**: lines 51–71 of `main()` decide which agent to hand off to based on user intent — the input is the user message, the output is the selected agent
- **Context assembly**: lines 30–45 of `handle_request()` gather documents from a vector store and format them into a prompt — the input is the query, the output is the assembled context
- **Tool dispatch**: lines 80–95 of `process_turn()` parse the LLM's tool-call response and execute the selected tool — the input is the tool-call JSON, the output is the tool result

If the intermediate state is already its own function, just use the function-level `code_pointer` (no line range needed).

Examples:

- `/home/user/myproject/app.py:handle_turn` — whole function
- `/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response` — whole function
- `/home/user/myproject/src/agents/agent.py:main:51:71` — lines 51–71 of `main()`, where a routing decision happens

The symbol can be:

- A function name: `my_func` → matches `def my_func` in the file
- A class.method: `MyClass.func` → matches `def func` inside `class MyClass`

### Example

```json
[
{
"name": "handle_turn",
"code_pointer": "/home/user/myproject/src/agents/agent.py:handle_turn",
"description": "Entry point for a single user request. Takes user message + conversation history, returns agent response.",
"parent": null,
"metadata": {
"data_shape": {
"input": "str (user message)",
"output": "str (response text)"
}
}
},
{
"name": "load_conversation_history",
"code_pointer": "/home/user/myproject/src/services/redis_client.py:get_history",
"description": "Reads conversation history from Redis. Returns list of message dicts.",
"parent": "handle_turn",
"metadata": {
"system": "Redis",
"data_shape": "list[dict] with role/content keys",
"mock_strategy": "Provide canned history list",
"credentials_needed": true
}
},
{
"name": "run_ai_response",
"code_pointer": "/home/user/myproject/src/agents/llm/openai_llm.py:run_ai_response",
"description": "Calls OpenAI API with system prompt + history + user message. Auto-captured by OpenInference.",
"parent": "handle_turn",
"is_llm_call": true,
"metadata": {
"provider": "OpenAI",
"model": "gpt-4o-mini"
}
},
{
"name": "save_conversation_to_redis",
"code_pointer": "/home/user/myproject/src/services/redis_client.py:save_history",
"description": "Writes updated conversation history back to Redis after LLM responds.",
"parent": "handle_turn",
"metadata": {
"system": "Redis",
"eval_relevant": false,
"mock_strategy": "Capture written data for assertions"
}
}
]
```

### Conditional / optional branches

Some apps have conditional code paths where only one branch executes per request — e.g., `transfer_call` vs `end_call` depending on the outcome. `pixie dag check-trace` (Step 2) validates against a **single** trace, so every DAG node must appear in that trace.

**Rule**: If two or more functions are mutually exclusive (only one runs per request), model them as a **single dispatcher node** that covers the branching logic, not as separate DAG nodes. For example, instead of `end_call_tool` + `transfer_call_tool` as separate nodes, use `execute_tool` pointing at the dispatch function.

If a function only runs under certain conditions but is the sole branch (not mutually exclusive), include it in the DAG — just ensure your reference trace (Step 2) exercises that code path.

### Validation checkpoint

After writing `pixie_qa/02-data-flow.json`, validate the DAG:

```bash
uv run pixie dag validate pixie_qa/02-data-flow.json
```

This command:

1. Checks the JSON structure is valid
2. Verifies node names use lower_snake_case
3. Verifies all node names are unique
4. Verifies all parent references exist
5. Checks exactly one root node exists (`parent` is null/omitted)
6. Detects cycles
7. Verifies code_pointer files exist on disk
8. Verifies symbols exist in the referenced files
9. Verifies line number ranges are valid (if present)
10. **Generates a Mermaid diagram** at `pixie_qa/02-data-flow.md` if validation passes

If validation fails, fix the errors and re-run. The error messages are specific — they tell you exactly which node has the problem and what's wrong.

### Also document testability seams

After the DAG JSON is validated, add a brief **testability seams** section at the bottom of the generated `pixie_qa/02-data-flow.md` (the Mermaid file). For each node that reads from or writes to an external system, note the mock interface:

| Dependency node | Interface / module boundary | Mock strategy |
| --------------- | --------------------------- | ------------- |
| ... | ... | ... |

This section supplements the DAG — the DAG captures _what_ the dependencies are, and this table captures _how_ to mock them.
85 changes: 85 additions & 0 deletions skills/eval-driven-dev/references/1-c-eval-criteria.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Step 1c: Eval Criteria

Define what quality dimensions matter for this app — based on the entry point (01-entry-point.md) and data flow (02-data-flow.md) you've already documented.

This document serves two purposes:

1. **Dataset creation (Step 5)**: The use cases tell you what kinds of eval_input items to generate — each use case should have representative items in the dataset.
2. **Evaluator selection (Step 4)**: The eval criteria tell you what evaluators to choose and how to map them.

Keep this concise — it's a planning artifact, not a comprehensive spec.

---

## What to define

### 1. Use cases

List the distinct scenarios the app handles. Each use case becomes a category of eval_input items in your dataset. **Each use case description must be a concise one-liner that conveys both (a) what the input is and (b) what the expected behavior or outcome is.** The description should be specific enough that someone unfamiliar with the app can understand the scenario and its success criteria.

**Good use case descriptions:**

- "Reroute to human agent on account lookup difficulties"
- "Answer billing question using customer's plan details from CRM"
- "Decline to answer questions outside the support domain"
- "Summarize research findings including all queried sub-topics"

**Bad use case descriptions (too vague):**

- "Handle billing questions"
- "Edge case"
- "Error handling"

### 2. Eval criteria

Define **high-level, application-specific eval criteria** — quality dimensions that matter for THIS app. Each criterion will map to an evaluator in Step 4.

**Good criteria are specific to the app's purpose.** Examples:

- Voice customer support agent: "Does the agent verify the caller's identity before transferring?", "Are responses concise enough for phone conversation?"
- Research report generator: "Does the report address all sub-questions?", "Are claims supported by retrieved sources?"
- RAG chatbot: "Are answers grounded in the retrieved context?", "Does it say 'I don't know' when context is missing?"

**Bad criteria are generic evaluator names dressed up as requirements.** Don't say "Factual accuracy" or "Response relevance" — say what factual accuracy or relevance means for THIS app.

At this stage, don't pick evaluator classes or thresholds. That comes in Step 4.

### 3. Check criteria applicability and observability

For each criterion:

1. **Determine applicability scope** — does this criterion apply to ALL use cases, or only a subset? If a criterion is only relevant for certain scenarios (e.g., "identity verification" only applies to account-related requests, not general FAQ), mark it clearly. This distinction is critical for Step 5 (dataset creation) because:
- **Universal criteria** → become dataset-level default evaluators
- **Case-specific criteria** → become item-level evaluators on relevant rows only

2. **Verify observability** — check that the data flow in `02-data-flow.md` includes the data needed to evaluate each criterion. If a criterion requires data that isn't in the processing stack, note what additional instrumentation is needed in Step 2.

---

## Output: `pixie_qa/03-eval-criteria.md`

Write your findings to this file. **Keep it short** — the template below is the maximum length.

### Template

```markdown
# Eval Criteria

## Use cases

1. <Use case name>: <one-liner conveying input + expected behavior>
2. ...

## Eval criteria

| # | Criterion | Applies to | Observable data needed |
| --- | --------- | ------------- | ---------------------- |
| 1 | ... | All | ... |
| 2 | ... | Use case 1, 3 | ... |

## Observability check

| Criterion | Available in data flow? | Gap? |
| --------- | ----------------------- | ---- |
| ... | Yes / No | ... |
```
Loading
Loading