Skip to content

Commit a8e3a88

Browse files
jimbobbennettclaude
andcommitted
Add 3 Phoenix AI observability skills
Add skills for Phoenix (Arize open-source) covering CLI debugging, LLM evaluation workflows, and OpenInference tracing/instrumentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 6020f64 commit a8e3a88

69 files changed

Lines changed: 6151 additions & 1 deletion

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

docs/README.skills.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,9 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
208208
| [openapi-to-application-code](../skills/openapi-to-application-code/SKILL.md) | Generate a complete, production-ready application from an OpenAPI specification | None |
209209
| [pdftk-server](../skills/pdftk-server/SKILL.md) | Skill for using the command-line tool pdftk (PDFtk Server) for working with PDF files. Use when asked to merge PDFs, split PDFs, rotate pages, encrypt or decrypt PDFs, fill PDF forms, apply watermarks, stamp overlays, extract metadata, burst documents into pages, repair corrupted PDFs, attach or extract files, or perform any PDF manipulation from the command line. | `references/download.md`<br />`references/pdftk-cli-examples.md`<br />`references/pdftk-man-page.md`<br />`references/pdftk-server-license.md`<br />`references/third-party-materials.md` |
210210
| [penpot-uiux-design](../skills/penpot-uiux-design/SKILL.md) | Comprehensive guide for creating professional UI/UX designs in Penpot using MCP tools. Use this skill when: (1) Creating new UI/UX designs for web, mobile, or desktop applications, (2) Building design systems with components and tokens, (3) Designing dashboards, forms, navigation, or landing pages, (4) Applying accessibility standards and best practices, (5) Following platform guidelines (iOS, Android, Material Design), (6) Reviewing or improving existing Penpot designs for usability. Triggers: "design a UI", "create interface", "build layout", "design dashboard", "create form", "design landing page", "make it accessible", "design system", "component library". | `references/accessibility.md`<br />`references/component-patterns.md`<br />`references/platform-guidelines.md`<br />`references/setup-troubleshooting.md` |
211+
| [phoenix-cli](../skills/phoenix-cli/SKILL.md) | Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues. | None |
212+
| [phoenix-evals](../skills/phoenix-evals/SKILL.md) | Build and run evaluators for AI/LLM applications using Phoenix. | `references/axial-coding.md`<br />`references/common-mistakes-python.md`<br />`references/error-analysis-multi-turn.md`<br />`references/error-analysis.md`<br />`references/evaluate-dataframe-python.md`<br />`references/evaluators-code-python.md`<br />`references/evaluators-code-typescript.md`<br />`references/evaluators-custom-templates.md`<br />`references/evaluators-llm-python.md`<br />`references/evaluators-llm-typescript.md`<br />`references/evaluators-overview.md`<br />`references/evaluators-pre-built.md`<br />`references/evaluators-rag.md`<br />`references/experiments-datasets-python.md`<br />`references/experiments-datasets-typescript.md`<br />`references/experiments-overview.md`<br />`references/experiments-running-python.md`<br />`references/experiments-running-typescript.md`<br />`references/experiments-synthetic-python.md`<br />`references/experiments-synthetic-typescript.md`<br />`references/fundamentals-anti-patterns.md`<br />`references/fundamentals-model-selection.md`<br />`references/fundamentals.md`<br />`references/observe-sampling-python.md`<br />`references/observe-sampling-typescript.md`<br />`references/observe-tracing-setup.md`<br />`references/production-continuous.md`<br />`references/production-guardrails.md`<br />`references/production-overview.md`<br />`references/setup-python.md`<br />`references/setup-typescript.md`<br />`references/validation-evaluators-python.md`<br />`references/validation-evaluators-typescript.md`<br />`references/validation.md` |
213+
| [phoenix-tracing](../skills/phoenix-tracing/SKILL.md) | OpenInference semantic conventions and instrumentation for Phoenix AI observability. Use when implementing LLM tracing, creating custom spans, or deploying to production. | `references/README.md`<br />`references/annotations-overview.md`<br />`references/annotations-python.md`<br />`references/annotations-typescript.md`<br />`references/fundamentals-flattening.md`<br />`references/fundamentals-overview.md`<br />`references/fundamentals-required-attributes.md`<br />`references/fundamentals-universal-attributes.md`<br />`references/instrumentation-auto-python.md`<br />`references/instrumentation-auto-typescript.md`<br />`references/instrumentation-manual-python.md`<br />`references/instrumentation-manual-typescript.md`<br />`references/metadata-python.md`<br />`references/metadata-typescript.md`<br />`references/production-python.md`<br />`references/production-typescript.md`<br />`references/projects-python.md`<br />`references/projects-typescript.md`<br />`references/sessions-python.md`<br />`references/sessions-typescript.md`<br />`references/setup-python.md`<br />`references/setup-typescript.md`<br />`references/span-agent.md`<br />`references/span-chain.md`<br />`references/span-embedding.md`<br />`references/span-evaluator.md`<br />`references/span-guardrail.md`<br />`references/span-llm.md`<br />`references/span-reranker.md`<br />`references/span-retriever.md`<br />`references/span-tool.md` |
211214
| [php-mcp-server-generator](../skills/php-mcp-server-generator/SKILL.md) | Generate a complete PHP Model Context Protocol server project with tools, resources, prompts, and tests using the official PHP SDK | None |
212215
| [planning-oracle-to-postgres-migration-integration-testing](../skills/planning-oracle-to-postgres-migration-integration-testing/SKILL.md) | Creates an integration testing plan for .NET data access artifacts during Oracle-to-PostgreSQL database migrations. Analyzes a single project to identify repositories, DAOs, and service layers that interact with the database, then produces a structured testing plan. Use when planning integration test coverage for a migrated project, identifying which data access methods need tests, or preparing for Oracle-to-PostgreSQL migration validation. | None |
213216
| [plantuml-ascii](../skills/plantuml-ascii/SKILL.md) | Generate ASCII art diagrams using PlantUML text mode. Use when user asks to create ASCII diagrams, text-based diagrams, terminal-friendly diagrams, or mentions plantuml ascii, text diagram, ascii art diagram. Supports: Converting PlantUML diagrams to ASCII art, Creating sequence diagrams, class diagrams, flowcharts in ASCII format, Generating Unicode-enhanced ASCII art with -utxt flag | None |
@@ -288,7 +291,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
288291
| [webapp-testing](../skills/webapp-testing/SKILL.md) | Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs. | `assets/test-helper.js` |
289292
| [what-context-needed](../skills/what-context-needed/SKILL.md) | Ask Copilot what files it needs to see before answering a question | None |
290293
| [winapp-cli](../skills/winapp-cli/SKILL.md) | Windows App Development CLI (winapp) for building, packaging, and deploying Windows applications. Use when asked to initialize Windows app projects, create MSIX packages, generate AppxManifest.xml, manage development certificates, add package identity for debugging, sign packages, publish to the Microsoft Store, create external catalogs, or access Windows SDK build tools. Supports .NET (csproj), C++, Electron, Rust, Tauri, and cross-platform frameworks targeting Windows. | None |
291-
| [winmd-api-search](../skills/winmd-api-search/SKILL.md) | Find and explore Windows desktop APIs. Use when building features that need platform capabilities — camera, file access, notifications, UI controls, AI/ML, sensors, networking, etc. Discovers the right API for a task and retrieves full type details (methods, properties, events, enumeration values). | `LICENSE.txt`<br />`scripts/Invoke-WinMdQuery.ps1`<br />`scripts/Update-WinMdCache.ps1`<br />`scripts/cache-generator` |
294+
| [winmd-api-search](../skills/winmd-api-search/SKILL.md) | Find and explore Windows desktop APIs. Use when building features that need platform capabilities — camera, file access, notifications, UI controls, AI/ML, sensors, networking, etc. Discovers the right API for a task and retrieves full type details (methods, properties, events, enumeration values). | `.DS_Store`<br />`LICENSE.txt`<br />`scripts/Invoke-WinMdQuery.ps1`<br />`scripts/Update-WinMdCache.ps1`<br />`scripts/cache-generator` |
292295
| [winui3-migration-guide](../skills/winui3-migration-guide/SKILL.md) | UWP-to-WinUI 3 migration reference. Maps legacy UWP APIs to correct Windows App SDK equivalents with before/after code snippets. Covers namespace changes, threading (CoreDispatcher to DispatcherQueue), windowing (CoreWindow to AppWindow), dialogs, pickers, sharing, printing, background tasks, and the most common Copilot code generation mistakes. | None |
293296
| [workiq-copilot](../skills/workiq-copilot/SKILL.md) | Guides the Copilot CLI on how to use the WorkIQ CLI/MCP server to query Microsoft 365 Copilot data (emails, meetings, docs, Teams, people) for live context, summaries, and recommendations. | None |
294297
| [write-coding-standards-from-file](../skills/write-coding-standards-from-file/SKILL.md) | Write a coding standards document for a project using the coding styles from the file(s) and/or folder(s) passed as arguments in the prompt. | None |

skills/phoenix-cli/SKILL.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
---
2+
name: phoenix-cli
3+
description: Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues.
4+
license: Apache-2.0
5+
metadata:
6+
author: arize-ai
7+
version: "2.0.0"
8+
---
9+
10+
# Phoenix CLI
11+
12+
## Invocation
13+
14+
```bash
15+
px <resource> <action> # if installed globally
16+
npx @arizeai/phoenix-cli <resource> <action> # no install required
17+
```
18+
19+
The CLI uses singular resource commands with subcommands like `list` and `get`:
20+
21+
```bash
22+
px trace list
23+
px trace get <trace-id>
24+
px span list
25+
px dataset list
26+
px dataset get <name>
27+
```
28+
29+
## Setup
30+
31+
```bash
32+
export PHOENIX_HOST=http://localhost:6006
33+
export PHOENIX_PROJECT=my-project
34+
export PHOENIX_API_KEY=your-api-key # if auth is enabled
35+
```
36+
37+
Always use `--format raw --no-progress` when piping to `jq`.
38+
39+
## Traces
40+
41+
```bash
42+
px trace list --limit 20 --format raw --no-progress | jq .
43+
px trace list --last-n-minutes 60 --limit 20 --format raw --no-progress | jq '.[] | select(.status == "ERROR")'
44+
px trace list --format raw --no-progress | jq 'sort_by(-.duration) | .[0:5]'
45+
px trace get <trace-id> --format raw | jq .
46+
px trace get <trace-id> --format raw | jq '.spans[] | select(.status_code != "OK")'
47+
```
48+
49+
## Spans
50+
51+
```bash
52+
px span list --limit 20 # recent spans (table view)
53+
px span list --last-n-minutes 60 --limit 50 # spans from last hour
54+
px span list --span-kind LLM --limit 10 # only LLM spans
55+
px span list --status-code ERROR --limit 20 # only errored spans
56+
px span list --name chat_completion --limit 10 # filter by span name
57+
px span list --trace-id <id> --format raw --no-progress | jq . # all spans for a trace
58+
px span list --include-annotations --limit 10 # include annotation scores
59+
px span list output.json --limit 100 # save to JSON file
60+
px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
61+
```
62+
63+
### Span JSON shape
64+
65+
```
66+
Span
67+
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
68+
status_code ("OK"|"ERROR"|"UNSET"), status_message
69+
context.span_id, context.trace_id, parent_id
70+
start_time, end_time
71+
attributes (same as trace span attributes above)
72+
annotations[] (with --include-annotations)
73+
name, result { score, label, explanation }
74+
```
75+
76+
### Trace JSON shape
77+
78+
```
79+
Trace
80+
traceId, status ("OK"|"ERROR"), duration (ms), startTime, endTime
81+
rootSpan — top-level span (parent_id: null)
82+
spans[]
83+
name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT")
84+
status_code ("OK"|"ERROR"), parent_id, context.span_id
85+
attributes
86+
input.value, output.value — raw input/output
87+
llm.model_name, llm.provider
88+
llm.token_count.prompt/completion/total
89+
llm.token_count.prompt_details.cache_read
90+
llm.token_count.completion_details.reasoning
91+
llm.input_messages.{N}.message.role/content
92+
llm.output_messages.{N}.message.role/content
93+
llm.invocation_parameters — JSON string (temperature, etc.)
94+
exception.message — set if span errored
95+
```
96+
97+
## Sessions
98+
99+
```bash
100+
px session list --limit 10 --format raw --no-progress | jq .
101+
px session list --order asc --format raw --no-progress | jq '.[].session_id'
102+
px session get <session-id> --format raw | jq .
103+
px session get <session-id> --include-annotations --format raw | jq '.annotations'
104+
```
105+
106+
### Session JSON shape
107+
108+
```
109+
SessionData
110+
id, session_id, project_id
111+
start_time, end_time
112+
traces[]
113+
id, trace_id, start_time, end_time
114+
115+
SessionAnnotation (with --include-annotations)
116+
id, name, annotator_kind ("LLM"|"CODE"|"HUMAN"), session_id
117+
result { label, score, explanation }
118+
metadata, identifier, source, created_at, updated_at
119+
```
120+
121+
## Datasets / Experiments / Prompts
122+
123+
```bash
124+
px dataset list --format raw --no-progress | jq '.[].name'
125+
px dataset get <name> --format raw | jq '.examples[] | {input, output: .expected_output}'
126+
px experiment list --dataset <name> --format raw --no-progress | jq '.[] | {id, name, failed_run_count}'
127+
px experiment get <id> --format raw --no-progress | jq '.[] | select(.error != null) | {input, error}'
128+
px prompt list --format raw --no-progress | jq '.[].name'
129+
px prompt get <name> --format text --no-progress # plain text, ideal for piping to AI
130+
```
131+
132+
## GraphQL
133+
134+
For ad-hoc queries not covered by the commands above. Output is `{"data": {...}}`.
135+
136+
```bash
137+
px api graphql '{ projectCount datasetCount promptCount evaluatorCount }'
138+
px api graphql '{ projects { edges { node { name traceCount tokenCountTotal } } } }' | jq '.data.projects.edges[].node'
139+
px api graphql '{ datasets { edges { node { name exampleCount experimentCount } } } }' | jq '.data.datasets.edges[].node'
140+
px api graphql '{ evaluators { edges { node { name kind } } } }' | jq '.data.evaluators.edges[].node'
141+
142+
# Introspect any type
143+
px api graphql '{ __type(name: "Project") { fields { name type { name } } } }' | jq '.data.__type.fields[]'
144+
```
145+
146+
Key root fields: `projects`, `datasets`, `prompts`, `evaluators`, `projectCount`, `datasetCount`, `promptCount`, `evaluatorCount`, `viewer`.
147+
148+
## Docs
149+
150+
Download Phoenix documentation markdown for local use by coding agents.
151+
152+
```bash
153+
px docs fetch # fetch default workflow docs to .px/docs
154+
px docs fetch --workflow tracing # fetch only tracing docs
155+
px docs fetch --workflow tracing --workflow evaluation
156+
px docs fetch --dry-run # preview what would be downloaded
157+
px docs fetch --refresh # clear .px/docs and re-download
158+
px docs fetch --output-dir ./my-docs # custom output directory
159+
```
160+
161+
Key options: `--workflow` (repeatable, values: `tracing`, `evaluation`, `datasets`, `prompts`, `integrations`, `sdk`, `self-hosting`, `all`), `--dry-run`, `--refresh`, `--output-dir` (default `.px/docs`), `--workers` (default 10).

skills/phoenix-evals/SKILL.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
---
2+
name: phoenix-evals
3+
description: Build and run evaluators for AI/LLM applications using Phoenix.
4+
license: Apache-2.0
5+
metadata:
6+
author: oss@arize.com
7+
version: "1.0.0"
8+
languages: Python, TypeScript
9+
---
10+
11+
# Phoenix Evals
12+
13+
Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
14+
15+
## Quick Reference
16+
17+
| Task | Files |
18+
| ---- | ----- |
19+
| Setup | `setup-python`, `setup-typescript` |
20+
| Decide what to evaluate | `evaluators-overview` |
21+
| Choose a judge model | `fundamentals-model-selection` |
22+
| Use pre-built evaluators | `evaluators-pre-built` |
23+
| Build code evaluator | `evaluators-code-{python\|typescript}` |
24+
| Build LLM evaluator | `evaluators-llm-{python\|typescript}`, `evaluators-custom-templates` |
25+
| Batch evaluate DataFrame | `evaluate-dataframe-python` |
26+
| Run experiment | `experiments-running-{python\|typescript}` |
27+
| Create dataset | `experiments-datasets-{python\|typescript}` |
28+
| Generate synthetic data | `experiments-synthetic-{python\|typescript}` |
29+
| Validate evaluator accuracy | `validation`, `validation-evaluators-{python\|typescript}` |
30+
| Sample traces for review | `observe-sampling-{python\|typescript}` |
31+
| Analyze errors | `error-analysis`, `error-analysis-multi-turn`, `axial-coding` |
32+
| RAG evals | `evaluators-rag` |
33+
| Avoid common mistakes | `common-mistakes-python`, `fundamentals-anti-patterns` |
34+
| Production | `production-overview`, `production-guardrails`, `production-continuous` |
35+
36+
## Workflows
37+
38+
**Starting Fresh:**
39+
`observe-tracing-setup``error-analysis``axial-coding``evaluators-overview`
40+
41+
**Building Evaluator:**
42+
`fundamentals``common-mistakes-python``evaluators-{code\|llm}-{python\|typescript}``validation-evaluators-{python\|typescript}`
43+
44+
**RAG Systems:**
45+
`evaluators-rag``evaluators-code-*` (retrieval) → `evaluators-llm-*` (faithfulness)
46+
47+
**Production:**
48+
`production-overview``production-guardrails``production-continuous`
49+
50+
## Rule Categories
51+
52+
| Prefix | Description |
53+
| ------ | ----------- |
54+
| `fundamentals-*` | Types, scores, anti-patterns |
55+
| `observe-*` | Tracing, sampling |
56+
| `error-analysis-*` | Finding failures |
57+
| `axial-coding-*` | Categorizing failures |
58+
| `evaluators-*` | Code, LLM, RAG evaluators |
59+
| `experiments-*` | Datasets, running experiments |
60+
| `validation-*` | Validating evaluator accuracy against human labels |
61+
| `production-*` | CI/CD, monitoring |
62+
63+
## Key Principles
64+
65+
| Principle | Action |
66+
| --------- | ------ |
67+
| Error analysis first | Can't automate what you haven't observed |
68+
| Custom > generic | Build from your failures |
69+
| Code first | Deterministic before LLM |
70+
| Validate judges | >80% TPR/TNR |
71+
| Binary > Likert | Pass/fail, not 1-5 |

0 commit comments

Comments
 (0)