github
diff --git a/‎docs/README.skills.md‎
Lines changed: 4 additions & 1 deletion b/‎docs/README.skills.md‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎skills/phoenix-cli/SKILL.md‎
Lines changed: 161 additions & 0 deletions b/‎skills/phoenix-cli/SKILL.md‎
Lines changed: 161 additions & 0 deletions
diff --git a/‎skills/phoenix-evals/SKILL.md‎
Lines changed: 71 additions & 0 deletions b/‎skills/phoenix-evals/SKILL.md‎
Lines changed: 71 additions & 0 deletions
@@ -208,6 +208,9 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [openapi-to-application-code](../skills/openapi-to-application-code/SKILL.md) | Generate a complete, production-ready application from an OpenAPI specification | None |
 | [pdftk-server](../skills/pdftk-server/SKILL.md) | Skill for using the command-line tool pdftk (PDFtk Server) for working with PDF files. Use when asked to merge PDFs, split PDFs, rotate pages, encrypt or decrypt PDFs, fill PDF forms, apply watermarks, stamp overlays, extract metadata, burst documents into pages, repair corrupted PDFs, attach or extract files, or perform any PDF manipulation from the command line. | `references/download.md`<br />`references/pdftk-cli-examples.md`<br />`references/pdftk-man-page.md`<br />`references/pdftk-server-license.md`<br />`references/third-party-materials.md` |
 | [penpot-uiux-design](../skills/penpot-uiux-design/SKILL.md) | Comprehensive guide for creating professional UI/UX designs in Penpot using MCP tools. Use this skill when: (1) Creating new UI/UX designs for web, mobile, or desktop applications, (2) Building design systems with components and tokens, (3) Designing dashboards, forms, navigation, or landing pages, (4) Applying accessibility standards and best practices, (5) Following platform guidelines (iOS, Android, Material Design), (6) Reviewing or improving existing Penpot designs for usability. Triggers: "design a UI", "create interface", "build layout", "design dashboard", "create form", "design landing page", "make it accessible", "design system", "component library". | `references/accessibility.md`<br />`references/component-patterns.md`<br />`references/platform-guidelines.md`<br />`references/setup-troubleshooting.md` |
+| [phoenix-cli](../skills/phoenix-cli/SKILL.md) | Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues. | None |
+| [phoenix-evals](../skills/phoenix-evals/SKILL.md) | Build and run evaluators for AI/LLM applications using Phoenix. | `references/axial-coding.md`<br />`references/common-mistakes-python.md`<br />`references/error-analysis-multi-turn.md`<br />`references/error-analysis.md`<br />`references/evaluate-dataframe-python.md`<br />`references/evaluators-code-python.md`<br />`references/evaluators-code-typescript.md`<br />`references/evaluators-custom-templates.md`<br />`references/evaluators-llm-python.md`<br />`references/evaluators-llm-typescript.md`<br />`references/evaluators-overview.md`<br />`references/evaluators-pre-built.md`<br />`references/evaluators-rag.md`<br />`references/experiments-datasets-python.md`<br />`references/experiments-datasets-typescript.md`<br />`references/experiments-overview.md`<br />`references/experiments-running-python.md`<br />`references/experiments-running-typescript.md`<br />`references/experiments-synthetic-python.md`<br />`references/experiments-synthetic-typescript.md`<br />`references/fundamentals-anti-patterns.md`<br />`references/fundamentals-model-selection.md`<br />`references/fundamentals.md`<br />`references/observe-sampling-python.md`<br />`references/observe-sampling-typescript.md`<br />`references/observe-tracing-setup.md`<br />`references/production-continuous.md`<br />`references/production-guardrails.md`<br />`references/production-overview.md`<br />`references/setup-python.md`<br />`references/setup-typescript.md`<br />`references/validation-evaluators-python.md`<br />`references/validation-evaluators-typescript.md`<br />`references/validation.md` |
+| [phoenix-tracing](../skills/phoenix-tracing/SKILL.md) | OpenInference semantic conventions and instrumentation for Phoenix AI observability. Use when implementing LLM tracing, creating custom spans, or deploying to production. | `references/README.md`<br />`references/annotations-overview.md`<br />`references/annotations-python.md`<br />`references/annotations-typescript.md`<br />`references/fundamentals-flattening.md`<br />`references/fundamentals-overview.md`<br />`references/fundamentals-required-attributes.md`<br />`references/fundamentals-universal-attributes.md`<br />`references/instrumentation-auto-python.md`<br />`references/instrumentation-auto-typescript.md`<br />`references/instrumentation-manual-python.md`<br />`references/instrumentation-manual-typescript.md`<br />`references/metadata-python.md`<br />`references/metadata-typescript.md`<br />`references/production-python.md`<br />`references/production-typescript.md`<br />`references/projects-python.md`<br />`references/projects-typescript.md`<br />`references/sessions-python.md`<br />`references/sessions-typescript.md`<br />`references/setup-python.md`<br />`references/setup-typescript.md`<br />`references/span-agent.md`<br />`references/span-chain.md`<br />`references/span-embedding.md`<br />`references/span-evaluator.md`<br />`references/span-guardrail.md`<br />`references/span-llm.md`<br />`references/span-reranker.md`<br />`references/span-retriever.md`<br />`references/span-tool.md` |
 | [php-mcp-server-generator](../skills/php-mcp-server-generator/SKILL.md) | Generate a complete PHP Model Context Protocol server project with tools, resources, prompts, and tests using the official PHP SDK | None |
 | [planning-oracle-to-postgres-migration-integration-testing](../skills/planning-oracle-to-postgres-migration-integration-testing/SKILL.md) | Creates an integration testing plan for .NET data access artifacts during Oracle-to-PostgreSQL database migrations. Analyzes a single project to identify repositories, DAOs, and service layers that interact with the database, then produces a structured testing plan. Use when planning integration test coverage for a migrated project, identifying which data access methods need tests, or preparing for Oracle-to-PostgreSQL migration validation. | None |
 | [plantuml-ascii](../skills/plantuml-ascii/SKILL.md) | Generate ASCII art diagrams using PlantUML text mode. Use when user asks to create ASCII diagrams, text-based diagrams, terminal-friendly diagrams, or mentions plantuml ascii, text diagram, ascii art diagram. Supports: Converting PlantUML diagrams to ASCII art, Creating sequence diagrams, class diagrams, flowcharts in ASCII format, Generating Unicode-enhanced ASCII art with -utxt flag | None |
@@ -288,7 +291,7 @@ See [CONTRIBUTING.md](../CONTRIBUTING.md#adding-skills) for guidelines on how to
 | [webapp-testing](../skills/webapp-testing/SKILL.md) | Toolkit for interacting with and testing local web applications using Playwright. Supports verifying frontend functionality, debugging UI behavior, capturing browser screenshots, and viewing browser logs. | `assets/test-helper.js` |
 | [what-context-needed](../skills/what-context-needed/SKILL.md) | Ask Copilot what files it needs to see before answering a question | None |
 | [winapp-cli](../skills/winapp-cli/SKILL.md) | Windows App Development CLI (winapp) for building, packaging, and deploying Windows applications. Use when asked to initialize Windows app projects, create MSIX packages, generate AppxManifest.xml, manage development certificates, add package identity for debugging, sign packages, publish to the Microsoft Store, create external catalogs, or access Windows SDK build tools. Supports .NET (csproj), C++, Electron, Rust, Tauri, and cross-platform frameworks targeting Windows. | None |
-| [winmd-api-search](../skills/winmd-api-search/SKILL.md) | Find and explore Windows desktop APIs. Use when building features that need platform capabilities — camera, file access, notifications, UI controls, AI/ML, sensors, networking, etc. Discovers the right API for a task and retrieves full type details (methods, properties, events, enumeration values). | `LICENSE.txt`<br />`scripts/Invoke-WinMdQuery.ps1`<br />`scripts/Update-WinMdCache.ps1`<br />`scripts/cache-generator` |
+| [winmd-api-search](../skills/winmd-api-search/SKILL.md) | Find and explore Windows desktop APIs. Use when building features that need platform capabilities — camera, file access, notifications, UI controls, AI/ML, sensors, networking, etc. Discovers the right API for a task and retrieves full type details (methods, properties, events, enumeration values). | `.DS_Store`<br />`LICENSE.txt`<br />`scripts/Invoke-WinMdQuery.ps1`<br />`scripts/Update-WinMdCache.ps1`<br />`scripts/cache-generator` |
 | [winui3-migration-guide](../skills/winui3-migration-guide/SKILL.md) | UWP-to-WinUI 3 migration reference. Maps legacy UWP APIs to correct Windows App SDK equivalents with before/after code snippets. Covers namespace changes, threading (CoreDispatcher to DispatcherQueue), windowing (CoreWindow to AppWindow), dialogs, pickers, sharing, printing, background tasks, and the most common Copilot code generation mistakes. | None |
 | [workiq-copilot](../skills/workiq-copilot/SKILL.md) | Guides the Copilot CLI on how to use the WorkIQ CLI/MCP server to query Microsoft 365 Copilot data (emails, meetings, docs, Teams, people) for live context, summaries, and recommendations. | None |
 | [write-coding-standards-from-file](../skills/write-coding-standards-from-file/SKILL.md) | Write a coding standards document for a project using the coding styles from the file(s) and/or folder(s) passed as arguments in the prompt. | None |
@@ -0,0 +1,161 @@
+---
+name: phoenix-cli
+description: Debug LLM applications using the Phoenix CLI. Fetch traces, analyze errors, review experiments, inspect datasets, and query the GraphQL API. Use when debugging AI/LLM applications, analyzing trace data, working with Phoenix observability, or investigating LLM performance issues.
+license: Apache-2.0
+metadata:
+  author: arize-ai
+  version: "2.0.0"
+---
+
+# Phoenix CLI
+
+## Invocation
+
+```bash
+px <resource> <action>                          # if installed globally
+npx @arizeai/phoenix-cli <resource> <action>    # no install required
+```
+
+The CLI uses singular resource commands with subcommands like `list` and `get`:
+
+```bash
+px trace list
+px trace get <trace-id>
+px span list
+px dataset list
+px dataset get <name>
+```
+
+## Setup
+
+```bash
+export PHOENIX_HOST=http://localhost:6006
+export PHOENIX_PROJECT=my-project
+export PHOENIX_API_KEY=your-api-key  # if auth is enabled
+```
+
+Always use `--format raw --no-progress` when piping to `jq`.
+
+## Traces
+
+```bash
+px trace list --limit 20 --format raw --no-progress | jq .
+px trace list --last-n-minutes 60 --limit 20 --format raw --no-progress | jq '.[] | select(.status == "ERROR")'
+px trace list --format raw --no-progress | jq 'sort_by(-.duration) | .[0:5]'
+px trace get <trace-id> --format raw | jq .
+px trace get <trace-id> --format raw | jq '.spans[] | select(.status_code != "OK")'
+```
+
+## Spans
+
+```bash
+px span list --limit 20                                    # recent spans (table view)
+px span list --last-n-minutes 60 --limit 50                # spans from last hour
+px span list --span-kind LLM --limit 10                    # only LLM spans
+px span list --status-code ERROR --limit 20                # only errored spans
+px span list --name chat_completion --limit 10             # filter by span name
+px span list --trace-id <id> --format raw --no-progress | jq .   # all spans for a trace
+px span list --include-annotations --limit 10              # include annotation scores
+px span list output.json --limit 100                       # save to JSON file
+px span list --format raw --no-progress | jq '.[] | select(.status_code == "ERROR")'
+```
+
+### Span JSON shape
+
+```
+Span
+  name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT"|"RERANKER"|"GUARDRAIL"|"EVALUATOR"|"UNKNOWN")
+  status_code ("OK"|"ERROR"|"UNSET"), status_message
+  context.span_id, context.trace_id, parent_id
+  start_time, end_time
+  attributes (same as trace span attributes above)
+  annotations[] (with --include-annotations)
+    name, result { score, label, explanation }
+```
+
+### Trace JSON shape
+
+```
+Trace
+  traceId, status ("OK"|"ERROR"), duration (ms), startTime, endTime
+  rootSpan  — top-level span (parent_id: null)
+  spans[]
+    name, span_kind ("LLM"|"CHAIN"|"TOOL"|"RETRIEVER"|"EMBEDDING"|"AGENT")
+    status_code ("OK"|"ERROR"), parent_id, context.span_id
+    attributes
+      input.value, output.value          — raw input/output
+      llm.model_name, llm.provider
+      llm.token_count.prompt/completion/total
+      llm.token_count.prompt_details.cache_read
+      llm.token_count.completion_details.reasoning
+      llm.input_messages.{N}.message.role/content
+      llm.output_messages.{N}.message.role/content
+      llm.invocation_parameters          — JSON string (temperature, etc.)
+      exception.message                  — set if span errored
+```
+
+## Sessions
+
+```bash
+px session list --limit 10 --format raw --no-progress | jq .
+px session list --order asc --format raw --no-progress | jq '.[].session_id'
+px session get <session-id> --format raw | jq .
+px session get <session-id> --include-annotations --format raw | jq '.annotations'
+```
+
+### Session JSON shape
+
+```
+SessionData
+  id, session_id, project_id
+  start_time, end_time
+  traces[]
+    id, trace_id, start_time, end_time
+
+SessionAnnotation (with --include-annotations)
+  id, name, annotator_kind ("LLM"|"CODE"|"HUMAN"), session_id
+  result { label, score, explanation }
+  metadata, identifier, source, created_at, updated_at
+```
+
+## Datasets / Experiments / Prompts
+
+```bash
+px dataset list --format raw --no-progress | jq '.[].name'
+px dataset get <name> --format raw | jq '.examples[] | {input, output: .expected_output}'
+px experiment list --dataset <name> --format raw --no-progress | jq '.[] | {id, name, failed_run_count}'
+px experiment get <id> --format raw --no-progress | jq '.[] | select(.error != null) | {input, error}'
+px prompt list --format raw --no-progress | jq '.[].name'
+px prompt get <name> --format text --no-progress   # plain text, ideal for piping to AI
+```
+
+## GraphQL
+
+For ad-hoc queries not covered by the commands above. Output is `{"data": {...}}`.
+
+```bash
+px api graphql '{ projectCount datasetCount promptCount evaluatorCount }'
+px api graphql '{ projects { edges { node { name traceCount tokenCountTotal } } } }' | jq '.data.projects.edges[].node'
+px api graphql '{ datasets { edges { node { name exampleCount experimentCount } } } }' | jq '.data.datasets.edges[].node'
+px api graphql '{ evaluators { edges { node { name kind } } } }' | jq '.data.evaluators.edges[].node'
+
+# Introspect any type
+px api graphql '{ __type(name: "Project") { fields { name type { name } } } }' | jq '.data.__type.fields[]'
+```
+
+Key root fields: `projects`, `datasets`, `prompts`, `evaluators`, `projectCount`, `datasetCount`, `promptCount`, `evaluatorCount`, `viewer`.
+
+## Docs
+
+Download Phoenix documentation markdown for local use by coding agents.
+
+```bash
+px docs fetch                                # fetch default workflow docs to .px/docs
+px docs fetch --workflow tracing             # fetch only tracing docs
+px docs fetch --workflow tracing --workflow evaluation
+px docs fetch --dry-run                      # preview what would be downloaded
+px docs fetch --refresh                      # clear .px/docs and re-download
+px docs fetch --output-dir ./my-docs         # custom output directory
+```
+
+Key options: `--workflow` (repeatable, values: `tracing`, `evaluation`, `datasets`, `prompts`, `integrations`, `sdk`, `self-hosting`, `all`), `--dry-run`, `--refresh`, `--output-dir` (default `.px/docs`), `--workers` (default 10).
@@ -0,0 +1,71 @@
+---
+name: phoenix-evals
+description: Build and run evaluators for AI/LLM applications using Phoenix.
+license: Apache-2.0
+metadata:
+  author: oss@arize.com
+  version: "1.0.0"
+  languages: Python, TypeScript
+---
+
+# Phoenix Evals
+
+Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.
+
+## Quick Reference
+
+| Task | Files |
+| ---- | ----- |
+| Setup | `setup-python`, `setup-typescript` |
+| Decide what to evaluate | `evaluators-overview` |
+| Choose a judge model | `fundamentals-model-selection` |
+| Use pre-built evaluators | `evaluators-pre-built` |
+| Build code evaluator | `evaluators-code-{python\|typescript}` |
+| Build LLM evaluator | `evaluators-llm-{python\|typescript}`, `evaluators-custom-templates` |
+| Batch evaluate DataFrame | `evaluate-dataframe-python` |
+| Run experiment | `experiments-running-{python\|typescript}` |
+| Create dataset | `experiments-datasets-{python\|typescript}` |
+| Generate synthetic data | `experiments-synthetic-{python\|typescript}` |
+| Validate evaluator accuracy | `validation`, `validation-evaluators-{python\|typescript}` |
+| Sample traces for review | `observe-sampling-{python\|typescript}` |
+| Analyze errors | `error-analysis`, `error-analysis-multi-turn`, `axial-coding` |
+| RAG evals | `evaluators-rag` |
+| Avoid common mistakes | `common-mistakes-python`, `fundamentals-anti-patterns` |
+| Production | `production-overview`, `production-guardrails`, `production-continuous` |
+
+## Workflows
+
+**Starting Fresh:**
+`observe-tracing-setup` → `error-analysis` → `axial-coding` → `evaluators-overview`
+
+**Building Evaluator:**
+`fundamentals` → `common-mistakes-python` → `evaluators-{code\|llm}-{python\|typescript}` → `validation-evaluators-{python\|typescript}`
+
+**RAG Systems:**
+`evaluators-rag` → `evaluators-code-*` (retrieval) → `evaluators-llm-*` (faithfulness)
+
+**Production:**
+`production-overview` → `production-guardrails` → `production-continuous`
+
+## Rule Categories
+
+| Prefix | Description |
+| ------ | ----------- |
+| `fundamentals-*` | Types, scores, anti-patterns |
+| `observe-*` | Tracing, sampling |
+| `error-analysis-*` | Finding failures |
+| `axial-coding-*` | Categorizing failures |
+| `evaluators-*` | Code, LLM, RAG evaluators |
+| `experiments-*` | Datasets, running experiments |
+| `validation-*` | Validating evaluator accuracy against human labels |
+| `production-*` | CI/CD, monitoring |
+
+## Key Principles
+
+| Principle | Action |
+| --------- | ------ |
+| Error analysis first | Can't automate what you haven't observed |
+| Custom > generic | Build from your failures |
+| Code first | Deterministic before LLM |
+| Validate judges | >80% TPR/TNR |
+| Binary > Likert | Pass/fail, not 1-5 |