feat(pxi): code-evaluator authoring by anticorrelator · Pull Request #13397 · Arize-ai/phoenix

anticorrelator · 2026-05-23T16:16:04Z

Summary

Adds PXI-assisted code-evaluator authoring through the existing Phoenix UI surfaces.

Dataset handoff mode. When the user is on a dataset surface and the code-evaluator form is not mounted, PXI does not create an evaluator directly. It either opens the form in place from a dataset-backed playground via open_experiment_evaluator_form, or links the user to the dataset Evaluators create surface. Persistence still happens only when the user clicks Save in the slideover.

Form co-pilot mode. When a Create/Edit Code Evaluator dialog is open, PXI can:

read_code_evaluator_draft — read the open draft, including sourceCode, sandboxConfigId, inputMapping, outputConfigs, testPayload, and a revision token.
edit_code_evaluator_draft — propose an accept/reject diff against the open form. Accept applies the draft locally; reject leaves the form untouched. Revision checks guard both propose-time and accept-time staleness.
test_code_evaluator_draft — run the current accepted draft through the preview path before the user saves.

The tools and context prompts now share the same availability gates: viewers get read-only behavior, create-mode edits require a usable sandbox, edit-mode draft edits can still be proposed without a sandbox, and draft tests require a usable sandbox.

Architecture

Dataset / playground context
        |
        |-- dataset-backed playground + sandbox + editor access
        |       -> open_experiment_evaluator_form mounts the create form
        |
        |-- dataset surface without mounted form
        |       -> PXI links to /datasets/<id>/evaluators?createCodeEvaluator=true
        |
        v
Mounted Create/Edit Code Evaluator form
        |
        |-- read_code_evaluator_draft captures current draft + revision
        |-- edit_code_evaluator_draft proposes user-approved local form edits
        |-- test_code_evaluator_draft previews accepted draft when sandbox is available
        |
        v
User clicks Save -> existing Relay mutations persist CodeEvaluator and dataset binding

Screen.Recording.2026-05-28.at.12.41.16.PM.mov

mintlify · 2026-05-23T16:16:18Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
arize-phoenix	🟢 Ready	View Preview	May 23, 2026, 4:17 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

pkg-pr-new · 2026-05-23T16:38:42Z

Open in StackBlitz

@arizeai/phoenix-cli

npm i https://pkg.pr.new/@arizeai/phoenix-cli@13397

@arizeai/phoenix-client

npm i https://pkg.pr.new/@arizeai/phoenix-client@13397

@arizeai/phoenix-evals

npm i https://pkg.pr.new/@arizeai/phoenix-evals@13397

@arizeai/phoenix-mcp

npm i https://pkg.pr.new/@arizeai/phoenix-mcp@13397

@arizeai/phoenix-otel

npm i https://pkg.pr.new/@arizeai/phoenix-otel@13397

commit: 39fe5c5

Post-churn cleanup of the PXI code-evaluator authoring surfaces. - Dataset context now defers code-evaluator draft guidance to the always-rendered <phoenix_code_evaluator_context> instead of carrying a duplicate, separately-gated copy; drops can_open/can_edit/can_test flags from dataset.py and renames is_code_evaluator_surface -> is_code_evaluator_form_mounted. Fixes the two failing dataset-context handoff tests that the prior template drift had left red. - Extracts the inline open_experiment_evaluator_form and test_code_evaluator_draft tool instructions into .xml.j2 templates wired through AgentPrompts, matching the read/edit tool convention. - Types _validate_code_evaluator_sandbox_config's decrypt as Callable[[bytes], bytes] (was Any); drops now-unused Any import. - Removes dead createCodeEvaluatorActionContextSchema alias, trims the testPayload alias shim to ["test_payload"], makes toOutputConfigDraft module-private, and drops the unused searchParams barrel re-export. - Unifies the playground experiment-evaluator-form prop vocabulary so the lifted-open state keeps one name from Playground to PlaygroundEvaluatorSelect.

- Regenerate schemas/openapi.json and the app + phoenix-client TS types so the DatasetContext schema description matches the trimmed Pydantic docstring (the docstring is surfaced as the OpenAPI description). - Apply ruff format to edit_code_evaluator_draft.py (collapse a two-line boolean that fits on one line). Fixes the Build OpenAPI Schema and Format and Lint CI checks.

The create/edit code-evaluator dialog now renders the output config's name in a disabled "Name" TextField, so the existing `getByRole("textbox", { name: /^Name.../ })` selector matched two elements (strict-mode violation). Scope the selector to the editable field with `disabled: false` so it targets the evaluator name input. Fixes the e2e-test (Playwright) Code Evaluators specs.

Catch the multi-line getByRole selector in the edit test that the prior single-line fix missed, so all four occurrences exclude the disabled output-config 'Name' field via disabled: false.

CreateCodeEvaluator / PatchCodeEvaluator no longer require the sandbox backend to be fully AVAILABLE (installed runtime deps + downloaded binary) at authoring time — that is an execution-time concern. The mutation still validates that the referenced sandbox config exists and is enabled, its provider is enabled, and the language matches. Reverts the over-strict create-time availability gate added in 6e15f60 ("harden code evaluator authoring flow"), which broke the TestSandboxAndCodeEvaluatorPermissions integration test: the integration server lacks the WASM runtime (wasmtime is a dev-only dependency), so the gate rejected evaluator creation there.

DeleteSandboxConfig was placed at the end of Tier 1, before the Tier 2 evaluator writes that reference the same setup config. For admin roles the delete succeeds, so the subsequent CreateCodeEvaluator failed the config-exists validation ("Sandbox config not found"). Move the delete to the actual end of the test — after every evaluator write and the Tier 3 reads — honoring the existing "delete runs last" intent. This surfaced only after relaxing the create-time backend-availability gate (which previously failed the test earlier, for all roles).

…luator_form The tool opens the dataset-backed create-code-evaluator form (CreateCodeDatasetEvaluatorSlideover) so the read/edit/test draft tools become available; it does not persist anything. "Experiment evaluator" is not part of the domain model — these are dataset (code) evaluators — so the name was misleading and inconsistent with the sibling *_code_evaluator_draft tools. Renames across the server↔client wire contract and all related symbols: the tool-name string, capability/class, prompt template + AgentPrompts field, frontend constant/type/parser/agent-tool, and the experimentEvaluatorForm* playground state/props (now codeEvaluatorForm*). No behavior change.

…nism Remove the optimistic-concurrency revision guard from the code-evaluator draft co-pilot tools. It is premature hardening for an unsaved, single-user draft: accept re-applies the agent's scoped set-operations to the current form state, so untouched fields are never clobbered, and the explicit accept gesture is the user's consent. Drops buildDraftRevision, the revision/expectedRevision fields and tool-schema params, the propose-time and accept-time checks, and the test-section staleness check (replaced by an isDraftMounted mount check). Simplifies the read/edit/test tool descriptions and prompt templates. prompt_instance is intentionally left unchanged.

Close the three open consolidation seams of the code-evaluator PXI tools, all behavior-preserving: - Extract a shared DiffAcceptRejectToolDetails<T> generic owning the diff render, accept/reject footer, and a single CSS namespace; reduce EditPromptToolDetails and EditCodeEvaluatorDraftToolDetails to thin wrappers that inject only snapshotToText/fileName/renderHeader/labels (collapses two ~95%-identical renderers; ToolPart.tsx switch untouched). - Derive the agent-tool Zod enums at runtime from canonical @phoenix/types as-const tuples (CODE_EVALUATOR_LANGUAGES, EVALUATOR_OPTIMIZATION_DIRECTIONS) via z.enum(TUPLE), so the union type and validator share one source. Backfill tool-registry docstrings; drop two no-op dispatch tests. - Render every active AgentContextPills context as a labeled pill; delete the isAuxiliaryFormContext display-hide (no hide, no cap). code_evaluator stays a peer context on the wire (/chat payload unchanged). tsc --noEmit + 48 touched-spec tests green.

Close the structural/maintainability half of the PR #13397 review by bringing each flagged thread back onto its canonical seam. Server (#2): route the sandbox inventory through the phoenix-gql seam. _load_sandbox_availability collapses from a select(models.SandboxConfig) row-load to a select(exists(...)) gate query; has_usable is stored directly on SandboxAvailability (SandboxConfigCapabilities/configs dropped). Both code-evaluator prompts drop <available_sandbox_configs> and invert the "do NOT issue a sandboxProviders query" instruction into an on-demand phoenix-gql fetch+filter (envVars{name}, never secretKey). Pre-turn capability gating is unchanged. Frontend (#5/#9): derive the OutputConfigDraft element types from the canonical @phoenix/types AnnotationConfig + explicit `kind`; couple the Zod schemas with a `satisfies` guard; centralize the form->draft conversion behind one documented function (no compile-time exhaustiveness claim — canonical union is undiscriminated). Frontend (#7): unify the three empty-input idioms onto one shared parseEmptyToolInput; delete the bespoke open-form parser. Frontend (E): add minimal orienting comments naming the playgroundPrompt template + lifecycle; re-home open_code_evaluator_form's name-constant and Input type into the codeEvaluatorDraft module (registry owns only registration). No renames. Frontend (#11/#13): remove the global vitest.setup.ts storage mock and delete AgentContextPills.test.tsx. The branch's jsdom bump ships a non-functional native localStorage, so the mock is replaced with an opt-in installTestStorage util imported by the store-mounting tests rather than reverted (which would regress on-main tests).

Lean out tests flagged as no-op or over-pinned in the PR #13397 review: - codeEvaluatorDraft.test.ts: drop the sandbox-null create-mode cases and the model-alias edit/test-payload pending-edit cases - delete EvaluatorNameInput.test.tsx - test_capabilities.py: loosen exact-wording prompt assertions to behavior-level checks - test_agents.py: drop provider-disabled / empty-inventory loader cases, the AgentDependencies override case, and the EditCodeEvaluatorDraft viewer-gate suite

…t collection The agent draft test-run tool lived in a production module named test_code_evaluator_draft.py with a Test*-prefixed capability class — both match pytest's default collection globs (test_*.py, python_classes=Test*), a latent footgun and a misleading inflater of the test-file surface. Rename the module to run_code_evaluator_draft.py and the class to RunCodeEvaluatorDraftCapability, matching the existing run_playground sibling. The model-facing tool name (test_code_evaluator_draft) and prompt accessor are unchanged.

Trim storylike 'how it works' docstrings and temporal phrasing (no longer / now lives) from the code-evaluator authoring flow. Keep comments that disambiguate a contract the name or signature does not — per-field prop docs, the cross-boundary 'Python schema is the model-facing source of truth' note, and the silent-fall-through footgun in the output-config converter. No behavior change.

The rebase onto main (which added PXI auto-mode: permissions.edits manual|bypass) left two integration gaps: - agents/types.py dropped the `Literal` import while keeping main's `edit_permission: Literal["manual", "bypass"]` field, breaking import. - EditPromptToolDetails merged main's auto-accept label logic with our refactor to the shared DiffAcceptRejectToolDetails; reconcile the file (import PromptSnapshot, drop the dead summary path) while preserving main's "Auto-approved" state label.

Wire the edit_code_evaluator_draft confirmation dialog into the new PXI auto-mode (permissions.edits === "bypass"), mirroring edit_prompt and batch_span_annotate: - accept() now takes { approvalSource } and stamps acceptedBy on output - createEditCodeEvaluatorDraftClientAction takes shouldAutoAccept; when true it applies the edit immediately and skips setPendingCodeEvaluatorEdit so no accept/reject dialog is surfaced - EditCodeEvaluatorDialogContent passes shouldAutoAccept reading agentStore.permissions.edits === "bypass" - tool-details renders "Auto-approved" vs "Accepted" - add a unit test asserting auto-mode never surfaces the confirmation

The anthropic SDK 0.105 added 'system' to MessageParam['role'] (Literal['user','assistant'] -> Literal['user','assistant','system']), which broke the exhaustiveness check in _RoleConversion.from_anthropic and failed the third-party SDK canary (pyright assert_never). Map an anthropic system message to a phoenix 'system' PromptMessage role and widen the return type, mirroring the openai helper's from_openai.

anticorrelator requested review from a team as code owners May 23, 2026 16:16

github-project-automation Bot added this to phoenix May 23, 2026

github-project-automation Bot moved this to 📘 Todo in phoenix May 23, 2026

dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 23, 2026

mintlify Bot deployed to staging May 23, 2026 16:17 View deployment

mintlify Bot deployed to staging May 23, 2026 16:38 View deployment

mintlify Bot deployed to staging May 23, 2026 16:53 View deployment

anticorrelator marked this pull request as draft May 26, 2026 19:14

anticorrelator force-pushed the dustin/code-evaluator-tools branch from 0c4f19a to 3586803 Compare May 26, 2026 23:30

mintlify Bot deployed to staging May 26, 2026 23:32 View deployment

mintlify Bot deployed to staging May 27, 2026 03:11 View deployment

mintlify Bot deployed to staging May 27, 2026 03:38 View deployment

mintlify Bot deployed to staging May 27, 2026 07:39 View deployment

mintlify Bot deployed to staging May 27, 2026 08:10 View deployment

mintlify Bot deployed to staging May 27, 2026 08:32 View deployment

mintlify Bot deployed to staging May 27, 2026 15:17 View deployment

anticorrelator force-pushed the dustin/code-evaluator-tools branch from 0690262 to 99afb6f Compare May 27, 2026 15:23

mintlify Bot deployed to staging May 27, 2026 15:25 View deployment

mintlify Bot deployed to staging May 27, 2026 15:35 View deployment

mintlify Bot deployed to staging May 27, 2026 15:50 View deployment

mintlify Bot deployed to staging May 27, 2026 15:59 View deployment

mintlify Bot deployed to staging May 27, 2026 16:11 View deployment

mintlify Bot deployed to staging May 27, 2026 16:13 View deployment

mintlify Bot deployed to staging May 27, 2026 16:21 View deployment

mintlify Bot deployed to staging May 27, 2026 16:25 View deployment

mintlify Bot deployed to staging May 27, 2026 16:28 View deployment

mintlify Bot deployed to staging May 27, 2026 16:51 View deployment

anticorrelator and others added 27 commits June 1, 2026 18:36

fix(pxi): hide evaluator tab context in draft handoff

c853b36

refactor(pxi): remove evaluator surface context

5ec33ae

feat: add PXI evaluator form workflow

cf08046

test: trim PXI evaluator smoke coverage

e09bf13

fix(pxi): harden code evaluator authoring flow

eb0a909

fix: stabilize code evaluator draft flow

086663e

test(pxi): disambiguate name selector in code-evaluator edit spec

0984134

Catch the multi-line getByRole selector in the edit test that the prior single-line fix missed, so all four occurrences exclude the disabled output-config 'Name' field via disabled: false.

feat: fix icon

c97c383

mikyo cleanup

a289a55

refactor(pxi): trim code-evaluator enum comment to cross-boundary note

d6467e5

refactor(pxi): drop code-evaluator lineage note from index header

ffc030d

style(pxi): apply oxfmt formatting to satisfy CI

86df8a3

This was referenced Jun 1, 2026

chore(main): release arize-phoenix 16.6.0 #13563

Merged

chore(main): release arize-phoenix-client 2.8.0 #13348

Open

anticorrelator mentioned this pull request Jun 2, 2026

feat(pxi): LLM-evaluator authoring for the PXI agent #13579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pxi): code-evaluator authoring#13397

feat(pxi): code-evaluator authoring#13397
anticorrelator merged 57 commits into
mainfrom
dustin/code-evaluator-tools

anticorrelator commented May 23, 2026 •

edited

Loading

Uh oh!

mintlify Bot commented May 23, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

anticorrelator commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Uh oh!

mintlify Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pkg-pr-new Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

anticorrelator commented May 23, 2026 •

edited

Loading

mintlify Bot commented May 23, 2026 •

edited

Loading

pkg-pr-new Bot commented May 23, 2026 •

edited

Loading