Skip to content

feat(pxi): code-evaluator authoring#13397

Merged
anticorrelator merged 57 commits into
mainfrom
dustin/code-evaluator-tools
Jun 1, 2026
Merged

feat(pxi): code-evaluator authoring#13397
anticorrelator merged 57 commits into
mainfrom
dustin/code-evaluator-tools

Conversation

@anticorrelator
Copy link
Copy Markdown
Contributor

@anticorrelator anticorrelator commented May 23, 2026

Summary

Adds PXI-assisted code-evaluator authoring through the existing Phoenix UI surfaces.

Dataset handoff mode. When the user is on a dataset surface and the code-evaluator form is not mounted, PXI does not create an evaluator directly. It either opens the form in place from a dataset-backed playground via open_experiment_evaluator_form, or links the user to the dataset Evaluators create surface. Persistence still happens only when the user clicks Save in the slideover.

Form co-pilot mode. When a Create/Edit Code Evaluator dialog is open, PXI can:

  • read_code_evaluator_draft — read the open draft, including sourceCode, sandboxConfigId, inputMapping, outputConfigs, testPayload, and a revision token.
  • edit_code_evaluator_draft — propose an accept/reject diff against the open form. Accept applies the draft locally; reject leaves the form untouched. Revision checks guard both propose-time and accept-time staleness.
  • test_code_evaluator_draft — run the current accepted draft through the preview path before the user saves.

The tools and context prompts now share the same availability gates: viewers get read-only behavior, create-mode edits require a usable sandbox, edit-mode draft edits can still be proposed without a sandbox, and draft tests require a usable sandbox.

Architecture

Dataset / playground context
        |
        |-- dataset-backed playground + sandbox + editor access
        |       -> open_experiment_evaluator_form mounts the create form
        |
        |-- dataset surface without mounted form
        |       -> PXI links to /datasets/<id>/evaluators?createCodeEvaluator=true
        |
        v
Mounted Create/Edit Code Evaluator form
        |
        |-- read_code_evaluator_draft captures current draft + revision
        |-- edit_code_evaluator_draft proposes user-approved local form edits
        |-- test_code_evaluator_draft previews accepted draft when sandbox is available
        |
        v
User clicks Save -> existing Relay mutations persist CodeEvaluator and dataset binding
Screen.Recording.2026-05-28.at.12.41.16.PM.mov

@anticorrelator anticorrelator requested review from a team as code owners May 23, 2026 16:16
@github-project-automation github-project-automation Bot moved this to 📘 Todo in phoenix May 23, 2026
@dosubot dosubot Bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label May 23, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 23, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
arize-phoenix 🟢 Ready View Preview May 23, 2026, 4:17 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 23, 2026

Open in StackBlitz

@arizeai/phoenix-cli

npm i https://pkg.pr.new/@arizeai/phoenix-cli@13397

@arizeai/phoenix-client

npm i https://pkg.pr.new/@arizeai/phoenix-client@13397

@arizeai/phoenix-evals

npm i https://pkg.pr.new/@arizeai/phoenix-evals@13397

@arizeai/phoenix-mcp

npm i https://pkg.pr.new/@arizeai/phoenix-mcp@13397

@arizeai/phoenix-otel

npm i https://pkg.pr.new/@arizeai/phoenix-otel@13397

commit: 39fe5c5

@anticorrelator anticorrelator marked this pull request as draft May 26, 2026 19:14
@anticorrelator anticorrelator force-pushed the dustin/code-evaluator-tools branch from 0c4f19a to 3586803 Compare May 26, 2026 23:30
@anticorrelator anticorrelator force-pushed the dustin/code-evaluator-tools branch from 0690262 to 99afb6f Compare May 27, 2026 15:23
anticorrelator and others added 27 commits June 1, 2026 18:36
Post-churn cleanup of the PXI code-evaluator authoring surfaces.

- Dataset context now defers code-evaluator draft guidance to the
  always-rendered <phoenix_code_evaluator_context> instead of carrying a
  duplicate, separately-gated copy; drops can_open/can_edit/can_test flags
  from dataset.py and renames is_code_evaluator_surface ->
  is_code_evaluator_form_mounted. Fixes the two failing dataset-context
  handoff tests that the prior template drift had left red.
- Extracts the inline open_experiment_evaluator_form and
  test_code_evaluator_draft tool instructions into .xml.j2 templates wired
  through AgentPrompts, matching the read/edit tool convention.
- Types _validate_code_evaluator_sandbox_config's decrypt as
  Callable[[bytes], bytes] (was Any); drops now-unused Any import.
- Removes dead createCodeEvaluatorActionContextSchema alias, trims the
  testPayload alias shim to ["test_payload"], makes toOutputConfigDraft
  module-private, and drops the unused searchParams barrel re-export.
- Unifies the playground experiment-evaluator-form prop vocabulary so the
  lifted-open state keeps one name from Playground to PlaygroundEvaluatorSelect.
- Regenerate schemas/openapi.json and the app + phoenix-client TS types so
  the DatasetContext schema description matches the trimmed Pydantic
  docstring (the docstring is surfaced as the OpenAPI description).
- Apply ruff format to edit_code_evaluator_draft.py (collapse a two-line
  boolean that fits on one line).

Fixes the Build OpenAPI Schema and Format and Lint CI checks.
The create/edit code-evaluator dialog now renders the output config's name
in a disabled "Name" TextField, so the existing
`getByRole("textbox", { name: /^Name.../ })` selector matched two elements
(strict-mode violation). Scope the selector to the editable field with
`disabled: false` so it targets the evaluator name input.

Fixes the e2e-test (Playwright) Code Evaluators specs.
Catch the multi-line getByRole selector in the edit test that the prior
single-line fix missed, so all four occurrences exclude the disabled
output-config 'Name' field via disabled: false.
CreateCodeEvaluator / PatchCodeEvaluator no longer require the sandbox
backend to be fully AVAILABLE (installed runtime deps + downloaded binary)
at authoring time — that is an execution-time concern. The mutation still
validates that the referenced sandbox config exists and is enabled, its
provider is enabled, and the language matches.

Reverts the over-strict create-time availability gate added in 6e15f60
("harden code evaluator authoring flow"), which broke the
TestSandboxAndCodeEvaluatorPermissions integration test: the integration
server lacks the WASM runtime (wasmtime is a dev-only dependency), so the
gate rejected evaluator creation there.
DeleteSandboxConfig was placed at the end of Tier 1, before the Tier 2
evaluator writes that reference the same setup config. For admin roles the
delete succeeds, so the subsequent CreateCodeEvaluator failed the
config-exists validation ("Sandbox config not found"). Move the delete to
the actual end of the test — after every evaluator write and the Tier 3
reads — honoring the existing "delete runs last" intent.

This surfaced only after relaxing the create-time backend-availability gate
(which previously failed the test earlier, for all roles).
…luator_form

The tool opens the dataset-backed create-code-evaluator form
(CreateCodeDatasetEvaluatorSlideover) so the read/edit/test draft tools
become available; it does not persist anything. "Experiment evaluator"
is not part of the domain model — these are dataset (code) evaluators —
so the name was misleading and inconsistent with the sibling
*_code_evaluator_draft tools.

Renames across the server↔client wire contract and all related symbols:
the tool-name string, capability/class, prompt template + AgentPrompts
field, frontend constant/type/parser/agent-tool, and the
experimentEvaluatorForm* playground state/props (now codeEvaluatorForm*).
No behavior change.
…nism

Remove the optimistic-concurrency revision guard from the code-evaluator draft co-pilot tools. It is premature hardening for an unsaved, single-user draft: accept re-applies the agent's scoped set-operations to the current form state, so untouched fields are never clobbered, and the explicit accept gesture is the user's consent.

Drops buildDraftRevision, the revision/expectedRevision fields and tool-schema params, the propose-time and accept-time checks, and the test-section staleness check (replaced by an isDraftMounted mount check). Simplifies the read/edit/test tool descriptions and prompt templates. prompt_instance is intentionally left unchanged.
Close the three open consolidation seams of the code-evaluator PXI tools,
all behavior-preserving:

- Extract a shared DiffAcceptRejectToolDetails<T> generic owning the diff
  render, accept/reject footer, and a single CSS namespace; reduce
  EditPromptToolDetails and EditCodeEvaluatorDraftToolDetails to thin
  wrappers that inject only snapshotToText/fileName/renderHeader/labels
  (collapses two ~95%-identical renderers; ToolPart.tsx switch untouched).
- Derive the agent-tool Zod enums at runtime from canonical @phoenix/types
  as-const tuples (CODE_EVALUATOR_LANGUAGES, EVALUATOR_OPTIMIZATION_DIRECTIONS)
  via z.enum(TUPLE), so the union type and validator share one source.
  Backfill tool-registry docstrings; drop two no-op dispatch tests.
- Render every active AgentContextPills context as a labeled pill; delete
  the isAuxiliaryFormContext display-hide (no hide, no cap). code_evaluator
  stays a peer context on the wire (/chat payload unchanged).

tsc --noEmit + 48 touched-spec tests green.
Close the structural/maintainability half of the PR #13397 review by
bringing each flagged thread back onto its canonical seam.

Server (#2): route the sandbox inventory through the phoenix-gql seam.
_load_sandbox_availability collapses from a select(models.SandboxConfig)
row-load to a select(exists(...)) gate query; has_usable is stored
directly on SandboxAvailability (SandboxConfigCapabilities/configs
dropped). Both code-evaluator prompts drop <available_sandbox_configs>
and invert the "do NOT issue a sandboxProviders query" instruction into
an on-demand phoenix-gql fetch+filter (envVars{name}, never secretKey).
Pre-turn capability gating is unchanged.

Frontend (#5/#9): derive the OutputConfigDraft element types from the
canonical @phoenix/types AnnotationConfig + explicit `kind`; couple the
Zod schemas with a `satisfies` guard; centralize the form->draft
conversion behind one documented function (no compile-time exhaustiveness
claim — canonical union is undiscriminated).

Frontend (#7): unify the three empty-input idioms onto one shared
parseEmptyToolInput; delete the bespoke open-form parser.

Frontend (E): add minimal orienting comments naming the playgroundPrompt
template + lifecycle; re-home open_code_evaluator_form's name-constant and
Input type into the codeEvaluatorDraft module (registry owns only
registration). No renames.

Frontend (#11/#13): remove the global vitest.setup.ts storage mock and
delete AgentContextPills.test.tsx. The branch's jsdom bump ships a
non-functional native localStorage, so the mock is replaced with an
opt-in installTestStorage util imported by the store-mounting tests
rather than reverted (which would regress on-main tests).
Lean out tests flagged as no-op or over-pinned in the PR #13397 review:
- codeEvaluatorDraft.test.ts: drop the sandbox-null create-mode cases and
  the model-alias edit/test-payload pending-edit cases
- delete EvaluatorNameInput.test.tsx
- test_capabilities.py: loosen exact-wording prompt assertions to
  behavior-level checks
- test_agents.py: drop provider-disabled / empty-inventory loader cases,
  the AgentDependencies override case, and the
  EditCodeEvaluatorDraft viewer-gate suite
…t collection

The agent draft test-run tool lived in a production module named
test_code_evaluator_draft.py with a Test*-prefixed capability class — both match
pytest's default collection globs (test_*.py, python_classes=Test*), a latent
footgun and a misleading inflater of the test-file surface. Rename the module to
run_code_evaluator_draft.py and the class to RunCodeEvaluatorDraftCapability,
matching the existing run_playground sibling. The model-facing tool name
(test_code_evaluator_draft) and prompt accessor are unchanged.
Trim storylike 'how it works' docstrings and temporal phrasing (no longer /
now lives) from the code-evaluator authoring flow. Keep comments that
disambiguate a contract the name or signature does not — per-field prop docs,
the cross-boundary 'Python schema is the model-facing source of truth' note,
and the silent-fall-through footgun in the output-config converter. No behavior
change.
The rebase onto main (which added PXI auto-mode: permissions.edits
manual|bypass) left two integration gaps:

- agents/types.py dropped the `Literal` import while keeping main's
  `edit_permission: Literal["manual", "bypass"]` field, breaking import.
- EditPromptToolDetails merged main's auto-accept label logic with our
  refactor to the shared DiffAcceptRejectToolDetails; reconcile the file
  (import PromptSnapshot, drop the dead summary path) while preserving
  main's "Auto-approved" state label.
Wire the edit_code_evaluator_draft confirmation dialog into the new PXI
auto-mode (permissions.edits === "bypass"), mirroring edit_prompt and
batch_span_annotate:

- accept() now takes { approvalSource } and stamps acceptedBy on output
- createEditCodeEvaluatorDraftClientAction takes shouldAutoAccept; when
  true it applies the edit immediately and skips setPendingCodeEvaluatorEdit
  so no accept/reject dialog is surfaced
- EditCodeEvaluatorDialogContent passes shouldAutoAccept reading
  agentStore.permissions.edits === "bypass"
- tool-details renders "Auto-approved" vs "Accepted"
- add a unit test asserting auto-mode never surfaces the confirmation
The anthropic SDK 0.105 added 'system' to MessageParam['role']
(Literal['user','assistant'] -> Literal['user','assistant','system']),
which broke the exhaustiveness check in _RoleConversion.from_anthropic
and failed the third-party SDK canary (pyright assert_never).

Map an anthropic system message to a phoenix 'system' PromptMessage role
and widen the return type, mirroring the openai helper's from_openai.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants