feat(pxi): code-evaluator authoring#13397
Merged
Merged
Conversation
Contributor
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
@arizeai/phoenix-cli
@arizeai/phoenix-client
@arizeai/phoenix-evals
@arizeai/phoenix-mcp
@arizeai/phoenix-otel
commit: |
0c4f19a to
3586803
Compare
0690262 to
99afb6f
Compare
Post-churn cleanup of the PXI code-evaluator authoring surfaces. - Dataset context now defers code-evaluator draft guidance to the always-rendered <phoenix_code_evaluator_context> instead of carrying a duplicate, separately-gated copy; drops can_open/can_edit/can_test flags from dataset.py and renames is_code_evaluator_surface -> is_code_evaluator_form_mounted. Fixes the two failing dataset-context handoff tests that the prior template drift had left red. - Extracts the inline open_experiment_evaluator_form and test_code_evaluator_draft tool instructions into .xml.j2 templates wired through AgentPrompts, matching the read/edit tool convention. - Types _validate_code_evaluator_sandbox_config's decrypt as Callable[[bytes], bytes] (was Any); drops now-unused Any import. - Removes dead createCodeEvaluatorActionContextSchema alias, trims the testPayload alias shim to ["test_payload"], makes toOutputConfigDraft module-private, and drops the unused searchParams barrel re-export. - Unifies the playground experiment-evaluator-form prop vocabulary so the lifted-open state keeps one name from Playground to PlaygroundEvaluatorSelect.
- Regenerate schemas/openapi.json and the app + phoenix-client TS types so the DatasetContext schema description matches the trimmed Pydantic docstring (the docstring is surfaced as the OpenAPI description). - Apply ruff format to edit_code_evaluator_draft.py (collapse a two-line boolean that fits on one line). Fixes the Build OpenAPI Schema and Format and Lint CI checks.
The create/edit code-evaluator dialog now renders the output config's name
in a disabled "Name" TextField, so the existing
`getByRole("textbox", { name: /^Name.../ })` selector matched two elements
(strict-mode violation). Scope the selector to the editable field with
`disabled: false` so it targets the evaluator name input.
Fixes the e2e-test (Playwright) Code Evaluators specs.
Catch the multi-line getByRole selector in the edit test that the prior single-line fix missed, so all four occurrences exclude the disabled output-config 'Name' field via disabled: false.
CreateCodeEvaluator / PatchCodeEvaluator no longer require the sandbox backend to be fully AVAILABLE (installed runtime deps + downloaded binary) at authoring time — that is an execution-time concern. The mutation still validates that the referenced sandbox config exists and is enabled, its provider is enabled, and the language matches. Reverts the over-strict create-time availability gate added in 6e15f60 ("harden code evaluator authoring flow"), which broke the TestSandboxAndCodeEvaluatorPermissions integration test: the integration server lacks the WASM runtime (wasmtime is a dev-only dependency), so the gate rejected evaluator creation there.
DeleteSandboxConfig was placed at the end of Tier 1, before the Tier 2
evaluator writes that reference the same setup config. For admin roles the
delete succeeds, so the subsequent CreateCodeEvaluator failed the
config-exists validation ("Sandbox config not found"). Move the delete to
the actual end of the test — after every evaluator write and the Tier 3
reads — honoring the existing "delete runs last" intent.
This surfaced only after relaxing the create-time backend-availability gate
(which previously failed the test earlier, for all roles).
…luator_form The tool opens the dataset-backed create-code-evaluator form (CreateCodeDatasetEvaluatorSlideover) so the read/edit/test draft tools become available; it does not persist anything. "Experiment evaluator" is not part of the domain model — these are dataset (code) evaluators — so the name was misleading and inconsistent with the sibling *_code_evaluator_draft tools. Renames across the server↔client wire contract and all related symbols: the tool-name string, capability/class, prompt template + AgentPrompts field, frontend constant/type/parser/agent-tool, and the experimentEvaluatorForm* playground state/props (now codeEvaluatorForm*). No behavior change.
…nism Remove the optimistic-concurrency revision guard from the code-evaluator draft co-pilot tools. It is premature hardening for an unsaved, single-user draft: accept re-applies the agent's scoped set-operations to the current form state, so untouched fields are never clobbered, and the explicit accept gesture is the user's consent. Drops buildDraftRevision, the revision/expectedRevision fields and tool-schema params, the propose-time and accept-time checks, and the test-section staleness check (replaced by an isDraftMounted mount check). Simplifies the read/edit/test tool descriptions and prompt templates. prompt_instance is intentionally left unchanged.
Close the three open consolidation seams of the code-evaluator PXI tools, all behavior-preserving: - Extract a shared DiffAcceptRejectToolDetails<T> generic owning the diff render, accept/reject footer, and a single CSS namespace; reduce EditPromptToolDetails and EditCodeEvaluatorDraftToolDetails to thin wrappers that inject only snapshotToText/fileName/renderHeader/labels (collapses two ~95%-identical renderers; ToolPart.tsx switch untouched). - Derive the agent-tool Zod enums at runtime from canonical @phoenix/types as-const tuples (CODE_EVALUATOR_LANGUAGES, EVALUATOR_OPTIMIZATION_DIRECTIONS) via z.enum(TUPLE), so the union type and validator share one source. Backfill tool-registry docstrings; drop two no-op dispatch tests. - Render every active AgentContextPills context as a labeled pill; delete the isAuxiliaryFormContext display-hide (no hide, no cap). code_evaluator stays a peer context on the wire (/chat payload unchanged). tsc --noEmit + 48 touched-spec tests green.
Close the structural/maintainability half of the PR #13397 review by bringing each flagged thread back onto its canonical seam. Server (#2): route the sandbox inventory through the phoenix-gql seam. _load_sandbox_availability collapses from a select(models.SandboxConfig) row-load to a select(exists(...)) gate query; has_usable is stored directly on SandboxAvailability (SandboxConfigCapabilities/configs dropped). Both code-evaluator prompts drop <available_sandbox_configs> and invert the "do NOT issue a sandboxProviders query" instruction into an on-demand phoenix-gql fetch+filter (envVars{name}, never secretKey). Pre-turn capability gating is unchanged. Frontend (#5/#9): derive the OutputConfigDraft element types from the canonical @phoenix/types AnnotationConfig + explicit `kind`; couple the Zod schemas with a `satisfies` guard; centralize the form->draft conversion behind one documented function (no compile-time exhaustiveness claim — canonical union is undiscriminated). Frontend (#7): unify the three empty-input idioms onto one shared parseEmptyToolInput; delete the bespoke open-form parser. Frontend (E): add minimal orienting comments naming the playgroundPrompt template + lifecycle; re-home open_code_evaluator_form's name-constant and Input type into the codeEvaluatorDraft module (registry owns only registration). No renames. Frontend (#11/#13): remove the global vitest.setup.ts storage mock and delete AgentContextPills.test.tsx. The branch's jsdom bump ships a non-functional native localStorage, so the mock is replaced with an opt-in installTestStorage util imported by the store-mounting tests rather than reverted (which would regress on-main tests).
Lean out tests flagged as no-op or over-pinned in the PR #13397 review: - codeEvaluatorDraft.test.ts: drop the sandbox-null create-mode cases and the model-alias edit/test-payload pending-edit cases - delete EvaluatorNameInput.test.tsx - test_capabilities.py: loosen exact-wording prompt assertions to behavior-level checks - test_agents.py: drop provider-disabled / empty-inventory loader cases, the AgentDependencies override case, and the EditCodeEvaluatorDraft viewer-gate suite
…t collection The agent draft test-run tool lived in a production module named test_code_evaluator_draft.py with a Test*-prefixed capability class — both match pytest's default collection globs (test_*.py, python_classes=Test*), a latent footgun and a misleading inflater of the test-file surface. Rename the module to run_code_evaluator_draft.py and the class to RunCodeEvaluatorDraftCapability, matching the existing run_playground sibling. The model-facing tool name (test_code_evaluator_draft) and prompt accessor are unchanged.
Trim storylike 'how it works' docstrings and temporal phrasing (no longer / now lives) from the code-evaluator authoring flow. Keep comments that disambiguate a contract the name or signature does not — per-field prop docs, the cross-boundary 'Python schema is the model-facing source of truth' note, and the silent-fall-through footgun in the output-config converter. No behavior change.
The rebase onto main (which added PXI auto-mode: permissions.edits manual|bypass) left two integration gaps: - agents/types.py dropped the `Literal` import while keeping main's `edit_permission: Literal["manual", "bypass"]` field, breaking import. - EditPromptToolDetails merged main's auto-accept label logic with our refactor to the shared DiffAcceptRejectToolDetails; reconcile the file (import PromptSnapshot, drop the dead summary path) while preserving main's "Auto-approved" state label.
Wire the edit_code_evaluator_draft confirmation dialog into the new PXI
auto-mode (permissions.edits === "bypass"), mirroring edit_prompt and
batch_span_annotate:
- accept() now takes { approvalSource } and stamps acceptedBy on output
- createEditCodeEvaluatorDraftClientAction takes shouldAutoAccept; when
true it applies the edit immediately and skips setPendingCodeEvaluatorEdit
so no accept/reject dialog is surfaced
- EditCodeEvaluatorDialogContent passes shouldAutoAccept reading
agentStore.permissions.edits === "bypass"
- tool-details renders "Auto-approved" vs "Accepted"
- add a unit test asserting auto-mode never surfaces the confirmation
The anthropic SDK 0.105 added 'system' to MessageParam['role'] (Literal['user','assistant'] -> Literal['user','assistant','system']), which broke the exhaustiveness check in _RoleConversion.from_anthropic and failed the third-party SDK canary (pyright assert_never). Map an anthropic system message to a phoenix 'system' PromptMessage role and widen the return type, mirroring the openai helper's from_openai.
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds PXI-assisted code-evaluator authoring through the existing Phoenix UI surfaces.
Dataset handoff mode. When the user is on a dataset surface and the code-evaluator form is not mounted, PXI does not create an evaluator directly. It either opens the form in place from a dataset-backed playground via
open_experiment_evaluator_form, or links the user to the dataset Evaluators create surface. Persistence still happens only when the user clicks Save in the slideover.Form co-pilot mode. When a Create/Edit Code Evaluator dialog is open, PXI can:
read_code_evaluator_draft— read the open draft, includingsourceCode,sandboxConfigId,inputMapping,outputConfigs,testPayload, and a revision token.edit_code_evaluator_draft— propose an accept/reject diff against the open form. Accept applies the draft locally; reject leaves the form untouched. Revision checks guard both propose-time and accept-time staleness.test_code_evaluator_draft— run the current accepted draft through the preview path before the user saves.The tools and context prompts now share the same availability gates: viewers get read-only behavior, create-mode edits require a usable sandbox, edit-mode draft edits can still be proposed without a sandbox, and draft tests require a usable sandbox.
Architecture
Screen.Recording.2026-05-28.at.12.41.16.PM.mov