feat: extract evaluation framework into uipath-eval package#1710
feat: extract evaluation framework into uipath-eval package#1710Chibionos wants to merge 1 commit into
Conversation
Move src/uipath/eval to a new namespace package distribution (packages/uipath-eval, import path unchanged: uipath.eval) so the evaluation framework — evaluators, mocking, eval runtime — can be consumed standalone, e.g. by the python eval worker in the agents backend, without pulling in the CLI and the rest of the SDK. - uipath-eval 0.1.0: depends only on uipath-core, uipath-platform, uipath-runtime (+ mockito, pydantic-function-models, coverage, which move out of the main package's dependencies) - uipath 2.10.82 depends on uipath-eval>=0.1.0,<0.2.0; editable link via [tool.uv.sources] - pure-eval tests move with the code (731 tests); CLI-coupled eval tests (discovery, telemetry, progress reporter, live tracking) stay in packages/uipath - the three legacy evaluators' relative import of uipath._utils.constants now uses the eval-local constant - CI: detect_changed_packages dependency graph, test/lint jobs, cd.yml publish tier (core -> platform -> eval -> uipath), labeler Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
There was a problem hiding this comment.
Pull request overview
Extracts the uipath.eval evaluation framework into a standalone uipath-eval distribution (while preserving from uipath.eval... import paths) and wires the monorepo/CI/release pipeline so uipath depends on uipath-eval instead of bundling eval internals directly.
Changes:
- Adds a new
packages/uipath-evalpackage containing evaluators, models, mocks/simulation, and the eval runtime (+ its test suite). - Updates
uipathto depend onuipath-eval(and shifts eval-only deps likemockito/coverageaccordingly). - Updates CI and CD workflows/scripts to test, lint, and publish
uipath-evalin the correct dependency tier.
Reviewed changes
Copilot reviewed 34 out of 130 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/uipath/pyproject.toml | Add uipath-eval dependency; bump uipath version. |
| packages/uipath/uv.lock | Lockfile updates: add editable uipath-eval; move deps; bump uipath version. |
| packages/uipath-eval/README.md | New package README documenting modules and usage. |
| packages/uipath-eval/pyproject.toml | New package metadata, deps, tooling config, pytest/mypy/ruff settings. |
| packages/uipath-eval/.python-version | Pin package dev Python version. |
| packages/uipath-eval/CLAUDE.md | Package-specific dev notes and constraints. |
| CLAUDE.md | Repo doc updated for new 4th package and dependency chain. |
| .github/workflows/test-packages.yml | Add uipath-eval test matrix job. |
| .github/workflows/lint-packages.yml | Add uipath-eval lint/typecheck job. |
| .github/workflows/cd.yml | Add build/publish/wait tier for uipath-eval between platform and uipath. |
| .github/scripts/detect_changed_packages.py | Add uipath-eval to dependents graph. |
| .github/labeler.yml | Include uipath-eval globs for integration/langchain triggers. |
| packages/uipath-eval/src/uipath/eval/py.typed | Mark package as typed. |
| packages/uipath-eval/src/uipath/eval/constants.py | Eval package constants (folder names, custom prefix). |
| packages/uipath-eval/src/uipath/eval/_execution_context.py | Shared contextvars + span collector for runtime/mocks. |
| packages/uipath-eval/src/uipath/eval/_helpers/init.py | Helpers package init. |
| packages/uipath-eval/src/uipath/eval/_helpers/helpers.py | Helper utilities (e.g., emptiness checks, metrics wrapper). |
| packages/uipath-eval/src/uipath/eval/_helpers/output_path.py | Utility for resolving nested output paths (a.b[0]). |
| packages/uipath-eval/src/uipath/eval/_helpers/evaluators_helpers.py | Evaluator helper functions/constants used across evaluators. |
| packages/uipath-eval/src/uipath/eval/models/init.py | Public exports for eval models. |
| packages/uipath-eval/src/uipath/eval/models/models.py | Core eval result/trace models (minor typing tweak in diff). |
| packages/uipath-eval/src/uipath/eval/models/_conversational_utils.py | Conversational eval input/output helpers. |
| packages/uipath-eval/src/uipath/eval/models/evaluation_set.py | Eval set + item models (incl GUID id normalization). |
| packages/uipath-eval/src/uipath/eval/models/llm_judge_types.py | LLM judge prompt/output schema models. |
| packages/uipath-eval/src/uipath/eval/mocks/init.py | Public exports for mocks/simulation API. |
| packages/uipath-eval/src/uipath/eval/mocks/mockable.py | @mockable decorator for mocking/simulation. |
| packages/uipath-eval/src/uipath/eval/mocks/_types.py | Pydantic schemas for mocking/simulation config. |
| packages/uipath-eval/src/uipath/eval/mocks/_mocker.py | Mocker interface + mock-related exceptions. |
| packages/uipath-eval/src/uipath/eval/mocks/_mocker_factory.py | Factory to select LLM vs mockito mocker. |
| packages/uipath-eval/src/uipath/eval/mocks/_mockito_mocker.py | Mockito-backed mocker implementation. |
| packages/uipath-eval/src/uipath/eval/mocks/_llm_mocker.py | LLM tool-response mocking implementation. |
| packages/uipath-eval/src/uipath/eval/mocks/_input_mocker.py | LLM input-generation mocking implementation. |
| packages/uipath-eval/src/uipath/eval/mocks/_cache_manager.py | Cache manager for mocker responses (memory + disk). |
| packages/uipath-eval/src/uipath/eval/mocks/_mock_context.py | Contextvars + helpers for mock resolution/simulation checks. |
| packages/uipath-eval/src/uipath/eval/mocks/_mock_runtime.py | Runtime delegate wrapping execution with mock context. |
| packages/uipath-eval/src/uipath/eval/mocks/_structured_output.py | Structured-output helper used by mocking. |
| packages/uipath-eval/src/uipath/eval/helpers.py | Eval set loading/migration + evaluator loading helpers. |
| packages/uipath-eval/src/uipath/eval/runtime/init.py | Runtime public API re-exports (evaluate, context, types). |
| packages/uipath-eval/src/uipath/eval/runtime/context.py | UiPathEvalContext container for runtime execution. |
| packages/uipath-eval/src/uipath/eval/runtime/events.py | Event types + payload models for eval progress reporting. |
| packages/uipath-eval/src/uipath/eval/runtime/_evaluate.py | evaluate() entrypoint wrapper around UiPathEvalRuntime. |
| packages/uipath-eval/src/uipath/eval/runtime/runtime.py | Main eval runtime implementation. |
| packages/uipath-eval/src/uipath/eval/runtime/_parallelization.py | Async worker-queue parallel execution helper. |
| packages/uipath-eval/src/uipath/eval/runtime/_utils.py | Input override merging utilities. |
| packages/uipath-eval/src/uipath/eval/runtime/_types.py | Runtime result DTOs/types. |
| packages/uipath-eval/src/uipath/eval/runtime/_spans.py | Span persistence/extraction utilities. |
| packages/uipath-eval/src/uipath/eval/runtime/_exporters.py | Trace/log exporters integration. |
| packages/uipath-eval/src/uipath/eval/evaluators/init.py | Evaluator exports + EVALUATORS registry. |
| packages/uipath-eval/src/uipath/eval/evaluators/evaluator.py | Discriminated unions for coded vs legacy evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators/evaluator_factory.py | Factory for loading built-in and custom evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators/registration.py | CLI support for registering custom evaluators/types. |
| packages/uipath-eval/src/uipath/eval/evaluators/base_legacy_evaluator.py | Legacy evaluator base + line-by-line support. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_deterministic_evaluator_base.py | Shared deterministic evaluator utilities (canonical JSON). |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_exact_match_evaluator.py | Legacy deterministic exact match evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_json_similarity_evaluator.py | Legacy deterministic JSON similarity evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_llm_helpers.py | Legacy LLM function-calling helper utilities. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_llm_as_judge_evaluator.py | Legacy LLM-as-judge evaluator (split helpers/const use). |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_trajectory_evaluator.py | Legacy trajectory evaluator (split helpers/const use). |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_evaluator_utils.py | Legacy evaluator utilities (const import change). |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_context_precision_evaluator.py | Legacy context precision evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_faithfulness_evaluator.py | Legacy faithfulness evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/legacy_csv_exact_match_evaluator.py | Legacy CSV exact match evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/attachment_utils.py | Job-attachment URI download helpers. |
| packages/uipath-eval/src/uipath/eval/evaluators/line_by_line_utils.py | Line-by-line evaluation utilities used by legacy evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators/exact_match_evaluator.py | Coded exact-match evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/contains_evaluator.py | Coded contains evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/json_similarity_evaluator.py | Coded JSON similarity evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/llm_as_judge_evaluator.py | Coded LLM-as-judge core logic. |
| packages/uipath-eval/src/uipath/eval/evaluators/llm_judge_output_evaluator.py | Coded LLM judge output evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators/llm_judge_trajectory_evaluator.py | Coded LLM judge trajectory evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators/binary_classification_evaluator.py | Binary classification evaluator + aggregation. |
| packages/uipath-eval/src/uipath/eval/evaluators/multiclass_classification_evaluator.py | Multiclass classification evaluator + aggregation. |
| packages/uipath-eval/src/uipath/eval/evaluators/tool_call_order_evaluator.py | Tool call order evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/tool_call_args_evaluator.py | Tool call args evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/tool_call_count_evaluator.py | Tool call count evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/tool_call_output_evaluator.py | Tool call output evaluator. |
| packages/uipath-eval/src/uipath/eval/evaluators/base_evaluator.py | Core coded evaluator base infrastructure. |
| packages/uipath-eval/src/uipath/eval/evaluators/output_evaluator.py | Output extraction + aggregation helpers for coded evaluators. |
| packages/uipath-eval/src/uipath/eval/evaluators_types/generate_types.py | Script to generate JSON type specs. |
| packages/uipath-eval/src/uipath/eval/evaluators_types/*.json | Generated evaluator config/criteria/justification schemas. |
| packages/uipath-eval/tests/evaluators/init.py | Tests package init. |
| packages/uipath-eval/tests/evaluators/test_output_path.py | Tests for nested output-path resolution. |
| packages/uipath-eval/tests/evaluators/test_helpers.py | Tests for helper utilities (e.g., is_empty_value). |
| packages/uipath-eval/tests/evaluators/test_legacy_trajectory_evaluator.py | Regression test for legacy trajectory prompt compaction. |
| packages/uipath-eval/tests/evaluators/test_evaluator_factory.py | EvaluatorFactory tests (incl config prep and loading). |
| packages/uipath-eval/tests/evaluators/test_attachment_utils.py | Tests for attachment URI parsing/downloading helpers. |
| packages/uipath-eval/tests/evaluators/test_documentation_examples.py | Documentation example coverage tests. |
| packages/uipath-eval/tests/evaluators/test_eval_level_expected_output.py | Tests around expected output placement. |
| packages/uipath-eval/tests/evaluators/test_evaluator_aggregation.py | Aggregation behavior tests for evaluators. |
| packages/uipath-eval/tests/evaluators/test_evaluator_helpers.py | Tests for evaluator helper functions. |
| packages/uipath-eval/tests/evaluators/test_evaluator_methods.py | Broad evaluator behavior tests. |
| packages/uipath-eval/tests/evaluators/test_evaluator_schemas.py | Schema generation/validation tests. |
| packages/uipath-eval/tests/evaluators/test_legacy_target_output_key_paths.py | Legacy targetOutputKey path tests. |
| packages/uipath-eval/tests/evaluators/test_line_by_line_utils.py | Tests for line-by-line evaluation utilities. |
| packages/uipath-eval/tests/evaluators/test_llm_judge_placeholder_validation.py | Tests for LLM judge prompt placeholder validation. |
| packages/uipath-eval/tests/eval/test_evaluate.py | End-to-end eval runtime tests invoking evaluate(). |
| packages/uipath-eval/tests/eval/test_eval_tracing_integration.py | Tracing integration tests for runtime/evals. |
| packages/uipath-eval/tests/eval/test_eval_runtime_suspend_resume.py | Suspend/resume flow tests. |
| packages/uipath-eval/tests/eval/test_eval_runtime_metadata.py | Runtime metadata access tests. |
| packages/uipath-eval/tests/eval/test_eval_resume_flow.py | Resume-mode selection/validation tests. |
| packages/uipath-eval/tests/eval/test_eval_id_casing.py | Regression tests for case-insensitive GUID ids. |
| packages/uipath-eval/tests/eval/test_conversational_utils.py | Conversational eval conversion tests. |
| packages/uipath-eval/tests/eval/test_input_overrides_e2e.py | E2E tests for per-eval input overrides utilities. |
| packages/uipath-eval/tests/eval/test_apply_file_overrides.py | Tests for applying file/attachment overrides in inputs. |
| packages/uipath-eval/tests/eval/test_eval_runtime_spans.py | Span handling/persistence tests. |
| packages/uipath-eval/tests/eval/test_eval_set.py | Eval set parsing/migration tests. |
| packages/uipath-eval/tests/eval/test_eval_span_utils.py | Span utility tests. |
| packages/uipath-eval/tests/eval/test_eval_util.py | Misc eval util tests. |
| packages/uipath-eval/tests/eval/test_span_persistence.py | Span persistence behavior tests. |
| packages/uipath-eval/tests/eval/mocks/test_mockable_arg_collision.py | Regression test for @mockable arg-name collisions. |
| packages/uipath-eval/tests/eval/mocks/test_input_mocker.py | Tests for LLM input mock generation. |
| packages/uipath-eval/tests/eval/mocks/test_input_mocker_span.py | Tests for tracing spans during input mocking. |
| packages/uipath-eval/tests/eval/mocks/test_cache_manager.py | Tests for cache manager read/write/invalidations. |
| packages/uipath-eval/tests/eval/mocks/test_mocks.py | Broader mock/simulation behavior tests. |
| packages/uipath-eval/tests/eval/mocks/test_mockable_mocked_annotation.py | Tests for @mockable annotation handling. |
| packages/uipath-eval/tests/eval/mocks/test_structured_output.py | Tests for provider-agnostic structured output handling. |
| packages/uipath-eval/tests/eval/evals/evaluators/exact-match.json | Test evaluator spec fixture. |
| packages/uipath-eval/tests/eval/evals/eval-sets/default.json | Test eval-set fixture. |
| packages/uipath-eval/tests/eval/evals/eval-sets/multiple-evals.json | Test multi-eval-set fixture. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a5f44181ae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| UIPATH_FOLDER_KEY: ${{ secrets.UIPATH_MEMORY_FOLDER }} | ||
| run: uv run pytest tests/services/test_memory_service_e2e.py -m e2e -v --no-cov | ||
|
|
||
| test-uipath-eval: |
There was a problem hiding this comment.
Include eval tests in the required test gate
This new test-uipath-eval matrix job is not included in the test-gate job's needs list or failure check at the bottom of this workflow, so a PR can still get a passing required Test status even when all uipath-eval tests fail. Since this commit moves the eval framework and its tests into this package, the gate should depend on and check test-uipath-eval as well.
Useful? React with 👍 / 👎.
| build-uipath-eval: | ||
| needs: [detect-publishable-packages, wait-for-uipath-platform] |
There was a problem hiding this comment.
Wait for core before building eval releases
When a release publishes a new uipath-core and uipath-eval version without also publishing uipath-platform or uipath, wait-for-uipath-core is skipped because its condition only mentions platform/uipath, and this new eval build only waits on wait-for-uipath-platform (which just skips if platform is not being published). In that scenario needs-relock: true runs uv lock --no-sources for eval before the new core version is visible on PyPI, causing intermittent release failures or locking against the previous core if the lower bound was not updated.
Useful? React with 👍 / 👎.



Summary
Extract the evaluation framework (
uipath.eval) into a new standalone distributionuipath-eval(packages/uipath-eval), so consumers — primarily the python eval worker in the agents backend — can depend on the evaluators, mocking system, and eval runtime without pulling in the CLI and the rest of the SDK. Today that worker pins the entireuipathSDK just to run evaluators.Supersedes the goals of #1040 (closed as stale); the strategy-pattern reporting refactor follows as a separate PR on top of this extraction.
What moved
packages/uipath/src/uipath/eval/→packages/uipath-eval/src/uipath/eval/(namespace package, same pattern asuipath-platform/uipath-core: no__init__.pyatsrc/uipath/,py.typedmarker included). Import paths are unchanged —from uipath.eval...works exactly as before for every existing consumer.packages/uipath, as do the CLI progress reporters in_cli/_evals/.Changes required by the split
COMMUNITY_agents_SUFFIXfromuipath._utils.constantsvia a relative import — the only entanglement with non-extracted SDK internals. They now use the constant that already existed inuipath.eval._helpers.evaluators_helpers.mockito,coveragemove fromuipath's dependencies touipath-eval's (only eval code uses them).pydantic-function-modelsstays in both (cli_server.pypreloads it).Versions & dependency chain
uipath2.10.82 →uipath-eval0.1.0 →uipath-platform/uipath-runtime/uipath-coreCI / release wiring
detect_changed_packages.py:uipath-evaladded to the dependents graph (core/platform changes test eval; eval changes test uipath)test-packages.yml/lint-packages.yml: dedicateduipath-evaljobs (same matrix as platform)cd.yml: new publish tier between platform and uipath (core → platform → eval → uipath), withwait-for-uipath-evalgating the uipath buildlabeler.yml:uipath-evalsource globs added to the langchain/integration test triggersValidation
uipath-eval: 731 tests pass; ruff, ruff format, mypy clean; wheel + sdist builduipath: 1200 tests pass; ruff, custom httpx linter, mypy (src+tests) clean; wheel builds;uipath --helpanduipath eval --helpsmoke-tested against the new layoutimport uipath; import uipath.eval.evaluators; from uipath.eval.runtime import evaluateresolves across the two distributions🤖 Generated with Claude Code