test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5) by radugheo · Pull Request #474 · UiPath/skills

radugheo · 2026-04-29T17:58:39Z

Summary

Adds four coded-agent tests covering the eval, deploy, and file-sync lifecycles — all uip codedagent subcommands not exercised by the framework tests in #473.

Task	Tier	What it covers
`skill-agent-coded-eval-exact-match`	e2e	Deterministic `adder` agent + `ExactMatchEvaluator`. Verifies the evaluator config under `evaluations/evaluators/`, the eval-set schema (`evaluatorRefs` matching the evaluator `id`), and `uip codedagent eval --no-report --output-file` shape — every test case must report `status: "PASSED"` with score 1.0.
`skill-agent-coded-eval-llm-judges`	e2e	LangGraph classifier with two LLM judges in one eval set: `LLMJudgeOutputEvaluator` and `LLMJudgeTrajectoryEvaluator`. Asserts per-judge criteria shape (`expectedOutput` vs `expectedAgentBehavior`) and that both evaluator ids surface in the results file.
`skill-agent-coded-deploy-my-workspace`	e2e	Minimal echo agent through `uip codedagent pack` → `publish --my-workspace` (or combined `deploy --my-workspace`) → `invoke`. Asserts `pyproject.toml` carries `name`/`version`/`description`/`authors`, the `.nupkg` artifact lands in `.uipath/`, and `invoke` surfaces a `https://...` monitoring URL.
`skill-agent-coded-push-pull-roundtrip`	integration	Push, mutate `main.py` with a marker token, push `--overwrite` again, pull into a fresh sibling directory, assert the marker round-trips through Studio Web and the two trees converge. Requires `UIPATH_PROJECT_ID` provisioned in the test environment.

What each check script asserts

Eval tests — evaluator JSON has the right evaluatorTypeId for its id; eval-set evaluatorRefs and per-test-case evaluationCriterias key the same id; results file shape parses and (for ExactMatch) every case is PASSED with score 1.0. Negative case verified: flipping one status to FAILED makes the check fail.
Deploy — pyproject.toml hygiene, presence of a .nupkg in .uipath/, non-empty invoke-output.txt (the YAML's file_contains separately checks for https://).
Push-pull — both project trees exist, both have .env with UIPATH_PROJECT_ID, the pulled main.py carries the post-mutation marker, and local + pulled main.py are byte-equal after trim.

Stacked on #473 (no _shared/ imports needed in this PR — the four checks are standalone). GitHub will retarget to main after #473 merges.

Test plan

All four check_*.py dry-run green against synthetic well-formed projects.
Negative cases verified: a flipped FAILED status makes eval-exact-match fail; a stale pulled main.py makes push-pull fail.
All eight task YAMLs in coded/ parse and tag lists are consistent.

github-actions · 2026-04-29T17:59:03Z

Claude finished @radugheo's task in 1m 54s —— View job

PR Review: test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)

Gather context (review criteria, project conventions, existing comments)
Read full diff and all changed files
Review YAML task files for correctness
Review Python check scripts for correctness
Check for security issues, secrets, cross-skill references
Post final review

Summary

Adds four coder_eval test tasks (8 files total) covering uip codedagent eval, deploy, and push/pull subcommands. Each task has a YAML definition and a Python check script. The tests are well-structured, follow the existing patterns established in the base branch (PR #473), and have good assertion coverage.

Change-by-Change Review

1. `tests/tasks/uipath-agents/coded/eval_exact_match/eval_exact_match.yaml`

Severity: OK

Well-structured e2e task. Tags follow the [uipath-agents, e2e, coded, ...] pattern matching existing tests. task_id follows the skill-agent-coded-* convention. Success criteria cover the full lifecycle: scaffold (uip codedagent new), eval run (--no-report), file existence for evaluator/eval-set/results, and the check script. Weights are reasonable (check script gets 5.0 — the heaviest — which makes sense since it validates correctness).

2. `tests/tasks/uipath-agents/coded/eval_exact_match/check_eval_exact_match.py`

Severity: Low

Thorough three-phase check: evaluator config shape, eval-set schema + cross-referencing evaluator ids, and results file validation. The results parser is resilient — tries multiple top-level keys (results, evaluations, testResults, tests) to handle potential shape drift.

One minor issue: expected_case_count parameter in check_results() (line 96) is accepted but never used inside the function body. This is dead code — the function validates all cases in the results file regardless.

Fix this →

3. `tests/tasks/uipath-agents/coded/eval_llm_judges/eval_llm_judges.yaml`

Severity: OK

Tags and structure consistent with eval_exact_match. Good call using --mocker-cache for LLM-judge reproducibility. The initial_prompt is detailed enough to guide the agent through the dual-evaluator setup without over-specifying implementation.

4. `tests/tasks/uipath-agents/coded/eval_llm_judges/check_eval_llm_judges.py`

Severity: OK

Clean three-phase validation. Smart decision to not assert exact scores for LLM judges (continuous 0.0–1.0) — only checks structural correctness and that both evaluator ids surface in results. The EXPECTED_EVALUATORS dict at the top provides a single source of truth for id→typeId mappings.

5. `tests/tasks/uipath-agents/coded/deploy_my_workspace/deploy_my_workspace.yaml`

Severity: Low

Good lifecycle coverage: pack/publish command, invoke command, file existence, file_contains for the monitoring URL, and the check script. The initial_prompt includes a practical retry hint for Version already exists errors.

Minor: The initial_prompt says to write invoke stdout to invoke-output.txt "in the project root" — this is slightly ambiguous (could mean the sandbox root or the deploy-smoke/ project root). The check script expects it at deploy-smoke/invoke-output.txt (line 66), and the YAML's file_exists criterion (line 68) also expects deploy-smoke/invoke-output.txt, so the trio is consistent. The prompt could be marginally clearer, but the file_exists criterion anchors it.

6. `tests/tasks/uipath-agents/coded/deploy_my_workspace/check_deploy_my_workspace.py`

Severity: Low

The pyproject.toml field check (lines 40–45) uses simple substring matching (if needle not in text). This means name would match a comment like # rename this or a dependency name. For a test assertion this is pragmatically fine — the false-positive risk is negligible since pyproject.toml is generated by the scaffold. Not worth complicating with TOML parsing for this use case.

7. `tests/tasks/uipath-agents/coded/push_pull_roundtrip/push_pull_roundtrip.yaml`

Severity: OK

Correctly tagged as integration (not e2e) — matches the cloud-dependent nature of push/pull. The file_contains criteria for both local and pulled main.py provide fast-fail before the heavier check script runs. min_count: 2 for push commands correctly enforces the "push twice" roundtrip.

8. `tests/tasks/uipath-agents/coded/push_pull_roundtrip/check_push_pull_roundtrip.py`

Severity: OK

Clean validation: .env wiring check for both directories, marker propagation, and byte-equality (with .strip() tolerance for trailing whitespace). The regex for UIPATH_PROJECT_ID (line 31) correctly uses re.M to handle multi-line .env files.

What's Missing

Unused parameter: expected_case_count in check_eval_exact_match.py:96 is passed but never used. Should either be removed or used to validate result count matches eval-set case count. (Low)
No CODEOWNERS entry for tests/tasks/uipath-agents/: The skill path /skills/uipath-agents/ is covered, but there's no explicit entry for the test directory. This matches the existing state (no test paths have dedicated CODEOWNERS entries in this repo), so not a regression — just noting it. (Informational)

Area Ratings

Area	Status	Notes
Frontmatter	N/A	No skill changes — test tasks only
E2E Tests	OK	3 e2e + 1 integration task, all well-structured
Skill Body	N/A	No skill changes
References & Assets	N/A	No skill changes
Repo Hygiene	OK	No secrets, no cross-skill refs, scoped changes

Issues for Manual Review

Live execution: All four tasks require UiPath staging tenant connectivity (UIPATH_URL / UIPATH_ACCESS_TOKEN). Push-pull additionally needs UIPATH_PROJECT_ID. These can only be verified in CI with the provisioned test environment.
CLI flag correctness: The --mocker-cache flag in eval_llm_judges.yaml and --my-workspace in deploy_my_workspace.yaml should be verified against the current uip codedagent CLI — flag names may have changed since the skill was written.

Conclusion

Clean PR. The four test tasks are well-structured, internally consistent, and follow the patterns established by the framework tests in PR #473. The check scripts have thorough validation logic with good resilience to output shape variations. One low-severity unused parameter (expected_case_count) is the only code issue found. Approve.

- skill-agent-coded-eval-exact-match (e2e) — deterministic adder agent + ExactMatchEvaluator + 3 test cases. Local-only with --no-report. Verifies evaluator config + eval-set schema + results-file shape. - skill-agent-coded-eval-llm-judges (e2e) — LangGraph classifier with two LLM judges in one eval set: LLMJudgeOutput (uipath-llm-judge-output-semantic-similarity) + LLMJudgeTrajectory (uipath-llm-judge-trajectory-similarity). Verifies per-judge criteria shape (expectedOutput vs expectedAgentBehavior). - skill-agent-coded-deploy-my-workspace (e2e) — minimal echo agent through pack -> publish --my-workspace -> invoke. Asserts pyproject hygiene, .nupkg artifact in .uipath/, and that invoke surfaced a monitoring URL.

radugheo mentioned this pull request Apr 29, 2026

test(uipath-agents): coded HITL / process / RAG / tracing tests (PR 4/5) #475

Merged

3 tasks

radugheo force-pushed the test/coded-tests-frameworks branch from 3673d4e to 56a1cbf Compare April 30, 2026 09:53

radugheo force-pushed the test/coded-tests-cli-commands branch from ac9ed4a to bfc8136 Compare April 30, 2026 09:53

radugheo force-pushed the test/coded-tests-frameworks branch from 56a1cbf to cb5be47 Compare April 30, 2026 09:56

radugheo force-pushed the test/coded-tests-cli-commands branch from bfc8136 to 737e97b Compare April 30, 2026 09:57

radugheo force-pushed the test/coded-tests-frameworks branch from cb5be47 to 9a4075e Compare April 30, 2026 11:40

radugheo force-pushed the test/coded-tests-cli-commands branch from 737e97b to fc46782 Compare April 30, 2026 11:42

radugheo force-pushed the test/coded-tests-frameworks branch from 9a4075e to c745b92 Compare April 30, 2026 12:04

radugheo force-pushed the test/coded-tests-cli-commands branch 2 times, most recently from 10872ad to f5c690d Compare May 4, 2026 09:37

radugheo force-pushed the test/coded-tests-frameworks branch from c745b92 to 97b1685 Compare May 4, 2026 10:07

radugheo force-pushed the test/coded-tests-cli-commands branch from f5c690d to 891fc53 Compare May 4, 2026 10:09

cosmyo approved these changes May 4, 2026

View reviewed changes

radugheo force-pushed the test/coded-tests-frameworks branch from 97b1685 to 46bde1e Compare May 4, 2026 13:27

radugheo force-pushed the test/coded-tests-cli-commands branch 3 times, most recently from 1d918e1 to 57c0f60 Compare May 4, 2026 13:32

radugheo force-pushed the test/coded-tests-frameworks branch 2 times, most recently from 1d7ac35 to 9056e49 Compare May 4, 2026 15:41

radugheo force-pushed the test/coded-tests-cli-commands branch from 57c0f60 to 945bcb6 Compare May 4, 2026 15:42

radugheo force-pushed the test/coded-tests-frameworks branch from 9056e49 to 151b0f3 Compare May 4, 2026 15:57

radugheo force-pushed the test/coded-tests-cli-commands branch from 945bcb6 to bc8ac53 Compare May 4, 2026 15:59

radugheo force-pushed the test/coded-tests-frameworks branch from 151b0f3 to 3a0b67a Compare May 4, 2026 16:34

radugheo force-pushed the test/coded-tests-cli-commands branch from bc8ac53 to bd7c69b Compare May 4, 2026 16:35

radugheo force-pushed the test/coded-tests-frameworks branch from 3a0b67a to 409ea9b Compare May 4, 2026 17:24

radugheo force-pushed the test/coded-tests-cli-commands branch 2 times, most recently from 2ea8a8d to 9aaa640 Compare May 4, 2026 18:53

radugheo force-pushed the test/coded-tests-frameworks branch from 409ea9b to 0e7ec84 Compare May 5, 2026 08:21

Base automatically changed from test/coded-tests-frameworks to main May 5, 2026 08:22

radugheo force-pushed the test/coded-tests-cli-commands branch from 9aaa640 to 95d6283 Compare May 5, 2026 08:26

radugheo merged commit d50dac8 into main May 5, 2026
3 of 4 checks passed

radugheo deleted the test/coded-tests-cli-commands branch May 5, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474

test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474
radugheo merged 1 commit intomainfrom
test/coded-tests-cli-commands

radugheo commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

radugheo commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What each check script asserts

Test plan

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)

Summary

Change-by-Change Review

1. tests/tasks/uipath-agents/coded/eval_exact_match/eval_exact_match.yaml

2. tests/tasks/uipath-agents/coded/eval_exact_match/check_eval_exact_match.py

3. tests/tasks/uipath-agents/coded/eval_llm_judges/eval_llm_judges.yaml

4. tests/tasks/uipath-agents/coded/eval_llm_judges/check_eval_llm_judges.py

5. tests/tasks/uipath-agents/coded/deploy_my_workspace/deploy_my_workspace.yaml

6. tests/tasks/uipath-agents/coded/deploy_my_workspace/check_deploy_my_workspace.py

7. tests/tasks/uipath-agents/coded/push_pull_roundtrip/push_pull_roundtrip.yaml

8. tests/tasks/uipath-agents/coded/push_pull_roundtrip/check_push_pull_roundtrip.py

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

radugheo commented Apr 29, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading

1. `tests/tasks/uipath-agents/coded/eval_exact_match/eval_exact_match.yaml`

2. `tests/tasks/uipath-agents/coded/eval_exact_match/check_eval_exact_match.py`

3. `tests/tasks/uipath-agents/coded/eval_llm_judges/eval_llm_judges.yaml`

4. `tests/tasks/uipath-agents/coded/eval_llm_judges/check_eval_llm_judges.py`

5. `tests/tasks/uipath-agents/coded/deploy_my_workspace/deploy_my_workspace.yaml`

6. `tests/tasks/uipath-agents/coded/deploy_my_workspace/check_deploy_my_workspace.py`

7. `tests/tasks/uipath-agents/coded/push_pull_roundtrip/push_pull_roundtrip.yaml`

8. `tests/tasks/uipath-agents/coded/push_pull_roundtrip/check_push_pull_roundtrip.py`