Skip to content

test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474

Merged
radugheo merged 1 commit intomainfrom
test/coded-tests-cli-commands
May 5, 2026
Merged

test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474
radugheo merged 1 commit intomainfrom
test/coded-tests-cli-commands

Conversation

@radugheo
Copy link
Copy Markdown
Contributor

@radugheo radugheo commented Apr 29, 2026

Summary

Adds four coded-agent tests covering the eval, deploy, and file-sync lifecycles — all uip codedagent subcommands not exercised by the framework tests in #473.

Task Tier What it covers
skill-agent-coded-eval-exact-match e2e Deterministic adder agent + ExactMatchEvaluator. Verifies the evaluator config under evaluations/evaluators/, the eval-set schema (evaluatorRefs matching the evaluator id), and uip codedagent eval --no-report --output-file shape — every test case must report status: "PASSED" with score 1.0.
skill-agent-coded-eval-llm-judges e2e LangGraph classifier with two LLM judges in one eval set: LLMJudgeOutputEvaluator and LLMJudgeTrajectoryEvaluator. Asserts per-judge criteria shape (expectedOutput vs expectedAgentBehavior) and that both evaluator ids surface in the results file.
skill-agent-coded-deploy-my-workspace e2e Minimal echo agent through uip codedagent packpublish --my-workspace (or combined deploy --my-workspace) → invoke. Asserts pyproject.toml carries name/version/description/authors, the .nupkg artifact lands in .uipath/, and invoke surfaces a https://... monitoring URL.
skill-agent-coded-push-pull-roundtrip integration Push, mutate main.py with a marker token, push --overwrite again, pull into a fresh sibling directory, assert the marker round-trips through Studio Web and the two trees converge. Requires UIPATH_PROJECT_ID provisioned in the test environment.

What each check script asserts

  • Eval tests — evaluator JSON has the right evaluatorTypeId for its id; eval-set evaluatorRefs and per-test-case evaluationCriterias key the same id; results file shape parses and (for ExactMatch) every case is PASSED with score 1.0. Negative case verified: flipping one status to FAILED makes the check fail.
  • Deploypyproject.toml hygiene, presence of a .nupkg in .uipath/, non-empty invoke-output.txt (the YAML's file_contains separately checks for https://).
  • Push-pull — both project trees exist, both have .env with UIPATH_PROJECT_ID, the pulled main.py carries the post-mutation marker, and local + pulled main.py are byte-equal after trim.

Stacked on #473 (no _shared/ imports needed in this PR — the four checks are standalone). GitHub will retarget to main after #473 merges.

Test plan

  • All four check_*.py dry-run green against synthetic well-formed projects.
  • Negative cases verified: a flipped FAILED status makes eval-exact-match fail; a stale pulled main.py makes push-pull fail.
  • All eight task YAMLs in coded/ parse and tag lists are consistent.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

Claude finished @radugheo's task in 1m 54s —— View job


PR Review: test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)

  • Gather context (review criteria, project conventions, existing comments)
  • Read full diff and all changed files
  • Review YAML task files for correctness
  • Review Python check scripts for correctness
  • Check for security issues, secrets, cross-skill references
  • Post final review

Summary

Adds four coder_eval test tasks (8 files total) covering uip codedagent eval, deploy, and push/pull subcommands. Each task has a YAML definition and a Python check script. The tests are well-structured, follow the existing patterns established in the base branch (PR #473), and have good assertion coverage.

Change-by-Change Review

1. tests/tasks/uipath-agents/coded/eval_exact_match/eval_exact_match.yaml

Severity: OK

Well-structured e2e task. Tags follow the [uipath-agents, e2e, coded, ...] pattern matching existing tests. task_id follows the skill-agent-coded-* convention. Success criteria cover the full lifecycle: scaffold (uip codedagent new), eval run (--no-report), file existence for evaluator/eval-set/results, and the check script. Weights are reasonable (check script gets 5.0 — the heaviest — which makes sense since it validates correctness).

2. tests/tasks/uipath-agents/coded/eval_exact_match/check_eval_exact_match.py

Severity: Low

Thorough three-phase check: evaluator config shape, eval-set schema + cross-referencing evaluator ids, and results file validation. The results parser is resilient — tries multiple top-level keys (results, evaluations, testResults, tests) to handle potential shape drift.

One minor issue: expected_case_count parameter in check_results() (line 96) is accepted but never used inside the function body. This is dead code — the function validates all cases in the results file regardless.

Fix this →

3. tests/tasks/uipath-agents/coded/eval_llm_judges/eval_llm_judges.yaml

Severity: OK

Tags and structure consistent with eval_exact_match. Good call using --mocker-cache for LLM-judge reproducibility. The initial_prompt is detailed enough to guide the agent through the dual-evaluator setup without over-specifying implementation.

4. tests/tasks/uipath-agents/coded/eval_llm_judges/check_eval_llm_judges.py

Severity: OK

Clean three-phase validation. Smart decision to not assert exact scores for LLM judges (continuous 0.0–1.0) — only checks structural correctness and that both evaluator ids surface in results. The EXPECTED_EVALUATORS dict at the top provides a single source of truth for id→typeId mappings.

5. tests/tasks/uipath-agents/coded/deploy_my_workspace/deploy_my_workspace.yaml

Severity: Low

Good lifecycle coverage: pack/publish command, invoke command, file existence, file_contains for the monitoring URL, and the check script. The initial_prompt includes a practical retry hint for Version already exists errors.

Minor: The initial_prompt says to write invoke stdout to invoke-output.txt "in the project root" — this is slightly ambiguous (could mean the sandbox root or the deploy-smoke/ project root). The check script expects it at deploy-smoke/invoke-output.txt (line 66), and the YAML's file_exists criterion (line 68) also expects deploy-smoke/invoke-output.txt, so the trio is consistent. The prompt could be marginally clearer, but the file_exists criterion anchors it.

6. tests/tasks/uipath-agents/coded/deploy_my_workspace/check_deploy_my_workspace.py

Severity: Low

The pyproject.toml field check (lines 40–45) uses simple substring matching (if needle not in text). This means name would match a comment like # rename this or a dependency name. For a test assertion this is pragmatically fine — the false-positive risk is negligible since pyproject.toml is generated by the scaffold. Not worth complicating with TOML parsing for this use case.

7. tests/tasks/uipath-agents/coded/push_pull_roundtrip/push_pull_roundtrip.yaml

Severity: OK

Correctly tagged as integration (not e2e) — matches the cloud-dependent nature of push/pull. The file_contains criteria for both local and pulled main.py provide fast-fail before the heavier check script runs. min_count: 2 for push commands correctly enforces the "push twice" roundtrip.

8. tests/tasks/uipath-agents/coded/push_pull_roundtrip/check_push_pull_roundtrip.py

Severity: OK

Clean validation: .env wiring check for both directories, marker propagation, and byte-equality (with .strip() tolerance for trailing whitespace). The regex for UIPATH_PROJECT_ID (line 31) correctly uses re.M to handle multi-line .env files.

What's Missing

  • Unused parameter: expected_case_count in check_eval_exact_match.py:96 is passed but never used. Should either be removed or used to validate result count matches eval-set case count. (Low)
  • No CODEOWNERS entry for tests/tasks/uipath-agents/: The skill path /skills/uipath-agents/ is covered, but there's no explicit entry for the test directory. This matches the existing state (no test paths have dedicated CODEOWNERS entries in this repo), so not a regression — just noting it. (Informational)

Area Ratings

Area Status Notes
Frontmatter N/A No skill changes — test tasks only
E2E Tests OK 3 e2e + 1 integration task, all well-structured
Skill Body N/A No skill changes
References & Assets N/A No skill changes
Repo Hygiene OK No secrets, no cross-skill refs, scoped changes

Issues for Manual Review

  • Live execution: All four tasks require UiPath staging tenant connectivity (UIPATH_URL / UIPATH_ACCESS_TOKEN). Push-pull additionally needs UIPATH_PROJECT_ID. These can only be verified in CI with the provisioned test environment.
  • CLI flag correctness: The --mocker-cache flag in eval_llm_judges.yaml and --my-workspace in deploy_my_workspace.yaml should be verified against the current uip codedagent CLI — flag names may have changed since the skill was written.

Conclusion

Clean PR. The four test tasks are well-structured, internally consistent, and follow the patterns established by the framework tests in PR #473. The check scripts have thorough validation logic with good resilience to output shape variations. One low-severity unused parameter (expected_case_count) is the only code issue found. Approve.

@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 3673d4e to 56a1cbf Compare April 30, 2026 09:53
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from ac9ed4a to bfc8136 Compare April 30, 2026 09:53
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 56a1cbf to cb5be47 Compare April 30, 2026 09:56
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from bfc8136 to 737e97b Compare April 30, 2026 09:57
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from cb5be47 to 9a4075e Compare April 30, 2026 11:40
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from 737e97b to fc46782 Compare April 30, 2026 11:42
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 9a4075e to c745b92 Compare April 30, 2026 12:04
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch 2 times, most recently from 10872ad to f5c690d Compare May 4, 2026 09:37
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from c745b92 to 97b1685 Compare May 4, 2026 10:07
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from f5c690d to 891fc53 Compare May 4, 2026 10:09
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 97b1685 to 46bde1e Compare May 4, 2026 13:27
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch 3 times, most recently from 1d918e1 to 57c0f60 Compare May 4, 2026 13:32
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch 2 times, most recently from 1d7ac35 to 9056e49 Compare May 4, 2026 15:41
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from 57c0f60 to 945bcb6 Compare May 4, 2026 15:42
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 9056e49 to 151b0f3 Compare May 4, 2026 15:57
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from 945bcb6 to bc8ac53 Compare May 4, 2026 15:59
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 151b0f3 to 3a0b67a Compare May 4, 2026 16:34
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from bc8ac53 to bd7c69b Compare May 4, 2026 16:35
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 3a0b67a to 409ea9b Compare May 4, 2026 17:24
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch 2 times, most recently from 2ea8a8d to 9aaa640 Compare May 4, 2026 18:53
@radugheo radugheo force-pushed the test/coded-tests-frameworks branch from 409ea9b to 0e7ec84 Compare May 5, 2026 08:21
Base automatically changed from test/coded-tests-frameworks to main May 5, 2026 08:22
- skill-agent-coded-eval-exact-match (e2e) — deterministic adder
  agent + ExactMatchEvaluator + 3 test cases. Local-only with
  --no-report. Verifies evaluator config + eval-set schema +
  results-file shape.
- skill-agent-coded-eval-llm-judges (e2e) — LangGraph classifier
  with two LLM judges in one eval set: LLMJudgeOutput
  (uipath-llm-judge-output-semantic-similarity) +
  LLMJudgeTrajectory (uipath-llm-judge-trajectory-similarity).
  Verifies per-judge criteria shape (expectedOutput vs
  expectedAgentBehavior).
- skill-agent-coded-deploy-my-workspace (e2e) — minimal echo agent
  through pack -> publish --my-workspace -> invoke. Asserts
  pyproject hygiene, .nupkg artifact in .uipath/, and that invoke
  surfaced a monitoring URL.
@radugheo radugheo force-pushed the test/coded-tests-cli-commands branch from 9aaa640 to 95d6283 Compare May 5, 2026 08:26
@radugheo radugheo merged commit d50dac8 into main May 5, 2026
3 of 4 checks passed
@radugheo radugheo deleted the test/coded-tests-cli-commands branch May 5, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants