test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474
test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)#474
Conversation
|
Claude finished @radugheo's task in 1m 54s —— View job PR Review: test(uipath-agents): coded eval / deploy / push-pull tests (PR 3/5)
SummaryAdds four coder_eval test tasks (8 files total) covering Change-by-Change Review1.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill changes — test tasks only |
| E2E Tests | OK | 3 e2e + 1 integration task, all well-structured |
| Skill Body | N/A | No skill changes |
| References & Assets | N/A | No skill changes |
| Repo Hygiene | OK | No secrets, no cross-skill refs, scoped changes |
Issues for Manual Review
- Live execution: All four tasks require UiPath staging tenant connectivity (
UIPATH_URL/UIPATH_ACCESS_TOKEN). Push-pull additionally needsUIPATH_PROJECT_ID. These can only be verified in CI with the provisioned test environment. - CLI flag correctness: The
--mocker-cacheflag ineval_llm_judges.yamland--my-workspaceindeploy_my_workspace.yamlshould be verified against the currentuip codedagentCLI — flag names may have changed since the skill was written.
Conclusion
Clean PR. The four test tasks are well-structured, internally consistent, and follow the patterns established by the framework tests in PR #473. The check scripts have thorough validation logic with good resilience to output shape variations. One low-severity unused parameter (expected_case_count) is the only code issue found. Approve.
3673d4e to
56a1cbf
Compare
ac9ed4a to
bfc8136
Compare
56a1cbf to
cb5be47
Compare
bfc8136 to
737e97b
Compare
cb5be47 to
9a4075e
Compare
737e97b to
fc46782
Compare
9a4075e to
c745b92
Compare
10872ad to
f5c690d
Compare
c745b92 to
97b1685
Compare
f5c690d to
891fc53
Compare
97b1685 to
46bde1e
Compare
1d918e1 to
57c0f60
Compare
1d7ac35 to
9056e49
Compare
57c0f60 to
945bcb6
Compare
9056e49 to
151b0f3
Compare
945bcb6 to
bc8ac53
Compare
151b0f3 to
3a0b67a
Compare
bc8ac53 to
bd7c69b
Compare
3a0b67a to
409ea9b
Compare
2ea8a8d to
9aaa640
Compare
409ea9b to
0e7ec84
Compare
- skill-agent-coded-eval-exact-match (e2e) — deterministic adder agent + ExactMatchEvaluator + 3 test cases. Local-only with --no-report. Verifies evaluator config + eval-set schema + results-file shape. - skill-agent-coded-eval-llm-judges (e2e) — LangGraph classifier with two LLM judges in one eval set: LLMJudgeOutput (uipath-llm-judge-output-semantic-similarity) + LLMJudgeTrajectory (uipath-llm-judge-trajectory-similarity). Verifies per-judge criteria shape (expectedOutput vs expectedAgentBehavior). - skill-agent-coded-deploy-my-workspace (e2e) — minimal echo agent through pack -> publish --my-workspace -> invoke. Asserts pyproject hygiene, .nupkg artifact in .uipath/, and that invoke surfaced a monitoring URL.
9aaa640 to
95d6283
Compare
Summary
Adds four coded-agent tests covering the eval, deploy, and file-sync lifecycles — all
uip codedagentsubcommands not exercised by the framework tests in #473.skill-agent-coded-eval-exact-matchadderagent +ExactMatchEvaluator. Verifies the evaluator config underevaluations/evaluators/, the eval-set schema (evaluatorRefsmatching the evaluatorid), anduip codedagent eval --no-report --output-fileshape — every test case must reportstatus: "PASSED"with score 1.0.skill-agent-coded-eval-llm-judgesLLMJudgeOutputEvaluatorandLLMJudgeTrajectoryEvaluator. Asserts per-judge criteria shape (expectedOutputvsexpectedAgentBehavior) and that both evaluator ids surface in the results file.skill-agent-coded-deploy-my-workspaceuip codedagent pack→publish --my-workspace(or combineddeploy --my-workspace) →invoke. Assertspyproject.tomlcarriesname/version/description/authors, the.nupkgartifact lands in.uipath/, andinvokesurfaces ahttps://...monitoring URL.skill-agent-coded-push-pull-roundtripmain.pywith a marker token, push--overwriteagain, pull into a fresh sibling directory, assert the marker round-trips through Studio Web and the two trees converge. RequiresUIPATH_PROJECT_IDprovisioned in the test environment.What each check script asserts
evaluatorTypeIdfor itsid; eval-setevaluatorRefsand per-test-caseevaluationCriteriaskey the same id; results file shape parses and (for ExactMatch) every case is PASSED with score 1.0. Negative case verified: flipping one status to FAILED makes the check fail.pyproject.tomlhygiene, presence of a.nupkgin.uipath/, non-emptyinvoke-output.txt(the YAML'sfile_containsseparately checks forhttps://)..envwithUIPATH_PROJECT_ID, the pulledmain.pycarries the post-mutation marker, and local + pulledmain.pyare byte-equal after trim.Stacked on #473 (no
_shared/imports needed in this PR — the four checks are standalone). GitHub will retarget tomainafter #473 merges.Test plan
check_*.pydry-run green against synthetic well-formed projects.main.pymakes push-pull fail.coded/parse and tag lists are consistent.