Fix self-reporting report.json pattern in test cases by bai-uipath · Pull Request #507 · UiPath/skills

bai-uipath · 2026-04-30T22:28:02Z

Reduced tasks using self-report as primary success signal from 69 to 21 — a net removal of 48 tests that were verifying the agent's claim about what it did rather than what it actually did.

Self-report count

	Count
Tasks using self-report before this PR	69
Tasks fixed in this PR	48
Tasks still using self-report after	21

Still remaining (not fixed in this PR):

Not tackled (8)

gov-aops-policy/deployed_policy_smoke, deployment_discovery_smoke, discover_catalog_smoke, list_policies_smoke, template_bootstrap_smoke (5)
gov-access-policy/get_policy_smoke, list_policies_smoke (2)
maestro-case/registry_discovery (1)

Not tackled because these are policy-analysis tasks where the written output file is the agent's actual work product (structuring a policy recommendation or discovery report). The output file IS the deliverable, not a redundant self-description of what the agent did. Needs separate design work to determine the right verification approach.

Schema-design / partial-report (3)

hitl/quality_01_approval_gate_schema — the written schema_proposal.json is the deliverable; the task is schema design, not a CLI workflow
rpa/template_aware_create — reverted to pre-PR; CI showed score dropped 1.0 → 0.43 when criteria were overhauled
rpa/coded_test_case — reverted to pre-PR; the report.json prompt was keeping the agent on task (score dropped to 0.35 without it)

Motivation

Many tasks asked the agent to write a report.json summary and then read it back as the success criterion. A misbehaving agent passes by writing the right strings regardless of whether the underlying work happened. This PR replaces those checks with criteria that verify state directly — command body regex on actual CLI calls, .flow file content checks, or run_command validation.

What changed

Self-report removed, direct checks substituted (selected highlights)

Task	Removed	Replaced with
`hitl/e2e_01_invoice_approval_greenfield`	`report.json` (hitl_node_id, schema, validation_passed)	`file_contains`/`file_matches_regex` on `.flow` + `run_command` validate
`hitl/e2e_03_gdpr_compliance_greenfield`	`report.json` (timeout_configured, validation_passed)	`.flow` file checks + `run_command` validate
`hitl/smoke_07_neg_automated`	`recommendation.json` (hitl_needed: false)	`skill_triggered: no`
`hitl/smoke_08_neg_admin`	`answer.json` (hitl_authoring_needed: false)	`skill_triggered: no`
`hitl/quality_01_approval_gate_schema`	`file_contains` on schema_proposal.json (confirmed: false)	removed (no replacement needed)
`flow/hitl/quality_03_boolean_decision`	`report.json` (decision_condition, validation_passed)	existing `.flow` file checks
`hitl/e2e_07_apptask_brownfield`	`report.json` (inputs_type, validation_passed)	existing `.flow` file checks
`datafabric/smoke_entities`	`report.json` (fieldName key, field_id_source)	`command_executed` body regex (`(?s).fieldName`, `updateFields."id"`)
`datafabric/smoke_records`	`report.json` (filterGroup, pagination_method, Id in updates)	`command_executed` body regex
`tasks/smoke_assignment`	`report.json` (assign_by_email, assign_by_user_id, mutual exclusion)	`command_executed` regex per flag variant
`tasks/smoke_completion`	`report.json` (has_action, action_value=Approve, all_calls_include_type)	`command_executed` body regex
`tasks/smoke_discovery`	`report.json` (task_id_style=numeric, used_type_hint, used_as_admin)	`command_executed` flag-specific regex
`datafabric/smoke_bulk_import`	`json_check` on report.json (csv_includes_system_fields)	`file_check excludes` on `products.csv` directly

Still using self-report (deferred)

hitl/smoke_01–06 — converting to skill_triggered: yes caused all 6 to fail (agent not invoking the skill); reverted pending agent fix.
rpa/template_aware_create, rpa/coded_test_case — reverted after CI showed score regressions (0.43 and 0.35 respectively).
gov-aops-policy/ (5 tasks), gov-access-policy/ (2 tasks), maestro-case/registry_discovery — not tackled in this PR.

README

Added skill_triggered and command_not_executed sections to the criterion reference.
Removed the ~60-line "Anti-patterns to avoid" section.
Updated file_contains and json_check examples to use real artifacts.

Pass rate comparison

Baseline: run 2026-04-28_10-28-40 (pre-PR). Post-PR: run 2026-04-30_16-40-26. Scoped to the 47 tasks where self-report was removed and both runs have results.

47 tasks: 40/47 → 42/47 passing (+2 net).

Improved

Task	Before	After
`datafabric-smoke-entities`	FAILURE	SUCCESS
`flow-hitl-quality-schema-design`	MAX_TURNS	SUCCESS
`hitl-quality-priority-timeout`	FAILURE	SUCCESS
`sdd-gap-detection`	FAILURE	SUCCESS

Regressed — report.json removal exposed pre-existing failures

Task	Before	After	Note
`flow-hitl-quality-boolean-decision`	SUCCESS	FAILURE	`.flow` file criteria were already failing; report.json score (weight 6.0) was masking it
`datafabric-integration-preseeded-entity`	SUCCESS	FAILURE	Same — large json_check (weight 6.0) masked that remaining criteria weren't passing

Status change but still failing

Task	Before	After
`hitl-e2e-apptask-brownfield`	MAX_TURNS	FAILURE
`tasks-e2e-fetch-tasks`	FAILURE	MAX_TURNS

github-actions · 2026-05-01T00:47:46Z

Claude finished @bai-uipath's task in 3s —— View job

PR Review: Fix self-reporting report.json pattern in test cases

Read review criteria and project conventions
Read full PR diff and changed files
Review README and CONTRIBUTING changes
Check for remaining self-report patterns
Validate success_criteria structure across files
Post review feedback

Summary

Removes report.json / recommendation.json self-reporting from 50 test tasks, replacing them with direct verification: command_executed regex, file_contains/file_matches_regex on real artifacts (.flow files, CSVs), run_command for state validation, and skill_triggered for activation checks. The test task changes are clean and well-executed. The README and CONTRIBUTING doc updates have a few issues.

Change-by-Change Review

1. Data Fabric test tasks (13 files)

Severity: OK

All 13 data-fabric YAML files are clean — zero remaining report.json references. Replacements are appropriate:

Smoke tests (smoke_entities, smoke_records, smoke_files) now use only command_executed to verify correct CLI syntax. Good fit for offline smoke tests.
smoke_bulk_import verifies the CSV artifact directly with file_contains (header names) and file_check (excludes system fields). Strong improvement.
Integration tests use command_executed with body-matching regex (e.g., (?s).*fieldName, filterGroup, logicalOperator) — verifies the agent sent the right request bodies.
E2E tests add run_command checks (awk row counts, cmp -s for byte-equality) — verifiable, un-fakeable.

All criteria have proper type, description, weight, and pass_threshold fields.

2. HITL test tasks (13 files)

Severity: OK

E2E tests (e2e_01, e2e_03) now check .flow file content directly (file_contains for node types, handles, outcomes; file_matches_regex for $vars.<id>.result; run_command for uip flow validate). This is the ideal verification pattern — the .flow artifact is the ground truth.

Remaining E2E files (e2e_02, e2e_04, e2e_05, e2e_06, e2e_07) had report.json criteria removed, leaving existing .flow file checks as the sole validation. Quality tests (quality_04, quality_05, quality_07) similarly trimmed. All clean.

Negative smoke tests (smoke_07, smoke_08) now use skill_triggered: "no" — the correct primitive for "agent should NOT invoke this skill." Clean replacement.

3. Maestro Flow test tasks (8 files)

Severity: OK

All report.json / registry_report.json criteria removed from HITL quality tests, smoke tests, and init_validate. The remaining file_contains and command_executed criteria on .flow files are sufficient. registry_discovery.yaml now uses only command_executed — correct for a registry-exploration test.

4. Platform / Integration Service test tasks (5 files)

Severity: OK

Each file had a json_check on report.json removed, leaving command_executed and skill_triggered as the verification. Clean.

5. Review, RPA, SDD, Tasks test tasks (11 files)

Severity: OK

Clean removals across the board. review_multi_project_solution and review_rpa_project verify review_report.md artifact content. coded_test_case checks generated .cs files. SDD tasks verify the generated SDD markdown. Tasks tests use command_executed exclusively.

6. `template_aware_create.yaml` — partial self-report retained

Severity: Low

This task asks the agent to write a template-selection decision to report.json because create-project cannot run in the sandbox. The report.json IS the deliverable here, not a redundant summary. The command_executed criterion for search-templates anchors the test to real behavior. file_matches_regex and file_check verify the decision content. This is a reasonable boundary case, correctly called out in the PR description.

7. `CONTRIBUTING.md` — broken cross-reference

Severity: Medium

Line 255 adds a new rule:

See the "Anti-patterns to avoid" section in tests/README.md.

That section does not exist in the README. The old README had a ~60-line "Anti-patterns to avoid" section which this PR removed but did not replace with a named section the CONTRIBUTING reference can point to. The reference is now a dead link.

Fix this →

8. `tests/README.md` — Smoke Test Example still shows the self-report pattern

Severity: Medium

The "Smoke Test Example" (lines 196–285) shows init_validate.yaml with:

initial_prompt asking the agent to write report.json (line 212)
json_check criterion scoring report.json (lines 270–284)

But the actual init_validate.yaml file no longer uses report.json — it was cleaned up (either in this PR or previously). The README example is now out of sync with the real file and directly contradicts the new rule in CONTRIBUTING.md line 255. The example also references weight: 2.0 for report.json as "Critical validation" in the Weight table (line 408), which was correctly updated to "Important artifact content" elsewhere but the example YAML still demonstrates the old pattern.

Fix this →

9. `tests/README.md` — inaccurate reference to `registry_discovery.yaml`

Severity: Low

Line 294 says:

For another example using file_contains and run_command criteria, see tasks/uipath-maestro-flow/smoke/registry_discovery.yaml. That test also demonstrates overriding a single field (agent: max_turns: 14).

The actual file uses neither file_contains nor run_command (only command_executed), and max_turns is 28, not 14. This reference was likely correct before the PR's changes but is now stale.

Fix this →

10. `tests/README.md` — undocumented criterion types

Severity: Low

This PR introduces usage of two criterion types not documented in the Success Criteria Reference:

file_check — used in smoke_bulk_import.yaml and template_aware_create.yaml (2 files). Appears to work like file_contains but with excludes only.
file_matches_regex — used in e2e_01_invoice_approval_greenfield.yaml, e2e_03_gdpr_compliance_greenfield.yaml, and template_aware_create.yaml (4 uses across 3 files).

Neither appears in the README's criterion reference section. The README documents file_contains with an excludes: note (line 340), so file_check may be redundant — or it may be a distinct type in coder_eval. Worth documenting or consolidating.

What's Missing

README Smoke Test Example is stale — still shows the report.json anti-pattern despite the actual file being clean. This is the most prominent example in the repo and will mislead contributors.
CONTRIBUTING.md cross-reference is broken — points to a removed section.
README reference to registry_discovery.yaml is inaccurate — claims criteria types and max_turns that don't match the file.
file_check and file_matches_regex undocumented — added in this PR but not in the Success Criteria Reference.
Negative-guard tests still use self-report — smoke_negative_guards.yaml in both uipath-data-fabric and uipath-tasks use json_check on report.json to verify refusal behavior. Not changed in this PR and not mentioned in the PR description's "still remaining" list. Arguably these are like the governance tests (the written reasoning IS the deliverable), but worth calling out for completeness.

Area Ratings

Area	Status	Notes
Frontmatter	N/A	No skill SKILL.md files changed
E2E Tests	OK	50 tasks cleaned, replacements are sound
Skill Body	N/A	No skills modified
References & Assets	Issue	README example stale, broken cross-refs
Repo Hygiene	OK	No secrets, no cross-skill refs, scoped changes

Issues for Manual Review

file_check vs file_contains with excludes: Verify in coder_eval whether file_check is a distinct valid criterion type or whether file_contains with excludes should be used instead. If file_check is not a recognized type, the criterion in smoke_bulk_import.yaml:84 and template_aware_create.yaml:100 will silently pass or error.
Pass-rate regressions: The PR transparently documents 2 regressions (flow-hitl-quality-boolean-decision, datafabric-integration-preseeded-entity) where removing report.json unmasked pre-existing failures. Verify these are acceptable before merging.

Conclusion

The test task changes are well-executed — the right criterion types are used for each scenario, and the removal of self-reporting is thorough across the 50 tasks. The two Medium issues (broken CONTRIBUTING cross-reference, stale README example) should be fixed before merge since they will actively mislead future contributors into using the pattern this PR is trying to eliminate. The Low issues (stale registry_discovery reference, undocumented criterion types) are worth addressing but not blocking.

bai-uipath added 9 commits April 30, 2026 15:02

delete self-authoring

bceaa2d

data fabric checks

cf4a457

hard cases

aa76c9b

negative guards, update readme

12fa5a4

fix typo

538fe93

docs cleanup

a268b6e

fixone more

73390fc

revert 6 smoke tests

446d7e9

restore negative guards

5209294

bai-uipath marked this pull request as ready for review May 1, 2026 00:47

bai-uipath requested review from baishalighosh, chandusailella and dushyant-uipath as code owners May 1, 2026 00:47

bai-uipath added 2 commits April 30, 2026 17:51

revert 2 more

704ebc0

update readme

367e34b

rockymadden approved these changes May 1, 2026

View reviewed changes

dushyant-uipath approved these changes May 1, 2026

View reviewed changes

bai-uipath mentioned this pull request May 1, 2026

feat(uipath-platform): add traces_fetch smoke test + traces_e2e full round-trip test #480

Merged

7 tasks

baishalighosh approved these changes May 3, 2026

View reviewed changes

AlvinStanescu approved these changes May 4, 2026

View reviewed changes

bai-uipath added 2 commits May 4, 2026 13:30

Merge branch 'main' into bai/skills-self-report-fixup

9e8e473

fix some more

187ede6

bai-uipath merged commit 33f0e2d into main May 4, 2026
4 of 5 checks passed

bai-uipath deleted the bai/skills-self-report-fixup branch May 4, 2026 21:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix self-reporting report.json pattern in test cases#507

Fix self-reporting report.json pattern in test cases#507
bai-uipath merged 13 commits into
mainfrom
bai/skills-self-report-fixup

bai-uipath commented Apr 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

bai-uipath commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Self-report count

Motivation

What changed

Self-report removed, direct checks substituted (selected highlights)

Still using self-report (deferred)

README

Pass rate comparison

Improved

Regressed — report.json removal exposed pre-existing failures

Status change but still failing

Uh oh!

github-actions Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Fix self-reporting report.json pattern in test cases

Summary

Change-by-Change Review

1. Data Fabric test tasks (13 files)

2. HITL test tasks (13 files)

3. Maestro Flow test tasks (8 files)

4. Platform / Integration Service test tasks (5 files)

5. Review, RPA, SDD, Tasks test tasks (11 files)

6. template_aware_create.yaml — partial self-report retained

7. CONTRIBUTING.md — broken cross-reference

8. tests/README.md — Smoke Test Example still shows the self-report pattern

9. tests/README.md — inaccurate reference to registry_discovery.yaml

10. tests/README.md — undocumented criterion types

What's Missing

Area Ratings

Issues for Manual Review

Conclusion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bai-uipath commented Apr 30, 2026 •

edited

Loading

github-actions Bot commented May 1, 2026 •

edited

Loading

6. `template_aware_create.yaml` — partial self-report retained

7. `CONTRIBUTING.md` — broken cross-reference

8. `tests/README.md` — Smoke Test Example still shows the self-report pattern

9. `tests/README.md` — inaccurate reference to `registry_discovery.yaml`

10. `tests/README.md` — undocumented criterion types