Skip to content

Fix self-reporting report.json pattern in test cases#507

Merged
bai-uipath merged 13 commits into
mainfrom
bai/skills-self-report-fixup
May 4, 2026
Merged

Fix self-reporting report.json pattern in test cases#507
bai-uipath merged 13 commits into
mainfrom
bai/skills-self-report-fixup

Conversation

@bai-uipath
Copy link
Copy Markdown
Contributor

@bai-uipath bai-uipath commented Apr 30, 2026

Reduced tasks using self-report as primary success signal from 69 to 21 — a net removal of 48 tests that were verifying the agent's claim about what it did rather than what it actually did.

Self-report count

Count
Tasks using self-report before this PR 69
Tasks fixed in this PR 48
Tasks still using self-report after 21

Still remaining (not fixed in this PR):

Not tackled (8)

  • gov-aops-policy/deployed_policy_smoke, deployment_discovery_smoke, discover_catalog_smoke, list_policies_smoke, template_bootstrap_smoke (5)
  • gov-access-policy/get_policy_smoke, list_policies_smoke (2)
  • maestro-case/registry_discovery (1)

Not tackled because these are policy-analysis tasks where the written output file is the agent's actual work product (structuring a policy recommendation or discovery report). The output file IS the deliverable, not a redundant self-description of what the agent did. Needs separate design work to determine the right verification approach.

Schema-design / partial-report (3)

  • hitl/quality_01_approval_gate_schema — the written schema_proposal.json is the deliverable; the task is schema design, not a CLI workflow
  • rpa/template_aware_create — reverted to pre-PR; CI showed score dropped 1.0 → 0.43 when criteria were overhauled
  • rpa/coded_test_case — reverted to pre-PR; the report.json prompt was keeping the agent on task (score dropped to 0.35 without it)

Motivation

Many tasks asked the agent to write a report.json summary and then read it back as the success criterion. A misbehaving agent passes by writing the right strings regardless of whether the underlying work happened. This PR replaces those checks with criteria that verify state directly — command body regex on actual CLI calls, .flow file content checks, or run_command validation.

What changed

Self-report removed, direct checks substituted (selected highlights)

Task Removed Replaced with
hitl/e2e_01_invoice_approval_greenfield report.json (hitl_node_id, schema, validation_passed) file_contains/file_matches_regex on .flow + run_command validate
hitl/e2e_03_gdpr_compliance_greenfield report.json (timeout_configured, validation_passed) .flow file checks + run_command validate
hitl/smoke_07_neg_automated recommendation.json (hitl_needed: false) skill_triggered: no
hitl/smoke_08_neg_admin answer.json (hitl_authoring_needed: false) skill_triggered: no
hitl/quality_01_approval_gate_schema file_contains on schema_proposal.json (confirmed: false) removed (no replacement needed)
flow/hitl/quality_03_boolean_decision report.json (decision_condition, validation_passed) existing .flow file checks
hitl/e2e_07_apptask_brownfield report.json (inputs_type, validation_passed) existing .flow file checks
datafabric/smoke_entities report.json (fieldName key, field_id_source) command_executed body regex ((?s).*fieldName, updateFields.*"id")
datafabric/smoke_records report.json (filterGroup, pagination_method, Id in updates) command_executed body regex
tasks/smoke_assignment report.json (assign_by_email, assign_by_user_id, mutual exclusion) command_executed regex per flag variant
tasks/smoke_completion report.json (has_action, action_value=Approve, all_calls_include_type) command_executed body regex
tasks/smoke_discovery report.json (task_id_style=numeric, used_type_hint, used_as_admin) command_executed flag-specific regex
datafabric/smoke_bulk_import json_check on report.json (csv_includes_system_fields) file_check excludes on products.csv directly

Still using self-report (deferred)

  • hitl/smoke_01–06 — converting to skill_triggered: yes caused all 6 to fail (agent not invoking the skill); reverted pending agent fix.
  • rpa/template_aware_create, rpa/coded_test_case — reverted after CI showed score regressions (0.43 and 0.35 respectively).
  • gov-aops-policy/ (5 tasks), gov-access-policy/ (2 tasks), maestro-case/registry_discovery — not tackled in this PR.

README

  • Added skill_triggered and command_not_executed sections to the criterion reference.
  • Removed the ~60-line "Anti-patterns to avoid" section.
  • Updated file_contains and json_check examples to use real artifacts.

Pass rate comparison

Baseline: run 2026-04-28_10-28-40 (pre-PR). Post-PR: run 2026-04-30_16-40-26. Scoped to the 47 tasks where self-report was removed and both runs have results.

47 tasks: 40/47 → 42/47 passing (+2 net).

Improved

Task Before After
datafabric-smoke-entities FAILURE SUCCESS
flow-hitl-quality-schema-design MAX_TURNS SUCCESS
hitl-quality-priority-timeout FAILURE SUCCESS
sdd-gap-detection FAILURE SUCCESS

Regressed — report.json removal exposed pre-existing failures

Task Before After Note
flow-hitl-quality-boolean-decision SUCCESS FAILURE .flow file criteria were already failing; report.json score (weight 6.0) was masking it
datafabric-integration-preseeded-entity SUCCESS FAILURE Same — large json_check (weight 6.0) masked that remaining criteria weren't passing

Status change but still failing

Task Before After
hitl-e2e-apptask-brownfield MAX_TURNS FAILURE
tasks-e2e-fetch-tasks FAILURE MAX_TURNS

@bai-uipath bai-uipath marked this pull request as ready for review May 1, 2026 00:47
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Claude finished @bai-uipath's task in 3s —— View job


PR Review: Fix self-reporting report.json pattern in test cases

  • Read review criteria and project conventions
  • Read full PR diff and changed files
  • Review README and CONTRIBUTING changes
  • Check for remaining self-report patterns
  • Validate success_criteria structure across files
  • Post review feedback

Summary

Removes report.json / recommendation.json self-reporting from 50 test tasks, replacing them with direct verification: command_executed regex, file_contains/file_matches_regex on real artifacts (.flow files, CSVs), run_command for state validation, and skill_triggered for activation checks. The test task changes are clean and well-executed. The README and CONTRIBUTING doc updates have a few issues.

Change-by-Change Review

1. Data Fabric test tasks (13 files)

Severity: OK

All 13 data-fabric YAML files are clean — zero remaining report.json references. Replacements are appropriate:

  • Smoke tests (smoke_entities, smoke_records, smoke_files) now use only command_executed to verify correct CLI syntax. Good fit for offline smoke tests.
  • smoke_bulk_import verifies the CSV artifact directly with file_contains (header names) and file_check (excludes system fields). Strong improvement.
  • Integration tests use command_executed with body-matching regex (e.g., (?s).*fieldName, filterGroup, logicalOperator) — verifies the agent sent the right request bodies.
  • E2E tests add run_command checks (awk row counts, cmp -s for byte-equality) — verifiable, un-fakeable.

All criteria have proper type, description, weight, and pass_threshold fields.

2. HITL test tasks (13 files)

Severity: OK

E2E tests (e2e_01, e2e_03) now check .flow file content directly (file_contains for node types, handles, outcomes; file_matches_regex for $vars.<id>.result; run_command for uip flow validate). This is the ideal verification pattern — the .flow artifact is the ground truth.

Remaining E2E files (e2e_02, e2e_04, e2e_05, e2e_06, e2e_07) had report.json criteria removed, leaving existing .flow file checks as the sole validation. Quality tests (quality_04, quality_05, quality_07) similarly trimmed. All clean.

Negative smoke tests (smoke_07, smoke_08) now use skill_triggered: "no" — the correct primitive for "agent should NOT invoke this skill." Clean replacement.

3. Maestro Flow test tasks (8 files)

Severity: OK

All report.json / registry_report.json criteria removed from HITL quality tests, smoke tests, and init_validate. The remaining file_contains and command_executed criteria on .flow files are sufficient. registry_discovery.yaml now uses only command_executed — correct for a registry-exploration test.

4. Platform / Integration Service test tasks (5 files)

Severity: OK

Each file had a json_check on report.json removed, leaving command_executed and skill_triggered as the verification. Clean.

5. Review, RPA, SDD, Tasks test tasks (11 files)

Severity: OK

Clean removals across the board. review_multi_project_solution and review_rpa_project verify review_report.md artifact content. coded_test_case checks generated .cs files. SDD tasks verify the generated SDD markdown. Tasks tests use command_executed exclusively.

6. template_aware_create.yaml — partial self-report retained

Severity: Low

This task asks the agent to write a template-selection decision to report.json because create-project cannot run in the sandbox. The report.json IS the deliverable here, not a redundant summary. The command_executed criterion for search-templates anchors the test to real behavior. file_matches_regex and file_check verify the decision content. This is a reasonable boundary case, correctly called out in the PR description.

7. CONTRIBUTING.md — broken cross-reference

Severity: Medium

Line 255 adds a new rule:

See the "Anti-patterns to avoid" section in tests/README.md.

That section does not exist in the README. The old README had a ~60-line "Anti-patterns to avoid" section which this PR removed but did not replace with a named section the CONTRIBUTING reference can point to. The reference is now a dead link.

Fix this →

8. tests/README.md — Smoke Test Example still shows the self-report pattern

Severity: Medium

The "Smoke Test Example" (lines 196–285) shows init_validate.yaml with:

  • initial_prompt asking the agent to write report.json (line 212)
  • json_check criterion scoring report.json (lines 270–284)

But the actual init_validate.yaml file no longer uses report.json — it was cleaned up (either in this PR or previously). The README example is now out of sync with the real file and directly contradicts the new rule in CONTRIBUTING.md line 255. The example also references weight: 2.0 for report.json as "Critical validation" in the Weight table (line 408), which was correctly updated to "Important artifact content" elsewhere but the example YAML still demonstrates the old pattern.

Fix this →

9. tests/README.md — inaccurate reference to registry_discovery.yaml

Severity: Low

Line 294 says:

For another example using file_contains and run_command criteria, see tasks/uipath-maestro-flow/smoke/registry_discovery.yaml. That test also demonstrates overriding a single field (agent: max_turns: 14).

The actual file uses neither file_contains nor run_command (only command_executed), and max_turns is 28, not 14. This reference was likely correct before the PR's changes but is now stale.

Fix this →

10. tests/README.md — undocumented criterion types

Severity: Low

This PR introduces usage of two criterion types not documented in the Success Criteria Reference:

  • file_check — used in smoke_bulk_import.yaml and template_aware_create.yaml (2 files). Appears to work like file_contains but with excludes only.
  • file_matches_regex — used in e2e_01_invoice_approval_greenfield.yaml, e2e_03_gdpr_compliance_greenfield.yaml, and template_aware_create.yaml (4 uses across 3 files).

Neither appears in the README's criterion reference section. The README documents file_contains with an excludes: note (line 340), so file_check may be redundant — or it may be a distinct type in coder_eval. Worth documenting or consolidating.

What's Missing

  • README Smoke Test Example is stale — still shows the report.json anti-pattern despite the actual file being clean. This is the most prominent example in the repo and will mislead contributors.
  • CONTRIBUTING.md cross-reference is broken — points to a removed section.
  • README reference to registry_discovery.yaml is inaccurate — claims criteria types and max_turns that don't match the file.
  • file_check and file_matches_regex undocumented — added in this PR but not in the Success Criteria Reference.
  • Negative-guard tests still use self-reportsmoke_negative_guards.yaml in both uipath-data-fabric and uipath-tasks use json_check on report.json to verify refusal behavior. Not changed in this PR and not mentioned in the PR description's "still remaining" list. Arguably these are like the governance tests (the written reasoning IS the deliverable), but worth calling out for completeness.

Area Ratings

Area Status Notes
Frontmatter N/A No skill SKILL.md files changed
E2E Tests OK 50 tasks cleaned, replacements are sound
Skill Body N/A No skills modified
References & Assets Issue README example stale, broken cross-refs
Repo Hygiene OK No secrets, no cross-skill refs, scoped changes

Issues for Manual Review

  • file_check vs file_contains with excludes: Verify in coder_eval whether file_check is a distinct valid criterion type or whether file_contains with excludes should be used instead. If file_check is not a recognized type, the criterion in smoke_bulk_import.yaml:84 and template_aware_create.yaml:100 will silently pass or error.
  • Pass-rate regressions: The PR transparently documents 2 regressions (flow-hitl-quality-boolean-decision, datafabric-integration-preseeded-entity) where removing report.json unmasked pre-existing failures. Verify these are acceptable before merging.

Conclusion

The test task changes are well-executed — the right criterion types are used for each scenario, and the removal of self-reporting is thorough across the 50 tasks. The two Medium issues (broken CONTRIBUTING cross-reference, stale README example) should be fixed before merge since they will actively mislead future contributors into using the pattern this PR is trying to eliminate. The Low issues (stale registry_discovery reference, undocumented criterion types) are worth addressing but not blocking.


@bai-uipath bai-uipath merged commit 33f0e2d into main May 4, 2026
4 of 5 checks passed
@bai-uipath bai-uipath deleted the bai/skills-self-report-fixup branch May 4, 2026 21:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants