Fix self-reporting report.json pattern in test cases#507
Conversation
|
Claude finished @bai-uipath's task in 3s —— View job PR Review: Fix self-reporting report.json pattern in test cases
SummaryRemoves Change-by-Change Review1. Data Fabric test tasks (13 files)Severity: OK All 13 data-fabric YAML files are clean — zero remaining
All criteria have proper 2. HITL test tasks (13 files)Severity: OK E2E tests ( Remaining E2E files ( Negative smoke tests ( 3. Maestro Flow test tasks (8 files)Severity: OK All 4. Platform / Integration Service test tasks (5 files)Severity: OK Each file had a 5. Review, RPA, SDD, Tasks test tasks (11 files)Severity: OK Clean removals across the board. 6.
|
| Area | Status | Notes |
|---|---|---|
| Frontmatter | N/A | No skill SKILL.md files changed |
| E2E Tests | OK | 50 tasks cleaned, replacements are sound |
| Skill Body | N/A | No skills modified |
| References & Assets | Issue | README example stale, broken cross-refs |
| Repo Hygiene | OK | No secrets, no cross-skill refs, scoped changes |
Issues for Manual Review
file_checkvsfile_containswithexcludes: Verify in coder_eval whetherfile_checkis a distinct valid criterion type or whetherfile_containswithexcludesshould be used instead. Iffile_checkis not a recognized type, the criterion insmoke_bulk_import.yaml:84andtemplate_aware_create.yaml:100will silently pass or error.- Pass-rate regressions: The PR transparently documents 2 regressions (
flow-hitl-quality-boolean-decision,datafabric-integration-preseeded-entity) where removingreport.jsonunmasked pre-existing failures. Verify these are acceptable before merging.
Conclusion
The test task changes are well-executed — the right criterion types are used for each scenario, and the removal of self-reporting is thorough across the 50 tasks. The two Medium issues (broken CONTRIBUTING cross-reference, stale README example) should be fixed before merge since they will actively mislead future contributors into using the pattern this PR is trying to eliminate. The Low issues (stale registry_discovery reference, undocumented criterion types) are worth addressing but not blocking.
Reduced tasks using self-report as primary success signal from 69 to 21 — a net removal of 48 tests that were verifying the agent's claim about what it did rather than what it actually did.
Self-report count
Still remaining (not fixed in this PR):
Not tackled (8)
gov-aops-policy/deployed_policy_smoke,deployment_discovery_smoke,discover_catalog_smoke,list_policies_smoke,template_bootstrap_smoke(5)gov-access-policy/get_policy_smoke,list_policies_smoke(2)maestro-case/registry_discovery(1)Not tackled because these are policy-analysis tasks where the written output file is the agent's actual work product (structuring a policy recommendation or discovery report). The output file IS the deliverable, not a redundant self-description of what the agent did. Needs separate design work to determine the right verification approach.
Schema-design / partial-report (3)
hitl/quality_01_approval_gate_schema— the written schema_proposal.json is the deliverable; the task is schema design, not a CLI workflowrpa/template_aware_create— reverted to pre-PR; CI showed score dropped 1.0 → 0.43 when criteria were overhauledrpa/coded_test_case— reverted to pre-PR; the report.json prompt was keeping the agent on task (score dropped to 0.35 without it)Motivation
Many tasks asked the agent to write a
report.jsonsummary and then read it back as the success criterion. A misbehaving agent passes by writing the right strings regardless of whether the underlying work happened. This PR replaces those checks with criteria that verify state directly — command body regex on actual CLI calls,.flowfile content checks, orrun_commandvalidation.What changed
Self-report removed, direct checks substituted (selected highlights)
hitl/e2e_01_invoice_approval_greenfieldreport.json(hitl_node_id, schema, validation_passed)file_contains/file_matches_regexon.flow+run_commandvalidatehitl/e2e_03_gdpr_compliance_greenfieldreport.json(timeout_configured, validation_passed).flowfile checks +run_commandvalidatehitl/smoke_07_neg_automatedrecommendation.json(hitl_needed: false)skill_triggered: nohitl/smoke_08_neg_adminanswer.json(hitl_authoring_needed: false)skill_triggered: nohitl/quality_01_approval_gate_schemafile_containson schema_proposal.json (confirmed: false)flow/hitl/quality_03_boolean_decisionreport.json(decision_condition, validation_passed).flowfile checkshitl/e2e_07_apptask_brownfieldreport.json(inputs_type, validation_passed).flowfile checksdatafabric/smoke_entitiesreport.json(fieldName key, field_id_source)command_executedbody regex ((?s).*fieldName,updateFields.*"id")datafabric/smoke_recordsreport.json(filterGroup, pagination_method, Id in updates)command_executedbody regextasks/smoke_assignmentreport.json(assign_by_email, assign_by_user_id, mutual exclusion)command_executedregex per flag varianttasks/smoke_completionreport.json(has_action, action_value=Approve, all_calls_include_type)command_executedbody regextasks/smoke_discoveryreport.json(task_id_style=numeric, used_type_hint, used_as_admin)command_executedflag-specific regexdatafabric/smoke_bulk_importjson_checkon report.json (csv_includes_system_fields)file_check excludesonproducts.csvdirectlyStill using self-report (deferred)
hitl/smoke_01–06— converting toskill_triggered: yescaused all 6 to fail (agent not invoking the skill); reverted pending agent fix.rpa/template_aware_create,rpa/coded_test_case— reverted after CI showed score regressions (0.43 and 0.35 respectively).gov-aops-policy/(5 tasks),gov-access-policy/(2 tasks),maestro-case/registry_discovery— not tackled in this PR.README
skill_triggeredandcommand_not_executedsections to the criterion reference.file_containsandjson_checkexamples to use real artifacts.Pass rate comparison
Baseline: run
2026-04-28_10-28-40(pre-PR). Post-PR: run2026-04-30_16-40-26. Scoped to the 47 tasks where self-report was removed and both runs have results.47 tasks: 40/47 → 42/47 passing (+2 net).
Improved
datafabric-smoke-entitiesflow-hitl-quality-schema-designhitl-quality-priority-timeoutsdd-gap-detectionRegressed — report.json removal exposed pre-existing failures
flow-hitl-quality-boolean-decision.flowfile criteria were already failing; report.json score (weight 6.0) was masking itdatafabric-integration-preseeded-entityStatus change but still failing
hitl-e2e-apptask-brownfieldtasks-e2e-fetch-tasks