Skip to content

fix(waa-live): gate app readiness and classify infra setup failures#107

Merged
abrichr merged 4 commits into
mainfrom
fix/waa-readiness-infra
Mar 6, 2026
Merged

fix(waa-live): gate app readiness and classify infra setup failures#107
abrichr merged 4 commits into
mainfrom
fix/waa-readiness-infra

Conversation

@abrichr
Copy link
Copy Markdown
Member

@abrichr abrichr commented Mar 4, 2026

Summary

  • fail tasks before step budget when post-setup app readiness cannot be verified
  • classify these as error_type: infrastructure and surface infra-adjusted success metrics
  • add deterministic LibreOffice recovery/open remediation in reset flow
  • persist infra metadata in execution/summary and show infra status in viewer + CLI output

Why

The previous DC run on 04d9aeaf spent steps trying to recover from setup/app-open failures. This change marks those as infrastructure failures instead of agent failures and preserves score integrity.

Validation

  • python3 -m py_compile on touched files
  • rerun of 04d9aeaf now reports num_steps=0, error_type=infrastructure when app readiness fails before step 0

@abrichr abrichr force-pushed the fix/waa-readiness-infra branch from 282e7dc to 76bb1e5 Compare March 4, 2026 22:03
@abrichr abrichr merged commit 3c06897 into main Mar 6, 2026
1 check passed
abrichr added a commit that referenced this pull request Mar 6, 2026
…lures

The post-setup focus check (PR #107) defaults to strict mode, which
marks tasks as infrastructure failures when the a11y window enumeration
can't find the expected app title. In practice, LibreOffice windows
take longer to render titles than the check allows, causing ALL
LibreOffice tasks to fail as infra — even though the app IS open.

Changing default to False: focus check still runs and logs warnings,
but doesn't abort the task. The agent can recover from focus issues
on its own (it did in all prior trials without this check).

Use --strict-setup-readiness to opt into the fatal behavior when
the a11y detection is more reliable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
abrichr added a commit that referenced this pull request Mar 6, 2026
…lures (#113)

The post-setup focus check (PR #107) defaults to strict mode, which
marks tasks as infrastructure failures when the a11y window enumeration
can't find the expected app title. In practice, LibreOffice windows
take longer to render titles than the check allows, causing ALL
LibreOffice tasks to fail as infra — even though the app IS open.

Changing default to False: focus check still runs and logs warnings,
but doesn't abort the task. The agent can recover from focus issues
on its own (it did in all prior trials without this check).

Use --strict-setup-readiness to opt into the fatal behavior when
the a11y detection is more reliable.

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant