Skip to content

[FastDeploy] Retry first run-as query to absorb post-install data-dir race (XA0137)#11809

Merged
jonathanpeppers merged 3 commits into
mainfrom
copilot/fix-fast-deployment-error
Jul 1, 2026
Merged

[FastDeploy] Retry first run-as query to absorb post-install data-dir race (XA0137)#11809
jonathanpeppers merged 3 commits into
mainfrom
copilot/fix-fast-deployment-error

Conversation

Copilot AI commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

The Fast Deployment install-vs-run-as race (#7821) fires XA0137 (run-as: couldn't stat /data/user/0/<pkg>: No such file or directory) on the primary user (user 0) ~daily in CI. The existing EnsureUserIsRunning mitigation (#11322) only covers secondary users and explicitly skips user 0 — and am start-user -w 0 wouldn't help anyway, since user 0 is already running; the real issue is the brief window after pm install before /data/user/0/<pkg> is stat-able through run-as.

Changes

src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs:

  • QueryInternalPathWithRetry — wraps the first run-as <pkg> pwd query in CheckAppInstalledAndDebuggable with a bounded poll (10 attempts × 500 ms) that retries while the result matches the transient race signature, before falling through to RaiseRunAsError/XA0137. Applies to all users, including user 0; no-op when the directory already exists.
  • IsTransientRunAsStatRace — matches the couldn't stat … No such file or directory signature only, so genuine run-as failures (permission denied, not debuggable, corrupt install, etc.) still surface immediately.
async Task<string> QueryInternalPathWithRetry ()
{
    const int maxAttempts = 10;
    var delay = TimeSpan.FromMilliseconds (500);
    string result = await Device.RunAs (packageInfo, "pwd");
    for (int attempt = 1; attempt < maxAttempts && IsTransientRunAsStatRace (result); attempt++) {
        LogDiagnostic ($"run-as could not stat the data directory for {packageInfo.PackageName} yet (attempt {attempt}/{maxAttempts}); retrying in {delay.TotalMilliseconds:0} ms. Output: {result?.Trim ()}");
        await Task.Delay (delay, CancellationToken);
        result = await Device.RunAs (packageInfo, "pwd");
    }
    return result;
}

EnsureUserIsRunning is left intact; the retry sits after it, so secondary users retain both mitigations.

Notes

No unit-test project exists for Xamarin.Android.Build.Debugging.Tasks; the path is exercised by the on-device MSBuildDeviceIntegration install tests where the race was observed.

Co-authored-by: simonrozsival <374616+simonrozsival@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix XA0137 race condition in Fast Deployment install [FastDeploy] Retry first run-as query to absorb post-install data-dir race (XA0137) Jun 29, 2026
Copilot AI requested a review from simonrozsival June 29, 2026 21:47
@simonrozsival simonrozsival marked this pull request as ready for review June 30, 2026 08:01
Copilot AI review requested due to automatic review settings June 30, 2026 08:01

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR mitigates a Fast Deployment flake where the first run-as <pkg> pwd query can fail immediately after pm install because the app’s per-user data directory isn’t yet stat-able (surfacing as XA0137), including on primary user 0 in CI.

Changes:

  • Adds a bounded retry loop for the initial run-as <pkg> pwd query to absorb the post-install /data/user/N/<pkg> materialization race.
  • Introduces a narrow signature check to retry only for the transient couldn't stat … No such file or directory case, preserving fast-fail for genuine run-as errors.
Show a summary per file
File Description
src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs Retries the initial run-as pwd query when it matches the transient post-install stat race, avoiding spurious XA0137/FastDeploy disablement.

Copilot's findings

  • Files reviewed: 1/1 changed files
  • Comments generated: 0

@simonrozsival simonrozsival added the ready-to-review This PR is ready to review/merge, I think any CI failures are just flaky (ignorable). label Jun 30, 2026
@jonathanpeppers

Copy link
Copy Markdown
Member

/review

@github-actions

github-actions Bot commented Jun 30, 2026

Copy link
Copy Markdown

Android PR Reviewer completed successfully!

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Android PR Reviewer

Verdict: ✅ LGTM (pending CI) — 0 errors, 0 warnings, 2 💡 suggestions.

The fix is correct, narrowly scoped, and conservative. I read the full FastDeploy.cs and traced both control-flow paths before forming this assessment.

Why it's sound:

  • The retry gate IsTransientRunAsStatRace requires both couldn't stat and No such file or directory, so genuine run-as failures (permission denied, wrong owner, SELinux, corrupt install, not a directory) still fast-fail to XA0137 via runas_codes — only the specific post-install /data/user/N/<pkg> materialization race is absorbed.
  • The retry sits after EnsureUserIsRunning and is reached by both CheckAppInstalledAndDebuggable call sites — the initial check and the post-install re-check at line 318 (InternalPath is reset to null first), so primary and secondary users get it. It complements, rather than duplicates, the am start-user mitigation (#11322).
  • await Task.Delay (delay, CancellationToken) correctly honors cancellation, so a cancelled build won't spin for the full ~4.5 s. The bound is finite (10 queries / 9 sleeps).
  • A genuinely-uninstalled package returns is unknown (the unknown path), not couldn't stat, so the retry penalty is not paid for missing apps — only inside the real race window.

Verified against project context:

  • ✅ No new null-deref risk — InternalPath usage is unchanged from the pre-existing await Device.RunAs (..., "pwd").
  • Device.RunAs has no CancellationToken overload, so the no-token calls match the established pattern in this file (lines 367, 480, 483, 740).
  • ✅ Consistent with the file's existing string.IsNullOrEmpty / OrdinalIgnoreCase conventions (project is netstandard2.0, but the file is not #nullable enable, so the IsNullOrEmpty() extension would add no annotation value here).
  • ✅ Named constants over magic numbers; static helper carries no instance state; doc comments accurately describe the implementation.

Non-blocking suggestions (posted inline):

  1. 💡 A small regression test for the pure IsTransientRunAsStatRace predicate would guard the race signature against future run-as wording drift (no test project exists for this assembly today).
  2. 💡 Document why ~4.5 s is expected to cover the user-0 window, mirroring the empirical justification already present in EnsureUserIsRunning.

CI:Pending — no checks have reported on 137f0c6 yet (status pending, total_count 0). This path is exercised only by the on-device MSBuildDeviceIntegration install tests, so the PR isn't mergeable until those go green — but I found no code-level blockers in this review.

Generated by Android PR Reviewer for issue #11809 · 167.6 AIC · ⌖ 20.6 AIC · ⊞ 6.7K
Comment /review to run again

/// install-vs-run-as race signature (<c>couldn't stat … No such file or directory</c>),
/// i.e. the per-user data directory has not yet materialized after <c>pm install</c>.
/// </summary>
static bool IsTransientRunAsStatRace (string result)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 💡 TestingIsTransientRunAsStatRace is a pure static predicate and the crux of this fix: it alone decides retry-vs-fast-fail from the run-as text. That makes it ideal to lock down with a unit test — e.g. run-as: couldn't stat ...: No such file or directorytrue; Permission denied, has wrong owner, or couldn't stat without No such filefalse. No test project exists for Xamarin.Android.Build.Debugging.Tasks today, so this is forward-looking rather than blocking, but a tiny regression test would guard the signature against future run-as wording drift (the message is produced by the Android run-as binary, not by us).

(Rule: Missing regression test for a bug fix)

/// </remarks>
async Task<string> QueryInternalPathWithRetry ()
{
const int maxAttempts = 10;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 💡 Documentation — This retry budget (10 attempts × 500 ms ≈ 4.5 s worst case; 9 sleeps) is an unexplained policy choice, whereas the sibling EnsureUserIsRunning documents N=50 measurements and explicitly notes that for secondary users "polling alone is not sufficient (30% never recover even after 30 s)". Since this path relies on polling alone, a one-line note on why ~4.5 s is expected to cover the user-0 /data/user/0/<pkg> materialization window — the key difference being that user 0 is already running, so the directory is guaranteed to appear and is only briefly delayed — would pre-empt the obvious "why not 30 s like the other case?" question.

(Rule: Document non-obvious behavioral decisions, Postmortem #13)

@simonrozsival

Copy link
Copy Markdown
Member

@copilot address the suggestions: add unit test for IsTransientRunAsStatRace and document the retry policy.

Co-authored-by: simonrozsival <374616+simonrozsival@users.noreply.github.com>

Copilot AI commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@copilot address the suggestions: add unit test for IsTransientRunAsStatRace and document the retry policy.

Done in commit Add unit tests for IsTransientRunAsStatRace and document retry policy:

  • IsTransientRunAsStatRace is now internal static so tests can reach it directly.
  • InternalsVisibleTo for Xamarin.Android.Build.Tests added to src/Xamarin.Android.Build.Debugging.Tasks/Properties/AssemblyInfo.cs (same public key as the other InternalsVisibleTo entries in the repo).
  • Unit tests added as FastDeployTests in DebuggingTasksTests.cs, covering transient-race inputs (primary user 0, secondary user, case variants, trailing whitespace) and genuine failures that must not be retried (not debuggable, unknown package, permission denied, partial substring matches, null/empty).
  • Retry policy documented in the QueryInternalPathWithRetry XML doc: "up to 10 attempts with a 500 ms delay between each, giving a maximum wait of 4.5 seconds before the error is surfaced as XA0137."

@simonrozsival simonrozsival added the copilot `copilot-cli` or other AIs were used to author this label Jul 1, 2026
@jonathanpeppers jonathanpeppers merged commit 814f5a0 into main Jul 1, 2026
36 of 38 checks passed
@jonathanpeppers jonathanpeppers deleted the copilot/fix-fast-deployment-error branch July 1, 2026 13:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

copilot `copilot-cli` or other AIs were used to author this flaky-tests ready-to-review This PR is ready to review/merge, I think any CI failures are just flaky (ignorable).

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fast Deployment XA0137 (run-as 'couldn't stat') races install on the PRIMARY user (user 0) ~daily in CI

4 participants