[FastDeploy] Retry first run-as query to absorb post-install data-dir race (XA0137)#11809
Conversation
Co-authored-by: simonrozsival <374616+simonrozsival@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR mitigates a Fast Deployment flake where the first run-as <pkg> pwd query can fail immediately after pm install because the app’s per-user data directory isn’t yet stat-able (surfacing as XA0137), including on primary user 0 in CI.
Changes:
- Adds a bounded retry loop for the initial
run-as <pkg> pwdquery to absorb the post-install/data/user/N/<pkg>materialization race. - Introduces a narrow signature check to retry only for the transient
couldn't stat … No such file or directorycase, preserving fast-fail for genuinerun-aserrors.
Show a summary per file
| File | Description |
|---|---|
| src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs | Retries the initial run-as pwd query when it matches the transient post-install stat race, avoiding spurious XA0137/FastDeploy disablement. |
Copilot's findings
- Files reviewed: 1/1 changed files
- Comments generated: 0
|
/review |
|
✅ Android PR Reviewer completed successfully! |
There was a problem hiding this comment.
🤖 Android PR Reviewer
Verdict: ✅ LGTM (pending CI) — 0 errors, 0 warnings, 2 💡 suggestions.
The fix is correct, narrowly scoped, and conservative. I read the full FastDeploy.cs and traced both control-flow paths before forming this assessment.
Why it's sound:
- The retry gate
IsTransientRunAsStatRacerequires bothcouldn't statandNo such file or directory, so genuinerun-asfailures (permission denied, wrong owner, SELinux, corrupt install,not a directory) still fast-fail to XA0137 viarunas_codes— only the specific post-install/data/user/N/<pkg>materialization race is absorbed. - The retry sits after
EnsureUserIsRunningand is reached by bothCheckAppInstalledAndDebuggablecall sites — the initial check and the post-install re-check at line 318 (InternalPathis reset tonullfirst), so primary and secondary users get it. It complements, rather than duplicates, theam start-usermitigation (#11322). await Task.Delay (delay, CancellationToken)correctly honors cancellation, so a cancelled build won't spin for the full ~4.5 s. The bound is finite (10 queries / 9 sleeps).- A genuinely-uninstalled package returns
is unknown(theunknownpath), notcouldn't stat, so the retry penalty is not paid for missing apps — only inside the real race window.
Verified against project context:
- ✅ No new null-deref risk —
InternalPathusage is unchanged from the pre-existingawait Device.RunAs (..., "pwd"). - ✅
Device.RunAshas noCancellationTokenoverload, so the no-token calls match the established pattern in this file (lines 367, 480, 483, 740). - ✅ Consistent with the file's existing
string.IsNullOrEmpty/OrdinalIgnoreCaseconventions (project isnetstandard2.0, but the file is not#nullable enable, so theIsNullOrEmpty()extension would add no annotation value here). - ✅ Named constants over magic numbers;
statichelper carries no instance state; doc comments accurately describe the implementation.
Non-blocking suggestions (posted inline):
- 💡 A small regression test for the pure
IsTransientRunAsStatRacepredicate would guard the race signature against futurerun-aswording drift (no test project exists for this assembly today). - 💡 Document why ~4.5 s is expected to cover the user-0 window, mirroring the empirical justification already present in
EnsureUserIsRunning.
CI: ⏳ Pending — no checks have reported on 137f0c6 yet (status pending, total_count 0). This path is exercised only by the on-device MSBuildDeviceIntegration install tests, so the PR isn't mergeable until those go green — but I found no code-level blockers in this review.
Generated by Android PR Reviewer for issue #11809 · 167.6 AIC · ⌖ 20.6 AIC · ⊞ 6.7K
Comment /review to run again
| /// install-vs-run-as race signature (<c>couldn't stat … No such file or directory</c>), | ||
| /// i.e. the per-user data directory has not yet materialized after <c>pm install</c>. | ||
| /// </summary> | ||
| static bool IsTransientRunAsStatRace (string result) |
There was a problem hiding this comment.
🤖 💡 Testing — IsTransientRunAsStatRace is a pure static predicate and the crux of this fix: it alone decides retry-vs-fast-fail from the run-as text. That makes it ideal to lock down with a unit test — e.g. run-as: couldn't stat ...: No such file or directory ⇒ true; Permission denied, has wrong owner, or couldn't stat without No such file ⇒ false. No test project exists for Xamarin.Android.Build.Debugging.Tasks today, so this is forward-looking rather than blocking, but a tiny regression test would guard the signature against future run-as wording drift (the message is produced by the Android run-as binary, not by us).
(Rule: Missing regression test for a bug fix)
| /// </remarks> | ||
| async Task<string> QueryInternalPathWithRetry () | ||
| { | ||
| const int maxAttempts = 10; |
There was a problem hiding this comment.
🤖 💡 Documentation — This retry budget (10 attempts × 500 ms ≈ 4.5 s worst case; 9 sleeps) is an unexplained policy choice, whereas the sibling EnsureUserIsRunning documents N=50 measurements and explicitly notes that for secondary users "polling alone is not sufficient (30% never recover even after 30 s)". Since this path relies on polling alone, a one-line note on why ~4.5 s is expected to cover the user-0 /data/user/0/<pkg> materialization window — the key difference being that user 0 is already running, so the directory is guaranteed to appear and is only briefly delayed — would pre-empt the obvious "why not 30 s like the other case?" question.
(Rule: Document non-obvious behavioral decisions, Postmortem #13)
|
@copilot address the suggestions: add unit test for |
Co-authored-by: simonrozsival <374616+simonrozsival@users.noreply.github.com>
Done in commit
|
The Fast Deployment install-vs-
run-asrace (#7821) firesXA0137(run-as: couldn't stat /data/user/0/<pkg>: No such file or directory) on the primary user (user 0) ~daily in CI. The existingEnsureUserIsRunningmitigation (#11322) only covers secondary users and explicitly skips user 0 — andam start-user -w 0wouldn't help anyway, since user 0 is already running; the real issue is the brief window afterpm installbefore/data/user/0/<pkg>is stat-able throughrun-as.Changes
src/Xamarin.Android.Build.Debugging.Tasks/Tasks/FastDeploy.cs:QueryInternalPathWithRetry— wraps the firstrun-as <pkg> pwdquery inCheckAppInstalledAndDebuggablewith a bounded poll (10 attempts × 500 ms) that retries while the result matches the transient race signature, before falling through toRaiseRunAsError/XA0137. Applies to all users, including user 0; no-op when the directory already exists.IsTransientRunAsStatRace— matches thecouldn't stat … No such file or directorysignature only, so genuinerun-asfailures (permission denied, not debuggable, corrupt install, etc.) still surface immediately.EnsureUserIsRunningis left intact; the retry sits after it, so secondary users retain both mitigations.Notes
No unit-test project exists for
Xamarin.Android.Build.Debugging.Tasks; the path is exercised by the on-deviceMSBuildDeviceIntegrationinstall tests where the race was observed.