[ci] Retry on-device install, fail fast, diagnose failures, always capture logcat (#11831)

simonrozsival · Copilot · web-flow · commit 36df374a4307 · 2026-07-01T08:13:16.000-05:00
## Description The on-device package-test template `apk-instrumentation.yaml` builds+installs the app with `-t:Install` and then runs `dotnet test`. When the emulator drops off ADB mid-install (`error XAGCPU7000: Mono.AndroidTools.AdbException: device offline`) the previous behavior was doubly bad: 1. The install ran with `continueOnError: true`, so the failure was swallowed and the pipeline still ran `dotnet test` against a device where the app was **never installed**, producing a misleading `INSTRUMENTATION_FAILED` and only going red at the final `fail if any issues occurred` gate — after wasting time. 2. A single transient ADB blip failed the whole lane with no attempt to recover, and we captured nothing to explain *why* the install failed. ## Changes - **Retry the install** via the `run-dotnet-preview` template's existing `retryCountOnTaskFailure: 3`, so a transient `device offline` blip can recover on a retry instead of failing the lane. - **Fail fast** (`continueOnError: false`) once retries are exhausted: the lane fails immediately and the `run` step (whose `condition` defaults to `succeeded()`) is skipped — no point running tests against a device with no app installed. - **Diagnose install failures**: a new `failed()`-gated, time-bounded step right after install snapshots device state so we can *classify* the failure next time instead of guessing — connectivity (`adb devices -l` / `get-state`), disk pressure (`df`, `dumpsys diskstats`), storage-service readiness (`dumpsys storaged` — the `StorageStatsManager` NPE seen during `install-create`), boot completion, and accumulated test apps (`pm list packages -3`). - **Never lose logcat**: the `capture logcat` step's condition is changed from the default `succeeded()` to `always()`, so the best-effort `adb logcat -d` runs on success, on a failed step (e.g. fail-fast install), **and** on job cancellation/timeout (e.g. a hung test). The capture is best-effort (`continueOnError: true` + `|| echo`) and tolerates an offline device. This matches the step's "Always capture full device logcat" intent and addresses review feedback. ## Why `always()` (logcat) and not just removing the condition Omitting `condition:` is **not** neutral — an Azure Pipelines step with no condition defaults to `succeeded()`, i.e. it is skipped as soon as any prior step fails. That is exactly the old, buggy behavior. `succeededOrFailed()` fixes the failed-step case but still skips on cancellation (a job-level timeout counts as cancellation), which is precisely when a hung test's logcat is most valuable — so `always()` is used. ## Behavior change - A genuinely unrecoverable install failure now fails the job at the install step (after retries) instead of the final gate. - Because the flavors (Debug / aab / NoAab / CoreCLR / …) share one job and downstream steps are gated on `succeeded()`, a hard install failure in one flavor will now skip the remaining flavors in that job. This is acceptable: a device offline across all retries is very unlikely to install the next flavor either, and fail-fast gives a clear signal. ## Context Tracking issue for the underlying emulator/ADB install flakiness (`device offline` and the `StorageStatsManager` NPE): #11830 Observed on build [1488505](https://dev.azure.com/dnceng-public/public/_build/results?buildId=1488505) (`Package Tests > macOS > Tests > APKs 1`). A companion PR adds an equivalent device-state snapshot to the `DeviceTest` on-failure teardown for the MSBuildDeviceIntegration `DeployToDevice`/`InstallAndRun` tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
diff --git a/build-tools/automation/yaml-templates/apk-instrumentation.yaml b/build-tools/automation/yaml-templates/apk-instrumentation.yaml
@@ -13,6 +13,13 @@ parameters:
 steps:
   # Build and install the APK to the device first, so that ComputeRunArguments
   # can resolve the package name from AndroidManifest.xml.
+  #
+  # Retry the install up to 3 times: the emulator/ADB connection occasionally
+  # drops mid-install (XAGCPU7000 "device offline") and a retry usually recovers.
+  # If every attempt fails, continueOnError: false fails the lane immediately -
+  # there is no point running the tests against a device where the app was never
+  # installed (that only wastes CI time and produces a misleading
+  # INSTRUMENTATION_FAILED). See dotnet/android#11830.
 - template: /build-tools/automation/yaml-templates/run-dotnet-preview.yaml
   parameters:
     configuration: ${{ parameters.buildConfiguration }}
@@ -24,7 +31,30 @@ steps:
       -bl:${{ parameters.xaSourcePath }}/bin/Test${{ parameters.configuration }}/build-${{ parameters.testName }}.binlog
       -c ${{ parameters.configuration }} ${{ parameters.extraBuildArgs }}
     condition: ${{ parameters.condition }}
-    continueOnError: true
+    retryCountOnTaskFailure: 3
+    continueOnError: false
+
+  # If the install above failed, snapshot device state so we can classify WHY on
+  # the next iteration instead of guessing: connectivity (adb devices/get-state),
+  # disk pressure (df), storage-service readiness (the StorageStatsManager NPE
+  # from install-create), boot completion, and how many test apps have piled up.
+  # Best-effort and time-bounded so it never hangs or fails the lane itself.
+- script: |
+    set +e
+    echo "===== adb devices -l ====="; adb devices -l
+    echo "===== adb get-state ====="; adb get-state
+    echo "===== boot completion ====="; adb shell getprop sys.boot_completed; adb shell getprop dev.bootcomplete
+    echo "===== df /data ====="; adb shell df /data
+    echo "===== df /storage/emulated/0 ====="; adb shell df /storage/emulated/0
+    echo "===== dumpsys diskstats ====="; adb shell dumpsys diskstats
+    echo "===== storaged (storage stats service) ====="; adb shell dumpsys storaged
+    echo "===== installed 3rd-party packages ====="; adb shell pm list packages -3
+    echo "===== end device state ====="
+    exit 0
+  displayName: diagnose install failure ${{ parameters.testName }}
+  condition: failed()
+  continueOnError: true
+  timeoutInMinutes: 3
 
   # Run dotnet test from the project directory so it finds the project-local
   # global.json with "test": { "runner": "Microsoft.Testing.Platform" }.
@@ -63,13 +93,19 @@ steps:
   # failures, JNI crashes, instrumentation errors, etc. The Android-side
   # instrumentation output (Log.Info/Log.Error from TestInstrumentation) and
   # any UnsatisfiedLinkError / dlopen messages only appear in logcat.
+  #
+  # Use always() (not the default succeeded()) so we still capture logcat when an
+  # earlier step failed or the job was canceled - in particular a failed -t:Install
+  # (which now fails the lane fast) or a hung/timed-out test run. Losing logcat in
+  # exactly those cases would defeat the diagnostics this template exists for; the
+  # capture is best-effort (continueOnError + '|| echo'). See dotnet/android#11830.
 - script: >
     DEST="$(Build.StagingDirectory)/Test${{ parameters.configuration }}/${{ parameters.artifactFolder }}/" &&
     mkdir -p "$DEST" &&
     adb logcat -d > "$DEST/logcat-${{ parameters.testName }}.txt" ||
     echo "logcat capture failed"
   displayName: capture logcat ${{ parameters.testName }}
-  condition: ${{ parameters.condition }}
+  condition: always()
   continueOnError: true
 
 - task: PublishTestResults@2