Skip to content

Commit 36df374

Browse files
[ci] Retry on-device install, fail fast, diagnose failures, always capture logcat (#11831)
## Description The on-device package-test template `apk-instrumentation.yaml` builds+installs the app with `-t:Install` and then runs `dotnet test`. When the emulator drops off ADB mid-install (`error XAGCPU7000: Mono.AndroidTools.AdbException: device offline`) the previous behavior was doubly bad: 1. The install ran with `continueOnError: true`, so the failure was swallowed and the pipeline still ran `dotnet test` against a device where the app was **never installed**, producing a misleading `INSTRUMENTATION_FAILED` and only going red at the final `fail if any issues occurred` gate — after wasting time. 2. A single transient ADB blip failed the whole lane with no attempt to recover, and we captured nothing to explain *why* the install failed. ## Changes - **Retry the install** via the `run-dotnet-preview` template's existing `retryCountOnTaskFailure: 3`, so a transient `device offline` blip can recover on a retry instead of failing the lane. - **Fail fast** (`continueOnError: false`) once retries are exhausted: the lane fails immediately and the `run` step (whose `condition` defaults to `succeeded()`) is skipped — no point running tests against a device with no app installed. - **Diagnose install failures**: a new `failed()`-gated, time-bounded step right after install snapshots device state so we can *classify* the failure next time instead of guessing — connectivity (`adb devices -l` / `get-state`), disk pressure (`df`, `dumpsys diskstats`), storage-service readiness (`dumpsys storaged` — the `StorageStatsManager` NPE seen during `install-create`), boot completion, and accumulated test apps (`pm list packages -3`). - **Never lose logcat**: the `capture logcat` step's condition is changed from the default `succeeded()` to `always()`, so the best-effort `adb logcat -d` runs on success, on a failed step (e.g. fail-fast install), **and** on job cancellation/timeout (e.g. a hung test). The capture is best-effort (`continueOnError: true` + `|| echo`) and tolerates an offline device. This matches the step's "Always capture full device logcat" intent and addresses review feedback. ## Why `always()` (logcat) and not just removing the condition Omitting `condition:` is **not** neutral — an Azure Pipelines step with no condition defaults to `succeeded()`, i.e. it is skipped as soon as any prior step fails. That is exactly the old, buggy behavior. `succeededOrFailed()` fixes the failed-step case but still skips on cancellation (a job-level timeout counts as cancellation), which is precisely when a hung test's logcat is most valuable — so `always()` is used. ## Behavior change - A genuinely unrecoverable install failure now fails the job at the install step (after retries) instead of the final gate. - Because the flavors (Debug / aab / NoAab / CoreCLR / …) share one job and downstream steps are gated on `succeeded()`, a hard install failure in one flavor will now skip the remaining flavors in that job. This is acceptable: a device offline across all retries is very unlikely to install the next flavor either, and fail-fast gives a clear signal. ## Context Tracking issue for the underlying emulator/ADB install flakiness (`device offline` and the `StorageStatsManager` NPE): #11830 Observed on build [1488505](https://dev.azure.com/dnceng-public/public/_build/results?buildId=1488505) (`Package Tests > macOS > Tests > APKs 1`). A companion PR adds an equivalent device-state snapshot to the `DeviceTest` on-failure teardown for the MSBuildDeviceIntegration `DeployToDevice`/`InstallAndRun` tests. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 4f7efd5 commit 36df374

1 file changed

Lines changed: 38 additions & 2 deletions

File tree

build-tools/automation/yaml-templates/apk-instrumentation.yaml

Lines changed: 38 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,13 @@ parameters:
1313
steps:
1414
# Build and install the APK to the device first, so that ComputeRunArguments
1515
# can resolve the package name from AndroidManifest.xml.
16+
#
17+
# Retry the install up to 3 times: the emulator/ADB connection occasionally
18+
# drops mid-install (XAGCPU7000 "device offline") and a retry usually recovers.
19+
# If every attempt fails, continueOnError: false fails the lane immediately -
20+
# there is no point running the tests against a device where the app was never
21+
# installed (that only wastes CI time and produces a misleading
22+
# INSTRUMENTATION_FAILED). See dotnet/android#11830.
1623
- template: /build-tools/automation/yaml-templates/run-dotnet-preview.yaml
1724
parameters:
1825
configuration: ${{ parameters.buildConfiguration }}
@@ -24,7 +31,30 @@ steps:
2431
-bl:${{ parameters.xaSourcePath }}/bin/Test${{ parameters.configuration }}/build-${{ parameters.testName }}.binlog
2532
-c ${{ parameters.configuration }} ${{ parameters.extraBuildArgs }}
2633
condition: ${{ parameters.condition }}
27-
continueOnError: true
34+
retryCountOnTaskFailure: 3
35+
continueOnError: false
36+
37+
# If the install above failed, snapshot device state so we can classify WHY on
38+
# the next iteration instead of guessing: connectivity (adb devices/get-state),
39+
# disk pressure (df), storage-service readiness (the StorageStatsManager NPE
40+
# from install-create), boot completion, and how many test apps have piled up.
41+
# Best-effort and time-bounded so it never hangs or fails the lane itself.
42+
- script: |
43+
set +e
44+
echo "===== adb devices -l ====="; adb devices -l
45+
echo "===== adb get-state ====="; adb get-state
46+
echo "===== boot completion ====="; adb shell getprop sys.boot_completed; adb shell getprop dev.bootcomplete
47+
echo "===== df /data ====="; adb shell df /data
48+
echo "===== df /storage/emulated/0 ====="; adb shell df /storage/emulated/0
49+
echo "===== dumpsys diskstats ====="; adb shell dumpsys diskstats
50+
echo "===== storaged (storage stats service) ====="; adb shell dumpsys storaged
51+
echo "===== installed 3rd-party packages ====="; adb shell pm list packages -3
52+
echo "===== end device state ====="
53+
exit 0
54+
displayName: diagnose install failure ${{ parameters.testName }}
55+
condition: failed()
56+
continueOnError: true
57+
timeoutInMinutes: 3
2858

2959
# Run dotnet test from the project directory so it finds the project-local
3060
# global.json with "test": { "runner": "Microsoft.Testing.Platform" }.
@@ -63,13 +93,19 @@ steps:
6393
# failures, JNI crashes, instrumentation errors, etc. The Android-side
6494
# instrumentation output (Log.Info/Log.Error from TestInstrumentation) and
6595
# any UnsatisfiedLinkError / dlopen messages only appear in logcat.
96+
#
97+
# Use always() (not the default succeeded()) so we still capture logcat when an
98+
# earlier step failed or the job was canceled - in particular a failed -t:Install
99+
# (which now fails the lane fast) or a hung/timed-out test run. Losing logcat in
100+
# exactly those cases would defeat the diagnostics this template exists for; the
101+
# capture is best-effort (continueOnError + '|| echo'). See dotnet/android#11830.
66102
- script: >
67103
DEST="$(Build.StagingDirectory)/Test${{ parameters.configuration }}/${{ parameters.artifactFolder }}/" &&
68104
mkdir -p "$DEST" &&
69105
adb logcat -d > "$DEST/logcat-${{ parameters.testName }}.txt" ||
70106
echo "logcat capture failed"
71107
displayName: capture logcat ${{ parameters.testName }}
72-
condition: ${{ parameters.condition }}
108+
condition: always()
73109
continueOnError: true
74110

75111
- task: PublishTestResults@2

0 commit comments

Comments
 (0)