|
| 1 | +# Benchmark Methodology |
| 2 | + |
| 3 | +This document covers how benchmarks in this project are designed, what hardware conditions are |
| 4 | +required for trustworthy results, why the build configuration is the way it is, how to read the |
| 5 | +output metrics, and what the numbers cannot tell you. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Device specification |
| 10 | + |
| 11 | +### CI environment |
| 12 | + |
| 13 | +CI runs macrobenchmarks on a GitHub-hosted runner using the |
| 14 | +[`reactivecircus/android-emulator-runner`](https://github.com/ReactiveCircus/android-emulator-runner) |
| 15 | +action: |
| 16 | + |
| 17 | +| Property | Value | |
| 18 | +|---|---| |
| 19 | +| API level | 34 (Android 14) | |
| 20 | +| Architecture | x86_64 | |
| 21 | +| Target | default (AOSP, no Play Services) | |
| 22 | +| Boot timeout | 600 s | |
| 23 | +| Compilation mode | `CompilationMode.None()` — JIT only, no AOT | |
| 24 | + |
| 25 | +Emulator results are inherently noisier than physical hardware (see [Limitations](#limitations)). |
| 26 | +The emulator configuration intentionally suppresses the two errors the benchmark runner would |
| 27 | +otherwise emit: |
| 28 | + |
| 29 | +```kotlin |
| 30 | +// benchmarks/build.gradle.kts |
| 31 | +testInstrumentationRunnerArguments["androidx.benchmark.suppressErrors"] = |
| 32 | + "EMULATOR,DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION" |
| 33 | +``` |
| 34 | + |
| 35 | +`EMULATOR` silences the "running on emulator" error. `DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION` |
| 36 | +silences a permissions-check false positive that appears on API 34 emulators. Neither suppression |
| 37 | +affects what is actually measured. |
| 38 | + |
| 39 | +### Physical device setup |
| 40 | + |
| 41 | +Running on physical hardware reduces variance significantly. Before measuring, lock the CPU and |
| 42 | +GPU clocks so the SoC cannot throttle or boost mid-run. |
| 43 | + |
| 44 | +**Prerequisites:** the device must be rooted or running a userdebug/eng build. Stock consumer |
| 45 | +devices cannot lock clocks. |
| 46 | + |
| 47 | +```bash |
| 48 | +# 1. Connect the device and verify adb access |
| 49 | +adb devices |
| 50 | + |
| 51 | +# 2. Lock clocks using the AndroidX Benchmark Gradle task |
| 52 | +# (available when the benchmark module uses MacrobenchmarkRule) |
| 53 | +./gradlew :benchmarks:lockClocks |
| 54 | + |
| 55 | +# 3. Run the benchmarks |
| 56 | +./gradlew :benchmarks:connectedBenchmarkAndroidTest |
| 57 | + |
| 58 | +# 4. Unlock clocks when done (skipping this degrades battery life) |
| 59 | +./gradlew :benchmarks:unlockClocks |
| 60 | +``` |
| 61 | + |
| 62 | +`lockClocks` pins CPU frequency to a fixed mid-range value (not max), disables the interactive |
| 63 | +governor, and locks the GPU where the kernel exposes a control node. The fixed frequency is |
| 64 | +intentionally below peak so thermal headroom is preserved across a full benchmark run. |
| 65 | + |
| 66 | +**Recommended device properties for reproducible results:** |
| 67 | + |
| 68 | +- Disable Wi-Fi and mobile data (reduces background wakeups). |
| 69 | +- Charge to ≥ 80 % or keep plugged in (battery saver policies alter scheduling at low charge). |
| 70 | +- Turn off all notification delivery from other apps (`adb shell settings put global |
| 71 | + zen_mode 1`). |
| 72 | +- Keep display on (`adb shell svc power stayon true`) — some devices throttle when the |
| 73 | + screen is off. |
| 74 | + |
| 75 | +--- |
| 76 | + |
| 77 | +## Why nonDebuggable builds are required |
| 78 | + |
| 79 | +All macrobenchmarks in this project run against the `benchmark` build type, defined in |
| 80 | +`app/build.gradle.kts`: |
| 81 | + |
| 82 | +```kotlin |
| 83 | +create("benchmark") { |
| 84 | + initWith(getByName("release")) // inherits minification + R8 |
| 85 | + signingConfig = signingConfigs.getByName("debug") // debug cert for CI |
| 86 | + isDebuggable = false |
| 87 | +} |
| 88 | +``` |
| 89 | + |
| 90 | +`isDebuggable = false` is not optional. Debug builds carry several sources of overhead that |
| 91 | +inflate every metric and make before/after comparisons unreliable: |
| 92 | + |
| 93 | +| Overhead source | Effect on benchmarks | |
| 94 | +|---|---| |
| 95 | +| JDWP agent always attached | Adds ~5–15 ms to every cold start; unpredictable per-frame cost | |
| 96 | +| JIT profiling hooks | Extra bookkeeping per method call; suppresses some JIT optimisations | |
| 97 | +| `StrictMode` and debug assertions | Extra allocations and thread checks on every UI operation | |
| 98 | +| Compose `isDebugInspectorInfoEnabled` | Turns on slot-table inspection for Layout Inspector; adds recomposition overhead | |
| 99 | +| R8 / ProGuard disabled | Dead code not stripped; more class loading; larger DEX → slower first-frame JIT | |
| 100 | + |
| 101 | +The benchmark runner enforces this: if `isDebuggable = true`, it emits a `DEBUG_BUILD` error and |
| 102 | +refuses to record results (unless you add `"DEBUG_BUILD"` to `suppressErrors`, which would |
| 103 | +invalidate the data). |
| 104 | + |
| 105 | +The `benchmark` build type keeps debug signing so the APK can be installed on CI without a |
| 106 | +release keystore. The signing cert has no effect on runtime performance. |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## How to interpret frame timing metrics |
| 111 | + |
| 112 | +`ScrollBenchmark` uses `FrameTimingMetric`, which records a distribution of frame durations over |
| 113 | +5 iterations of 5 down-scrolls + 5 up-scrolls. The output JSON contains these fields per |
| 114 | +benchmark: |
| 115 | + |
| 116 | +``` |
| 117 | +frameDurationCpuMs.p50 — median frame duration (CPU time only) |
| 118 | +frameDurationCpuMs.p90 — 90th percentile |
| 119 | +frameDurationCpuMs.p95 — 95th percentile |
| 120 | +frameDurationCpuMs.p99 — 99th percentile |
| 121 | +frameOverrunMs — signed wall-clock budget overrun (hardware timestamp devices only) |
| 122 | +jankyFrameCount — frames that exceeded the 16.67 ms / 60 fps deadline |
| 123 | +jankyFramePercent — janky frames as a share of total frames rendered |
| 124 | +``` |
| 125 | + |
| 126 | +### Reading the percentiles |
| 127 | + |
| 128 | +Think of the percentile distribution as a story about different kinds of rendering problems: |
| 129 | + |
| 130 | +**p50** reflects steady-state cost — what a typical frame costs when nothing unusual is happening. |
| 131 | +A high p50 (> 8 ms on a 60 Hz display) means the per-frame work budget is already half-consumed |
| 132 | +before any hiccup occurs. The optimised scroll screen targets p50 around 4–6 ms. |
| 133 | + |
| 134 | +**p90** reflects how well the app handles light variation — minor GC pauses, occasional longer |
| 135 | +layout passes, background service wakeups. A p90 below 10 ms means nine out of ten frames are |
| 136 | +comfortable even under normal system noise. |
| 137 | + |
| 138 | +**p99** is the headline regression gate in this project. It captures the worst 1 % of frames — |
| 139 | +the frames a user would perceive as a visible stutter. The CI threshold is **16.0 ms**: |
| 140 | + |
| 141 | +```python |
| 142 | +# benchmarks/BenchmarkResultsParser.py |
| 143 | +FRAME_P99_THRESHOLD_MS = 16.0 |
| 144 | +``` |
| 145 | + |
| 146 | +This is intentionally 1 % tighter than the 16.67 ms budget for 60 fps. The reasoning: if p99 is |
| 147 | +already at the deadline, a single additional GC pause or thermal event pushes real-world p99 |
| 148 | +over the cliff. A p99 of 16 ms leaves almost no headroom. |
| 149 | + |
| 150 | +The threshold is only enforced for `scrollAnimatedList_optimized`. The unoptimized variant is |
| 151 | +allowed to exceed it — its purpose is to confirm the baseline is genuinely slow, not to pass CI. |
| 152 | + |
| 153 | +**p95** is not gated but is worth watching: a large gap between p90 and p95 typically signals |
| 154 | +infrequent but expensive allocations (bitmaps, large `List` copies) rather than per-frame waste. |
| 155 | + |
| 156 | +### `frameOverrunMs` vs `frameDurationCpuMs` |
| 157 | + |
| 158 | +`frameDurationCpuMs` measures only CPU-side work (including RenderThread). It is available on |
| 159 | +all devices. `frameOverrunMs` measures wall-clock overrun relative to the frame deadline and |
| 160 | +requires hardware GPU-timestamp support (most Pixel devices, some Snapdragons). On the CI |
| 161 | +emulator, `frameOverrunMs` is absent from the JSON; do not treat its absence as a failure. |
| 162 | + |
| 163 | +### `jankyFrameCount` vs p99 |
| 164 | + |
| 165 | +These are complementary, not redundant. p99 tells you how bad the worst frames are. |
| 166 | +`jankyFrameCount` tells you how many frames crossed the 16.67 ms deadline. A test can have a |
| 167 | +low p99 but a non-zero jank count if a handful of frames spiked just barely over the deadline. |
| 168 | +For 60 Hz content, a jank count of zero is the target; one or two janky frames per 100 is |
| 169 | +acceptable on non-rooted emulator hardware. |
| 170 | + |
| 171 | +--- |
| 172 | + |
| 173 | +## Startup timing metrics |
| 174 | + |
| 175 | +`StartupBenchmark` and `AppStartupBenchmark` use `StartupTimingMetric` across 10 iterations: |
| 176 | + |
| 177 | +``` |
| 178 | +timeToInitialDisplayMs — TTID: system-measured time from process start to first frame drawn |
| 179 | +timeToFullDisplayMs — TTFD: time until the app calls reportFullyDrawn() |
| 180 | +``` |
| 181 | + |
| 182 | +**TTID** is reported by the system and cannot be manipulated by the app. It ends when the window |
| 183 | +surface receives its first rendered frame — even if that frame shows only a blank background. |
| 184 | + |
| 185 | +**TTFD** is the app-reported milestone. `MainActivity` calls `reportFullyDrawn()` after the |
| 186 | +Compose layout pass completes and the feed `LazyColumn` is scrollable. TTFD is absent for |
| 187 | +`StartupMode.HOT` because `onCreate()` is not called in that mode and `reportFullyDrawn()` is |
| 188 | +never invoked. |
| 189 | + |
| 190 | +The CI cold-start threshold is **800 ms TTID**: |
| 191 | + |
| 192 | +```python |
| 193 | +COLD_START_THRESHOLD_MS = 800 |
| 194 | +``` |
| 195 | + |
| 196 | +The optimised build targets 150–350 ms; the 800 ms gate is a wide safety margin designed to catch |
| 197 | +regressions (e.g. an SDK accidentally moved back onto the main thread) rather than to certify |
| 198 | +production quality. |
| 199 | + |
| 200 | +The startup tests use `CompilationMode.None()` (JIT only, no AOT pre-compilation). This produces |
| 201 | +the worst-case startup time — the same condition a user experiences on first install before ART |
| 202 | +has had time to profile and compile. Baseline Profiles are generated separately via |
| 203 | +`./gradlew :app:generateBaselineProfile` and are measured independently. |
| 204 | + |
| 205 | +--- |
| 206 | + |
| 207 | +## Limitations and variance expectations |
| 208 | + |
| 209 | +### Emulator variance |
| 210 | + |
| 211 | +CPU clock locking is not possible on the emulator. The emulator shares host CPU cores with other |
| 212 | +processes and is subject to the host scheduler. Expect ±30–50 ms variance on startup metrics |
| 213 | +and ±2–4 ms variance on p99 frame duration across runs. This is why: |
| 214 | + |
| 215 | +- Startup uses 10 iterations (more samples reduce the impact of outliers). |
| 216 | +- Scroll uses 5 iterations (frame metrics are per-frame averages over hundreds of frames, so |
| 217 | + fewer iterations are needed for stable statistics). |
| 218 | +- The CI threshold for cold start (800 ms) is set 3× above the measured optimised value |
| 219 | + (~250 ms) to absorb emulator noise. |
| 220 | + |
| 221 | +### `CompilationMode.None()` and JIT behaviour |
| 222 | + |
| 223 | +All benchmarks in this project run with `CompilationMode.None()`. JIT compilation happens during |
| 224 | +the benchmark run, which means the first iteration is always slower (the JIT is profiling) and |
| 225 | +later iterations are faster (hot methods are compiled). The benchmark library accounts for this |
| 226 | +by recording all iterations but reporting the distribution — look at p50 and p90 across multiple |
| 227 | +runs rather than a single median. |
| 228 | + |
| 229 | +If you switch to `CompilationMode.Full()` (AOT), numbers will be lower and more consistent but |
| 230 | +will not represent install-fresh behaviour. `CompilationMode.None()` is the right choice for |
| 231 | +detecting regressions in production conditions. |
| 232 | + |
| 233 | +### Thermal throttling on physical devices |
| 234 | + |
| 235 | +Even with locked clocks, sustained benchmarks on physical hardware can trigger thermal |
| 236 | +throttling if the device approaches its temperature limit. Signs of throttling: |
| 237 | + |
| 238 | +- Startup times that increase monotonically across iterations (not random noise). |
| 239 | +- Frame p99 that is higher for `scrollAnimatedList_optimized` than for `scrollAnimatedList_unoptimized` |
| 240 | + (impossible without throttling — the unoptimized path does more work). |
| 241 | + |
| 242 | +If you observe these patterns, let the device cool for 5–10 minutes and re-run. Plugging in |
| 243 | +USB-C power delivery can worsen thermals on some devices; consider unplugging during the run. |
| 244 | + |
| 245 | +### What the numbers do and do not represent |
| 246 | + |
| 247 | +| The numbers DO reflect | The numbers DO NOT reflect | |
| 248 | +|---|---| |
| 249 | +| Regression introduced in the code under test | Absolute production performance on a user's device | |
| 250 | +| Relative improvement from a specific optimisation | Performance under network I/O or database load | |
| 251 | +| Worst-case startup before ART profiling | Performance after a user's device has profiled and compiled the app | |
| 252 | +| Per-frame Compose rendering cost | GPU-bound rendering (these benchmarks are CPU-bound) | |
| 253 | +| Recomposition pass count (unit test metric) | Number of composables recomposed within a single pass | |
| 254 | + |
| 255 | +Recomposition counts in `RecompositionBenchmark` measure `Recomposer.changeCount` — the number |
| 256 | +of complete composition passes applied, not the number of individual composables that re-ran. |
| 257 | +One click that triggers one state change = one pass = `delta` of 1 in the optimised build. |
| 258 | +The assertion `assertEquals(1L, delta)` verifies no cascading second pass was triggered; it |
| 259 | +does not verify which composables were skipped within that pass. Use Layout Inspector's |
| 260 | +recomposition highlighting to inspect per-composable skip behaviour. |
0 commit comments