Skip to content

Commit 7afcba2

Browse files
authored
Merge pull request #4 from mohdaquib/feature/after-benchmarks
Feature/after benchmarks
2 parents 7fa29cc + 7cc8237 commit 7afcba2

14 files changed

Lines changed: 990 additions & 91 deletions

File tree

.github/workflows/ci.yml

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -70,13 +70,31 @@ jobs:
7070
target: default
7171
arch: x86_64
7272
emulator-boot-timeout: 600
73-
script: ./gradlew :benchmarks:connectedBenchmarkAndroidTest
73+
disable-animations: true
74+
emulator-options: -no-window -no-audio -no-boot-anim -gpu swiftshader_indirect
75+
script: |
76+
# Wait for boot to complete
77+
adb wait-for-device
78+
adb shell 'while [[ -z $(getprop sys.boot_completed) ]]; do sleep 1; done'
79+
sleep 5
80+
81+
# Disable animations
82+
adb shell settings put global window_animation_scale 0
83+
adb shell settings put global transition_animation_scale 0
84+
adb shell settings put global animator_duration_scale 0
85+
86+
# Verify device is responsive
87+
adb shell getprop ro.build.version.release
88+
89+
# Run tests
90+
./gradlew :benchmarks:connectedBenchmarkBenchmarkAndroidTest
7491
7592
- name: Parse Benchmark Results
7693
if: always()
7794
run: |
7895
echo "### Macrobenchmark Results" >> $GITHUB_STEP_SUMMARY
79-
python3 benchmarks/BenchmarkResultsParser.py >> $GITHUB_STEP_SUMMARY
96+
python3 benchmarks/BenchmarkResultsParser.py | tee -a $GITHUB_STEP_SUMMARY
97+
exit ${PIPESTATUS[0]}
8098
8199
- name: Upload benchmark JSON
82100
if: always()

METHODOLOGY.md

Lines changed: 260 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,260 @@
1+
# Benchmark Methodology
2+
3+
This document covers how benchmarks in this project are designed, what hardware conditions are
4+
required for trustworthy results, why the build configuration is the way it is, how to read the
5+
output metrics, and what the numbers cannot tell you.
6+
7+
---
8+
9+
## Device specification
10+
11+
### CI environment
12+
13+
CI runs macrobenchmarks on a GitHub-hosted runner using the
14+
[`reactivecircus/android-emulator-runner`](https://github.com/ReactiveCircus/android-emulator-runner)
15+
action:
16+
17+
| Property | Value |
18+
|---|---|
19+
| API level | 34 (Android 14) |
20+
| Architecture | x86_64 |
21+
| Target | default (AOSP, no Play Services) |
22+
| Boot timeout | 600 s |
23+
| Compilation mode | `CompilationMode.None()` — JIT only, no AOT |
24+
25+
Emulator results are inherently noisier than physical hardware (see [Limitations](#limitations)).
26+
The emulator configuration intentionally suppresses the two errors the benchmark runner would
27+
otherwise emit:
28+
29+
```kotlin
30+
// benchmarks/build.gradle.kts
31+
testInstrumentationRunnerArguments["androidx.benchmark.suppressErrors"] =
32+
"EMULATOR,DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION"
33+
```
34+
35+
`EMULATOR` silences the "running on emulator" error. `DYNAMIC_RECEIVER_NOT_EXPORTED_PERMISSION`
36+
silences a permissions-check false positive that appears on API 34 emulators. Neither suppression
37+
affects what is actually measured.
38+
39+
### Physical device setup
40+
41+
Running on physical hardware reduces variance significantly. Before measuring, lock the CPU and
42+
GPU clocks so the SoC cannot throttle or boost mid-run.
43+
44+
**Prerequisites:** the device must be rooted or running a userdebug/eng build. Stock consumer
45+
devices cannot lock clocks.
46+
47+
```bash
48+
# 1. Connect the device and verify adb access
49+
adb devices
50+
51+
# 2. Lock clocks using the AndroidX Benchmark Gradle task
52+
# (available when the benchmark module uses MacrobenchmarkRule)
53+
./gradlew :benchmarks:lockClocks
54+
55+
# 3. Run the benchmarks
56+
./gradlew :benchmarks:connectedBenchmarkAndroidTest
57+
58+
# 4. Unlock clocks when done (skipping this degrades battery life)
59+
./gradlew :benchmarks:unlockClocks
60+
```
61+
62+
`lockClocks` pins CPU frequency to a fixed mid-range value (not max), disables the interactive
63+
governor, and locks the GPU where the kernel exposes a control node. The fixed frequency is
64+
intentionally below peak so thermal headroom is preserved across a full benchmark run.
65+
66+
**Recommended device properties for reproducible results:**
67+
68+
- Disable Wi-Fi and mobile data (reduces background wakeups).
69+
- Charge to ≥ 80 % or keep plugged in (battery saver policies alter scheduling at low charge).
70+
- Turn off all notification delivery from other apps (`adb shell settings put global
71+
zen_mode 1`).
72+
- Keep display on (`adb shell svc power stayon true`) — some devices throttle when the
73+
screen is off.
74+
75+
---
76+
77+
## Why nonDebuggable builds are required
78+
79+
All macrobenchmarks in this project run against the `benchmark` build type, defined in
80+
`app/build.gradle.kts`:
81+
82+
```kotlin
83+
create("benchmark") {
84+
initWith(getByName("release")) // inherits minification + R8
85+
signingConfig = signingConfigs.getByName("debug") // debug cert for CI
86+
isDebuggable = false
87+
}
88+
```
89+
90+
`isDebuggable = false` is not optional. Debug builds carry several sources of overhead that
91+
inflate every metric and make before/after comparisons unreliable:
92+
93+
| Overhead source | Effect on benchmarks |
94+
|---|---|
95+
| JDWP agent always attached | Adds ~5–15 ms to every cold start; unpredictable per-frame cost |
96+
| JIT profiling hooks | Extra bookkeeping per method call; suppresses some JIT optimisations |
97+
| `StrictMode` and debug assertions | Extra allocations and thread checks on every UI operation |
98+
| Compose `isDebugInspectorInfoEnabled` | Turns on slot-table inspection for Layout Inspector; adds recomposition overhead |
99+
| R8 / ProGuard disabled | Dead code not stripped; more class loading; larger DEX → slower first-frame JIT |
100+
101+
The benchmark runner enforces this: if `isDebuggable = true`, it emits a `DEBUG_BUILD` error and
102+
refuses to record results (unless you add `"DEBUG_BUILD"` to `suppressErrors`, which would
103+
invalidate the data).
104+
105+
The `benchmark` build type keeps debug signing so the APK can be installed on CI without a
106+
release keystore. The signing cert has no effect on runtime performance.
107+
108+
---
109+
110+
## How to interpret frame timing metrics
111+
112+
`ScrollBenchmark` uses `FrameTimingMetric`, which records a distribution of frame durations over
113+
5 iterations of 5 down-scrolls + 5 up-scrolls. The output JSON contains these fields per
114+
benchmark:
115+
116+
```
117+
frameDurationCpuMs.p50 — median frame duration (CPU time only)
118+
frameDurationCpuMs.p90 — 90th percentile
119+
frameDurationCpuMs.p95 — 95th percentile
120+
frameDurationCpuMs.p99 — 99th percentile
121+
frameOverrunMs — signed wall-clock budget overrun (hardware timestamp devices only)
122+
jankyFrameCount — frames that exceeded the 16.67 ms / 60 fps deadline
123+
jankyFramePercent — janky frames as a share of total frames rendered
124+
```
125+
126+
### Reading the percentiles
127+
128+
Think of the percentile distribution as a story about different kinds of rendering problems:
129+
130+
**p50** reflects steady-state cost — what a typical frame costs when nothing unusual is happening.
131+
A high p50 (> 8 ms on a 60 Hz display) means the per-frame work budget is already half-consumed
132+
before any hiccup occurs. The optimised scroll screen targets p50 around 4–6 ms.
133+
134+
**p90** reflects how well the app handles light variation — minor GC pauses, occasional longer
135+
layout passes, background service wakeups. A p90 below 10 ms means nine out of ten frames are
136+
comfortable even under normal system noise.
137+
138+
**p99** is the headline regression gate in this project. It captures the worst 1 % of frames —
139+
the frames a user would perceive as a visible stutter. The CI threshold is **16.0 ms**:
140+
141+
```python
142+
# benchmarks/BenchmarkResultsParser.py
143+
FRAME_P99_THRESHOLD_MS = 16.0
144+
```
145+
146+
This is intentionally 1 % tighter than the 16.67 ms budget for 60 fps. The reasoning: if p99 is
147+
already at the deadline, a single additional GC pause or thermal event pushes real-world p99
148+
over the cliff. A p99 of 16 ms leaves almost no headroom.
149+
150+
The threshold is only enforced for `scrollAnimatedList_optimized`. The unoptimized variant is
151+
allowed to exceed it — its purpose is to confirm the baseline is genuinely slow, not to pass CI.
152+
153+
**p95** is not gated but is worth watching: a large gap between p90 and p95 typically signals
154+
infrequent but expensive allocations (bitmaps, large `List` copies) rather than per-frame waste.
155+
156+
### `frameOverrunMs` vs `frameDurationCpuMs`
157+
158+
`frameDurationCpuMs` measures only CPU-side work (including RenderThread). It is available on
159+
all devices. `frameOverrunMs` measures wall-clock overrun relative to the frame deadline and
160+
requires hardware GPU-timestamp support (most Pixel devices, some Snapdragons). On the CI
161+
emulator, `frameOverrunMs` is absent from the JSON; do not treat its absence as a failure.
162+
163+
### `jankyFrameCount` vs p99
164+
165+
These are complementary, not redundant. p99 tells you how bad the worst frames are.
166+
`jankyFrameCount` tells you how many frames crossed the 16.67 ms deadline. A test can have a
167+
low p99 but a non-zero jank count if a handful of frames spiked just barely over the deadline.
168+
For 60 Hz content, a jank count of zero is the target; one or two janky frames per 100 is
169+
acceptable on non-rooted emulator hardware.
170+
171+
---
172+
173+
## Startup timing metrics
174+
175+
`StartupBenchmark` and `AppStartupBenchmark` use `StartupTimingMetric` across 10 iterations:
176+
177+
```
178+
timeToInitialDisplayMs — TTID: system-measured time from process start to first frame drawn
179+
timeToFullDisplayMs — TTFD: time until the app calls reportFullyDrawn()
180+
```
181+
182+
**TTID** is reported by the system and cannot be manipulated by the app. It ends when the window
183+
surface receives its first rendered frame — even if that frame shows only a blank background.
184+
185+
**TTFD** is the app-reported milestone. `MainActivity` calls `reportFullyDrawn()` after the
186+
Compose layout pass completes and the feed `LazyColumn` is scrollable. TTFD is absent for
187+
`StartupMode.HOT` because `onCreate()` is not called in that mode and `reportFullyDrawn()` is
188+
never invoked.
189+
190+
The CI cold-start threshold is **800 ms TTID**:
191+
192+
```python
193+
COLD_START_THRESHOLD_MS = 800
194+
```
195+
196+
The optimised build targets 150–350 ms; the 800 ms gate is a wide safety margin designed to catch
197+
regressions (e.g. an SDK accidentally moved back onto the main thread) rather than to certify
198+
production quality.
199+
200+
The startup tests use `CompilationMode.None()` (JIT only, no AOT pre-compilation). This produces
201+
the worst-case startup time — the same condition a user experiences on first install before ART
202+
has had time to profile and compile. Baseline Profiles are generated separately via
203+
`./gradlew :app:generateBaselineProfile` and are measured independently.
204+
205+
---
206+
207+
## Limitations and variance expectations
208+
209+
### Emulator variance
210+
211+
CPU clock locking is not possible on the emulator. The emulator shares host CPU cores with other
212+
processes and is subject to the host scheduler. Expect ±30–50 ms variance on startup metrics
213+
and ±2–4 ms variance on p99 frame duration across runs. This is why:
214+
215+
- Startup uses 10 iterations (more samples reduce the impact of outliers).
216+
- Scroll uses 5 iterations (frame metrics are per-frame averages over hundreds of frames, so
217+
fewer iterations are needed for stable statistics).
218+
- The CI threshold for cold start (800 ms) is set 3× above the measured optimised value
219+
(~250 ms) to absorb emulator noise.
220+
221+
### `CompilationMode.None()` and JIT behaviour
222+
223+
All benchmarks in this project run with `CompilationMode.None()`. JIT compilation happens during
224+
the benchmark run, which means the first iteration is always slower (the JIT is profiling) and
225+
later iterations are faster (hot methods are compiled). The benchmark library accounts for this
226+
by recording all iterations but reporting the distribution — look at p50 and p90 across multiple
227+
runs rather than a single median.
228+
229+
If you switch to `CompilationMode.Full()` (AOT), numbers will be lower and more consistent but
230+
will not represent install-fresh behaviour. `CompilationMode.None()` is the right choice for
231+
detecting regressions in production conditions.
232+
233+
### Thermal throttling on physical devices
234+
235+
Even with locked clocks, sustained benchmarks on physical hardware can trigger thermal
236+
throttling if the device approaches its temperature limit. Signs of throttling:
237+
238+
- Startup times that increase monotonically across iterations (not random noise).
239+
- Frame p99 that is higher for `scrollAnimatedList_optimized` than for `scrollAnimatedList_unoptimized`
240+
(impossible without throttling — the unoptimized path does more work).
241+
242+
If you observe these patterns, let the device cool for 5–10 minutes and re-run. Plugging in
243+
USB-C power delivery can worsen thermals on some devices; consider unplugging during the run.
244+
245+
### What the numbers do and do not represent
246+
247+
| The numbers DO reflect | The numbers DO NOT reflect |
248+
|---|---|
249+
| Regression introduced in the code under test | Absolute production performance on a user's device |
250+
| Relative improvement from a specific optimisation | Performance under network I/O or database load |
251+
| Worst-case startup before ART profiling | Performance after a user's device has profiled and compiled the app |
252+
| Per-frame Compose rendering cost | GPU-bound rendering (these benchmarks are CPU-bound) |
253+
| Recomposition pass count (unit test metric) | Number of composables recomposed within a single pass |
254+
255+
Recomposition counts in `RecompositionBenchmark` measure `Recomposer.changeCount` — the number
256+
of complete composition passes applied, not the number of individual composables that re-ran.
257+
One click that triggers one state change = one pass = `delta` of 1 in the optimised build.
258+
The assertion `assertEquals(1L, delta)` verifies no cascading second pass was triggered; it
259+
does not verify which composables were skipped within that pass. Use Layout Inspector's
260+
recomposition highlighting to inspect per-composable skip behaviour.

app/build.gradle.kts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ android {
4646

4747
buildFeatures {
4848
compose = true
49+
buildConfig = true
4950
}
5051
}
5152

0 commit comments

Comments
 (0)