ROADMAP §7 Next-12.1. Macrobenchmark harness lives at
benchmark/src/main/kotlin/dev/patrickgold/florisboard/benchmark/ KeyboardLatencyBenchmark.kt. The adb harness scripts live under tools/.
Run on a clocks-locked device:
# Lock device clocks (Pixel 6 / S25 Ultra example).
adb shell cmd device_config put activity_manager max_phantom_processes 2147483647
adb shell input keyevent KEYCODE_WAKEUP
# … additional clock-locking per Macrobenchmark docs:
# https://developer.android.com/topic/performance/benchmarking/macrobenchmark-overview
./gradlew :app:assembleBenchmark :benchmark:assembleBenchmark
# Repeatable adb baseline for IME first render.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-first-render.ps1 -Iterations 5
# Repeatable adb baseline for cold first suggestion provider latency.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-suggestion-latency.ps1 -Iterations 5
# Repeatable adb baseline for dictionary cold load, preload, and lazy indexes.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-dictionary-load.ps1 -Iterations 5
# Repeatable adb baseline for candidate-row recomposition during warm typing.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-candidate-row.ps1 -Iterations 5
# Repeatable adb baseline for theme switching while the IME is visible.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-theme-switch.ps1 -Iterations 5
# Repeatable adb baseline for backup/restore on a representative default archive.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-backup-restore.ps1 -Iterations 5
# AndroidX Macrobenchmark trace/frame runs.
./gradlew :benchmark:connectedBenchmarkAndroidTest
# Offline trend compare for candidate JSON already collected under build/.
python scripts/check-benchmark-trends.py \
--baseline-dir docs/benchmark-results \
--candidate-dir build/benchmark-results \
--report build/benchmark-results/benchmark-trend-report.md \
--require-all-baselinesCollect output from
benchmark/build/outputs/connected_android_test_additional_output/ for
AndroidX Macrobenchmark runs. The adb harness scripts write JSON to
docs/benchmark-results/.
The manual GitHub Actions gate lives at
.github/workflows/benchmark-regression.yml. It builds the benchmark APK,
boots an emulator, writes candidate JSON under build/benchmark-results/,
runs scripts/check-benchmark-trends.py, and uploads both the candidate JSON
and markdown trend report.
| Benchmark | Device / iterations | Launch median | TraceSection / log median | Evidence |
|---|---|---|---|---|
imeFirstRender (cold IME view inflation) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: TotalTime 31.0 ms, WaitTime 34.0 ms |
SwiftFlorisPerf: swiftfloris.ime.firstRenderMs 18.335469 ms |
baseline-2026-05-18-ime-first-render.json |
firstSuggestionLatency (cold provider-direct teh) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: activity launch recorded per run |
SwiftFlorisPerf: swiftfloris.nlp.firstSuggestionMs 1878.616249 ms, 8 candidates |
baseline-2026-05-18-ime-suggestion-latency.json |
dictionaryColdLoad (SCOWL load + preload + SymSpell indexes) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: activity launch recorded per run |
SwiftFlorisPerf: swiftfloris.dict.loadMs 757.353333 ms; swiftfloris.dict.preloadMs 772.080625 ms; SymSpell d1 500.230156 ms / d2 532.298281 ms; post-preload spell 1030.179896 ms |
baseline-2026-05-18-ime-dictionary-load.json |
candidateRowRecomposition (warm typing phrase) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: activity launch recorded per run |
SwiftFlorisPerf: 9.0 recompositions/run; median body 0.326563 ms; median max 0.770365 ms; median total 4.069529 ms; paired median swiftfloris.nlp.suggestMs 0.339896 ms |
baseline-2026-05-18-ime-candidate-row.json |
themeSwitch (Snygg stylesheet swap) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: activity launch recorded per run |
SwiftFlorisPerf: 5.0 switches/run; median body 18.541197 ms; median max 19.587708 ms; median total 57.505571 ms; cold-step median 19.221354 ms; warm cached-step median 0.2808075 ms; 0 load failures |
baseline-2026-05-18-ime-theme-switch.json |
backupRestore (default prefs + keyboard/theme archive) |
Samsung SM-S938B / Android 16, 5 runs | am start -W: activity launch recorded per run |
SwiftFlorisPerf: backup create 12.653698 ms; archive 22,034 bytes; restore prepare 4.062604 ms; merge apply 5.727604 ms; restore total 9.874167 ms; 3/3 sections restored, 0 failed |
baseline-2026-05-18-backup-restore.json |
The EI9 gate compares candidate JSON summary metrics against the latest
committed baseline for the same benchmark field. Timings target the baseline
or better, improvements at 5 % faster are called out in the report, and
regressions > 8 % slower fail the workflow. Functional guardrails must stay
at zero where noted.
| Benchmark | Watched timing metrics | Target / pass range | Hold range | Guardrails |
|---|---|---|---|---|
imeFirstRender |
activityTotalTimeMedianMs, activityWaitTimeMedianMs, imeFirstRenderMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | n/a |
firstSuggestionLatency |
suggestionLatencyMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | n/a |
dictionaryLoadAndPreload |
dictionaryLoadMedianMs, dictionaryPreloadMedianMs, symSpellDistance1BuildMedianMs, symSpellDistance2BuildMedianMs, postPreloadSpellMedianMs, postPreloadSuggestionMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | n/a |
candidateRowRecomposition |
recomposeMedianOfRunMediansMs, recomposeMaxMedianMs, recomposeTotalMedianMs, nlpSuggestMedianOfRunMediansMs, nlpSuggestMaxMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | n/a |
themeSwitch |
themeSwitchMedianOfRunMediansMs, themeSwitchMaxMedianMs, themeSwitchTotalMedianMs, benchmarkStepMedianOfRunMediansMs, benchmarkStepMaxMedianMs, benchmarkColdStepMedianMs, benchmarkWarmStepMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | loadFailureCountMedian == 0 |
backupRestore |
backupCreateMedianMs, restorePrepareMedianMs, restoreApplyMedianMs, restoreTotalMedianMs |
<= baseline, or <= +8 % noise window | > +8 % | missingSectionsMedian == 0, failedSectionsMedian == 0 |
ROADMAP N1.4 now has the JVM-side replay and reporting pieces:
SwipeTraceImporter loads JSON Array / JSON Lines traces,
SwipeTraceReplay.toPointerData(...) maps normalized samples into
GlideTypingGesture.Detector.PointerData, and
SwipeTraceBenchmark.evaluate(...) computes top-1 / top-3 / top-N
accuracy, failures, average predictor latency, and capped miss samples.
The remaining evidence step needs the MIT-licensed FUTO swipe corpus
downloaded outside the repo and a device or host runner that wires the
records into StatisticalGlideTypingClassifier.
| Corpus | Engine | Records | Top-1 | Top-3 | Top-N | Avg latency | Notes |
|---|---|---|---|---|---|---|---|
| FUTO MIT swipe corpus | StatisticalGlideTypingClassifier |
pending | pending | pending | pending | pending | Requires corpus download + replay runner |
| FUTO nightly reference model | external reference | pending | pending | pending | pending | pending | Record from published run only; do not ingest FUTO app code |
Every production hot path that we benchmark wraps itself with
androidx.tracing.Trace.beginSection("swiftfloris.<subsystem>.<action>").
Current sections:
| Section | Subsystem | When entered |
|---|---|---|
swiftfloris.ime.firstRender |
IME bootstrap | FlorisImeService.onCreateInputView start → first frame |
swiftfloris.dict.load |
NLP dictionary | LatinDictionaryStore.dictionaryForLanguage cold path |
swiftfloris.nlp.symspell.build |
NLP correction index | LatinDictionarySnapshot.symSpellIndex lazy-init |
swiftfloris.nlp.suggest |
NLP suggestion pipeline | NlpManager.suggest start → IO-bound completion |
swiftfloris.smartbar.candidates.recompose |
Smartbar UI | Candidates row Compose re-emit |
swiftfloris.theme.switch |
Theme engine | ThemeManager.updateActiveTheme swap |
Add a new section by wrapping its hot-path call site:
import androidx.tracing.Trace
Trace.beginSection("swiftfloris.<subsystem>.<action>")
try {
// … work …
} finally {
Trace.endSection()
}Each baseline lives at docs/benchmark-results/baseline-YYYY-MM-DD*.json
so subsequent runs can compare against the recorded numbers. Format:
raw BenchmarkResult JSON as emitted by AndroidX Macrobenchmark, or the
script-emitted JSON for adb-only baselines.
| Date | Device | Build | Result file |
|---|---|---|---|
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.164 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-backup-restore.json |
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.163 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-ime-theme-switch.json |
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.162 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-ime-candidate-row.json |
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.161 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-ime-dictionary-load.json |
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.160 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-ime-suggestion-latency.json |
| 2026-05-18 | Samsung SM-S938B / Android 16 (SDK 36) | v1.8.159 benchmark APK (dev.patrickgold.florisboard.bench) |
baseline-2026-05-18-ime-first-render.json |
A watched median increase > 8 % vs the immediately-preceding baseline is the regression threshold. The manual workflow records both numbers, the % delta, the device labels, and the candidate JSON artifact so triage is one click. Compare against the same device + build configuration whenever possible; cross-device comparisons are useful for smoke evidence but should not become a new committed baseline without maintainer review.
A v1.X.Y release that touches the IME hot path includes:
- A fresh benchmark run on a clocks-locked device.
- The new JSON committed to
docs/benchmark-results/. - The latency baseline table above updated.
- A regression-or-improvement line in the release notes.
When the harness reports an improvement > 5 %, claim it in release notes. When it reports a regression > 8 %, the release is held until the regression is investigated.