SwiftFloris benchmark results

ROADMAP §7 Next-12.1. Macrobenchmark harness lives at benchmark/src/main/kotlin/dev/patrickgold/florisboard/benchmark/ KeyboardLatencyBenchmark.kt. The adb harness scripts live under tools/. Run on a clocks-locked device:

# Lock device clocks (Pixel 6 / S25 Ultra example).
adb shell cmd device_config put activity_manager max_phantom_processes 2147483647
adb shell input keyevent KEYCODE_WAKEUP
# … additional clock-locking per Macrobenchmark docs:
# https://developer.android.com/topic/performance/benchmarking/macrobenchmark-overview

./gradlew :app:assembleBenchmark :benchmark:assembleBenchmark

# Repeatable adb baseline for IME first render.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-first-render.ps1 -Iterations 5

# Repeatable adb baseline for cold first suggestion provider latency.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-suggestion-latency.ps1 -Iterations 5

# Repeatable adb baseline for dictionary cold load, preload, and lazy indexes.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-dictionary-load.ps1 -Iterations 5

# Repeatable adb baseline for candidate-row recomposition during warm typing.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-candidate-row.ps1 -Iterations 5

# Repeatable adb baseline for theme switching while the IME is visible.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-ime-theme-switch.ps1 -Iterations 5

# Repeatable adb baseline for backup/restore on a representative default archive.
pwsh -NoProfile -ExecutionPolicy Bypass -File tools/benchmark-backup-restore.ps1 -Iterations 5

# AndroidX Macrobenchmark trace/frame runs.
./gradlew :benchmark:connectedBenchmarkAndroidTest

# Offline trend compare for candidate JSON already collected under build/.
python scripts/check-benchmark-trends.py \
  --baseline-dir docs/benchmark-results \
  --candidate-dir build/benchmark-results \
  --report build/benchmark-results/benchmark-trend-report.md \
  --require-all-baselines

Collect output from benchmark/build/outputs/connected_android_test_additional_output/ for AndroidX Macrobenchmark runs. The adb harness scripts write JSON to docs/benchmark-results/.

The manual GitHub Actions gate lives at .github/workflows/benchmark-regression.yml. It builds the benchmark APK, boots an emulator, writes candidate JSON under build/benchmark-results/, runs scripts/check-benchmark-trends.py, and uploads both the candidate JSON and markdown trend report.

Latency baseline

Benchmark	Device / iterations	Launch median	TraceSection / log median	Evidence
`imeFirstRender` (cold IME view inflation)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: `TotalTime` 31.0 ms, `WaitTime` 34.0 ms	`SwiftFlorisPerf`: `swiftfloris.ime.firstRenderMs` 18.335469 ms	`baseline-2026-05-18-ime-first-render.json`
`firstSuggestionLatency` (cold provider-direct `teh`)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: activity launch recorded per run	`SwiftFlorisPerf`: `swiftfloris.nlp.firstSuggestionMs` 1878.616249 ms, 8 candidates	`baseline-2026-05-18-ime-suggestion-latency.json`
`dictionaryColdLoad` (SCOWL load + preload + SymSpell indexes)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: activity launch recorded per run	`SwiftFlorisPerf`: `swiftfloris.dict.loadMs` 757.353333 ms; `swiftfloris.dict.preloadMs` 772.080625 ms; SymSpell d1 500.230156 ms / d2 532.298281 ms; post-preload spell 1030.179896 ms	`baseline-2026-05-18-ime-dictionary-load.json`
`candidateRowRecomposition` (warm typing phrase)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: activity launch recorded per run	`SwiftFlorisPerf`: 9.0 recompositions/run; median body 0.326563 ms; median max 0.770365 ms; median total 4.069529 ms; paired median `swiftfloris.nlp.suggestMs` 0.339896 ms	`baseline-2026-05-18-ime-candidate-row.json`
`themeSwitch` (Snygg stylesheet swap)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: activity launch recorded per run	`SwiftFlorisPerf`: 5.0 switches/run; median body 18.541197 ms; median max 19.587708 ms; median total 57.505571 ms; cold-step median 19.221354 ms; warm cached-step median 0.2808075 ms; 0 load failures	`baseline-2026-05-18-ime-theme-switch.json`
`backupRestore` (default prefs + keyboard/theme archive)	Samsung SM-S938B / Android 16, 5 runs	`am start -W`: activity launch recorded per run	`SwiftFlorisPerf`: backup create 12.653698 ms; archive 22,034 bytes; restore prepare 4.062604 ms; merge apply 5.727604 ms; restore total 9.874167 ms; 3/3 sections restored, 0 failed	`baseline-2026-05-18-backup-restore.json`

Trend-regression ranges

The EI9 gate compares candidate JSON summary metrics against the latest committed baseline for the same benchmark field. Timings target the baseline or better, improvements at 5 % faster are called out in the report, and regressions > 8 % slower fail the workflow. Functional guardrails must stay at zero where noted.

Benchmark	Watched timing metrics	Target / pass range	Hold range	Guardrails
`imeFirstRender`	`activityTotalTimeMedianMs`, `activityWaitTimeMedianMs`, `imeFirstRenderMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	n/a
`firstSuggestionLatency`	`suggestionLatencyMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	n/a
`dictionaryLoadAndPreload`	`dictionaryLoadMedianMs`, `dictionaryPreloadMedianMs`, `symSpellDistance1BuildMedianMs`, `symSpellDistance2BuildMedianMs`, `postPreloadSpellMedianMs`, `postPreloadSuggestionMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	n/a
`candidateRowRecomposition`	`recomposeMedianOfRunMediansMs`, `recomposeMaxMedianMs`, `recomposeTotalMedianMs`, `nlpSuggestMedianOfRunMediansMs`, `nlpSuggestMaxMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	n/a
`themeSwitch`	`themeSwitchMedianOfRunMediansMs`, `themeSwitchMaxMedianMs`, `themeSwitchTotalMedianMs`, `benchmarkStepMedianOfRunMediansMs`, `benchmarkStepMaxMedianMs`, `benchmarkColdStepMedianMs`, `benchmarkWarmStepMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	`loadFailureCountMedian == 0`
`backupRestore`	`backupCreateMedianMs`, `restorePrepareMedianMs`, `restoreApplyMedianMs`, `restoreTotalMedianMs`	<= baseline, or <= +8 % noise window	> +8 %	`missingSectionsMedian == 0`, `failedSectionsMedian == 0`

Glide trace benchmark — pending first corpus run

ROADMAP N1.4 now has the JVM-side replay and reporting pieces: SwipeTraceImporter loads JSON Array / JSON Lines traces, SwipeTraceReplay.toPointerData(...) maps normalized samples into GlideTypingGesture.Detector.PointerData, and SwipeTraceBenchmark.evaluate(...) computes top-1 / top-3 / top-N accuracy, failures, average predictor latency, and capped miss samples.

The remaining evidence step needs the MIT-licensed FUTO swipe corpus downloaded outside the repo and a device or host runner that wires the records into StatisticalGlideTypingClassifier.

Corpus	Engine	Records	Top-1	Top-3	Top-N	Avg latency	Notes
FUTO MIT swipe corpus	`StatisticalGlideTypingClassifier`	pending	pending	pending	pending	pending	Requires corpus download + replay runner
FUTO nightly reference model	external reference	pending	pending	pending	pending	pending	Record from published run only; do not ingest FUTO app code

Trace-section naming convention

Every production hot path that we benchmark wraps itself with androidx.tracing.Trace.beginSection("swiftfloris.<subsystem>.<action>"). Current sections:

Section	Subsystem	When entered
`swiftfloris.ime.firstRender`	IME bootstrap	`FlorisImeService.onCreateInputView` start → first frame
`swiftfloris.dict.load`	NLP dictionary	`LatinDictionaryStore.dictionaryForLanguage` cold path
`swiftfloris.nlp.symspell.build`	NLP correction index	`LatinDictionarySnapshot.symSpellIndex` lazy-init
`swiftfloris.nlp.suggest`	NLP suggestion pipeline	`NlpManager.suggest` start → IO-bound completion
`swiftfloris.smartbar.candidates.recompose`	Smartbar UI	Candidates row Compose re-emit
`swiftfloris.theme.switch`	Theme engine	`ThemeManager.updateActiveTheme` swap

Add a new section by wrapping its hot-path call site:

import androidx.tracing.Trace
Trace.beginSection("swiftfloris.<subsystem>.<action>")
try {
    // … work …
} finally {
    Trace.endSection()
}

Historical baselines

Each baseline lives at docs/benchmark-results/baseline-YYYY-MM-DD*.json so subsequent runs can compare against the recorded numbers. Format: raw BenchmarkResult JSON as emitted by AndroidX Macrobenchmark, or the script-emitted JSON for adb-only baselines.

Date	Device	Build	Result file
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.164 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-backup-restore.json`
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.163 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-ime-theme-switch.json`
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.162 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-ime-candidate-row.json`
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.161 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-ime-dictionary-load.json`
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.160 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-ime-suggestion-latency.json`
2026-05-18	Samsung SM-S938B / Android 16 (SDK 36)	v1.8.159 benchmark APK (`dev.patrickgold.florisboard.bench`)	`baseline-2026-05-18-ime-first-render.json`

How to read a regression

A watched median increase > 8 % vs the immediately-preceding baseline is the regression threshold. The manual workflow records both numbers, the % delta, the device labels, and the candidate JSON artifact so triage is one click. Compare against the same device + build configuration whenever possible; cross-device comparisons are useful for smoke evidence but should not become a new committed baseline without maintainer review.

Definition of done

A v1.X.Y release that touches the IME hot path includes:

A fresh benchmark run on a clocks-locked device.
The new JSON committed to docs/benchmark-results/.
The latency baseline table above updated.
A regression-or-improvement line in the release notes.

When the harness reports an improvement > 5 %, claim it in release notes. When it reports a regression > 8 %, the release is held until the regression is investigated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SwiftFloris benchmark results

Latency baseline

Trend-regression ranges

Glide trace benchmark — pending first corpus run

Trace-section naming convention

Historical baselines

How to read a regression

Definition of done

FilesExpand file tree

BENCHMARKS.md

Latest commit

History

BENCHMARKS.md

File metadata and controls

SwiftFloris benchmark results

Latency baseline

Trend-regression ranges

Glide trace benchmark — pending first corpus run

Trace-section naming convention

Historical baselines

How to read a regression

Definition of done