Skip to content

Commit 2b300a4

Browse files
committed
docs: refresh cycle 14 research queue
1 parent 857cfe0 commit 2b300a4

3 files changed

Lines changed: 122 additions & 4 deletions

File tree

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Cycle 14 Findings - 2026-06-04
2+
3+
## Scope
4+
5+
- Repository: `SwiftFloris`
6+
- Baseline: clean detached worktree at pushed `master` `857cfe0`
7+
(`docs: refresh cycle 13 research queue`), described as
8+
`v1.8.246-2-g857cfe0`.
9+
- Sync: `git pull --rebase origin master` reported up to date before this
10+
cycle.
11+
- Constraint: research/docs only. No feature source, tests, build files, or
12+
assets were edited.
13+
14+
## Anti-Duplicate Checks
15+
16+
- Did not duplicate R12-1. R12-1 targets safe file replacement after a flush;
17+
this cycle targets the token-safety precondition before flush writes TSV.
18+
- Did not duplicate R13-1. R13-1 targets stats/reset serialization; this cycle
19+
targets write-time rejection of tokens that cannot be represented in the TSV
20+
format.
21+
- Did not reopen v1.8.234 locale-scoped flush regression coverage. That release
22+
covers which locale is flushed, not whether token strings are TSV-safe.
23+
- Did not propose an escaping or migration layer. Rejection is the narrower
24+
implementation shape for this app-private learned-token format.
25+
26+
## Local Evidence
27+
28+
- `PersonalBigramStore.kt:82-88` and `PersonalTrigramStore.kt:86-92` normalize
29+
learned words by trimming edge punctuation, rejecting digits, and requiring a
30+
letter, but they do not reject interior tab, newline, carriage-return, NUL, or
31+
other ISO control characters.
32+
- `PersonalBigramStore.kt:101-111` and `PersonalTrigramStore.kt:107-119` reload
33+
persisted rows with `split('\t')`, so interior tabs change field counts.
34+
- `PersonalBigramStore.kt:303-315` and `PersonalTrigramStore.kt:305-319` write
35+
raw token strings separated by tabs and terminated with newlines.
36+
- `PersonalTrigramStore.kt:51` reserves `\u0000` as the in-memory context
37+
delimiter, so a committed NUL inside `prev2` or `prev1` can collide with
38+
context splitting.
39+
- Existing dictionary source tests cover locale-scoped flush/reset contracts,
40+
but not control-character rejection before persistence.
41+
- `docs/AUDIT_2026-05-28.md:66-68` records the tab/NUL corruption path.
42+
43+
## Roadmap Changes Fed
44+
45+
- R14-1: Reject control separators before personal n-gram TSV persistence. The
46+
implementation should reject tab, newline, carriage-return, NUL, and other
47+
ISO control characters in both bigram and trigram normalized tokens before the
48+
tokens can reach in-memory maps or flush rows, with focused coverage that
49+
fails if a token can alter TSV field counts or trigram context splitting.
50+
51+
## Non-Adds
52+
53+
- No source fix was made in this cycle.
54+
- No new dictionary retention, export, permission, or network behavior was
55+
proposed.
56+
- No TSV escaping/migration work proposed. Rejection is enough for the current
57+
app-private learned-token persistence format.

RESEARCH_REPORT.md

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# SwiftFloris Research Report
22

3-
This report summarizes current research conclusions. The full 2026-05-25 research plan is archived at `docs/archive/research/RESEARCH_FEATURE_PLAN_2026-05-25.md`. Deep-research pass refreshed **2026-06-03** (post-v1.8.204), with 2026-06-04 freshness notes through Cycle 13 and v1.8.246 implementation notes.
3+
This report summarizes current research conclusions. The full 2026-05-25 research plan is archived at `docs/archive/research/RESEARCH_FEATURE_PLAN_2026-05-25.md`. Deep-research pass refreshed **2026-06-03** (post-v1.8.204), with 2026-06-04 freshness notes through Cycle 14 and v1.8.246 implementation notes.
44

55
2026-06-04 implementation note: v1.8.241 closed R4-3. `MimeTypeFilter`
66
constructor stdout logging is removed, aggregate helper semantics are documented
@@ -38,6 +38,13 @@ must remain local generated output rather than review evidence.
3838
AndroidX Core `1.19.0` remains blocked on the API 37 behavior-gate because the
3939
published `core-1.19.0.aar` metadata declares `minCompileSdk=37`.
4040

41+
2026-06-04 Cycle 14 note: after the Cycle 13 docs push, `master` is clean at
42+
`857cfe0` (`v1.8.246-2-g857cfe0`). Cycle 14 rechecked the deferred
43+
personal n-gram TSV token-safety audit against live bigram/trigram
44+
normalization, load, and flush paths. This cycle adds R14-1: reject tab,
45+
newline, carriage-return, NUL, and other ISO control separators before learned
46+
tokens can enter personal n-gram TSV persistence.
47+
4148
2026-06-04 Cycle 13 note: after the Cycle 12 docs push, `master` is clean at
4249
`3df1e5b` (`v1.8.246-1-g3df1e5b`). Cycle 13 rechecked the deferred
4350
`totalEntryCount()` / `resetAndAwait()` audit against live personal bigram and
@@ -204,6 +211,7 @@ Top opportunities (one line each):
204211
25. **Preference-store init splash recovery** — async `initAndroid` failures now stage a crash report, unblock the splash wait, and redirect to crash recovery before normal Settings content renders (R11-1). [Closed]
205212
26. **Personal n-gram file replacement** — bigram/trigram flush fallback deletes the live file before a successful replacement exists (R12-1, P2). [Verified]
206213
27. **Personal n-gram stats/reset serialization**`totalEntryCount()` can enumerate/load persisted bigram/trigram locales outside the reset lock while `resetAndAwait()` clears and deletes those files (R13-1, P2). [Verified]
214+
28. **Personal n-gram TSV token safety** — learned bigram/trigram tokens can still contain tab/newline/NUL/control separators that corrupt TSV rows or trigram context keys on reload (R14-1, P2). [Verified]
207215

208216
No Critical or Major reliability/security defects were found that are not already on the roadmap or in the deferred audit lists. The remaining heavy work (glide model training, Vosk addon, F-Droid submission, device-only visual verification) stays maintainer-gated as the existing roadmap records.
209217

@@ -287,7 +295,9 @@ Privacy-first multilingual IME. `:app` is Apache-2.0-ceiling, no network permiss
287295
deletes the destination before a second rename attempt. R12-1 keeps the
288296
last-known-good n-gram file until replacement succeeds. R13-1 adds the
289297
adjacent stats/reset consistency gap: `totalEntryCount()` should not reload or
290-
report stale locales around `resetAndAwait()` cleanup. [Verified]
298+
report stale locales around `resetAndAwait()` cleanup. R14-1 adds the
299+
write-time TSV token-safety gap for control separators before persistence.
300+
[Verified]
291301
- Established surfaces (autocorrect/SymSpell, glide classifier, clipboard, addons, voice handoff, sync, MCP, hardware-keyboard import) are covered by `COMPLETED.md` and the audits; no net-new gap surfaced beyond what the roadmap already tracks.
292302

293303
## Competitive Landscape
@@ -330,6 +340,10 @@ Privacy-first multilingual IME. `:app` is Apache-2.0-ceiling, no network permiss
330340
bigram/trigram `totalEntryCount()` file enumeration and `ensureLoaded()` under
331341
the reset-safe boundary, or compute counts from a reset-safe snapshot, so
332342
Settings stats cannot resurrect or display stale learning counts after reset.
343+
- **[Medium] Personal n-gram TSV token safety** → R14-1. Reject tab, newline,
344+
carriage-return, NUL, and other ISO control characters in bigram/trigram
345+
normalized tokens before they can corrupt tab-separated rows or trigram
346+
context keys.
333347
- **[Closed v1.8.219] Remaining diagnostic `printStackTrace()` paths** → R2-2. `RestoreScreen` failure diagnostics now use `flogError`, restore UI copy falls back to the existing "Unknown error" string for null/blank throwable messages, and `CrashUtility.writeToFile` logs through `LogTopic.CRASH_UTILITY`.
334348
- **[High] Local release ledger drift** → R3-1. Three code-fix commits after
335349
the v1.8.225 docs marker are untagged and absent from the release ledger.
@@ -423,7 +437,9 @@ Privacy-first multilingual IME. `:app` is Apache-2.0-ceiling, no network permiss
423437
atomic-replace contract so persistence failures cannot destroy the previous
424438
locale file. Cycle 13 adds the related read/reset boundary: stats counting
425439
should share reset serialization or use a reset-safe snapshot before it can
426-
call `ensureLoaded()` on persisted locale files.
440+
call `ensureLoaded()` on persisted locale files. Cycle 14 adds the TSV
441+
precondition boundary: learned tokens must reject control separators before
442+
the stores write raw tab/newline-delimited rows.
427443
- **User-dictionary navigation policy:** `UserDictionaryEntryPolicy` correctly
428444
centralizes leave/mutation/transfer gates. v1.8.232 keeps that policy and
429445
adds a visible response when Compose back handling blocks the gesture during
@@ -436,7 +452,7 @@ Privacy-first multilingual IME. `:app` is Apache-2.0-ceiling, no network permiss
436452

437453
## Security / Privacy / Data Safety
438454

439-
No net-new permission or data-egress finding. The settings-search additions are display/navigation only; the no-results Browse all settings action (RA-2), synonym keyword coverage (RA-3), and query-change scroll reset (RA-10) do not weaken the no-network posture. R2-1 and R2-2 closed as local diagnostic-safety work without adding network, telemetry, or broad file export. R11-1 closes the async side of startup diagnostics by surfacing preference-store init failures through the existing local crash recovery path without adding storage, permissions, or outbound data. R12-1 is local personal-prediction durability hardening and does not change dictionary retention, export, permissions, or outbound data. R13-1 is local stats/reset consistency hardening for the same personal n-gram files and likewise does not change retention, export, permissions, or outbound data. R3-2 is also local-only clipboard filtering. R3-3 closed as sync-crypto contract hardening before transport activation, with no new permission or native dependency. R4-1/R4-2/R4-3/R4-4 are closed local correctness/a11y/API-contract work. WS12 and WS10/WS15 are docs/resource-only and do not change permissions, retention, or storage behavior. R5-1 closed as trust-boundary hardening for optional addon APKs: it keeps the no-network addon screen but requires explicit trust before non-co-signed packages become active. R6-1 is local editor critical-section hardening and does not change storage, permissions, or outbound data. R7-1 closed as privacy posture hardening for the existing incognito mode and `FLAG_SECURE` contract, not a permission change. R9-1 is privacy-state hardening for existing local suggestion and smart-compose paths: it keeps the no-network posture and ensures `IME_FLAG_NO_PERSONALIZED_LEARNING` / incognito decisions are request-scoped across async work. R10-1 is local editor-session lifecycle hardening and does not change storage, permissions, or outbound data. R8-1 is UI feedback for an already-blocked dictionary operation path and does not change data retention, dictionary mutation, or export/import permissions. WS13 now explicitly includes the deferred `StickerMediaProvider.openFile` SAF allow-list validation so forged encoded sticker URIs are rejected without broadening file access. The deferred audit lists (`docs/AUDIT_2026-06-02.md`) remain the authority for crypto/parsing/lifecycle hardening; this pass does not duplicate them.
455+
No net-new permission or data-egress finding. The settings-search additions are display/navigation only; the no-results Browse all settings action (RA-2), synonym keyword coverage (RA-3), and query-change scroll reset (RA-10) do not weaken the no-network posture. R2-1 and R2-2 closed as local diagnostic-safety work without adding network, telemetry, or broad file export. R11-1 closes the async side of startup diagnostics by surfacing preference-store init failures through the existing local crash recovery path without adding storage, permissions, or outbound data. R12-1 is local personal-prediction durability hardening and does not change dictionary retention, export, permissions, or outbound data. R13-1 is local stats/reset consistency hardening for the same personal n-gram files and likewise does not change retention, export, permissions, or outbound data. R14-1 is local write-time token-safety hardening for existing personal n-gram persistence and does not add collection, retention, export, permissions, or outbound data. R3-2 is also local-only clipboard filtering. R3-3 closed as sync-crypto contract hardening before transport activation, with no new permission or native dependency. R4-1/R4-2/R4-3/R4-4 are closed local correctness/a11y/API-contract work. WS12 and WS10/WS15 are docs/resource-only and do not change permissions, retention, or storage behavior. R5-1 closed as trust-boundary hardening for optional addon APKs: it keeps the no-network addon screen but requires explicit trust before non-co-signed packages become active. R6-1 is local editor critical-section hardening and does not change storage, permissions, or outbound data. R7-1 closed as privacy posture hardening for the existing incognito mode and `FLAG_SECURE` contract, not a permission change. R9-1 is privacy-state hardening for existing local suggestion and smart-compose paths: it keeps the no-network posture and ensures `IME_FLAG_NO_PERSONALIZED_LEARNING` / incognito decisions are request-scoped across async work. R10-1 is local editor-session lifecycle hardening and does not change storage, permissions, or outbound data. R8-1 is UI feedback for an already-blocked dictionary operation path and does not change data retention, dictionary mutation, or export/import permissions. WS13 now explicitly includes the deferred `StickerMediaProvider.openFile` SAF allow-list validation so forged encoded sticker URIs are rejected without broadening file access. The deferred audit lists (`docs/AUDIT_2026-06-02.md`) remain the authority for crypto/parsing/lifecycle hardening; this pass does not duplicate them.
440456

441457
## UX & Accessibility
442458

@@ -466,6 +482,8 @@ The keyboard surface already has a strong a11y baseline (`ACCESSIBILITY.md`, `To
466482
decision is required.
467483
8. R13-1 needs a focused personal n-gram stats/reset test; no maintainer
468484
product decision is required.
485+
9. R14-1 needs focused personal n-gram token-safety tests for control
486+
separators; no maintainer product decision is required.
469487

470488
## Archived Evidence
471489

ROADMAP.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,49 @@ These are genuine blockers — each needs an account, key, sibling repo, ML infr
176176

177177
## Research-Driven Additions
178178

179+
### Researcher Queue (Cycle 14 - 2026-06-04)
180+
181+
- [x] 🔬 `personal-ngram-control-token-recheck-2026-06-04` - synced
182+
`master` after the Cycle 13 docs push, rechecked the deferred TSV
183+
control-character audit against live bigram/trigram normalization, load, and
184+
flush paths. This cycle adds one focused row for rejecting personal n-gram
185+
tokens that cannot be represented safely in the current TSV format.
186+
187+
#### Personal n-gram TSV token safety
188+
189+
- [ ] 🤖 P2 — Reject control separators before personal n-gram TSV persistence (R14-1)
190+
- Why: Both personal n-gram stores normalize learned words by trimming edge
191+
punctuation, rejecting digits, and requiring at least one letter, but they do
192+
not reject interior tab, newline, carriage-return, NUL, or other ISO control
193+
characters. Flush then writes tokens directly into tab-separated,
194+
newline-delimited files. A pasted or imported token like `foo\tbar` or a
195+
token containing `\u0000` can corrupt the next load by shifting TSV fields,
196+
splitting rows, or colliding with the trigram context delimiter.
197+
- Evidence: `PersonalBigramStore.kt:82-88` and
198+
`PersonalTrigramStore.kt:86-92` return lowercased trimmed words without any
199+
control-character rejection; `PersonalBigramStore.kt:101-111` and
200+
`PersonalTrigramStore.kt:107-119` parse persisted rows with `split('\t')`;
201+
`PersonalBigramStore.kt:303-315` and `PersonalTrigramStore.kt:305-319`
202+
write raw token strings separated by tabs and newlines; `PersonalTrigramStore.kt:51`
203+
reserves `\u0000` as the in-memory context delimiter; the current
204+
dictionary source tests cover locale-scoped flush/reset contracts but not
205+
write-time token safety; the deferred audit records the corruption path in
206+
`docs/AUDIT_2026-05-28.md:66-68`.
207+
- Touches: `PersonalBigramStore.kt`, `PersonalTrigramStore.kt`, and a focused
208+
JVM/source contract test for normalization or learn/flush rejection. Keep the
209+
current simple TSV format; this row is about rejecting unrepresentable tokens,
210+
not introducing an escaping migration.
211+
- Acceptance: after trimming and before lowercasing, both stores reject any
212+
normalized token containing `'\t'`, `'\n'`, `'\r'`, `'\u0000'`, or
213+
`Char.isISOControl()`; learned control-character tokens do not reach
214+
in-memory maps or persisted TSV rows; existing apostrophe/hyphen real-word
215+
cases continue to work; tests fail if either store can persist a token that
216+
changes TSV field count or trigram context splitting on reload.
217+
- Verify: `./gradlew.bat :app:testDebugUnitTest --tests
218+
"dev.patrickgold.florisboard.ime.dictionary.PersonalNgramFlushIsolationTest"`
219+
or a new focused personal n-gram token-safety test class.
220+
- Complexity: S
221+
179222
### Researcher Queue (Cycle 13 - 2026-06-04)
180223

181224
- [x] 🔬 `personal-ngram-stats-reset-race-recheck-2026-06-04` - synced

0 commit comments

Comments
 (0)