fix(autocorrect): route typo-of-contraction-base through alias map (#101)

tribixbite · tribixbite · commit 39b9b525823a · 2026-05-21T07:26:56.000-04:00
Follow-up to Tier A — covers the case where the new adjacency-aware dict scan converges on an alias-keyed bare-form (`dont`, `hadnt`, `couldnt`, etc.) instead of the contracted target. Before this change, typing `hadnr` (r↔t adjacent) returned `have` because: 1. `have` won the freq tiebreaker — binary dict scale gives it ≈ 1M vs `hadnt`'s 5000 after `loadPrimaryContractionKeys` overwrites the existing freq with `currentDict[withApostrophe] ?: 5000` (apostrophe form is never in dict → unconditional 5000 anchor). 2. Even when the alias-key was the score-wise best match, the `autoCorrect` return-path only consulted `contractionAliases` at step 0 (on the typed input) — never on the dict-scan winner. Fix routes through two layers: - `ALIAS_KEY_FLOOR_FREQUENCY` (`Int.MAX_VALUE / 2`) — alias-key candidates that clear the score threshold are floored above any non-alias freq, regardless of which loader populated the dict (JSON ≈ 100-10k, binary ≈ 5500-1M). Among multiple alias-keys competing on the same typo (`couldnr` matches both `couldnt` and `couldve`), a `score * 1000` offset added to the floor makes the higher-scoring candidate win deterministically — eliminates the hash-map iteration-order race that previously picked `could've` over `couldn't`. - End-of-scan alias re-route — after `bestCandidate` is picked, `contractionAliases[winner]` lookup substitutes the contracted form. Capitalization rules from step 0 (I-prefix → "I'm"; else `preserveCapitalization`) are reused. Tests (3 new in AutocorrectTest, all instrumented): - `aliasDirect_hadntIsLoaded` — sanity probe that the alias map is populated for this @before setup (was unclear earlier whether `loadDictionary` triggered contractions loading in tests). - `contractionBaseTypo_hadnrToHadntContracted` — `hadnr → hadn't` via the new scan-then-reroute path. - `contractionBaseTypo_couldnrToCouldntContracted` — `couldnr → couldn't`, validating the score-tiebreaker among multiple alias-key candidates (`couldnt` vs `couldve`). Verified: 1231 pure + 194 mock + 1279 instrumented tests pass. — opus 4.7
diff --git a/src/androidTest/kotlin/tribixbite/cleverkeys/AutocorrectTest.kt b/src/androidTest/kotlin/tribixbite/cleverkeys/AutocorrectTest.kt
@@ -217,6 +217,55 @@ class AutocorrectTest {
         assertTrue("Min length should be reasonable", minLength >= 0 && minLength <= 5)
     }
 
+    // =========================================================================
+    // Issue #101 follow-up — typo of a contraction base must autocorrect to
+    // the CONTRACTED form, not the bare alias key. Reported 2026-05-21:
+    // Tier A's adjacency-aware dict scan now reaches alias-injected `dont` /
+    // `im` / `youre` entries via near-miss typos, but autoCorrect returned
+    // the bare alias key because the contractionAliases lookup only fired on
+    // the typed input, not on the dict-scan winner. Fix: re-route the winner
+    // through contractionAliases before returning.
+    // =========================================================================
+
+    @Test
+    fun testAutocorrect_aliasDirect_hadntIsLoaded() {
+        // Sanity probe: confirms `contractionAliases` is populated for this
+        // test's @Before setup. If this fails the typo-of-base tests below
+        // can't possibly work (boost is gated on `dictWord in aliases`).
+        config.autocorrect_enabled = true
+        val result = predictor.autoCorrect("hadnt")
+        assertEquals("alias direct path: hadnt should map to hadn't via step 0",
+            "hadn't", result)
+    }
+
+    @Test
+    fun testAutocorrect_contractionBaseTypo_hadnrToHadntContracted() {
+        // `hadnr` is a 5-char typo of `hadnt` (t→r, both top-row adjacent).
+        // `hadnt` is alias-injected mapping to "hadn't". The `hadn?` prefix
+        // and ≥ 50% exact ratio leave `hadnt` (alias-keyed) as the only
+        // viable winner — no high-freq 5-char competitor matches the
+        // `had?r` shape with exactRatio ≥ 0.5.
+        // Expected: the dict-scan winner `hadnt` is re-routed through the
+        // alias map to yield "hadn't".
+        config.autocorrect_enabled = true
+        config.autocorrect_prefix_length = 0
+        val result = predictor.autoCorrect("hadnr")
+        assertEquals("hadnr → hadn't (alias re-routed dict-scan winner)",
+            "hadn't", result)
+    }
+
+    @Test
+    fun testAutocorrect_contractionBaseTypo_couldnrToCouldntContracted() {
+        // `couldnr` is a 7-char typo of `couldnt` (t→r, adjacent). With
+        // 7 chars the prefix constraint plus exactRatio ≥ 0.5 leaves only
+        // `couldnt` as a candidate among 7-char dict words.
+        config.autocorrect_enabled = true
+        config.autocorrect_prefix_length = 0
+        val result = predictor.autoCorrect("couldnr")
+        assertEquals("couldnr → couldn't (alias re-routed dict-scan winner)",
+            "couldn't", result)
+    }
+
     // =========================================================================
     // Config settings tests
     // =========================================================================
diff --git a/src/main/kotlin/tribixbite/cleverkeys/WordPredictor.kt b/src/main/kotlin/tribixbite/cleverkeys/WordPredictor.kt
@@ -70,6 +70,28 @@ class WordPredictor {
          *   - `wuestion → something` (lenDiff=1, ed≈2.71): 2.71 > 1.5 ✗
          */
         private const val LENGTH_DIFF_ED_BUDGET = 0.5f
+
+        /**
+         * Floor frequency assigned to alias-key candidates (bare-form
+         * contractions like `dont`, `cant`, `hadnt`) during the dict-scan
+         * tiebreaker. Sized at `Int.MAX_VALUE / 2` so the alias-key wins
+         * against ANY non-alias candidate regardless of which freq scale
+         * is in use (JSON path: ≈100–10k, binary path: ≈5500–1,000,000).
+         *
+         * Product intent: when a typo is a near-match to a contraction
+         * base AND clears the score threshold, the contracted form is the
+         * far more likely intent than a similarly-scored common word —
+         * `donr → don't` beats `donr → done` because users typing `donr`
+         * almost always meant `don't`. Tied/near-tied scores are the norm
+         * for typos: a single adjacent-key substitution lands at score ≈
+         * 0.97 for many candidate words simultaneously, and the old
+         * frequency tiebreaker silently picked the wrong winner.
+         *
+         * Halved from `Int.MAX_VALUE` so two aliases competing in the
+         * same scan don't overflow into ambiguous wraparound — though
+         * that case is itself a corner (most typos match only one base).
+         */
+        private const val ALIAS_KEY_FLOOR_FREQUENCY = Int.MAX_VALUE / 2
         private const val MAX_EDIT_DISTANCE = 2
         private const val MAX_RECENT_WORDS = 20 // Keep last 20 words for language detection
         private const val PREFIX_INDEX_MAX_LENGTH = 3 // Index prefixes up to 3 chars
@@ -1845,17 +1867,42 @@ class WordPredictor {
             }
 
             if (score >= charMatchThreshold) {
-                // Tiebreaker: higher dictionary frequency wins.
-                if (bestCandidate == null || candidateFrequency > bestCandidate.score) {
-                    bestCandidate = WordCandidate(dictWord, candidateFrequency)
+                // Tiebreaker: higher dictionary frequency wins, but alias-
+                // keys (bare-form contractions like `dont`, `cant`, `hadnt`)
+                // are floored at `ALIAS_KEY_FLOOR_FREQUENCY` so they always
+                // beat non-alias competitors. Among multiple alias-keys
+                // (e.g. `couldnr` matches both `couldnt` AND `couldve`),
+                // the higher-scoring candidate wins via a score-scaled
+                // offset added to the floor. Without the offset, hash-map
+                // iteration order silently picks the wrong contraction.
+                val effectiveFrequency =
+                    if (dictWord in contractionAliases) {
+                        ALIAS_KEY_FLOOR_FREQUENCY + (score * 1000f).toInt()
+                    } else {
+                        candidateFrequency
+                    }
+                if (bestCandidate == null || effectiveFrequency > bestCandidate.score) {
+                    bestCandidate = WordCandidate(dictWord, effectiveFrequency)
                 }
             }
         }
 
         // 5. Apply correction only if confident candidate found.
         if (bestCandidate != null && bestCandidate.score >= frequencyFloor) {
-            val corrected = preserveCapitalization(typedWord, bestCandidate.word)
-            Log.d(TAG, "AUTO-CORRECT: '$typedWord' → '$corrected' (freq=${bestCandidate.score})")
+            // Re-route alias-keyed winners through contractionAliases so the
+            // returned form is the apostrophe-bearing contraction. Without
+            // this, `donr → dont` (the alias-key) would stop there; the
+            // user-visible result must be `don't`. The same I-capitalization
+            // rule from step 0 applies.
+            val winnerWord = bestCandidate.word
+            val aliasTarget = contractionAliases[winnerWord]
+            val outputWord = aliasTarget ?: winnerWord
+            val corrected = if (aliasTarget != null && aliasTarget.startsWith("i'")) {
+                aliasTarget.replaceFirstChar { it.uppercase() }
+            } else {
+                preserveCapitalization(typedWord, outputWord)
+            }
+            Log.d(TAG, "AUTO-CORRECT: '$typedWord' → '$corrected' (winner=$winnerWord, freq=${bestCandidate.score})")
             return corrected
         }