Skip to content

Commit 6b4553b

Browse files
fix(boost): rebuild trie on vocab change, not on every status flip
The phrase-boost trie rebuild effect depended on the raw `status` string as a proxy for "the model's tokenizer changed". But `status` also flips on recording start/stop, file transcribe, and each chunk-progress tick (setStatus with a percentage string). Every flip re-ran parseBoostPhrases (~88ms on the 64k-line medical list) and the trie rebuild, and pushed fresh boost-warning arrays that forced the giant phrase-list textarea to reconcile. On load, status cycles idle -> loadingModel -> creatingSessions -> modelReady, so this fired several times back to back: the "My Computer tab frozen on load" the user saw. The earlier fix (d103856) only stopped the casing-expand on the prebuilt path; the parse + rebuild + warning-state churn on every status value remained. Fix: publish the loaded tokenizer's vocab signature as `tokenizerVocabSig` state (set after the model is created, cleared on dispose) and key the rebuild effect on it instead of `status`. The vocab signature changes only on a real model load/swap, so the trie rebuilds exactly once per model change and unrelated status churn no longer touches it. Covered by a tier-3 e2e (boost-rebuild-on-status.spec.js): with verbose logging on and a Custom boost list, it asserts the "[Boost] rebuilding trie" count does not grow across a full transcribe cycle. Verified the test fails when the effect is reverted to depend on `status`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 583ba67 commit 6b4553b

2 files changed

Lines changed: 120 additions & 3 deletions

File tree

app/ui/src/App.jsx

Lines changed: 26 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -517,6 +517,17 @@ export default function App() {
517517
const [boostRebuilding, setBoostRebuilding] = useState(false);
518518
const phraseBoostRef = useRef(null); // BoostingTrie | null (null = inert)
519519
const boostStrengthRef = useRef(1);
520+
// Vocab signature of the currently loaded tokenizer (null when no model is
521+
// loaded). This is the *only* model-side input the boost-trie rebuild needs:
522+
// it changes when (and only when) the model's vocab does, so the rebuild
523+
// effect keys on it instead of `status`. Keying on `status` was the bug
524+
// behind the "My Computer tab frozen on load": `status` also flips on every
525+
// recording start/stop, file transcribe, and even each chunk-progress tick
526+
// (setStatus with a percentage string), and each flip re-ran the heavy
527+
// parseBoostPhrases + trie rebuild and pushed fresh boost-warning arrays that
528+
// forced the giant phrase-list textarea to reconcile. Now the rebuild fires
529+
// once per real model change, not once per status string.
530+
const [tokenizerVocabSig, setTokenizerVocabSig] = useState(null);
520531
// Current verbose-logging flag for use inside the debounced rebuild closure
521532
// (which captures a stale `verboseLog`); synced by the effect below.
522533
const verboseLogRef = useRef(false);
@@ -1660,8 +1671,12 @@ export default function App() {
16601671
// is debounced (a large paste shouldn't re-encode per keystroke) and the
16611672
// encode runs in the worker, so the main thread never blocks on tokenizing a
16621673
// big list; only the cheap trie insert happens here. Strength is applied
1663-
// separately (below) so moving the slider does not force a re-encode. We
1664-
// rebuild on `status` so a model load/swap refreshes the trie.
1674+
// separately (below) so moving the slider does not force a re-encode. We key
1675+
// on `tokenizerVocabSig` (not `status`) so a model load/swap refreshes the
1676+
// trie exactly once: `status` also flips on every recording/transcribe/chunk
1677+
// transition, none of which change the vocab, and re-running the parse +
1678+
// rebuild + warning-state writes on each of those froze the UI on a large
1679+
// curated list (the textarea reconciled the whole list every time).
16651680
useEffect(() => {
16661681
// parseBoostPhrases is cheap (a line scan) and feeds the inline warnings,
16671682
// so it always runs. The expensive step is expandCasingVariants below, so
@@ -1746,7 +1761,7 @@ export default function App() {
17461761
}
17471762
}, BOOST_REBUILD_DEBOUNCE_MS);
17481763
return () => { cancelled = true; clearTimeout(timer); if (showSpinner) setBoostRebuilding(false); };
1749-
}, [boostPhrases, boostCaseInsensitive, status, encodeBoostPhrases]);
1764+
}, [boostPhrases, boostCaseInsensitive, tokenizerVocabSig, encodeBoostPhrases]);
17501765

17511766
// Apply the strength slider without rebuilding the trie.
17521767
useEffect(() => {
@@ -1769,6 +1784,9 @@ export default function App() {
17691784
console.log('[App] Disposing existing model before loading new one...');
17701785
modelRef.current.dispose();
17711786
modelRef.current = null;
1787+
// Drop the old vocab signature so the boost effect clears its trie now
1788+
// (no tokenizer) and rebuilds once the new model publishes its signature.
1789+
setTokenizerVocabSig(null);
17721790
}
17731791

17741792
setStatus('loadingModel');
@@ -1842,6 +1860,11 @@ export default function App() {
18421860
});
18431861

18441862
console.timeEnd('LoadModel');
1863+
// Publish the loaded tokenizer's vocab signature so the boost-trie rebuild
1864+
// effect runs now (model became ready) and on a later vocab-changing swap,
1865+
// but NOT on the unrelated status churn of recording/transcribing.
1866+
const tk = modelRef.current?.tokenizer;
1867+
setTokenizerVocabSig(tk?.id2token ? vocabSignature(tk.id2token) : 'ready');
18451868
setStatus('modelReady');
18461869
setProgressText('');
18471870
setProgressPct(null);
Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
// Tier-3 regression E2E: the phrase-boost trie must rebuild once per real model
2+
// change, NOT on every `status` transition. The boost rebuild effect used to
3+
// depend on the raw `status` string, which also flips on recording start/stop,
4+
// file transcribe, and each chunk-progress tick. On a large curated list each
5+
// flip re-ran parseBoostPhrases + the trie rebuild and pushed fresh
6+
// boost-warning arrays that forced the giant phrase-list textarea to reconcile,
7+
// freezing the "My Computer" tab. The fix keys the effect on the loaded
8+
// tokenizer's vocab signature instead, so unrelated status churn no longer
9+
// rebuilds.
10+
//
11+
// This pins it end to end: with verbose logging on, "[Boost] rebuilding trie"
12+
// is logged once per rebuild. We record the count once the model is ready, run
13+
// a full transcription (which flips `status` through transcribing -> modelReady
14+
// and would re-trigger the buggy effect when it lands back on modelReady), and
15+
// assert the count did not grow. Reuses the WASM-int8 local-model setup of
16+
// transcription.spec.js.
17+
//
18+
// Built with Claude Code.
19+
20+
import { test, expect } from '@playwright/test';
21+
import { fileURLToPath } from 'node:url';
22+
import { resolve, dirname } from 'node:path';
23+
24+
const here = dirname(fileURLToPath(import.meta.url));
25+
const FIXTURE_AUDIO = resolve(here, '../fixtures/sample.aac');
26+
27+
// Seed the settings DB so the app loads the model from /models with WASM, with
28+
// verbose logging on and a Custom boost list present (a few phrases is enough to
29+
// build a trie and emit the rebuild log; the bug is about *how often* it
30+
// rebuilds, not list size).
31+
async function seedSettings(page) {
32+
await page.evaluate(async () => {
33+
const DB = 'parakeetweb-settings-db', STORE = 'settings-store', PREFIX = 'parakeetweb_';
34+
const db = await new Promise((res, rej) => {
35+
const req = indexedDB.open(DB, 1);
36+
req.onupgradeneeded = (e) => {
37+
const d = e.target.result;
38+
if (!d.objectStoreNames.contains(STORE)) d.createObjectStore(STORE);
39+
};
40+
req.onsuccess = () => res(req.result);
41+
req.onerror = () => rej(req.error);
42+
});
43+
await new Promise((res, rej) => {
44+
const tx = db.transaction([STORE], 'readwrite');
45+
const os = tx.objectStore(STORE);
46+
os.put('local', PREFIX + 'modelSource');
47+
os.put('wasm', PREFIX + 'backend');
48+
os.put(true, PREFIX + 'verboseLog');
49+
os.put('__custom__', PREFIX + 'boostSource');
50+
os.put('venlafaxine\nacetaminophen\nmetoprolol', PREFIX + 'boostPhrases');
51+
tx.oncomplete = () => res();
52+
tx.onerror = () => rej(tx.error);
53+
});
54+
});
55+
}
56+
57+
test('boost trie does not rebuild on transcribe status churn', async ({ page }) => {
58+
let boostRebuilds = 0;
59+
page.on('console', (m) => {
60+
if (m.type() === 'log' && m.text().includes('[Boost] rebuilding trie')) boostRebuilds += 1;
61+
});
62+
63+
// First load creates the settings DB/store; seed it, then reload so the app
64+
// picks up local model source + wasm backend + the boost list.
65+
await page.goto('/');
66+
await seedSettings(page);
67+
await page.reload();
68+
69+
await page.locator('[data-umami-event="load_model_button"]').click();
70+
71+
// Model ready: the app renders a ✔ once weights are loaded and initialised.
72+
await expect(page.locator('body')).toContainText('✔', { timeout: 6 * 60 * 1000 });
73+
74+
// Let the post-model-ready rebuild (and its 300ms debounce) settle, then take
75+
// the baseline: this is the one legitimate rebuild (the vocab became known).
76+
await expect.poll(() => boostRebuilds, { timeout: 30 * 1000 }).toBeGreaterThanOrEqual(1);
77+
await page.waitForTimeout(1000);
78+
const afterReady = boostRebuilds;
79+
80+
// Transcribe the fixture clip. autoTranscribe is on by default, so feeding the
81+
// file starts it; this churns `status` (transcribing... -> modelReady) which
82+
// is exactly what used to re-trigger the rebuild.
83+
await page.locator('#audio-file-input').setInputFiles(FIXTURE_AUDIO);
84+
const historyText = page.locator('.history-text').first();
85+
await expect(historyText).toBeVisible({ timeout: 6 * 60 * 1000 });
86+
await expect(historyText).not.toBeEmpty({ timeout: 6 * 60 * 1000 });
87+
88+
// Give any (buggy) status-driven rebuild its debounce window to fire.
89+
await page.waitForTimeout(1000);
90+
expect(
91+
boostRebuilds,
92+
`boost trie rebuilt ${boostRebuilds - afterReady} extra time(s) on transcribe status churn (expected 0)`,
93+
).toBe(afterReady);
94+
});

0 commit comments

Comments
 (0)