Skip to content

Commit 75e4040

Browse files
gorzellCopilot
andcommitted
docs(bpe): note paper builds on our incremental algorithm
Per reviewer feedback, clarify that the cited paper extends this crate's incremental algorithm by combining the aho-corasick search with the compatibility test into a single automaton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 35e387b commit 75e4040

1 file changed

Lines changed: 1 addition & 0 deletions

File tree

crates/bpe/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -302,4 +302,5 @@ The Huggingface encoder scales better, but becomes slower and slower compared to
302302
![worst-case encoding runtime comparison](./images/performance-worstcase.svg)
303303

304304
For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length).
305+
Their work builds on our incremental algorithm and takes it one step further by combining the aho-corasick search with the compatibility test into a single automaton.
305306
Their implementation is available at [ModelTC/mtc-inc-bpe](https://github.com/ModelTC/mtc-inc-bpe).

0 commit comments

Comments
 (0)