Skip to content

Commit 4ffa548

Browse files
authored
Merge pull request #130 from github/gorzell/cite-incremental-bpe-paper
docs(bpe): cite Incremental BPE Tokenization paper
2 parents 2c3ff25 + 75e4040 commit 4ffa548

1 file changed

Lines changed: 4 additions & 0 deletions

File tree

crates/bpe/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -300,3 +300,7 @@ This case is particularly challenging for tiktoken, which shows a quadratic grow
300300
The Huggingface encoder scales better, but becomes slower and slower compared to our implementation as input size increases.
301301

302302
![worst-case encoding runtime comparison](./images/performance-worstcase.svg)
303+
304+
For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length).
305+
Their work builds on our incremental algorithm and takes it one step further by combining the aho-corasick search with the compatibility test into a single automaton.
306+
Their implementation is available at [ModelTC/mtc-inc-bpe](https://github.com/ModelTC/mtc-inc-bpe).

0 commit comments

Comments
 (0)