docs(bpe): cite Incremental BPE Tokenization paper#130
Conversation
Add a reference to Jiang and Gong, "Incremental BPE Tokenization" (ICML 2026), in the tokenizer comparison section of the bpe README, linking to the paper's full runtime analysis and the authors' implementation at ModelTC/mtc-inc-bpe. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates crates/bpe documentation to better ground the crate’s incremental BPE discussion in recent academic work, by adding a citation and an implementation link in the “Comparison with other tokenizers” section.
Changes:
- Adds a citation to Jiang and Gong, “Incremental BPE Tokenization” (ICML 2026) with a stated worst-case runtime bound.
- Links to the authors’ reference implementation repository (ModelTC/mtc-inc-bpe).
Show a summary per file
| File | Description |
|---|---|
| crates/bpe/README.md | Adds an academic citation and implementation link to the incremental BPE runtime analysis. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 1/1 changed files
- Comments generated: 1
Replace LaTeX-delimited complexity notation with inline code to match the rest of the README and render correctly on crates.io/docs.rs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
|
||
|  | ||
|
|
||
| For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length). |
There was a problem hiding this comment.
you could write that this is paper is building on our incremental algorithm and takes it one step further by combining the aho-corasick-search with the compatibility test into one automaton.
There was a problem hiding this comment.
Great context, thanks. Added a sentence noting their work builds on our incremental algorithm and combines the aho-corasick search with the compatibility test into a single automaton (75e4040).
Per reviewer feedback, clarify that the cited paper extends this crate's incremental algorithm by combining the aho-corasick search with the compatibility test into a single automaton. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Why
The
crates/bpeREADME compares our encoder against tiktoken and Huggingface tokenizers but doesn't point readers to the broader academic work on incremental BPE. Recent work provides a formal worst-case runtime analysis that's directly relevant to anyone evaluating this crate's approach.What
Adds a short citation at the end of the Comparison with other tokenizers section linking to:
O(n log^2 t)).Documentation-only change, no code affected.