Skip to content

docs(bpe): cite Incremental BPE Tokenization paper#130

Merged
gorzell merged 3 commits into
mainfrom
gorzell/cite-incremental-bpe-paper
Jun 17, 2026
Merged

docs(bpe): cite Incremental BPE Tokenization paper#130
gorzell merged 3 commits into
mainfrom
gorzell/cite-incremental-bpe-paper

Conversation

@gorzell

@gorzell gorzell commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Why

The crates/bpe README compares our encoder against tiktoken and Huggingface tokenizers but doesn't point readers to the broader academic work on incremental BPE. Recent work provides a formal worst-case runtime analysis that's directly relevant to anyone evaluating this crate's approach.

What

Adds a short citation at the end of the Comparison with other tokenizers section linking to:

Documentation-only change, no code affected.

Add a reference to Jiang and Gong, "Incremental BPE Tokenization"
(ICML 2026), in the tokenizer comparison section of the bpe README,
linking to the paper's full runtime analysis and the authors'
implementation at ModelTC/mtc-inc-bpe.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 17, 2026 07:31
@gorzell gorzell requested a review from a team as a code owner June 17, 2026 07:31
GitHub Advanced Security started work on behalf of gorzell June 17, 2026 07:31 View session
@gorzell gorzell enabled auto-merge June 17, 2026 07:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates crates/bpe documentation to better ground the crate’s incremental BPE discussion in recent academic work, by adding a citation and an implementation link in the “Comparison with other tokenizers” section.

Changes:

  • Adds a citation to Jiang and Gong, “Incremental BPE Tokenization” (ICML 2026) with a stated worst-case runtime bound.
  • Links to the authors’ reference implementation repository (ModelTC/mtc-inc-bpe).
Show a summary per file
File Description
crates/bpe/README.md Adds an academic citation and implementation link to the incremental BPE runtime analysis.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 1/1 changed files
  • Comments generated: 1

Comment thread crates/bpe/README.md Outdated
GitHub Advanced Security finished work on behalf of gorzell June 17, 2026 07:32
Replace LaTeX-delimited complexity notation with inline code to match
the rest of the README and render correctly on crates.io/docs.rs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@gorzell gorzell disabled auto-merge June 17, 2026 07:35
GitHub Advanced Security started work on behalf of gorzell June 17, 2026 07:35 View session
GitHub Advanced Security finished work on behalf of gorzell June 17, 2026 07:36
Comment thread crates/bpe/README.md

![worst-case encoding runtime comparison](./images/performance-worstcase.svg)

For a full runtime analysis of incremental BPE tokenization, see Jiang and Gong, ["Incremental BPE Tokenization"](https://arxiv.org/abs/2605.30813) (ICML 2026), which presents an algorithm with a worst-case `O(n log^2 t)` complexity (where `n` is the input length and `t` is the maximum token length).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could write that this is paper is building on our incremental algorithm and takes it one step further by combining the aho-corasick-search with the compatibility test into one automaton.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great context, thanks. Added a sentence noting their work builds on our incremental algorithm and combines the aho-corasick search with the compatibility test into a single automaton (75e4040).

Per reviewer feedback, clarify that the cited paper extends this
crate's incremental algorithm by combining the aho-corasick search
with the compatibility test into a single automaton.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GitHub Advanced Security started work on behalf of gorzell June 17, 2026 07:43 View session
GitHub Advanced Security finished work on behalf of gorzell June 17, 2026 07:44
@gorzell gorzell merged commit 4ffa548 into main Jun 17, 2026
8 checks passed
@gorzell gorzell deleted the gorzell/cite-incremental-bpe-paper branch June 17, 2026 07:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants