Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion crates/bpe/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -203,7 +203,7 @@ We benchmarked the following scenarios:
The data structure we built specifically for this purpose can answer those interval counting requests in typically constant times after the initial linear preprocessing of the text.
This mode is not available in tiktoken, which only supports counting/encoding a complete text.

All benchmarks were run single-threaded on a MacBook Pro M1.
All benchmarks were run single-threaded on a MacBook Air M4.

### Encoding

Expand All @@ -219,6 +219,7 @@ Two additional encoders are included that are faster but deviate from the origin

- The greedy encoder picks the left-longest token.
- The minimal encoder computes an encoding with the minimal number of tokens.
- The minimal_dropout encoder implements BPE-Dropout [algorithm](https://arxiv.org/abs/1910.13267), randomly ignoring some multi-byte tokens at runtime.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave a note here that this is using a different drop-out strategy than proposed in the paper and it was NOT tested with actual training sessions!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, thanks.


The benchmark measured the runtime of encoding of slices of lengths 10, 100, 1000, and 10000 from a random 20000 token original text using the o200k token set.
(All encodings were computed from scratch for each slice.)
Expand Down
23 changes: 23 additions & 0 deletions crates/bpe/benchmarks/performance.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,17 @@ use bpe_benchmarks::*;
use criterion::{
criterion_group, criterion_main, AxisScale, BenchmarkId, Criterion, PlotConfiguration,
};
use rand::rngs::StdRng;
use rand::SeedableRng;
use rand::{rng, Rng};

fn get_rng(seed: u64) -> StdRng {
// Expand the u64 seed to 32 bytes
let mut seed_bytes = [0u8; 32];
seed_bytes[..8].copy_from_slice(&seed.to_le_bytes());
StdRng::from_seed(seed_bytes)
}
Comment thread
marinegor marked this conversation as resolved.
Outdated
Comment thread
marinegor marked this conversation as resolved.
Outdated

fn counting_benchmark(c: &mut Criterion) {
for (name, bpe, _, _) in TOKENIZERS.iter() {
let input = create_test_string(&bpe.bpe, 80_000);
Expand Down Expand Up @@ -92,6 +101,20 @@ fn encoding_benchmark(c: &mut Criterion) {
criterion::BatchSize::SmallInput,
)
});
group.bench_with_input(
BenchmarkId::new("minimal_dropout", bytes),
&bytes,
|b, bytes| {
b.iter_batched(
|| select_test_string(&text, *bytes),
|text| {
bpe.bpe
.encode_minimal_dropout(text.as_bytes(), 0.1, get_rng(0))
Comment thread
marinegor marked this conversation as resolved.
Outdated
},
criterion::BatchSize::SmallInput,
)
},
);
group.bench_with_input(
BenchmarkId::new("huggingface", bytes),
&bytes,
Expand Down
97 changes: 35 additions & 62 deletions crates/bpe/images/performance-appending.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
118 changes: 44 additions & 74 deletions crates/bpe/images/performance-comparison.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
97 changes: 35 additions & 62 deletions crates/bpe/images/performance-counting.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
160 changes: 74 additions & 86 deletions crates/bpe/images/performance-encoding.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
162 changes: 73 additions & 89 deletions crates/bpe/images/performance-worstcase.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
63 changes: 59 additions & 4 deletions crates/bpe/src/byte_pair_encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -526,9 +526,9 @@ impl BytePairEncoding {
/// tokenization produced by the original BPE algorithm.
pub fn encode_minimal(&self, text: &[u8]) -> Vec<u32> {
let mut last_token: Vec<(u32, u32)> = Vec::with_capacity(text.len());
let mut state = self.overlapping_searcher.start_state();
for (pos, c) in text.iter().enumerate() {
let (s, iter) = self.overlapping_searcher.consume(state, pos + 1, *c);
let mut state = self.overlapping_searcher_rev.start_state();
for (pos, c) in text.iter().rev().enumerate() {
let (s, iter) = self.overlapping_searcher_rev.consume(state, pos + 1, *c);
state = s;
let mut best = (0, u32::MAX);
for m in iter {
Expand All @@ -548,7 +548,62 @@ impl BytePairEncoding {
encoded.push(token);
pos -= self.token_len(token);
}
encoded.reverse();
encoded
}

/// This function computes the encoding while randomly rejecting some merges.
/// Result of the encoding will be non-deterministic unless `seed` is provided.
/// Implementation loosely follows original BPE dropout paper: https://arxiv.org/abs/1910.13267
///
/// In more detail: the tokenization uses dynamic programming, i.e. it models the tokenization as a graph,
/// where every position between text bytes is a node and two nodes are connected when the text slice between those two nodes matches a token.
// It then tries to find the shortest possible path from the beginning of the text till the end, i.e. it finds the shortest possible encoding.
// For this is processes the nodes from left to right and visits all edges to the left. Then, it picks the edge which results in the shortest path.
Comment thread
marinegor marked this conversation as resolved.
Outdated
// The length of the shortest path is stored as second value, the edge (or rather token) is stored as first value.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you described all this, you should also add the last step where we walk in reverse direction through the table along the shortest path.
Note: the reason for constructing the table from back to front is that the reconstruction outputs the path from start till end (i.e. we don't have to reverse the path afterwards).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thanks!

//
// For the dropout (when dropout > 0.0), we uniformly drop edges from the graph, but always keep the one-byte tokens such that the graph stays connected.
// Note: this is very different from how BPE works and cannot produce the same output as the algorithm
// in the [paper's repository](https://github.com/VProv/BPE-Dropout/blob/master/bpe.py#L98), for two main reasons:
// - `encode_minimal` already doesn't follow the original heap-based BPE procedure
// - randomness source in dropout works differently in rust and python
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one you can drop IMO, since it shouldn't matter if a reasonable random number generator was chosen

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah makes sense, I think previous two reasons are enough to not claim complete reproducibility.

// - BPE-dropout authors discard all multi-byte tokens for each word separately, while this implementation does not split the "sentence" into words first
// and hence may include previously discarded token later down the byte stream. At the sentence level though we don't expect it to make much difference.
Comment on lines +572 to +573
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are at least two/three distinct points in here

  1. word splitting
  2. set of possible merge operations
  3. choosing from the set with drop-out

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 1 and 3 were covered, I added 2.

#[cfg(feature = "rand")]
pub fn encode_minimal_dropout<R: rand::Rng>(
&self,
text: &[u8],
dropout: f32,
mut rng: R,
) -> Vec<u32> {
assert!(0.0 <= dropout);
assert!(dropout <= 1.0);

let mut last_token: Vec<(u32, u32)> = Vec::with_capacity(text.len());
let mut state = self.overlapping_searcher_rev.start_state();
for (pos, c) in text.iter().rev().enumerate() {
let (s, iter) = self.overlapping_searcher_rev.consume(state, pos + 1, *c);
state = s;
let mut best = (0, u32::MAX);
for m in iter {
if m.end() > m.start() + 1 && dropout >= rng.random() {
continue;
}
if m.start() == 0 {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some paper explaining in more detail how the randomization is supposed to work?
It's not quite obvious what properties the implementation actually has.

Also, some documentation would be nice (as part of some readme and/or doc comment).

If this is a one-to-one implementation of some paper, then we can probably just link to that paper.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The paper: https://arxiv.org/abs/1910.13267

We're interested in Algorithm 1 (page 3).

Improvements rationale can be seen on Figure 6.

I don't think it's an one-to-one implementation, since encode_minimal does not follow the original BPE, hence its modification won't follow BPE_dropout. But I still see its as a valuable addition (at least I'm planning to use it in my current project).

Copy link
Copy Markdown
Contributor Author

@marinegor marinegor Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...although I admit I don't really understand encode_minimal good enough to ensure I'm actually rejecting merges, and not doing something different (which I probably am).

Intuition is that dropout roughly equals number of rejected merges in the final encoding, e.g. dropout ~=1 would result in almost single-byte encoding. However, I don't see that with dropout=0.99:

 t_1    s_1     t_2    s_2
   1      b       1      b
   2     ab       2     ab
   0      a       0      a
   2     ab       2     ab
   2     ab       2     ab
   2     ab       2     ab
   2     ab       2     ab
  -1      -       0      a   <----
   2     ab       1      b   <----
   1      b       1      b
e1=' b ab  a ab ab ab ab __ ab  b'
e2=' b ab  a ab ab ab ab  a  b  b'

where dictionary is a b ab and string is babaabababababb.

So I'd appreciate any directions if you have any :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

encode_minimal uses dynamic programming, i.e. it models the tokenization as a graph where every position between text bytes is a node and two nodes are connected when the text slice between those two nodes matches a token.
It then tries to find the shortest possible path from the beginning of the text till the end, i.e. it finds the shortest possible encoding.
For this is processes the nodes from left to right and visits all edges to the left. Then, it picks the edge which results in the shortest path. The length of the shortest path is stored as second value, the edge (or rather token) is stored as first value.

Note: this is very different from how BPE works and cannot produce the same output as the algorithm in the paper.

The only implementation in this crate which follows the "standard" BPE algorithm is encode_into_bitfield, since it uses the "standard" heap approach. But instead of storing some complicated doubly linked list or whatever, it uses a compact bitfield to encode the start and end positions of tokens which makes this implementation probably the fastest standard one.
But it is still slow compared to the other algorithms in this crate. But those operate VERY differently and again I'm not sure if it's possible to emulate the exact probability distribution suggested in the paper.

The problem with the algorithm in the paper is that it is VERY slow.
And the dropout implementation I found here: https://github.com/VProv/BPE-Dropout/blob/master/bpe.py#L98 is just as bad (or maybe even worse)?

So, maybe it is good enough to pick a different randomization process which follows the idea of the paper in spirit?
I don't have the time right now to do this research myself though... Happy to review any proposals though.
It would also make sense IMO to think about some procedure which plots somehow the "quality" of the BPE based on the dropout value and compare that graph with the original algorithm.
I mean you probably want some kind of prove that this actually has the desired properties.

Copy link
Copy Markdown
Contributor Author

@marinegor marinegor Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aneubeck thanks for the explanation, that's actually very helpful. I guess that the only thing that matters is just being able to drop some merges before actually building tokenization.

Could you have a look at the updated approach? I've changed the approach that I had before (which I think was very wrong), and instead now consider "best" tokens if they are not in "forbidden_tokens", which have been constructed prior to tokenization. My only worry is the single-byte tokens -- I'm not sure how they're handled, and I wouldn't like to discard them from the allowed tokens, but I'm not sure how to handle that properly. I'm talking about this line:

...
& (!(forbidden_tokens_set.contains(&m.value())) | ((m.end() - m.start()) == 1))
...

I'm not sure if the second condition should be present or not, basically.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes!
I made some more improvements to your code here
c215210.
Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).

There was a little bug with how you treated tokens which started at the beginning of the text (you didn't filter larger tokens out there...).
I updated the test to detect this.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
Thereby I also found a way to speed up things by another 20-30% by going through the text in reverse order and using the reverse aho corasick lookup table. This way one can avoid the final reversing of the token output which improves throughput further.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).
And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes as well!

Can you copy them into your branch?
(I changed the way the rng is passed in, since this should make it easier to use IMO).
done and thanks, I don't have practically any experience with rngs so appreciate it here.

I also got rid of the pretty expensive lookup tables which you were computing. Those would slow down the processing drastically!
I'm thinking it might be technically different from the paper -- from how I'm reading their algorithm, it's impossible to get a dropped token in tokenization once it has been already dropped, while in your implementation it may appear later down the text. But with all fairness, they also split by words first, which at the sentence level makes things the same with your implementation.

It would be nice if you could extend the comment of this function describing in more detail what it does (i.e. we uniformly drop edges from the graph I described above, but always keep the one-byte tokens such that the graph stays connected).

Will do!

And we need some benchmark test + some update of the README.md mentioning this new feature and its performance... It would be a bonus to measure the performance of other dropout implementations and mention them as well (just to show what difference it makes).

I'll spend some time playing around with a toy example (with a b ab dictionary), and then update the README / tests with that.

On my Macbook I measured about 30million input characters/sec with dropout and 40 million/sec with the "standard" minimal_encoding impelmentation.

that's pretty cool :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aneubeck I've added some explanation and updated README slightly

I'm running benchmarks now -- I guess it's simply cargo criterion and cd scripts && ./copy-results, right?

Also, I'm running them on m4 -- should I update the description in README accordingly, or would you prefer to run it on your machine?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to review the changes tomorrow.
I'm on-call this week, so things are more busy than usual...

best = (m.value(), 1);
break;
} else if last_token[m.start() - 1].1 + 1 < best.1 {
best = (m.value(), last_token[m.start() - 1].1 + 1);
}
}
last_token.push(best);
}
let mut encoded = Vec::with_capacity(last_token.last().map(|l| l.1 as usize).unwrap_or(0));
let mut pos = text.len();
while pos > 0 {
let token = last_token[pos - 1].0;
encoded.push(token);
pos -= self.token_len(token);
}
encoded
}
}
Expand Down
3 changes: 3 additions & 0 deletions crates/bpe/tests/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,6 @@ bpe-openai = { path = "../../bpe-openai" }
itertools = "0.14"
rand = "0.9"
tiktoken-rs = "0.9"

[dev-dependencies]
rand_chacha = { version = "0.9" }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this dependency anymore I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed, thanks!

43 changes: 43 additions & 0 deletions crates/bpe/tests/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#[cfg(test)]
mod tests {
use std::time;

Comment thread
marinegor marked this conversation as resolved.
Outdated
use itertools::Itertools;
use rand::{rng, Rng};
use tiktoken_rs::cl100k_base_singleton;
Expand Down Expand Up @@ -141,4 +143,45 @@ mod tests {
assert_eq!(enc.token_count(), bpe.count(&input[i..]));
}
}

#[test]
fn test_bpe_dropout() {
use rand::rngs::StdRng;
use rand::SeedableRng;

fn get_rng(seed: u64) -> StdRng {
// Expand the u64 seed to 32 bytes
let mut seed_bytes = [0u8; 32];
seed_bytes[..8].copy_from_slice(&seed.to_le_bytes());
StdRng::from_seed(seed_bytes)
}
Comment thread
marinegor marked this conversation as resolved.

let bpe = &cl100k_base().bpe;
for bytes in [10000, 20000] {
for _ in 0..8 {
Comment thread
marinegor marked this conversation as resolved.
Outdated
let input = create_test_bytes(bpe, bytes);
let encoded = bpe.encode_minimal(&input);
let encoded_d_min = bpe.encode_minimal_dropout(&input, 0.2, get_rng(0));
let encoded_d_max = bpe.encode_minimal_dropout(&input, 0.9, get_rng(1));
Comment thread
marinegor marked this conversation as resolved.
Outdated
let encoded_d_1_0 = bpe.encode_minimal_dropout(&input, 1.0, get_rng(2));
let decoded = bpe.decode_tokens(&encoded);
let decoded_min = bpe.decode_tokens(&encoded_d_min);
let decoded_max = bpe.decode_tokens(&encoded_d_max);
let decoded_max_again = bpe.decode_tokens(&encoded_d_1_0);
println!("Input length: {}, Encoded length: {}, Encoded with dropout length: {}-{}, max {}",
input.len(), encoded.len(), encoded_d_min.len(), encoded_d_max.len(), encoded_d_1_0.len());
assert_eq!(input, decoded);
assert_eq!(input, decoded_min);
assert_eq!(input, decoded_max);
assert_eq!(input, decoded_max_again);
assert_eq!(input.len(), encoded_d_1_0.len());
assert!(encoded_d_min.len() >= encoded.len());
assert!(encoded_d_max.len() > encoded.len());

assert_ne!(encoded, encoded_d_min);
assert_ne!(encoded, encoded_d_max);
assert_ne!(encoded_d_max, encoded_d_1_0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace these assertions with explicit numbers, i.e. something like

assert_eq!(encoded.len(), 2000);
assert_eq!(encoded_d_0_2.len(), 3000);
assert_eq!(encoded_d_0_9.len(), 9000);
assert_eq!(encoded_d_1_0.len(), 10000);

}
}
}
}
Loading