Add voyage3 model#62
Merged
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR adds basic Unicode normalization support and introduces the new voyage3 model, along with test and benchmark updates to accommodate optional Tiktoken tokenizers and fixed equivalence tests.
- Introduce
Normalizable/NormalizedStringand wire NFC normalization into theTokenizerAPI - Add
voyage3_baseBPE model (build script, library, benchmarks) and make Tiktoken optional - Fix equivalence tests by disabling Hugging Face pretokenizer and adding a byte-char mapping
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| crates/bpe/benchmarks/performance.rs | Guard Tiktoken benchmarks behind Option check |
| crates/bpe/benchmarks/lib.rs | Make Tiktoken optional and add voyage3 entry |
| crates/bpe/benchmarks/equivalence.rs | Implement char_to_byte, fix equivalence tests |
| crates/bpe/benchmarks/Cargo.toml | Enable rand and tiktoken features for bpe |
| crates/bpe-openai/src/normalizer.rs | New NormalizedString type and Normalizable trait |
| crates/bpe-openai/src/lib.rs | Wire in NFC, add voyage3_base, update Tokenizer |
| crates/bpe-openai/build.rs | Serialize voyage3_base in build script |
| crates/bpe-openai/Cargo.toml | Add unicode-normalization dependency |
Comments suppressed due to low confidence (3)
crates/bpe-openai/build.rs:25
- The build script directive is incorrect. It should be
cargo:rerun-if-changed=build.rs(single colon) to inform Cargo properly.
println!("cargo::rerun-if-changed=build.rs");
crates/bpe/benchmarks/performance.rs:166
- [nitpick] Shadowing the outer
tiktokenvariable may reduce clarity. Consider renaming the inner binding (e.g.if let Some(tok)) to avoid confusion.
if let Some(tiktoken) = tiktoken {
crates/bpe-openai/src/lib.rs:266
- The new
voyage3_basemodel isn’t covered by this test. Consider adding assertions forvoyage3_base().count_till_limit(...)to ensure it behaves correctly.
fn test_count_till_limit() {
Contributor
|
I pushed a commit which fixed a |
jorendorff
approved these changes
May 6, 2025
|
|
||
| use unicode_normalization::UnicodeNormalization; | ||
|
|
||
| /// Type which represents a normalized string. |
Contributor
There was a problem hiding this comment.
Suggested change
| /// Type which represents a normalized string. | |
| /// Type which represents an NFC normalized string. |
Contributor
|
@aneubeck This is a breaking change, so it requires a version bump, right? |
Co-authored-by: Jason Orendorff <jorendorff@github.com>
Co-authored-by: Jason Orendorff <jorendorff@github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
voyage3 requires unicode normalization for which I introduced basic support.
I also fixed the broken equivalence tests in the benchmarks folder.
It is important to note that huggingface uses some byte to char conversion, so that it can run the BPE algorithm on unicode strings. In the test, the pretokenizer were not properly disabled.
We could consider adding features for the different models.