Skip to content

Add voyage3 model#62

Merged
aneubeck merged 9 commits into
mainfrom
aneubeck/voyage
May 7, 2025
Merged

Add voyage3 model#62
aneubeck merged 9 commits into
mainfrom
aneubeck/voyage

Conversation

@aneubeck
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck commented May 5, 2025

voyage3 requires unicode normalization for which I introduced basic support.
I also fixed the broken equivalence tests in the benchmarks folder.
It is important to note that huggingface uses some byte to char conversion, so that it can run the BPE algorithm on unicode strings. In the test, the pretokenizer were not properly disabled.

We could consider adding features for the different models.

Copilot AI review requested due to automatic review settings May 5, 2025 12:08
@aneubeck aneubeck requested a review from a team as a code owner May 5, 2025 12:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds basic Unicode normalization support and introduces the new voyage3 model, along with test and benchmark updates to accommodate optional Tiktoken tokenizers and fixed equivalence tests.

  • Introduce Normalizable/NormalizedString and wire NFC normalization into the Tokenizer API
  • Add voyage3_base BPE model (build script, library, benchmarks) and make Tiktoken optional
  • Fix equivalence tests by disabling Hugging Face pretokenizer and adding a byte-char mapping

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
crates/bpe/benchmarks/performance.rs Guard Tiktoken benchmarks behind Option check
crates/bpe/benchmarks/lib.rs Make Tiktoken optional and add voyage3 entry
crates/bpe/benchmarks/equivalence.rs Implement char_to_byte, fix equivalence tests
crates/bpe/benchmarks/Cargo.toml Enable rand and tiktoken features for bpe
crates/bpe-openai/src/normalizer.rs New NormalizedString type and Normalizable trait
crates/bpe-openai/src/lib.rs Wire in NFC, add voyage3_base, update Tokenizer
crates/bpe-openai/build.rs Serialize voyage3_base in build script
crates/bpe-openai/Cargo.toml Add unicode-normalization dependency
Comments suppressed due to low confidence (3)

crates/bpe-openai/build.rs:25

  • The build script directive is incorrect. It should be cargo:rerun-if-changed=build.rs (single colon) to inform Cargo properly.
println!("cargo::rerun-if-changed=build.rs");

crates/bpe/benchmarks/performance.rs:166

  • [nitpick] Shadowing the outer tiktoken variable may reduce clarity. Consider renaming the inner binding (e.g. if let Some(tok) ) to avoid confusion.
if let Some(tiktoken) = tiktoken {

crates/bpe-openai/src/lib.rs:266

  • The new voyage3_base model isn’t covered by this test. Consider adding assertions for voyage3_base().count_till_limit(...) to ensure it behaves correctly.
fn test_count_till_limit() {

Comment thread crates/bpe/benchmarks/equivalence.rs Outdated
@CleanCut
Copy link
Copy Markdown
Contributor

CleanCut commented May 6, 2025

I pushed a commit which fixed a cargo fmt difference in Rust 1.86.

Copy link
Copy Markdown
Contributor

@jorendorff jorendorff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread crates/bpe-openai/src/normalizer.rs Outdated
Comment thread crates/bpe-openai/src/normalizer.rs Outdated

use unicode_normalization::UnicodeNormalization;

/// Type which represents a normalized string.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Type which represents a normalized string.
/// Type which represents an NFC normalized string.

@jorendorff
Copy link
Copy Markdown
Contributor

@aneubeck This is a breaking change, so it requires a version bump, right?

aneubeck and others added 3 commits May 7, 2025 08:42
Co-authored-by: Jason Orendorff <jorendorff@github.com>
Co-authored-by: Jason Orendorff <jorendorff@github.com>
@aneubeck aneubeck enabled auto-merge May 7, 2025 09:47
@aneubeck aneubeck merged commit bcb4204 into main May 7, 2025
7 checks passed
@aneubeck aneubeck deleted the aneubeck/voyage branch May 7, 2025 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants