Add voyage3 model by aneubeck · Pull Request #62 · github/rust-gems

aneubeck · 2025-05-05T12:08:16Z

voyage3 requires unicode normalization for which I introduced basic support.
I also fixed the broken equivalence tests in the benchmarks folder.
It is important to note that huggingface uses some byte to char conversion, so that it can run the BPE algorithm on unicode strings. In the test, the pretokenizer were not properly disabled.

We could consider adding features for the different models.

Copilot

Pull Request Overview

This PR adds basic Unicode normalization support and introduces the new voyage3 model, along with test and benchmark updates to accommodate optional Tiktoken tokenizers and fixed equivalence tests.

Introduce Normalizable/NormalizedString and wire NFC normalization into the Tokenizer API
Add voyage3_base BPE model (build script, library, benchmarks) and make Tiktoken optional
Fix equivalence tests by disabling Hugging Face pretokenizer and adding a byte-char mapping

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
crates/bpe/benchmarks/performance.rs	Guard Tiktoken benchmarks behind `Option` check
crates/bpe/benchmarks/lib.rs	Make Tiktoken optional and add `voyage3` entry
crates/bpe/benchmarks/equivalence.rs	Implement `char_to_byte`, fix equivalence tests
crates/bpe/benchmarks/Cargo.toml	Enable `rand` and `tiktoken` features for bpe
crates/bpe-openai/src/normalizer.rs	New `NormalizedString` type and `Normalizable` trait
crates/bpe-openai/src/lib.rs	Wire in NFC, add `voyage3_base`, update `Tokenizer`
crates/bpe-openai/build.rs	Serialize `voyage3_base` in build script
crates/bpe-openai/Cargo.toml	Add `unicode-normalization` dependency

Comments suppressed due to low confidence (3)

crates/bpe-openai/build.rs:25

The build script directive is incorrect. It should be cargo:rerun-if-changed=build.rs (single colon) to inform Cargo properly.

println!("cargo::rerun-if-changed=build.rs");

crates/bpe/benchmarks/performance.rs:166

[nitpick] Shadowing the outer tiktoken variable may reduce clarity. Consider renaming the inner binding (e.g. if let Some(tok) ) to avoid confusion.

if let Some(tiktoken) = tiktoken {

crates/bpe-openai/src/lib.rs:266

The new voyage3_base model isn’t covered by this test. Consider adding assertions for voyage3_base().count_till_limit(...) to ensure it behaves correctly.

fn test_count_till_limit() {

CleanCut · 2025-05-06T17:44:49Z

I pushed a commit which fixed a cargo fmt difference in Rust 1.86.

jorendorff

👍

jorendorff · 2025-05-06T23:17:58Z

+
+use unicode_normalization::UnicodeNormalization;
+
+/// Type which represents a normalized string.


Suggested change

/// Type which represents a normalized string.

/// Type which represents an NFC normalized string.

jorendorff · 2025-05-06T23:21:29Z

@aneubeck This is a breaking change, so it requires a version bump, right?

Co-authored-by: Jason Orendorff <jorendorff@github.com>

Add voyage3 model

6566c1c

Copilot AI review requested due to automatic review settings May 5, 2025 12:08

aneubeck requested a review from a team as a code owner May 5, 2025 12:08

Copilot AI reviewed May 5, 2025

View reviewed changes

Comment thread crates/bpe/benchmarks/equivalence.rs Outdated

aneubeck and others added 3 commits May 5, 2025 14:12

Update lib.rs

2a0e272

Update equivalence.rs

c335e66

cargo +1.86.0 fmt --all

3a5835c

jorendorff approved these changes May 6, 2025

View reviewed changes

aneubeck and others added 3 commits May 7, 2025 08:42

Update crates/bpe-openai/src/normalizer.rs

0ebe6df

Co-authored-by: Jason Orendorff <jorendorff@github.com>

Update crates/bpe-openai/src/normalizer.rs

e982966

Co-authored-by: Jason Orendorff <jorendorff@github.com>

bump version

023257b

aneubeck enabled auto-merge May 7, 2025 09:47

aneubeck added 2 commits May 7, 2025 11:49

Update lib.rs

799d04e

Update equivalence.rs

a22f108

aneubeck merged commit bcb4204 into main May 7, 2025
7 checks passed

aneubeck deleted the aneubeck/voyage branch May 7, 2025 10:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add voyage3 model#62

Add voyage3 model#62
aneubeck merged 9 commits into
mainfrom
aneubeck/voyage

aneubeck commented May 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

CleanCut commented May 6, 2025

Uh oh!

jorendorff left a comment

Uh oh!

Uh oh!

Uh oh!

jorendorff May 6, 2025

Uh oh!

jorendorff commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		use unicode_normalization::UnicodeNormalization;

		/// Type which represents a normalized string.

	/// Type which represents a normalized string.
	/// Type which represents an NFC normalized string.

Conversation

aneubeck commented May 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

CleanCut commented May 6, 2025

Uh oh!

jorendorff left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jorendorff May 6, 2025

Choose a reason for hiding this comment

Uh oh!

jorendorff commented May 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants