Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625
Add public SentencePieceTokenizer factory methods for Unigram from vocab list and tokenizer.json#7625Copilot wants to merge 15 commits into
Conversation
…erJson APIs Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
ericstj
left a comment
There was a problem hiding this comment.
Thanks for adding this — the in-memory Create(vocab, ...) overload is clean and the ID-preservation (JSON vocab index = token id) is the right call. I implemented this same JSON-only Unigram capability recently against real Hugging Face models, and hit two correctness issues that this PR's single test model happens to mask. Details inline; summary here.
Bugs (untested by the current suite)
-
BOS/EOS positional fallback corrupts real pieces.
FindSpecialTokenId(pieces, "<s>", 1)/("</s>", 2)fall back to positions 1/2 when the vocab has no piece literally named<s>/</s>. Many HF Unigram tokenizers don't use those names — e.g.minishlab/potion-multilingual-128M(bge-m3 family) hasunk_id=1, vocab[0]="[PAD]", [1]="[UNK]", [2]=",". ThereeosId→2 marks","asControland drops it from the Viterbi trie (it can never be emitted), andbosId→1 collides withunkIdand clobbers the unknown entry. This is structural (independent ofaddBos/addEos). -
Normalizer steps beyond
Precompiledare silently dropped.ExtractPrecompiledCharsMapextracts only the charsmap and discards sibling normalizers. Real Unigram models often have a richer chain (potion/bge-m3:Sequence[Precompiled, Replace(punctuation spacing), Replace("\\s+"->" "), Strip]), whichSentencePieceNormalizercannot reproduce — soCreateFromTokenizerJsonsilently yields different tokens than HF. Since the charsmap must run before thoseReplacesteps, they can't just be reordered into SP; at minimum this should throw on unrecognized normalizer types rather than silently ignore them.
Why the test stays green
Paraphrase-multilingual-MiniLM-L12-v2 names its specials (<s>=0, </s>=2) so the fallback never fires, and its normalizer is a single Precompiled, so the dropped-sibling path is never hit. Recommend adding fixtures that (a) place specials at non-conventional positions / omit <s>/</s>, and (b) use a Sequence normalizer with Replace/Strip, asserting against HF reference ids.
Minor
removeExtraWhitespacesis hard-codedtruein both factories rather than derived from the JSON.added_tokensfrom the JSON aren't auto-wired; correctness depends on the caller passingspecialTokens. Worth documenting or reading them.
(Posting as comments only — not an approval or change request.)
|
@copilot please address feedback |
…end_scheme handling Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
|
@copilot address feedback |
…tion, and add tests Co-authored-by: ericstj <8918108+ericstj@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #7625 +/- ##
==========================================
+ Coverage 69.59% 69.74% +0.14%
==========================================
Files 1484 1485 +1
Lines 273606 275443 +1837
Branches 27949 28161 +212
==========================================
+ Hits 190410 192096 +1686
- Misses 75832 75916 +84
- Partials 7364 7431 +67
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR extends SentencePieceTokenizer to support Hugging Face JSON-only Unigram tokenizers by adding new public factory APIs that can construct a Unigram tokenizer from either an in-memory vocab list or a tokenizer.json stream, avoiding the current requirement for a SentencePiece .model protobuf.
Changes:
- Add
SentencePieceTokenizer.Create(IEnumerable<(string Piece, float Score)> vocab, ...)for constructing a Unigram tokenizer directly from a vocab list. - Add
SentencePieceTokenizer.CreateFromTokenizerJson(Stream tokenizerJsonStream, ...)for parsing HFtokenizer.json(Unigram) including vocab,unk_id, precompiled charsmap, and Metaspace settings. - Add internal constructors/refactoring to build a
SentencePieceUnigramModelfrom vocab pieces and config values, plus new tests covering these creation paths.
Show a summary per file
| File | Description |
|---|---|
| test/Microsoft.ML.Tokenizers.Tests/UnigramTests.cs | Adds unit tests for vocab-based and tokenizer.json-based Unigram construction and behavior parity checks. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceUnigramModel.cs | Adds Unigram model constructors that build vocab/trie from (piece, score) inputs and detect special tokens by name. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceTokenizer.cs | Adds public factories for vocab and tokenizer.json, plus JSON parsing helpers for normalizer/pre-tokenizer extraction. |
| src/Microsoft.ML.Tokenizers/Model/SentencePieceBaseModel.cs | Adds a new base-model constructor taking explicit config/token IDs instead of ModelProto. |
Copilot's findings
- Files reviewed: 4/4 changed files
- Comments generated: 4
|
The new JSON parser never reads the : base(addBos, addEos,
...,
addDummyPrefix, escapeWhiteSpaces, treatWhitespaceAsSuffix, byteFallback: false,
precompiledCharsmap, removeExtraWhitespaces, specialTokens)Hugging Face Unigram tokenizers serialize this as a top-level "model": {
"type": "Unigram",
"unk_id": 0,
"byte_fallback": true,
"vocab": [ ... ]
}For any model that enables byte fallback, hardcoding Suggested fix: read the flag in bool byteFallback = modelElement.TryGetProperty("byte_fallback", out JsonElement bf) && bf.GetBoolean();defaulting to |
…loader Harden the tokenizer.json parser so malformed inputs fail with a diagnostic InvalidDataException instead of an InvalidOperationException/ArgumentException escaping from CreateFromTokenizerJson: - added_tokens: require string content and numeric id. - post_processor TemplateProcessing: require a string SpecialToken.id and a numeric special_tokens ids[0]. - RobertaProcessing/BertProcessing cls/sep: check the [token, id] value kinds before reading. - Metaspace pre_tokenizer: require a boolean add_prefix_space and a string replacement. - Replace normalizer: require an object pattern with a string String/Regex, and wrap invalid Regex construction as InvalidDataException. - Read all normalizer/pre_tokenizer/post_processor 'type' fields through a shared GetStringOrNull helper so a non-string type is handled in a controlled way rather than throwing. Adds tests covering the new validation paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nsistency ResolveTemplateTokenId now verifies that a post_processor special_tokens id maps back to the referenced token (via added tokens or the vocabulary), mirroring the RobertaProcessing/BertProcessing affix validation, so an inconsistent tokenizer.json fails with InvalidDataException instead of emitting an id whose decoded token differs from the template's token name. Adds an inconsistent-template-special-token test. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
CreateFromTokenizerJson returns SentencePieceTokenizer, so the explicit casts in the new tests were unnecessary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Require a normalizer entry to be a JSON object at the start of Build, so a non-object entry in a Sequence array fails with InvalidDataException instead of InvalidOperationException. - Validate the Precompiled 'precompiled_charsmap' is a string before reading it. - Validate the Prepend 'prepend' is a string before reading it. Adds tests for the non-object Sequence entry, non-string Precompiled charsmap, and non-string Prepend cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Overall the changes look good. A few gaps, since a full
It would be good to address 1 and 2 in this PR. None block merging for the standard HF Unigram/Metaspace scenario the tests cover. Suggest opening a tracking issue to support the remaining parts (e.g. |
tarekgh
left a comment
There was a problem hiding this comment.
Left minor comment, LGTM otherwise.
SentencePieceTokenizeronly exposedCreate(Stream)requiring a SentencePiece protobuf (.model), making it impossible to load Hugging Face JSON-only Unigram tokenizers that have no.modelfile.New public APIs
From in-memory vocab:
From
tokenizer.json:CreateFromTokenizerJsonreadsmodel.vocab,model.unk_id, extractsprecompiled_charsmapfrom aPrecompiledorSequencenormalizer, and reads Metaspace pre-tokenizer settings (add_prefix_space,replacement,prepend_scheme). It validatesmodel.type == "Unigram".Internal changes
SentencePieceBaseModel: new constructor taking individual config parameters instead ofModelProtoSentencePieceUnigramModel: new constructors building vocab fromIReadOnlyList<(string, float)>; BOS/EOS/PAD IDs auto-detected by piece name (<s>,</s>,<pad>) with SentencePiece-conventional positional fallbacksNote on token IDs
HF
tokenizer.jsontypically uses a different special-token ordering than the SentencePiece protobuf (e.g.<s>=0, <pad>=1, </s>=2, <unk>=3vs.<unk>=0, <s>=1, </s>=2). Piece strings produced are identical; numeric IDs will differ by the vocab offset introduced by the extra special tokens.