Unigram lattice walks Unicode scalars (#352, Bug 3)#356
Merged
pcuenca merged 4 commits intoMay 16, 2026
Conversation
…face#352, Bug 3) `UnigramTokenizer.tokenize(text:)` and `TokenLattice` indexed the input by Swift `Character` (extended grapheme clusters). SentencePiece Unigram vocabularies are scalar-indexed, so an input grapheme that spans multiple scalars never gets exposed as its constituent scalars to the trie walk and the vocab lookup. For example `"1️⃣"` is one grapheme but three scalars (digit `1`, VS-16, combining keycap U+20E3); HF Python emits `▁1 <unk> </s>` for it, while Swift previously returned `▁ <unk> </s>` — the digit was silently dropped because the entire keycap grapheme occupied a single lattice slot that didn't match any vocab key. Switch the iteration unit to `Unicode.Scalar`: - `UnigramTokenizer.trie` is now `Trie<Unicode.Scalar>` and is fed each vocab entry's `unicodeScalars` view. - `tokenize(text:)` walks `Array(text.unicodeScalars)` and reconstructs token strings from `String.UnicodeScalarView` slices. - `TokenLattice.chars` is `[Unicode.Scalar]`. The convenience `init(sentence:)` and the `piece(_:)` reconstruction follow the same convention. Lattice offsets and lengths are now in scalar units; the Viterbi algorithm is unchanged (it only sees integer positions). Adds a regression test using google-t5/t5-small over the keycap emoji. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pcuenca
reviewed
May 15, 2026
pcuenca
left a comment
Member
There was a problem hiding this comment.
Thanks, it looks great!
I toned down comments a bit; I think it's enough to keep the one in TokenLattice, and the tests. Let me know if you think it's important to preserve others.
Comment on lines
+17
to
+26
| /// `Unicode.Scalar` view of the input String for performance. | ||
| /// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct | ||
| /// access through the array does not pay the cost of | ||
| /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences). | ||
| private let chars: [Character] | ||
| /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for | ||
| /// sequences). Scalars rather than `Character` so the lattice positions | ||
| /// align 1:1 with the per-scalar vocab lookup used by SentencePiece | ||
| /// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji | ||
| /// keycaps, combining-mark scripts). | ||
| /// Reference: https://github.com/huggingface/swift-transformers/issues/352 | ||
| private let chars: [Unicode.Scalar] |
Member
There was a problem hiding this comment.
Suggested change
| /// `Unicode.Scalar` view of the input String for performance. | |
| /// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct | |
| /// access through the array does not pay the cost of | |
| /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences). | |
| private let chars: [Character] | |
| /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for | |
| /// sequences). Scalars rather than `Character` so the lattice positions | |
| /// align 1:1 with the per-scalar vocab lookup used by SentencePiece | |
| /// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji | |
| /// keycaps, combining-mark scripts). | |
| /// Reference: https://github.com/huggingface/swift-transformers/issues/352 | |
| private let chars: [Unicode.Scalar] | |
| /// `Unicode.Scalar` view of the input String. Lattice offsets and lengths are in | |
| /// scalar units so multi-scalar grapheme clusters are addressable at the same | |
| /// granularity the SentencePiece vocab uses. | |
| private let chars: [Unicode.Scalar] |
Comment on lines
+111
to
+113
| // SentencePiece vocabularies are scalar-indexed; iterating the token | ||
| // strings by `Unicode.Scalar` keeps the trie keys aligned with the | ||
| // scalar-level traversal used in `tokenize(text:)` below. |
Member
There was a problem hiding this comment.
Suggested change
| // SentencePiece vocabularies are scalar-indexed; iterating the token | |
| // strings by `Unicode.Scalar` keeps the trie keys aligned with the | |
| // scalar-level traversal used in `tokenize(text:)` below. |
Comment on lines
+138
to
+143
| // Walk the input by `Unicode.Scalar` rather than by grapheme cluster. | ||
| // SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three | ||
| // separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating | ||
| // `Character` would silently swallow scalars that belong to combined | ||
| // graphemes — including the digit at the start of an emoji keycap. | ||
| // Reference: https://github.com/huggingface/swift-transformers/issues/352 |
Member
There was a problem hiding this comment.
Suggested change
| // Walk the input by `Unicode.Scalar` rather than by grapheme cluster. | |
| // SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three | |
| // separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating | |
| // `Character` would silently swallow scalars that belong to combined | |
| // graphemes — including the digit at the start of an emoji keycap. | |
| // Reference: https://github.com/huggingface/swift-transformers/issues/352 |
Per @pcuenca's review on huggingface#356: - Replace the multi-line TokenLattice.chars docstring with the shorter three-line version @pcuenca suggested. - Remove the in-body comment blocks at the trie-build site in `UnigramTokenizer.init` and at the top of `UnigramTokenizer.tokenize(text:)`. The remaining TokenLattice docstring plus the regression test in TokenizerTests carry the explanation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apocryphx
added a commit
to apocryphx/swift-transformers
that referenced
this pull request
May 16, 2026
…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pcuenca
approved these changes
May 16, 2026
pcuenca
reviewed
May 16, 2026
apocryphx
added a commit
to apocryphx/swift-transformers
that referenced
this pull request
May 16, 2026
…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apocryphx
added a commit
to apocryphx/swift-transformers
that referenced
this pull request
May 16, 2026
…e#355 / huggingface#356 merged) The three bug-fix PRs landed in main, simplifying the table from 35 entries down to 8. The struct loses its `fixedBy` field — all surviving entries are the two new bug clusters this corpus surfaces and that have no PR yet, so the table is now just `{modelId, inputId, note}`. The cleanup-hint message prints `note` instead of `fixedBy`. What remains is two clusters under huggingface#352 worth filing as follow-up issues: - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs collapsing to single `▁` tokens instead of producing a multi-space vocab entry (e.g. `▁▁▁▁` id 268). - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge ordering than HF Python on Thai byte sequences. The same corpus runs in the companion Obj-C port (https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and hits 7/7 byte-identity on these inputs — so both clusters look like upstream-only bugs and the Obj-C source is a usable reference for the follow-up fixes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses Bug 3 of #352.
UnigramTokenizer.tokenize(text:)andTokenLatticeindexed the input by SwiftCharacter(extended grapheme clusters). SentencePiece Unigram vocabularies are scalar-indexed, so an input grapheme that spans multiple scalars never gets exposed as its constituent scalars to the trie walk and the vocab lookup.Example:
"1️⃣"is one grapheme but three scalars (digit1U+0031, VS-16 U+FE0F, combining keycap U+20E3). HF Python emits▁1 <unk> </s>for it. swift-transformers previously returned▁ <unk> </s>— the digit was silently dropped because the entire keycap grapheme occupied a single lattice slot that didn't match any vocab key.Switch the iteration unit to
Unicode.Scalar:UnigramTokenizer.trieis nowTrie<Unicode.Scalar>and is fed each vocab entry'sunicodeScalarsview.tokenize(text:)walksArray(text.unicodeScalars)and reconstructs token strings fromString.UnicodeScalarViewslices.TokenLattice.charsis[Unicode.Scalar]. The convenienceinit(sentence:)and thepiece(_:)reconstruction follow the same convention. Lattice offsets and lengths are now in scalar units; the Viterbi algorithm itself is unchanged (it only sees integer positions).This matches the iteration-unit shape established by #353 for WordPiece and #355 for BPE.
Test plan
swift test --filter TokenizerTests— all 46 tests pass (no regressions).t5UnigramKeycapEmoji()usesgoogle-t5/t5-smallover"1️⃣". Pre-fix:▁ <unk> </s>(digit lost); post-fix:▁1 <unk> </s>([209, 2, 1]), matching HF Python transformers==4.57.1 byte-for-byte.🤖 Generated with Claude Code