Unigram lattice walks Unicode scalars (#352, Bug 3) by apocryphx · Pull Request #356 · huggingface/swift-transformers

apocryphx · 2026-05-15T03:08:41Z

Addresses Bug 3 of #352.

UnigramTokenizer.tokenize(text:) and TokenLattice indexed the input by Swift Character (extended grapheme clusters). SentencePiece Unigram vocabularies are scalar-indexed, so an input grapheme that spans multiple scalars never gets exposed as its constituent scalars to the trie walk and the vocab lookup.

Example: "1️⃣" is one grapheme but three scalars (digit 1 U+0031, VS-16 U+FE0F, combining keycap U+20E3). HF Python emits ▁1 <unk> </s> for it. swift-transformers previously returned ▁ <unk> </s> — the digit was silently dropped because the entire keycap grapheme occupied a single lattice slot that didn't match any vocab key.

Switch the iteration unit to Unicode.Scalar:

UnigramTokenizer.trie is now Trie<Unicode.Scalar> and is fed each vocab entry's unicodeScalars view.
tokenize(text:) walks Array(text.unicodeScalars) and reconstructs token strings from String.UnicodeScalarView slices.
TokenLattice.chars is [Unicode.Scalar]. The convenience init(sentence:) and the piece(_:) reconstruction follow the same convention. Lattice offsets and lengths are now in scalar units; the Viterbi algorithm itself is unchanged (it only sees integer positions).

This matches the iteration-unit shape established by #353 for WordPiece and #355 for BPE.

Test plan

swift test --filter TokenizerTests — all 46 tests pass (no regressions).
New test t5UnigramKeycapEmoji() uses google-t5/t5-small over "1️⃣". Pre-fix: ▁ <unk> </s> (digit lost); post-fix: ▁1 <unk> </s> ([209, 2, 1]), matching HF Python transformers==4.57.1 byte-for-byte.

🤖 Generated with Claude Code

…face#352, Bug 3) `UnigramTokenizer.tokenize(text:)` and `TokenLattice` indexed the input by Swift `Character` (extended grapheme clusters). SentencePiece Unigram vocabularies are scalar-indexed, so an input grapheme that spans multiple scalars never gets exposed as its constituent scalars to the trie walk and the vocab lookup. For example `"1️⃣"` is one grapheme but three scalars (digit `1`, VS-16, combining keycap U+20E3); HF Python emits `▁1 <unk> </s>` for it, while Swift previously returned `▁ <unk> </s>` — the digit was silently dropped because the entire keycap grapheme occupied a single lattice slot that didn't match any vocab key. Switch the iteration unit to `Unicode.Scalar`: - `UnigramTokenizer.trie` is now `Trie<Unicode.Scalar>` and is fed each vocab entry's `unicodeScalars` view. - `tokenize(text:)` walks `Array(text.unicodeScalars)` and reconstructs token strings from `String.UnicodeScalarView` slices. - `TokenLattice.chars` is `[Unicode.Scalar]`. The convenience `init(sentence:)` and the `piece(_:)` reconstruction follow the same convention. Lattice offsets and lengths are now in scalar units; the Viterbi algorithm is unchanged (it only sees integer positions). Adds a regression test using google-t5/t5-small over the keycap emoji. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pcuenca

Thanks, it looks great!

I toned down comments a bit; I think it's enough to keep the one in TokenLattice, and the tests. Let me know if you think it's important to preserve others.

pcuenca · 2026-05-15T14:22:36Z

+    /// `Unicode.Scalar` view of the input String for performance.
+    /// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct
    /// access through the array does not pay the cost of
-    /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences).
-    private let chars: [Character]
+    /// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for
+    /// sequences). Scalars rather than `Character` so the lattice positions
+    /// align 1:1 with the per-scalar vocab lookup used by SentencePiece
+    /// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji
+    /// keycaps, combining-mark scripts).
+    /// Reference: https://github.com/huggingface/swift-transformers/issues/352
+    private let chars: [Unicode.Scalar]


Suggested change

/// `Unicode.Scalar` view of the input String for performance.

/// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct

/// access through the array does not pay the cost of

/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences).

private let chars: [Character]

/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for

/// sequences). Scalars rather than `Character` so the lattice positions

/// align 1:1 with the per-scalar vocab lookup used by SentencePiece

/// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji

/// keycaps, combining-mark scripts).

/// Reference: https://github.com/huggingface/swift-transformers/issues/352

private let chars: [Unicode.Scalar]

/// `Unicode.Scalar` view of the input String. Lattice offsets and lengths are in

/// scalar units so multi-scalar grapheme clusters are addressable at the same

/// granularity the SentencePiece vocab uses.

private let chars: [Unicode.Scalar]

pcuenca · 2026-05-15T14:22:54Z

+        // SentencePiece vocabularies are scalar-indexed; iterating the token
+        // strings by `Unicode.Scalar` keeps the trie keys aligned with the
+        // scalar-level traversal used in `tokenize(text:)` below.


Suggested change

// SentencePiece vocabularies are scalar-indexed; iterating the token

// strings by `Unicode.Scalar` keeps the trie keys aligned with the

// scalar-level traversal used in `tokenize(text:)` below.

pcuenca · 2026-05-15T14:23:53Z

+        // Walk the input by `Unicode.Scalar` rather than by grapheme cluster.
+        // SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three
+        // separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating
+        // `Character` would silently swallow scalars that belong to combined
+        // graphemes — including the digit at the start of an emoji keycap.
+        // Reference: https://github.com/huggingface/swift-transformers/issues/352


Suggested change

// Walk the input by `Unicode.Scalar` rather than by grapheme cluster.

// SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three

// separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating

// `Character` would silently swallow scalars that belong to combined

// graphemes — including the digit at the start of an emoji keycap.

// Reference: https://github.com/huggingface/swift-transformers/issues/352

@pcuenca

Per @pcuenca's review on huggingface#356: - Replace the multi-line TokenLattice.chars docstring with the shorter three-line version @pcuenca suggested. - Remove the in-body comment blocks at the trie-build site in `UnigramTokenizer.init` and at the top of `UnigramTokenizer.tokenize(text:)`. The remaining TokenLattice docstring plus the regression test in TokenizerTests carry the explanation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e#355 / huggingface#356 merged) The three bug-fix PRs landed in main, simplifying the table from 35 entries down to 8. The struct loses its `fixedBy` field — all surviving entries are the two new bug clusters this corpus surfaces and that have no PR yet, so the table is now just `{modelId, inputId, note}`. The cleanup-hint message prints `note` instead of `fixedBy`. What remains is two clusters under huggingface#352 worth filing as follow-up issues: - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs collapsing to single `▁` tokens instead of producing a multi-space vocab entry (e.g. `▁▁▁▁` id 268). - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge ordering than HF Python on Thai byte sequences. The same corpus runs in the companion Obj-C port (https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and hits 7/7 byte-identity on these inputs — so both clusters look like upstream-only bugs and the Obj-C source is a usable reference for the follow-up fixes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>

apocryphx mentioned this pull request May 15, 2026

Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352

Open

john-rocky mentioned this pull request May 15, 2026

Add multilingual conformance tests for byte-identical Python parity #357

Closed

pcuenca reviewed May 15, 2026

View reviewed changes

apocryphx mentioned this pull request May 16, 2026

Add MultilingualConformanceTests for byte-identical Python parity (#352) #360

Open

pcuenca approved these changes May 16, 2026

View reviewed changes

Merge branch 'main' into fix/unigram-scalar-iteration

063ed2d

pcuenca reviewed May 16, 2026

View reviewed changes

Comment thread Tests/TokenizersTests/TokenizerTests.swift Outdated

I hate linters

a5868d6

pcuenca merged commit 2fa33e1 into huggingface:main May 16, 2026
4 checks passed

jkrukowski mentioned this pull request May 17, 2026

Bump swift-transformers to 1.3.3 jkrukowski/swift-embeddings#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unigram lattice walks Unicode scalars (#352, Bug 3)#356

Unigram lattice walks Unicode scalars (#352, Bug 3)#356
pcuenca merged 4 commits into
huggingface:mainfrom
apocryphx:fix/unigram-scalar-iteration

apocryphx commented May 15, 2026

Uh oh!

pcuenca left a comment

Uh oh!

pcuenca May 15, 2026

Uh oh!

pcuenca May 15, 2026

Uh oh!

pcuenca May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	// SentencePiece vocabularies are scalar-indexed; iterating the token
	// strings by `Unicode.Scalar` keeps the trie keys aligned with the
	// scalar-level traversal used in `tokenize(text:)` below.

Uh oh!

Conversation

apocryphx commented May 15, 2026

Test plan

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants