Skip to content

Unigram lattice walks Unicode scalars (#352, Bug 3)#356

Merged
pcuenca merged 4 commits into
huggingface:mainfrom
apocryphx:fix/unigram-scalar-iteration
May 16, 2026
Merged

Unigram lattice walks Unicode scalars (#352, Bug 3)#356
pcuenca merged 4 commits into
huggingface:mainfrom
apocryphx:fix/unigram-scalar-iteration

Conversation

@apocryphx

Copy link
Copy Markdown
Contributor

Addresses Bug 3 of #352.

UnigramTokenizer.tokenize(text:) and TokenLattice indexed the input by Swift Character (extended grapheme clusters). SentencePiece Unigram vocabularies are scalar-indexed, so an input grapheme that spans multiple scalars never gets exposed as its constituent scalars to the trie walk and the vocab lookup.

Example: "1️⃣" is one grapheme but three scalars (digit 1 U+0031, VS-16 U+FE0F, combining keycap U+20E3). HF Python emits ▁1 <unk> </s> for it. swift-transformers previously returned ▁ <unk> </s> — the digit was silently dropped because the entire keycap grapheme occupied a single lattice slot that didn't match any vocab key.

Switch the iteration unit to Unicode.Scalar:

  • UnigramTokenizer.trie is now Trie<Unicode.Scalar> and is fed each vocab entry's unicodeScalars view.
  • tokenize(text:) walks Array(text.unicodeScalars) and reconstructs token strings from String.UnicodeScalarView slices.
  • TokenLattice.chars is [Unicode.Scalar]. The convenience init(sentence:) and the piece(_:) reconstruction follow the same convention. Lattice offsets and lengths are now in scalar units; the Viterbi algorithm itself is unchanged (it only sees integer positions).

This matches the iteration-unit shape established by #353 for WordPiece and #355 for BPE.

Test plan

  • swift test --filter TokenizerTests — all 46 tests pass (no regressions).
  • New test t5UnigramKeycapEmoji() uses google-t5/t5-small over "1️⃣". Pre-fix: ▁ <unk> </s> (digit lost); post-fix: ▁1 <unk> </s> ([209, 2, 1]), matching HF Python transformers==4.57.1 byte-for-byte.

🤖 Generated with Claude Code

…face#352, Bug 3)

`UnigramTokenizer.tokenize(text:)` and `TokenLattice` indexed the input by Swift
`Character` (extended grapheme clusters). SentencePiece Unigram vocabularies
are scalar-indexed, so an input grapheme that spans multiple scalars never gets
exposed as its constituent scalars to the trie walk and the vocab lookup. For
example `"1️⃣"` is one grapheme but three scalars (digit `1`, VS-16, combining
keycap U+20E3); HF Python emits `▁1 <unk> </s>` for it, while Swift previously
returned `▁ <unk> </s>` — the digit was silently dropped because the entire
keycap grapheme occupied a single lattice slot that didn't match any vocab key.

Switch the iteration unit to `Unicode.Scalar`:
  - `UnigramTokenizer.trie` is now `Trie<Unicode.Scalar>` and is fed each vocab
    entry's `unicodeScalars` view.
  - `tokenize(text:)` walks `Array(text.unicodeScalars)` and reconstructs token
    strings from `String.UnicodeScalarView` slices.
  - `TokenLattice.chars` is `[Unicode.Scalar]`. The convenience
    `init(sentence:)` and the `piece(_:)` reconstruction follow the same
    convention. Lattice offsets and lengths are now in scalar units; the
    Viterbi algorithm is unchanged (it only sees integer positions).

Adds a regression test using google-t5/t5-small over the keycap emoji.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@pcuenca pcuenca left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it looks great!

I toned down comments a bit; I think it's enough to keep the one in TokenLattice, and the tests. Let me know if you think it's important to preserve others.

Comment thread Sources/Tokenizers/TokenLattice.swift Outdated
Comment on lines +17 to +26
/// `Unicode.Scalar` view of the input String for performance.
/// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct
/// access through the array does not pay the cost of
/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences).
private let chars: [Character]
/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for
/// sequences). Scalars rather than `Character` so the lattice positions
/// align 1:1 with the per-scalar vocab lookup used by SentencePiece
/// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji
/// keycaps, combining-mark scripts).
/// Reference: https://github.com/huggingface/swift-transformers/issues/352
private let chars: [Unicode.Scalar]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// `Unicode.Scalar` view of the input String for performance.
/// Lattice offsets and lengths are in `Unicode.Scalar` units, so direct
/// access through the array does not pay the cost of
/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for sequences).
private let chars: [Character]
/// `String.index(_:offsetBy:)` traversal (O(N) per token, quadratic for
/// sequences). Scalars rather than `Character` so the lattice positions
/// align 1:1 with the per-scalar vocab lookup used by SentencePiece
/// Unigram for inputs whose graphemes span multiple scalars (e.g. emoji
/// keycaps, combining-mark scripts).
/// Reference: https://github.com/huggingface/swift-transformers/issues/352
private let chars: [Unicode.Scalar]
/// `Unicode.Scalar` view of the input String. Lattice offsets and lengths are in
/// scalar units so multi-scalar grapheme clusters are addressable at the same
/// granularity the SentencePiece vocab uses.
private let chars: [Unicode.Scalar]

Comment on lines +111 to +113
// SentencePiece vocabularies are scalar-indexed; iterating the token
// strings by `Unicode.Scalar` keeps the trie keys aligned with the
// scalar-level traversal used in `tokenize(text:)` below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SentencePiece vocabularies are scalar-indexed; iterating the token
// strings by `Unicode.Scalar` keeps the trie keys aligned with the
// scalar-level traversal used in `tokenize(text:)` below.

Comment on lines +138 to +143
// Walk the input by `Unicode.Scalar` rather than by grapheme cluster.
// SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three
// separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating
// `Character` would silently swallow scalars that belong to combined
// graphemes — including the digit at the start of an emoji keycap.
// Reference: https://github.com/huggingface/swift-transformers/issues/352

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Walk the input by `Unicode.Scalar` rather than by grapheme cluster.
// SentencePiece vocabularies are scalar-indexed (a keycap "1️⃣" is three
// separate vocab lookup targets: U+0031, U+FE0F, U+20E3), so iterating
// `Character` would silently swallow scalars that belong to combined
// graphemes — including the digit at the start of an emoji keycap.
// Reference: https://github.com/huggingface/swift-transformers/issues/352

Per @pcuenca's review on huggingface#356:

- Replace the multi-line TokenLattice.chars docstring with the shorter
  three-line version @pcuenca suggested.
- Remove the in-body comment blocks at the trie-build site in
  `UnigramTokenizer.init` and at the top of `UnigramTokenizer.tokenize(text:)`.
  The remaining TokenLattice docstring plus the regression test in
  TokenizerTests carry the explanation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apocryphx added a commit to apocryphx/swift-transformers that referenced this pull request May 16, 2026
…on to huggingface#356

Verified locally on a combined branch (huggingface#354huggingface#355huggingface#356 ⊕ this PR): all
three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two
were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16
segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run
fired cleanup hints for both, alongside the emoji-keycap-and-flags hint.

Root cause is the same as the keycap case: a vocab-relevant scalar
(TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that
the old `Character`-based Unigram lattice never decomposed. Moving the
iteration unit to `Unicode.Scalar` exposes it.

`expectedDivergences` count drops from 10 pending-investigation entries
to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread Tests/TokenizersTests/TokenizerTests.swift Outdated
@pcuenca pcuenca merged commit 2fa33e1 into huggingface:main May 16, 2026
4 checks passed
apocryphx added a commit to apocryphx/swift-transformers that referenced this pull request May 16, 2026
…on to huggingface#356

Verified locally on a combined branch (huggingface#354huggingface#355huggingface#356 ⊕ this PR): all
three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two
were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16
segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run
fired cleanup hints for both, alongside the emoji-keycap-and-flags hint.

Root cause is the same as the keycap case: a vocab-relevant scalar
(TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that
the old `Character`-based Unigram lattice never decomposed. Moving the
iteration unit to `Unicode.Scalar` exposes it.

`expectedDivergences` count drops from 10 pending-investigation entries
to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
apocryphx added a commit to apocryphx/swift-transformers that referenced this pull request May 16, 2026
…e#355 / huggingface#356 merged)

The three bug-fix PRs landed in main, simplifying the table from 35 entries
down to 8. The struct loses its `fixedBy` field — all surviving entries are
the two new bug clusters this corpus surfaces and that have no PR yet, so
the table is now just `{modelId, inputId, note}`. The cleanup-hint message
prints `note` instead of `fixedBy`.

What remains is two clusters under huggingface#352 worth filing as follow-up issues:

  - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs
    collapsing to single `▁` tokens instead of producing a multi-space
    vocab entry (e.g. `▁▁▁▁` id 268).
  - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge
    ordering than HF Python on Thai byte sequences.

The same corpus runs in the companion Obj-C port
(https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and
hits 7/7 byte-identity on these inputs — so both clusters look like
upstream-only bugs and the Obj-C source is a usable reference for the
follow-up fixes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants