BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4) by apocryphx · Pull Request #355 · huggingface/swift-transformers

apocryphx · 2026-05-15T03:05:11Z

Addresses Bug 4 of #352.

BPETokenizer.bpe(token:) decomposed the input into initial BPE symbols via Array(token).map { String(\$0) }, which iterates Swift Character (extended grapheme clusters). Non-spacing combining marks that the BPE vocab and merge table treat as standalone scalars — e.g. Thai vowel mark U+0E31, Devanagari halant U+094D, the variation selector + combining keycap behind emoji keycaps — therefore fused with their base character into a single symbol, preventing the merge loop from ever considering the combining-mark scalar as its own atom. Llama-7B's byte_fallback: true then byte-encoded the unmatched fused symbol, even though both halves were direct vocab entries.

There were actually two grapheme-cluster traps in the BPE path:

The initial symbol decomposition at the top of bpe(token:) — switched to token.unicodeScalars.map { String(\$0) }, plus the early-return guard for trivial inputs counts scalars rather than Characters.
The downstream re-split of the BPE result. bpe(token:) returned its pieces joined by ASCII space, and the caller re-split with bpe(token: text).split(separator: " ").map { String(\$0) }. But Swift's split operates on grapheme clusters: if a piece begins with a non-spacing mark, the preceding ASCII space fuses with the mark into a single grapheme and split silently swallows the boundary. The merged substring (including the literal space) then byte-fallbacks. To remove the round-trip entirely, bpe(token:) now returns [String] directly.

The benchmark test in Tests/Benchmarks/BPETokenizerBenchmarkTests.swift calls bpe(token:) via _ = model.bpe(token: encoded), so the return-type change does not affect it.

Test plan

swift test --filter TokenizerTests — all 46 tests pass (no regressions).
New test llama7bCombiningMarks() uses huggyllama/llama-7b over Thai "สวัส". Pre-fix: byte-fallback on ว and ั; post-fix: 4 direct vocab matches plus the leading ▁, matching HF Python transformers==4.57.1 byte-for-byte.

🤖 Generated with Claude Code

, Bug 4) `BPETokenizer.bpe(token:)` decomposed the input into initial BPE symbols via `Array(token).map { String($0) }`, which iterates Swift `Character` (extended grapheme clusters). Non-spacing combining marks that the BPE vocab and merge table treat as standalone scalars — e.g. Thai vowel mark U+0E31, Devanagari halant U+094D, the variation selector + combining keycap behind emoji keycaps — therefore fused with their base character into a single symbol, preventing the merge loop from ever considering the combining-mark scalar as its own atom. Llama-7B's `byte_fallback: true` then byte-encoded the unmatched fused symbol, even though both halves were direct vocab entries. There were actually two grapheme-cluster traps in the BPE path: 1. The initial symbol decomposition at the top of `bpe(token:)` — switched to `token.unicodeScalars.map { String($0) }`, plus the early-return guard for trivial inputs counts scalars rather than `Character`s. 2. The downstream re-split of the BPE result. `bpe(token:)` returned its pieces joined by ASCII space, and the caller re-split with `bpe(token: text).split(separator: " ").map { String($0) }`. But Swift's `split` operates on grapheme clusters: if a piece begins with a non- spacing mark, the preceding ASCII space fuses with the mark into a single grapheme and `split` silently swallows the boundary. The merged substring (including the literal space) then byte-fallbacks. To remove the round-trip entirely, `bpe(token:)` now returns `[String]` directly. Adds a regression test using huggyllama/llama-7b over the Thai input "สวัส". Pre-fix: byte-fallback on `ว` and `ั`; post-fix: 4 direct vocab matches plus the leading `▁`, matching HF Python. The benchmark test in Tests/Benchmarks/BPETokenizerBenchmarkTests.swift calls `bpe(token:)` via `_ = model.bpe(token: encoded)`, so the return-type change does not affect it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pcuenca

Thank you! Suggesting a couple of minor nits.

pcuenca · 2026-05-15T14:44:07Z

+        // Reference: https://github.com/huggingface/swift-transformers/issues/352
+        let initialSymbols = token.unicodeScalars.map { String($0) }
+        if initialSymbols.count <= 1 {
+            return [token]


Suggested change

return [token]

return symbols.isEmpty ? [] : [token]

Technically, if the input is "" I think we should return an empty array of pieces rather than an array containing one empty piece. Both will result in the same thing, but the former skips the attempt to tokenize the empty piece.

Using symbols instead of initialSymbols as per the related comment.

pcuenca · 2026-05-15T14:45:25Z

-        var symbols = Array(token).map { String($0) }
+        // Initial symbols: one entry per Unicode scalar of `token`. We keep these
+        // as a doubly linked list embedded in parallel arrays of indices.
+        var symbols = initialSymbols


initialSymbols is only used for the isEmpty test and immediately reassigned. I'd go for var symbols = token.unicodeScalars.map { String($0) } above and remove initialSymbols.

@pcuenca

…nput Per @pcuenca's review on huggingface#355: - Eliminate the `initialSymbols` intermediate variable. Declare `var symbols` directly from `token.unicodeScalars.map { String($0) }` and drop the redundant re-assignment further down. - Empty-input safety: `bpe("")` now returns `[]` rather than `[""]`, which skips a redundant tokenize attempt on an empty piece downstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e#355 / huggingface#356 merged) The three bug-fix PRs landed in main, simplifying the table from 35 entries down to 8. The struct loses its `fixedBy` field — all surviving entries are the two new bug clusters this corpus surfaces and that have no PR yet, so the table is now just `{modelId, inputId, note}`. The cleanup-hint message prints `note` instead of `fixedBy`. What remains is two clusters under huggingface#352 worth filing as follow-up issues: - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs collapsing to single `▁` tokens instead of producing a multi-space vocab entry (e.g. `▁▁▁▁` id 268). - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge ordering than HF Python on Thai byte sequences. The same corpus runs in the companion Obj-C port (https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and hits 7/7 byte-identity on these inputs — so both clusters look like upstream-only bugs and the Obj-C source is a usable reference for the follow-up fixes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>

This was referenced May 15, 2026

Unigram lattice walks Unicode scalars (#352, Bug 3) #356

Merged

Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352

Open

john-rocky mentioned this pull request May 15, 2026

Add multilingual conformance tests for byte-identical Python parity #357

Closed

pcuenca approved these changes May 15, 2026

View reviewed changes

apocryphx mentioned this pull request May 16, 2026

Add MultilingualConformanceTests for byte-identical Python parity (#352) #360

Open

Apply suggestions from code review

ba5c4f2

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

pcuenca reviewed May 16, 2026

View reviewed changes

Comment thread Tests/TokenizersTests/TokenizerTests.swift Outdated

Apply suggestions from code review

64bf97c

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>

pcuenca approved these changes May 16, 2026

View reviewed changes

pcuenca merged commit eb08337 into huggingface:main May 16, 2026
4 checks passed

jkrukowski mentioned this pull request May 17, 2026

Bump swift-transformers to 1.3.3 jkrukowski/swift-embeddings#27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4)#355

BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4)#355
pcuenca merged 4 commits into
huggingface:mainfrom
apocryphx:fix/bpe-scalar-iteration

apocryphx commented May 15, 2026

Uh oh!

pcuenca left a comment

Uh oh!

Uh oh!

pcuenca May 15, 2026

Uh oh!

pcuenca May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

apocryphx commented May 15, 2026

Test plan

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcuenca May 15, 2026

Choose a reason for hiding this comment

Uh oh!

pcuenca May 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants