Skip to content

BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4)#355

Open
apocryphx wants to merge 1 commit into
huggingface:mainfrom
apocryphx:fix/bpe-scalar-iteration
Open

BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4)#355
apocryphx wants to merge 1 commit into
huggingface:mainfrom
apocryphx:fix/bpe-scalar-iteration

Conversation

@apocryphx
Copy link
Copy Markdown

Addresses Bug 4 of #352.

BPETokenizer.bpe(token:) decomposed the input into initial BPE symbols via Array(token).map { String(\$0) }, which iterates Swift Character (extended grapheme clusters). Non-spacing combining marks that the BPE vocab and merge table treat as standalone scalars — e.g. Thai vowel mark U+0E31, Devanagari halant U+094D, the variation selector + combining keycap behind emoji keycaps — therefore fused with their base character into a single symbol, preventing the merge loop from ever considering the combining-mark scalar as its own atom. Llama-7B's byte_fallback: true then byte-encoded the unmatched fused symbol, even though both halves were direct vocab entries.

There were actually two grapheme-cluster traps in the BPE path:

  1. The initial symbol decomposition at the top of bpe(token:) — switched to token.unicodeScalars.map { String(\$0) }, plus the early-return guard for trivial inputs counts scalars rather than Characters.

  2. The downstream re-split of the BPE result. bpe(token:) returned its pieces joined by ASCII space, and the caller re-split with bpe(token: text).split(separator: " ").map { String(\$0) }. But Swift's split operates on grapheme clusters: if a piece begins with a non-spacing mark, the preceding ASCII space fuses with the mark into a single grapheme and split silently swallows the boundary. The merged substring (including the literal space) then byte-fallbacks. To remove the round-trip entirely, bpe(token:) now returns [String] directly.

The benchmark test in Tests/Benchmarks/BPETokenizerBenchmarkTests.swift calls bpe(token:) via _ = model.bpe(token: encoded), so the return-type change does not affect it.

Test plan

  • swift test --filter TokenizerTests — all 46 tests pass (no regressions).
  • New test llama7bCombiningMarks() uses huggyllama/llama-7b over Thai "สวัส". Pre-fix: byte-fallback on and ; post-fix: 4 direct vocab matches plus the leading , matching HF Python transformers==4.57.1 byte-for-byte.

🤖 Generated with Claude Code

, Bug 4)

`BPETokenizer.bpe(token:)` decomposed the input into initial BPE symbols via
`Array(token).map { String($0) }`, which iterates Swift `Character` (extended
grapheme clusters). Non-spacing combining marks that the BPE vocab and merge
table treat as standalone scalars — e.g. Thai vowel mark U+0E31, Devanagari
halant U+094D, the variation selector + combining keycap behind emoji
keycaps — therefore fused with their base character into a single symbol,
preventing the merge loop from ever considering the combining-mark scalar
as its own atom. Llama-7B's `byte_fallback: true` then byte-encoded the
unmatched fused symbol, even though both halves were direct vocab entries.

There were actually two grapheme-cluster traps in the BPE path:

1. The initial symbol decomposition at the top of `bpe(token:)` — switched
   to `token.unicodeScalars.map { String($0) }`, plus the early-return guard
   for trivial inputs counts scalars rather than `Character`s.

2. The downstream re-split of the BPE result. `bpe(token:)` returned its
   pieces joined by ASCII space, and the caller re-split with
   `bpe(token: text).split(separator: " ").map { String($0) }`. But Swift's
   `split` operates on grapheme clusters: if a piece begins with a non-
   spacing mark, the preceding ASCII space fuses with the mark into a
   single grapheme and `split` silently swallows the boundary. The merged
   substring (including the literal space) then byte-fallbacks. To remove
   the round-trip entirely, `bpe(token:)` now returns `[String]` directly.

Adds a regression test using huggyllama/llama-7b over the Thai input "สวัส".
Pre-fix: byte-fallback on `ว` and `ั`; post-fix: 4 direct vocab matches plus
the leading `▁`, matching HF Python.

The benchmark test in Tests/Benchmarks/BPETokenizerBenchmarkTests.swift
calls `bpe(token:)` via `_ = model.bpe(token: encoded)`, so the return-type
change does not affect it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Suggesting a couple of minor nits.

Comment on lines +246 to +252
// Iterate Unicode scalars rather than grapheme clusters. Combining marks
// (e.g. Thai vowel U+0E31, Devanagari halant U+094D, the variation selector
// and combining keycap U+FE0F + U+20E3) form a single Swift `Character`
// with their base, but the BPE vocab and merge table are scalar-indexed —
// grouping them prevents merges and forces spurious byte-fallback even
// when both scalars are direct vocab entries.
// Reference: https://github.com/huggingface/swift-transformers/issues/352
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Iterate Unicode scalars rather than grapheme clusters. Combining marks
// (e.g. Thai vowel U+0E31, Devanagari halant U+094D, the variation selector
// and combining keycap U+FE0F + U+20E3) form a single Swift `Character`
// with their base, but the BPE vocab and merge table are scalar-indexed —
// grouping them prevents merges and forces spurious byte-fallback even
// when both scalars are direct vocab entries.
// Reference: https://github.com/huggingface/swift-transformers/issues/352

// Reference: https://github.com/huggingface/swift-transformers/issues/352
let initialSymbols = token.unicodeScalars.map { String($0) }
if initialSymbols.count <= 1 {
return [token]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return [token]
return symbols.isEmpty ? [] : [token]

Technically, if the input is "" I think we should return an empty array of pieces rather than an array containing one empty piece. Both will result in the same thing, but the former skips the attempt to tokenize the empty piece.

Using symbols instead of initialSymbols as per the related comment.

var symbols = Array(token).map { String($0) }
// Initial symbols: one entry per Unicode scalar of `token`. We keep these
// as a doubly linked list embedded in parallel arrays of indices.
var symbols = initialSymbols
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialSymbols is only used for the isEmpty test and immediately reassigned. I'd go for var symbols = token.unicodeScalars.map { String($0) } above and remove initialSymbols.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants