Add MultilingualConformanceTests for byte-identical Python parity (#352) by apocryphx · Pull Request #360 · huggingface/swift-transformers

apocryphx · 2026-05-16T00:12:13Z

Addresses #42 (improve tokenizer testing) and provides a place to land regression coverage for the bug catalogue in #352. Picks up where @john-rocky's closed #357 left off — see attribution at the bottom.

What

A new test target that compares Swift tokenizer.encode(text:) output against HuggingFace Python transformers==4.57.1 references across six tokenizer kernels and an 83-input corpus covering script boundaries the existing English-leaning fixtures don't reach: CJK simplified + traditional, Japanese with voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script stress lines, programming code, IPA, and the German Kompositum torture set.

How

Three pieces:

Tools/generate_tokenizer_baselines.py regenerates per-kernel baselines from the transformers version pinned in Tools/requirements.txt. Asserts is_fast so a slow-only model can't silently produce a non-comparable reference. Each baseline carries a metadata block with the transformers version, generation timestamp, and entry count.
Tests/TokenizersTests/Resources/MultilingualConformance/:
- inputs.json — 83 entries, each {id, category, text}. Stable ids keep baselines aligned across regenerations; categories let test output cite the broken axis directly (japanese-voiced-kana, emoji-keycap, thai-combining-marks, devanagari, …).
- baselines/<slug>_multilingual.json — one per kernel, holds input_ids, convert_ids_to_tokens strings, and the two decoded forms (decoded_with_special / decoded_skip_special). The decoded fields are intentionally produced now and left unused by the current Swift tests — forward-compatible material for a decoder-side parity test follow-up.
Tests/TokenizersTests/MultilingualConformanceTests.swift — three tests under @Suite(\"Multilingual Conformance\"):
- Corpus is well-formed — unique ids, non-empty fields.
- Baselines cover the corpus exactly — parameterised per kernel, guards against drift between the corpus and any baseline file.
- Byte-identical token ids vs HF Python — parameterised per kernel; on divergence prints a windowed diff with token strings around the first mismatch and the corpus category, so the broken axis is visible in the failure message.

Kernel matrix

Kernel	Model	Why
WordPiece	`BAAI/bge-small-en-v1.5`	Most used Apple-Silicon embedder
Unigram (SentencePiece)	`google-t5/t5-small`	Canonical Unigram + Metaspace
Byte-level BPE	`openai-community/gpt2`	Round-trip baseline
Byte-level BPE + RobertaProcessing	`FacebookAI/roberta-base`	Post-processor surface no other kernel hits
Byte-level BPE (modern)	`Qwen/Qwen2.5-0.5B`	Catches kernel- vs vocab-shape bugs
BPE + byte-fallback	`TinyLlama/TinyLlama-1.1B-Chat-v1.0`	Llama family, no HF auth gate

TinyLlama is picked over huggyllama/llama-7b so the Llama family is covered without an auth-gating CI risk.

`expectedDivergences`

Some (model, input) pairs diverge from the Python reference today because a fix is in review. The cleanup-hint pattern lets the target land green while bug fixes are in flight:

An unexpected divergence — a (model, input) pair that diverges but isn't in the table — is a hard failure. This is the regression catch.
An unexpected match — a listed pair that now matches Python — emits a printed cleanup hint, but doesn't fail the test. A freshly merged improvement doesn't break CI on this file.

Entries are grouped by fixedBy PR number:

11 entries fixedBy: 354 — Mn-filter widening in BertNormalizer.stripAccents. Covers Japanese voiced kana, Devanagari halant, Arabic diacritics, emoji VS-16, and similar.
3 entries fixedBy: 356 — Unigram + TokenLattice scalar iteration. Covers T5 emoji-keycap, the ™️ (U+2122 + VS-16) cluster, and ZWJ-between-text in escape sequences.
13 entries fixedBy: 355 — BPE scalar iteration on TinyLlama. Covers Thai combining marks, Devanagari conjuncts, emoji ZWJ, emoji keycap, and several multi-script stress lines.
8 entries fixedBy: 0 — pending investigation under Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352. Two distinct new bug clusters that this corpus surfaces but none of the in-flight PRs address:
- Metaspace leading-whitespace runs collapse to single ▁ tokens on SentencePiece BPE instead of producing the multi-space vocab entry (e.g. ▁▁▁▁ id 268 on TinyLlama). 5 entries on TinyLlama. Worth filing as a follow-up issue.
- Qwen2.5 byte-level BPE merge-ordering on Thai. 3 entries — the kernel picks a different merge ordering than HF Python on Thai byte sequences. Byte-level encoding rules out the combining-mark trap from Bug 4, so this is its own surface.

These two clusters should land as separate follow-up issues. Adding entries to expectedDivergences rather than gating the PR on them keeps the regression-catch infrastructure shippable today.

Both directions of the mechanism were verified locally — removing an entry from the table causes a hard failure with the windowed diff, and adding a phantom entry for a passing input emits the cleanup hint.

Verified against the combined fix branch

A combined branch (#354 ⊕ #355 ⊕ #356 ⊕ this PR) was run end-to-end to confirm Pedro's requested changes don't introduce regressions and that the cleanup hints fire where they should:

```
swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"
✔ Test run with 52 tests in 4 suites passed after 9.13s
0 failures
27 cleanup hints fired (11 BGE / 3 T5 / 13 TinyLlama)
```

All 27 cleanup hints correspond to the fixedBy: 354|355|356 entries — every entry the three fix PRs are supposed to clean up does in fact get cleaned up. The 8 fixedBy: 0 entries persisted (correctly — neither of the two new clusters is addressed by the in-flight PRs).

Verification

Local on macOS, Swift 6.3.2, against current main:

```
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed (unique ids, non-empty fields)" passed
✔ Test "Baselines cover the corpus exactly" with 6 test cases passed
✔ Test "Byte-identical token ids vs HF Python" with 6 test cases passed
✔ Test run with 3 tests in 1 suite passed after 2.4s
```

Full `swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"`: 48 tests in 4 suites pass on this branch alone (52 on the combined branch).

Regenerating baselines from a fresh `pip install -r Tools/requirements.txt` reproduces the bundled JSON against `transformers==4.57.1`.

Attribution

@john-rocky opened #357 a day before this PR with significant overlap. They closed it with explicit "no need to preserve attribution" — but the design ideas this PR borrows from them are worth recording:

The `expectedDivergences` cleanup-hint pattern, which makes the target land-able while bug fixes are in flight without losing regression-catch behaviour.
Decoded forms in baselines for forward-compatible decoder-parity work.
TinyLlama instead of huggyllama/llama-7b for auth-gate-free Llama coverage.

The corpus, regen-script split, and category-tagged input shape are ported from the companion Obj-C port at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, which was the conceptual harness that diagnosed the bug catalogue in #352.

🤖 Generated with Claude Code

john-rocky · 2026-05-16T00:56:33Z

Pulled the branch and ran the new suite locally — looks great. Thanks for picking this up, and for the attribution.

Env: Apple Silicon (arm64), macOS 26.0, Swift 6.2.1, branch `apocryphx/feat/multilingual-conformance-tests@2c0b5ca`.

```
swift build # Build complete!
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed (unique ids, non-empty fields)" passed after 0.004s
✔ Test "Baselines cover the corpus exactly" with 6 test cases passed after 0.014s
✔ Test "Byte-identical token ids vs HF Python" with 6 test cases passed after 2.531s
✔ Suite "Multilingual Conformance" passed after 2.531s
✔ Test run with 3 tests in 1 suite passed.

swift test --parallel --filter MultilingualConformance
✔ ... 2.474s

swift test # full package
✔ Test run with 158 tests in 21 suites passed after 15.337s
```

Corpus stats (just to surface them in the thread): 83 inputs across 30 categories, the biggest being `programming-code` (17), `multiscript-stress` (8), `german-compound` (6); plus all the script edges you called out (CJK both flavors, Hangul, Devanagari, Thai combining, Arabic / Hebrew RTL, BMP+astral emoji, ZWJ, IPA, etc.).

The two T5 `expectedDivergences` re-classified in 2c0b5ca to point at #356 (Unigram-by-scalar) tracked cleanly — they're the only entries that exercise the `google-t5/t5-small` SentencePiece path on the multi-script stress lines, so it's a natural attribution.

No suggested changes from my side. Happy to verify regeneration locally too once #354/355/356 land if useful — running `Tools/generate_tokenizer_baselines.py` and confirming the new baselines reduce `expectedDivergences` would be a nice closing checkpoint for #352.

apocryphx · 2026-05-16T01:17:32Z

Thanks @john-rocky for running it through — useful coverage point on macOS 26.0 / Swift 6.2.1.

On the german-compound entries: small Kompositum easter egg — German speakers will find them humorous. They also exercise long-word handling, which is why they survived.

Yes please to the regeneration check once #354/#355/#356 land. Clean closing checkpoint for #352 — the regen script + pinned transformers==4.57.1 should reproduce byte-for-byte; the only divergence sources would be a Foundation Unicode-table drift on your machine or a config refresh upstream, both of which we'd want to know about.

For context: the same corpus runs in the companion ObjCTokenizer port and hits 7/7 byte-identity — including the 8 fixedBy: 0 entries this PR parks for follow-up (TinyLlama Metaspace whitespace-run collapse + Qwen2.5 BPE merge-ordering on Thai). Those look like genuine upstream-only bugs worth their own issues once the in-flight PRs settle.

Thank you for your ideas and suggestions. Appreciate you opening the door on this with #357 and then handing it off without friction — the expectedDivergences + decoded-fields shape made the whole thing land much faster than it would have.

pcuenca

Thanks both; it looks great.

Aside from simplifying the expected divergences after we reconcile with main, my only suggestion is to move the json files and the generation script to a Hugging Face bucket (or dataset). When the test suite starts we can download that repo via something like .snapshot(from: repo, matching: "*.json"). This would keep this repo lean and we can document the golden generation process there. As a matter of fact, I've wanted to move all the other static test resources to a separate repo for a while, and extend them with additional, more exhaustive test cases. This PR could become a seed for that separate effort.

I understand this may be out of scope for your contribution, so I'd be happy to help do that myself!

On attribution, I would suggest we add a co-authored-by line to the final commit so both of you appear in the merge commit:

Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>

pcuenca · 2026-05-16T11:29:21Z

+    ]
+}
+
+// MARK: - Divergences known to be in flight


We can simplify this now after we sync with main

@john-rocky

A new test target that compares Swift `tokenizer.encode(text:)` output to HuggingFace Python `transformers==4.57.1` references across six tokenizer kernels and 83 inputs covering CJK simplified/traditional, Japanese with voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script stress, programming code, IPA, and the German Kompositum torture set. Three pieces: 1. `Tools/generate_tokenizer_baselines.py` regenerates per-kernel baselines from the `transformers` version pinned in `Tools/requirements.txt`. Asserts `is_fast` so a slow-only model can't silently produce a non-comparable reference. Each baseline carries a `metadata` block with the `transformers` version, generation timestamp, and entry count. 2. `Tests/TokenizersTests/Resources/MultilingualConformance/`: - `inputs.json` — 83 entries, each `{id, category, text}`. Stable ids keep baselines aligned across regenerations; categories let test output cite the broken axis directly (`japanese-voiced-kana`, `emoji-keycap`, `thai-combining-marks`, `devanagari`, …). - `baselines/<slug>_multilingual.json` — one per kernel, holds `input_ids`, `convert_ids_to_tokens` strings, and the two decoded forms. The decoded fields are unused by the current Swift tests; they're forward-compatible material for a decoder-side parity test. 3. `Tests/TokenizersTests/MultilingualConformanceTests.swift` — three tests under `@Suite("Multilingual Conformance")`: - Corpus is well-formed (unique ids, non-empty fields). - Baselines cover the corpus exactly (parameterised per kernel). - Byte-identical token ids vs HF Python (parameterised per kernel). Kernel matrix (six entries): - WordPiece BAAI/bge-small-en-v1.5 - Unigram google-t5/t5-small - Byte-level BPE openai-community/gpt2 - Byte-level BPE + RBP FacebookAI/roberta-base - Byte-level BPE modern Qwen/Qwen2.5-0.5B - BPE byte-fallback TinyLlama/TinyLlama-1.1B-Chat-v1.0 TinyLlama is picked over huggyllama/llama-7b so the Llama family is covered without an HF auth gate. RoBERTa is kept alongside GPT-2 and Qwen2.5 for the RobertaProcessing post-processor coverage no other kernel exercises. ## expectedDivergences Some (model, input) pairs are known to diverge from the Python reference today because a fix is in review. The cleanup-hint pattern (inspired by @john-rocky's closed huggingface#357) lets the target land green while bug fixes are in flight: - Unexpected divergence — a (model, input) pair that diverges but isn't in the table — hard-fails. This is the regression catch. - Unexpected match — a listed pair that now matches Python — prints a cleanup hint inviting removal of the entry, but doesn't fail. A freshly merged improvement won't break CI on this file. Both directions were verified locally (removing an entry causes a hard failure with a windowed diff annotating the divergence point with token strings; adding a phantom entry for a passing input emits the cleanup hint). Entries are grouped by `fixedBy` PR number: - 11 entries `fixedBy: 354` (Mn-filter widening in BertNormalizer) - 1 entry `fixedBy: 356` (Unigram scalar iteration) - 13 entries `fixedBy: 355` (BPE scalar iteration on TinyLlama) - 10 entries `fixedBy: 0` pending investigation — three distinct new bug clusters that this corpus surfaces: * Metaspace leading-whitespace runs collapsing to single `▁` on SentencePiece BPE (5 entries, TinyLlama) * T5 Unigram segmentation around TM trademark / VS-16 (2 entries) * Qwen2.5 byte-level BPE merge-ordering on Thai (3 entries) These should be filed as follow-up issues under huggingface#352. ## Verification swift test --filter MultilingualConformance ✔ Test "Corpus inputs.json is well-formed" passed ✔ Test "Baselines cover the corpus exactly" with 6 cases passed ✔ Test "Byte-identical token ids vs HF Python" with 6 cases passed ✔ Test run with 3 tests in 1 suite passed after 2.4s Full `swift test --filter TokenizerTests|BertTokenizerTests|MultilingualConformance`: 48 tests in 4 suites pass. ## Attribution @john-rocky's closed huggingface#357 is the source of two design decisions used here: the `expectedDivergences` cleanup-hint pattern, and shipping decoded forms in baselines for future decoder-parity work. They closed their PR with explicit "no need to preserve attribution" — credit recorded anyway. The corpus, regen-script split, and category-tagged input shape are ported from the multilingual conformance package shipped at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, the companion Obj-C port that diagnosed the bug catalogue in huggingface#352. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e#355 / huggingface#356 merged) The three bug-fix PRs landed in main, simplifying the table from 35 entries down to 8. The struct loses its `fixedBy` field — all surviving entries are the two new bug clusters this corpus surfaces and that have no PR yet, so the table is now just `{modelId, inputId, note}`. The cleanup-hint message prints `note` instead of `fixedBy`. What remains is two clusters under huggingface#352 worth filing as follow-up issues: - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs collapsing to single `▁` tokens instead of producing a multi-space vocab entry (e.g. `▁▁▁▁` id 268). - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge ordering than HF Python on Thai byte sequences. The same corpus runs in the companion Obj-C port (https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and hits 7/7 byte-identity on these inputs — so both clusters look like upstream-only bugs and the Obj-C source is a usable reference for the follow-up fixes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>

apocryphx · 2026-05-16T15:29:40Z

Pushed ce84708:

Rebased onto current main (now carries Strip Japanese voiced-kana marks in BasicTokenizer (#352, Bug 2) #354 / BPE merge by Unicode scalar, not grapheme cluster (#352, Bug 4) #355 / Unigram lattice walks Unicode scalars (#352, Bug 3) #356).
Dropped the 27 obsolete expectedDivergences entries — the table is down from 35 to 8. Surviving entries are the two new bug clusters this corpus surfaces (TinyLlama Metaspace leading-whitespace runs, Qwen2.5 byte-level BPE merge-ordering on Thai), worth filing as separate follow-up issues under Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352.
Simplified the struct accordingly: {modelId, inputId, note} — no more fixedBy field since every surviving entry is in the same "no specific PR yet" bucket.
Added Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com> to the commit footer so @john-rocky lands in the squash-and-merge commit alongside the Claude attribution.

For the Hub dataset / bucket migration — yes please, your offer to handle that yourself is appreciated. The corpus, baselines, and generate_tokenizer_baselines.py script live together in Tests/TokenizersTests/Resources/MultilingualConformance/ + Tools/, so they should move as a unit. The companion package at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance has the same shape (the Obj-C port consumes it as a regression bar) and could either mirror the Hub repo or become a downstream consumer — whichever ends up cleaner for the seed effort you described.

Thanks @pcuenca for the careful review on all four PRs and for landing them so quickly; it was a pleasure to collaborate.

pcuenca · 2026-05-22T12:37:20Z

Update: I'll see if we can land bucket support (huggingface/swift-huggingface#55) first, and then we can move the test data to a bucket. If progress on buckets is slow, I'll use a dataset instead.

Sorry for the delay, I'll keep you posted!

apocryphx · 2026-05-22T16:31:47Z

No rush at all @pcuenca. The conformance target sits happily in-tree as a local checkpoint until buckets land; the pinned transformers==4.57.1 in the regen script keeps baselines stable while we wait.

Looking forward to seeing swift-huggingface#55 land.

P.S.
(For context, not asking anything: ObjCTokenizer just shipped into VecD, a headless XPC daemon for BGE-M3 dense + sparse retrieval. The sparse head is token-keyed, so v1.3.3's byte-identity is load-bearing.)

pcuenca reviewed May 16, 2026

View reviewed changes

apocryphx and others added 3 commits May 16, 2026 08:26

apocryphx force-pushed the feat/multilingual-conformance-tests branch from 2c0b5ca to ce84708 Compare May 16, 2026 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MultilingualConformanceTests for byte-identical Python parity (#352)#360

Add MultilingualConformanceTests for byte-identical Python parity (#352)#360
apocryphx wants to merge 3 commits into
huggingface:mainfrom
apocryphx:feat/multilingual-conformance-tests

apocryphx commented May 16, 2026 •

edited

Loading

Uh oh!

john-rocky commented May 16, 2026

Uh oh!

apocryphx commented May 16, 2026

Uh oh!

pcuenca left a comment

Uh oh!

pcuenca May 16, 2026

Uh oh!

apocryphx commented May 16, 2026

Uh oh!

pcuenca commented May 22, 2026

Uh oh!

apocryphx commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

apocryphx commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Kernel matrix

expectedDivergences

Verified against the combined fix branch

Verification

Attribution

Uh oh!

john-rocky commented May 16, 2026

Uh oh!

apocryphx commented May 16, 2026

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

pcuenca May 16, 2026

Choose a reason for hiding this comment

Uh oh!

apocryphx commented May 16, 2026

Uh oh!

pcuenca commented May 22, 2026

Uh oh!

apocryphx commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

apocryphx commented May 16, 2026 •

edited

Loading

`expectedDivergences`