Skip to content

Add MultilingualConformanceTests for byte-identical Python parity (#352)#360

Open
apocryphx wants to merge 3 commits into
huggingface:mainfrom
apocryphx:feat/multilingual-conformance-tests
Open

Add MultilingualConformanceTests for byte-identical Python parity (#352)#360
apocryphx wants to merge 3 commits into
huggingface:mainfrom
apocryphx:feat/multilingual-conformance-tests

Conversation

@apocryphx

@apocryphx apocryphx commented May 16, 2026

Copy link
Copy Markdown
Contributor

Addresses #42 (improve tokenizer testing) and provides a place to land regression coverage for the bug catalogue in #352. Picks up where @john-rocky's closed #357 left off — see attribution at the bottom.

What

A new test target that compares Swift tokenizer.encode(text:) output against HuggingFace Python transformers==4.57.1 references across six tokenizer kernels and an 83-input corpus covering script boundaries the existing English-leaning fixtures don't reach: CJK simplified + traditional, Japanese with voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script stress lines, programming code, IPA, and the German Kompositum torture set.

How

Three pieces:

  1. Tools/generate_tokenizer_baselines.py regenerates per-kernel baselines from the transformers version pinned in Tools/requirements.txt. Asserts is_fast so a slow-only model can't silently produce a non-comparable reference. Each baseline carries a metadata block with the transformers version, generation timestamp, and entry count.

  2. Tests/TokenizersTests/Resources/MultilingualConformance/:

    • inputs.json — 83 entries, each {id, category, text}. Stable ids keep baselines aligned across regenerations; categories let test output cite the broken axis directly (japanese-voiced-kana, emoji-keycap, thai-combining-marks, devanagari, …).
    • baselines/<slug>_multilingual.json — one per kernel, holds input_ids, convert_ids_to_tokens strings, and the two decoded forms (decoded_with_special / decoded_skip_special). The decoded fields are intentionally produced now and left unused by the current Swift tests — forward-compatible material for a decoder-side parity test follow-up.
  3. Tests/TokenizersTests/MultilingualConformanceTests.swift — three tests under @Suite(\"Multilingual Conformance\"):

    • Corpus is well-formed — unique ids, non-empty fields.
    • Baselines cover the corpus exactly — parameterised per kernel, guards against drift between the corpus and any baseline file.
    • Byte-identical token ids vs HF Python — parameterised per kernel; on divergence prints a windowed diff with token strings around the first mismatch and the corpus category, so the broken axis is visible in the failure message.

Kernel matrix

Kernel Model Why
WordPiece BAAI/bge-small-en-v1.5 Most used Apple-Silicon embedder
Unigram (SentencePiece) google-t5/t5-small Canonical Unigram + Metaspace
Byte-level BPE openai-community/gpt2 Round-trip baseline
Byte-level BPE + RobertaProcessing FacebookAI/roberta-base Post-processor surface no other kernel hits
Byte-level BPE (modern) Qwen/Qwen2.5-0.5B Catches kernel- vs vocab-shape bugs
BPE + byte-fallback TinyLlama/TinyLlama-1.1B-Chat-v1.0 Llama family, no HF auth gate

TinyLlama is picked over huggyllama/llama-7b so the Llama family is covered without an auth-gating CI risk.

expectedDivergences

Some (model, input) pairs diverge from the Python reference today because a fix is in review. The cleanup-hint pattern lets the target land green while bug fixes are in flight:

  • An unexpected divergence — a (model, input) pair that diverges but isn't in the table — is a hard failure. This is the regression catch.
  • An unexpected match — a listed pair that now matches Python — emits a printed cleanup hint, but doesn't fail the test. A freshly merged improvement doesn't break CI on this file.

Entries are grouped by fixedBy PR number:

  • 11 entries fixedBy: 354 — Mn-filter widening in BertNormalizer.stripAccents. Covers Japanese voiced kana, Devanagari halant, Arabic diacritics, emoji VS-16, and similar.
  • 3 entries fixedBy: 356 — Unigram + TokenLattice scalar iteration. Covers T5 emoji-keycap, the ™️ (U+2122 + VS-16) cluster, and ZWJ-between-text in escape sequences.
  • 13 entries fixedBy: 355 — BPE scalar iteration on TinyLlama. Covers Thai combining marks, Devanagari conjuncts, emoji ZWJ, emoji keycap, and several multi-script stress lines.
  • 8 entries fixedBy: 0 — pending investigation under Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352. Two distinct new bug clusters that this corpus surfaces but none of the in-flight PRs address:
    • Metaspace leading-whitespace runs collapse to single tokens on SentencePiece BPE instead of producing the multi-space vocab entry (e.g. ▁▁▁▁ id 268 on TinyLlama). 5 entries on TinyLlama. Worth filing as a follow-up issue.
    • Qwen2.5 byte-level BPE merge-ordering on Thai. 3 entries — the kernel picks a different merge ordering than HF Python on Thai byte sequences. Byte-level encoding rules out the combining-mark trap from Bug 4, so this is its own surface.

These two clusters should land as separate follow-up issues. Adding entries to expectedDivergences rather than gating the PR on them keeps the regression-catch infrastructure shippable today.

Both directions of the mechanism were verified locally — removing an entry from the table causes a hard failure with the windowed diff, and adding a phantom entry for a passing input emits the cleanup hint.

Verified against the combined fix branch

A combined branch (#354#355#356 ⊕ this PR) was run end-to-end to confirm Pedro's requested changes don't introduce regressions and that the cleanup hints fire where they should:

```
swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"
✔ Test run with 52 tests in 4 suites passed after 9.13s
0 failures
27 cleanup hints fired (11 BGE / 3 T5 / 13 TinyLlama)
```

All 27 cleanup hints correspond to the fixedBy: 354|355|356 entries — every entry the three fix PRs are supposed to clean up does in fact get cleaned up. The 8 fixedBy: 0 entries persisted (correctly — neither of the two new clusters is addressed by the in-flight PRs).

Verification

Local on macOS, Swift 6.3.2, against current main:

```
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed (unique ids, non-empty fields)" passed
✔ Test "Baselines cover the corpus exactly" with 6 test cases passed
✔ Test "Byte-identical token ids vs HF Python" with 6 test cases passed
✔ Test run with 3 tests in 1 suite passed after 2.4s
```

Full `swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"`: 48 tests in 4 suites pass on this branch alone (52 on the combined branch).

Regenerating baselines from a fresh `pip install -r Tools/requirements.txt` reproduces the bundled JSON against `transformers==4.57.1`.

Attribution

@john-rocky opened #357 a day before this PR with significant overlap. They closed it with explicit "no need to preserve attribution" — but the design ideas this PR borrows from them are worth recording:

  • The `expectedDivergences` cleanup-hint pattern, which makes the target land-able while bug fixes are in flight without losing regression-catch behaviour.
  • Decoded forms in baselines for forward-compatible decoder-parity work.
  • TinyLlama instead of huggyllama/llama-7b for auth-gate-free Llama coverage.

The corpus, regen-script split, and category-tagged input shape are ported from the companion Obj-C port at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, which was the conceptual harness that diagnosed the bug catalogue in #352.

🤖 Generated with Claude Code

@john-rocky

Copy link
Copy Markdown
Contributor

Pulled the branch and ran the new suite locally — looks great. Thanks for picking this up, and for the attribution.

Env: Apple Silicon (arm64), macOS 26.0, Swift 6.2.1, branch `apocryphx/feat/multilingual-conformance-tests@2c0b5ca`.

```
swift build # Build complete!
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed (unique ids, non-empty fields)" passed after 0.004s
✔ Test "Baselines cover the corpus exactly" with 6 test cases passed after 0.014s
✔ Test "Byte-identical token ids vs HF Python" with 6 test cases passed after 2.531s
✔ Suite "Multilingual Conformance" passed after 2.531s
✔ Test run with 3 tests in 1 suite passed.

swift test --parallel --filter MultilingualConformance
✔ ... 2.474s

swift test # full package
✔ Test run with 158 tests in 21 suites passed after 15.337s
```

Corpus stats (just to surface them in the thread): 83 inputs across 30 categories, the biggest being `programming-code` (17), `multiscript-stress` (8), `german-compound` (6); plus all the script edges you called out (CJK both flavors, Hangul, Devanagari, Thai combining, Arabic / Hebrew RTL, BMP+astral emoji, ZWJ, IPA, etc.).

The two T5 `expectedDivergences` re-classified in 2c0b5ca to point at #356 (Unigram-by-scalar) tracked cleanly — they're the only entries that exercise the `google-t5/t5-small` SentencePiece path on the multi-script stress lines, so it's a natural attribution.

No suggested changes from my side. Happy to verify regeneration locally too once #354/355/356 land if useful — running `Tools/generate_tokenizer_baselines.py` and confirming the new baselines reduce `expectedDivergences` would be a nice closing checkpoint for #352.

@apocryphx

Copy link
Copy Markdown
Contributor Author

Thanks @john-rocky for running it through — useful coverage point on macOS 26.0 / Swift 6.2.1.

On the german-compound entries: small Kompositum easter egg — German speakers will find them humorous. They also exercise long-word handling, which is why they survived.

Yes please to the regeneration check once #354/#355/#356 land. Clean closing checkpoint for #352 — the regen script + pinned transformers==4.57.1 should reproduce byte-for-byte; the only divergence sources would be a Foundation Unicode-table drift on your machine or a config refresh upstream, both of which we'd want to know about.

For context: the same corpus runs in the companion ObjCTokenizer port and hits 7/7 byte-identity — including the 8 fixedBy: 0 entries this PR parks for follow-up (TinyLlama Metaspace whitespace-run collapse + Qwen2.5 BPE merge-ordering on Thai). Those look like genuine upstream-only bugs worth their own issues once the in-flight PRs settle.

Thank you for your ideas and suggestions. Appreciate you opening the door on this with #357 and then handing it off without friction — the expectedDivergences + decoded-fields shape made the whole thing land much faster than it would have.

@pcuenca pcuenca left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both; it looks great.

Aside from simplifying the expected divergences after we reconcile with main, my only suggestion is to move the json files and the generation script to a Hugging Face bucket (or dataset). When the test suite starts we can download that repo via something like .snapshot(from: repo, matching: "*.json"). This would keep this repo lean and we can document the golden generation process there. As a matter of fact, I've wanted to move all the other static test resources to a separate repo for a while, and extend them with additional, more exhaustive test cases. This PR could become a seed for that separate effort.

I understand this may be out of scope for your contribution, so I'd be happy to help do that myself!


On attribution, I would suggest we add a co-authored-by line to the final commit so both of you appear in the merge commit:

Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>

]
}

// MARK: - Divergences known to be in flight

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this now after we sync with main

apocryphx and others added 3 commits May 16, 2026 08:26
A new test target that compares Swift `tokenizer.encode(text:)` output to
HuggingFace Python `transformers==4.57.1` references across six tokenizer
kernels and 83 inputs covering CJK simplified/traditional, Japanese with
voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining
marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script
stress, programming code, IPA, and the German Kompositum torture set.

Three pieces:

1. `Tools/generate_tokenizer_baselines.py` regenerates per-kernel baselines
   from the `transformers` version pinned in `Tools/requirements.txt`.
   Asserts `is_fast` so a slow-only model can't silently produce a
   non-comparable reference. Each baseline carries a `metadata` block
   with the `transformers` version, generation timestamp, and entry count.

2. `Tests/TokenizersTests/Resources/MultilingualConformance/`:
   - `inputs.json` — 83 entries, each `{id, category, text}`. Stable ids
     keep baselines aligned across regenerations; categories let test
     output cite the broken axis directly (`japanese-voiced-kana`,
     `emoji-keycap`, `thai-combining-marks`, `devanagari`, …).
   - `baselines/<slug>_multilingual.json` — one per kernel, holds
     `input_ids`, `convert_ids_to_tokens` strings, and the two decoded
     forms. The decoded fields are unused by the current Swift tests;
     they're forward-compatible material for a decoder-side parity test.

3. `Tests/TokenizersTests/MultilingualConformanceTests.swift` — three
   tests under `@Suite("Multilingual Conformance")`:
   - Corpus is well-formed (unique ids, non-empty fields).
   - Baselines cover the corpus exactly (parameterised per kernel).
   - Byte-identical token ids vs HF Python (parameterised per kernel).

Kernel matrix (six entries):
  - WordPiece              BAAI/bge-small-en-v1.5
  - Unigram                google-t5/t5-small
  - Byte-level BPE         openai-community/gpt2
  - Byte-level BPE + RBP   FacebookAI/roberta-base
  - Byte-level BPE modern  Qwen/Qwen2.5-0.5B
  - BPE byte-fallback      TinyLlama/TinyLlama-1.1B-Chat-v1.0

TinyLlama is picked over huggyllama/llama-7b so the Llama family is
covered without an HF auth gate. RoBERTa is kept alongside GPT-2 and
Qwen2.5 for the RobertaProcessing post-processor coverage no other
kernel exercises.

## expectedDivergences

Some (model, input) pairs are known to diverge from the Python reference
today because a fix is in review. The cleanup-hint pattern (inspired by
@john-rocky's closed huggingface#357) lets the target land green while bug fixes are
in flight:

  - Unexpected divergence — a (model, input) pair that diverges but
    isn't in the table — hard-fails. This is the regression catch.
  - Unexpected match — a listed pair that now matches Python — prints a
    cleanup hint inviting removal of the entry, but doesn't fail. A
    freshly merged improvement won't break CI on this file.

Both directions were verified locally (removing an entry causes a hard
failure with a windowed diff annotating the divergence point with token
strings; adding a phantom entry for a passing input emits the cleanup
hint).

Entries are grouped by `fixedBy` PR number:
  - 11 entries `fixedBy: 354`  (Mn-filter widening in BertNormalizer)
  -  1 entry  `fixedBy: 356`  (Unigram scalar iteration)
  - 13 entries `fixedBy: 355`  (BPE scalar iteration on TinyLlama)
  - 10 entries `fixedBy: 0`   pending investigation — three distinct
       new bug clusters that this corpus surfaces:
       * Metaspace leading-whitespace runs collapsing to single `▁`
         on SentencePiece BPE (5 entries, TinyLlama)
       * T5 Unigram segmentation around TM trademark / VS-16 (2 entries)
       * Qwen2.5 byte-level BPE merge-ordering on Thai (3 entries)
     These should be filed as follow-up issues under huggingface#352.

## Verification

  swift test --filter MultilingualConformance
  ✔ Test "Corpus inputs.json is well-formed"            passed
  ✔ Test "Baselines cover the corpus exactly"           with 6 cases passed
  ✔ Test "Byte-identical token ids vs HF Python"        with 6 cases passed
  ✔ Test run with 3 tests in 1 suite passed after 2.4s

Full `swift test --filter TokenizerTests|BertTokenizerTests|MultilingualConformance`:
  48 tests in 4 suites pass.

## Attribution

@john-rocky's closed huggingface#357 is the source of two design decisions used here:
the `expectedDivergences` cleanup-hint pattern, and shipping decoded forms
in baselines for future decoder-parity work. They closed their PR with
explicit "no need to preserve attribution" — credit recorded anyway.

The corpus, regen-script split, and category-tagged input shape are
ported from the multilingual conformance package shipped at
https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, the
companion Obj-C port that diagnosed the bug catalogue in huggingface#352.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on to huggingface#356

Verified locally on a combined branch (huggingface#354huggingface#355huggingface#356 ⊕ this PR): all
three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two
were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16
segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run
fired cleanup hints for both, alongside the emoji-keycap-and-flags hint.

Root cause is the same as the keycap case: a vocab-relevant scalar
(TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that
the old `Character`-based Unigram lattice never decomposed. Moving the
iteration unit to `Unicode.Scalar` exposes it.

`expectedDivergences` count drops from 10 pending-investigation entries
to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e#355 / huggingface#356 merged)

The three bug-fix PRs landed in main, simplifying the table from 35 entries
down to 8. The struct loses its `fixedBy` field — all surviving entries are
the two new bug clusters this corpus surfaces and that have no PR yet, so
the table is now just `{modelId, inputId, note}`. The cleanup-hint message
prints `note` instead of `fixedBy`.

What remains is two clusters under huggingface#352 worth filing as follow-up issues:

  - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs
    collapsing to single `▁` tokens instead of producing a multi-space
    vocab entry (e.g. `▁▁▁▁` id 268).
  - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge
    ordering than HF Python on Thai byte sequences.

The same corpus runs in the companion Obj-C port
(https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and
hits 7/7 byte-identity on these inputs — so both clusters look like
upstream-only bugs and the Obj-C source is a usable reference for the
follow-up fixes.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>
@apocryphx apocryphx force-pushed the feat/multilingual-conformance-tests branch from 2c0b5ca to ce84708 Compare May 16, 2026 15:29
@apocryphx

Copy link
Copy Markdown
Contributor Author

Pushed ce84708:

For the Hub dataset / bucket migration — yes please, your offer to handle that yourself is appreciated. The corpus, baselines, and generate_tokenizer_baselines.py script live together in Tests/TokenizersTests/Resources/MultilingualConformance/ + Tools/, so they should move as a unit. The companion package at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance has the same shape (the Obj-C port consumes it as a regression bar) and could either mirror the Hub repo or become a downstream consumer — whichever ends up cleaner for the seed effort you described.

Thanks @pcuenca for the careful review on all four PRs and for landing them so quickly; it was a pleasure to collaborate.

@pcuenca

pcuenca commented May 22, 2026

Copy link
Copy Markdown
Member

Update: I'll see if we can land bucket support (huggingface/swift-huggingface#55) first, and then we can move the test data to a bucket. If progress on buckets is slow, I'll use a dataset instead.

Sorry for the delay, I'll keep you posted!

@apocryphx

Copy link
Copy Markdown
Contributor Author

No rush at all @pcuenca. The conformance target sits happily in-tree as a local checkpoint until buckets land; the pinned transformers==4.57.1 in the regen script keeps baselines stable while we wait.

Looking forward to seeing swift-huggingface#55 land.

P.S.
(For context, not asking anything: ObjCTokenizer just shipped into VecD, a headless XPC daemon for BGE-M3 dense + sparse retrieval. The sparse head is token-keyed, so v1.3.3's byte-identity is load-bearing.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants