Add MultilingualConformanceTests for byte-identical Python parity (#352)#360
Add MultilingualConformanceTests for byte-identical Python parity (#352)#360apocryphx wants to merge 3 commits into
Conversation
|
Pulled the branch and ran the new suite locally — looks great. Thanks for picking this up, and for the attribution. Env: Apple Silicon (arm64), macOS 26.0, Swift 6.2.1, branch `apocryphx/feat/multilingual-conformance-tests@2c0b5ca`. ``` swift test --parallel --filter MultilingualConformance swift test # full package Corpus stats (just to surface them in the thread): 83 inputs across 30 categories, the biggest being `programming-code` (17), `multiscript-stress` (8), `german-compound` (6); plus all the script edges you called out (CJK both flavors, Hangul, Devanagari, Thai combining, Arabic / Hebrew RTL, BMP+astral emoji, ZWJ, IPA, etc.). The two T5 `expectedDivergences` re-classified in 2c0b5ca to point at #356 (Unigram-by-scalar) tracked cleanly — they're the only entries that exercise the `google-t5/t5-small` SentencePiece path on the multi-script stress lines, so it's a natural attribution. No suggested changes from my side. Happy to verify regeneration locally too once #354/355/356 land if useful — running `Tools/generate_tokenizer_baselines.py` and confirming the new baselines reduce `expectedDivergences` would be a nice closing checkpoint for #352. |
|
Thanks @john-rocky for running it through — useful coverage point on macOS 26.0 / Swift 6.2.1. On the Yes please to the regeneration check once #354/#355/#356 land. Clean closing checkpoint for #352 — the regen script + pinned For context: the same corpus runs in the companion Thank you for your ideas and suggestions. Appreciate you opening the door on this with #357 and then handing it off without friction — the |
pcuenca
left a comment
There was a problem hiding this comment.
Thanks both; it looks great.
Aside from simplifying the expected divergences after we reconcile with main, my only suggestion is to move the json files and the generation script to a Hugging Face bucket (or dataset). When the test suite starts we can download that repo via something like .snapshot(from: repo, matching: "*.json"). This would keep this repo lean and we can document the golden generation process there. As a matter of fact, I've wanted to move all the other static test resources to a separate repo for a while, and extend them with additional, more exhaustive test cases. This PR could become a seed for that separate effort.
I understand this may be out of scope for your contribution, so I'd be happy to help do that myself!
On attribution, I would suggest we add a co-authored-by line to the final commit so both of you appear in the merge commit:
Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>
| ] | ||
| } | ||
|
|
||
| // MARK: - Divergences known to be in flight |
There was a problem hiding this comment.
We can simplify this now after we sync with main
A new test target that compares Swift `tokenizer.encode(text:)` output to
HuggingFace Python `transformers==4.57.1` references across six tokenizer
kernels and 83 inputs covering CJK simplified/traditional, Japanese with
voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining
marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script
stress, programming code, IPA, and the German Kompositum torture set.
Three pieces:
1. `Tools/generate_tokenizer_baselines.py` regenerates per-kernel baselines
from the `transformers` version pinned in `Tools/requirements.txt`.
Asserts `is_fast` so a slow-only model can't silently produce a
non-comparable reference. Each baseline carries a `metadata` block
with the `transformers` version, generation timestamp, and entry count.
2. `Tests/TokenizersTests/Resources/MultilingualConformance/`:
- `inputs.json` — 83 entries, each `{id, category, text}`. Stable ids
keep baselines aligned across regenerations; categories let test
output cite the broken axis directly (`japanese-voiced-kana`,
`emoji-keycap`, `thai-combining-marks`, `devanagari`, …).
- `baselines/<slug>_multilingual.json` — one per kernel, holds
`input_ids`, `convert_ids_to_tokens` strings, and the two decoded
forms. The decoded fields are unused by the current Swift tests;
they're forward-compatible material for a decoder-side parity test.
3. `Tests/TokenizersTests/MultilingualConformanceTests.swift` — three
tests under `@Suite("Multilingual Conformance")`:
- Corpus is well-formed (unique ids, non-empty fields).
- Baselines cover the corpus exactly (parameterised per kernel).
- Byte-identical token ids vs HF Python (parameterised per kernel).
Kernel matrix (six entries):
- WordPiece BAAI/bge-small-en-v1.5
- Unigram google-t5/t5-small
- Byte-level BPE openai-community/gpt2
- Byte-level BPE + RBP FacebookAI/roberta-base
- Byte-level BPE modern Qwen/Qwen2.5-0.5B
- BPE byte-fallback TinyLlama/TinyLlama-1.1B-Chat-v1.0
TinyLlama is picked over huggyllama/llama-7b so the Llama family is
covered without an HF auth gate. RoBERTa is kept alongside GPT-2 and
Qwen2.5 for the RobertaProcessing post-processor coverage no other
kernel exercises.
## expectedDivergences
Some (model, input) pairs are known to diverge from the Python reference
today because a fix is in review. The cleanup-hint pattern (inspired by
@john-rocky's closed huggingface#357) lets the target land green while bug fixes are
in flight:
- Unexpected divergence — a (model, input) pair that diverges but
isn't in the table — hard-fails. This is the regression catch.
- Unexpected match — a listed pair that now matches Python — prints a
cleanup hint inviting removal of the entry, but doesn't fail. A
freshly merged improvement won't break CI on this file.
Both directions were verified locally (removing an entry causes a hard
failure with a windowed diff annotating the divergence point with token
strings; adding a phantom entry for a passing input emits the cleanup
hint).
Entries are grouped by `fixedBy` PR number:
- 11 entries `fixedBy: 354` (Mn-filter widening in BertNormalizer)
- 1 entry `fixedBy: 356` (Unigram scalar iteration)
- 13 entries `fixedBy: 355` (BPE scalar iteration on TinyLlama)
- 10 entries `fixedBy: 0` pending investigation — three distinct
new bug clusters that this corpus surfaces:
* Metaspace leading-whitespace runs collapsing to single `▁`
on SentencePiece BPE (5 entries, TinyLlama)
* T5 Unigram segmentation around TM trademark / VS-16 (2 entries)
* Qwen2.5 byte-level BPE merge-ordering on Thai (3 entries)
These should be filed as follow-up issues under huggingface#352.
## Verification
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed" passed
✔ Test "Baselines cover the corpus exactly" with 6 cases passed
✔ Test "Byte-identical token ids vs HF Python" with 6 cases passed
✔ Test run with 3 tests in 1 suite passed after 2.4s
Full `swift test --filter TokenizerTests|BertTokenizerTests|MultilingualConformance`:
48 tests in 4 suites pass.
## Attribution
@john-rocky's closed huggingface#357 is the source of two design decisions used here:
the `expectedDivergences` cleanup-hint pattern, and shipping decoded forms
in baselines for future decoder-parity work. They closed their PR with
explicit "no need to preserve attribution" — credit recorded anyway.
The corpus, regen-script split, and category-tagged input shape are
ported from the multilingual conformance package shipped at
https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, the
companion Obj-C port that diagnosed the bug catalogue in huggingface#352.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on to huggingface#356 Verified locally on a combined branch (huggingface#354 ⊕ huggingface#355 ⊕ huggingface#356 ⊕ this PR): all three T5 divergences are fixed by huggingface#356's scalar-iteration switch. Two were originally listed as `fixedBy: 0` ("Unigram TM trademark / VS-16 segmentation" and "Unigram ZWJ-after-text edge"); the combined-branch run fired cleanup hints for both, alongside the emoji-keycap-and-flags hint. Root cause is the same as the keycap case: a vocab-relevant scalar (TM glyph U+2122, ZWJ U+200D) is hidden inside a grapheme cluster that the old `Character`-based Unigram lattice never decomposed. Moving the iteration unit to `Unicode.Scalar` exposes it. `expectedDivergences` count drops from 10 pending-investigation entries to 8 (5 TinyLlama Metaspace whitespace + 3 Qwen2.5 BPE merge-ordering). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e#355 / huggingface#356 merged) The three bug-fix PRs landed in main, simplifying the table from 35 entries down to 8. The struct loses its `fixedBy` field — all surviving entries are the two new bug clusters this corpus surfaces and that have no PR yet, so the table is now just `{modelId, inputId, note}`. The cleanup-hint message prints `note` instead of `fixedBy`. What remains is two clusters under huggingface#352 worth filing as follow-up issues: - 5 entries on TinyLlama: SentencePiece-BPE leading-whitespace runs collapsing to single `▁` tokens instead of producing a multi-space vocab entry (e.g. `▁▁▁▁` id 268). - 3 entries on Qwen2.5-0.5B: byte-level BPE picks a different merge ordering than HF Python on Thai byte sequences. The same corpus runs in the companion Obj-C port (https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance) and hits 7/7 byte-identity on these inputs — so both clusters look like upstream-only bugs and the Obj-C source is a usable reference for the follow-up fixes. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Daisuke Majima <rockyshikoku@gmail.com>
2c0b5ca to
ce84708
Compare
|
Pushed
For the Hub dataset / bucket migration — yes please, your offer to handle that yourself is appreciated. The corpus, baselines, and Thanks @pcuenca for the careful review on all four PRs and for landing them so quickly; it was a pleasure to collaborate. |
|
Update: I'll see if we can land bucket support (huggingface/swift-huggingface#55) first, and then we can move the test data to a bucket. If progress on buckets is slow, I'll use a dataset instead. Sorry for the delay, I'll keep you posted! |
|
No rush at all @pcuenca. The conformance target sits happily in-tree as a local checkpoint until buckets land; the pinned Looking forward to seeing swift-huggingface#55 land. P.S. |
Addresses #42 (improve tokenizer testing) and provides a place to land regression coverage for the bug catalogue in #352. Picks up where @john-rocky's closed #357 left off — see attribution at the bottom.
What
A new test target that compares Swift
tokenizer.encode(text:)output against HuggingFace Pythontransformers==4.57.1references across six tokenizer kernels and an 83-input corpus covering script boundaries the existing English-leaning fixtures don't reach: CJK simplified + traditional, Japanese with voiced kana, Hangul, Arabic, Hebrew, Devanagari conjuncts, Thai combining marks, BMP + astral emoji, ZWJ sequences, emoji keycaps, multi-script stress lines, programming code, IPA, and the German Kompositum torture set.How
Three pieces:
Tools/generate_tokenizer_baselines.pyregenerates per-kernel baselines from thetransformersversion pinned inTools/requirements.txt. Assertsis_fastso a slow-only model can't silently produce a non-comparable reference. Each baseline carries ametadatablock with thetransformersversion, generation timestamp, and entry count.Tests/TokenizersTests/Resources/MultilingualConformance/:inputs.json— 83 entries, each{id, category, text}. Stable ids keep baselines aligned across regenerations; categories let test output cite the broken axis directly (japanese-voiced-kana,emoji-keycap,thai-combining-marks,devanagari, …).baselines/<slug>_multilingual.json— one per kernel, holdsinput_ids,convert_ids_to_tokensstrings, and the two decoded forms (decoded_with_special/decoded_skip_special). The decoded fields are intentionally produced now and left unused by the current Swift tests — forward-compatible material for a decoder-side parity test follow-up.Tests/TokenizersTests/MultilingualConformanceTests.swift— three tests under@Suite(\"Multilingual Conformance\"):Kernel matrix
BAAI/bge-small-en-v1.5google-t5/t5-smallopenai-community/gpt2FacebookAI/roberta-baseQwen/Qwen2.5-0.5BTinyLlama/TinyLlama-1.1B-Chat-v1.0TinyLlama is picked over
huggyllama/llama-7bso the Llama family is covered without an auth-gating CI risk.expectedDivergencesSome (model, input) pairs diverge from the Python reference today because a fix is in review. The cleanup-hint pattern lets the target land green while bug fixes are in flight:
Entries are grouped by
fixedByPR number:fixedBy: 354— Mn-filter widening inBertNormalizer.stripAccents. Covers Japanese voiced kana, Devanagari halant, Arabic diacritics, emoji VS-16, and similar.fixedBy: 356— Unigram + TokenLattice scalar iteration. Covers T5 emoji-keycap, the™️(U+2122 + VS-16) cluster, and ZWJ-between-text in escape sequences.fixedBy: 355— BPE scalar iteration on TinyLlama. Covers Thai combining marks, Devanagari conjuncts, emoji ZWJ, emoji keycap, and several multi-script stress lines.fixedBy: 0— pending investigation under Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352. Two distinct new bug clusters that this corpus surfaces but none of the in-flight PRs address:▁tokens on SentencePiece BPE instead of producing the multi-space vocab entry (e.g.▁▁▁▁id 268 on TinyLlama). 5 entries on TinyLlama. Worth filing as a follow-up issue.These two clusters should land as separate follow-up issues. Adding entries to
expectedDivergencesrather than gating the PR on them keeps the regression-catch infrastructure shippable today.Both directions of the mechanism were verified locally — removing an entry from the table causes a hard failure with the windowed diff, and adding a phantom entry for a passing input emits the cleanup hint.
Verified against the combined fix branch
A combined branch (#354 ⊕ #355 ⊕ #356 ⊕ this PR) was run end-to-end to confirm Pedro's requested changes don't introduce regressions and that the cleanup hints fire where they should:
```
swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"
✔ Test run with 52 tests in 4 suites passed after 9.13s
0 failures
27 cleanup hints fired (11 BGE / 3 T5 / 13 TinyLlama)
```
All 27 cleanup hints correspond to the
fixedBy: 354|355|356entries — every entry the three fix PRs are supposed to clean up does in fact get cleaned up. The 8fixedBy: 0entries persisted (correctly — neither of the two new clusters is addressed by the in-flight PRs).Verification
Local on macOS, Swift 6.3.2, against current
main:```
swift test --filter MultilingualConformance
✔ Test "Corpus inputs.json is well-formed (unique ids, non-empty fields)" passed
✔ Test "Baselines cover the corpus exactly" with 6 test cases passed
✔ Test "Byte-identical token ids vs HF Python" with 6 test cases passed
✔ Test run with 3 tests in 1 suite passed after 2.4s
```
Full `swift test --filter "TokenizerTests|BertTokenizerTests|MultilingualConformance"`: 48 tests in 4 suites pass on this branch alone (52 on the combined branch).
Regenerating baselines from a fresh `pip install -r Tools/requirements.txt` reproduces the bundled JSON against `transformers==4.57.1`.
Attribution
@john-rocky opened #357 a day before this PR with significant overlap. They closed it with explicit "no need to preserve attribution" — but the design ideas this PR borrows from them are worth recording:
huggyllama/llama-7bfor auth-gate-free Llama coverage.The corpus, regen-script split, and category-tagged input shape are ported from the companion Obj-C port at https://github.com/apocryphx/ObjCTokenizer/tree/main/Conformance, which was the conceptual harness that diagnosed the bug catalogue in #352.
🤖 Generated with Claude Code