Add multilingual conformance tests for byte-identical Python parity#357
Closed
john-rocky wants to merge 1 commit into
Closed
Add multilingual conformance tests for byte-identical Python parity#357john-rocky wants to merge 1 commit into
john-rocky wants to merge 1 commit into
Conversation
Adds a parity test target that compares Swift tokenization output against
HuggingFace Python `transformers` across 5 tokenizer kernels and a 30-line
multilingual corpus stressing CJK, voiced-kana, Hangul, Arabic/Hebrew RTL,
Devanagari, Thai, BMP+astral emoji, ZWJ grapheme clusters, mixed-script,
and combining-mark edge cases.
- Tools/generate_tokenizer_baselines.py regenerates the per-kernel JSON
baselines from `transformers.AutoTokenizer`. Asserts `is_fast` so
slow-only tokenizers don't silently produce non-comparable refs.
- Tests/TokenizersTests/Resources/MultilingualConformance/{inputs.json,
baselines/*.json} carry the corpus and the Python-produced refs.
- Tests/TokenizersTests/MultilingualConformanceTests.swift runs three
tests: byte-identical token id parity (parameterised over kernels),
a corpus-well-formed sanity check, and a baseline-covers-corpus
sanity check. Known divergences are listed in `expectedDivergences`
with upstream issue/PR references so the target lands green while
upstream fixes are in flight; unexpected divergences hard-fail
(regression catch); unexpected matches print a cleanup hint.
Addresses huggingface#42. Provides regression coverage for the bugs catalogued in
huggingface#352.
pcuenca
reviewed
May 15, 2026
Member
pcuenca
left a comment
There was a problem hiding this comment.
Thanks @john-rocky. Seeing that @apocryphx is the initiator of this effort and he already offered to share his testing protocol, I'd rather wait before we review this PR, which might become unnecessary.
Contributor
Author
|
Got it — thanks @pcuenca. I'll close this so it doesn't sit in the review queue while @apocryphx's testing protocol comes in. If any pieces here end up being useful when his work lands (the Python regenerator script, the input categories, the |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a multilingual byte-identical conformance test target that compares Swift tokenization output against HuggingFace Python
transformers(treated as the authoritative reference) across five tokenizer kernels and a 30-line corpus that exercises script boundaries the existing English-leaning fixtures don't reach: CJK simplified + traditional, Japanese (incl. voiced-kana), Hangul, Arabic / Hebrew (RTL + diacritics), Devanagari, Thai, BMP + astral-plane emoji, ZWJ grapheme clusters, mixed-script, and combining-mark edge cases.Addresses #42 (Improve tokenizer testing) and provides a place to land regression coverage for the bugs catalogued in #352.
How
Three pieces:
Tools/generate_tokenizer_baselines.py— re-generates the per-kernel JSON baselines fromtransformers.AutoTokenizerfor every input in the corpus. Thetransformersversion is pinned inTools/requirements.txtso the references are reproducible. Assertstokenizer.is_fastso a model without atokenizer.json(slow-only) doesn't silently produce a non-comparable reference. Adding a new kernel or input is a one-line append followed by a rerun.Tests/TokenizersTests/Resources/MultilingualConformance/:inputs.json— the 30-line corpus, each entry tagged with a category and a stable id so baselines stay aligned across regenerations.baselines/*.json— one file per kernel, keyed by input id. Holdsinput_ids, theconvert_ids_to_tokensstrings, and the two decoded forms (decoded_with_special/decoded_skip_special). The decoded fields are intentionally produced now and left unused by the current Swift tests; they're forward-compatible material for a decoder-side parity follow-up.Tests/TokenizersTests/MultilingualConformanceTests.swift— three tests:tokenizer.encode(text:)to the baselineinput_idsfor every input. On divergence prints a windowed token diff around the first mismatch.Initial kernel matrix
BAAI/bge-small-en-v1.5google-t5/t5-smalltokenizer.jsonopenai-community/gpt2Qwen/Qwen2.5-0.5BTinyLlama/TinyLlama-1.1B-Chat-v1.0expectedDivergencesSome inputs are known to diverge from the Python reference today because of bugs being tracked in #352 / #353 / #354 / #355 / #356. The test file lists those (modelId, inputId) pairs in an
expectedDivergencestable with the upstream reference for each, so the test target lands green while the work is in flight.The mechanism is bidirectional:
Both directions were verified locally (removing an entry from the list causes a hard failure with the windowed diff; adding a phantom entry for a passing input emits the cleanup hint).
This way the same target works for adding inputs without needing to fix kernels first, adding kernels without needing to triage every cell up front, and catching regressions on inputs that already match Python.
Verification
Local run on macOS 26 / Swift 6.2.1 / M-series, against current
main:Also verified:
swift test(full suite, sequential): 158 tests / 21 suites pass in ~14s.swift test --parallel(full suite, parallel): same, passes.pip install -r Tools/requirements.txtreproduces the bundled JSON byte-for-byte againsttransformers==4.57.1.Notes on framing / future work
--output-diralready separates baseline production from consumption, and a follow-up can add a--push-to-hubmode without changing the Swift side.decoded_with_special/decoded_skip_specialfields the script already emits. Held back from this PR because at least one existing decoder path (WordPieceDecoder.decode(tokens:)'stokens.first!) trips a fatal unwrap on shapes the encoder happily emits, which deserves its own surface.ObjCTokenizerport (specifically themake goldenpipeline) was the inspiration for the Python-reference / Swift-runner split this PR uses.