Skip to content

Add multilingual conformance tests for byte-identical Python parity#357

Closed
john-rocky wants to merge 1 commit into
huggingface:mainfrom
john-rocky:test/multilingual-tokenizer-conformance
Closed

Add multilingual conformance tests for byte-identical Python parity#357
john-rocky wants to merge 1 commit into
huggingface:mainfrom
john-rocky:test/multilingual-tokenizer-conformance

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

What

Adds a multilingual byte-identical conformance test target that compares Swift tokenization output against HuggingFace Python transformers (treated as the authoritative reference) across five tokenizer kernels and a 30-line corpus that exercises script boundaries the existing English-leaning fixtures don't reach: CJK simplified + traditional, Japanese (incl. voiced-kana), Hangul, Arabic / Hebrew (RTL + diacritics), Devanagari, Thai, BMP + astral-plane emoji, ZWJ grapheme clusters, mixed-script, and combining-mark edge cases.

Addresses #42 (Improve tokenizer testing) and provides a place to land regression coverage for the bugs catalogued in #352.

How

Three pieces:

  1. Tools/generate_tokenizer_baselines.py — re-generates the per-kernel JSON baselines from transformers.AutoTokenizer for every input in the corpus. The transformers version is pinned in Tools/requirements.txt so the references are reproducible. Asserts tokenizer.is_fast so a model without a tokenizer.json (slow-only) doesn't silently produce a non-comparable reference. Adding a new kernel or input is a one-line append followed by a rerun.

  2. Tests/TokenizersTests/Resources/MultilingualConformance/:

    • inputs.json — the 30-line corpus, each entry tagged with a category and a stable id so baselines stay aligned across regenerations.
    • baselines/*.json — one file per kernel, keyed by input id. Holds input_ids, the convert_ids_to_tokens strings, and the two decoded forms (decoded_with_special / decoded_skip_special). The decoded fields are intentionally produced now and left unused by the current Swift tests; they're forward-compatible material for a decoder-side parity follow-up.
  3. Tests/TokenizersTests/MultilingualConformanceTests.swift — three tests:

    • Byte-identical token ids (parameterised over kernels) — compares Swift tokenizer.encode(text:) to the baseline input_ids for every input. On divergence prints a windowed token diff around the first mismatch.
    • Corpus is well-formed — schema + duplicate-id sanity check.
    • Baselines cover the corpus — guards against drift between the corpus and any baseline file.

Initial kernel matrix

Kernel Model Why
WordPiece (Bert family) BAAI/bge-small-en-v1.5 Used by most Apple-Silicon embedding pipelines today
Unigram / SentencePiece google-t5/t5-small Canonical Unigram, ships tokenizer.json
Byte-level BPE (legacy) openai-community/gpt2 Family known to round-trip cleanly — passing baseline
Byte-level BPE (modern) Qwen/Qwen2.5-0.5B Modern vocab / merge table sharing the GPT-2 kernel
SentencePiece BPE with byte-fallback TinyLlama/TinyLlama-1.1B-Chat-v1.0 Llama tokenizer without an auth gate

expectedDivergences

Some inputs are known to diverge from the Python reference today because of bugs being tracked in #352 / #353 / #354 / #355 / #356. The test file lists those (modelId, inputId) pairs in an expectedDivergences table with the upstream reference for each, so the test target lands green while the work is in flight.

The mechanism is bidirectional:

  • An unexpected divergence — a (model, input) pair that diverges but isn't listed — is a hard failure. This is the regression catch.
  • An unexpected match — a pair listed in the table that now matches Python — emits a printed hint inviting removal of the entry, but doesn't fail the test, so a freshly merged improvement doesn't break CI.

Both directions were verified locally (removing an entry from the list causes a hard failure with the windowed diff; adding a phantom entry for a passing input emits the cleanup hint).

This way the same target works for adding inputs without needing to fix kernels first, adding kernels without needing to triage every cell up front, and catching regressions on inputs that already match Python.

Verification

Local run on macOS 26 / Swift 6.2.1 / M-series, against current main:

$ swift test --filter MultilingualConformanceTests
✔ Test "Corpus is well-formed" passed after 0.004 seconds.
✔ Test "Baselines cover the corpus" with 5 test cases passed after 0.007 seconds.
✔ Test "Byte-identical token ids" with 5 test cases passed after 2.412 seconds.
✔ Test run with 3 tests in 1 suite passed after 2.412 seconds.

Also verified:

  • swift test (full suite, sequential): 158 tests / 21 suites pass in ~14s.
  • swift test --parallel (full suite, parallel): same, passes.
  • Regenerating baselines from a fresh pip install -r Tools/requirements.txt reproduces the bundled JSON byte-for-byte against transformers==4.57.1.

Notes on framing / future work

  • The Python script writes to local JSON. Migrating the baselines to a Hub dataset (per the Improve tokenizer testing #42 description) is straightforward from here: the script's existing --output-dir already separates baseline production from consumption, and a follow-up can add a --push-to-hub mode without changing the Swift side.
  • The 30-line corpus is intentionally small and category-tagged rather than exhaustive. Intended to grow from here as more axes get classified.
  • A natural follow-up is a decoder-side parity test that reuses the decoded_with_special / decoded_skip_special fields the script already emits. Held back from this PR because at least one existing decoder path (WordPieceDecoder.decode(tokens:)'s tokens.first!) trips a fatal unwrap on shapes the encoder happily emits, which deserves its own surface.
  • Conceptual framing borrowed from @apocryphx's diagnostic work in Multilingual byte-divergence from HuggingFace Python: 4 distinct bugs across WordPiece / Unigram / BPE-byte-fallback #352, which catalogued the bug classes that motivated the corpus categories used here. The reproduction harness in their ObjCTokenizer port (specifically the make golden pipeline) was the inspiration for the Python-reference / Swift-runner split this PR uses.

Adds a parity test target that compares Swift tokenization output against
HuggingFace Python `transformers` across 5 tokenizer kernels and a 30-line
multilingual corpus stressing CJK, voiced-kana, Hangul, Arabic/Hebrew RTL,
Devanagari, Thai, BMP+astral emoji, ZWJ grapheme clusters, mixed-script,
and combining-mark edge cases.

  - Tools/generate_tokenizer_baselines.py regenerates the per-kernel JSON
    baselines from `transformers.AutoTokenizer`. Asserts `is_fast` so
    slow-only tokenizers don't silently produce non-comparable refs.
  - Tests/TokenizersTests/Resources/MultilingualConformance/{inputs.json,
    baselines/*.json} carry the corpus and the Python-produced refs.
  - Tests/TokenizersTests/MultilingualConformanceTests.swift runs three
    tests: byte-identical token id parity (parameterised over kernels),
    a corpus-well-formed sanity check, and a baseline-covers-corpus
    sanity check. Known divergences are listed in `expectedDivergences`
    with upstream issue/PR references so the target lands green while
    upstream fixes are in flight; unexpected divergences hard-fail
    (regression catch); unexpected matches print a cleanup hint.

Addresses huggingface#42. Provides regression coverage for the bugs catalogued in
huggingface#352.
Copy link
Copy Markdown
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @john-rocky. Seeing that @apocryphx is the initiator of this effort and he already offered to share his testing protocol, I'd rather wait before we review this PR, which might become unnecessary.

@john-rocky
Copy link
Copy Markdown
Contributor Author

Got it — thanks @pcuenca. I'll close this so it doesn't sit in the review queue while @apocryphx's testing protocol comes in. If any pieces here end up being useful when his work lands (the Python regenerator script, the input categories, the expectedDivergences cleanup-hint pattern), happy for them to be picked up however is most convenient — no need to preserve attribution. Looking forward to seeing the cases.

@john-rocky john-rocky closed this May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants