Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop by john-rocky · Pull Request #358 · huggingface/swift-transformers

john-rocky · 2026-05-15T14:04:12Z

What

Fixes two independent bugs in Sources/Tokenizers/Decoder.swift:

WordPieceDecoder.decode(tokens:) crashes on empty input. The first line force-unwrapped tokens.first!, so the decoder asserted with "Fatal error: Unexpectedly found nil while unwrapping an Optional value" whenever the calling chain handed it an empty token list. An empty list can legitimately arrive there — for example Tokenizer.decode(tokens:, skipSpecialTokens: true) on an id sequence whose non-special-token ids all fail vocab lookup (the compactMap upstream drops every entry), or decode(tokens: [<just special tokens>], skipSpecialTokens: true) after the special-token filter empties the list.
ByteFallbackDecoder.decode(tokens:) silently drops trailing <0xHH> byte runs. The decoder accumulated <0xHH> byte tokens and flushed them to a UTF-8 string only when it encountered a non-byte token, but the post-loop flush was missing. When the input ended on a multi-byte UTF-8 run (e.g. a trailing emoji, CJK character, or a Llama-style <0x0A> newline at end-of-text), those bytes never reached newTokens and the output was truncated.

Fixes

`WordPieceDecoder`

Early-returns [] when called with an empty list, matching the no-op behavior of the reference Rust implementation.

`ByteFallbackDecoder`

Factored the byte-flush into a small flushPendingBytes() closure and called it both at the start of every non-byte token (preserved behavior) and once after the loop (new). The intra-loop behavior is otherwise unchanged.

Test coverage

Two new regression tests in Tests/TokenizersTests/DecoderTests.swift:

WordPiece decoder no-ops on empty input — verifies WordPieceDecoder.decode(tokens: []) returns [] for both cleanup: true and cleanup: false. This is the direct regression for the crash.
ByteFallback decoder flushes trailing byte tokens — exercises six cases including ["<0xC2>", "<0xA9>"] → ["©"] (trailing 2-byte UTF-8), ["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"] → ["hi", "🚀"] (trailing 4-byte UTF-8 emoji), the pre-existing in-loop flush (["<0xC2>", "<0xA9>", "xyz"] → ["©", "xyz"]), empty input, and the all-non-byte case.

One updated test:

Jinja block whitespace control (Tests/TokenizersTests/ChatTemplateTests.swift) — the expected string was previously "Describe the Swift programming language.\nassistant" (no trailing newline). The rendered Jinja template actually ends with a newline (the trailing {% endif %} block sits on its own line), which Phi's tokenizer encodes as a <0x0A> byte-fallback token. The pre-fix ByteFallbackDecoder dropped that trailing token on the floor, so the bug-masked decoder output happened to equal the assertion. With the trailing-flush fix the test now compares against the un-truncated "…assistant\n" and the two adjacent assertions (!hasPrefix("\n"), !contains("\n\n")) still hold. A short comment in the test body explains why.

Verification

Local on macOS 26 / Swift 6.2.1 / M-series, from main:

$ swift test
✔ Test run with 157 tests in 20 suites passed after ~13s.

$ swift test --parallel
✔ Test run with 157 tests in 20 suites passed after ~13s.

$ xcrun swift-format lint Sources/Tokenizers/Decoder.swift Tests/TokenizersTests/{DecoderTests,ChatTemplateTests}.swift
(clean)

Without the fixes:

The new WordPiece decoder no-ops on empty input test crashes the process with the fatal-error message quoted above (verified by reverting the guard let first line locally).
The new ByteFallback decoder flushes trailing byte tokens cases that end on byte runs fail (["<0xC2>", "<0xA9>"] returns [] instead of ["©"]).

Two independent bugs in Sources/Tokenizers/Decoder.swift: - WordPieceDecoder.decode(tokens:) force-unwrapped `tokens.first!` and crashed when handed an empty list. An empty list can legitimately reach the decoder via `Tokenizer.decode(tokens:, skipSpecialTokens: true)` when the upstream compactMap or special-token filter drops every entry. Early-return [] in that case to match the reference Rust behavior. - ByteFallbackDecoder.decode(tokens:) accumulated runs of `<0xHH>` byte-fallback tokens and only flushed them to a UTF-8 string when it saw a non-byte token. Inputs that ended on a byte run (trailing emoji, CJK, or `<0x0A>` newline) had those bytes silently dropped. Added a matching post-loop flush so `["<0xC2>", "<0xA9>"]` decodes to `["©"]` and `["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"]` decodes to `["hi", "🚀"]`. Tests: two new regression tests covering both bugs. One existing test ("Jinja block whitespace control") was updated — the prior expected string omitted the trailing newline that the rendered template legitimately ends with, because the pre-fix ByteFallbackDecoder dropped the `<0x0A>` byte token. A comment explains the dependency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop#358

Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop#358
john-rocky wants to merge 1 commit into
huggingface:mainfrom
john-rocky:fix/decoder-empty-input-and-trailing-bytes

john-rocky commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

john-rocky commented May 15, 2026

What

Fixes

WordPieceDecoder

ByteFallbackDecoder

Test coverage

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`WordPieceDecoder`

`ByteFallbackDecoder`