Skip to content

Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop#358

Open
john-rocky wants to merge 1 commit into
huggingface:mainfrom
john-rocky:fix/decoder-empty-input-and-trailing-bytes
Open

Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop#358
john-rocky wants to merge 1 commit into
huggingface:mainfrom
john-rocky:fix/decoder-empty-input-and-trailing-bytes

Conversation

@john-rocky
Copy link
Copy Markdown
Contributor

What

Fixes two independent bugs in Sources/Tokenizers/Decoder.swift:

  1. WordPieceDecoder.decode(tokens:) crashes on empty input. The first line force-unwrapped tokens.first!, so the decoder asserted with "Fatal error: Unexpectedly found nil while unwrapping an Optional value" whenever the calling chain handed it an empty token list. An empty list can legitimately arrive there — for example Tokenizer.decode(tokens:, skipSpecialTokens: true) on an id sequence whose non-special-token ids all fail vocab lookup (the compactMap upstream drops every entry), or decode(tokens: [<just special tokens>], skipSpecialTokens: true) after the special-token filter empties the list.

  2. ByteFallbackDecoder.decode(tokens:) silently drops trailing <0xHH> byte runs. The decoder accumulated <0xHH> byte tokens and flushed them to a UTF-8 string only when it encountered a non-byte token, but the post-loop flush was missing. When the input ended on a multi-byte UTF-8 run (e.g. a trailing emoji, CJK character, or a Llama-style <0x0A> newline at end-of-text), those bytes never reached newTokens and the output was truncated.

Fixes

WordPieceDecoder

Early-returns [] when called with an empty list, matching the no-op behavior of the reference Rust implementation.

ByteFallbackDecoder

Factored the byte-flush into a small flushPendingBytes() closure and called it both at the start of every non-byte token (preserved behavior) and once after the loop (new). The intra-loop behavior is otherwise unchanged.

Test coverage

Two new regression tests in Tests/TokenizersTests/DecoderTests.swift:

  • WordPiece decoder no-ops on empty input — verifies WordPieceDecoder.decode(tokens: []) returns [] for both cleanup: true and cleanup: false. This is the direct regression for the crash.
  • ByteFallback decoder flushes trailing byte tokens — exercises six cases including ["<0xC2>", "<0xA9>"] → ["©"] (trailing 2-byte UTF-8), ["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"] → ["hi", "🚀"] (trailing 4-byte UTF-8 emoji), the pre-existing in-loop flush (["<0xC2>", "<0xA9>", "xyz"] → ["©", "xyz"]), empty input, and the all-non-byte case.

One updated test:

  • Jinja block whitespace control (Tests/TokenizersTests/ChatTemplateTests.swift) — the expected string was previously "Describe the Swift programming language.\nassistant" (no trailing newline). The rendered Jinja template actually ends with a newline (the trailing {% endif %} block sits on its own line), which Phi's tokenizer encodes as a <0x0A> byte-fallback token. The pre-fix ByteFallbackDecoder dropped that trailing token on the floor, so the bug-masked decoder output happened to equal the assertion. With the trailing-flush fix the test now compares against the un-truncated "…assistant\n" and the two adjacent assertions (!hasPrefix("\n"), !contains("\n\n")) still hold. A short comment in the test body explains why.

Verification

Local on macOS 26 / Swift 6.2.1 / M-series, from main:

$ swift test
✔ Test run with 157 tests in 20 suites passed after ~13s.

$ swift test --parallel
✔ Test run with 157 tests in 20 suites passed after ~13s.

$ xcrun swift-format lint Sources/Tokenizers/Decoder.swift Tests/TokenizersTests/{DecoderTests,ChatTemplateTests}.swift
(clean)

Without the fixes:

  • The new WordPiece decoder no-ops on empty input test crashes the process with the fatal-error message quoted above (verified by reverting the guard let first line locally).
  • The new ByteFallback decoder flushes trailing byte tokens cases that end on byte runs fail (["<0xC2>", "<0xA9>"] returns [] instead of ["©"]).

Two independent bugs in Sources/Tokenizers/Decoder.swift:

- WordPieceDecoder.decode(tokens:) force-unwrapped `tokens.first!` and
  crashed when handed an empty list. An empty list can legitimately reach
  the decoder via `Tokenizer.decode(tokens:, skipSpecialTokens: true)`
  when the upstream compactMap or special-token filter drops every entry.
  Early-return [] in that case to match the reference Rust behavior.

- ByteFallbackDecoder.decode(tokens:) accumulated runs of `<0xHH>`
  byte-fallback tokens and only flushed them to a UTF-8 string when it
  saw a non-byte token. Inputs that ended on a byte run (trailing emoji,
  CJK, or `<0x0A>` newline) had those bytes silently dropped. Added a
  matching post-loop flush so `["<0xC2>", "<0xA9>"]` decodes to `["©"]`
  and `["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"]` decodes to
  `["hi", "🚀"]`.

Tests: two new regression tests covering both bugs. One existing test
("Jinja block whitespace control") was updated — the prior expected
string omitted the trailing newline that the rendered template
legitimately ends with, because the pre-fix ByteFallbackDecoder dropped
the `<0x0A>` byte token. A comment explains the dependency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant