Fix WordPieceDecoder crash on empty input and ByteFallbackDecoder trailing byte drop#358
Open
john-rocky wants to merge 1 commit into
Open
Conversation
Two independent bugs in Sources/Tokenizers/Decoder.swift:
- WordPieceDecoder.decode(tokens:) force-unwrapped `tokens.first!` and
crashed when handed an empty list. An empty list can legitimately reach
the decoder via `Tokenizer.decode(tokens:, skipSpecialTokens: true)`
when the upstream compactMap or special-token filter drops every entry.
Early-return [] in that case to match the reference Rust behavior.
- ByteFallbackDecoder.decode(tokens:) accumulated runs of `<0xHH>`
byte-fallback tokens and only flushed them to a UTF-8 string when it
saw a non-byte token. Inputs that ended on a byte run (trailing emoji,
CJK, or `<0x0A>` newline) had those bytes silently dropped. Added a
matching post-loop flush so `["<0xC2>", "<0xA9>"]` decodes to `["©"]`
and `["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"]` decodes to
`["hi", "🚀"]`.
Tests: two new regression tests covering both bugs. One existing test
("Jinja block whitespace control") was updated — the prior expected
string omitted the trailing newline that the rendered template
legitimately ends with, because the pre-fix ByteFallbackDecoder dropped
the `<0x0A>` byte token. A comment explains the dependency.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes two independent bugs in
Sources/Tokenizers/Decoder.swift:WordPieceDecoder.decode(tokens:)crashes on empty input. The first line force-unwrappedtokens.first!, so the decoder asserted with "Fatal error: Unexpectedly found nil while unwrapping an Optional value" whenever the calling chain handed it an empty token list. An empty list can legitimately arrive there — for exampleTokenizer.decode(tokens:, skipSpecialTokens: true)on an id sequence whose non-special-token ids all fail vocab lookup (thecompactMapupstream drops every entry), ordecode(tokens: [<just special tokens>], skipSpecialTokens: true)after the special-token filter empties the list.ByteFallbackDecoder.decode(tokens:)silently drops trailing<0xHH>byte runs. The decoder accumulated<0xHH>byte tokens and flushed them to a UTF-8 string only when it encountered a non-byte token, but the post-loop flush was missing. When the input ended on a multi-byte UTF-8 run (e.g. a trailing emoji, CJK character, or a Llama-style<0x0A>newline at end-of-text), those bytes never reachednewTokensand the output was truncated.Fixes
WordPieceDecoderEarly-returns
[]when called with an empty list, matching the no-op behavior of the reference Rust implementation.ByteFallbackDecoderFactored the byte-flush into a small
flushPendingBytes()closure and called it both at the start of every non-byte token (preserved behavior) and once after the loop (new). The intra-loop behavior is otherwise unchanged.Test coverage
Two new regression tests in
Tests/TokenizersTests/DecoderTests.swift:WordPiece decoder no-ops on empty input— verifiesWordPieceDecoder.decode(tokens: [])returns[]for bothcleanup: trueandcleanup: false. This is the direct regression for the crash.ByteFallback decoder flushes trailing byte tokens— exercises six cases including["<0xC2>", "<0xA9>"] → ["©"](trailing 2-byte UTF-8),["hi", "<0xF0>", "<0x9F>", "<0x9A>", "<0x80>"] → ["hi", "🚀"](trailing 4-byte UTF-8 emoji), the pre-existing in-loop flush (["<0xC2>", "<0xA9>", "xyz"] → ["©", "xyz"]), empty input, and the all-non-byte case.One updated test:
Jinja block whitespace control(Tests/TokenizersTests/ChatTemplateTests.swift) — the expected string was previously"Describe the Swift programming language.\nassistant"(no trailing newline). The rendered Jinja template actually ends with a newline (the trailing{% endif %}block sits on its own line), which Phi's tokenizer encodes as a<0x0A>byte-fallback token. The pre-fixByteFallbackDecoderdropped that trailing token on the floor, so the bug-masked decoder output happened to equal the assertion. With the trailing-flush fix the test now compares against the un-truncated"…assistant\n"and the two adjacent assertions (!hasPrefix("\n"),!contains("\n\n")) still hold. A short comment in the test body explains why.Verification
Local on macOS 26 / Swift 6.2.1 / M-series, from
main:Without the fixes:
WordPiece decoder no-ops on empty inputtest crashes the process with the fatal-error message quoted above (verified by reverting theguard let firstline locally).ByteFallback decoder flushes trailing byte tokenscases that end on byte runs fail (["<0xC2>", "<0xA9>"]returns[]instead of["©"]).