Skip to content

DeflateDecompressor/ZlibDecompressor error on trailing bytes after a complete stream (concatenated-stream policy) #107

Description

@mkitti

DeflateDecompressor/ZlibDecompressor error on trailing bytes after a complete stream (concatenated-stream policy)

Originates from the Discourse thread Julia CodecZlib fails while Python succeeds.

Summary

When a raw-DEFLATE or zlib stream is followed by any trailing bytes that are not themselves a valid follow-on stream, transcode/the decompressor stream throws a ZlibError (e.g. invalid literal/length code (code: -3)), after the complete and correct output has already been produced.

This happens because, on Z_STREAM_END, the codec resets the inflate state and attempts to decode the remaining input as a concatenated stream. That behavior is required for gzip (RFC 1952 — a gzip file is "a series of members") and CodecZlib handles concatenated gzip correctly. But for raw DEFLATE (RFC 1951) and zlib (RFC 1950), concatenation is not part of the format, so trailing bytes that aren't a valid next stream turn into a hard error — whereas the underlying zlib C API and other bindings simply stop at Z_STREAM_END and leave the extra bytes unconsumed.

Minimal reproducer (no external data)

using CodecZlib

payload = b"The quick brown fox jumps over the lazy dog."
raw = transcode(DeflateCompressor, payload)        # raw DEFLATE (windowbits = -15)

transcode(DeflateDecompressor, vcat(raw, UInt8[0x00]))    # one extra 0x00 byte
# ERROR: ZlibError: the compressed stream may be truncated

transcode(DeflateDecompressor, vcat(raw, UInt8[0xf3,0x1b,0x33]))
# ERROR: ZlibError: invalid literal/length code (code: -3)

transcode(DeflateDecompressor, vcat(raw, UInt8[0xff,0xff,0xff]))
# ERROR: ZlibError: invalid block type (code: -3)

A single trailing byte is enough. The error message varies with the trailing bytes (truncated, invalid literal/length code, invalid block type), because the codec is trying to interpret them as the start of a new DEFLATE stream.

GzipDecompressor, by contrast, correctly decodes genuinely concatenated members:

g = vcat(transcode(GzipCompressor, b"Hello, "), transcode(GzipCompressor, b"world!"))
String(transcode(GzipDecompressor, g))   # "Hello, world!"  ✔ (RFC 1952 multi-member)

How we hit this in the wild

We were handed a file described as "compressed via zlib." It turned out to be a zlib stream with its 2-byte header stripped and its 4-byte Adler-32 trailer truncated to 3 bytes, i.e. [deflate body][first 3 of 4 Adler-32 bytes]. Decoding it as raw DEFLATE recovers the payload everywhere — but CodecZlib then tried to decode the 3 orphaned Adler-32 bytes as a concatenated stream and threw invalid literal/length code. The data emitted before the throw was complete and correct (reading byte-by-byte from a DeflateDecompressorStream yields the full output and only throws at EOF).

Comparison to the standards

  • RFC 1951 (DEFLATE): defines a single sequence of blocks ending at the BFINAL block. No concatenation, no trailer, no framing for "what follows."
  • RFC 1950 (zlib): defines a single stream = 2-byte header + DEFLATE + 4-byte Adler-32. No concatenation in the spec.
  • RFC 1952 (gzip): §2.2 explicitly: a file "consists of a series of members." Concatenation is required, and CodecZlib's gzip handling is correct.

So the reset-and-continue policy is well-founded for gzip but has no standards basis for raw DEFLATE / zlib.

Comparison to other implementations

zlib C API: inflate() returns Z_STREAM_END when the final block is consumed and does not loop on its own; decoding another concatenated stream requires an explicit inflateReset(). Trailing bytes are simply left unconsumed (avail_in > 0), not treated as an error.

Python (zlib): stops at the stream end and exposes the remainder rather than erroring:

>>> import zlib
>>> payload = b"The quick brown fox jumps over the lazy dog."
>>> raw = zlib.compress(payload)[2:-4]          # raw deflate body
>>> d = zlib.decompressobj(-15)
>>> d.decompress(raw + b"\xf3\x1b\x33") == payload
True
>>> d.unused_data
b'\xf3\x1b3'                                     # trailing bytes surfaced, no error

Direct Zlib_jll inflate call (single call, windowBits = -15) on the same data returns Z_STREAM_END with the full output and the trailing bytes simply unconsumed — i.e. the bug is not in zlib, it's in the reset-and-continue driving.

Questions / possible resolutions

  1. Should DeflateDecompressor and ZlibDecompressor attempt concatenated-stream decoding at all, or should that be limited to GzipDecompressor (where the spec mandates it)?
  2. If concatenated raw-DEFLATE/zlib decoding is intentionally supported, could trailing bytes that don't begin a valid stream be treated as end-of-data (stop at Z_STREAM_END) rather than a hard error — at least optionally?
  3. Relatedly, is there a supported way to stop exactly at the first Z_STREAM_END and recover the unconsumed bytes (cf. zlib's avail_in / Python's unused_data)? This is the same need raised in Find end of Zlib stream #4.

Happy to open a PR if there's a preferred direction.

Versions

  • CodecZlib.jl 0.7.8
  • TranscodingStreams.jl 0.11.3
  • Zlib_jll 1.3.1+2 (zlib runtime 1.3.1)
  • Julia 1.12.6

Investigation and reproducers prepared with Claude Code (Claude Opus 4.8).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions