fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967
Conversation
…hout Content-Encoding header GzipParser now checks for gzip magic bytes (0x1f 0x8b) before attempting decompression. If data is not gzip-compressed, it passes through to the inner parser unchanged. This fixes APIs like Apple App Store Connect that return gzip bodies without Content-Encoding headers. Also updates create_gzip_decoder() to use gzip_parser (with auto-detection) as the fallback parser instead of gzip_parser.inner_parser. Co-Authored-By: bot_apk <apk@cognition.ai>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
👋 Greetings, Airbyte Team Member!Here are some helpful tips and reminders for your convenience. 💡 Show Tips and TricksTesting This CDK VersionYou can test this version of the CDK using the following: # Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774625405-fix-gzip-decoder-auto-detect#egg=airbyte-python-cdk[dev]' --help
# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774625405-fix-gzip-decoder-auto-detectPR Slash CommandsAirbyte Maintainers can execute the following slash commands on your PR:
|
PyTest Results (Fast)4 020 tests +86 4 009 ✅ +86 7m 11s ⏱️ +12s Results for commit 115ed14. ± Comparison against base commit acafc75. This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
PyTest Results (Full)4 023 tests +86 4 011 ✅ +86 10m 43s ⏱️ -6s Results for commit 115ed14. ± Comparison against base commit acafc75. This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.♻️ This comment has been updated with latest results. |
Code Review: GzipParser auto-detect gzip magic bytesOverall AssessmentThe core fix is correct for the stated problem: APIs like Apple App Store Connect However, there are a few issues worth flagging — one significant, one minor, one cosmetic. 1. 🔴 Memory regression in streaming mode (significant)The new remaining = data.read()
full_data = io.BytesIO(header + remaining)In production mode ( This matters because Suggested fix: Use a lightweight wrapper to prepend the peeked bytes back onto the stream without buffering: class _PrefixedStream(io.RawIOBase):
"""Prepends already-read bytes back onto a stream without buffering everything."""
def __init__(self, prefix: bytes, stream: BufferedIOBase) -> None:
self._prefix = io.BytesIO(prefix)
self._stream = stream
self._prefix_done = False
def readable(self) -> bool:
return True
def read(self, n: int = -1) -> bytes:
if not self._prefix_done:
chunk = self._prefix.read(n)
if chunk:
if n != -1 and len(chunk) >= n:
return chunk
self._prefix_done = True
remaining = self._stream.read(n - len(chunk) if n != -1 else -1)
return chunk + (remaining or b"")
self._prefix_done = True
return self._stream.read(n)Then in def parse(self, data: BufferedIOBase) -> PARSER_OUTPUT_TYPE:
header = data.read(2)
if not header:
return
stream = _PrefixedStream(header, data)
if header == GZIP_MAGIC_BYTES:
with gzip.GzipFile(fileobj=stream, mode="rb") as gzipobj:
yield from self.inner_parser.parse(gzipobj)
else:
yield from self.inner_parser.parse(stream)This preserves streaming behavior for both gzip and non-gzip paths. 2. 🟡 Double-decompression edge case in builder mode (low risk)In builder mode (
But if decompressed content happens to start with bytes 3. 🟢 Cosmetic: constant placementGZIP_MAGIC_BYTES = b"\x1f\x8b"
import orjson
import requestsModule-level constant placed between import blocks — should be after all imports per standard Python style. Test Coverage ✅The 7 new tests cover: gzip CSV/JSONL without header, non-gzip passthrough for CSV/JSONL, empty data, fallback in VerdictLogic is correct and addresses the root cause. Main concern is the memory regression in the streaming path — this affects all connectors using |
…se buffering - Replace data.read() + BytesIO buffering with a lightweight _PrefixedStream wrapper that prepends the 2-byte magic-byte header back onto the original stream without reading the entire response into memory. - Move GZIP_MAGIC_BYTES constant after imports (now _GZIP_MAGIC_BYTES, private). - Add inline docstring note about the double-decompression edge case in builder mode where requests auto-decompresses Content-Encoding: gzip. Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>
Local Mock Test ResultsTested locally with a mock HTTP server returning gzip-compressed CSV data without Test 1: Bug Reproduction on
|
|
/autofix
|
Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>
There was a problem hiding this comment.
Pull request overview
Updates the declarative GzipDecoder/GzipParser behavior to correctly handle APIs that return gzip-compressed bodies without Content-Encoding: gzip, while preserving streaming behavior (no full buffering).
Changes:
- Add gzip magic-bytes auto-detection to
GzipParser.parse()with a_PrefixedStreamwrapper to reattach peeked bytes. - Update
create_gzip_decoder()to usegzip_parser(notinner_parser) in builder mode and as the header-mismatch fallback. - Add unit tests covering gzip-without-headers, passthrough behavior, empty input, fallback selection, non-streamed mode, and stream closing.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py |
Implements magic-bytes detection, adds _PrefixedStream, and changes GzipParser to passthrough non-gzip streams. |
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py |
Ensures GzipDecoder uses the gzip-aware parser in builder mode and as the fallback when headers don’t match. |
unit_tests/sources/declarative/decoders/test_composite_decoder.py |
Adds tests validating gzip auto-detection and passthrough across streaming/non-streaming scenarios. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Daryna Ishchenko (darynaishchenko)
left a comment
There was a problem hiding this comment.
Alfredo Garcia (@agarctfi) I was working on similar issue some time ago https://github.com/airbytehq/oncall/issues/11173#issuecomment-3967448166. Was able to fix it in manifest itself by adding inner parser as gzip decoder meaning that it will be used when headers don't contain info about gzip.
We have connectors like amazon-ads that already use this logic to fix the issue with headers, looks like this change is breaking for such connectors.
- Can the oc issue be fixed by adding inner parser as gzipdecoder in their builder project without a cdk fix?
- Can we implement a cdk fix that will be backward compatible and we don't need to update connectors in follow-up?
Summary
Some APIs (notably Apple App Store Connect
/v1/salesReports) return gzip-compressed response bodies without setting theContent-Encoding: gzipheader. The existingGzipParserunconditionally assumed gzip input, andcreate_gzip_decoder()used the inner parser (e.g.CsvParser) as the fallback when headers didn't match — so gzip data without the header was never decompressed, producing'utf-8' codec can't decode byte 0x8berrors.Changes:
GzipParser.parse()— reads the first 2 bytes and checks for gzip magic bytes (\x1f\x8b). If present, decompresses via a_PrefixedStreamwrapper that prepends the peeked bytes back onto the original stream (preserving streaming behavior). If not gzip, passes data through to the inner parser unchanged._PrefixedStream— a lightweightio.RawIOBasesubclass that chains a small prefix (the 2-byte header) with the underlying stream, avoiding buffering the entire response into memory.create_gzip_decoder()— usesgzip_parser(with auto-detection) instead ofgzip_parser.inner_parseras both the default parser in builder mode and the fallback in production mode.Resolves https://github.com/airbytehq/oncall/issues/11809:
Related: #914, #909, #895, #892
Review & Testing Checklist for Human
Content-Encoding: gzipIS present andstream_response=False, therequestslibrary already decompressesresponse.content.GzipParserthen receives decompressed bytes — the magic-byte check should correctly identify this as non-gzip and pass through. However, if decompressed content happens to start with\x1f\x8bbytes, it would be incorrectly re-decompressed. This is documented in the code docstring and is extremely unlikely for structured formats (CSV, JSON, JSONL), but worth assessing whether additional safeguards are needed._PrefixedStreamcorrectness with all inner parsers: The new stream wrapper is exercised by the existing 40 decoder tests (all passing), but verify thatread()andreadinto()behave correctly for all real-world read patterns — particularlyTextIOWrapper(used byCsvParser) andgzip.GzipFile, which may issue reads of varying chunk sizes./v1/salesReports(or mock a server returning gzip bytes withoutContent-Encoding) and confirm the response is correctly decompressed and parsed in both Builder test-read and sync modes.Notes
GzipDecoder.BytesIOfor magic-byte detection; this was replaced with a streaming_PrefixedStreamwrapper to avoid memory regression for large responses.by_headersmode, and non-streamed mode. All 40 decoder tests pass locally.Link to Devin session: https://app.devin.ai/sessions/1e67cd663c11402b82881eeb06a30745
Requested by: Alfredo Garcia (@agarctfi)