Skip to content

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967

Open
devin-ai-integration[bot] wants to merge 4 commits intomainfrom
devin/1774625405-fix-gzip-decoder-auto-detect
Open

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967
devin-ai-integration[bot] wants to merge 4 commits intomainfrom
devin/1774625405-fix-gzip-decoder-auto-detect

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot commented Mar 27, 2026

Summary

Some APIs (notably Apple App Store Connect /v1/salesReports) return gzip-compressed response bodies without setting the Content-Encoding: gzip header. The existing GzipParser unconditionally assumed gzip input, and create_gzip_decoder() used the inner parser (e.g. CsvParser) as the fallback when headers didn't match — so gzip data without the header was never decompressed, producing 'utf-8' codec can't decode byte 0x8b errors.

Changes:

  1. GzipParser.parse() — reads the first 2 bytes and checks for gzip magic bytes (\x1f\x8b). If present, decompresses via a _PrefixedStream wrapper that prepends the peeked bytes back onto the original stream (preserving streaming behavior). If not gzip, passes data through to the inner parser unchanged.
  2. _PrefixedStream — a lightweight io.RawIOBase subclass that chains a small prefix (the 2-byte header) with the underlying stream, avoiding buffering the entire response into memory.
  3. create_gzip_decoder() — uses gzip_parser (with auto-detection) instead of gzip_parser.inner_parser as both the default parser in builder mode and the fallback in production mode.

Resolves https://github.com/airbytehq/oncall/issues/11809:

Related: #914, #909, #895, #892

Review & Testing Checklist for Human

  • Double decompression edge case in builder mode: When Content-Encoding: gzip IS present and stream_response=False, the requests library already decompresses response.content. GzipParser then receives decompressed bytes — the magic-byte check should correctly identify this as non-gzip and pass through. However, if decompressed content happens to start with \x1f\x8b bytes, it would be incorrectly re-decompressed. This is documented in the code docstring and is extremely unlikely for structured formats (CSV, JSON, JSONL), but worth assessing whether additional safeguards are needed.
  • _PrefixedStream correctness with all inner parsers: The new stream wrapper is exercised by the existing 40 decoder tests (all passing), but verify that read() and readinto() behave correctly for all real-world read patterns — particularly TextIOWrapper (used by CsvParser) and gzip.GzipFile, which may issue reads of varying chunk sizes.
  • Recommended manual test: Build a connector against Apple App Store Connect /v1/salesReports (or mock a server returning gzip bytes without Content-Encoding) and confirm the response is correctly decompressed and parsed in both Builder test-read and sync modes.

Notes

  • This is a CDK-level fix affecting all manifest-only/low-code connectors that use GzipDecoder.
  • Not a breaking change — strictly additive behavior (auto-detection is a superset of the old unconditional gzip path).
  • The initial revision buffered the entire response into a BytesIO for magic-byte detection; this was replaced with a streaming _PrefixedStream wrapper to avoid memory regression for large responses.
  • 7 new unit tests cover: gzip without headers (CSV, JSONL), non-gzip passthrough (CSV, JSONL), empty data, fallback in by_headers mode, and non-streamed mode. All 40 decoder tests pass locally.

Link to Devin session: https://app.devin.ai/sessions/1e67cd663c11402b82881eeb06a30745
Requested by: Alfredo Garcia (@agarctfi)

…hout Content-Encoding header

GzipParser now checks for gzip magic bytes (0x1f 0x8b) before attempting
decompression. If data is not gzip-compressed, it passes through to the
inner parser unchanged. This fixes APIs like Apple App Store Connect that
return gzip bodies without Content-Encoding headers.

Also updates create_gzip_decoder() to use gzip_parser (with auto-detection)
as the fallback parser instead of gzip_parser.inner_parser.

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774625405-fix-gzip-decoder-auto-detect#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774625405-fix-gzip-decoder-auto-detect

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

PyTest Results (Fast)

4 020 tests  +86   4 009 ✅ +86   7m 11s ⏱️ +12s
    1 suites ± 0      11 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 115ed14. ± Comparison against base commit acafc75.

This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_cgroup_v1_emits_warning
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.utils.test_memory_monitor ‑ test_no_warning_below_threshold
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_empty_data_returns_no_records
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_csv_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_fallback_in_by_headers_mode
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_jsonl_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_data_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_jsonl_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_streamed_gzip_without_content_encoding
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_prefixed_stream_closes_wrapped_stream
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_expands_non_mapping_safely
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_interpolated_expand_path
…

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 27, 2026

PyTest Results (Full)

4 023 tests  +86   4 011 ✅ +86   10m 43s ⏱️ -6s
    1 suites ± 0      12 💤 ± 0 
    1 files   ± 0       0 ❌ ± 0 

Results for commit 115ed14. ± Comparison against base commit acafc75.

This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.
unit_tests.utils.test_memory_monitor ‑ test_cgroup_v1_emits_warning
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.utils.test_memory_monitor ‑ test_no_warning_below_threshold
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_empty_data_returns_no_records
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_csv_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_fallback_in_by_headers_mode
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_jsonl_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_data_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_jsonl_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_streamed_gzip_without_content_encoding
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_prefixed_stream_closes_wrapped_stream
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_expands_non_mapping_safely
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_interpolated_expand_path
…

♻️ This comment has been updated with latest results.

@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Code Review: GzipParser auto-detect gzip magic bytes

Overall Assessment

The core fix is correct for the stated problem: APIs like Apple App Store Connect /v1/salesReports return gzip-compressed bodies without Content-Encoding: gzip, and the old code had no way to handle this. Magic-byte detection (\x1f\x8b) is the standard approach. Tests are solid (7 new tests covering the key scenarios).

However, there are a few issues worth flagging — one significant, one minor, one cosmetic.


1. 🔴 Memory regression in streaming mode (significant)

The new parse() method calls data.read() which buffers the entire response body into memory:

remaining = data.read()
full_data = io.BytesIO(header + remaining)

In production mode (stream_response=True), data is response.raw (urllib3 HTTPResponse). The old code passed this directly to gzip.GzipFile(fileobj=data), which streamed decompression without buffering the full response. The new code reads everything into a BytesIO first.

This matters because GzipParser is now also used as the fallback parser in by_headers mode (the second change in create_gzip_decoder). So every response that doesn't match the Content-Encoding header will be fully buffered — even if it's hundreds of MB of plain CSV/JSONL.

Suggested fix: Use a lightweight wrapper to prepend the peeked bytes back onto the stream without buffering:

class _PrefixedStream(io.RawIOBase):
    """Prepends already-read bytes back onto a stream without buffering everything."""

    def __init__(self, prefix: bytes, stream: BufferedIOBase) -> None:
        self._prefix = io.BytesIO(prefix)
        self._stream = stream
        self._prefix_done = False

    def readable(self) -> bool:
        return True

    def read(self, n: int = -1) -> bytes:
        if not self._prefix_done:
            chunk = self._prefix.read(n)
            if chunk:
                if n != -1 and len(chunk) >= n:
                    return chunk
                self._prefix_done = True
                remaining = self._stream.read(n - len(chunk) if n != -1 else -1)
                return chunk + (remaining or b"")
            self._prefix_done = True
        return self._stream.read(n)

Then in parse():

def parse(self, data: BufferedIOBase) -> PARSER_OUTPUT_TYPE:
    header = data.read(2)
    if not header:
        return

    stream = _PrefixedStream(header, data)
    if header == GZIP_MAGIC_BYTES:
        with gzip.GzipFile(fileobj=stream, mode="rb") as gzipobj:
            yield from self.inner_parser.parse(gzipobj)
    else:
        yield from self.inner_parser.parse(stream)

This preserves streaming behavior for both gzip and non-gzip paths.


2. 🟡 Double-decompression edge case in builder mode (low risk)

In builder mode (_emit_connector_builder_messages=True, stream_response=False):

  • CompositeRawDecoder.decode() calls response.content, which the requests library auto-decompresses when Content-Encoding: gzip is set
  • GzipParser.parse() receives already-decompressed bytes
  • The magic-byte check correctly identifies this as non-gzip → passes through ✅

But if decompressed content happens to start with bytes \x1f\x8b, it would be incorrectly double-decompressed. Extremely unlikely for CSV/JSONL (first bytes would be column names or {"), but worth adding an inline comment so future readers understand the assumption.


3. 🟢 Cosmetic: constant placement

GZIP_MAGIC_BYTES = b"\x1f\x8b"

import orjson
import requests

Module-level constant placed between import blocks — should be after all imports per standard Python style.


Test Coverage ✅

The 7 new tests cover: gzip CSV/JSONL without header, non-gzip passthrough for CSV/JSONL, empty data, fallback in by_headers mode, and non-streamed mode. Good coverage.

Verdict

Logic is correct and addresses the root cause. Main concern is the memory regression in the streaming path — this affects all connectors using GzipDecoder in production, not just Apple App Store. Recommend addressing before merge.

…se buffering

- Replace data.read() + BytesIO buffering with a lightweight _PrefixedStream
  wrapper that prepends the 2-byte magic-byte header back onto the original
  stream without reading the entire response into memory.
- Move GZIP_MAGIC_BYTES constant after imports (now _GZIP_MAGIC_BYTES, private).
- Add inline docstring note about the double-decompression edge case in
  builder mode where requests auto-decompresses Content-Encoding: gzip.

Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

Local Mock Test Results

Tested locally with a mock HTTP server returning gzip-compressed CSV data without Content-Encoding header (simulating Apple App Store Connect /v1/salesReports behavior). Requested by Alfredo Garcia (@agarctfi).

Test 1: Bug Reproduction on main branch

Result: FAILED (as expected — bug confirmed)

The create_gzip_decoder() fallback parser is CsvParser, which receives raw gzip bytes and throws:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Test 2: Fix Verification on PR branch

Result: SUCCESS — all 3 CSV records decoded correctly

GzipParser auto-detects gzip magic bytes (0x1f 0x8b), decompresses via streaming _PrefixedStream, and passes decompressed data to CsvParser:

RECORD: {'id': '1', 'name': 'Alice'}
RECORD: {'id': '2', 'name': 'Bob'}
RECORD: {'id': '3', 'name': 'Charlie'}

Test 3: Unit Tests on PR branch

41 passed, 0 failed (including 8 new auto-detection tests)

Screenshots

Bug reproduction on main branch:

Bug reproduction on main branch

Fix verified on PR branch + unit tests:

Fix verified on PR branch

Verdict

The fix correctly resolves the issue. APIs returning gzip-compressed data without Content-Encoding headers are now properly handled by the GzipDecoder.


Devin session

@agarctfi
Copy link
Copy Markdown
Contributor

Alfredo Garcia (agarctfi) commented Apr 6, 2026

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>
@agarctfi Alfredo Garcia (agarctfi) marked this pull request as ready for review April 6, 2026 21:06
Copilot AI review requested due to automatic review settings April 6, 2026 21:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the declarative GzipDecoder/GzipParser behavior to correctly handle APIs that return gzip-compressed bodies without Content-Encoding: gzip, while preserving streaming behavior (no full buffering).

Changes:

  • Add gzip magic-bytes auto-detection to GzipParser.parse() with a _PrefixedStream wrapper to reattach peeked bytes.
  • Update create_gzip_decoder() to use gzip_parser (not inner_parser) in builder mode and as the header-mismatch fallback.
  • Add unit tests covering gzip-without-headers, passthrough behavior, empty input, fallback selection, non-streamed mode, and stream closing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py Implements magic-bytes detection, adds _PrefixedStream, and changes GzipParser to passthrough non-gzip streams.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Ensures GzipDecoder uses the gzip-aware parser in builder mode and as the fallback when headers don’t match.
unit_tests/sources/declarative/decoders/test_composite_decoder.py Adds tests validating gzip auto-detection and passthrough across streaming/non-streaming scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alfredo Garcia (@agarctfi) I was working on similar issue some time ago https://github.com/airbytehq/oncall/issues/11173#issuecomment-3967448166. Was able to fix it in manifest itself by adding inner parser as gzip decoder meaning that it will be used when headers don't contain info about gzip.
We have connectors like amazon-ads that already use this logic to fix the issue with headers, looks like this change is breaking for such connectors.

  1. Can the oc issue be fixed by adding inner parser as gzipdecoder in their builder project without a cdk fix?
  2. Can we implement a cdk fix that will be backward compatible and we don't need to update connectors in follow-up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants