fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header by devin-ai-integration[bot] · Pull Request #967 · airbytehq/airbyte-python-cdk

devin-ai-integration · 2026-03-27T15:33:11Z

Summary

Some APIs (notably Apple App Store Connect /v1/salesReports) return gzip-compressed response bodies without setting the Content-Encoding: gzip header. The existing GzipParser unconditionally assumed gzip input, and create_gzip_decoder() used the inner parser (e.g. CsvParser) as the fallback when headers didn't match — so gzip data without the header was never decompressed, producing 'utf-8' codec can't decode byte 0x8b errors.

Changes:

GzipParser.parse() — reads the first 2 bytes and checks for gzip magic bytes (\x1f\x8b). If present, decompresses via a _PrefixedStream wrapper that prepends the peeked bytes back onto the original stream (preserving streaming behavior). If not gzip, passes data through to the inner parser unchanged.
_PrefixedStream — a lightweight io.RawIOBase subclass that chains a small prefix (the 2-byte header) with the underlying stream, avoiding buffering the entire response into memory.
create_gzip_decoder() — uses gzip_parser (with auto-detection) instead of gzip_parser.inner_parser as both the default parser in builder mode and the fallback in production mode.

Resolves https://github.com/airbytehq/oncall/issues/11809:

https://github.com/airbytehq/oncall/issues/11809

Related: #914, #909, #895, #892

Review & Testing Checklist for Human

Double decompression edge case in builder mode: When Content-Encoding: gzip IS present and stream_response=False, the requests library already decompresses response.content. GzipParser then receives decompressed bytes — the magic-byte check should correctly identify this as non-gzip and pass through. However, if decompressed content happens to start with \x1f\x8b bytes, it would be incorrectly re-decompressed. This is documented in the code docstring and is extremely unlikely for structured formats (CSV, JSON, JSONL), but worth assessing whether additional safeguards are needed.
_PrefixedStream correctness with all inner parsers: The new stream wrapper is exercised by the existing 40 decoder tests (all passing), but verify that read() and readinto() behave correctly for all real-world read patterns — particularly TextIOWrapper (used by CsvParser) and gzip.GzipFile, which may issue reads of varying chunk sizes.
Recommended manual test: Build a connector against Apple App Store Connect /v1/salesReports (or mock a server returning gzip bytes without Content-Encoding) and confirm the response is correctly decompressed and parsed in both Builder test-read and sync modes.

Notes

This is a CDK-level fix affecting all manifest-only/low-code connectors that use GzipDecoder.
Not a breaking change — strictly additive behavior (auto-detection is a superset of the old unconditional gzip path).
The initial revision buffered the entire response into a BytesIO for magic-byte detection; this was replaced with a streaming _PrefixedStream wrapper to avoid memory regression for large responses.
7 new unit tests cover: gzip without headers (CSV, JSONL), non-gzip passthrough (CSV, JSONL), empty data, fallback in by_headers mode, and non-streamed mode. All 40 decoder tests pass locally.

Link to Devin session: https://app.devin.ai/sessions/1e67cd663c11402b82881eeb06a30745
Requested by: Alfredo Garcia (@agarctfi)

…hout Content-Encoding header GzipParser now checks for gzip magic bytes (0x1f 0x8b) before attempting decompression. If data is not gzip-compressed, it passes through to the inner parser unchanged. This fixes APIs like Apple App Store Connect that return gzip bodies without Content-Encoding headers. Also updates create_gzip_decoder() to use gzip_parser (with auto-detection) as the fallback parser instead of gzip_parser.inner_parser. Co-Authored-By: bot_apk <apk@cognition.ai>

devin-ai-integration · 2026-03-27T15:33:13Z

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

Disable automatic comment and CI monitoring

github-actions · 2026-03-27T15:33:23Z

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1774625405-fix-gzip-decoder-auto-detect#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1774625405-fix-gzip-decoder-auto-detect

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

/autofix - Fixes most formatting and linting issues
/poetry-lock - Updates poetry.lock file
/test - Runs connector tests with the updated CDK
/prerelease - Triggers a prerelease publish with default arguments
/poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
/poe <command> - Runs any poe command in the CDK environment

📚 Show Repo Guidance

Helpful Resources

CDK API Reference

📝 Edit this welcome message.

github-actions · 2026-03-27T15:42:02Z

PyTest Results (Fast)

4 020 tests +86 4 009 ✅ +86 7m 11s ⏱️ +12s
1 suites ± 0 11 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 115ed14. ± Comparison against base commit acafc75.

This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.

unit_tests.utils.test_memory_monitor ‑ test_cgroup_v1_emits_warning
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.utils.test_memory_monitor ‑ test_no_warning_below_threshold

unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_empty_data_returns_no_records
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_csv_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_fallback_in_by_headers_mode
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_jsonl_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_data_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_jsonl_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_streamed_gzip_without_content_encoding
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_prefixed_stream_closes_wrapped_stream
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_expands_non_mapping_safely
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_interpolated_expand_path
…

♻️ This comment has been updated with latest results.

github-actions · 2026-03-27T15:45:28Z

PyTest Results (Full)

4 023 tests +86 4 011 ✅ +86 10m 43s ⏱️ -6s
1 suites ± 0 12 💤 ± 0
1 files ± 0 0 ❌ ± 0

Results for commit 115ed14. ± Comparison against base commit acafc75.

This pull request removes 4 and adds 90 tests. Note that renamed tests count towards both.

unit_tests.utils.test_memory_monitor ‑ test_cgroup_v1_emits_warning
unit_tests.utils.test_memory_monitor ‑ test_logs_at_90_percent
unit_tests.utils.test_memory_monitor ‑ test_logs_on_every_check_above_90_percent
unit_tests.utils.test_memory_monitor ‑ test_no_warning_below_threshold

unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_empty_data_returns_no_records
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_csv_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_fallback_in_by_headers_mode
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_gzip_jsonl_without_content_encoding_header
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_data_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_gzip_jsonl_passthrough
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_non_streamed_gzip_without_content_encoding
unit_tests.sources.declarative.decoders.test_composite_decoder.TestGzipParserAutoDetection ‑ test_prefixed_stream_closes_wrapped_stream
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_expands_non_mapping_safely
unit_tests.sources.declarative.extractors.test_dpath_extractor ‑ test_dpath_extractor_interpolated_expand_path
…

♻️ This comment has been updated with latest results.

devin-ai-integration · 2026-04-01T22:32:19Z

Code Review: GzipParser auto-detect gzip magic bytes

Overall Assessment

The core fix is correct for the stated problem: APIs like Apple App Store Connect /v1/salesReports return gzip-compressed bodies without Content-Encoding: gzip, and the old code had no way to handle this. Magic-byte detection (\x1f\x8b) is the standard approach. Tests are solid (7 new tests covering the key scenarios).

However, there are a few issues worth flagging — one significant, one minor, one cosmetic.

1. 🔴 Memory regression in streaming mode (significant)

The new parse() method calls data.read() which buffers the entire response body into memory:

remaining = data.read()
full_data = io.BytesIO(header + remaining)

In production mode (stream_response=True), data is response.raw (urllib3 HTTPResponse). The old code passed this directly to gzip.GzipFile(fileobj=data), which streamed decompression without buffering the full response. The new code reads everything into a BytesIO first.

This matters because GzipParser is now also used as the fallback parser in by_headers mode (the second change in create_gzip_decoder). So every response that doesn't match the Content-Encoding header will be fully buffered — even if it's hundreds of MB of plain CSV/JSONL.

Suggested fix: Use a lightweight wrapper to prepend the peeked bytes back onto the stream without buffering:

class _PrefixedStream(io.RawIOBase):
    """Prepends already-read bytes back onto a stream without buffering everything."""

    def __init__(self, prefix: bytes, stream: BufferedIOBase) -> None:
        self._prefix = io.BytesIO(prefix)
        self._stream = stream
        self._prefix_done = False

    def readable(self) -> bool:
        return True

    def read(self, n: int = -1) -> bytes:
        if not self._prefix_done:
            chunk = self._prefix.read(n)
            if chunk:
                if n != -1 and len(chunk) >= n:
                    return chunk
                self._prefix_done = True
                remaining = self._stream.read(n - len(chunk) if n != -1 else -1)
                return chunk + (remaining or b"")
            self._prefix_done = True
        return self._stream.read(n)

Then in parse():

def parse(self, data: BufferedIOBase) -> PARSER_OUTPUT_TYPE:
    header = data.read(2)
    if not header:
        return

    stream = _PrefixedStream(header, data)
    if header == GZIP_MAGIC_BYTES:
        with gzip.GzipFile(fileobj=stream, mode="rb") as gzipobj:
            yield from self.inner_parser.parse(gzipobj)
    else:
        yield from self.inner_parser.parse(stream)

This preserves streaming behavior for both gzip and non-gzip paths.

2. 🟡 Double-decompression edge case in builder mode (low risk)

In builder mode (_emit_connector_builder_messages=True, stream_response=False):

CompositeRawDecoder.decode() calls response.content, which the requests library auto-decompresses when Content-Encoding: gzip is set
GzipParser.parse() receives already-decompressed bytes
The magic-byte check correctly identifies this as non-gzip → passes through ✅

But if decompressed content happens to start with bytes \x1f\x8b, it would be incorrectly double-decompressed. Extremely unlikely for CSV/JSONL (first bytes would be column names or {"), but worth adding an inline comment so future readers understand the assumption.

3. 🟢 Cosmetic: constant placement

GZIP_MAGIC_BYTES = b"\x1f\x8b"

import orjson
import requests

Module-level constant placed between import blocks — should be after all imports per standard Python style.

Test Coverage ✅

The 7 new tests cover: gzip CSV/JSONL without header, non-gzip passthrough for CSV/JSONL, empty data, fallback in by_headers mode, and non-streamed mode. Good coverage.

Verdict

Logic is correct and addresses the root cause. Main concern is the memory regression in the streaming path — this affects all connectors using GzipDecoder in production, not just Apple App Store. Recommend addressing before merge.

…se buffering - Replace data.read() + BytesIO buffering with a lightweight _PrefixedStream wrapper that prepends the 2-byte magic-byte header back onto the original stream without reading the entire response into memory. - Move GZIP_MAGIC_BYTES constant after imports (now _GZIP_MAGIC_BYTES, private). - Add inline docstring note about the double-decompression edge case in builder mode where requests auto-decompresses Content-Encoding: gzip. Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>

…test

devin-ai-integration · 2026-04-06T20:45:01Z

Local Mock Test Results

Tested locally with a mock HTTP server returning gzip-compressed CSV data without Content-Encoding header (simulating Apple App Store Connect /v1/salesReports behavior). Requested by Alfredo Garcia (@agarctfi).

Test 1: Bug Reproduction on `main` branch

Result: FAILED (as expected — bug confirmed)

The create_gzip_decoder() fallback parser is CsvParser, which receives raw gzip bytes and throws:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Test 2: Fix Verification on PR branch

Result: SUCCESS — all 3 CSV records decoded correctly

GzipParser auto-detects gzip magic bytes (0x1f 0x8b), decompresses via streaming _PrefixedStream, and passes decompressed data to CsvParser:

RECORD: {'id': '1', 'name': 'Alice'}
RECORD: {'id': '2', 'name': 'Bob'}
RECORD: {'id': '3', 'name': 'Charlie'}

Test 3: Unit Tests on PR branch

41 passed, 0 failed (including 8 new auto-detection tests)

Screenshots

Bug reproduction on main branch:

Fix verified on PR branch + unit tests:

Verdict

The fix correctly resolves the issue. APIs returning gzip-compressed data without Content-Encoding headers are now properly handled by the GzipDecoder.

Devin session

Alfredo Garcia (agarctfi) · 2026-04-06T20:45:46Z

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

🟦 Job completed successfully (no changes).

Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>

Copilot

Pull request overview

Updates the declarative GzipDecoder/GzipParser behavior to correctly handle APIs that return gzip-compressed bodies without Content-Encoding: gzip, while preserving streaming behavior (no full buffering).

Changes:

Add gzip magic-bytes auto-detection to GzipParser.parse() with a _PrefixedStream wrapper to reattach peeked bytes.
Update create_gzip_decoder() to use gzip_parser (not inner_parser) in builder mode and as the header-mismatch fallback.
Add unit tests covering gzip-without-headers, passthrough behavior, empty input, fallback selection, non-streamed mode, and stream closing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py`	Implements magic-bytes detection, adds `_PrefixedStream`, and changes `GzipParser` to passthrough non-gzip streams.
`airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py`	Ensures `GzipDecoder` uses the gzip-aware parser in builder mode and as the fallback when headers don’t match.
`unit_tests/sources/declarative/decoders/test_composite_decoder.py`	Adds tests validating gzip auto-detection and passthrough across streaming/non-streaming scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py

Daryna Ishchenko (darynaishchenko)

Alfredo Garcia (@agarctfi) I was working on similar issue some time ago https://github.com/airbytehq/oncall/issues/11173#issuecomment-3967448166. Was able to fix it in manifest itself by adding inner parser as gzip decoder meaning that it will be used when headers don't contain info about gzip.
We have connectors like amazon-ads that already use this logic to fix the issue with headers, looks like this change is breaking for such connectors.

Can the oc issue be fixed by adding inner parser as gzipdecoder in their builder project without a cdk fix?
Can we implement a cdk fix that will be backward compatible and we don't need to update connectors in follow-up?

Alfredo Garcia (agarctfi) temporarily deployed to DockerHub April 1, 2026 22:53 — with GitHub Actions Inactive

Alfredo Garcia (agarctfi) temporarily deployed to PyPi April 1, 2026 22:53 — with GitHub Actions Inactive

Alfredo Garcia (agarctfi) temporarily deployed to DockerHub April 1, 2026 22:53 — with GitHub Actions Inactive

fix(decoder): close wrapped stream in gzip parser and add regression …

dbb89a6

…test

fix(decoder): add type annotations to _PrefixedStream.__exit__ for mypy

115ed14

Co-Authored-By: alfredo.garcia@airbyte.io <freddy.garcia7.fg@gmail.com>

Alfredo Garcia (agarctfi) marked this pull request as ready for review April 6, 2026 21:06

Copilot AI review requested due to automatic review settings April 6, 2026 21:06

Copilot started reviewing on behalf of Alfredo Garcia (agarctfi) April 6, 2026 21:07 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

airbyte_cdk/sources/declarative/decoders/composite_raw_decoder.py Show resolved Hide resolved

Alfredo Garcia (agarctfi) requested review from Daryna Ishchenko (darynaishchenko), Patrick Nilan (pnilan) and Anatolii Yatsuk (tolik0) April 6, 2026 21:16

Daryna Ishchenko (darynaishchenko) requested changes Apr 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967

fix(decoder): auto-detect gzip magic bytes in GzipParser for APIs without Content-Encoding header#967
devin-ai-integration[bot] wants to merge 4 commits intomainfrom
devin/1774625405-fix-gzip-decoder-auto-detect

devin-ai-integration bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

Uh oh!

github-actions bot commented Mar 27, 2026

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 27, 2026 •

edited

Loading

Uh oh!

devin-ai-integration bot commented Apr 1, 2026

Uh oh!

devin-ai-integration bot commented Apr 6, 2026

Uh oh!

Alfredo Garcia (agarctfi) commented Apr 6, 2026 •

edited by github-actions bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Daryna Ishchenko (darynaishchenko) left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

devin-ai-integration bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review & Testing Checklist for Human

Notes

Uh oh!

devin-ai-integration bot commented Mar 27, 2026

🤖 Devin AI Engineer

Uh oh!

github-actions bot commented Mar 27, 2026

👋 Greetings, Airbyte Team Member!

Testing This CDK Version

PR Slash Commands

Helpful Resources

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Fast)

Uh oh!

github-actions bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PyTest Results (Full)

Uh oh!

devin-ai-integration bot commented Apr 1, 2026

Code Review: GzipParser auto-detect gzip magic bytes

Overall Assessment

1. 🔴 Memory regression in streaming mode (significant)

2. 🟡 Double-decompression edge case in builder mode (low risk)

3. 🟢 Cosmetic: constant placement

Test Coverage ✅

Verdict

Uh oh!

devin-ai-integration bot commented Apr 6, 2026

Local Mock Test Results

Test 1: Bug Reproduction on main branch

Test 2: Fix Verification on PR branch

Test 3: Unit Tests on PR branch

Screenshots

Bug reproduction on main branch:

Fix verified on PR branch + unit tests:

Verdict

Uh oh!

Alfredo Garcia (agarctfi) commented Apr 6, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Daryna Ishchenko (darynaishchenko) left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devin-ai-integration bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

github-actions bot commented Mar 27, 2026 •

edited

Loading

Test 1: Bug Reproduction on `main` branch

Alfredo Garcia (agarctfi) commented Apr 6, 2026 •

edited by github-actions bot

Loading