Skip to content

fix(decoder): auto-detect gzip magic bytes for responses without Content-Encoding header#914

Draft
Airbyte Support (Airbyte-Support) wants to merge 2 commits intomainfrom
devin/1771566625-gzip-auto-detect-magic-bytes
Draft

fix(decoder): auto-detect gzip magic bytes for responses without Content-Encoding header#914
Airbyte Support (Airbyte-Support) wants to merge 2 commits intomainfrom
devin/1771566625-gzip-auto-detect-magic-bytes

Conversation

@Airbyte-Support
Copy link

@Airbyte-Support Airbyte Support (Airbyte-Support) commented Feb 20, 2026

fix(decoder): auto-detect gzip magic bytes for responses without Content-Encoding header

Summary

Some APIs (notably Apple App Store Connect /v1/salesReports) return gzip-compressed response bodies without setting the Content-Encoding: gzip header. The existing GzipParser unconditionally assumed gzip input, and the create_gzip_decoder factory used the inner parser (e.g. CsvParser) as the fallback when headers didn't match — meaning gzip data without the header was never decompressed, producing 'utf-8' codec can't decode byte 0x8b errors.

Changes:

  1. GzipParser.parse() — now reads the first 2 bytes and checks for gzip magic bytes (\x1f\x8b). If present, decompresses; otherwise passes data through to the inner parser unchanged.
  2. create_gzip_decoder() — uses gzip_parser (with auto-detection) instead of gzip_parser.inner_parser as both the default parser in builder mode and the fallback in production mode.

Updates since last revision

  • Added a parametrized test (test_gzip_parser_auto_detection) with 6 cases covering: gzip CSV/JSONL without Content-Encoding header, non-gzip passthrough, by_headers fallback path, non-streamed mode, and empty data.
  • All 39 decoder tests pass locally (33 existing + 6 new).

Review & Testing Checklist for Human

  • Memory regression for large streaming responses: GzipParser.parse() now calls data.read() to buffer the entire response into a BytesIO. The old code streamed through gzip.GzipFile(fileobj=data) directly. For very large responses in production mode (stream_response=True), this could significantly increase memory usage. Consider whether a streaming-friendly approach (e.g., a prefixed stream wrapper) is needed.
  • Double decompression edge case in builder mode: When Content-Encoding: gzip IS present and stream_response=False, response.content is already decompressed by the requests library. GzipParser then receives the decompressed bytes — the magic-byte check should correctly identify this as non-gzip and pass through. However, if decompressed content happens to start with \x1f\x8b bytes, it would be incorrectly re-decompressed. Assess whether this is a realistic risk for your API consumers.
  • Recommended manual test plan: Build a connector against the Apple App Store Connect /v1/salesReports endpoint (or mock a server that returns gzip bytes without Content-Encoding) and confirm the response is correctly decompressed and parsed as TSV/CSV.

Notes

…ent-Encoding header

Co-Authored-By: syed.khadeer@airbyte.io <cloud-support@airbyte.io>
@devin-ai-integration
Copy link
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

Testing This CDK Version

You can test this version of the CDK using the following:

# Run the CLI from this branch:
uvx 'git+https://github.com/airbytehq/airbyte-python-cdk.git@devin/1771566625-gzip-auto-detect-magic-bytes#egg=airbyte-python-cdk[dev]' --help

# Update a connector to use the CDK from this branch ref:
cd airbyte-integrations/connectors/source-example
poe use-cdk-branch devin/1771566625-gzip-auto-detect-magic-bytes

PR Slash Commands

Airbyte Maintainers can execute the following slash commands on your PR:

  • /autofix - Fixes most formatting and linting issues
  • /poetry-lock - Updates poetry.lock file
  • /test - Runs connector tests with the updated CDK
  • /prerelease - Triggers a prerelease publish with default arguments
  • /poe build - Regenerate git-committed build artifacts, such as the pydantic models which are generated from the manifest JSON schema in YAML.
  • /poe <command> - Runs any poe command in the CDK environment
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

…Encoding header

Co-Authored-By: syed.khadeer@airbyte.io <cloud-support@airbyte.io>
@github-actions
Copy link

PyTest Results (Fast)

3 875 tests  +6   3 863 ✅ +6   6m 15s ⏱️ -22s
    1 suites ±0      12 💤 ±0 
    1 files   ±0       0 ❌ ±0 

Results for commit 670e1e3. ± Comparison against base commit cd7e369.

@github-actions
Copy link

PyTest Results (Full)

3 878 tests   3 866 ✅  10m 57s ⏱️
    1 suites     12 💤
    1 files        0 ❌

Results for commit 670e1e3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant