Skip to content

fix(destination-s3-data-lake): produce lowercase column names for Glue catalog compatibility#76059

Draft
devin-ai-integration[bot] wants to merge 5 commits into
masterfrom
devin/1775169931-s3-data-lake-lowercase-columns
Draft

fix(destination-s3-data-lake): produce lowercase column names for Glue catalog compatibility#76059
devin-ai-integration[bot] wants to merge 5 commits into
masterfrom
devin/1775169931-s3-data-lake-lowercase-columns

Conversation

@devin-ai-integration
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot commented Apr 2, 2026

What

The S3 Data Lake connector does not produce lowercase column names when writing to the Glue catalog, which breaks downstream engines like Snowflake and Athena.

The connector was missing a custom TableSchemaMapper implementation and fell back to NoopTableSchemaMapper, which passes column names through unchanged. The sibling GCS Data Lake connector already has a working implementation of this pattern.

Resolves https://github.com/airbytehq/oncall/issues/11856
Original issue: #76058

How

  1. New S3DataLakeTableSchemaMapper — Implements TableSchemaMapper with @Singleton. toColumnName() calls Transformations.toAlphanumericAndUnderscore(name).lowercase() to sanitize and lowercase all column names. toFinalTableName() applies the same transformation to namespace and table name.

  2. transformSchemaWithMappedNames() in S3DataLakeStreamLoader — Post-processes the Iceberg schema produced by IcebergUtil.toIcebergSchema() to replace original column names with the mapped (lowercased) names from stream.tableSchema.columnSchema.inputToFinalColumnNames. Airbyte metadata columns are skipped since they are already valid.

Both changes follow the pattern established by GcsDataLakeTableSchemaMapper and GcsDataLakeStreamLoader.

Review guide

⚠️ Key concern for reviewers: transformSchemaWithMappedNames creates a new Schema(mappedFields) without preserving identifier field IDs from the original schema. The original code (icebergUtil.toIcebergSchema(stream)) returned a Schema that included identifier fields for dedup mode. The GCS version handles this explicitly with a withIdentifierFields boolean parameter — the S3 version does not. Please verify whether dropping identifier field IDs here causes a regression for dedup syncs, or whether they are re-applied downstream by IcebergTableSynchronizer.maybeApplySchemaChanges().

  1. S3DataLakeTableSchemaMapper.kt — New file. Core fix lives in toColumnName() (line 41). Review the toColumnType() mappings — these were copied from the GCS connector (BigLake/Parquet types) and should be verified for S3/Glue correctness.
  2. S3DataLakeStreamLoader.kt — Lines 46–89. The transformSchemaWithMappedNames() method and the changed incomingSchema initialization.
  3. S3DataLakeTableSchemaMapperTest.kt — Unit tests for the mapper.
  4. metadata.yaml — Version bump to 0.3.47.
  5. docs/integrations/destinations/s3-data-lake.md — Changelog entry.

User Impact

Column names written to the Glue catalog will now be lowercase and sanitized (special characters replaced with underscores). This fixes compatibility with Snowflake, Athena, and other engines that require lowercase Glue identifiers.

For existing tables with mixed-case column names: the schema evolution logic in IcebergTableSynchronizer will see the lowercased names as new columns. Reviewers should verify the behavior here — it may require a full refresh for affected streams.

Can this PR be safely reverted and rolled back?

  • YES 💚

Link to Devin session: https://app.devin.ai/sessions/4328e94738f64f5880a3d21a37916762

…e catalog compatibility

Add S3DataLakeTableSchemaMapper to transform column names using
Transformations.toAlphanumericAndUnderscore() followed by .lowercase(),
matching the pattern used by the GCS Data Lake connector.

Also add transformSchemaWithMappedNames() to S3DataLakeStreamLoader to
post-process the Iceberg schema with the mapped column names from the
stream's table schema.

Resolves airbytehq/oncall#11856

Co-Authored-By: bot_apk <apk@cognition.ai>
@devin-ai-integration
Copy link
Copy Markdown
Contributor Author

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment and CI monitoring

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

👋 Greetings, Airbyte Team Member!

Here are some helpful tips and reminders for your convenience.

💡 Show Tips and Tricks

PR Slash Commands

Airbyte Maintainers (that's you!) can execute the following slash commands on your PR:

  • 🛠️ Quick Fixes
    • /format-fix - Fixes most formatting issues.
    • /bump-version - Bumps connector versions, scraping changelog description from the PR title.
      • Bump types: patch (default), minor, major, major_rc, rc, promote.
      • The rc type is a smart default: applies minor_rc if stable, or bumps the RC number if already RC.
      • The promote type strips the RC suffix to finalize a release.
      • Example: /bump-version type=rc or /bump-version type=minor
    • /bump-progressive-rollout-version - Alias for /bump-version type=rc. Bumps with an RC suffix and enables progressive rollout.
  • ❇️ AI Testing and Review (internal link: AI-SDLC Docs):
    • /ai-prove-fix - Runs prerelease readiness checks, including testing against customer connections.
    • /ai-canary-prerelease - Rolls out prerelease to 5-10 connections for canary testing.
    • /ai-review - AI-powered PR review for connector safety and quality gates.
  • 🚀 Connector Releases:
    • /publish-connectors-prerelease - Publishes pre-release connector builds (tagged as {version}-preview.{git-sha}) for all modified connectors in the PR.
  • ☕️ JVM connectors:
    • /update-connector-cdk-version connector=<CONNECTOR_NAME> - Updates the specified connector to the latest CDK version.
      Example: /update-connector-cdk-version connector=destination-bigquery
  • 🐍 Python connectors:
    • /poe connector source-example lock - Run the Poe lock task on the source-example connector, committing the results back to the branch.
    • /poe source example lock - Alias for /poe connector source-example lock.
    • /poe source example use-cdk-branch my/branch - Pin the source-example CDK reference to the branch name specified.
    • /poe source example use-cdk-latest - Update the source-example CDK dependency to the latest available version.
  • ⚙️ Admin commands:
    • /force-merge reason="<REASON>" - Force merges the PR using admin privileges, bypassing CI checks. Requires a reason.
      Example: /force-merge reason="CI is flaky, tests pass locally"
📚 Show Repo Guidance

Helpful Resources

📝 Edit this welcome message.

devin-ai-integration Bot and others added 2 commits April 2, 2026 22:51
Co-Authored-By: bot_apk <apk@cognition.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

destination-s3-data-lake Connector Test Results

25 tests   24 ✅  3s ⏱️
 3 suites   0 💤
 3 files     1 ❌

For more details on these failures, see this check.

Results for commit e0728d7.

♻️ This comment has been updated with latest results.

Co-Authored-By: bot_apk <apk@cognition.ai>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Deploy preview for airbyte-docs ready!

✅ Preview
https://airbyte-docs-vz12t84aj-airbyte-growth.vercel.app

Built with commit e0728d7.
This pull request is being automatically deployed with vercel-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant