Fix CHAR/VARCHAR length overflow when writing reconcile intermediate data#2428
Open
moomindani wants to merge 4 commits intomainfrom
Open
Fix CHAR/VARCHAR length overflow when writing reconcile intermediate data#2428moomindani wants to merge 4 commits intomainfrom
moomindani wants to merge 4 commits intomainfrom
Conversation
Some data sources (e.g., Teradata) return CHAR(n) values with space
padding via JDBC, resulting in values that exceed the declared column
length. Delta enforces CHAR/VARCHAR length constraints through column
metadata (__CHAR_VARCHAR_TYPE_STRING), causing writes to fail for these
padded values.
Strip all column metadata via col.alias(metadata={}) before writing
intermediate DataFrames to Delta. This removes the constraint that
Delta uses for length enforcement.
Observed with Teradata via Lakehouse Federation but not with Lakebase
(PostgreSQL) via Lakehouse Federation.
Co-authored-by: Isaac
- black reformats list comprehension to single line in test helper - ruff removes unused StringType import (was used in main, dropped after merge) Co-authored-by: Isaac
6 tasks
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2428 +/- ##
==========================================
- Coverage 65.78% 65.78% -0.01%
==========================================
Files 98 98
Lines 9237 9242 +5
Branches 992 992
==========================================
+ Hits 6077 6080 +3
- Misses 2984 2986 +2
Partials 176 176 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
`_write_df_to_delta` is a module-level function and accessed ReconIntermediatePersist._strip_char_varchar_constraints from outside the class, which pylint flags as protected-access. Rename to public since the helper is effectively a utility. Also rename mock_select unused arg to *_cols and fix test fn names. Co-authored-by: Isaac
|
✅ 148/148 passed, 5 skipped, 24m59s total Running from acceptance #4311 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
What does this PR do?
Strip CHAR(n)/VARCHAR(n) length constraints from DataFrames before writing intermediate data to Delta during reconciliation. This prevents
DELTA_EXCEED_CHAR_VARCHAR_LIMITerrors when source data contains space-padded CHAR values.Root cause
Some data sources (e.g., Teradata) return CHAR(n) values with space padding via JDBC, resulting in values that exceed the declared column length (e.g., a CHAR(16) column returning 16 digits + 16 spaces = 32 characters). Delta enforces CHAR/VARCHAR length constraints through column metadata (
__CHAR_VARCHAR_TYPE_STRING), causing writes to fail for these padded values.This was observed with Teradata via Lakehouse Federation but not with Lakebase (PostgreSQL) via Lakehouse Federation.
Fix
Strip all column metadata via
col.alias(name, metadata={})before writing intermediate DataFrames to Delta. This removes the constraint that Delta uses for length enforcement. The intermediate data is temporary and does not need metadata preservation.Linked issues
Fixes #2389
Tests
Test plan
test_strip_char_varchar_constraints_strips_metadata— verifies CHAR/VARCHAR metadata is strippedtest_strip_char_varchar_constraints_preserves_types— verifies column types are preservedReopened from #2390 on an upstream branch to bypass the fork-PR OIDC restriction on JFrog auth (CI cannot run on fork PRs). All review comments and history are preserved on the original PR.