[SPARK-57773][SQL] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs#56887
Open
cloud-fan wants to merge 2 commits into
Open
[SPARK-57773][SQL] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs#56887cloud-fan wants to merge 2 commits into
cloud-fan wants to merge 2 commits into
Conversation
171a850 to
bb3c001
Compare
Contributor
Author
…tion.create while preserving column IDs DataSourceV2Relation.create builds the relation output schema directly from table.columns without removing internal metadata. Any internal metadata key that leaked onto the table columns (the keys in INTERNAL_METADATA_KEYS, such as __metadata_col) therefore surfaces on the relation output and on df.schema, instead of being hidden as intended. This strips internal metadata in create via removeInternalMetadata, but preserves column IDs: FIELD_ID_METADATA_KEY is listed in INTERNAL_METADATA_KEYS so that other paths drop it, while it is also deliberately surfaced onto the relation output (SPARK-57544). removeInternalMetadata gains a keepFieldIds flag (default false, preserving behavior for existing callers) that skips that single key in the same removal pass, so create keeps the IDs without a strip-then-re-add round-trip.
bb3c001 to
be8bb4f
Compare
uros-b
reviewed
Jun 30, 2026
| val field = relation.schema.head | ||
|
|
||
| // The leaked internal metadata key is stripped from the relation output ... | ||
| assert(!field.metadata.contains(METADATA_COL_ATTR_KEY)) |
Member
There was a problem hiding this comment.
Nit: the test exercises one leaked key (METADATA_COL_ATTR_KEY) and one preserved key (FIELD_ID_METADATA_KEY); the other INTERNAL_METADATA_KEYS (QUALIFIED_ACCESS_ONLY, AUTO_GENERATED_ALIAS, PRESERVE_ON_*) are not separately asserted.
…data test Use util.Set.of() instead of Collections.emptySet() to satisfy scalastyle, and assert that create() strips every key in INTERNAL_METADATA_KEYS (except the deliberately-surfaced column ID) rather than only METADATA_COL_ATTR_KEY. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
DataSourceV2Relation.createbuilds the relation output schema directly fromtable.columns().asSchema(after char/varchar replacement), without removing internal metadata. This PR makescreatestrip internal metadata (the keys inINTERNAL_METADATA_KEYS, viaremoveInternalMetadata) from the schema before it becomes the relation's output, while preserving column IDs.removeInternalMetadatagains akeepFieldIds: Boolean = falseparameter that skipsFIELD_ID_METADATA_KEYin the same removal pass (the default preserves behavior for existing callers).createcalls it withkeepFieldIds = true:Why are the changes needed?
INTERNAL_METADATA_KEYSexists so that internal-only metadata keys (e.g.__metadata_col,__qualified_access_only) do not surface to users, andremoveInternalMetadatais already applied on other schema-producing paths. ButDataSourceV2Relation.createnever applied it, so any internal metadata key that a v2 source attaches to its columns leaks straight onto the relation output anddf.schema. This hardenscreateso internal-only keys stay internal, consistent with the other paths.Column IDs need special handling:
FIELD_ID_METADATA_KEYis listed inINTERNAL_METADATA_KEYSso that other paths drop it, but it is also deliberately surfaced onto the relation's output (see SPARK-57544). Removing it increatewould defeat that, so the newkeepFieldIdsflag preserves it in the same pass that strips the rest.Does this PR introduce any user-facing change?
No. It removes internal-only metadata keys (which were never meant to be user-visible) from the v2 relation output, and preserves the column IDs that are intentionally surfaced.
How was this patch tested?
New unit test in
DataSourceV2RelationSuitethat builds a table whose column carries both a column ID and a leaked internal metadata key, and asserts thatcreatestrips the internal key from the relation output while preserving the column ID. Verified the test fails without the source change.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)