Skip to content

[SPARK-57773][SQL] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs#56887

Open
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:SPARK-57544-followup
Open

[SPARK-57773][SQL] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs#56887
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:SPARK-57544-followup

Conversation

@cloud-fan

@cloud-fan cloud-fan commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

DataSourceV2Relation.create builds the relation output schema directly from table.columns().asSchema (after char/varchar replacement), without removing internal metadata. This PR makes create strip internal metadata (the keys in INTERNAL_METADATA_KEYS, via removeInternalMetadata) from the schema before it becomes the relation's output, while preserving column IDs.

removeInternalMetadata gains a keepFieldIds: Boolean = false parameter that skips FIELD_ID_METADATA_KEY in the same removal pass (the default preserves behavior for existing callers). create calls it with keepFieldIds = true:

val schema = removeInternalMetadata(
  CharVarcharUtils.replaceCharVarcharWithStringInSchema(table.columns.asSchema),
  keepFieldIds = true)

Why are the changes needed?

INTERNAL_METADATA_KEYS exists so that internal-only metadata keys (e.g. __metadata_col, __qualified_access_only) do not surface to users, and removeInternalMetadata is already applied on other schema-producing paths. But DataSourceV2Relation.create never applied it, so any internal metadata key that a v2 source attaches to its columns leaks straight onto the relation output and df.schema. This hardens create so internal-only keys stay internal, consistent with the other paths.

Column IDs need special handling: FIELD_ID_METADATA_KEY is listed in INTERNAL_METADATA_KEYS so that other paths drop it, but it is also deliberately surfaced onto the relation's output (see SPARK-57544). Removing it in create would defeat that, so the new keepFieldIds flag preserves it in the same pass that strips the rest.

Does this PR introduce any user-facing change?

No. It removes internal-only metadata keys (which were never meant to be user-visible) from the v2 relation output, and preserves the column IDs that are intentionally surfaced.

How was this patch tested?

New unit test in DataSourceV2RelationSuite that builds a table whose column carries both a column ID and a leaked internal metadata key, and asserts that create strips the internal key from the relation output while preserving the column ID. Verified the test fails without the source change.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Opus 4.8)

@cloud-fan cloud-fan force-pushed the SPARK-57544-followup branch from 171a850 to bb3c001 Compare June 30, 2026 05:10
@cloud-fan cloud-fan changed the title [SPARK-57544][SQL][FOLLOWUP] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs [SPARK-57773][SQL] Strip leaked internal metadata in DataSourceV2Relation.create while preserving column IDs Jun 30, 2026
@cloud-fan

Copy link
Copy Markdown
Contributor Author

cc @aokolnychyi @gengliangwang

…tion.create while preserving column IDs

DataSourceV2Relation.create builds the relation output schema directly from
table.columns without removing internal metadata. Any internal metadata key
that leaked onto the table columns (the keys in INTERNAL_METADATA_KEYS, such as
__metadata_col) therefore surfaces on the relation output and on df.schema,
instead of being hidden as intended.

This strips internal metadata in create via removeInternalMetadata, but
preserves column IDs: FIELD_ID_METADATA_KEY is listed in INTERNAL_METADATA_KEYS
so that other paths drop it, while it is also deliberately surfaced onto the
relation output (SPARK-57544). removeInternalMetadata gains a keepFieldIds flag
(default false, preserving behavior for existing callers) that skips that single
key in the same removal pass, so create keeps the IDs without a strip-then-re-add
round-trip.
@cloud-fan cloud-fan force-pushed the SPARK-57544-followup branch from bb3c001 to be8bb4f Compare June 30, 2026 05:25
val field = relation.schema.head

// The leaked internal metadata key is stripped from the relation output ...
assert(!field.metadata.contains(METADATA_COL_ATTR_KEY))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the test exercises one leaked key (METADATA_COL_ATTR_KEY) and one preserved key (FIELD_ID_METADATA_KEY); the other INTERNAL_METADATA_KEYS (QUALIFIED_ACCESS_ONLY, AUTO_GENERATED_ALIAS, PRESERVE_ON_*) are not separately asserted.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @cloud-fan!

…data test

Use util.Set.of() instead of Collections.emptySet() to satisfy scalastyle,
and assert that create() strips every key in INTERNAL_METADATA_KEYS (except
the deliberately-surfaced column ID) rather than only METADATA_COL_ATTR_KEY.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants