[SPARK-56346][SQL] Use PartitionPredicate in DSV2 Metadata Only Delete by szehon-ho · Pull Request #55179 · apache/spark

szehon-ho · 2026-04-02T23:59:00Z

What changes were proposed in this pull request?

When OptimizeMetadataOnlyDeleteFromTable fails to translate all delete predicates to standard V2 filters, it now falls back to a second pass that converts partition-column filters to PartitionPredicates (reusing the SPARK-55596 infrastructure), translates any remaining data-column filters to standard V2 predicates, and combines them for table.canDeleteWhere. This mirrors the two-pass approach already used for scan filter pushdown in PushDownUtils.pushPartitionPredicates.

Why are the changes needed?

Currently, OptimizeMetadataOnlyDeleteFromTable only attempts to translate all delete predicates to standard V2 filters. If any predicate cannot be translated (e.g. complex expressions on partition columns), the optimization falls back to an expensive row-level delete even though the table could accept the predicates as PartitionPredicates. This change enables the metadata-only delete path in more cases by leveraging the PartitionPredicate infrastructure introduced in SPARK-55596.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New test suite DataSourceV2EnhancedDeleteFilterSuite with 9 test cases covering: first-pass accept, second-pass accept/reject, mixed partition+data filters, UDF on non-contiguous partition columns, multiple PartitionPredicates, and row-level fallback. Existing suites verified for no regressions: DataSourceV2EnhancedPartitionFilterSuite, GroupBasedDeleteFromTableSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Cursor agent mode)

szehon-ho · 2026-04-03T00:08:07Z

+   * Evaluates a single V2 predicate by resolving column values through the
+   * given function. Supports =, <=>, IS_NULL, IS_NOT_NULL, and ALWAYS_TRUE.
+   */
+  def evalPredicate(


just refactor for re-use in new test InMemoryTable

szehon-ho · 2026-04-03T00:09:17Z

  }

+  /**
+   * Separates partition filters from data filters and converts pushable partition


again, refactor for re-use in OptimizeMetadataOnlyDeleteQuery

szehon-ho · 2026-04-03T00:09:58Z

+   * Returns a map from flattened expression to original.
   */
-  private def normalizeNestedPartitionFilters(
+  private[v2] def flattenNestedPartitionFilters(


rename, because 'normalize' is already used in OptimizeMetadataOnlyDelete

cloud-fan

Summary

Prior state and problem: OptimizeMetadataOnlyDeleteFromTable could only perform metadata-only deletes when all filter expressions translated to standard V2 predicates (e.g., =, <=>, IS_NULL). Filters like IN, STARTS_WITH, or UDFs on partition columns caused a fallback to expensive row-level operations even though the table might accept PartitionPredicates.

Design approach: Add a second-pass fallback in the delete optimization rule that mirrors the existing two-pass approach in PushDownUtils.pushPartitionPredicates for scan filter pushdown. When the first pass (V2 translation) fails or is rejected, the second pass:

Separates filters into partition-column and data-column categories
Converts partition filters to PartitionPredicates via PartitionPredicateImpl
Translates remaining data filters to standard V2 predicates
Combines both and calls table.canDeleteWhere

Key design decisions:

No supportsIterativePushdown gate for the delete path (the scan path has one). This is intentional — canDeleteWhere already serves as the acceptance gate, and the supportsIterativePushdown opt-in is specific to ScanBuilder.
All-or-nothing semantics: if any remaining data filter can't translate to V2, the entire second pass fails and falls back to row-level. This differs from the scan path (which returns remaining filters for post-scan evaluation) because metadata-only deletes require complete filter acceptance.

Implementation sketch: OptimizeMetadataOnlyDeleteFromTable.apply → first tries tryTranslateToV2 (standard V2 path), on failure → tryDeleteWithPartitionPredicates (second pass via shared PushDownUtils.createPartitionPredicates and flattenNestedPartitionFilters), on failure → row-level plan. The shared methods were extracted from the existing pushPartitionPredicates and made package-private for reuse.

General comments

The logDebug message on the original first-pass success path was removed. With three possible outcomes now (first-pass V2, second-pass partition predicates, row-level fallback), adding logDebug for each path would help with debugging filter pushdown behavior.

cloud-fan · 2026-04-03T02:52:57Z

      }
      if (fields.length == transforms.length) {
-        Some(fields.toSeq)
+        Some(fields.toSeq).filter(_.nonEmpty)


This .filter(_.nonEmpty) guard is redundant: the outer check at line 139 guarantees transforms.nonEmpty, and fields.length == transforms.length at line 151 ensures fields is non-empty.

Suggested change

Some(fields.toSeq).filter(_.nonEmpty)

Some(fields.toSeq)

cloud-fan · 2026-04-03T02:52:57Z

+      candidateKeys
+    }
+
+    // Handle data predicates (simulate data source with data column statistics)


The comment says "data column statistics" but the code evaluates predicates row-by-row, not via statistics.

Suggested change

// Handle data predicates (simulate data source with data column statistics)

// Handle data predicates (simulate a data source applying row-level data filters)

Hm yea I thought about this too. I think using 'simulate a data source with data column statistics' is more accurate. The test table goes row-by-row, but its a simulation.

This is for the use-case where a data column predicate can still be completely handled by the data source. For example, iceberg min-max statistics.
delete from table t where x < 10, and all Iceberg data files have min value less than 10.

Of course the InMemoryPartitionPredicateDeleteTable is not structured that way, so its a 'simulation'. It'd be a bit overkill to implement min/max stats to make it a more accurate simulation? Maybe in a follow up.

By the way, this pushdown is not about PartitionPredicate, but the existing case, its just I wanted to add a test that mixes the two.

cloud-fan · 2026-04-07T05:55:05Z

LGTM if CI is green

peter-toth · 2026-04-07T07:53:15Z

@szehon-ho , can you please update the PR description to follow the template?

peter-toth

Just a nit and CI seems to fail with a linter issue, but LGTM.

peter-toth · 2026-04-07T08:37:57Z

+
+  /** Translates all expressions to V2 filters, or returns [[None]] if any fail. */
+  private def tryTranslateToV2(predicates: Seq[Expression]): Option[Array[Predicate]] = {
+    val filters = toDataSourceV2Filters(predicates)


Nit: this seems to be only callsite of toDataSourceV2Filters() so you can probably combine them.

Good catch, thanks

szehon-ho · 2026-04-07T19:06:16Z

Done, thanks!

When `OptimizeMetadataOnlyDeleteFromTable` fails to push standard V2 predicates for a metadata-only delete, it now falls back to a second pass that converts partition-column filters to `PartitionPredicate`s (SPARK-55596) and combines them with translated V2 data filters.

- Remove redundant .filter(_.nonEmpty) guard in PushDownUtils - Fix misleading comment in InMemoryPartitionPredicateDeleteTable - Add logDebug for each delete optimization outcome path

cloud-fan · 2026-04-08T05:58:11Z

thanks, merging to master!

szehon-ho force-pushed the delete_partition_filter branch from cb0ff92 to 33d100e Compare April 3, 2026 00:05

szehon-ho commented Apr 3, 2026

View reviewed changes

cloud-fan reviewed Apr 3, 2026

View reviewed changes

peter-toth approved these changes Apr 7, 2026

View reviewed changes

szehon-ho mentioned this pull request Apr 7, 2026

[SPARK-56378] Add PR description guidelines to AGENTS.md #55242

Closed

szehon-ho force-pushed the delete_partition_filter branch from 04456d0 to 379b162 Compare April 7, 2026 22:23

szehon-ho added 3 commits April 7, 2026 17:58

Address review comments

cb48c44

- Remove redundant .filter(_.nonEmpty) guard in PushDownUtils - Fix misleading comment in InMemoryPartitionPredicateDeleteTable - Add logDebug for each delete optimization outcome path

Inline toDataSourceV2Filters into tryTranslateToV2; fix scalastyle

75e1f51

szehon-ho force-pushed the delete_partition_filter branch from 379b162 to 75e1f51 Compare April 8, 2026 00:58

cloud-fan closed this in 590b0d5 Apr 8, 2026

	// Handle data predicates (simulate data source with data column statistics)
	// Handle data predicates (simulate a data source applying row-level data filters)

Conversation

szehon-ho commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Summary

General comments

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 7, 2026

Uh oh!

peter-toth commented Apr 7, 2026

Uh oh!

peter-toth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho commented Apr 7, 2026

Uh oh!

cloud-fan commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

szehon-ho commented Apr 2, 2026 •

edited

Loading