Skip to content

perf: [iceberg] reduce nativeIcebergScanMetadata serialization points#3243

Merged
mbutrovich merged 1 commit into
apache:mainfrom
mbutrovich:iceberg_metadata
Jan 22, 2026
Merged

perf: [iceberg] reduce nativeIcebergScanMetadata serialization points#3243
mbutrovich merged 1 commit into
apache:mainfrom
mbutrovich:iceberg_metadata

Conversation

@mbutrovich
Copy link
Copy Markdown
Contributor

@mbutrovich mbutrovich commented Jan 22, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

During Iceberg workloads, profiling revealed high GC time caused by Java deserialization of List$SerializationProxy objects (22.1 MiB allocations per task).

CometIcebergNativeScanMetadata contains heavyweight Iceberg objects (FileScanTasks list, table, schemas) with nested Scala collections. These were being included in equals()/hashCode() methods, causing Java serializer to recursively traverse the entire object graph during task serialization, creating allocation pressure.

The metadata is only needed during query planning to convert Iceberg metadata to protobuf. After conversion, all necessary information exists in the serialized nativeOp, making the metadata unnecessary.

What changes are included in this PR?

  1. Mark nativeIcebergScanMetadata as @transient in CometBatchScanExec to prevent serialization to executors
  2. Remove nativeIcebergScanMetadata from equals()/hashCode() in both CometBatchScanExec and CometIcebergNativeScanExec to prevent object graph traversal during plan comparison
  3. Set nativeIcebergScanMetadata = None in CometBatchScanExec.doCanonicalize() to free driver memory during plan optimization

Equivalence checking remains correct because wrappedScan.equals() already compares the underlying BatchScanExec with all its Iceberg metadata.

How are these changes tested?

Existing tests. I will try to benchmark it locally as well.

…bergNativeScanExec since it's redundant to the wrapped scan object and introduces a ton of serde overhead.
@mbutrovich mbutrovich marked this pull request as ready for review January 22, 2026 19:48
Copy link
Copy Markdown
Contributor

@hsiang-c hsiang-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jan 22, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.04%. Comparing base (f09f8af) to head (8c89a44).
⚠️ Report is 865 commits behind head on main.

Files with missing lines Patch % Lines
...rg/apache/spark/sql/comet/CometBatchScanExec.scala 80.00% 0 Missing and 1 partial ⚠️
...e/spark/sql/comet/CometIcebergNativeScanExec.scala 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3243      +/-   ##
============================================
+ Coverage     56.12%   60.04%   +3.92%     
- Complexity      976     1428     +452     
============================================
  Files           119      170      +51     
  Lines         11743    15769    +4026     
  Branches       2251     2604     +353     
============================================
+ Hits           6591     9469    +2878     
- Misses         4012     4981     +969     
- Partials       1140     1319     +179     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Running TPC-H SF10 locally I see a reduction in serialization calls. Hopefully that yields reduced GC pressure since I can't seem to reproduce that behavior on my machine with my workloads.

@mbutrovich mbutrovich requested a review from andygrove January 22, 2026 20:08
Copy link
Copy Markdown
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@mbutrovich
Copy link
Copy Markdown
Contributor Author

Upon thinking about this more, I don't think this is a huge win for executor memory pressure, maybe driver side. I suspect we'll need to do more work to figure out how to not serialize every partitions' FileScanTasks in the Comet native plan.

@mbutrovich mbutrovich merged commit 5e9a1ca into apache:main Jan 22, 2026
170 checks passed
@mbutrovich mbutrovich deleted the iceberg_metadata branch March 13, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants