Acceleration : Iceberg table compaction [iceberg] by Shekharrajak · Pull Request #3519 · apache/datafusion-comet

Shekharrajak · 2026-02-14T09:45:14Z

Which issue does this PR close?

PR Description

Rationale for this change

Iceberg table compaction using Spark's default rewriteDataFiles() action is slow due to Spark shuffle and task scheduling overhead. This PR adds native Rust-based compaction using DataFusion for direct Parquet read/write, achieving 1.5-1.8x speedup over Spark's default compaction.

What changes are included in this PR?

Native Rust compaction: DataFusion-based Parquet read/write via JNI ([iceberg_compaction_jni.rs]
Scala integration: CometNativeCompaction class that executes native compaction (Executes native scan + write via JNI) and commits via Iceberg Java API
Configuration: spark.comet.iceberg.compaction.enabled config option
Benchmark: TPC-H based compaction benchmark comparing Spark vs Native performance

How are these changes tested?

Unit tests in CometIcebergCompactionSuite covering:
- Non-partitioned table compaction
- Partitioned table compaction (bucket, truncate, date partitions)
- Data correctness verification after compaction
TPC-H benchmark (CometIcebergTPCCompactionBenchmark) measuring performance on lineitem, orders, customer tables
Manual testing with SF1 TPC-H data showing:
- lineitem (6M rows): 7.2s → 4.4s (1.6x)
- orders (1.5M rows): 1.5s → 0.9s (1.8x)

Shekharrajak · 2026-02-14T09:48:46Z

+// under the License.
+
+//! Iceberg Parquet writer operator for writing RecordBatches to Parquet files
+//! with Iceberg-compatible metadata (DataFile structures).


DataFusion execution operator that writes Arrow RecordBatches to Parquet files with Iceberg-compatible metadata.

It enables native Rust to produce files that Iceberg's Java API can directly commit.
Metadata is serialized as JSON and passed back to JVM via JNI for commit.

Shekharrajak · 2026-02-14T09:49:51Z

+
+//! JNI bridge for Iceberg compaction operations.
+//!
+//! This module provides JNI functions for native Iceberg compaction (scan + write).


JNI bridge that exposes native Rust compaction to Scala/JVM.

executeIcebergCompaction() | JNI entry point - reads Parquet files via DataFusion, writes compacted output

Shekharrajak · 2026-02-14T09:50:49Z

+
+/// Configuration for Iceberg table metadata passed from JVM
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct IcebergTableConfig {


Table metadata from JVM (identifier, warehouse, snapshot ID, file IO props)

Shekharrajak · 2026-02-14T09:51:50Z

+
+    logDebug(s"Executing native compaction with config: $configJson")
+
+    val resultJson = native.executeIcebergCompaction(configJson)


JNI entry point - reads Parquet files via DataFusion, writes compacted output

Shekharrajak · 2026-02-14T09:52:06Z

+
+  def isAvailable: Boolean = {
+    try {
+      val version = new Native().getIcebergCompactionVersion()


Returns native library version for compatibility checks

Shekharrajak · 2026-02-14T09:56:07Z

-          "Iceberg reflection failure: Failed to get filter expressions from SparkScan: " +
-            s"${e.getMessage}")
-        None
+    findMethodInHierarchy(scan.getClass, "filterExpressions").flatMap { filterExpressionsMethod =>


previously we were assuming a fixed Iceberg class hierarchy, this findMethodInHierarchy walks up the class tree - better approach.

For compaction to work, we need to extract FileScanTask objects from the scan. Different Iceberg scan types expose tasks differently:

SparkBatchQueryScan -> tasks() method
SparkStagedScan -> taskGroups() method (returns groups, need to extract tasks from each)

…ma evolution, nested types

… interpolation

mbutrovich · 2026-02-16T16:15:22Z

Interesting PR, thanks @Shekharrajak! To help start the review process, could you:

Provide a high level architecture diagram?
Explain where the performance benefit comes from, and why is it so much faster to pass batches over this JNI interface than the existing interface?
As part of (1), maybe we can find a way to break this down into smaller PRs.

Shekharrajak · 2026-02-16T17:35:40Z

Provide a high level architecture diagram?

The rewrite commit API reference : apache/iceberg-rust#2106 - so in this PR commit is happening in JVM, in future PRs we can have it native as well.

Explain where the performance benefit comes from, and why is it so much faster to pass batches over this JNI interface than the existing interface?

The compaction is all about reading small files -> writing back larger files, so it is I/O intensive work.

Making read and write in rust is improving the performance: The entire I/O pipeline (Parquet read -> Arrow RecordBatch -> Parquet write) happens in Rust (reading and writing Parquet through the same Arrow memory layout), eliminating the entire Spark orchestration layer, not just replacing individual operators within it.

Shekharrajak · 2026-02-16T17:50:55Z

+      // Measure Spark compaction (single run - compaction is destructive)
+      val sparkStart = System.nanoTime()
+      val sparkTable = Spark3Util.loadIcebergTable(spark, icebergTableName)
+      SparkActions.get(spark).rewriteDataFiles(sparkTable).binPack().execute()


default Spark action API

Shekharrajak · 2026-02-16T17:56:17Z

+      file_io_properties = fileIOProperties)
+  }
+
+  /** Plan file groups using bin-pack strategy. */


Right now, we are using bin-pack runner strategy for small files into groups.
We can extend the other runner and planners.

Shekharrajak · 2026-02-17T04:33:15Z

+ * Integration tests for CALL rewrite_data_files() procedure intercepted by CometCompactionRule.
+ * Verifies that the SQL procedure path routes through native compaction when enabled.
+ */
+class CometIcebergCompactionProcedureSuite extends CometTestBase {


Validation for calling rewrite API using procedure for iceberg table compaction.

…sses

github-actions · 2026-04-27T02:19:14Z

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

Shekharrajak added 4 commits February 14, 2026 14:51

feat: add native Rust Iceberg compaction with DataFusion and JNI bridge

b7c7559

feat: add Scala JNI interface and CometNativeCompaction for Iceberg

9c01c57

feat: add COMET_ICEBERG_COMPACTION_ENABLED config option

80051d0

test: add Iceberg compaction unit tests and TPC-H benchmark

9dca0f3

Shekharrajak changed the title ~~Feature/iceberg compaction benchmark~~ Iceberg table compaction Feb 14, 2026

Shekharrajak commented Feb 14, 2026

View reviewed changes

Shekharrajak added 2 commits February 14, 2026 15:35

test: add comprehensive Iceberg compaction tests for partitions, sche…

1df0011

…ma evolution, nested types

test: add file count validation and Spark vs Native comparison test

ad88f6e

Shekharrajak changed the title ~~Iceberg table compaction~~ Acceleration : Iceberg table compaction Feb 14, 2026

fix: scalastyle errors - remove unused imports and unnecessary string…

9573823

… interpolation

Shekharrajak force-pushed the feature/iceberg-compaction-benchmark branch from 08ea1bf to 961ac46 Compare February 16, 2026 04:48

Shekharrajak added 2 commits February 16, 2026 22:52

fix: Scala 2.13 compilation errors and unused parameter warnings

b9b015c

fix: Scala 2.13 compilation errors and unused parameter warnings

326e6cc

Shekharrajak force-pushed the feature/iceberg-compaction-benchmark branch from ba454b4 to 326e6cc Compare February 16, 2026 17:23

Merge upstream/main into feature/iceberg-compaction-benchmark

5435d7b

Shekharrajak commented Feb 16, 2026

View reviewed changes

Shekharrajak added 4 commits February 17, 2026 00:36

feat: move CometNativeCompaction to main scope with provided Iceberg dep

aa78c26

fix: add enforcer ignore for Iceberg uber jar duplicate classes

23a8dd2

feat: add CometCompactionRule to intercept CALL rewrite_data_files

38095c3

test: add integration tests for CALL rewrite_data_files procedure

e2e9c33

Shekharrajak commented Feb 17, 2026

View reviewed changes

mbutrovich changed the title ~~Acceleration : Iceberg table compaction~~ Acceleration : Iceberg table compaction [iceberg] Feb 24, 2026

Shekharrajak added 2 commits February 25, 2026 11:30

Merge upstream/main into feature/iceberg-compaction-benchmark

3af0b12

fix(build): add enforcer ignores for iceberg-spark-runtime shaded cla…

1baacaa

…sses

Shekharrajak force-pushed the feature/iceberg-compaction-benchmark branch from 414961a to 1baacaa Compare February 25, 2026 06:08

Merge branch 'main' into feature/iceberg-compaction-benchmark

9f53ca9

github-actions Bot added the Stale label Apr 27, 2026

github-actions Bot closed this May 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Acceleration : Iceberg table compaction [iceberg]#3519

Acceleration : Iceberg table compaction [iceberg]#3519
Shekharrajak wants to merge 17 commits into
apache:mainfrom
Shekharrajak:feature/iceberg-compaction-benchmark

Shekharrajak commented Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026 •

edited

Loading

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026

Uh oh!

Shekharrajak Feb 14, 2026 •

edited

Loading

Uh oh!

mbutrovich commented Feb 16, 2026

Uh oh!

Shekharrajak commented Feb 16, 2026 •

edited

Loading

Uh oh!

Shekharrajak Feb 16, 2026

Uh oh!

Shekharrajak Feb 16, 2026

Uh oh!

Shekharrajak Feb 17, 2026

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		logDebug(s"Executing native compaction with config: $configJson")

		val resultJson = native.executeIcebergCompaction(configJson)

Conversation

Shekharrajak commented Feb 14, 2026

Which issue does this PR close?

PR Description

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich commented Feb 16, 2026

Uh oh!

Shekharrajak commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shekharrajak Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shekharrajak Feb 14, 2026 •

edited

Loading

Shekharrajak Feb 14, 2026 •

edited

Loading

Shekharrajak commented Feb 16, 2026 •

edited

Loading