Delta kernel conversion target by vaibhavk1992 · Pull Request #801 · apache/incubator-xtable

vaibhavk1992 · 2026-02-06T16:46:08Z

What is the purpose of the pull request

This PR migrates
XTable's Delta Lake integration from Delta Standalone to
Delta Kernel for writers

Brief change log

Added unit test for delta kernel data file updates
Added unit tests for conversion target

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added TestConversionController to verify the change.
Manually verified the change by running a job locally.

…itions,big fixes

the-other-tim-brown · 2026-02-23T15:14:34Z

@vaibhavk1992 can you update the PR description?

the-other-tim-brown

@vaibhavk1992 can you complete a self-review of the code first to clean up a bit?

Merged 7 upstream commits: - f991e31 Parquet Source: snapshot sync fixes (apache#806) - 4307565 Parquet source: column stats support (apache#805) - 5c25674 Remove wildcard imports and enforce with spotless (apache#809) - fe7215e add .sdkmanrc to .gitignore (apache#793) - abbf4b7 fix(iceberg): nested comments (apache#797) - 8e58367 Remove redundant getSnapshotAt calls (apache#791) - 8cab6a2 fix(delta): avoid NPE for binary in map/array (apache#795) Resolved conflicts: - TestDeltaKernelSchemaExtractor.java: kept StructField import needed for new tests

Fixed wildcard imports in Delta Kernel test files to comply with spotless rules enforced in upstream commit 5c25674.

The spotless:apply command removed wildcard imports but didn't add back all necessary specific imports. Added missing imports: TestDeltaKernelReadWriteIntegration.java: - Static assertions (assertEquals, assertTrue, assertFalse, assertNotNull) - java.util.* (Random, UUID, List, Map, Set, Arrays, Collections, etc.) TestDeltaKernelSync.java: - Static assertions (including fail) - java.util.* (Random, UUID, List, Map, Set, Arrays, Collections, etc.) TestDeltaKernelDataFileUpdatesExtractor.java: - Static assertions (assertEquals, assertTrue, assertFalse, assertNotNull) - java.util.* (List, Arrays, Collections) All tests now compile successfully.

vaibhavk1992 · 2026-03-02T17:26:17Z

I have tried to address all the comments @vinishjail97 @the-other-tim-brown

vinishjail97 · 2026-03-02T17:59:18Z

+                  }
+                } catch (Exception e) {
+                  // Log and continue to next commit
+                  log.warn(


log.warn("...", version, e.getMessage()) swallows the stack trace. Pass the exception as the last argument instead: log.warn("Failed to parse commit metadata for version {}", version, e). On-call engineers debugging a production issue won't know where the parse failure originated.

vinishjail97 · 2026-03-02T17:59:18Z

+      try {
+        Table table = Table.forPath(engine, basePath);
+        this.latestSchema = table.getLatestSnapshot(engine).getSchema();
+      } catch (Exception e) {


catch (Exception e) silently sets this.latestSchema = null. If the error is a network issue, permissions error, or anything other than "table doesn't exist", this will silently proceed with a null schema and cause an NPE in addColumn() at line 366 (latestSchema.add(field)). Should catch only the specific "table not found" exception from Delta Kernel, not the broad Exception.

…Kernel integration

vaibhavk1992 · 2026-03-11T16:49:42Z

@vinishjail97 I have implemented all the above suggestions.

the-other-tim-brown · 2026-03-15T18:10:42Z

+      for (RowBackedAction action : actions) {
+
+        if (action instanceof io.delta.kernel.internal.actions.AddFile) {
+          io.delta.kernel.internal.actions.AddFile addFile =


Can these classes from the delta kernel be imported?

the-other-tim-brown · 2026-03-29T14:29:36Z

+        DeltaKernelDataFileUpdatesExtractor.builder()
+            .engine(engine)
+            .basePath(targetTable.getBasePath())
+            // Column statistics are not needed for conversion operations


When we convert from one format to another, we actually do want the statistics. This allows query engines to leverage them for their planning operations for improved efficiency.

the-other-tim-brown · 2026-03-29T14:38:46Z

+        if (action instanceof AddFile) {
+          AddFile addFile = (AddFile) action;
+          Row wrappedRow =
+              io.delta.kernel.internal.actions.SingleAction.createAddFileSingleAction(


Let's start using imports throughout the PR please. Do a sanity check of the files and make sure you are using them throughout. Highlighting every line leads to a lot of noise on the reviews.

the-other-tim-brown · 2026-03-29T15:02:51Z

+  public static boolean tableExists(Engine engine, String basePath) {
+    try {
+      Table table = Table.forPath(engine, basePath);
+      table.getLatestSnapshot(engine);


This looks like it will load the snapshot, is there a more lightweight way to do this?

the-other-tim-brown · 2026-03-29T15:06:23Z

        DeltaKernelSchemaExtractor.getInstance().toInternalSchema(structRepresentation));
  }
+
+  // ========== Tests for fromInternalSchema() - New Tests ==========


Remove this comment for New Tests?

the-other-tim-brown · 2026-03-29T15:07:03Z

+
+  // ========== Tests for fromInternalSchema() - New Tests ==========
+
+  @Test


Let's have tests with nested fields, lists and maps as well

the-other-tim-brown · 2026-03-29T15:28:44Z

+
+    // Verify we have AddFile actions
+    boolean hasAddFile = actionList.stream().anyMatch(action -> action instanceof AddFile);
+    assertTrue(hasAddFile, "Should contain AddFile actions");


Can we assert on the content of the AddFile to make sure it is aligned with our expectations?

the-other-tim-brown · 2026-03-29T15:30:23Z

+        actionList.stream().filter(action -> action instanceof RemoveFile).count();
+
+    // Verify: Should have AddFile for file3 (new file)
+    assertTrue(addFileCount >= 1, "Should have at least 1 AddFile action for new file (file3)");


These counts should be strict. Only 1 file is expected to be added and 1 removed

… type tests - Enhanced TestDeltaKernelDataFileUpdatesExtractor with detailed AddFile content assertions - Added strict count verification (== instead of >=) for differential sync tests - Fixed path format inconsistency (Hadoop URI vs plain string) in test files - Added 3 comprehensive tests for fromInternalSchema: nested records, lists, and maps - Simplified test code by inlining nested schema builds with clear structural comments - Fixed applyDiff signature and improved DeltaKernelDataFileUpdatesExtractor - Added DeltaKernelUtils.tableExists helper method - All 19 tests passing (16 schema + 3 data file updater tests) Test coverage now includes: - Multi-level nested structures (3 levels deep) - Lists of primitives and complex types - Maps of primitives and complex types - Round-trip conversions with complex types - Strict assertions on AddFile/RemoveFile actions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

vinishjail97 · 2026-04-13T19:16:20Z

+      if (!tableExists) {
+        File tableDir = new File(basePath);
+        if (!tableDir.exists()) {
+          tableDir.mkdirs();


new File(basePath).mkdirs() only works for local filesystem paths. For S3/GCS/Azure/HDFS paths (e.g., s3://bucket/path), this silently fails or creates nonsensical local directories.

Consider removing this block — Table.forPath() with CREATE_TABLE operation should handle table directory creation. Or use Hadoop FileSystem.mkdirs() if explicit creation is needed.

vinishjail97 · 2026-04-13T19:16:20Z

+          RemoveFile removeFile =
+              new RemoveFile(addFile.toRemoveFileRow(false, Optional.of(snapshot.getVersion())));
+          String fullPath =
+              DeltaKernelActionsConverter.getFullPathToFile(removeFile.getPath(), table);


Performance concern: DeltaKernelActionsConverter.getFullPathToFile() creates a new Configuration + DefaultEngine per call. In applySnapshot, this is invoked per-file in the existing snapshot — potentially thousands of times. Each new Configuration() scans the classpath.

Consider accepting an Engine parameter, or computing the full path directly from basePath + removeFile.getPath() without constructing a new engine each time.

vinishjail97 · 2026-04-13T19:16:20Z

+
+      try {
+        Table table = Table.forPath(engine, basePath);
+        this.latestSchema = table.getLatestSnapshot(engine).getSchema();


Multiple full snapshot loads per sync cycle: the constructor loads the snapshot here for schema, commitTransaction() calls checkTableExists() which loads another snapshot (line 470 -> DeltaKernelUtils.tableExists), and syncFilesForSnapshot/syncFilesForDiff load yet another (lines 268, 280). That's 3-4 full log replays per sync.

Consider caching the snapshot (or at least the tableExists result) within a TransactionState lifecycle.

vinishjail97 · 2026-04-13T19:16:20Z

+      return DeltaKernelUtils.tableExists(engine, basePath);
+    }
+
+    private Map<String, String> getConfigurationsForDeltaSync() {


getConfigurationsForDeltaSync() does not set minReaderVersion/minWriterVersion, unlike the existing DeltaConversionTarget. Protocol versions default to Delta Kernel defaults, which may not match features used (e.g., generated columns require writer version 4). Is this intentional?

vinishjail97 · 2026-04-13T19:16:20Z

+
+  private final Engine engine;
+  private final String basePath;
+  private final boolean includeColumnStats;


includeColumnStats field is declared but never read — createAddFileAction always passes Optional.empty() for stats (line 177). Either implement stats conversion or remove the dead field. If deferred, add a // TODO(issue-link) so it doesn't get forgotten.

vinishjail97 · 2026-04-13T19:16:20Z

+            Optional.empty(), // tags
+            Optional.empty(), // baseRowId
+            Optional.empty(), // defaultRowCommitVersion
+            Optional.empty() // stats - TODO: convert column stats to DataFileStatistics


Stats are always Optional.empty(). The synced Delta table will have no column statistics on AddFile entries, degrading data skipping for downstream readers (Spark, Trino). Is there a tracking issue for adding stats support?

vinishjail97 · 2026-04-13T19:16:20Z

@@ -265,22 +265,27 @@ private void collectUnsupportedStats(Map<String, Object> additionalStats) {
   */


This class is ~300 lines of near-identical duplication from DeltaStatsExtractor in the Standalone package. The only material difference is the AddFile import. Consider extracting shared stats serialization logic (convertStatsToDeltaFormat, insertValueAtPath, flattenStatMap, DeltaStats) into a common utility to avoid maintaining two copies.

vinishjail97 · 2026-04-13T19:16:20Z

+
+    // Scan all files
+    ScanImpl scan = (ScanImpl) snapshot.getScanBuilder().build();
+    CloseableIterator<FilteredColumnarBatch> scanFiles = scan.getScanFiles(engine, false);


validateDeltaTable uses CloseableIterator<FilteredColumnarBatch> (line 440) and CloseableIterator<Row> (line 449) that are never closed. Use try-with-resources to prevent resource leaks:

try (CloseableIterator<FilteredColumnarBatch> scanFiles = scan.getScanFiles(engine, false)) { // ... }

vinishjail97 · 2026-04-13T19:16:20Z

+    assertTrue(
+        addFile.getPath().contains("test_data.parquet"),
+        "AddFile path should contain the test file name");
+    assertTrue(addFile.getSize() > 0, "AddFile size should be greater than 0");


nit: This asserts addFile.getSize() > 0 but the test creates expectedFileSize = 1024L (line 128). Use assertEquals(1024L, addFile.getSize()) to validate size propagation — otherwise a wrong size still passes.

vinishjail97 · 2026-04-13T19:16:20Z

+    // Verify schema
+    InternalSchema readSchema = readTable.getReadSchema();
+    assertNotNull(readSchema);
+    assertEquals(schema.getFields().size(), readSchema.getFields().size());


nit: This only checks field count, not content. Use assertEquals(schema, readSchema) for full structural comparison — otherwise schema corruption that preserves field count goes undetected.

vaibhavk1992 and others added 30 commits May 14, 2025 10:43

single commit

dfea6a4

adding delta kernel

b75bc7c

adding the test file

16134b3

adding workable code for iteration over data

3929e95

adding Kernel 4.0 code

c6379b5

adding the working code with xtable that check getcurrenttable

6deb5f7

Merge branch 'main' into test-4

05d9984

adding the dependecies

c7ba4b9

adding getcurrentsnapshot code

0ff36a5

spotless fix

18ab9d6

spotless fix 2

e906091

spotless fix 2

e00241c

fixed partitioned test case

3fdfd31

setting junit parallel execution to true

e0102e3

testInsertsUpsertsAndDeletes test case addition,internal datatype add…

381722a

…itions,big fixes

added the fix for table basepath listing wrong paths

809bfe8

added the fix for table basepath listing wrong paths

40172f2

adding all tests

e0b7829

adding refactored code

9ac022a

spotless fix

73f33b6

fix change extraction

bee3e8a

adding the commitbacklog test cases changes

e75bb55

Merge branch 'apache:main' into test-4

21044af

adding a test case testConvertFromDeltaPartitionFormat

e212f52

adding a test case testConvertFromDeltaPartitionFormat

988cda1

adding the KernelPartitionExtractor test under kernel

1705ce4

commiting schema extractor and stats extrator

8f81109

adding unit test cases with the request changes on the PR

49ebf21

spotless fix

70fe0e3

spotless fix

fba7e0e

the-other-tim-brown reviewed Feb 23, 2026

View reviewed changes

vinishjail97 reviewed Feb 23, 2026

View reviewed changes

the-other-tim-brown reviewed Feb 23, 2026

View reviewed changes

vaibhavk1992 added 9 commits March 2, 2026 20:37

addressed comments over PR

2af1236

addressed comments over PR

2d0e16e

addressed comments over PR

ab0417c

adding data types

935d835

dummy commit to trigger actions

23f6321

Apply spotless formatting to remove wildcard imports

d622ae7

Fixed wildcard imports in Delta Kernel test files to comply with spotless rules enforced in upstream commit 5c25674.

adding read write integration test case

d405870

vaibhavk1992 requested review from the-other-tim-brown and vinishjail97 March 2, 2026 17:26

vinishjail97 reviewed Mar 2, 2026

View reviewed changes

Fix exception handling, Scala/Java mixing, and test quality in Delta …

37caddf

…Kernel integration

vaibhavk1992 requested a review from vinishjail97 March 11, 2026 16:49

the-other-tim-brown reviewed Mar 15, 2026

View reviewed changes

vaibhavk1992 added 2 commits March 16, 2026 11:36

added fixes

3915e22

adding fixes

5868207

vaibhavk1992 requested a review from the-other-tim-brown March 19, 2026 11:30

the-other-tim-brown reviewed Mar 29, 2026

View reviewed changes

vaibhavk1992 mentioned this pull request Apr 7, 2026

Need to add commit tags for delta conversion target #819

Open

2 tasks

vaibhavk1992 requested a review from the-other-tim-brown April 11, 2026 09:58

vinishjail97 reviewed Apr 16, 2026

View reviewed changes


		// ========== Tests for fromInternalSchema() - New Tests ==========

		@Test

		@@ -265,22 +265,27 @@ private void collectUnsupportedStats(Map<String, Object> additionalStats) {
		*/

Conversation

vaibhavk1992 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the pull request

Brief change log

Verify this pull request

Uh oh!

the-other-tim-brown commented Feb 23, 2026

Uh oh!

the-other-tim-brown left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaibhavk1992 commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaibhavk1992 commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

vaibhavk1992 commented Feb 6, 2026 •

edited

Loading