[SparkConnector][No Review]FixNoClassDefFoundError for MetadataVersionUtil (#48837)

xinlian12 · Copilot · web-flow · commit df7614ae665a · 2026-04-17T14:02:57.000+02:00
* Fix NoClassDefFoundError for MetadataVersionUtil in Cosmos Spark connector

Inline version validation logic in ChangeFeedInitialOffsetWriter instead
of depending on Spark-internal MetadataVersionUtil, which has been
relocated in Databricks Runtime 17.3 LTS (Spark 4.0).

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Add unit tests for inlined validateVersion logic

Add ChangeFeedInitialOffsetWriterSpec with tests covering:
- Valid version strings within supported range
- Version exceeding max supported (UnsupportedLogVersion)
- Malformed versions: non-numeric, empty, missing v prefix, v0, negative, bare v

Widen companion object visibility to private[spark] for testability.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Add change feed micro-batch streaming scenarios to Databricks live test notebooks

Add structured streaming scenarios using cosmos.oltp.changeFeed to both
basicScenario.scala and basicScenarioAadManagedIdentity.scala notebooks.
These scenarios exercise the ChangeFeedInitialOffsetWriter and
HDFSMetadataLog code paths that can break on certain Spark distributions
(e.g. Databricks Runtime 17.3+).

Each scenario:
- Creates a sink container
- Reads change feed from source via readStream with micro-batch
- Writes to sink container via writeStream
- Validates records were copied
- Cleans up both containers

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Fix change feed streaming checkpoint path in Databricks notebooks

Use file:/tmp/ instead of /tmp/ for checkpoint location to avoid DBFS
access issues on Unity Catalog-enabled Databricks clusters. Also:
- Remove unused Trigger import
- Stop query before reading sink to avoid race conditions

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Simplify change feed streaming test to use memory sink

Replace cosmos.oltp sink with in-memory sink to eliminate the need for
a separate sink container. This avoids 404 errors from sink container
creation/resolution and removes checkpoint path concerns.

The test still exercises the full ChangeFeedInitialOffsetWriter and
HDFSMetadataLog code paths (readStream with cosmos.oltp.changeFeed),
which is the goal for validating the MetadataVersionUtil fix.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Remove change feed streaming scenarios from Databricks notebooks

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Re-add change feed streaming with shared logic in both notebooks

Both notebooks now use the same pattern: derive changeFeedCfg from the
existing cfg map (which already has the correct auth config) plus the
change feed-specific options. Write to an in-memory sink to avoid
container creation issues. This ensures both key-based and AAD/MSI
notebooks exercise identical streaming logic.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Remove change feed streaming from AAD/MSI notebook

The MSI notebook shares a cluster with basicScenario, and the Cosmos
client cache retains references from the first notebook's proactive
connection init. When basicScenario drops the source container during
cleanup, the MSI notebook's change feed streaming fails with 404 on
the cached (now-deleted) container. The change feed streaming test in
basicScenario already provides sufficient coverage for the
ChangeFeedInitialOffsetWriter code paths.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Add diagnostic logging to MSI change feed streaming test

Add detailed logging to capture:
- Endpoint, database, container, auth config used
- Source container record count before streaming
- Streaming query ID
- Full exception details on failure

This will help diagnose why the change feed streaming fails
on the MSI notebook but succeeds on the key-based one.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

* Remove change feed streaming from MSI notebook

The MSI change feed test passes on a fresh cluster but fails when
basicScenario runs first on the same cluster without restart. The
basicScenario leaves cached Cosmos client state (proactive connection
init on the ephemeral endpoint) that causes the MSI streaming query
to resolve to the wrong endpoint, resulting in a 404. The change feed
test in basicScenario provides sufficient coverage for the
ChangeFeedInitialOffsetWriter/HDFSMetadataLog code paths.

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;

---------

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedInitialOffsetWriter.scala b/sdk/cosmos/azure-cosmos-spark_3/src/main/scala/com/azure/cosmos/spark/ChangeFeedInitialOffsetWriter.scala
@@ -3,7 +3,7 @@
 package com.azure.cosmos.spark
 
 import org.apache.spark.sql.SparkSession
-import org.apache.spark.sql.execution.streaming.{HDFSMetadataLog, MetadataVersionUtil}
+import org.apache.spark.sql.execution.streaming.HDFSMetadataLog
 
 import java.io.{BufferedWriter, InputStream, InputStreamReader, OutputStream, OutputStreamWriter}
 import java.nio.charset.StandardCharsets
@@ -33,7 +33,7 @@ private class ChangeFeedInitialOffsetWriter
         "Log file was malformed: failed to detect the log file version line.")
     }
 
-    MetadataVersionUtil.validateVersion(content.substring(0, indexOfNewLine), VERSION)
+    ChangeFeedInitialOffsetWriter.validateVersion(content.substring(0, indexOfNewLine), VERSION)
     content.substring(indexOfNewLine + 1)
   }
 
@@ -58,3 +58,35 @@ private class ChangeFeedInitialOffsetWriter
     override def toString: String = stringBuilder.toString()
   }
 }
+
+private[spark] object ChangeFeedInitialOffsetWriter {
+  /**
+   * Validates the version string from the log file.
+   * This is inlined to avoid a runtime dependency on MetadataVersionUtil,
+   * which has been relocated in some Spark distributions (e.g. Databricks Runtime 17.3+).
+   */
+  def validateVersion(versionText: String, maxSupportedVersion: Int): Int = {
+    if (versionText.nonEmpty && versionText(0) == 'v') {
+      val version =
+        try {
+          versionText.substring(1).toInt
+        } catch {
+          case _: NumberFormatException =>
+            throw new IllegalStateException(
+              s"Log file was malformed: failed to read correct log version from $versionText.")
+        }
+      if (version > 0 && version <= maxSupportedVersion) {
+        return version
+      }
+      if (version > maxSupportedVersion) {
+        throw new IllegalStateException(
+          s"UnsupportedLogVersion: maximum supported log version " +
+            s"is v$maxSupportedVersion, but encountered v$version. " +
+            s"The log file was produced by a newer version of Spark and cannot be read by this version. " +
+            s"Please upgrade.")
+      }
+    }
+    throw new IllegalStateException(
+      s"Log file was malformed: failed to read correct log version from $versionText.")
+  }
+}
diff --git a/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/ChangeFeedInitialOffsetWriterSpec.scala b/sdk/cosmos/azure-cosmos-spark_3/src/test/scala/com/azure/cosmos/spark/ChangeFeedInitialOffsetWriterSpec.scala
@@ -0,0 +1,69 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+package com.azure.cosmos.spark
+
+class ChangeFeedInitialOffsetWriterSpec extends UnitSpec {
+
+  "validateVersion" should "return version for valid version string within supported range" in {
+    ChangeFeedInitialOffsetWriter.validateVersion("v1", 1) shouldBe 1
+  }
+
+  it should "return version when version is less than max supported" in {
+    ChangeFeedInitialOffsetWriter.validateVersion("v1", 5) shouldBe 1
+  }
+
+  it should "return version when version equals max supported" in {
+    ChangeFeedInitialOffsetWriter.validateVersion("v3", 3) shouldBe 3
+  }
+
+  it should "throw IllegalStateException for version exceeding max supported" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("v2", 1)
+    }
+    exception.getMessage should include("UnsupportedLogVersion")
+    exception.getMessage should include("v1")
+    exception.getMessage should include("v2")
+  }
+
+  it should "throw IllegalStateException for non-numeric version" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("vabc", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+
+  it should "throw IllegalStateException for empty string" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+
+  it should "throw IllegalStateException for string without v prefix" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("1", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+
+  it should "throw IllegalStateException for v0 (zero version)" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("v0", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+
+  it should "throw IllegalStateException for negative version" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("v-1", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+
+  it should "throw IllegalStateException for version string with only v" in {
+    val exception = intercept[IllegalStateException] {
+      ChangeFeedInitialOffsetWriter.validateVersion("v", 1)
+    }
+    exception.getMessage should include("malformed")
+  }
+}
diff --git a/sdk/cosmos/azure-cosmos-spark_3/test-databricks/notebooks/basicScenario.scala b/sdk/cosmos/azure-cosmos-spark_3/test-databricks/notebooks/basicScenario.scala
@@ -111,5 +111,39 @@ df.filter(col("isAlive") === true)
 
 // COMMAND ----------
 
+// Change Feed - micro-batch structured streaming
+// This exercises the ChangeFeedInitialOffsetWriter and HDFSMetadataLog code paths
+// that can break on certain Spark distributions (e.g. Databricks Runtime 17.3+)
+
+val changeFeedCfg = cfg ++ Map(
+  "spark.cosmos.read.inferSchema.enabled" -> "false",
+  "spark.cosmos.changeFeed.startFrom" -> "Beginning",
+  "spark.cosmos.changeFeed.mode" -> "Incremental"
+)
+
+val testId = java.util.UUID.randomUUID().toString.replace("-", "")
+
+val changeFeedDF = spark
+  .readStream
+  .format("cosmos.oltp.changeFeed")
+  .options(changeFeedCfg)
+  .load()
+
+val microBatchQuery = changeFeedDF
+  .writeStream
+  .format("memory")
+  .queryName(testId)
+  .outputMode("append")
+  .start()
+
+microBatchQuery.processAllAvailable()
+microBatchQuery.stop()
+
+val sinkCount = spark.sql(s"SELECT * FROM $testId").count()
+println(s"Change Feed micro-batch streaming: $sinkCount records read via change feed")
+assert(sinkCount >= 2, s"Expected at least 2 records from change feed but found $sinkCount")
+
+// COMMAND ----------
+
 // cleanup
 spark.sql(s"DROP TABLE cosmosCatalog.${cosmosDatabaseName}.${cosmosContainerName};")