[SPARK-56406][SS] Stream-stream join v4: skip writing secondary index if the operator will not evict from that side

HeartSaVioR · HeartSaVioR · commit 5fd1438ac0e9 · 2026-04-18T21:49:02.000+09:00
### What changes were proposed in this pull request? This PR proposes to skip writing secondary index if the operator will not evict from that side. For simplicity, we keep creating column family and just skip writing to secondary index. The way to understand whether the operator won't evict from that side is following: It's very obvious for the case where both sides do not have event time column - both sides do not evict any state at all. The tricky case is when one side has event time column and another side can deduce from it to evict the state row. * equality join (event time column is in join key): non-watermarked side can actually evict the state. * time-interval join (event time column is in value side): "neither" side can actually evict the state since one side is unbound and other side has to be relative with it. For the former, the logic is able to detect the ability for non-watermarked side and write secondary index for it. (See how joinKeyOrdinalForWatermark is constructed.) For the latter, technically, we can skip "both" sides to skip writing secondary index, but that's fairly minor case and the logic only enables non-watermarked side to skip writing secondary index. (The PR left the potential optimization as TODO code comment, but we are not going to file a JIRA ticket since we don't know whether we ever demand it.) The main coverage of this optimization is a regular join where both sides do not have event time column; the coverage of watermark on only one side is a sort of bonus. For safety net, `evict***` methods will raise an exception if the operator has skipped writing secondary index, since it is NOT expected for these methods to be called. ### Why are the changes needed? There are several cases where the operator never leverages the secondary index on one (or both) side(s), so there is no value to write to the secondary index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude 4.6 Opus Closes #55271 from HeartSaVioR/SPARK-56406. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/operators/stateful/join/SymmetricHashJoinStateManager.scala
@@ -244,6 +244,16 @@ class SymmetricHashJoinStateManagerV4(
     // pass the information. The information is in SQLConf.
     allowMultipleEventTimeColumns = false)
 
+  // When there is no event time column in the value and no watermark ordinal in the key,
+  // the secondary index (TsWithKey) will never be used for eviction. Skip writing to it
+  // to avoid unnecessary RocksDB merge overhead.
+  // TODO(SPARK-56536): This could be further optimized by also considering whether the state
+  //   watermark predicate is defined. Even when an event time column exists, the secondary index
+  //   is unused if eviction is not possible (e.g., only one side defines a watermark in a time
+  //   interval join). That would require propagating the predicate information here.
+  private val hasEventTime: Boolean =
+    eventTimeColIdxOpt.isDefined || joinKeyOrdinalForWatermark.isDefined
+
   private val random = new scala.util.Random(System.currentTimeMillis())
   private val bucketCountForNoEventTime = 1024
   private val extractEventTimeFn: UnsafeRow => Long = { row =>
@@ -353,7 +363,9 @@ class SymmetricHashJoinStateManagerV4(
     val eventTime = extractEventTimeFnFromKey(key).getOrElse(extractEventTimeFn(value))
     // We always do blind merge for appending new value.
     keyWithTsToValues.append(key, eventTime, value, matched)
-    tsWithKey.add(eventTime, key)
+    if (hasEventTime) {
+      tsWithKey.add(eventTime, key)
+    }
   }
 
   override def getJoinedRows(
@@ -508,6 +520,8 @@ class SymmetricHashJoinStateManagerV4(
   }
 
   override def evictByTimestamp(endTimestamp: Long): Long = {
+    require(hasEventTime,
+      "evictByTimestamp requires event time; secondary index was not populated")
     var removed = 0L
     tsWithKey.scanEvictedKeys(endTimestamp).foreach { evicted =>
       val key = evicted.key
@@ -524,6 +538,8 @@ class SymmetricHashJoinStateManagerV4(
   }
 
   override def evictAndReturnByTimestamp(endTimestamp: Long): Iterator[KeyToValuePair] = {
+    require(hasEventTime,
+      "evictAndReturnByTimestamp requires event time; secondary index was not populated")
     val reusableKeyToValuePair = KeyToValuePair()
 
     tsWithKey.scanEvictedKeys(endTimestamp).flatMap { evicted =>
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinSuite.scala
@@ -931,13 +931,16 @@ abstract class StreamingInnerJoinBase extends StreamingJoinSuite {
         AddData(input2, 1, 10),
         CheckNewAnswer((1, 2, 3)),
         Execute { query =>
-          val numInternalKeys =
+          val numInternalCfs =
             query.lastProgress
               .stateOperators(0)
               .customMetrics
-              .get("rocksdbNumInternalColFamiliesKeys")
-          // Number of internal column family keys should be nonzero for this join implementation
-          assert(numInternalKeys.longValue() > 0)
+              .get("rocksdbNumInternalColumnFamilies")
+          // The V4 virtual-column-family join uses internal column families for the
+          // secondary index, so the CF count should be nonzero for this join implementation.
+          // Note: we intentionally check the CF count (not the key count), because for joins
+          // without event time the secondary index is created but never populated.
+          assert(numInternalCfs.longValue() > 0)
         },
         StopStream,
         // Restart the query from the same checkpoint
@@ -948,13 +951,13 @@ abstract class StreamingInnerJoinBase extends StreamingJoinSuite {
         CheckNewAnswer((2, 4, 6), (2, 4, 6)),
         Execute { query =>
           // The join implementation should not have changed between runs
-          val numInternalKeys =
+          val numInternalCfs =
             query.lastProgress
               .stateOperators(0)
               .customMetrics
-              .get("rocksdbNumInternalColFamiliesKeys")
-          // Number of internal column family keys should still be nonzero for this join
-          assert(numInternalKeys.longValue() > 0)
+              .get("rocksdbNumInternalColumnFamilies")
+          // Number of internal column families should still be nonzero for this join
+          assert(numInternalCfs.longValue() > 0)
         },
         StopStream
       )
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinV4Suite.scala b/sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingJoinV4Suite.scala
@@ -20,6 +20,7 @@ package org.apache.spark.sql.streaming
 import org.apache.hadoop.fs.Path
 import org.scalatest.Tag
 
+import org.apache.spark.sql.execution.datasources.v2.state.StateSourceOptions
 import org.apache.spark.sql.execution.streaming.checkpointing.CheckpointFileManager
 import org.apache.spark.sql.execution.streaming.operators.stateful.join.StreamingSymmetricHashJoinExec
 import org.apache.spark.sql.execution.streaming.runtime.MemoryStream
@@ -184,6 +185,150 @@ class StreamingInnerJoinV4Suite
       )
     }
   }
+
+  private def readStateStore(checkpointLoc: String, storeName: String): Long = {
+    spark.read.format("statestore")
+      .option(StateSourceOptions.PATH, checkpointLoc)
+      .option(StateSourceOptions.STORE_NAME, storeName)
+      .load()
+      .count()
+  }
+
+  testWithVirtualColumnFamilyJoins(
+    "SPARK-56406: secondary index is not populated for join without event time") {
+    withTempDir { checkpointDir =>
+      val input1 = MemoryStream[Int]
+      val input2 = MemoryStream[Int]
+
+      val df1 = input1.toDF()
+        .select($"value" as "key", ($"value" * 2) as "leftValue")
+      val df2 = input2.toDF()
+        .select($"value" as "key", ($"value" * 3) as "rightValue")
+      val joined = df1.join(df2, "key")
+
+      testStream(joined)(
+        StartStream(checkpointLocation = checkpointDir.getCanonicalPath),
+        AddData(input1, 1, 2, 3),
+        CheckAnswer(),
+        AddData(input2, 1, 2),
+        CheckNewAnswer((1, 2, 3), (2, 4, 6)),
+        Execute { _ =>
+          val checkpointLoc = checkpointDir.getCanonicalPath
+
+          assert(readStateStore(checkpointLoc, "left-keyWithTsToValues") > 0,
+            "left primary store should have rows")
+          assert(readStateStore(checkpointLoc, "right-keyWithTsToValues") > 0,
+            "right primary store should have rows")
+
+          assert(readStateStore(checkpointLoc, "left-tsWithKey") === 0,
+            "left secondary index should be empty without event time")
+          assert(readStateStore(checkpointLoc, "right-tsWithKey") === 0,
+            "right secondary index should be empty without event time")
+        },
+        StopStream
+      )
+    }
+  }
+
+  testWithVirtualColumnFamilyJoins(
+    "SPARK-56406: secondary index populated on both sides when watermark is on join key") {
+    withTempDir { checkpointDir =>
+      val input1 = MemoryStream[(Int, Int)]
+      val input2 = MemoryStream[(Int, Int)]
+
+      val df1 = input1.toDF().toDF("key", "time")
+        .select($"key", timestamp_seconds($"time") as "ts", ($"key" * 2) as "leftValue")
+        .withWatermark("ts", "10 seconds")
+      val df2 = input2.toDF().toDF("key", "time")
+        .select($"key", timestamp_seconds($"time") as "ts", ($"key" * 3) as "rightValue")
+      // Only left side has watermark; ts is part of the join key, so
+      // joinKeyOrdinalForWatermark is defined -> hasEventTime = true for both sides.
+
+      val joined = df1.join(df2, Seq("key", "ts"))
+        .select($"key", $"ts".cast("long"), $"leftValue", $"rightValue")
+
+      testStream(joined)(
+        StartStream(checkpointLocation = checkpointDir.getCanonicalPath),
+        // Use ts=20 for the row we expect to join against input2.
+        // withWatermark("ts", "10 seconds") causes batch 0 to advance the watermark to
+        // max(ts) - 10s = 10s. Because watermark-based cleanup is enabled,
+        // MicroBatchExecution fires a no-data batch (shouldRunAnotherBatch) after
+        // batch 0 that evicts any state rows with ts <= 10 (inclusive). Keeping
+        // ts=20 > 10 ensures the row survives that eviction so the input2 row in
+        // the following batch can match it.
+        AddData(input1, (1, 20), (2, 10)),
+        CheckAnswer(),
+        AddData(input2, (1, 20)),
+        CheckNewAnswer((1, 20, 2, 3)),
+        Execute { _ =>
+          val checkpointLoc = checkpointDir.getCanonicalPath
+
+          assert(readStateStore(checkpointLoc, "left-keyWithTsToValues") > 0,
+            "left primary store should have rows")
+          assert(readStateStore(checkpointLoc, "right-keyWithTsToValues") > 0,
+            "right primary store should have rows")
+
+          // Both secondary indexes should be populated because joinKeyOrdinalForWatermark
+          // is defined (watermark on join key applies to both sides).
+          assert(readStateStore(checkpointLoc, "left-tsWithKey") > 0,
+            "left secondary index should be populated when watermark is on join key")
+          assert(readStateStore(checkpointLoc, "right-tsWithKey") > 0,
+            "right secondary index should be populated when watermark is on join key")
+        },
+        StopStream
+      )
+    }
+  }
+
+  testWithVirtualColumnFamilyJoins(
+    "SPARK-56406: secondary index only populated on watermarked side for time interval join") {
+    withTempDir { checkpointDir =>
+      val leftInput = MemoryStream[(Int, Int)]
+      val rightInput = MemoryStream[(Int, Int)]
+
+      val df1 = leftInput.toDF().toDF("leftKey", "time")
+        .select($"leftKey", timestamp_seconds($"time") as "leftTime",
+          ($"leftKey" * 2) as "leftValue")
+        .withWatermark("leftTime", "10 seconds")
+      val df2 = rightInput.toDF().toDF("rightKey", "time")
+        .select($"rightKey", timestamp_seconds($"time") as "rightTime",
+          ($"rightKey" * 3) as "rightValue")
+      // Only left side has watermark; watermark is on a value column, not the join key.
+      // joinKeyOrdinalForWatermark is None -> only left has hasEventTime = true.
+      // Neither side can actually evict: the left state watermark is derived from the right
+      // side's watermark via the join condition, which is absent here. The left secondary
+      // index is populated but never used for eviction.
+
+      val joined = df1.join(df2,
+        expr("leftKey = rightKey AND " +
+          "leftTime BETWEEN rightTime - interval 5 seconds AND rightTime + interval 5 seconds"))
+        .select($"leftKey", $"leftTime".cast("int"), $"rightTime".cast("int"))
+
+      testStream(joined)(
+        StartStream(checkpointLocation = checkpointDir.getCanonicalPath),
+        AddData(leftInput, (1, 10), (2, 20)),
+        CheckAnswer(),
+        AddData(rightInput, (1, 12)),
+        CheckNewAnswer((1, 10, 12)),
+        Execute { _ =>
+          val checkpointLoc = checkpointDir.getCanonicalPath
+
+          assert(readStateStore(checkpointLoc, "left-keyWithTsToValues") > 0,
+            "left primary store should have rows")
+          assert(readStateStore(checkpointLoc, "right-keyWithTsToValues") > 0,
+            "right primary store should have rows")
+
+          // Left has watermark on a value column -> hasEventTime = true, secondary index populated.
+          assert(readStateStore(checkpointLoc, "left-tsWithKey") > 0,
+            "left secondary index should be populated (watermark on left value column)")
+          // Right has no watermark -> hasEventTime = false, secondary index empty.
+          assert(readStateStore(checkpointLoc, "right-tsWithKey") === 0,
+            "right secondary index should be empty (no watermark on right side)")
+        },
+        StopStream
+      )
+    }
+  }
 }
 
 @SlowSQLTest