[SPARK-56326][SS] Include streaming query and batch ids in scheduling logs by BrooksWalls · Pull Request #55166 · apache/spark

BrooksWalls · 2026-04-02T16:51:15Z

What changes were proposed in this pull request?

This change adds the streaming query Id and batch Id to some of the scheduling logs in order to aid in debugging structured streaming queries.

The following log lines have been updated to include the query and batch Id:

All log lines in TaskSetManager. Examples:
- 26/04/02 16:34:01 INFO TaskSetManager: [queryId = 1251e] [batchId = 5] Starting task 0.0 in stage 5.0 (TID 129) (...,executor driver, partition 0, PROCESS_LOCAL, 9728 bytes)
- 26/04/02 16:34:01 INFO TaskSetManager: [queryId = 1251e] [batchId = 5] Finished task 6.0 in stage 5.0 (TID 135) in 12 ms on ...(executor driver) (6/32)
One log in SchedulableBuilder:
- 26/04/02 16:39:09 INFO FairSchedulableBuilder: [queryId = f5660] [batchId = 5] Added task set TaskSet_5.0 to pool default

Why are the changes needed?

When debugging multiple streaming queries running at the same time it can be difficult to go through the scheduling logs. By including the query and batch Id it is much easier to isolate logs to specific queries and batches.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests were added.

Also manually tested by running the spark shell and redirecting info logs to a temporary file. Then ran a basic streaming query and grepped the temp file for the desired log lines to ensure they included the query and batch id. Also confirmed a batch query ran in the shell does not include the query and batch Id in its logs.

Was this patch authored or co-authored using generative AI tooling?

yes, coauthored

Generated-by: claude

dichlorodiphen

Generally looks good

BrooksWalls · 2026-04-03T20:30:05Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+ * Mix this trait into any scheduler component that has access to task
+ * properties and needs streaming-aware log output.
+ */
+private[scheduler] trait StructuredStreamingIdAwareSchedulerLogging extends Logging {


Use a trait here so all logs published from TaskSetManager will include the query and batch Id when present

BrooksWalls · 2026-04-03T21:23:56Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+ */
+private[scheduler] trait StructuredStreamingIdAwareSchedulerLogging extends Logging {
+  // we gather the query and batch Id from the properties of a given TaskSet
+  protected def properties: Properties


Since we can't rely on thread local properties, we need to gather the query and batch Id from the taskSet's properties, this must be set by class which mixes in the trait

BrooksWalls · 2026-04-03T21:24:34Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+ * Helpers for constructing log entries enriched with structured streaming
+ * identifiers extracted from task properties.
+ */
+private[scheduler] object StructuredStreamingIdAwareSchedulerLogging extends Logging {


uses a companion object here so that we can call the methods from SchedulableBuilder which can not set one Properties object at construction as it's shared across tasks

BrooksWalls · 2026-04-03T21:25:29Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+    // formatMessage truncates the queryId for readability
+    // so we use a blank messageWithContext to overwrite the full query Id to the context
+    formatMessage(
+      queryId,
+      batchId,
+      entry
+    ) + MessageWithContext("", constructStreamingContext(queryId, batchId))


This is a little clunky but I wanted to truncate the query Id in the outputted log line so that its more readable as you scan through, but still have the full query id in the log context. To do that we use a blank log line with the query context hashmap so it overrides the truncated query Id.

BrooksWalls · 2026-04-03T21:26:00Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+    // MDC places the log key in the context as all lowercase, so we do the same here
+    queryId.foreach(streamingContext.put(LogKeys.QUERY_ID.name.toLowerCase(Locale.ROOT), _))
+    batchId.foreach(streamingContext.put(LogKeys.BATCH_ID.name.toLowerCase(Locale.ROOT), _))


I'm not sure if the lowercase is necessary here or not, but wanted to match the behavior of the log interpolator

BrooksWalls · 2026-04-03T21:27:03Z

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

    healthTracker: Option[HealthTracker] = None,
-    clock: Clock = new SystemClock()) extends Schedulable with Logging {
+    clock: Clock = new SystemClock())
+  extends Schedulable with StructuredStreamingIdAwareSchedulerLogging {


Since we extend the StructuredStreamingIdAwareSchedulerLogging instead of logging, all logs published will include the query and batch Id when handling a streaming query TaskSet

BrooksWalls · 2026-04-03T21:27:53Z

core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala

-      log"${MDC(LogKeys.POOL_NAME, poolName)}")
+
+    logInfo(
+      StructuredStreamingIdAwareSchedulerLogging.constructStreamingLogEntry(


Like I said above, since this method is called for many different TaskSets we have to use the companion object's method

HeartSaVioR · 2026-04-06T22:01:34Z

@jiangxb1987 @Ngone51
Would love to hear your voice on the change. We are trying to correlate the scheduler log with streaming query, but that's unfortunately an inverse direction of dependency, hence we had to make CORE be aware of streaming.

Maybe logging might take (very) slight more time than before, hence if we think it's on critical path (I can't judge) we should be very careful. I'd love to hear from experts; I'd give this a go when I have at least one approval from CORE module experts.

Thanks in advance!

BrooksWalls · 2026-04-08T22:31:07Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+    if (isStreamingTaskSet(taskSet)) {
+      streamingTaskSetManager(taskSet, maxTaskFailures)


now only TaskSets which are for streaming queries will get the streaming log line mixin applied. This way we avoid any overhead on the non-streaming path.

BrooksWalls · 2026-04-08T22:32:34Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+        // ensure log name matches the non-streaming version
+        override protected def logName: String = classOf[TaskSetManager].getName


this is done to ensure the streaming version has the exact same logName as the non-streaming version. We could also move this to StructuredStreamingIdAwareSchedulerLogging but it would look a little different

private[scheduler] trait StructuredStreamingIdAwareSchedulerLogging extends Logging { protected def properties: Properties override protected def logName: String = this.getClass.getSuperclass.getName.stripSuffix("$") }

without one of these 2 options the logname will be something like org.apache.spark.scheduler.TaskSchedulerImpl$$anon$1 just for streamingTaskSetManagers since we are doing the inline mixin

BrooksWalls · 2026-04-08T22:41:56Z

core/src/test/scala/org/apache/spark/scheduler/TaskSchedulerImplSuite.scala

+    val logAppender = new LogAppender("streaming log name check")
+    // TSM constructor prints some debug logs we can use
+    logAppender.setThreshold(Level.DEBUG)
+    withLogAppender(logAppender,
+      loggerNames = Seq(classOf[TaskSetManager].getName),
+      level = Some(Level.DEBUG)) {
+      val tsm = taskScheduler.createTaskSetManager(taskSet, 1)
+      assert(tsm.isInstanceOf[StructuredStreamingIdAwareSchedulerLogging])
+    }
+    // when creating the streaming version we want the log name to match the
+    // non-streaming baseline case. By confirming our log appender contains
+    // logs we know the log name is correct
+    assert(logAppender.loggingEvents.nonEmpty,
+      "Expected logs under TaskSetManager logger name")


I don't really like the way this test is asserting the logName since it relies on some debug log lines that get published in TaskSetManager's constructor, which, if removed, would make this test fail for no reason. But I don't see a clearly better approach.

The reason we check the logs is to ensure that the code in TaskSchedulerImpl is correctly setting the log name for the streaming version so it matches the non streaming version. We have a few options

leave what is here and accept that we rely on unrelated debugging logs

manually call a method on TaskSetManager that publishes a log (still relies on an unrelated log)

move the log name override to the trait like

private[scheduler] trait StructuredStreamingIdAwareSchedulerLogging extends Logging { protected def properties: Properties override protected def logName: String = this.getClass.getSuperclass.getName.stripSuffix("$") }

just don't have a test which covers the production code for setting the log name, since we have a test in TaskSetManagerSuite which mirrors the code in TaskSetSchedulerImpl and confirms the log name is correct

…ed log lines

BrooksWalls · 2026-04-09T02:13:22Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+      return entry
+    }
+    // wrap in log entry to defer until log is evaluated
+    new LogEntry({


we wrap this in a LogEntry since Claude pointed out that we were forcing eager evaluation of the provided logEntry even if the logging level was disabled in the environment. By wrapping everything in a LogEntry, now the logic is only called when the logging level is enabled for the environment. This is important for things like debugging and trace logs

BrooksWalls · 2026-04-09T02:14:52Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+      bId => log"[batchId = ${MDC(LogKeys.BATCH_ID, bId)}] " + toMessageWithContext(msg)
+    ).getOrElse(toMessageWithContext(msg))
+    queryId.map(
+      qId => log"[queryId = ${MDC(LogKeys.QUERY_ID, qId)}] " + msgWithBatchId


I ended up removing the truncation of the query Id here. The reason is that, before, when we added the hashmap to the context containing the full query Id, any log renderer which used the log context would place the full query Id into the log message anyways. Ultimately I think it's probably better to have the full query Id in the log context and subsequently the log, then to have the truncated version. Open to reverting this change back if others feel differently.

Also the code is simpler without the truncation

BrooksWalls · 2026-04-09T02:17:45Z

.../test/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLoggingSuite.scala

+  test("SPARK-56326: constructStreamingLogEntry with LogEntry - defers evaluation") {
+    var evaluated = false
+    val lazyEntry = new LogEntry({
+      evaluated = true
+      MessageWithContext("lazy message", java.util.Collections.emptyMap())
+    })
+
+    val result = StructuredStreamingIdAwareSchedulerLogging
+      .constructStreamingLogEntry(propsWithBothIds(), lazyEntry)
+
+    // Work should be deferred
+    assert(!evaluated,
+      "LogEntry should not be evaluated during constructStreamingLogEntry")
+
+    // Accessing .message triggers evaluation
+    result.message
+    assert(evaluated, "LogEntry should be evaluated when .message is accessed")
+  }
+
+  test("SPARK-56326: constructStreamingLogEntry with LogEntry - defers property access") {
+    var propertiesAccessed = false
+    val props = new Properties() {
+      override def getProperty(key: String): String = {
+        propertiesAccessed = true
+        super.getProperty(key)
+      }
+    }
+    props.setProperty(QUERY_ID_KEY, testQueryId)
+    props.setProperty(BATCH_ID_KEY, testBatchId)
+
+    val entry = log"test message ${MDC(LogKeys.MESSAGE, "Dummy Context")}"
+    val result = StructuredStreamingIdAwareSchedulerLogging
+      .constructStreamingLogEntry(props, entry)
+
+    assert(!propertiesAccessed,
+      "Properties should not be accessed during constructStreamingLogEntry")
+
+    result.message
+    assert(propertiesAccessed,
+      "Properties should be accessed when .message is called")
+  }


new tests for that eager evaluation thing I mentioned above

Ngone51 · 2026-04-09T12:38:55Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+                                                 properties: Properties,
+                                                 msg: => String): LogEntry = {


nit: 4 indents

Ngone51 · 2026-04-09T12:39:04Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+
+  private def constructStreamingContext(
+                                         queryId: Option[String],
+                                         batchId: Option[String]):


nit: 4 indents.

Ngone51 · 2026-04-09T12:39:45Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+  java.util.HashMap[String, String] = {
+    val streamingContext = new java.util.HashMap[String, String]()


Could you add it to import list?

Ngone51 · 2026-04-09T12:40:07Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+  private def formatMessage(
+                             queryId: Option[String],
+                             batchId: Option[String],
+                             msg: => String): String = {


Ngone51 · 2026-04-09T12:40:13Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+  private def formatMessage(
+                             queryId: Option[String],
+                             batchId: Option[String],
+                             msg: => LogEntry): MessageWithContext = {


Ngone51 · 2026-04-09T12:44:00Z

core/src/main/scala/org/apache/spark/scheduler/StructuredStreamingIdAwareSchedulerLogging.scala

+  private[scheduler] def constructStreamingLogEntry(
+                                                 properties: Properties,
+                                                 msg: => String): LogEntry = {
+    if (properties == null) {


Shall we also check QUERY_ID_KEY is non-empty?

We could check it here. I didn't so far since it would mean checking the properties for query Id on every log call even if that log level is disabled. The overhead of the hashmap lookup is pretty small so it's likely okay but also our code does handle the case where neither queryId or batchId is set and that's currently not called if the log level is disabled.

The only place this is currently a concern is FairSchedulableBuilder since anything that uses this through TaskSetManager has already been confirmed to have query Id set

The only place this is currently a concern is FairSchedulableBuilder

Yes. So is there a difference of the log before and after when query Id is empty for non-streaming query?

Ngone51 · 2026-04-09T12:51:20Z

core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala

+  // will include query and batch Id in the logs
+  private def streamingTaskSetManager(taskSet: TaskSet, maxTaskFailures: Int): TaskSetManager = {
    new TaskSetManager(this, taskSet, maxTaskFailures, healthTrackerOpt, clock)
+      with StructuredStreamingIdAwareSchedulerLogging {


I wonder shall we actually override properties here:

override def properties: Properties = taskSet.properties

I don't get how the current way overrides that.

My understanding is that it relied on scala behavior with mixins where it saw that TaskSetManager already has a method that matches the signature so it defaults to that, though it's probably best to go with explicitly setting it here so we don't need the unrelated and seemingly unused method in TaskSetManager

…dulerImpl to construct TSM

dichlorodiphen approved these changes Apr 2, 2026

View reviewed changes

BrooksWalls added 3 commits April 2, 2026 20:09

Add streaming query Id and batch Id to scheduling logs

9ed6cd6

Add unit tests covering streaming query and batch Id logs

4a53e4b

add jira ID to test name in PoolSuite

92d2f95

BrooksWalls force-pushed the SPARK-56326/streamingQueryIdAndBatchIdInSchedulingLogs branch from 10d7760 to 92d2f95 Compare April 2, 2026 20:09

switch from SchedulableBuilder companion object to IdAware logger

4845215

BrooksWalls commented Apr 3, 2026

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-56326] Include streaming query and batch ids in scheduling logs~~ [SPARK-56326][SS] Include streaming query and batch ids in scheduling logs Apr 5, 2026

Move mixin to just streaming tasksets' TaskSetManagers

53099c9

BrooksWalls commented Apr 8, 2026

View reviewed changes

BrooksWalls added 6 commits April 8, 2026 22:42

clean up white space diff

280d11b

Override all Throwable-accepting log methods in streaming trait

f1497f6

Enrich FairSchedulableBuilder logWarning with streaming IDs

f5a1915

Restore 'tasks' in 'Added task set ... tasks to pool' log message

413c21c

Add explicit return type to streamingTaskSetManager

fedfcdc

wrap streaming message in logEntry to avoid eager execution on disabl…

dc6bd59

…ed log lines

BrooksWalls commented Apr 9, 2026

View reviewed changes

remove query Id truncation

0b0f5f1

BrooksWalls force-pushed the SPARK-56326/streamingQueryIdAndBatchIdInSchedulingLogs branch from 2518b31 to 0b0f5f1 Compare April 9, 2026 02:17

BrooksWalls commented Apr 9, 2026

View reviewed changes

Ngone51 reviewed Apr 9, 2026

View reviewed changes

BrooksWalls added 2 commits April 9, 2026 17:00

address nits in StructuredStreamingIdAwareSchedulerLogging

3951124

override properties in TaskSchedulerImpl, update test to use TaskSche…

8272bc4

…dulerImpl to construct TSM

		if (isStreamingTaskSet(taskSet)) {
		streamingTaskSetManager(taskSet, maxTaskFailures)

		// ensure log name matches the non-streaming version
		override protected def logName: String = classOf[TaskSetManager].getName

		java.util.HashMap[String, String] = {
		val streamingContext = new java.util.HashMap[String, String]()

Conversation

BrooksWalls commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dichlorodiphen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR commented Apr 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BrooksWalls Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BrooksWalls commented Apr 2, 2026 •

edited

Loading

BrooksWalls Apr 8, 2026 •

edited

Loading