[SPARK-57600][4.1][SQL] Declarative Pipelines should isolate per-flow SQL confs during parallel flow resolution

LuciferYang · LuciferYang · commit c8415bc567a9 · 2026-06-30T10:26:05.000+08:00
### What changes were proposed in this pull request? Declarative Pipelines resolves flows in parallel on a shared `SparkSession` (`DataflowGraphTransformer`, parallelism 10). `FlowAnalysis.createFlowFunctionFromLogicalPlan` applied each flow's per-flow SQL confs by mutating that shared session's conf and restoring it afterwards. Because the session is shared, concurrent flows interleave those set/restore operations, so a flow can be analyzed under another flow's confs or have its own conf restored out from under it. This gives each flow a private `SQLConf` instead: clone the session's conf, apply the flow's overrides to the clone, and install it for the analyzing thread with `SQLConf.withExistingConf` while that flow is analyzed (the analyzer reads conf through `SQLConf.get`). Analysis still runs on the shared session, so the catalog, current catalog/database, temp views, and the resolved DataFrames are all left on that session; only the confs the analyzer reads are isolated per flow. ### Why are the changes needed? When more than one flow sets per-flow confs, parallel resolution can analyze a flow under another flow's confs, producing non-deterministic and occasionally incorrect analysis and schema inference. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added three tests in `ConnectValidPipelineSuite`: - one checks that a flow's per-flow conf is what the analyzer reads during analysis, but does not leak onto the session the pipeline is run from; - one resolves several flows in parallel, each setting a different value for the same conf, and asserts every flow's analysis observes its own value; - one resolves a graph through `resolveToDataflowGraph()` with a per-flow `spark.sql.caseSensitive` override and checks that analysis actually honors it: a column reference resolves under the default but fails for the flow that turns on case sensitivity. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Claude Opus 4.8) Closes #56861 from LuciferYang/SPARK-57600-4.1. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: yangjie01 <yangjie01@baidu.com>
diff --git a/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala b/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysis.scala
@@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.{AliasIdentifier, TableIdentifier}
 import org.apache.spark.sql.catalyst.analysis.{CTESubstitution, UnresolvedRelation}
 import org.apache.spark.sql.catalyst.plans.logical.{LogicalPlan, SubqueryAlias}
 import org.apache.spark.sql.classic.{DataFrame, Dataset, DataStreamReader, SparkSession}
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.pipelines.AnalysisWarning
 import org.apache.spark.sql.pipelines.graph.GraphIdentifierManager.{ExternalDatasetIdentifier, InternalDatasetIdentifier}
 import org.apache.spark.sql.pipelines.util.{BatchReadOptions, InputReadOptions, StreamingReadOptions}
@@ -46,17 +47,23 @@ object FlowAnalysis {
       confs: Map[String, String],
       queryContext: QueryContext,
       queryOrigin: QueryOrigin) => {
+      // Flows are resolved in parallel on a shared session, so applying per-flow confs by mutating
+      // that session's conf would race across flows. Instead, give each flow a private SQLConf
+      // (a clone of the session's conf plus this flow's overrides) and install it for the analyzing
+      // thread via SQLConf.withExistingConf. Analysis still runs on the shared session, so its
+      // catalog and the resolved DataFrames are unaffected; only the confs the analyzer reads are
+      // isolated per flow.
+      val spark = SparkSession.active
       val ctx = FlowAnalysisContext(
         allInputs = allInputs,
         availableInputs = availableInputs,
         queryContext = queryContext,
-        spark = SparkSession.active
+        spark = spark,
+        flowConf = spark.sessionState.conf.clone()
       )
-      val df = try {
+      val df = SQLConf.withExistingConf(ctx.flowConf) {
         confs.foreach { case (k, v) => ctx.setConf(k, v) }
         Try(FlowAnalysis.analyze(ctx, plan))
-      } finally {
-        ctx.restoreOriginalConf()
       }
       FlowFunctionResult(
         requestedInputs = ctx.requestedInputs.toSet,
@@ -74,9 +81,12 @@ object FlowAnalysis {
    * Constructs an analyzed [[DataFrame]] from a [[LogicalPlan]] by resolving Pipelines specific
    * TVFs and datasets that cannot be resolved directly by Catalyst.
    *
-   * This function shouldn't call any singleton as it will break concurrent access to graph
-   * analysis; or any thread local variables as graph analysis and this function will use
-   * different threads in python repl.
+   * This runs on the flow-resolution thread pool, which may differ from the thread that defined
+   * the flow (e.g. in a Python REPL), so it must not depend on ambient singletons or thread-locals
+   * carried over from that defining thread. The one piece of per-flow state it relies on - the
+   * flow's SQL confs - is installed on the analyzing thread by
+   * [[createFlowFunctionFromLogicalPlan]] via `SQLConf.withExistingConf`, so the Catalyst analysis
+   * this triggers reads them through `SQLConf.get`.
    *
    * @param plan     The [[LogicalPlan]] defining a flow.
    * @return An analyzed [[DataFrame]].
@@ -236,7 +246,7 @@ object FlowAnalysis {
     }
 
     val incompatibleViewReadCheck =
-      ctx.spark.conf.get("pipelines.incompatibleViewCheck.enabled", "true").toBoolean
+      ctx.flowConf.getConfString("pipelines.incompatibleViewCheck.enabled", "true").toBoolean
 
     // Wrap the DF in an alias so that columns in the DF can be referenced with
     // the following in the query:
diff --git a/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysisContext.scala b/sql/pipelines/src/main/scala/org/apache/spark/sql/pipelines/graph/FlowAnalysisContext.scala
@@ -22,6 +22,7 @@ import scala.collection.mutable.ListBuffer
 
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.classic.SparkSession
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.pipelines.AnalysisWarning
 
 /**
@@ -32,7 +33,12 @@ import org.apache.spark.sql.pipelines.AnalysisWarning
  * @param queryContext         The context of the query being evaluated.
  * @param requestedInputs      A mutable buffer populated with names of all inputs that were
  *                             requested.
- * @param spark                the spark session to be used.
+ * @param spark                The (shared) spark session to be used.
+ * @param flowConf             A private [[SQLConf]] holding this flow's per-flow confs. It is
+ *                             installed for the analyzing thread via `SQLConf.withExistingConf`
+ *                             (see `FlowAnalysis.createFlowFunctionFromLogicalPlan`) so per-flow
+ *                             confs stay isolated from concurrently resolving flows and from the
+ *                             shared session, without cloning the session.
  * @param externalInputs The names of external inputs that were used to evaluate
  *                                 the flow's query.
  */
@@ -46,27 +52,20 @@ private[pipelines] case class FlowAnalysisContext(
     shouldLowerCaseNames: Boolean = false,
     analysisWarnings: mutable.Buffer[AnalysisWarning] = new ListBuffer[AnalysisWarning],
     spark: SparkSession,
+    flowConf: SQLConf,
     externalInputs: mutable.HashSet[TableIdentifier] = mutable.HashSet.empty
 ) {
 
   /** Map from `Input` name to the actual `Input` */
   val availableInput: Map[TableIdentifier, Input] =
     availableInputs.map(i => i.identifier -> i).toMap
 
-  /** The confs set in this context that should be undone when exiting this context. */
-  private val confsToRestore = mutable.HashMap[String, Option[String]]()
-
-  /** Sets a Spark conf within this context that will be undone by `restoreOriginalConf`. */
+  /**
+   * Sets a Spark conf for this flow's analysis. It is set on the per-flow [[flowConf]], which is
+   * active for the analyzing thread only, so it does not leak to other flows or to the shared
+   * session.
+   */
   def setConf(key: String, value: String): Unit = {
-    if (!confsToRestore.contains(key)) {
-      confsToRestore.put(key, spark.conf.getOption(key))
-    }
-    spark.conf.set(key, value)
-  }
-
-  /** Restores the Spark conf to its state when this context was creating by undoing confs set. */
-  def restoreOriginalConf(): Unit = confsToRestore.foreach {
-    case (k, Some(v)) => spark.conf.set(k, v)
-    case (k, None) => spark.conf.unset(k)
+    flowConf.setConfString(key, value)
   }
 }
diff --git a/sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala b/sql/pipelines/src/test/scala/org/apache/spark/sql/pipelines/graph/ConnectValidPipelineSuite.scala
@@ -20,7 +20,10 @@ package org.apache.spark.sql.pipelines.graph
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
 import org.apache.spark.sql.catalyst.plans.logical.Union
+import org.apache.spark.sql.classic.DataFrame
 import org.apache.spark.sql.execution.streaming.runtime.MemoryStream
+import org.apache.spark.sql.internal.SQLConf
+import org.apache.spark.sql.pipelines.util.InputReadOptions
 import org.apache.spark.sql.pipelines.utils.{PipelineTest, TestGraphRegistrationContext}
 import org.apache.spark.sql.test.SharedSparkSession
 import org.apache.spark.sql.types._
@@ -528,4 +531,144 @@ class ConnectValidPipelineSuite extends PipelineTest with SharedSparkSession {
       s"Flow ${identifier.unquotedString} has the wrong schema"
     )
   }
+
+  test("per-flow confs are visible to the analyzer but do not leak onto the run session") {
+    val key = "pipelines.test.flowConfIsolation"
+    assert(spark.conf.getOption(key).isEmpty)
+
+    val inputId = TableIdentifier("conf_observer")
+    // (conf the analyzer reads via SQLConf.get, conf on the run session) captured during load().
+    var observed: (Option[String], Option[String]) = null
+    val runSession = spark
+    val observingInput = new Input {
+      override def identifier: TableIdentifier = inputId
+      override def origin: QueryOrigin = QueryOrigin()
+      override def load(readOptions: InputReadOptions): DataFrame = {
+        observed = (SQLConf.get.getAllConfs.get(key), runSession.conf.getOption(key))
+        runSession.range(1).toDF()
+      }
+    }
+
+    val result = FlowAnalysis
+      .createFlowFunctionFromLogicalPlan(UnresolvedRelation(Seq("conf_observer")))
+      .call(
+        allInputs = Set(inputId),
+        availableInputs = Seq(observingInput),
+        configuration = Map(key -> "flowValue"),
+        queryContext = QueryContext(currentCatalog = None, currentDatabase = None),
+        queryOrigin = QueryOrigin())
+
+    assert(result.dataFrame.isSuccess, s"flow analysis failed: ${result.dataFrame}")
+    assert(observed != null, "input.load was not invoked during analysis")
+    val (analyzerConf, runConf) = observed
+    // The per-flow conf is what the analyzer reads ...
+    assert(analyzerConf.contains("flowValue"))
+    // ... but it must not leak onto the session the pipeline is run from.
+    assert(
+      !runConf.contains("flowValue"),
+      "per-flow conf leaked onto the run session during flow analysis")
+    // ... and nothing is left behind on the run session afterwards.
+    assert(spark.conf.getOption(key).isEmpty)
+  }
+
+  test("per-flow confs stay isolated when flows are resolved in parallel") {
+    val key = "pipelines.test.flowConfIsolation"
+    assert(spark.conf.getOption(key).isEmpty)
+
+    val numFlows = 8
+    val runSession = spark
+    // The conf value each flow's analyzer reads for `key`.
+    val observed = new java.util.concurrent.ConcurrentHashMap[Int, String]()
+    val errors = new java.util.concurrent.ConcurrentLinkedQueue[Throwable]()
+    // Rendezvous so every flow is mid-analysis - its per-flow conf already applied - at the same
+    // time. That is exactly when applying confs to a shared session would let one flow observe
+    // another flow's value.
+    val barrier = new java.util.concurrent.CyclicBarrier(numFlows)
+
+    def observingInput(i: Int): Input = new Input {
+      override def identifier: TableIdentifier = TableIdentifier(s"conf_observer_$i")
+      override def origin: QueryOrigin = QueryOrigin()
+      override def load(readOptions: InputReadOptions): DataFrame = {
+        barrier.await(60, java.util.concurrent.TimeUnit.SECONDS)
+        observed.put(i, SQLConf.get.getConfString(key, "<unset>"))
+        runSession.range(1).toDF()
+      }
+    }
+
+    val threads = (0 until numFlows).map { i =>
+      val t = new Thread(() => {
+        try {
+          val result = FlowAnalysis
+            .createFlowFunctionFromLogicalPlan(UnresolvedRelation(Seq(s"conf_observer_$i")))
+            .call(
+              allInputs = Set(TableIdentifier(s"conf_observer_$i")),
+              availableInputs = Seq(observingInput(i)),
+              configuration = Map(key -> s"flowValue_$i"),
+              queryContext = QueryContext(currentCatalog = None, currentDatabase = None),
+              queryOrigin = QueryOrigin())
+          result.dataFrame.failed.foreach(errors.add)
+        } catch {
+          case t: Throwable => errors.add(t)
+        }
+      })
+      t.setName(s"flow-conf-isolation-$i")
+      t.start()
+      t
+    }
+    threads.foreach(_.join(120000))
+
+    assert(errors.isEmpty, s"flow analysis threads failed: ${errors.toArray.mkString(", ")}")
+    assert(
+      observed.size() == numFlows,
+      s"only ${observed.size()} of $numFlows flows recorded a conf")
+    (0 until numFlows).foreach { i =>
+      assert(
+        observed.get(i) == s"flowValue_$i",
+        s"flow $i observed '${observed.get(i)}' instead of its own per-flow conf")
+    }
+    // Nothing leaks onto the run session.
+    assert(spark.conf.getOption(key).isEmpty)
+  }
+
+  test("per-flow confs reach the analyzer through the full resolveToDataflowGraph() path") {
+    val caseSensitiveKey = SQLConf.CASE_SENSITIVE.key
+    // Pin the session default so the test is self-contained under the shared session. The per-flow
+    // override below is applied to the flow's own conf, never to this session conf.
+    withSQLConf(caseSensitiveKey -> "false") {
+      // With case-insensitive resolution `SELECT Foo FROM src` matches the `foo` column. Setting
+      // spark.sql.caseSensitive=true on the consumer flow makes that flow's analysis
+      // case-sensitive, so `Foo` no longer matches `foo`. Driving this through
+      // resolveToDataflowGraph() exercises a per-flow conf on the full resolution path (not just a
+      // direct FlowAnalysis call) and shows it is consumed by Catalyst analysis, not merely stored
+      // where SQLConf.get can read it. Cross-flow isolation under concurrency is covered by the
+      // parallel test above.
+
+      // Baseline: no per-flow conf, so `Foo` matches `foo` and the graph resolves.
+      val resolved = new TestGraphRegistrationContext(spark) {
+        registerPersistedView("src", query = dfFlowFunc(spark.range(1).toDF("foo")))
+        registerPersistedView("consumer", query = sqlFlowFunc(spark, "SELECT Foo FROM src"))
+      }.resolveToDataflowGraph()
+      assert(resolved.resolved, "pipeline should resolve under the default case-insensitive conf")
+
+      // Same query, but the consumer flow sets spark.sql.caseSensitive=true, so `Foo` no longer
+      // matches `foo` and analysis of that flow fails.
+      val unresolved = new TestGraphRegistrationContext(spark) {
+        registerPersistedView("src", query = dfFlowFunc(spark.range(1).toDF("foo")))
+        registerPersistedView(
+          "consumer",
+          query = sqlFlowFunc(spark, "SELECT Foo FROM src"),
+          sqlConf = Map(caseSensitiveKey -> "true"))
+      }.resolveToDataflowGraph()
+      assert(!unresolved.resolved, "case-sensitive consumer flow should fail to resolve")
+      val ex = intercept[UnresolvedPipelineException] {
+        unresolved.validate()
+      }
+      assertAnalysisException(
+        ex.directFailures(fullyQualifiedIdentifier("consumer")),
+        "UNRESOLVED_COLUMN.WITH_SUGGESTION")
+
+      // The per-flow conf must not leak onto the run session.
+      assert(spark.conf.get(caseSensitiveKey) == "false")
+    }
+  }
 }