[SPARK-57787][CONNECT] Reuse a persistent local Spark Connect server for faster local startup#56907
Open
ericm-db wants to merge 2 commits into
Open
[SPARK-57787][CONNECT] Reuse a persistent local Spark Connect server for faster local startup#56907ericm-db wants to merge 2 commits into
ericm-db wants to merge 2 commits into
Conversation
…for faster local startup
Adds an opt-in (SPARK_LOCAL_CONNECT_REUSE / spark.local.connect.reuse) so that
`SparkSession.builder.remote("local[*]").getOrCreate()` reconnects to a persistent local
Spark Connect server -- starting a detached one on the first run and reconnecting to it on
later runs -- instead of booting a fresh in-process server in every process. The first run
pays the cold start once; later runs reconnect in a fraction of a second.
Default behavior is unchanged when the opt-in is off. No protocol or Scala changes.
Trim verbose comments/docstrings, drop redundant noqa and the private-method example from the user docs, and consolidate exception handling. No functional change.
Contributor
|
Is this part of https://lists.apache.org/thread/sg9o2gbb3nttz74f0s01v8f167zy8ltt ? |
Contributor
Author
|
Oh yeah this was to address a comment that Nicholas had left on the doc he had linked in that thread. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Adds an opt-in fast path for local Spark Connect development. Today
SparkSession.builder.remote("local[*]").getOrCreate()starts a fresh in-process Connect server inevery process (
SparkSession._start_connect_server), so each run re-pays the cold start (JVM warmupSparkContext+ server boot).When
SPARK_LOCAL_CONNECT_REUSE=1(or.config("spark.local.connect.reuse", "true")) is set, alocal-mode remote session instead reconnects to a persistent local Connect server, starting one onthe first run:
~/.spark/connect-local.json, overridable viaSPARK_LOCAL_CONNECT_DISCOVERY) and reuses the recorded server if its pid is alive, its port isopen, and its Spark version matches; otherwise
pyspark/sql/connect/local_server.py), waits until it isreachable, and records it for the next process.
The user's code is unchanged. The first run pays the cold start once; later runs reconnect in a
fraction of a second.
Notes:
changes.
0600), which the client uses to authenticate.isolated artifacts) is fresh per run; only shared
SparkContextstate (catalog, global tempviews, cached data) carries across runs.
spark.local.connect.server.idleTimeoutseconds idle (default3600; 0 disables).
Open questions for the dev list:
SPARK_LOCAL_CONNECT_REUSE/spark.local.connect.reuse);command is preferred;
Why are the changes needed?
Creating a local Spark session for a quick edit/run loop takes a few seconds, and that cost is
one-time-per-process -- it does not amortize across separate runs. Keeping a warm server alive and
reconnecting to it is the only way to make a repeated local dev/test loop fast. This makes that
behavior available behind a single opt-in, without changing user code or default behavior.
Does this PR introduce any user-facing change?
Only when the opt-in is enabled. With
SPARK_LOCAL_CONNECT_REUSE=1(orspark.local.connect.reuse=true),SparkSession.builder.remote("local[*]").getOrCreate()starts apersistent local Connect server on the first run and reconnects on later runs, instead of booting a
fresh in-process server each time. With the opt-in unset (the default), behavior is unchanged. A new
documentation section describes the feature.
How was this patch tested?
New
python/pyspark/sql/tests/connect/test_connect_local_server.py: unit tests for the discoveryand reuse-decision logic (version mismatch, dead pid, alive-and-listening, safe no-op stop), plus an
end-to-end test that starts a real detached server, confirms a second call reconnects to it (same
pid, no respawn), runs queries over two independent connections, and checks that a temp view in one
does not leak into the other.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.8)