Skip to content

[SPARK-57793][CONNECT] Add PathAwareChannelBuilder for SparkConnect python client#56933

Open
hiboyang wants to merge 3 commits into
apache:masterfrom
hiboyang:SPARK-57793-path-aware-channel-builder
Open

[SPARK-57793][CONNECT] Add PathAwareChannelBuilder for SparkConnect python client#56933
hiboyang wants to merge 3 commits into
apache:masterfrom
hiboyang:SPARK-57793-path-aware-channel-builder

Conversation

@hiboyang

@hiboyang hiboyang commented Jul 1, 2026

Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR adds a new PathAwareChannelBuilder to the Spark Connect Python client to support a URL path in the Spark Connect connection string, e.g. sc://host1:15002/path1.

The new builder accepts both the standard connection form (sc://host[:port][/;params]) and the path-routed form sc://gateway:<port>/<prefix>. When a path prefix is present, it is prepended to every gRPC method via a client interceptor.

Related discussion: #56816
JIRA: https://issues.apache.org/jira/browse/SPARK-57793

Why are the changes needed?

In Kubernetes, users commonly set up an Ingress to expose a Spark Connect driver endpoint behind a matching URL, e.g. http://host1/path1 routed to the driver endpoint. The existing channel builder cannot carry a path prefix, so the Spark Connect client needs to be updated to support this path-based routing scenario.

Does this PR introduce any user-facing change?

No behavior change for existing connection strings — the existing host-based sc:// connection strings continue to work unchanged. This PR adds a new opt-in PathAwareChannelBuilder that additionally supports a path inside the URL, e.g.

spark = SparkSession.builder.channelBuilder(PathAwareChannelBuilder("sc://host1:443/path1")).getOrCreate()

How was this patch tested?

Added unit tests in python/pyspark/sql/tests/connect/test_connect_channel.py covering path parsing, port extraction from the final path segment, and the path-prefix interceptor. Also manually tested in a local environment.

Was this patch authored or co-authored using generative AI tooling?

assisted by: Claude Code

…ython client

Support URL path in the Spark Connect connection string, e.g.
`sc://host1/path1:15002`, so a Kubernetes Ingress can route by URL path to
the right Spark Connect driver endpoint. Adds PathAwareChannelBuilder which
parses an ingress path prefix from the connection string and prepends it to
every gRPC method via an interceptor.

See apache#56816 and
https://issues.apache.org/jira/browse/SPARK-57793

Generated-by: Claude Code
@hiboyang

hiboyang commented Jul 1, 2026

Copy link
Copy Markdown
Author

@viirya @dbtsai this is to support path routing in Spark Connect Python client, would you help to take a look?

@dbtsai

dbtsai commented Jul 1, 2026

Copy link
Copy Markdown
Member

Review feedback

Two issues in PathAwareChannelBuilder._extract_attributes (python/pyspark/sql/connect/client/core.py).

1. IPv6 endpoints are rejected (core.py:667)

netloc = self.url.netloc.split(":") with len(netloc) in (1, 2) breaks on IPv6 addresses:

  • sc://[::1]:15002/path1 -> netloc='[::1]:15002' -> split(':') yields ['[', '', '1]', '15002'] (4 parts) -> falls into the else branch and raises INVALID_CONNECT_URL.
  • sc://[::1]/path1 -> ['[', '', '1]'] (3 parts) -> same failure.

DefaultChannelBuilder (core.py:486-495) handles this correctly by using self.url.hostname / self.url.port and bracket-wrapping IPv6 hosts, so this new builder is a functional regression for IPv6 users. Two related issues on the same path:

  • sc://host:abc/p -> int(netloc[1]) raises a raw ValueError instead of the wrapped PySparkValueError.
  • sc://:15002/p -> ['', '15002'] (2 parts) -> host='' is silently accepted.

Suggested fix: parse host/port via self.url.hostname / self.url.port as DefaultChannelBuilder does (correct IPv6 bracket-wrapping, hostname validation, consistent error path), and derive netloc_has_port from self.url.port is not None.

2. Path port is lost when combined with the standard /;params form (core.py:688-700)

The trailing-path-port syntax loses the intended port when users combine it with Spark's existing /;param=value form. For example sc://gateway/app/driver:443/;token=abc parses (via urlparse) to path='/app/driver:443/', params='token=abc'. Then:

  • last_segment = prefix.rsplit("/", 1)[-1] on /app/driver:443/ -> '' (the trailing slash leaves an empty last segment).
  • ":" in last_segment is therefore False, so the port-extraction block (lines 693-700) is skipped.
  • Result: self._port stays at the fallback 15002, and _path_prefix becomes /app/driver:443 with :443 left literally in the prefix.

So the intended port 443 is silently dropped. This is the documented standard param form (the class docstring advertises sc://host[:port][/;params]), so combining it with path routing is a natural usage. Without the trailing slash (sc://gateway/app/driver:443;token=abc) it works correctly (last_segment='driver:443', port 443 extracted), which confirms the trailing / before ; is the trigger.

Suggested fix: parse the last non-empty path segment (strip a trailing / before rsplit) so the port is recognized. Given the docstring already promises the /;params form, fixing the parser seems preferable to documenting the limitation.

… methods

The gRPC interceptor methods lacked type annotations, failing mypy's
disallow_untyped_defs check ([no-untyped-def]).
@hiboyang

hiboyang commented Jul 1, 2026

Copy link
Copy Markdown
Author

Review feedback

Two issues in PathAwareChannelBuilder._extract_attributes (python/pyspark/sql/connect/client/core.py).

1. IPv6 endpoints are rejected (core.py:667)

netloc = self.url.netloc.split(":") with len(netloc) in (1, 2) breaks on IPv6 addresses:

  • sc://[::1]:15002/path1 -> netloc='[::1]:15002' -> split(':') yields ['[', '', '1]', '15002'] (4 parts) -> falls into the else branch and raises INVALID_CONNECT_URL.
  • sc://[::1]/path1 -> ['[', '', '1]'] (3 parts) -> same failure.

DefaultChannelBuilder (core.py:486-495) handles this correctly by using self.url.hostname / self.url.port and bracket-wrapping IPv6 hosts, so this new builder is a functional regression for IPv6 users. Two related issues on the same path:

  • sc://host:abc/p -> int(netloc[1]) raises a raw ValueError instead of the wrapped PySparkValueError.
  • sc://:15002/p -> ['', '15002'] (2 parts) -> host='' is silently accepted.

Suggested fix: parse host/port via self.url.hostname / self.url.port as DefaultChannelBuilder does (correct IPv6 bracket-wrapping, hostname validation, consistent error path), and derive netloc_has_port from self.url.port is not None.

2. Path port is lost when combined with the standard /;params form (core.py:688-700)

The trailing-path-port syntax loses the intended port when users combine it with Spark's existing /;param=value form. For example sc://gateway/app/driver:443/;token=abc parses (via urlparse) to path='/app/driver:443/', params='token=abc'. Then:

  • last_segment = prefix.rsplit("/", 1)[-1] on /app/driver:443/ -> '' (the trailing slash leaves an empty last segment).
  • ":" in last_segment is therefore False, so the port-extraction block (lines 693-700) is skipped.
  • Result: self._port stays at the fallback 15002, and _path_prefix becomes /app/driver:443 with :443 left literally in the prefix.

So the intended port 443 is silently dropped. This is the documented standard param form (the class docstring advertises sc://host[:port][/;params]), so combining it with path routing is a natural usage. Without the trailing slash (sc://gateway/app/driver:443;token=abc) it works correctly (last_segment='driver:443', port 443 extracted), which confirms the trailing / before ; is the trigger.

Suggested fix: parse the last non-empty path segment (strip a trailing / before rsplit) so the port is recognized. Given the docstring already promises the /;params form, fixing the parser seems preferable to documenting the limitation.

Thanks for the comments! Will make these changes after I get initial PR check passing.

@viirya

viirya commented Jul 1, 2026

Copy link
Copy Markdown
Member

Thanks for the update — the two issues from the earlier round (IPv6 handling and the path port being lost with the trailing-slash /;params form) are both fixed correctly in the current revision, with tests covering them.

A few more issues from a closer pass, roughly in order of importance:

1. The 443-implies-TLS rule doesn't match the docstring, and one form gets port 443 without TLS

The docstring says "TLS is enabled implicitly when the resolved port is 443 and ;use_ssl= was not specified", but the implication sits inside the if prefix != "/" block, so:

  • PathAwareChannelBuilder("sc://gateway:443") (no path) gets no implicit TLS, contradicting the docstring. This matches DefaultChannelBuilder behavior, so maybe it's the docstring that should be narrowed to "when a path prefix is present".
  • PathAwareChannelBuilder("sc://host/:443"): the port-extraction block runs (last_segment is :443, so self._port becomes 443), but the prefix then collapses to / and the guarded block is skipped — no TLS implication, no interceptor. The client attempts plaintext gRPC against an HTTPS port. The port assignment and the TLS implication should probably live under the same condition.

2. A ;key=value on a non-final path segment is silently absorbed into the prefix

urlparse only strips params from the last path segment, so for sc://gateway/app;token=abc/driver we get url.params == "" — the token is never set, and /app;token=abc/driver becomes the path prefix. Auth silently fails and the credential is sent inside the :path of every RPC (and lands in ingress access logs). Since ; can never be legal inside a route prefix under this grammar, rejecting a prefix containing ; with INVALID_CONNECT_URL would turn this silent failure into a clear error.

3. Consider subclassing DefaultChannelBuilder instead of copying it

The scheme check in __init__, the params/hostname/port block of _extract_attributes, the secure/use_ssl/host/endpoint properties, and the entire toChannel are byte-identical copies of DefaultChannelBuilder. The copies have already diverged in one place: the overridden default_port() hardcodes 15002 and drops DefaultChannelBuilder.default_port()'s SPARK_TESTING ephemeral-port resolution, so under the local test harness a portless URL connects to the wrong port.

Since DefaultChannelBuilder rejects non-empty paths in __init__, the subclass can pre-process the raw URL string instead: extract the path prefix (and the optional trailing :port), rebuild a path-free sc://host:port/;params URL, delegate to super().__init__, then set the SSL default and register the interceptor. That reduces the ~120 duplicated lines to ~30 lines of genuinely new logic, and future fixes to parsing/credential handling apply to both builders automatically.

4. The connection-string spec and other clients

sql/connect/docs/client-connection-string.md states "The path component must be empty" and lists sc://myhost.com:443/mypathprefix/;token=AAAAAAA as an explicitly invalid example — which is the exact form this PR makes valid, in the Python client only (the Scala client's verifyURI still rejects non-empty paths). Since the new form is opt-in via a separate builder and the default sc:// parser is unchanged, I don't think this blocks the PR, but the doc should be updated to describe the extension (or explicitly mark it Python-only), and a follow-up JIRA for Scala-client parity would be good so the same connection string stays portable.

5. Path-derived port skips validation the netloc port gets

The netloc form goes through urlparse, which enforces 0-65535, but the path form is a bare int(): driver:99999 -> port 99999, driver:-1 -> port -1, and driver:4_43 -> 443 via PEP 515 underscores, all silently accepted with the bogus :PORT stripped from the prefix. A range/isdigit() check raising INVALID_CONNECT_URL would make the two forms consistent.

6. SparkSession.Builder.channelBuilder annotation

session.py annotates channelBuilder (and _channel_builder at L129, and connection at L282) as DefaultChannelBuilder, but PathAwareChannelBuilder extends ChannelBuilder directly — so the usage advertised in this PR's description is an arg-type error under mypy/pyright. Runtime is fine, but those annotations should be widened to ChannelBuilder as part of this PR, since this is the first shipped builder that isn't a DefaultChannelBuilder.

Minor: the class docstring's doctest is malformed (the ... continuation line holds an independent expression, and the expected output uses double-quoted repr) — it's copied from DefaultChannelBuilder's existing docstring and never collected, but new code needn't repeat it. The four intercept_* methods with identical bodies could also collapse into one implementation aliased four times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants