[SPARK-57626][SQL] Share repeated nested JSON path parsing by sunchao · Pull Request #56685 · apache/spark

sunchao · 2026-06-23T00:33:27Z

What changes were proposed in this pull request?

This PR extends the internal shared parser introduced by SPARK-47670 and #56547 from simple top-level fields to repeated literal named object paths.

The proposed changes:

Parse literal named JSON paths into field-name segments. Both dot notation such as $.payload.user.id and quoted names such as $['payload']['user.name'] are supported.
Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan.
Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values.
Keep ancestor and descendant paths in separate shared parses. For example, $.a never shares a parse with $.a.b.
Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains.
Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal.
Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent GetJsonObject evaluation.
Continue using the existing internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration. The existing JSON expression optimization must also be enabled.
Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths.

For example, these prefix-free paths share one parse:

$.payload.user.id
$.payload.user.name
$.payload.request_id

For this ordered set of ancestor/descendant paths:

$.a
$.a.b
$.x
$.x.y

the optimizer creates two prefix-free groups in one invocation:

group 1: $.a,   $.x
group 2: $.a.b, $.x.y

Unsupported forms such as $.items[0].id, $.payload.*, and paths supplied by another column remain unchanged.

Why are the changes needed?

SPARK-57626 follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently.

For example, consider an events table whose json column contains:

{"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}

An existing query might run:

SELECT
  get_json_object(json, '$.payload.user.id') AS user_id,
  get_json_object(json, '$.payload.user.name') AS user_name,
  get_json_object(json, '$.payload.request_id') AS request_id
FROM events;

Before this PR, Spark parses each row's json value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal MultiGetJsonObject, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change.

Does this PR introduce any user-facing change?

Yes, within the unreleased master branch only. When the existing internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration is enabled, eligible repeated simple nested get_json_object paths now use shared parsing; previously only top-level paths were eligible.

There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected.

How was this patch tested?

The following compilation, suites, and style checks passed on JDK 17:

build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"
build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"
build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle"
git diff --check

The complete OptimizeJsonExprsSuite passed 24 tests, the complete JsonFunctionsSuite passed 106 tests, and SparkConfigBindingPolicySuite passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation.

The microbenchmark and its committed result files were regenerated with Spark's Run benchmarks GitHub Actions workflow for JDK 17, JDK 21, and JDK 25. All three runs passed and committed their JDK-specific results to this branch.

Best JDK 17 times for 200,000 cached rows with 32 fields under a nested payload object:

Nested paths extracted	Shared parsing off	Shared parsing on	Relative speedup
2	1,498 ms	894 ms	1.7x
4	2,969 ms	1,129 ms	2.6x
8	5,841 ms	1,378 ms	4.2x
16	11,856 ms	1,926 ms	6.2x

These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three SharedJsonParseBenchmark result files.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

sunchao · 2026-06-23T00:50:33Z

cc @cloud-fan @viirya @peter-toth @dongjoon-hyun a follow-up on #56547. Please take a look, thanks. 🙏

viirya · 2026-06-23T01:43:56Z

Before reviewing, seems the PR description is broken (not following full template) at the first glance.

sunchao · 2026-06-23T03:01:45Z

Oops. Updated the PR description.

cloud-fan

Reviewed with AI assistance. Overall a clean, well-tested PR — the nested-path sharing rewrite is result-equivalent to independent get_json_object evaluation. Two minor, non-blocking notes.

0 blocking, 2 non-blocking, 0 nits.

Design / architecture (1)

jsonExpressions.scala:198: MultiGetJsonObject.fieldNames is now vestigial — only its length is used; path info flows through fallbackPaths/namedPaths — see inline

Suggestions (1)

SharedJsonParseBenchmark.scala:88: benchmark gained nested-path cases but the committed SharedJsonParseBenchmark-*results.txt weren't regenerated — see inline

Verification

Traced the rewrite's equivalence to independent GetJsonObject evaluation across: per-path first-non-null over duplicate parent objects and duplicate keys (matched/hasUnmatched), null values at every level, non-object intermediates, prefix-conflict gating into separate prefix-free groups, depth/unsupported-path gating, and malformed-input/rendering-failure isolation — all match legacy semantics. Confirmed by the shared-on-vs-legacy tests in JsonFunctionsSuite, which assert independently hardcoded expected values rather than self-consistency.

….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)

….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)

….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)

viirya

Traced the trie evaluator and optimizer grouping against the duplicate-key / null / non-object semantics the tests assert; behavior matches legacy GetJsonObject. The fieldNames removal is a nice cleanup, and the committed JDK 17/21/25 benchmark results are a welcome addition. Two minor, non-blocking notes inline.

The design is consistent with the existing top-level path (idempotent rule, top-level fast path, full path as the ordinal key so $.a.b and $.f.b stay distinct), and coverage is thorough across both prefix-conflict directions, one-pass parallel-chain grouping, the 2000-path and depth-65 cases, duplicate keys, nulls, non-object intermediates, malformed/single-quoted input, rendering-failure isolation, and default-off plan equivalence.

viirya

Thanks for the quick turnaround - re-reviewed at 917b893, and all four threads are resolved well:

Depth-bound invariant: the note on MultiGetJsonObject (OptimizeCsvJsonExprs caps shared path depth to keep evaluator recursion stack-safe) captures exactly the invariant I was after - the recursive evaluator's stack safety now self-documents its reliance on the optimizer cap.
Multi-element terminalOrdinals: the comment in extractValue correctly explains both that optimizer paths are deduplicated and that the multi-ordinal fold is retained only as defensive support for directly constructed expressions. Clear now.
Vestigial fieldNames: cleanly removed from both MultiGetJsonObject and the evaluator, arity derived from fallbackPaths; I confirmed no stale references remain (the other fieldNames hits in these files are unrelated JsonTuple code).
Benchmark results: regenerated and committed for JDK 17/21/25.

The core logic I traced earlier (trie duplicate-key first-wins semantics, prefix-conflict grouping, the grouped(2) path parsing) is unchanged and still correct. LGTM.

dongjoon-hyun

+1, LGTM (with the last commit). For two nits, you can ignore them.

uros-b

Thank you @sunchao and @viirya @cloud-fan @dongjoon-hyun!

### What changes were proposed in this pull request? This PR extends the internal shared parser introduced by [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 from simple top-level fields to repeated literal named object paths. The proposed changes: - Parse literal named JSON paths into field-name segments. Both dot notation such as `$.payload.user.id` and quoted names such as `$['payload']['user.name']` are supported. - Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan. - Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values. - Keep ancestor and descendant paths in separate shared parses. For example, `$.a` never shares a parse with `$.a.b`. - Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains. - Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal. - Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent `GetJsonObject` evaluation. - Continue using the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration. The existing JSON expression optimization must also be enabled. - Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths. For example, these prefix-free paths share one parse: ```text $.payload.user.id $.payload.user.name $.payload.request_id ``` For this ordered set of ancestor/descendant paths: ```text $.a $.a.b $.x $.x.y ``` the optimizer creates two prefix-free groups in one invocation: ```text group 1: $.a, $.x group 2: $.a.b, $.x.y ``` Unsupported forms such as `$.items[0].id`, `$.payload.*`, and paths supplied by another column remain unchanged. ### Why are the changes needed? [SPARK-57626](https://issues.apache.org/jira/browse/SPARK-57626) follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently. For example, consider an `events` table whose `json` column contains: ```json {"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}} ``` An existing query might run: ```sql SELECT get_json_object(json, '$.payload.user.id') AS user_id, get_json_object(json, '$.payload.user.name') AS user_name, get_json_object(json, '$.payload.request_id') AS request_id FROM events; ``` Before this PR, Spark parses each row's `json` value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal `MultiGetJsonObject`, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change. ### Does this PR introduce _any_ user-facing change? Yes, within the unreleased `master` branch only. When the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration is enabled, eligible repeated simple nested `get_json_object` paths now use shared parsing; previously only top-level paths were eligible. There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected. ### How was this patch tested? The following compilation, suites, and style checks passed on JDK 17: ```bash build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile" build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite" build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite" build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite" build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle" git diff --check ``` The complete `OptimizeJsonExprsSuite` passed 24 tests, the complete `JsonFunctionsSuite` passed 106 tests, and `SparkConfigBindingPolicySuite` passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation. The microbenchmark and its committed result files were regenerated with Spark's `Run benchmarks` GitHub Actions workflow for [JDK 17](https://github.com/sunchao/spark/actions/runs/28035861052), [JDK 21](https://github.com/sunchao/spark/actions/runs/28035861792), and [JDK 25](https://github.com/sunchao/spark/actions/runs/28035859413). All three runs passed and committed their JDK-specific results to this branch. Best JDK 17 times for 200,000 cached rows with 32 fields under a nested `payload` object: | Nested paths extracted | Shared parsing off | Shared parsing on | Relative speedup | | ---: | ---: | ---: | ---: | | 2 | 1,498 ms | 894 ms | 1.7x | | 4 | 2,969 ms | 1,129 ms | 2.6x | | 8 | 5,841 ms | 1,378 ms | 4.2x | | 16 | 11,856 ms | 1,926 ms | 6.2x | These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three `SharedJsonParseBenchmark` result files. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenAI Codex (GPT-5) Closes #56685 from sunchao/codex/SPARK-57626-shared-nested-json-paths. Lead-authored-by: Chao Sun <chao@openai.com> Co-authored-by: sunchao <sunchao@users.noreply.github.com> Signed-off-by: Chao Sun <chao@openai.com> (cherry picked from commit f111d80) Signed-off-by: Chao Sun <chao@openai.com>

sunchao · 2026-06-23T23:01:15Z

Thanks all!!! Merged to master / branch-4.x

[SPARK-57626][SQL] Share repeated nested JSON path parsing

ee71fb1

sunchao marked this pull request as ready for review June 23, 2026 00:49

cloud-fan reviewed Jun 23, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Comment thread ...rc/test/scala/org/apache/spark/sql/execution/datasources/json/SharedJsonParseBenchmark.scala

sunchao and others added 4 commits June 23, 2026 08:08

[SPARK-57626][SQL] Remove redundant shared JSON field names

bb0fff0

Benchmark results for org.apache.spark.sql.execution.datasources.json…

8c97ee4

….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.json…

6cff419

….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.json…

152293f

….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)

viirya reviewed Jun 23, 2026

View reviewed changes

Comment thread sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Comment thread .../src/main/scala/org/apache/spark/sql/catalyst/expressions/json/JsonExpressionEvalUtils.scala

[SPARK-57626][SQL] Clarify shared JSON path invariants

917b893

viirya approved these changes Jun 23, 2026

View reviewed changes

dongjoon-hyun reviewed Jun 23, 2026

View reviewed changes

Comment thread .../src/main/scala/org/apache/spark/sql/catalyst/expressions/json/JsonExpressionEvalUtils.scala Outdated

dongjoon-hyun reviewed Jun 23, 2026

View reviewed changes

Comment thread .../src/main/scala/org/apache/spark/sql/catalyst/expressions/json/JsonExpressionEvalUtils.scala Outdated

dongjoon-hyun approved these changes Jun 23, 2026

View reviewed changes

[SPARK-57626][SQL] Address review nits

6754fbf

cloud-fan approved these changes Jun 23, 2026

View reviewed changes

uros-b approved these changes Jun 23, 2026

View reviewed changes

sunchao closed this in f111d80 Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685

[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685
sunchao wants to merge 7 commits into
apache:masterfrom
sunchao:codex/SPARK-57626-shared-nested-json-paths

sunchao commented Jun 23, 2026 •

edited

Loading

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

viirya commented Jun 23, 2026

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

cloud-fan left a comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

uros-b left a comment

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

sunchao commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

viirya commented Jun 23, 2026

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Design / architecture (1)

Suggestions (1)

Verification

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uros-b left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sunchao commented Jun 23, 2026 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading