[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685
[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685sunchao wants to merge 7 commits into
Conversation
|
cc @cloud-fan @viirya @peter-toth @dongjoon-hyun a follow-up on #56547. Please take a look, thanks. 🙏 |
|
Before reviewing, seems the PR description is broken (not following full template) at the first glance. |
|
Oops. Updated the PR description. |
cloud-fan
left a comment
There was a problem hiding this comment.
Reviewed with AI assistance. Overall a clean, well-tested PR — the nested-path sharing rewrite is result-equivalent to independent get_json_object evaluation. Two minor, non-blocking notes.
0 blocking, 2 non-blocking, 0 nits.
Design / architecture (1)
- jsonExpressions.scala:198:
MultiGetJsonObject.fieldNamesis now vestigial — only its length is used; path info flows throughfallbackPaths/namedPaths— see inline
Suggestions (1)
- SharedJsonParseBenchmark.scala:88: benchmark gained nested-path cases but the committed
SharedJsonParseBenchmark-*results.txtweren't regenerated — see inline
Verification
Traced the rewrite's equivalence to independent GetJsonObject evaluation across: per-path first-non-null over duplicate parent objects and duplicate keys (matched/hasUnmatched), null values at every level, non-object intermediates, prefix-conflict gating into separate prefix-free groups, depth/unsupported-path gating, and malformed-input/rendering-failure isolation — all match legacy semantics. Confirmed by the shared-on-vs-legacy tests in JsonFunctionsSuite, which assert independently hardcoded expected values rather than self-consistency.
….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)
viirya
left a comment
There was a problem hiding this comment.
Traced the trie evaluator and optimizer grouping against the duplicate-key / null / non-object semantics the tests assert; behavior matches legacy GetJsonObject. The fieldNames removal is a nice cleanup, and the committed JDK 17/21/25 benchmark results are a welcome addition. Two minor, non-blocking notes inline.
The design is consistent with the existing top-level path (idempotent rule, top-level fast path, full path as the ordinal key so $.a.b and $.f.b stay distinct), and coverage is thorough across both prefix-conflict directions, one-pass parallel-chain grouping, the 2000-path and depth-65 cases, duplicate keys, nulls, non-object intermediates, malformed/single-quoted input, rendering-failure isolation, and default-off plan equivalence.
viirya
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround - re-reviewed at 917b893, and all four threads are resolved well:
- Depth-bound invariant: the note on
MultiGetJsonObject(OptimizeCsvJsonExprs caps shared path depth to keep evaluator recursion stack-safe) captures exactly the invariant I was after - the recursive evaluator's stack safety now self-documents its reliance on the optimizer cap. - Multi-element
terminalOrdinals: the comment inextractValuecorrectly explains both that optimizer paths are deduplicated and that the multi-ordinal fold is retained only as defensive support for directly constructed expressions. Clear now. - Vestigial
fieldNames: cleanly removed from bothMultiGetJsonObjectand the evaluator, arity derived fromfallbackPaths; I confirmed no stale references remain (the otherfieldNameshits in these files are unrelatedJsonTuplecode). - Benchmark results: regenerated and committed for JDK 17/21/25.
The core logic I traced earlier (trie duplicate-key first-wins semantics, prefix-conflict grouping, the grouped(2) path parsing) is unchanged and still correct. LGTM.
uros-b
left a comment
There was a problem hiding this comment.
Thank you @sunchao and @viirya @cloud-fan @dongjoon-hyun!
### What changes were proposed in this pull request? This PR extends the internal shared parser introduced by [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 from simple top-level fields to repeated literal named object paths. The proposed changes: - Parse literal named JSON paths into field-name segments. Both dot notation such as `$.payload.user.id` and quoted names such as `$['payload']['user.name']` are supported. - Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan. - Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values. - Keep ancestor and descendant paths in separate shared parses. For example, `$.a` never shares a parse with `$.a.b`. - Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains. - Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal. - Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent `GetJsonObject` evaluation. - Continue using the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration. The existing JSON expression optimization must also be enabled. - Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths. For example, these prefix-free paths share one parse: ```text $.payload.user.id $.payload.user.name $.payload.request_id ``` For this ordered set of ancestor/descendant paths: ```text $.a $.a.b $.x $.x.y ``` the optimizer creates two prefix-free groups in one invocation: ```text group 1: $.a, $.x group 2: $.a.b, $.x.y ``` Unsupported forms such as `$.items[0].id`, `$.payload.*`, and paths supplied by another column remain unchanged. ### Why are the changes needed? [SPARK-57626](https://issues.apache.org/jira/browse/SPARK-57626) follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently. For example, consider an `events` table whose `json` column contains: ```json {"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}} ``` An existing query might run: ```sql SELECT get_json_object(json, '$.payload.user.id') AS user_id, get_json_object(json, '$.payload.user.name') AS user_name, get_json_object(json, '$.payload.request_id') AS request_id FROM events; ``` Before this PR, Spark parses each row's `json` value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal `MultiGetJsonObject`, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change. ### Does this PR introduce _any_ user-facing change? Yes, within the unreleased `master` branch only. When the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration is enabled, eligible repeated simple nested `get_json_object` paths now use shared parsing; previously only top-level paths were eligible. There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected. ### How was this patch tested? The following compilation, suites, and style checks passed on JDK 17: ```bash build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile" build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite" build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite" build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite" build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle" git diff --check ``` The complete `OptimizeJsonExprsSuite` passed 24 tests, the complete `JsonFunctionsSuite` passed 106 tests, and `SparkConfigBindingPolicySuite` passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation. The microbenchmark and its committed result files were regenerated with Spark's `Run benchmarks` GitHub Actions workflow for [JDK 17](https://github.com/sunchao/spark/actions/runs/28035861052), [JDK 21](https://github.com/sunchao/spark/actions/runs/28035861792), and [JDK 25](https://github.com/sunchao/spark/actions/runs/28035859413). All three runs passed and committed their JDK-specific results to this branch. Best JDK 17 times for 200,000 cached rows with 32 fields under a nested `payload` object: | Nested paths extracted | Shared parsing off | Shared parsing on | Relative speedup | | ---: | ---: | ---: | ---: | | 2 | 1,498 ms | 894 ms | 1.7x | | 4 | 2,969 ms | 1,129 ms | 2.6x | | 8 | 5,841 ms | 1,378 ms | 4.2x | | 16 | 11,856 ms | 1,926 ms | 6.2x | These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three `SharedJsonParseBenchmark` result files. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: OpenAI Codex (GPT-5) Closes #56685 from sunchao/codex/SPARK-57626-shared-nested-json-paths. Lead-authored-by: Chao Sun <chao@openai.com> Co-authored-by: sunchao <sunchao@users.noreply.github.com> Signed-off-by: Chao Sun <chao@openai.com> (cherry picked from commit f111d80) Signed-off-by: Chao Sun <chao@openai.com>
|
Thanks all!!! Merged to |
What changes were proposed in this pull request?
This PR extends the internal shared parser introduced by SPARK-47670 and #56547 from simple top-level fields to repeated literal named object paths.
The proposed changes:
$.payload.user.idand quoted names such as$['payload']['user.name']are supported.$.anever shares a parse with$.a.b.GetJsonObjectevaluation.spark.sql.optimizer.getJsonObjectSharedParsing.enabledconfiguration. The existing JSON expression optimization must also be enabled.For example, these prefix-free paths share one parse:
For this ordered set of ancestor/descendant paths:
the optimizer creates two prefix-free groups in one invocation:
Unsupported forms such as
$.items[0].id,$.payload.*, and paths supplied by another column remain unchanged.Why are the changes needed?
SPARK-57626 follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently.
For example, consider an
eventstable whosejsoncolumn contains:{"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}An existing query might run:
Before this PR, Spark parses each row's
jsonvalue independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internalMultiGetJsonObject, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change.Does this PR introduce any user-facing change?
Yes, within the unreleased
masterbranch only. When the existing internal, default-disabledspark.sql.optimizer.getJsonObjectSharedParsing.enabledconfiguration is enabled, eligible repeated simple nestedget_json_objectpaths now use shared parsing; previously only top-level paths were eligible.There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected.
How was this patch tested?
The following compilation, suites, and style checks passed on JDK 17:
The complete
OptimizeJsonExprsSuitepassed 24 tests, the completeJsonFunctionsSuitepassed 106 tests, andSparkConfigBindingPolicySuitepassed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation.The microbenchmark and its committed result files were regenerated with Spark's
Run benchmarksGitHub Actions workflow for JDK 17, JDK 21, and JDK 25. All three runs passed and committed their JDK-specific results to this branch.Best JDK 17 times for 200,000 cached rows with 32 fields under a nested
payloadobject:These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three
SharedJsonParseBenchmarkresult files.Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex (GPT-5)