Commit f111d80
committed
[SPARK-57626][SQL] Share repeated nested JSON path parsing
### What changes were proposed in this pull request?
This PR extends the internal shared parser introduced by [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 from simple top-level fields to repeated literal named object paths.
The proposed changes:
- Parse literal named JSON paths into field-name segments. Both dot notation such as `$.payload.user.id` and quoted names such as `$['payload']['user.name']` are supported.
- Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan.
- Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values.
- Keep ancestor and descendant paths in separate shared parses. For example, `$.a` never shares a parse with `$.a.b`.
- Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains.
- Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal.
- Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent `GetJsonObject` evaluation.
- Continue using the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration. The existing JSON expression optimization must also be enabled.
- Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths.
For example, these prefix-free paths share one parse:
```text
$.payload.user.id
$.payload.user.name
$.payload.request_id
```
For this ordered set of ancestor/descendant paths:
```text
$.a
$.a.b
$.x
$.x.y
```
the optimizer creates two prefix-free groups in one invocation:
```text
group 1: $.a, $.x
group 2: $.a.b, $.x.y
```
Unsupported forms such as `$.items[0].id`, `$.payload.*`, and paths supplied by another column remain unchanged.
### Why are the changes needed?
[SPARK-57626](https://issues.apache.org/jira/browse/SPARK-57626) follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently.
For example, consider an `events` table whose `json` column contains:
```json
{"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}
```
An existing query might run:
```sql
SELECT
get_json_object(json, '$.payload.user.id') AS user_id,
get_json_object(json, '$.payload.user.name') AS user_name,
get_json_object(json, '$.payload.request_id') AS request_id
FROM events;
```
Before this PR, Spark parses each row's `json` value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal `MultiGetJsonObject`, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change.
### Does this PR introduce _any_ user-facing change?
Yes, within the unreleased `master` branch only. When the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration is enabled, eligible repeated simple nested `get_json_object` paths now use shared parsing; previously only top-level paths were eligible.
There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected.
### How was this patch tested?
The following compilation, suites, and style checks passed on JDK 17:
```bash
build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"
build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"
build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle"
git diff --check
```
The complete `OptimizeJsonExprsSuite` passed 24 tests, the complete `JsonFunctionsSuite` passed 106 tests, and `SparkConfigBindingPolicySuite` passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation.
The microbenchmark and its committed result files were regenerated with Spark's `Run benchmarks` GitHub Actions workflow for [JDK 17](https://github.com/sunchao/spark/actions/runs/28035861052), [JDK 21](https://github.com/sunchao/spark/actions/runs/28035861792), and [JDK 25](https://github.com/sunchao/spark/actions/runs/28035859413). All three runs passed and committed their JDK-specific results to this branch.
Best JDK 17 times for 200,000 cached rows with 32 fields under a nested `payload` object:
| Nested paths extracted | Shared parsing off | Shared parsing on | Relative speedup |
| ---: | ---: | ---: | ---: |
| 2 | 1,498 ms | 894 ms | 1.7x |
| 4 | 2,969 ms | 1,129 ms | 2.6x |
| 8 | 5,841 ms | 1,378 ms | 4.2x |
| 16 | 11,856 ms | 1,926 ms | 6.2x |
These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three `SharedJsonParseBenchmark` result files.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex (GPT-5)
Closes #56685 from sunchao/codex/SPARK-57626-shared-nested-json-paths.
Lead-authored-by: Chao Sun <chao@openai.com>
Co-authored-by: sunchao <sunchao@users.noreply.github.com>
Signed-off-by: Chao Sun <chao@openai.com>1 parent eb69a3e commit f111d80
10 files changed
Lines changed: 596 additions & 117 deletions
File tree
- sql
- catalyst/src
- main/scala/org/apache/spark/sql
- catalyst
- expressions
- json
- optimizer
- internal
- test/scala/org/apache/spark/sql/catalyst/optimizer
- core
- benchmarks
- src/test/scala/org/apache/spark/sql
- execution/datasources/json
Lines changed: 138 additions & 34 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
| 21 | + | |
21 | 22 | | |
22 | 23 | | |
23 | 24 | | |
| |||
575 | 576 | | |
576 | 577 | | |
577 | 578 | | |
578 | | - | |
| 579 | + | |
579 | 580 | | |
580 | 581 | | |
581 | | - | |
582 | | - | |
| 582 | + | |
| 583 | + | |
583 | 584 | | |
584 | 585 | | |
585 | | - | |
586 | | - | |
587 | | - | |
588 | | - | |
| 586 | + | |
589 | 587 | | |
590 | 588 | | |
591 | | - | |
| 589 | + | |
| 590 | + | |
| 591 | + | |
| 592 | + | |
| 593 | + | |
| 594 | + | |
| 595 | + | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
592 | 599 | | |
593 | 600 | | |
594 | 601 | | |
595 | | - | |
| 602 | + | |
596 | 603 | | |
597 | 604 | | |
598 | 605 | | |
| |||
611 | 618 | | |
612 | 619 | | |
613 | 620 | | |
614 | | - | |
615 | | - | |
| 621 | + | |
| 622 | + | |
616 | 623 | | |
617 | 624 | | |
618 | 625 | | |
619 | 626 | | |
620 | 627 | | |
621 | 628 | | |
| 629 | + | |
| 630 | + | |
622 | 631 | | |
623 | | - | |
624 | | - | |
625 | | - | |
626 | | - | |
627 | | - | |
628 | | - | |
629 | | - | |
630 | | - | |
631 | | - | |
632 | | - | |
633 | | - | |
634 | | - | |
635 | | - | |
636 | | - | |
637 | | - | |
638 | | - | |
639 | | - | |
640 | | - | |
641 | | - | |
| 632 | + | |
642 | 633 | | |
643 | 634 | | |
644 | 635 | | |
| |||
647 | 638 | | |
648 | 639 | | |
649 | 640 | | |
650 | | - | |
651 | | - | |
| 641 | + | |
| 642 | + | |
652 | 643 | | |
653 | | - | |
654 | | - | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
655 | 647 | | |
656 | 648 | | |
657 | 649 | | |
658 | 650 | | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
| 705 | + | |
| 706 | + | |
| 707 | + | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
| 711 | + | |
| 712 | + | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
| 716 | + | |
| 717 | + | |
| 718 | + | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
| 722 | + | |
659 | 723 | | |
660 | 724 | | |
661 | 725 | | |
| |||
726 | 790 | | |
727 | 791 | | |
728 | 792 | | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
| 827 | + | |
| 828 | + | |
| 829 | + | |
| 830 | + | |
| 831 | + | |
| 832 | + | |
Lines changed: 24 additions & 14 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
142 | 142 | | |
143 | 143 | | |
144 | 144 | | |
145 | | - | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
146 | 148 | | |
147 | | - | |
148 | | - | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
149 | 155 | | |
150 | 156 | | |
151 | 157 | | |
| |||
155 | 161 | | |
156 | 162 | | |
157 | 163 | | |
158 | | - | |
159 | | - | |
160 | | - | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
161 | 167 | | |
162 | 168 | | |
163 | 169 | | |
164 | | - | |
165 | 170 | | |
166 | 171 | | |
167 | 172 | | |
168 | 173 | | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
| 174 | + | |
| 175 | + | |
173 | 176 | | |
174 | 177 | | |
175 | 178 | | |
176 | 179 | | |
177 | 180 | | |
178 | 181 | | |
179 | | - | |
| 182 | + | |
180 | 183 | | |
181 | 184 | | |
182 | 185 | | |
| |||
189 | 192 | | |
190 | 193 | | |
191 | 194 | | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
192 | 202 | | |
193 | 203 | | |
194 | | - | |
195 | | - | |
| 204 | + | |
| 205 | + | |
196 | 206 | | |
197 | 207 | | |
198 | 208 | | |
| |||
0 commit comments