Skip to content

[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685

Closed
sunchao wants to merge 7 commits into
apache:masterfrom
sunchao:codex/SPARK-57626-shared-nested-json-paths
Closed

[SPARK-57626][SQL] Share repeated nested JSON path parsing#56685
sunchao wants to merge 7 commits into
apache:masterfrom
sunchao:codex/SPARK-57626-shared-nested-json-paths

Conversation

@sunchao

@sunchao sunchao commented Jun 23, 2026

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

This PR extends the internal shared parser introduced by SPARK-47670 and #56547 from simple top-level fields to repeated literal named object paths.

The proposed changes:

  • Parse literal named JSON paths into field-name segments. Both dot notation such as $.payload.user.id and quoted names such as $['payload']['user.name'] are supported.
  • Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan.
  • Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values.
  • Keep ancestor and descendant paths in separate shared parses. For example, $.a never shares a parse with $.a.b.
  • Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains.
  • Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal.
  • Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent GetJsonObject evaluation.
  • Continue using the existing internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration. The existing JSON expression optimization must also be enabled.
  • Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths.

For example, these prefix-free paths share one parse:

$.payload.user.id
$.payload.user.name
$.payload.request_id

For this ordered set of ancestor/descendant paths:

$.a
$.a.b
$.x
$.x.y

the optimizer creates two prefix-free groups in one invocation:

group 1: $.a,   $.x
group 2: $.a.b, $.x.y

Unsupported forms such as $.items[0].id, $.payload.*, and paths supplied by another column remain unchanged.

Why are the changes needed?

SPARK-57626 follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently.

For example, consider an events table whose json column contains:

{"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}

An existing query might run:

SELECT
  get_json_object(json, '$.payload.user.id') AS user_id,
  get_json_object(json, '$.payload.user.name') AS user_name,
  get_json_object(json, '$.payload.request_id') AS request_id
FROM events;

Before this PR, Spark parses each row's json value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal MultiGetJsonObject, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change.

Does this PR introduce any user-facing change?

Yes, within the unreleased master branch only. When the existing internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration is enabled, eligible repeated simple nested get_json_object paths now use shared parsing; previously only top-level paths were eligible.

There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected.

How was this patch tested?

The following compilation, suites, and style checks passed on JDK 17:

build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"
build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"
build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle"
git diff --check

The complete OptimizeJsonExprsSuite passed 24 tests, the complete JsonFunctionsSuite passed 106 tests, and SparkConfigBindingPolicySuite passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation.

The microbenchmark and its committed result files were regenerated with Spark's Run benchmarks GitHub Actions workflow for JDK 17, JDK 21, and JDK 25. All three runs passed and committed their JDK-specific results to this branch.

Best JDK 17 times for 200,000 cached rows with 32 fields under a nested payload object:

Nested paths extracted Shared parsing off Shared parsing on Relative speedup
2 1,498 ms 894 ms 1.7x
4 2,969 ms 1,129 ms 2.6x
8 5,841 ms 1,378 ms 4.2x
16 11,856 ms 1,926 ms 6.2x

These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three SharedJsonParseBenchmark result files.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

@sunchao sunchao marked this pull request as ready for review June 23, 2026 00:49
@sunchao

sunchao commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

cc @cloud-fan @viirya @peter-toth @dongjoon-hyun a follow-up on #56547. Please take a look, thanks. 🙏

@viirya

viirya commented Jun 23, 2026

Copy link
Copy Markdown
Member

Before reviewing, seems the PR description is broken (not following full template) at the first glance.

@sunchao

sunchao commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Oops. Updated the PR description.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with AI assistance. Overall a clean, well-tested PR — the nested-path sharing rewrite is result-equivalent to independent get_json_object evaluation. Two minor, non-blocking notes.

0 blocking, 2 non-blocking, 0 nits.

Design / architecture (1)

  • jsonExpressions.scala:198: MultiGetJsonObject.fieldNames is now vestigial — only its length is used; path info flows through fallbackPaths/namedPaths — see inline

Suggestions (1)

  • SharedJsonParseBenchmark.scala:88: benchmark gained nested-path cases but the committed SharedJsonParseBenchmark-*results.txt weren't regenerated — see inline

Verification

Traced the rewrite's equivalence to independent GetJsonObject evaluation across: per-path first-non-null over duplicate parent objects and duplicate keys (matched/hasUnmatched), null values at every level, non-object intermediates, prefix-conflict gating into separate prefix-free groups, depth/unsupported-path gating, and malformed-input/rendering-failure isolation — all match legacy semantics. Confirmed by the shared-on-vs-legacy tests in JsonFunctionsSuite, which assert independently hardcoded expected values rather than self-consistency.

sunchao and others added 4 commits June 23, 2026 08:08
….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Traced the trie evaluator and optimizer grouping against the duplicate-key / null / non-object semantics the tests assert; behavior matches legacy GetJsonObject. The fieldNames removal is a nice cleanup, and the committed JDK 17/21/25 benchmark results are a welcome addition. Two minor, non-blocking notes inline.

The design is consistent with the existing top-level path (idempotent rule, top-level fast path, full path as the ordinal key so $.a.b and $.f.b stay distinct), and coverage is thorough across both prefix-conflict directions, one-pass parallel-chain grouping, the 2000-path and depth-65 cases, duplicate keys, nulls, non-object intermediates, malformed/single-quoted input, rendering-failure isolation, and default-off plan equivalence.

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround - re-reviewed at 917b893, and all four threads are resolved well:

  • Depth-bound invariant: the note on MultiGetJsonObject (OptimizeCsvJsonExprs caps shared path depth to keep evaluator recursion stack-safe) captures exactly the invariant I was after - the recursive evaluator's stack safety now self-documents its reliance on the optimizer cap.
  • Multi-element terminalOrdinals: the comment in extractValue correctly explains both that optimizer paths are deduplicated and that the multi-ordinal fold is retained only as defensive support for directly constructed expressions. Clear now.
  • Vestigial fieldNames: cleanly removed from both MultiGetJsonObject and the evaluator, arity derived from fallbackPaths; I confirmed no stale references remain (the other fieldNames hits in these files are unrelated JsonTuple code).
  • Benchmark results: regenerated and committed for JDK 17/21/25.

The core logic I traced earlier (trie duplicate-key first-wins semantics, prefix-conflict grouping, the grouped(2) path parsing) is unchanged and still correct. LGTM.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (with the last commit). For two nits, you can ignore them.

@uros-b uros-b left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao sunchao closed this in f111d80 Jun 23, 2026
sunchao added a commit that referenced this pull request Jun 23, 2026
### What changes were proposed in this pull request?

This PR extends the internal shared parser introduced by [SPARK-47670](https://issues.apache.org/jira/browse/SPARK-47670) and #56547 from simple top-level fields to repeated literal named object paths.

The proposed changes:

- Parse literal named JSON paths into field-name segments. Both dot notation such as `$.payload.user.id` and quoted names such as `$['payload']['user.name']` are supported.
- Build a runtime path trie so multiple nested fields can be extracted in one streaming JSON scan.
- Preserve the first non-null duplicate-key match independently for each requested path, including duplicate parent objects and non-object intermediate values.
- Keep ancestor and descendant paths in separate shared parses. For example, `$.a` never shares a parse with `$.a.b`.
- Build all greedy prefix-free groups in one optimizer invocation, avoiding repeated fixed-point iterations for parallel prefix chains.
- Retain the existing top-level fast path so top-level sharing does not pay for runtime trie traversal.
- Leave dynamic paths, wildcards, array subscripts, and paths deeper than 64 named fields on their existing independent `GetJsonObject` evaluation.
- Continue using the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration. The existing JSON expression optimization must also be enabled.
- Extend optimizer, runtime, code-generation, malformed-input, and microbenchmark coverage for nested paths.

For example, these prefix-free paths share one parse:

```text
$.payload.user.id
$.payload.user.name
$.payload.request_id
```

For this ordered set of ancestor/descendant paths:

```text
$.a
$.a.b
$.x
$.x.y
```

the optimizer creates two prefix-free groups in one invocation:

```text
group 1: $.a,   $.x
group 2: $.a.b, $.x.y
```

Unsupported forms such as `$.items[0].id`, `$.payload.*`, and paths supplied by another column remain unchanged.

### Why are the changes needed?

[SPARK-57626](https://issues.apache.org/jira/browse/SPARK-57626) follows up on the initial shared-parsing optimization. The initial implementation intentionally supports only simple top-level fields, so repeated literal nested paths still parse the same JSON independently.

For example, consider an `events` table whose `json` column contains:

```json
{"payload":{"user":{"id":123,"name":"alice"},"request_id":"r-1"}}
```

An existing query might run:

```sql
SELECT
  get_json_object(json, '$.payload.user.id') AS user_id,
  get_json_object(json, '$.payload.user.name') AS user_name,
  get_json_object(json, '$.payload.request_id') AS request_id
FROM events;
```

Before this PR, Spark parses each row's `json` value independently for all three nested extractions. With shared parsing enabled, Catalyst rewrites them to one internal `MultiGetJsonObject`, so the input is scanned once and all three values are returned from that scan. The SQL and its results do not change.

### Does this PR introduce _any_ user-facing change?

Yes, within the unreleased `master` branch only. When the existing internal, default-disabled `spark.sql.optimizer.getJsonObjectSharedParsing.enabled` configuration is enabled, eligible repeated simple nested `get_json_object` paths now use shared parsing; previously only top-level paths were eligible.

There is no new API, expression, configuration, or query migration. Result semantics remain unchanged for malformed input, duplicate keys, nulls, non-object intermediate values, and rendering failures. With the flag disabled, existing analyzed and optimized plans remain unchanged. Released Spark versions are unaffected.

### How was this patch tested?

The following compilation, suites, and style checks passed on JDK 17:

```bash
build/sbt "catalyst/compile" "catalyst/Test/compile" "sql/Test/compile"
build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "hive/Test/testOnly org.apache.spark.sql.configaudit.SparkConfigBindingPolicySuite"
build/sbt "catalyst/scalastyle" "catalyst/Test/scalastyle" "sql/Test/scalastyle"
git diff --check
```

The complete `OptimizeJsonExprsSuite` passed 24 tests, the complete `JsonFunctionsSuite` passed 106 tests, and `SparkConfigBindingPolicySuite` passed 3 tests. The coverage includes nested sharing, both prefix-conflict directions, one-pass grouping of parallel prefix chains, a 2,000-path projection, unsupported paths, the depth limit, default-off plan equivalence, duplicate keys, nulls, malformed and single-quoted JSON, non-object intermediate values, rendering failures, and whole-stage code generation.

The microbenchmark and its committed result files were regenerated with Spark's `Run benchmarks` GitHub Actions workflow for [JDK 17](https://github.com/sunchao/spark/actions/runs/28035861052), [JDK 21](https://github.com/sunchao/spark/actions/runs/28035861792), and [JDK 25](https://github.com/sunchao/spark/actions/runs/28035859413). All three runs passed and committed their JDK-specific results to this branch.

Best JDK 17 times for 200,000 cached rows with 32 fields under a nested `payload` object:

| Nested paths extracted | Shared parsing off | Shared parsing on | Relative speedup |
| ---: | ---: | ---: | ---: |
| 2 | 1,498 ms | 894 ms | 1.7x |
| 4 | 2,969 ms | 1,129 ms | 2.6x |
| 8 | 5,841 ms | 1,378 ms | 4.2x |
| 16 | 11,856 ms | 1,926 ms | 6.2x |

These are GitHub-hosted expression-scaling measurements, not end-to-end production-job results. The complete output, including the JDK 21 and JDK 25 runs, is recorded in the three `SharedJsonParseBenchmark` result files.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

Closes #56685 from sunchao/codex/SPARK-57626-shared-nested-json-paths.

Lead-authored-by: Chao Sun <chao@openai.com>
Co-authored-by: sunchao <sunchao@users.noreply.github.com>
Signed-off-by: Chao Sun <chao@openai.com>
(cherry picked from commit f111d80)
Signed-off-by: Chao Sun <chao@openai.com>
@sunchao

sunchao commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

Thanks all!!! Merged to master / branch-4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants