Commit 5dacfe3
[SPARK-56654][SQL] Reject unpaired UTF-16 surrogates in Variant JSON parsing
Jackson's permissive surrogate handling let lone surrogates from \uXXXX escapes pass through `parse_json`, `try_parse_json`, and `from_json('variant')`, where `getBytes(UTF_8)` then silently substituted U+FFFD and corrupted the Variant. Validate the decoded strings before they enter the dictionary or write buffer, gated by a new internal SQL conf (default-on) for opt-out compatibility.
### What changes were proposed in this pull request?
This PR adds strict Unicode validation to the Variant JSON parser so it rejects strings containing unpaired UTF-16 surrogate code units (e.g. a lone `\uD835` high surrogate). The check runs inside `VariantBuilder.buildJson` for both JSON object keys and string values, before either is encoded to UTF-8 and committed to the Variant binary buffer.
The validation is gated by a new internal SQL conf `spark.sql.variant.validateUnicodeInJsonParsing`, defaulting to `true` so the strict, RFC 8259-compliant behavior is enabled by default. Setting the conf to `false` restores the legacy permissive behavior as a transitional escape hatch for pipelines that currently depend on it.
The fix applies to all three Variant-parsing entry points:
- `parse_json` — throws `MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION` on lone surrogates.
- `try_parse_json` — returns `NULL`.
- `from_json` — returns `NULL` in `PERMISSIVE` mode (default), throws in `FAILFAST`.
### Why are the changes needed?
1. JSON containing a lone surrogate (e.g. `"\uD835"` not followed by a low surrogate) is invalid.
2. Strict parsers such as simdjson reject these inputs; Jackson's `ReaderBasedJsonParser`, which Spark uses on the JVM, accepts them and decodes the escape into a Java `char` containing the lone surrogate.
3. The Variant ends up containing `?` where the original input was supposed to be, with no error or warning a silent data-corruption bug.
4. The records containing `\uD835` were silently accepted with substituted characters when handled by the JVM, but correctly rejected by Photon.
5. This PR closes that JVM ↔ Photon divergence at its root.
### Does this PR introduce _any_ user-facing change?
Yes. With the default spark.sql.variant.validateUnicodeInJsonParsing = true, JSON input containing an unpaired UTF-16 surrogate (e.g. a lone \uD835) will now produce an error instead of being silently accepted. Specifically:
- parse_json throws MALFORMED_RECORD_IN_PARSING.WITHOUT_SUGGESTION.
- try_parse_json returns NULL.
- from_json(col, 'variant') returns NULL in PERMISSIVE mode (default) and throws in FAILFAST.
Previously, the lone surrogate was decoded into a Java char, then silently substituted with the Unicode replacement character during UTF-8 encoding, producing a Variant value containing ? with no error or warning. Setting spark.sql.variant.validateUnicodeInJsonParsing = false restores the previous permissive behavior as a transitional opt-out.
### How was this patch tested?
Added new test cases in VariantExpressionEvalUtilsSuite (unit tests for both reject and legacy-mode paths, covering lone high/low surrogates as values and as object keys, plus valid surrogate pairs as a control) and VariantEndToEndSuite (end-to-end SQL test exercising parse_json / try_parse_json / from_json in both PERMISSIVE and FAILFAST modes, with the conf flipped on and off).
### Was this patch authored or co-authored using generative AI tooling?
co-authored by : 'claude-opus-4.7'
Closes #55661 from NJAHNAVI2907/SPARK-56654-strict-unicode-json.
Lead-authored-by: Jahnavi Nelavelli <75218211+NJAHNAVI2907@users.noreply.github.com>
Co-authored-by: NJAHNAVI2907 <jahnavinelavelli29@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0594b12)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>1 parent 7ee145d commit 5dacfe3
15 files changed
Lines changed: 180 additions & 18 deletions
File tree
- common/variant/src/main/java/org/apache/spark/types/variant
- sql
- catalyst/src
- main/scala/org/apache/spark/sql
- catalyst
- expressions/variant
- json
- internal
- test/scala/org/apache/spark/sql/catalyst/expressions/variant
- connect/common/src/test/resources/query-tests/explain-results
- core/src/test/scala/org/apache/spark/sql
Lines changed: 69 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
46 | 50 | | |
| 51 | + | |
47 | 52 | | |
48 | 53 | | |
49 | 54 | | |
| |||
53 | 58 | | |
54 | 59 | | |
55 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
56 | 73 | | |
57 | 74 | | |
58 | | - | |
| 75 | + | |
59 | 76 | | |
60 | 77 | | |
61 | 78 | | |
62 | 79 | | |
63 | | - | |
| 80 | + | |
| 81 | + | |
64 | 82 | | |
65 | 83 | | |
66 | 84 | | |
67 | | - | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
68 | 96 | | |
69 | 97 | | |
70 | 98 | | |
| |||
495 | 523 | | |
496 | 524 | | |
497 | 525 | | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
498 | 529 | | |
499 | 530 | | |
500 | 531 | | |
| |||
513 | 544 | | |
514 | 545 | | |
515 | 546 | | |
516 | | - | |
517 | | - | |
| 547 | + | |
| 548 | + | |
| 549 | + | |
| 550 | + | |
| 551 | + | |
| 552 | + | |
518 | 553 | | |
| 554 | + | |
519 | 555 | | |
520 | 556 | | |
521 | 557 | | |
| |||
557 | 593 | | |
558 | 594 | | |
559 | 595 | | |
| 596 | + | |
| 597 | + | |
| 598 | + | |
| 599 | + | |
| 600 | + | |
| 601 | + | |
| 602 | + | |
| 603 | + | |
| 604 | + | |
| 605 | + | |
| 606 | + | |
| 607 | + | |
| 608 | + | |
| 609 | + | |
| 610 | + | |
| 611 | + | |
| 612 | + | |
| 613 | + | |
| 614 | + | |
| 615 | + | |
| 616 | + | |
| 617 | + | |
| 618 | + | |
| 619 | + | |
560 | 620 | | |
561 | 621 | | |
562 | 622 | | |
| |||
583 | 643 | | |
584 | 644 | | |
585 | 645 | | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
586 | 650 | | |
Lines changed: 4 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
34 | 34 | | |
35 | 35 | | |
36 | 36 | | |
37 | | - | |
| 37 | + | |
| 38 | + | |
38 | 39 | | |
39 | 40 | | |
40 | 41 | | |
| |||
43 | 44 | | |
44 | 45 | | |
45 | 46 | | |
46 | | - | |
| 47 | + | |
| 48 | + | |
47 | 49 | | |
48 | 50 | | |
49 | 51 | | |
| |||
Lines changed: 3 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
62 | 62 | | |
63 | 63 | | |
64 | 64 | | |
65 | | - | |
66 | | - | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
67 | 68 | | |
68 | 69 | | |
69 | 70 | | |
| |||
Lines changed: 4 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
| 125 | + | |
| 126 | + | |
125 | 127 | | |
126 | 128 | | |
127 | 129 | | |
| |||
131 | 133 | | |
132 | 134 | | |
133 | 135 | | |
134 | | - | |
| 136 | + | |
| 137 | + | |
135 | 138 | | |
136 | 139 | | |
137 | 140 | | |
| |||
Lines changed: 13 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6135 | 6135 | | |
6136 | 6136 | | |
6137 | 6137 | | |
| 6138 | + | |
| 6139 | + | |
| 6140 | + | |
| 6141 | + | |
| 6142 | + | |
| 6143 | + | |
| 6144 | + | |
| 6145 | + | |
| 6146 | + | |
| 6147 | + | |
| 6148 | + | |
| 6149 | + | |
| 6150 | + | |
6138 | 6151 | | |
6139 | 6152 | | |
6140 | 6153 | | |
| |||
Lines changed: 46 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
143 | 189 | | |
144 | 190 | | |
145 | 191 | | |
| |||
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
sql/connect/common/src/test/resources/query-tests/explain-results/function_is_variant_null.explain
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
Lines changed: 1 addition & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | | - | |
| 1 | + | |
2 | 2 | | |
0 commit comments