fix: incorrect mask with additionalProperties:true and non-required properties#595
fix: incorrect mask with additionalProperties:true and non-required properties#595ianliuy wants to merge 2 commits into
Conversation
Generate a trie-based EBNF key pattern that excludes defined property names from the additional/unevaluated property matching. This prevents additional properties from matching defined property names, which could bypass type constraints. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Yiyang Liu <yiyangliu+microsoft@microsoft.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a trie-based mechanism to exclude defined property names from the 'additional properties' pattern, ensuring that defined properties adhere to their type constraints even when additional properties are allowed. Review feedback identified critical compilation errors due to illegal access of protected class members from an anonymous namespace and suggested escaping non-printable characters in the generated EBNF. Furthermore, a significant logic issue was raised regarding the potential bypass of property exclusion through JSON escape sequences in keys, which the current implementation does not handle.
| std::string BuildTrieBody(const TrieNode& node) { | ||
| std::string result; | ||
| bool first = true; | ||
| auto add = [&](const std::string& s) { | ||
| if (!first) result += " | "; | ||
| first = false; | ||
| result += s; | ||
| }; | ||
|
|
||
| // 1. Close quote - only if no excluded key ends here | ||
| if (!node.is_terminal) { | ||
| add("\"\\\"\""); | ||
| } | ||
|
|
||
| // 2. Negated char class - excludes edge chars + JSON specials | ||
| std::string neg = "[^"; | ||
| for (const auto& [c, _] : node.children) { | ||
| if (c == ']' || c == '\\' || c == '^' || c == '-') { | ||
| neg += "\\"; | ||
| } | ||
| neg += c; | ||
| } | ||
| neg += "\\0-\\x1f\\\"\\\\\\r\\n]"; | ||
| add(neg + " " + JSONSchemaConverter::kBasicStringSub); | ||
|
|
||
| // 3. Escape sequence | ||
| add("\"\\\\\" " + JSONSchemaConverter::kBasicEscape + " " + | ||
| JSONSchemaConverter::kBasicStringSub); | ||
|
|
||
| // 4. Trie edges - recurse | ||
| for (const auto& [c, child] : node.children) { | ||
| std::string child_body = BuildTrieBody(child); | ||
| std::string char_lit = "\""; | ||
| if (c == '"') { | ||
| char_lit += "\\\""; | ||
| } else if (c == '\\') { | ||
| char_lit += "\\\\"; | ||
| } else { | ||
| char_lit += c; | ||
| } | ||
| char_lit += "\""; | ||
| add(char_lit + " " + child_body); | ||
| } | ||
|
|
||
| return "(" + result + ")"; | ||
| } |
There was a problem hiding this comment.
The BuildTrieBody function in the anonymous namespace cannot access the protected static members kBasicStringSub and kBasicEscape of JSONSchemaConverter. This will cause a compilation error. Additionally, non-printable characters in property names (like \n, \r, \t) should be escaped when building EBNF literals to avoid breaking the generated grammar script.
std::string BuildTrieBody(
const TrieNode& node,
const std::string& basic_string_sub,
const std::string& basic_escape
) {
std::string result;
bool first = true;
auto add = [&](const std::string& s) {
if (!first) result += " | ";
first = false;
result += s;
};
// 1. Close quote - only if no excluded key ends here
if (!node.is_terminal) {
add("\"\\\"\"");
}
// 2. Negated char class - excludes edge chars + JSON specials
std::string neg = "[^";
for (const auto& [c, _] : node.children) {
if (c == ']' || c == '\\' || c == '^' || c == '-') {
neg += "\\";
}
neg += c;
}
neg += "\\0-\\x1f\\\"\\\\\\r\\n]";
add(neg + " " + basic_string_sub);
// 3. Escape sequence
add("\"\\\\\" " + basic_escape + " " + basic_string_sub);
// 4. Trie edges - recurse
for (const auto& [c, child] : node.children) {
std::string child_body = BuildTrieBody(child, basic_string_sub, basic_escape);
std::string char_lit = "\"";
if (c == '"') {
char_lit += "\\\"";
} else if (c == '\\') {
char_lit += "\\\\";
} else if (c == '\n') {
char_lit += "\\n";
} else if (c == '\r') {
char_lit += "\\r";
} else if (c == '\t') {
char_lit += "\\t";
} else {
char_lit += c;
}
char_lit += "\"";
add(char_lit + " " + child_body);
}
return "(" + result + ")";
}| } | ||
|
|
||
| // Generate EBNF body | ||
| std::string inner = BuildTrieBody(root); |
| std::string JSONSchemaConverter::GetKeyPatternExcluding( | ||
| const std::vector<ObjectSpec::Property>& properties, | ||
| const std::string& rule_name | ||
| ) { | ||
| if (properties.empty()) { | ||
| return GetKeyPattern(); | ||
| } | ||
|
|
||
| // Build trie from property names | ||
| TrieNode root; | ||
| for (const auto& prop : properties) { | ||
| TrieNode* cur = &root; | ||
| for (char c : prop.name) { | ||
| cur = &cur->children[c]; | ||
| } | ||
| cur->is_terminal = true; | ||
| } | ||
|
|
||
| // Generate EBNF body | ||
| std::string inner = BuildTrieBody(root); | ||
| std::string ws = GetWhitespacePattern(); | ||
| std::string body = "[\"] (" + inner + ") (= " + ws + " [,}\\]:])"; | ||
|
|
||
| return ebnf_script_creator_.AddRule(rule_name + "_addl_key", body); | ||
| } |
There was a problem hiding this comment.
The trie-based exclusion logic matches literal characters from the property names. However, JSON allows keys to be represented using escape sequences (e.g., "\u0061" instead of "a"). The current implementation transitions to the catch-all kBasicStringSub rule as soon as a backslash is encountered (Path 3 in BuildTrieBody), which means that any excluded property name can be bypassed by using escape sequences in the JSON input. This would re-introduce the original issue where the wrong type is accepted for a defined property.
There was a problem hiding this comment.
Pull request overview
Fixes an xgrammar JSON-schema conversion bug where additionalProperties: true (and unevaluatedProperties: true) could bypass type enforcement for defined but non-required properties by introducing an “additional key” pattern that excludes known property names.
Changes:
- Add
GetKeyPatternExcluding()to generate a trie-based key pattern that matches any JSON string key except the schema’s declared property names. - Use the excluding key pattern for additional/unevaluated property alternatives inside
GetPartialRuleForProperties(). - Add a Python regression test covering type enforcement with
additionalProperties: trueand optional typed properties.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/python/test_json_schema_converter.py | Adds regression coverage for #208 around optional typed properties with additionalProperties: true. |
| cpp/json_schema_converter.h | Declares a new helper for generating an excluding key pattern. |
| cpp/json_schema_converter.cc | Implements trie-based excluding key rule generation and swaps it into additional-property handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| /*! \brief Get a key pattern that excludes specific property names. */ | ||
| std::string GetKeyPatternExcluding( |
| continue | ||
| if num_properties < lower_bound or num_properties > upper_bound: | ||
| is_accepted = False | ||
| else: | ||
| is_accepted = ( | ||
| all(prop in instance for prop in required_properties) | ||
| and not have_additional | ||
| ) | ||
| check_schema_with_instance( | ||
| schema, instance, is_accepted=is_accepted, any_whitespace=False | ||
| ) | ||
|
|
||
| # Case 2.2. additional properties allowed | ||
| for required_properties in required_properties_instance: | ||
| schema = { | ||
| **base_schema, | ||
| "required": required_properties, | ||
| "additionalProperties": {"type": "string"}, | ||
| } | ||
| for instance, num_properties, have_additional in instances: | ||
| is_accepted = all(prop in instance for prop in required_properties) | ||
| check_schema_with_instance( | ||
| schema, instance, is_accepted=is_accepted, any_whitespace=False | ||
| ) | ||
|
|
||
| for lower_bound in range(8): | ||
| schema = { | ||
| **base_schema, | ||
| "minProperties": lower_bound, | ||
| "required": required_properties, | ||
| "additionalProperties": {"type": "string"}, | ||
| } | ||
| for instance, num_properties, have_additional in instances: | ||
| if lower_bound > len(base_schema["properties"]): | ||
| continue | ||
| if num_properties < lower_bound: | ||
| is_accepted = False | ||
| else: | ||
| is_accepted = all(prop in instance for prop in required_properties) | ||
| check_schema_with_instance( | ||
| schema, instance, is_accepted=is_accepted, any_whitespace=False | ||
| ) | ||
|
|
||
| for upper_bound in range(lower_bound, 8): | ||
| schema = { | ||
| **base_schema, | ||
| "minProperties": lower_bound, | ||
| "maxProperties": upper_bound, | ||
| "required": required_properties, | ||
| "additionalProperties": {"type": "string"}, | ||
| } | ||
| for instance, num_properties, have_additional in instances: | ||
| if lower_bound > len(base_schema["properties"]) or upper_bound < len( | ||
| required_properties | ||
| ): | ||
| if lower_bound > len( | ||
| base_schema["properties"] | ||
| ) or upper_bound < len(required_properties): | ||
| continue | ||
| if num_properties < lower_bound or num_properties > upper_bound: | ||
| is_accepted = False | ||
| else: | ||
| is_accepted = all(prop in instance for prop in required_properties) | ||
| is_accepted = all( | ||
| prop in instance for prop in required_properties | ||
| ) | ||
| check_schema_with_instance( | ||
| schema, instance, is_accepted=is_accepted, any_whitespace=False | ||
| ) |
What's broken?
When a JSON schema has
additionalProperties: truewith properties that are not in therequiredarray, xgrammar accepts values with the wrong type for those defined properties. For example,{"a": "wrong"}is accepted even though"a"is defined asinteger.Who is affected?
Anyone using
additionalProperties: true(orunevaluatedProperties: true) with non-required typed properties. This includes the common pattern of optional typed fields with free-form extra data. Schemas with all properties required oradditionalProperties: falseare NOT affected.When does it trigger?
Reproducible 100% of the time with:
{ "type": "object", "properties": { "a": {"type": "integer"} }, "additionalProperties": true, "required": [] }Input
{"a": "wrong"}is incorrectly accepted.Root Cause
GetKeyPattern()returnsbasic_string, which matches any JSON string key -- including defined property names like"a". When the grammar addsadditional_prop_patternas an alternative, the parser tracks both paths simultaneously:"a" : integer_rule-- correctly constrains value typebasic_string : any_rule-- accepts any value typeThe token bitmask is the union of valid tokens across all active paths, so the catch-all's
any_ruleeffectively overrides the typed constraint.Fix
Replace
GetKeyPattern()withGetKeyPatternExcluding(properties, rule_name)at 3 call sites inGetPartialRuleForProperties(). The new method generates a trie-based EBNF key pattern that matches any JSON string key except defined property names.For excluded key
"a", the generated pattern:"","b","ab"(not the excluded key)"a"(must use typed rule instead)Testing
Added
test_additional_properties_type_enforcementwith 7 test cases:Fixes #208
cc @Ubospica @Seven-Streams