[GEODE-10463] Fix lexical nondeterminism warning in OQL grammar between ALL_UNICODE and DIGIT rules (#7928)

JinwooHwang · web-flow · commit dbdec41174b1 · 2025-09-29T05:08:00.000-04:00
* GEODE-10463: Fix lexical nondeterminism warning in OQL grammar between ALL_UNICODE and DIGIT rules

Refactored ALL_UNICODE rule to exclude Unicode digit ranges that overlap
with DIGIT rule, eliminating lexical ambiguity in RegionNameCharacter.
The ALL_UNICODE range is now split into 15 non-overlapping segments that
exclude Arabic-Indic, Devanagari, Bengali, and other Unicode digit ranges.

This ensures deterministic tokenization where Unicode digits are always
matched by DIGIT rule while other Unicode characters use ALL_UNICODE.

* GEODE-10463: Add clarifying comment for ALL_UNICODE lexer rule

Add documentation comment to explain that the ALL_UNICODE character
class excludes Unicode digit ranges to prevent lexical nondeterminism
with the DIGIT rule in the OQL grammar lexer.
diff --git a/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g b/geode-core/src/main/antlr/org/apache/geode/cache/query/internal/parse/oql.g
@@ -133,8 +133,23 @@ DIGIT : ('\u0030'..'\u0039' |
        '\u1040'..'\u1049')
     ;
 
+// Exclude Unicode digit ranges to prevent lexical nondeterminism with DIGIT rule
 protected
-ALL_UNICODE : ('\u0061'..'\ufffd')	
+ALL_UNICODE : ('\u0061'..'\u065f' |   // exclude Arabic-Indic digits
+               '\u066a'..'\u06ef' |   // exclude Extended Arabic-Indic digits  
+               '\u06fa'..'\u0965' |   // exclude Devanagari digits
+               '\u0970'..'\u09e5' |   // exclude Bengali digits
+               '\u09f0'..'\u0a65' |   // exclude Gurmukhi digits
+               '\u0a70'..'\u0ae5' |   // exclude Gujarati digits
+               '\u0af0'..'\u0b65' |   // exclude Oriya digits
+               '\u0b70'..'\u0be6' |   // exclude Tamil digits (note: Tamil starts at 0be7)
+               '\u0bf0'..'\u0c65' |   // exclude Telugu digits
+               '\u0c70'..'\u0ce5' |   // exclude Kannada digits
+               '\u0cf0'..'\u0d65' |   // exclude Malayalam digits
+               '\u0d70'..'\u0e4f' |   // exclude Thai digits
+               '\u0e5a'..'\u0ecf' |   // exclude Lao digits
+               '\u0eda'..'\u103f' |   // exclude Myanmar digits
+               '\u104a'..'\ufffd')    // rest of Unicode
     ;
 
 /*