Skip to content

Commit 6c474df

Browse files
h3n4lclaude
andauthored
feat(cosmosdb): improve SQL parser compatibility for all 13 query feature areas (#62)
* feat(cosmosdb): rewrite grammar with unified scalar_expression and new clauses Add missing lexer tokens: !=, IN, BETWEEN, TOP, VALUE, ORDER, BY, GROUP, OFFSET, LIMIT, ASC, DESC, EXISTS, LIKE, HAVING, JOIN. Fix IDENTIFIER to allow leading underscore (for _ts, _etag, etc.). Merge scalar_expression and scalar_expression_in_where into a single unified scalar_expression rule. Add TOP, VALUE, ORDER BY, GROUP BY, OFFSET LIMIT, HAVING, JOIN, IN, BETWEEN, LIKE, EXISTS, NOT, and subquery support. Fix object_constant_field_pair to use COLON_SYMBOL. Resolves all 13 failing query feature areas from BYT-9043. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test(cosmosdb): add test SQL examples for all 13 feature areas Covers: SELECT TOP, WHERE operators (!=, IN, BETWEEN, _ts fields), functions in SELECT, ORDER BY, aggregation, GROUP BY, string/math/ type-check functions, DISTINCT VALUE, VALUE keyword, OFFSET LIMIT, geospatial with JSON objects, and NOT EQUAL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(cosmosdb): correct operator precedence and function name casing Split binary_operator into precedence-tiered rules (multiplicative, additive, shift, comparison) and reorder scalar_expression alternatives so AND/OR have lower precedence than comparison operators. Previously AND/OR bound tighter than =, >, < which caused incorrect parse trees. Also fix STRINGTONUMBER to StringToNumber for consistency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 39c76b4 commit 6c474df

26 files changed

+4898
-2631
lines changed

cosmosdb/CosmosDBLexer.g4

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,21 @@ UDF_SYMBOL: 'UDF';
4646
WHERE_SYMBOL: 'WHERE';
4747
AND_SYMBOL: 'AND';
4848
OR_SYMBOL: 'OR';
49+
IN_SYMBOL: 'IN';
50+
BETWEEN_SYMBOL: 'BETWEEN';
51+
TOP_SYMBOL: 'TOP';
52+
VALUE_SYMBOL: 'VALUE';
53+
ORDER_SYMBOL: 'ORDER';
54+
BY_SYMBOL: 'BY';
55+
GROUP_SYMBOL: 'GROUP';
56+
OFFSET_SYMBOL: 'OFFSET';
57+
LIMIT_SYMBOL: 'LIMIT';
58+
ASC_SYMBOL: 'ASC';
59+
DESC_SYMBOL: 'DESC';
60+
EXISTS_SYMBOL: 'EXISTS';
61+
LIKE_SYMBOL: 'LIKE';
62+
HAVING_SYMBOL: 'HAVING';
63+
JOIN_SYMBOL: 'JOIN';
4964

5065
AT_SYMBOL: '@';
5166
LC_BRACKET_SYMBOL: '{';
@@ -77,10 +92,11 @@ GREATER_THAN_EQUAL_OPERATOR: '>=';
7792
LEFT_SHIFT_OPERATOR: '<<';
7893
RIGHT_SHIFT_OPERATOR: '>>';
7994
ZERO_FILL_RIGHT_SHIFT_OPERATOR: '>>>';
95+
NOT_EQUAL_OPERATOR: '!=';
8096

8197

8298
/* Identifiers */
83-
IDENTIFIER: [a-z] [a-z_0-9]*;
99+
IDENTIFIER: [a-z_] [a-z_0-9]*;
84100

85101
// White space handling
86102
WHITESPACE:

cosmosdb/CosmosDBParser.g4

Lines changed: 122 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -6,137 +6,155 @@ options {
66

77
root: select EOF;
88

9-
select: select_clause from_clause where_clause?;
9+
select:
10+
select_clause from_clause? where_clause? group_by_clause? having_clause? order_by_clause?
11+
offset_limit_clause?;
1012

11-
select_clause: SELECT_SYMBOL select_specification;
13+
select_clause: SELECT_SYMBOL top_clause? select_specification;
14+
15+
top_clause: TOP_SYMBOL DECIMAL;
1216

1317
select_specification:
1418
MULTIPLY_OPERATOR
15-
| DISTINCT_SYMBOL? object_property_list;
19+
| DISTINCT_SYMBOL? VALUE_SYMBOL? object_property_list;
1620

1721
from_clause: FROM_SYMBOL from_specification;
1822

19-
where_clause: WHERE_SYMBOL scalar_expression_in_where;
23+
where_clause: WHERE_SYMBOL scalar_expression;
24+
25+
group_by_clause:
26+
GROUP_SYMBOL BY_SYMBOL scalar_expression (
27+
COMMA_SYMBOL scalar_expression
28+
)*;
29+
30+
having_clause: HAVING_SYMBOL scalar_expression;
31+
32+
order_by_clause:
33+
ORDER_SYMBOL BY_SYMBOL sort_expression (
34+
COMMA_SYMBOL sort_expression
35+
)*;
36+
37+
sort_expression: scalar_expression (ASC_SYMBOL | DESC_SYMBOL)?;
38+
39+
offset_limit_clause: OFFSET_SYMBOL DECIMAL LIMIT_SYMBOL DECIMAL;
2040

2141
from_specification: from_source;
2242

23-
from_source: container_expression;
43+
from_source: container_expression (join_clause)*;
2444

25-
container_expression: container_name (AS_SYMBOL? IDENTIFIER)?;
45+
container_expression: container_name (AS_SYMBOL? identifier)?;
2646

27-
container_name: IDENTIFIER;
47+
join_clause:
48+
JOIN_SYMBOL identifier IN_SYMBOL scalar_expression;
49+
50+
container_name: identifier;
2851

2952
object_property_list:
3053
object_property (COMMA_SYMBOL object_property)*;
3154

32-
object_property: scalar_expression (AS_SYMBOL? property_alias)?;
55+
object_property:
56+
scalar_expression (AS_SYMBOL? property_alias)?;
3357

34-
property_alias: IDENTIFIER;
58+
property_alias: identifier;
3559

36-
// scalar_expression: https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/scalar-expressions
60+
// Unified scalar_expression - used in both SELECT projections and WHERE clause.
61+
// https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/scalar-expressions
62+
// Alternatives are ordered from highest precedence (first) to lowest (last) per ANTLR4 semantics.
3763
scalar_expression:
38-
input_alias
39-
| scalar_expression DOT_SYMBOL property_name
40-
| scalar_expression LS_BRACKET_SYMBOL (
41-
(DOUBLE_QUOTE_STRING_LITERAL)
42-
| (array_index)
43-
) RS_BRACKET_SYMBOL
44-
| unary_operator scalar_expression;
45-
46-
// TODO(zp): Merge scalar_expression and scalar_expression_in_where while supporting the project
47-
// fully. https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/scalar-expressions
48-
scalar_expression_in_where:
4964
constant
5065
| input_alias
5166
| parameter_name
52-
| scalar_expression_in_where AND_SYMBOL scalar_expression_in_where
53-
| scalar_expression_in_where OR_SYMBOL scalar_expression_in_where
54-
| scalar_expression_in_where DOT_SYMBOL property_name
55-
| scalar_expression_in_where LS_BRACKET_SYMBOL (
56-
(DOUBLE_QUOTE_STRING_LITERAL)
57-
| (array_index)
58-
) RS_BRACKET_SYMBOL
59-
| unary_operator scalar_expression_in_where
60-
| scalar_expression_in_where binary_operator scalar_expression_in_where
61-
| scalar_expression_in_where QUESTION_MARK_SYMBOL scalar_expression_in_where COLON_SYMBOL
62-
scalar_expression_in_where
6367
| scalar_function_expression
6468
| create_object_expression
6569
| create_array_expression
66-
| LR_BRACKET_SYMBOL scalar_expression_in_where RR_BRACKET_SYMBOL;
67-
68-
create_array_expression: array_constant;
70+
| LR_BRACKET_SYMBOL scalar_expression RR_BRACKET_SYMBOL
71+
| LR_BRACKET_SYMBOL select RR_BRACKET_SYMBOL
72+
| EXISTS_SYMBOL LR_BRACKET_SYMBOL select RR_BRACKET_SYMBOL
73+
| scalar_expression DOT_SYMBOL property_name
74+
| scalar_expression LS_BRACKET_SYMBOL (
75+
DOUBLE_QUOTE_STRING_LITERAL
76+
| SINGLE_QUOTE_STRING_LITERAL
77+
| array_index
78+
) RS_BRACKET_SYMBOL
79+
| unary_operator scalar_expression
80+
| NOT_SYMBOL scalar_expression
81+
| scalar_expression multiplicative_operator scalar_expression
82+
| scalar_expression additive_operator scalar_expression
83+
| scalar_expression shift_operator scalar_expression
84+
| scalar_expression BIT_AND_SYMBOL scalar_expression
85+
| scalar_expression BIT_XOR_SYMBOL scalar_expression
86+
| scalar_expression BIT_OR_SYMBOL scalar_expression
87+
| scalar_expression DOUBLE_BAR_SYMBOL scalar_expression
88+
| scalar_expression comparison_operator scalar_expression
89+
| scalar_expression NOT_SYMBOL? IN_SYMBOL LR_BRACKET_SYMBOL (
90+
scalar_expression (COMMA_SYMBOL scalar_expression)*
91+
)? RR_BRACKET_SYMBOL
92+
| scalar_expression NOT_SYMBOL? BETWEEN_SYMBOL scalar_expression AND_SYMBOL scalar_expression
93+
| scalar_expression NOT_SYMBOL? LIKE_SYMBOL scalar_expression
94+
| scalar_expression AND_SYMBOL scalar_expression
95+
| scalar_expression OR_SYMBOL scalar_expression
96+
| scalar_expression QUESTION_MARK_SYMBOL scalar_expression COLON_SYMBOL scalar_expression;
97+
98+
create_array_expression:
99+
LS_BRACKET_SYMBOL (
100+
scalar_expression (COMMA_SYMBOL scalar_expression)*
101+
)? RS_BRACKET_SYMBOL;
102+
103+
create_object_expression:
104+
LC_BRACKET_SYMBOL (
105+
object_field_pair (COMMA_SYMBOL object_field_pair)*
106+
)? RC_BRACKET_SYMBOL;
69107

70-
create_object_expression: object_constant;
108+
object_field_pair:
109+
(string_literal | property_name) COLON_SYMBOL scalar_expression;
71110

72111
scalar_function_expression:
73112
udf_scalar_function_expression
74113
| builtin_function_expression;
75114

76115
udf_scalar_function_expression:
77-
UDF_SYMBOL DOT_SYMBOL IDENTIFIER LR_BRACKET_SYMBOL (
78-
scalar_expression_in_where (
79-
COMMA_SYMBOL scalar_expression_in_where
80-
)*
81-
) RR_BRACKET_SYMBOL;
116+
UDF_SYMBOL DOT_SYMBOL identifier LR_BRACKET_SYMBOL (
117+
scalar_expression (COMMA_SYMBOL scalar_expression)*
118+
)? RR_BRACKET_SYMBOL;
82119

83120
builtin_function_expression:
84-
IDENTIFIER LR_BRACKET_SYMBOL (
85-
scalar_expression_in_where (
86-
COMMA_SYMBOL scalar_expression_in_where
121+
identifier LR_BRACKET_SYMBOL (
122+
(MULTIPLY_OPERATOR | scalar_expression) (
123+
COMMA_SYMBOL scalar_expression
87124
)*
88-
) RR_BRACKET_SYMBOL;
125+
)? RR_BRACKET_SYMBOL;
89126

90-
binary_operator:
127+
multiplicative_operator:
91128
MULTIPLY_OPERATOR
92129
| DIVIDE_SYMBOL
93-
| MODULO_SYMBOL
94-
| PLUS_SYMBOL
95-
| MINUS_SYMBOL
96-
| BIT_AND_SYMBOL
97-
| BIT_XOR_SYMBOL
98-
| BIT_OR_SYMBOL
99-
| DOUBLE_BAR_SYMBOL
100-
| EQUAL_SYMBOL
130+
| MODULO_SYMBOL;
131+
132+
additive_operator: PLUS_SYMBOL | MINUS_SYMBOL;
133+
134+
shift_operator:
135+
LEFT_SHIFT_OPERATOR
136+
| RIGHT_SHIFT_OPERATOR
137+
| ZERO_FILL_RIGHT_SHIFT_OPERATOR;
138+
139+
comparison_operator:
140+
EQUAL_SYMBOL
141+
| NOT_EQUAL_OPERATOR
101142
| LESS_THAN_OPERATOR
102143
| LESS_THAN_EQUAL_OPERATOR
103144
| GREATER_THAN_OPERATOR
104-
| GREATER_THAN_EQUAL_OPERATOR
105-
| LEFT_SHIFT_OPERATOR
106-
| RIGHT_SHIFT_OPERATOR
107-
| ZERO_FILL_RIGHT_SHIFT_OPERATOR
108-
;
145+
| GREATER_THAN_EQUAL_OPERATOR;
109146

110147
unary_operator: BIT_NOT_SYMBOL | PLUS_SYMBOL | MINUS_SYMBOL;
111148

112-
parameter_name: AT_SYMBOL IDENTIFIER;
149+
parameter_name: AT_SYMBOL identifier;
113150

114151
// https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/query/constants
115152
constant:
116153
undefined_constant
117154
| null_constant
118155
| boolean_constant
119156
| number_constant
120-
| string_constant
121-
| array_constant
122-
| object_constant;
123-
124-
object_constant:
125-
LC_BRACKET_SYMBOL (
126-
object_constant_field_pair (
127-
COMMA_SYMBOL object_constant_field_pair
128-
)*
129-
) RC_BRACKET_SYMBOL;
130-
131-
object_constant_field_pair: (
132-
property_name
133-
| (DOUBLE_QUOTE_SYMBOL property_name DOUBLE_QUOTE_SYMBOL)
134-
) COMMA_SYMBOL constant;
135-
136-
array_constant:
137-
LS_BRACKET_SYMBOL (constant (COMMA_SYMBOL constant)*)? RS_BRACKET_SYMBOL;
138-
139-
string_constant: string_literal;
157+
| string_constant;
140158

141159
undefined_constant: UNDEFINED_SYMBOL;
142160

@@ -146,6 +164,8 @@ boolean_constant: TRUE_SYMBOL | FALSE_SYMBOL;
146164

147165
number_constant: decimal_literal | hexadecimal_literal;
148166

167+
string_constant: string_literal;
168+
149169
string_literal:
150170
SINGLE_QUOTE_STRING_LITERAL
151171
| DOUBLE_QUOTE_STRING_LITERAL;
@@ -154,8 +174,28 @@ decimal_literal: DECIMAL | REAL | FLOAT;
154174

155175
hexadecimal_literal: HEXADECIMAL;
156176

157-
property_name: IDENTIFIER;
177+
// Allow keywords to be used as identifiers (property names, aliases, etc.)
178+
// This is necessary because CosmosDB allows keywords as property names.
179+
identifier:
180+
IDENTIFIER
181+
| IN_SYMBOL
182+
| BETWEEN_SYMBOL
183+
| TOP_SYMBOL
184+
| VALUE_SYMBOL
185+
| ORDER_SYMBOL
186+
| BY_SYMBOL
187+
| GROUP_SYMBOL
188+
| OFFSET_SYMBOL
189+
| LIMIT_SYMBOL
190+
| ASC_SYMBOL
191+
| DESC_SYMBOL
192+
| EXISTS_SYMBOL
193+
| LIKE_SYMBOL
194+
| HAVING_SYMBOL
195+
| JOIN_SYMBOL;
196+
197+
property_name: identifier;
158198

159199
array_index: DECIMAL;
160200

161-
input_alias: IDENTIFIER;
201+
input_alias: identifier;

0 commit comments

Comments
 (0)