Skip to content

Commit 1bdc914

Browse files
committed
add rst docs for regex cmd
Signed-off-by: Jialiang Liang <jiallian@amazon.com>
1 parent 066b658 commit 1bdc914

1 file changed

Lines changed: 184 additions & 0 deletions

File tree

docs/user/ppl/cmd/regex.rst

Lines changed: 184 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,184 @@
1+
=============
2+
regex
3+
=============
4+
5+
.. rubric:: Table of contents
6+
7+
.. contents::
8+
:local:
9+
:depth: 2
10+
11+
12+
Description
13+
============
14+
| The ``regex`` command filters search results by matching field values against a regular expression pattern. Only documents where the specified field matches the pattern are included in the results.
15+
16+
Syntax
17+
============
18+
regex <field> = <pattern>
19+
regex <field> != <pattern>
20+
regex <pattern>
21+
22+
* field: optional. The field name to match against. If not specified, the pattern will be matched against the default field.
23+
* pattern: mandatory string. The regular expression pattern to match. Supports Java regex syntax including named groups, lookahead/lookbehind, and character classes.
24+
* = : optional operator for positive matching (default behavior)
25+
* != : optional operator for negative matching (exclude matches)
26+
27+
Regular Expression Engine
28+
==========================
29+
30+
The regex command uses Java's built-in regular expression engine, which supports:
31+
32+
* **Standard regex features**: Character classes, quantifiers, anchors
33+
* **Named capture groups**: ``(?<name>pattern)`` syntax
34+
* **Lookahead/lookbehind**: ``(?=...)`` and ``(?<=...)`` assertions
35+
* **Case sensitivity**: Patterns are case-sensitive by default
36+
37+
Field Type Handling
38+
===================
39+
40+
The regex command automatically converts non-string field values to strings before pattern matching:
41+
42+
* **String fields**: Used directly
43+
* **Numeric fields**: Converted to string representation (e.g., ``42`` becomes ``"42"``)
44+
* **Boolean fields**: Converted to ``"true"`` or ``"false"``
45+
* **Null/missing fields**: Treated as non-matching
46+
47+
Example 1: Basic pattern matching
48+
=================================
49+
50+
The example shows how to filter documents where the ``lastname`` field matches names starting with uppercase letters.
51+
52+
PPL query::
53+
54+
os> source=accounts | regex lastname="^[A-Z][a-z]+$" | fields account_number, firstname, lastname;
55+
fetched rows / total rows = 4/4
56+
+----------------+-----------+----------+
57+
| account_number | firstname | lastname |
58+
|----------------+-----------+----------|
59+
| 1 | Amber | Duke |
60+
| 6 | Hattie | Bond |
61+
| 13 | Nanette | Bates |
62+
| 18 | Dale | Adams |
63+
+----------------+-----------+----------+
64+
65+
66+
Example 2: Negative matching
67+
============================
68+
69+
The example shows how to exclude documents where the ``lastname`` field ends with "son".
70+
71+
PPL query::
72+
73+
os> source=accounts | regex lastname!=".*son$" | fields account_number, lastname;
74+
fetched rows / total rows = 3/3
75+
+----------------+----------+
76+
| account_number | lastname |
77+
|----------------+----------|
78+
| 1 | Duke |
79+
| 6 | Bond |
80+
| 13 | Bates |
81+
+----------------+----------+
82+
83+
84+
Example 3: Email domain matching
85+
================================
86+
87+
The example shows how to filter documents by email domain patterns.
88+
89+
PPL query::
90+
91+
os> source=accounts | regex email="@pyrami\.com$" | fields account_number, email;
92+
fetched rows / total rows = 1/1
93+
+----------------+----------------------+
94+
| account_number | email |
95+
|----------------+----------------------|
96+
| 1 | amberduke@pyrami.com |
97+
+----------------+----------------------+
98+
99+
100+
Example 4: Numeric field matching
101+
=================================
102+
103+
The example shows how to match patterns in numeric fields (automatically converted to strings).
104+
105+
PPL query::
106+
107+
os> source=accounts | regex account_number="^1\d" | fields account_number, firstname;
108+
fetched rows / total rows = 2/2
109+
+----------------+-----------+
110+
| account_number | firstname |
111+
|----------------+-----------|
112+
| 13 | Nanette |
113+
| 18 | Dale |
114+
+----------------+-----------+
115+
116+
117+
Example 5: Complex patterns with character classes
118+
==================================================
119+
120+
The example shows how to use complex regex patterns with character classes and quantifiers.
121+
122+
PPL query::
123+
124+
os> source=accounts | regex address="\d{3,4}\s+[A-Z][a-z]+\s+(Street|Lane|Court)" | fields account_number, address;
125+
fetched rows / total rows = 2/2
126+
+----------------+------------------+
127+
| account_number | address |
128+
|----------------+------------------|
129+
| 6 | 671 Bristol Street |
130+
| 18 | 880 Holmes Lane |
131+
+----------------+------------------+
132+
133+
134+
Example 6: Case-sensitive matching
135+
==================================
136+
137+
The example demonstrates that regex matching is case-sensitive by default.
138+
139+
PPL query::
140+
141+
os> source=accounts | regex state="virginia" | fields account_number, state;
142+
fetched rows / total rows = 0/0
143+
+----------------+-------+
144+
| account_number | state |
145+
|----------------+-------|
146+
+----------------+-------+
147+
148+
PPL query::
149+
150+
os> source=accounts | regex state="Virginia" | fields account_number, state;
151+
fetched rows / total rows = 1/1
152+
+----------------+-------+
153+
| account_number | state |
154+
|----------------+-------|
155+
| 1 | VA |
156+
+----------------+-------+
157+
158+
159+
Limitations
160+
===========
161+
162+
* **Calcite engine required**: The regex command only works when the Calcite query engine is enabled (``plugins.calcite.enabled=true``)
163+
* **Performance**: Complex regex patterns may impact query performance, especially on large datasets
164+
* **Memory usage**: Pattern compilation results are cached, but very large numbers of unique patterns may consume memory
165+
166+
Comparison with Related Commands
167+
===============================
168+
169+
**regex vs parse**:
170+
171+
* ``regex``: Filters documents based on pattern matching (boolean result)
172+
* ``parse``: Extracts new fields from text using named capture groups
173+
174+
**regex vs where with LIKE**:
175+
176+
* ``regex``: Supports full Java regex syntax with advanced features
177+
* ``LIKE``: Supports only basic SQL wildcards (``%`` and ``_``)
178+
179+
Usage Notes
180+
===========
181+
182+
* Use ``regex`` when you need powerful pattern matching for filtering
183+
* For simple wildcard matching, ``where field LIKE pattern`` is an alternative option
184+
* Always test regex patterns with representative data to ensure good performance

0 commit comments

Comments
 (0)