Skip to content

Commit 76d99da

Browse files
msaelicescarolinefrascayetalit
authored
Regex package (#139)
* Bump mojo-websockets version, compatible with max 25.2 * Recipe for the regex library Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Set the build number 0 as it's a new package Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Fix the code to link to the github repo Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Fix revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * mojo-regex image Signed-off-by: Manuel Saelices <msaelices@gmail.com> * New revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * New revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Fix mojo package command Signed-off-by: Manuel Saelices <msaelices@gmail.com> * New revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Remove a lot of verbosity in the recipe description Signed-off-by: Manuel Saelices <msaelices@gmail.com> * New revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Point to the last revision Signed-off-by: Manuel Saelices <msaelices@gmail.com> * Fixed mojo command and version pinning * Possible fix on requirements part --------- Signed-off-by: Manuel Saelices <msaelices@gmail.com> Co-authored-by: Caroline Frasca <42614552+carolinefrasca@users.noreply.github.com> Co-authored-by: Tiyagora <98420273+yetalit@users.noreply.github.com>
1 parent 67c4266 commit 76d99da

3 files changed

Lines changed: 226 additions & 0 deletions

File tree

recipes/regex/README.md

Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
# Mojo Regex
2+
Regular Expressions Library for Mojo
3+
4+
`regex` is a regex library featuring a hybrid DFA/NFA engine architecture that automatically optimizes pattern matching based on complexity.
5+
6+
It aims to provide a similar interface as the [re](https://docs.python.org/3/library/re.html) stdlib package while leveraging Mojo's performance capabilities.
7+
8+
## Disclaimer ⚠️
9+
10+
This software is in an early stage of development. Even though it is functional, it is not yet feature-complete and may contain bugs. Check the features section below and the TO-DO sections for the current status
11+
12+
## Implemented Features
13+
14+
### Basic Elements
15+
- ✅ Literal characters (`a`, `hello`)
16+
- ✅ Wildcard (`.`) - matches any character except newline
17+
- ✅ Whitespace (`\s`) - matches space, tab, newline, carriage return, form feed
18+
- ✅ Escape sequences (`\t` for tab, `\\` for literal backslash)
19+
20+
### Character Classes
21+
- ✅ Character ranges (`[a-z]`, `[0-9]`, `[A-Za-z0-9]`)
22+
- ✅ Negated ranges (`[^a-z]`, `[^0-9]`)
23+
- ✅ Mixed character sets (`[abc123]`)
24+
- ✅ Character ranges within groups (`(b|[c-n])`)
25+
26+
### Quantifiers
27+
- ✅ Zero or more (`*`)
28+
- ✅ One or more (`+`)
29+
- ✅ Zero or one (`?`)
30+
- ✅ Exact count (`{3}`)
31+
- ✅ Range count (`{2,4}`)
32+
- ✅ Minimum count (`{2,}`)
33+
- ✅ Quantifiers on all elements (characters, wildcards, ranges, groups)
34+
35+
### Anchors
36+
- ✅ Start of string (`^`)
37+
- ✅ End of string (`$`)
38+
- ✅ Anchors in OR expressions (`^na|nb$`)
39+
40+
### Groups and Alternation
41+
- ✅ Capturing groups (`(abc)`)
42+
- ✅ Alternation/OR (`a|b`)
43+
- ✅ Complex OR patterns (`(a|b)`, `na|nb`)
44+
- ✅ Nested alternations (`(b|[c-n])`)
45+
- ✅ Group quantifiers (`(a)*`, `(abc)+`)
46+
47+
### Engine Features
48+
-**Hybrid DFA/NFA Architecture** - Automatic engine selection for optimal performance
49+
-**O(n) Performance** - DFA engine for simple patterns (literals, basic quantifiers, character classes)
50+
-**Full Regex Support** - NFA engine with backtracking for complex patterns
51+
-**Pattern Complexity Analysis** - Intelligent routing between engines
52+
-**SIMD Optimization** - Vectorized character class matching
53+
-**Pattern Compilation Caching** - Pre-compiled patterns for reuse
54+
-**Match Position Tracking** - Precise start_idx, end_idx reporting
55+
-**Simple API**: `match_first(pattern, text) -> Optional[Match]`
56+
57+
## Installation
58+
59+
1. **Install [pixi](https://pixi.sh/latest/)**
60+
61+
2. **Add the Package** (at the top level of your project):
62+
63+
```bash
64+
pixi add regex
65+
```
66+
67+
## Example Usage
68+
69+
```mojo
70+
from regex import match_first, findall
71+
72+
# Basic literal matching
73+
var result = match_first("hello", "hello world")
74+
if result:
75+
print("Match found:", result.value().match_text)
76+
77+
# Find all matches
78+
var matches = findall("a", "banana")
79+
print("Found", len(matches), "matches:")
80+
for i in range(len(matches)):
81+
print(" Match", i, ":", matches[i].match_text, "at position", matches[i].start_idx)
82+
83+
# Wildcard and quantifiers
84+
result = match_first(".*@.*", "user@domain.com")
85+
if result:
86+
print("Email found")
87+
88+
# Find all numbers in text
89+
var numbers = findall("[0-9]+", "Price: $123, Quantity: 456, Total: $579")
90+
for i in range(len(numbers)):
91+
print("Number found:", numbers[i].match_text)
92+
93+
# Character ranges
94+
result = match_first("[a-z]+", "hello123")
95+
if result:
96+
print("Letters:", result.value().match_text)
97+
98+
# Groups and alternation
99+
result = match_first("(com|org|net)", "example.com")
100+
if result:
101+
print("TLD found:", result.value().match_text)
102+
103+
# Find all domains in text
104+
var domains = findall("(com|org|net)", "Visit example.com or test.org for more info")
105+
for i in range(len(domains)):
106+
print("Domain found:", domains[i].match_text)
107+
108+
# Anchors
109+
result = match_first("^https?://", "https://example.com")
110+
if result:
111+
print("Valid URL")
112+
113+
# Complex patterns
114+
result = match_first("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", "user@example.com")
115+
if result:
116+
print("Valid email format")
117+
118+
# Find all email addresses in text
119+
var emails = findall("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", "Contact john@example.com or mary@test.org")
120+
for i in range(len(emails)):
121+
print("Email found:", emails[i].match_text)
122+
```
123+
124+
## Building and Testing
125+
126+
```bash
127+
# Build the package
128+
./tools/build.sh
129+
130+
# Run tests
131+
./tools/run-tests.sh
132+
133+
# Or run specific test
134+
mojo test -I src/ tests/test_matcher.mojo
135+
136+
# Run benchmarks to see performance including SIMD optimizations
137+
mojo benchmarks/bench_engine.mojo
138+
```
139+
140+
## TO-DO: Missing Features
141+
142+
### High Priority
143+
- [x] Global matching (`findall()`)
144+
- [x] Hybrid DFA/NFA engine architecture
145+
- [x] Pattern complexity analysis and optimization
146+
- [x] SIMD-accelerated character class matching
147+
- [x] SIMD-accelerated literal string search
148+
- [x] SIMD capability detection and automatic routing
149+
- [x] Vectorized quantifier processing for character classes
150+
- [ ] Non-capturing groups (`(?:...)`)
151+
- [ ] Named groups (`(?<name>...)` or `(?P<name>...)`)
152+
- [ ] Predefined character classes (`\d`, `\w`, `\S`, `\D`, `\W`)
153+
- [ ] Case insensitive matching options
154+
- [ ] Match replacement (`sub()`, `gsub()`)
155+
- [ ] String splitting (`split()`)
156+
157+
### Medium Priority
158+
- [ ] Non-greedy quantifiers (`*?`, `+?`, `??`)
159+
- [ ] Word boundaries (`\b`, `\B`)
160+
- [ ] Match groups extraction and iteration
161+
- [ ] Pattern compilation object
162+
- [ ] Unicode character classes (`\p{L}`, `\p{N}`)
163+
- [ ] Multiline mode (`^` and `$` match line boundaries)
164+
- [ ] Dot-all mode (`.` matches newlines)
165+
166+
### Advanced Features
167+
- [ ] Positive lookahead (`(?=...)`)
168+
- [ ] Negative lookahead (`(?!...)`)
169+
- [ ] Positive lookbehind (`(?<=...)`)
170+
- [ ] Negative lookbehind (`(?<!...)`)
171+
- [ ] Backreferences (`\1`, `\2`)
172+
- [ ] Atomic groups (`(?>...)`)
173+
- [ ] Possessive quantifiers (`*+`, `++`)
174+
- [ ] Conditional expressions (`(?(condition)yes|no)`)
175+
- [ ] Recursive patterns
176+
- [ ] Subroutine calls
177+
178+
### Engine Improvements
179+
- [x] Hybrid DFA/NFA architecture with automatic engine selection
180+
- [x] O(n) DFA engine for simple patterns
181+
- [x] SIMD optimization for character class matching and literal string search
182+
- [x] Pattern complexity analysis for optimal routing
183+
- [x] SIMD capability detection for intelligent engine selection
184+
- [x] Vectorized operations for quantifiers and repetition counting
185+
- [ ] Additional DFA pattern support (more complex quantifiers and groups)
186+
- [ ] Compile-time pattern specialization for string literals
187+
- [ ] Aho-Corasick multi-pattern matching for alternations
188+
- [ ] Advanced NFA optimizations (lazy quantifiers, cut operators)
189+
- [ ] Parallel matching for multiple patterns
190+
191+
## Contributing
192+
193+
Contributions are welcome! If you'd like to contribute, please follow the contribution guidelines in the [CONTRIBUTING.md](CONTRIBUTING.md) file in the repository.
194+
195+
## License
196+
197+
mojo is licensed under the [MIT license](LICENSE).

recipes/regex/image.jpeg

1.79 MB
Loading

recipes/regex/recipe.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
about:
2+
description: "# Mojo Regex\nRegular Expressions Library for Mojo\n\n`regex` is a\
3+
\ regex library featuring a hybrid DFA/NFA engine architecture that automatically\
4+
\ optimizes pattern matching based on complexity.\n\nIt aims to provide a similar\
5+
\ interface as the [re](https://docs.python.org/3/library/re.html) stdlib package\
6+
\ while leveraging Mojo's performance capabilities."
7+
homepage: https://github.com/msaelices/mojo-regex
8+
license: MIT
9+
license_file: LICENSE
10+
repository: https://github.com/msaelices/mojo-regex
11+
summary: Library for dealing with regular expressions in Mojo
12+
build:
13+
number: 0
14+
script:
15+
- mkdir -p ${PREFIX}/lib/mojo
16+
- mojo package src/regex -o ${PREFIX}/lib/mojo/regex.mojopkg
17+
context:
18+
version: 13.4.2
19+
package:
20+
name: regex
21+
version: 0.1.0
22+
requirements:
23+
host:
24+
- max =25.4.0
25+
run:
26+
- ${{ pin_compatible('max') }}
27+
source:
28+
- git: https://github.com/msaelices/mojo-regex.git
29+
rev: b37e0e5c7d0548449d35fff2b419cf17eb91827c

0 commit comments

Comments
 (0)