|
| 1 | +# Mojo Regex |
| 2 | +Regular Expressions Library for Mojo |
| 3 | + |
| 4 | +`regex` is a regex library featuring a hybrid DFA/NFA engine architecture that automatically optimizes pattern matching based on complexity. |
| 5 | + |
| 6 | +It aims to provide a similar interface as the [re](https://docs.python.org/3/library/re.html) stdlib package while leveraging Mojo's performance capabilities. |
| 7 | + |
| 8 | +## Disclaimer ⚠️ |
| 9 | + |
| 10 | +This software is in an early stage of development. Even though it is functional, it is not yet feature-complete and may contain bugs. Check the features section below and the TO-DO sections for the current status |
| 11 | + |
| 12 | +## Implemented Features |
| 13 | + |
| 14 | +### Basic Elements |
| 15 | +- ✅ Literal characters (`a`, `hello`) |
| 16 | +- ✅ Wildcard (`.`) - matches any character except newline |
| 17 | +- ✅ Whitespace (`\s`) - matches space, tab, newline, carriage return, form feed |
| 18 | +- ✅ Escape sequences (`\t` for tab, `\\` for literal backslash) |
| 19 | + |
| 20 | +### Character Classes |
| 21 | +- ✅ Character ranges (`[a-z]`, `[0-9]`, `[A-Za-z0-9]`) |
| 22 | +- ✅ Negated ranges (`[^a-z]`, `[^0-9]`) |
| 23 | +- ✅ Mixed character sets (`[abc123]`) |
| 24 | +- ✅ Character ranges within groups (`(b|[c-n])`) |
| 25 | + |
| 26 | +### Quantifiers |
| 27 | +- ✅ Zero or more (`*`) |
| 28 | +- ✅ One or more (`+`) |
| 29 | +- ✅ Zero or one (`?`) |
| 30 | +- ✅ Exact count (`{3}`) |
| 31 | +- ✅ Range count (`{2,4}`) |
| 32 | +- ✅ Minimum count (`{2,}`) |
| 33 | +- ✅ Quantifiers on all elements (characters, wildcards, ranges, groups) |
| 34 | + |
| 35 | +### Anchors |
| 36 | +- ✅ Start of string (`^`) |
| 37 | +- ✅ End of string (`$`) |
| 38 | +- ✅ Anchors in OR expressions (`^na|nb$`) |
| 39 | + |
| 40 | +### Groups and Alternation |
| 41 | +- ✅ Capturing groups (`(abc)`) |
| 42 | +- ✅ Alternation/OR (`a|b`) |
| 43 | +- ✅ Complex OR patterns (`(a|b)`, `na|nb`) |
| 44 | +- ✅ Nested alternations (`(b|[c-n])`) |
| 45 | +- ✅ Group quantifiers (`(a)*`, `(abc)+`) |
| 46 | + |
| 47 | +### Engine Features |
| 48 | +- ✅ **Hybrid DFA/NFA Architecture** - Automatic engine selection for optimal performance |
| 49 | +- ✅ **O(n) Performance** - DFA engine for simple patterns (literals, basic quantifiers, character classes) |
| 50 | +- ✅ **Full Regex Support** - NFA engine with backtracking for complex patterns |
| 51 | +- ✅ **Pattern Complexity Analysis** - Intelligent routing between engines |
| 52 | +- ✅ **SIMD Optimization** - Vectorized character class matching |
| 53 | +- ✅ **Pattern Compilation Caching** - Pre-compiled patterns for reuse |
| 54 | +- ✅ **Match Position Tracking** - Precise start_idx, end_idx reporting |
| 55 | +- ✅ **Simple API**: `match_first(pattern, text) -> Optional[Match]` |
| 56 | + |
| 57 | +## Installation |
| 58 | + |
| 59 | +1. **Install [pixi](https://pixi.sh/latest/)** |
| 60 | + |
| 61 | +2. **Add the Package** (at the top level of your project): |
| 62 | + |
| 63 | + ```bash |
| 64 | + pixi add regex |
| 65 | + ``` |
| 66 | + |
| 67 | +## Example Usage |
| 68 | + |
| 69 | +```mojo |
| 70 | +from regex import match_first, findall |
| 71 | +
|
| 72 | +# Basic literal matching |
| 73 | +var result = match_first("hello", "hello world") |
| 74 | +if result: |
| 75 | + print("Match found:", result.value().match_text) |
| 76 | +
|
| 77 | +# Find all matches |
| 78 | +var matches = findall("a", "banana") |
| 79 | +print("Found", len(matches), "matches:") |
| 80 | +for i in range(len(matches)): |
| 81 | + print(" Match", i, ":", matches[i].match_text, "at position", matches[i].start_idx) |
| 82 | +
|
| 83 | +# Wildcard and quantifiers |
| 84 | +result = match_first(".*@.*", "user@domain.com") |
| 85 | +if result: |
| 86 | + print("Email found") |
| 87 | +
|
| 88 | +# Find all numbers in text |
| 89 | +var numbers = findall("[0-9]+", "Price: $123, Quantity: 456, Total: $579") |
| 90 | +for i in range(len(numbers)): |
| 91 | + print("Number found:", numbers[i].match_text) |
| 92 | +
|
| 93 | +# Character ranges |
| 94 | +result = match_first("[a-z]+", "hello123") |
| 95 | +if result: |
| 96 | + print("Letters:", result.value().match_text) |
| 97 | +
|
| 98 | +# Groups and alternation |
| 99 | +result = match_first("(com|org|net)", "example.com") |
| 100 | +if result: |
| 101 | + print("TLD found:", result.value().match_text) |
| 102 | +
|
| 103 | +# Find all domains in text |
| 104 | +var domains = findall("(com|org|net)", "Visit example.com or test.org for more info") |
| 105 | +for i in range(len(domains)): |
| 106 | + print("Domain found:", domains[i].match_text) |
| 107 | +
|
| 108 | +# Anchors |
| 109 | +result = match_first("^https?://", "https://example.com") |
| 110 | +if result: |
| 111 | + print("Valid URL") |
| 112 | +
|
| 113 | +# Complex patterns |
| 114 | +result = match_first("^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$", "user@example.com") |
| 115 | +if result: |
| 116 | + print("Valid email format") |
| 117 | +
|
| 118 | +# Find all email addresses in text |
| 119 | +var emails = findall("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", "Contact john@example.com or mary@test.org") |
| 120 | +for i in range(len(emails)): |
| 121 | + print("Email found:", emails[i].match_text) |
| 122 | +``` |
| 123 | +
|
| 124 | +## Building and Testing |
| 125 | +
|
| 126 | +```bash |
| 127 | +# Build the package |
| 128 | +./tools/build.sh |
| 129 | +
|
| 130 | +# Run tests |
| 131 | +./tools/run-tests.sh |
| 132 | +
|
| 133 | +# Or run specific test |
| 134 | +mojo test -I src/ tests/test_matcher.mojo |
| 135 | +
|
| 136 | +# Run benchmarks to see performance including SIMD optimizations |
| 137 | +mojo benchmarks/bench_engine.mojo |
| 138 | +``` |
| 139 | +
|
| 140 | +## TO-DO: Missing Features |
| 141 | +
|
| 142 | +### High Priority |
| 143 | +- [x] Global matching (`findall()`) |
| 144 | +- [x] Hybrid DFA/NFA engine architecture |
| 145 | +- [x] Pattern complexity analysis and optimization |
| 146 | +- [x] SIMD-accelerated character class matching |
| 147 | +- [x] SIMD-accelerated literal string search |
| 148 | +- [x] SIMD capability detection and automatic routing |
| 149 | +- [x] Vectorized quantifier processing for character classes |
| 150 | +- [ ] Non-capturing groups (`(?:...)`) |
| 151 | +- [ ] Named groups (`(?<name>...)` or `(?P<name>...)`) |
| 152 | +- [ ] Predefined character classes (`\d`, `\w`, `\S`, `\D`, `\W`) |
| 153 | +- [ ] Case insensitive matching options |
| 154 | +- [ ] Match replacement (`sub()`, `gsub()`) |
| 155 | +- [ ] String splitting (`split()`) |
| 156 | +
|
| 157 | +### Medium Priority |
| 158 | +- [ ] Non-greedy quantifiers (`*?`, `+?`, `??`) |
| 159 | +- [ ] Word boundaries (`\b`, `\B`) |
| 160 | +- [ ] Match groups extraction and iteration |
| 161 | +- [ ] Pattern compilation object |
| 162 | +- [ ] Unicode character classes (`\p{L}`, `\p{N}`) |
| 163 | +- [ ] Multiline mode (`^` and `$` match line boundaries) |
| 164 | +- [ ] Dot-all mode (`.` matches newlines) |
| 165 | +
|
| 166 | +### Advanced Features |
| 167 | +- [ ] Positive lookahead (`(?=...)`) |
| 168 | +- [ ] Negative lookahead (`(?!...)`) |
| 169 | +- [ ] Positive lookbehind (`(?<=...)`) |
| 170 | +- [ ] Negative lookbehind (`(?<!...)`) |
| 171 | +- [ ] Backreferences (`\1`, `\2`) |
| 172 | +- [ ] Atomic groups (`(?>...)`) |
| 173 | +- [ ] Possessive quantifiers (`*+`, `++`) |
| 174 | +- [ ] Conditional expressions (`(?(condition)yes|no)`) |
| 175 | +- [ ] Recursive patterns |
| 176 | +- [ ] Subroutine calls |
| 177 | +
|
| 178 | +### Engine Improvements |
| 179 | +- [x] Hybrid DFA/NFA architecture with automatic engine selection |
| 180 | +- [x] O(n) DFA engine for simple patterns |
| 181 | +- [x] SIMD optimization for character class matching and literal string search |
| 182 | +- [x] Pattern complexity analysis for optimal routing |
| 183 | +- [x] SIMD capability detection for intelligent engine selection |
| 184 | +- [x] Vectorized operations for quantifiers and repetition counting |
| 185 | +- [ ] Additional DFA pattern support (more complex quantifiers and groups) |
| 186 | +- [ ] Compile-time pattern specialization for string literals |
| 187 | +- [ ] Aho-Corasick multi-pattern matching for alternations |
| 188 | +- [ ] Advanced NFA optimizations (lazy quantifiers, cut operators) |
| 189 | +- [ ] Parallel matching for multiple patterns |
| 190 | +
|
| 191 | +## Contributing |
| 192 | +
|
| 193 | +Contributions are welcome! If you'd like to contribute, please follow the contribution guidelines in the [CONTRIBUTING.md](CONTRIBUTING.md) file in the repository. |
| 194 | +
|
| 195 | +## License |
| 196 | +
|
| 197 | +mojo is licensed under the [MIT license](LICENSE). |
0 commit comments