While ChrysaLisp emphasizes high-performance text parsing through simple binary
search primitives (bfind, bskip), it also provides a robust and efficient
Regular Expression engine. Implemented as the Regexp class, this system is
used for complex pattern matching, global search, and sophisticated text
transformations in tools like grep, sed, and the Editor application.
The ChrysaLisp Regexp engine is a Non-deterministic Finite Automaton (NFA) implementation. It prioritizes correctness and predictable performance over the complex, backtracking-heavy engines found in many other environments.
The process follows a classic three-stage pipeline:
-
Tokenization: The pattern is broken into literals, operators, and character classes.
-
Postfix Conversion: The tokens are rearranged into postfix notation (Reverse Polish Notation) to handle operator precedence correctly.
-
Compilation: The postfix expression is compiled into a list of NFA instructions (meta-data), which represents the state machine.
The compiled "meta" object consists of instructions like:
-
(:lit char): Match a literal character. -
(:wild .): Match any character. -
(:class string): Match a character from an optimizedchar-class. -
(:nil offset1 [offset2]): A non-consuming split or jump instruction used to implement loops, branches, and quantifiers. -
(:assert type): Anchor matches like^(start),$(end), or!(word boundary).
ChrysaLisp supports both greedy (matching as much as possible) and lazy (matching as little as possible) quantifiers.
-
Greedy:
*(zero or more),+(one or more),?(zero or one). -
Lazy:
*?,+?,??.
The lazy behavior is achieved by the order of offsets in the (:nil ...) split
instructions during compilation. By swapping the priority of the loop vs. the
exit branch in the NFA, the engine naturally finds the shortest or longest
match.
Capture groups (...) are supported and automatically tracked during the
search. The results of a match include not just the full match bounds, but a
list of all submatches (capture groups).
(. regexp :search "2026-02-10" "(\\d+)-(\\d+)-(\\d+)")
; Returns a list of matches, each containing:
; ((0 10) (0 4) (5 7) (8 10))
; ^full ^$1 ^$2 ^$3
These submatches are essential for the replace-regex functionality, where they
can be referenced in the replacement string as $0, $1, etc.
For convenience, the Regexp functionality is wrapped in a set of global
functions in lib/text/searching.inc. These functions manage a global Regexp
instance (+regexp) and provide a simplified API:
-
match? (text pattern): Returns:tif the pattern matches anywhere. -
matches (text pattern): Returns a list of all matches (including submatches). -
replace-regex (text pattern replacement): Performs global search and replace.
Regexp replacement in ChrysaLisp is extremely efficient. The system
pre-compiles the replacement string into a "template" and then uses the
built-in splice primitive to perform the entire transformation in a
single, atomic operation. This avoids repeated string allocations and
copying.
The engine includes several optimizations:
-
Fast Path Scanning: Before running the full NFA, the
:match?and:searchmethods check if the pattern starts with a literal or a character class. If so, they use the high-speedbskipnorbskipprimitives to rapidly skip over text that cannot possibly match. -
Meta-Data Caching: Compiled NFA objects are cached in an
Fmapkeyed by the pattern string, ensuring that the compilation step only happens once per unique pattern. -
Graph Element Caching: During compilation, individual NFA segments are memoized to reduce object churn and memory usage.
By combining the theoretical elegance of NFAs with the practical speed of ChrysaLisp's string primitives, the Regexp system provides a powerful tool for complex text processing that remains consistent with the system's overall performance goals.