Text-native lexer/parser internals (closes #478)#480
Conversation
Make the lexer's input tape, all Token payloads, and the P/Lex monad state operate on Data.Text rather than [Char]. This is the foundation for the Text-input parser API. Changes: * Token type's String fields become Text fields. Parser actions T.unpack at AST construction sites to keep the existing String- valued Syntax AST intact, so this is non-breaking for AST consumers. * New module Language.Haskell.Exts.Parser.Text exposes Text-input variants of parseModule, parseExp, parseDecl, parseType, parsePat, parseStmt, parseImportDecl (with -WithMode and -WithComments variants for each). These skip the eager T.pack the existing String entry points perform at the boundary. * lexWhile gains a Text-producing companion lexWhileT. * getInput's String form is preserved for short lookaheads; hot loops (lexWhileT, discard, lexNewline, lexTab) work on Text natively. The Syntax AST stays String-valued in this commit -- a Text-valued AST is a separate later change. Adds dependency on the 'text' package.
Keyword/operator/pragma lookup tables (reserved_ops, special_varops, reserved_ids, special_varids, pragmas) now keep their keys as Text, so the lexer can look up directly against the Text spans produced by lexWhileT and skip a per-token T.pack/unpack round-trip. Numeric, escape, raw-pragma and identifier workers (lexOctal, lexBinary, lexHexadecimal, lexDecimal, lexExponent, lexEscape, lexRawPragma, lexIdents, lexConIdOrQual) now produce Text directly via lexWhileT instead of the [Char] -> T.pack pattern. parseInteger is generalised to fold over Text. String-cons accumulators in lexString and lexQQBody are kept (cons over [Char] is O(1); only the final T.pack pays for the conversion).
Two surgical fixes to the lexer hot loops to remove allocation patterns that the Text-native rewrite would otherwise regress on. * discard: strict bang on the let-bound 'rest' and 'newCh' so the monad continuation receives forced values, rather than a lazy T.drop closure that holds the input alive across continuations. * lexWhileT: replace per-character recursion with a single T.span followed by a strict foldl' over the matched text for line/column tracking. This removes the O(n) thunk chain that recursing through the Lex CPS monad produces on long identifier-class runs.
* Make every Token payload field strict (!Text, !(Text,Text)). This forces the T.pack at token construction so the [Char] accumulator in lexString/lexQQBody is freed as soon as the token is yielded; the AST then holds only the materialized strict Text. * Reformat lexString.loop, lexQQBody and lexWhiteChars to keep the one-action-per-line house style: first action immediately after 'do', vertically aligned matrix when consecutive case branches share shape. A separate experiment with [Text] reverse-list accumulators (true strict-Text path, no Builder/Lazy) showed the [Char]-cons design is already optimal for lexString: per-char T.singleton allocates a ~48-byte Text record + ByteArray vs a 16-byte cons cell, making the strict-Text version 3.4x slower and using 90% more peak residency on the issue haskell-suite#478 stress case. The lower per-token-stream residency the [Text] approach showed on multi-literal corpora turned out to come from the AST's 'Literal l String String' representation, not from the lexer itself, and is left as Phase 2 work.
Adds lexTokenStreamText and lexTokenStreamTextWithMode in Language.Haskell.Exts.Lexer, the lexer-only counterparts of the Text-input parser entry points in Language.Haskell.Exts.Parser.Text. These let consumers that only need a token stream (linters, syntax highlighters, exact-print tooling) skip the eager Data.Text.pack at the String boundary that the existing String entry point performs. Also factors lexIt out as a top-level helper shared by both the String and Text variants.
Most string literals contain neither backslash escapes nor literal newlines. Previously every such literal was lexed character-by- character into a [Char] cons accumulator and then T.pack'd at the end, allocating one cons cell per character plus a fresh ByteArray. This commit walks the input Text with T.span looking for the run of "plain" characters (not '"', '\\', or '\n'). If the run is followed by a closing '"', we emit StringTok with the slice itself -- no allocation per character, no T.pack copy, just a Text record (32 B) pointing at the existing input ByteArray. When an escape, gap or premature EOF is encountered, we fall back to the original cons-list 'loop' starting from the prefix already consumed. Measured (15-trial 2σ-gated, vs Text PR tip without this fast path): big.hs (1 x 3 MB literal): alloc -99%, resid -99%, time -98% many.hs (200 x 50 kB liters): alloc -99%, resid -98%, time -97% repeat.hs (id-heavy, no liters): unchanged (fast path doesn't apply) vs master (String API): big.hs: 234 MB -> 4 MB residency many.hs: 628 MB -> 17 MB residency The slice keeps the input ByteArray alive while any token derived from it remains live; this is acceptable because the parser holds the whole input until parse completion anyway, and O(1) sharing beats the O(n) cons-and-pack cost we previously paid per literal.
The original fast-path bailed to a [Char]-cons loop on the first escape or string-gap. This commit unifies fast and slow paths into a single scan-and-splice: at each iteration, slice the next plain run via T.span, then handle the terminator or escape, and continue. Per-token allocation is now O(escapes), not O(chars): - A literal with no escapes allocates one Text record (slice). - A literal with K escapes allocates O(K) chunks + K T.singleton parsed-value records; intermediate runs are slices of the input. - The reverse-ordered chunk list collapses to one strict Text via T.concat at token emission. Wall-clock improvement vs the simpler fast-path (already 99% faster than master) on test corpora: big.hs -13.33% (0.015s -> 0.013s) many.hs -12.96% (0.054s -> 0.047s) repeat.hs no-effect Allocation and residency unchanged on these escape-free corpora; the new path matters most on literals with sparse escapes (no test corpus available).
DanBurton
left a comment
There was a problem hiding this comment.
A few nits. I'm still contemplating the larger change overall. The package hasn't been touched in over 6 years so I'm hesitant to make a breaking change after all this time.
Have you tried using Text for the internals but lazily convert back to String in the output, in order to preserve compatibility with the existing API? Or would that still incur enough of the costs from the pathological case so as to not actually solve the problem?
Or perhaps a compromise of both, exposing the new Text API, but preserving the existing String API as a thin wrapper around it? If a best-of-both-worlds solution that isn't a breaking change is possible, I'd prefer that.
Tangentially. I'd wager an LLM was used to assist in producing this. I'm fine with that, but perhaps at least adding attribution to the model(s) is warranted. Try asking it to rewrite the commits, citing itself as your co-author.
| @@ -1,5 +1,5 @@ | |||
| Name: haskell-src-exts | |||
| Version: 1.23.1 | |||
| Version: 1.23.2 | |||
There was a problem hiding this comment.
This is an api-breaking change, so it would need to be a bump to 1.24.0
| eliminated by Phase 2) | ||
| Build.hs, Comments.hs, SrcLoc.hs (small): resid neutral to -38%, | ||
| alloc +11-13% (same small-file T.unpack overhead). | ||
|
|
There was a problem hiding this comment.
No need for all this in the changelog; just a single bullet for what the change is, and reference the pr number for the details.
Summary
Migration of lexer/parser to
Text-native input fixes #478 OOM/slownessand gives improvements of 20% at times.
Exposing
Text-based AST (while maintaining compatibility with old API) is a follow-up PR.Detailed changes by commit
Text.Text.discardbangs the let-boundrest/newCh;lexWhileTis rewritten as a singleT.span+foldl',removing
the per-character thunk chain in the tokenizer hot loop.
data Token = VarId !Text | …lexTokenStreamText/lexTokenStreamTextWithMode, the lexer-only counterparts.lexStringfast-path: when a string literal has noescapes, slice the body directly from the input
TextviaT.span; avoids per-char allocation.lexStringscan-and-splice. Allocates O(escapes) instead of O(chars.Measurements
15-trial bench-mutex 2σ-gated, GHC 9.10, -O. Text API vs master
parseModule:Identifier-heavy code (
repeat.hs) is unchanged — the ASTscaffolding dominates residency there, not the lexer; an atom table
at lex time would address it as a follow-up.
Real haskell-src-exts library files (Text API):
The +11–13% allocation on small files is the per-token
T.unpackoverhead at AST construction, which may be removed by migration to
Text-based AST.Backwards compatibility
Tokenpayload fields change fromStringtoText. Consumerspattern-matching directly on Tokens see a type change. All other
consumers (those reading the
SyntaxAST) see no change at all.parseModule/parseExp/etc. continue to takeString. Behaviorunchanged; just slower on small inputs by 11–13% allocation due
to the round-trip via
T.packandT.unpack. Consumers that careabout that switch to
parseModuleTextetc.text >= 1.2as a build-depend.