Text-native lexer/parser internals (closes #478) by mgajda · Pull Request #480 · haskell-suite/haskell-src-exts

mgajda · 2026-05-08T15:30:03Z

Summary

Migration of lexer/parser to Text-native input fixes #478 OOM/slowness
and gives improvements of 20% at times.

Exposing Text-based AST (while maintaining compatibility with old API) is a follow-up PR.

Detailed changes by commit

Move parser internals and entry points to Text.
Move lexer keyword tables and lex workers to Text.
Strict tokenizer. discard bangs the let-bound rest/newCh;
lexWhileT is rewritten as a single T.span + foldl',
removing
the per-character thunk chain in the tokenizer hot loop.
Strict Token fields. data Token = VarId !Text | …
Expose Text-input lexer entry points. lexTokenStreamText /
lexTokenStreamTextWithMode, the lexer-only counterparts.
lexString fast-path: when a string literal has no
escapes, slice the body directly from the input Text via
T.span; avoids per-char allocation.
lexString scan-and-splice. Allocates O(escapes) instead of O(chars.

Measurements

15-trial bench-mutex 2σ-gated, GHC 9.10, -O. Text API vs master
parseModule:

Corpus	Residency	Allocation	Wall clock
big.hs (1×3 MB literal, issue #478 stress)	−99.98% (234 MB → 52 KB)	−99.40% (1156 MB → 7 MB)	−98.71% (1.0 s → 13 ms)
many.hs (200×50 kB literals, 9.6 MB)	−99.05% (628 MB → 6 MB)	−99.22% (3858 MB → 30 MB)	−98.39% (2.9 s → 47 ms)
repeat.hs (5.1 MB, 1M id/100 unique)	no-effect	no-effect	no-effect

Identifier-heavy code (repeat.hs) is unchanged — the AST
scaffolding dominates residency there, not the lexer; an atom table
at lex time would address it as a follow-up.

Real haskell-src-exts library files (Text API):

File	Resid	Alloc	Time
InternalLexer.hs (58 kB)	−74%	−27%	−51%
ParseSyntax.hs (17 kB)	−54%	+13%	no-effect
Build.hs / Comments.hs / SrcLoc.hs (small)	neutral to −38%	+11–13%	no-effect / 1 ms slower

The +11–13% allocation on small files is the per-token T.unpack
overhead at AST construction, which may be removed by migration to Text-based AST.

Backwards compatibility

Token payload fields change from String to Text. Consumers
pattern-matching directly on Tokens see a type change. All other
consumers (those reading the Syntax AST) see no change at all.
parseModule/parseExp/etc. continue to take String. Behavior
unchanged; just slower on small inputs by 11–13% allocation due
to the round-trip via T.pack and T.unpack. Consumers that care
about that switch to parseModuleText etc.
Adds text >= 1.2 as a build-depend.

Make the lexer's input tape, all Token payloads, and the P/Lex monad state operate on Data.Text rather than [Char]. This is the foundation for the Text-input parser API. Changes: * Token type's String fields become Text fields. Parser actions T.unpack at AST construction sites to keep the existing String- valued Syntax AST intact, so this is non-breaking for AST consumers. * New module Language.Haskell.Exts.Parser.Text exposes Text-input variants of parseModule, parseExp, parseDecl, parseType, parsePat, parseStmt, parseImportDecl (with -WithMode and -WithComments variants for each). These skip the eager T.pack the existing String entry points perform at the boundary. * lexWhile gains a Text-producing companion lexWhileT. * getInput's String form is preserved for short lookaheads; hot loops (lexWhileT, discard, lexNewline, lexTab) work on Text natively. The Syntax AST stays String-valued in this commit -- a Text-valued AST is a separate later change. Adds dependency on the 'text' package.

Keyword/operator/pragma lookup tables (reserved_ops, special_varops, reserved_ids, special_varids, pragmas) now keep their keys as Text, so the lexer can look up directly against the Text spans produced by lexWhileT and skip a per-token T.pack/unpack round-trip. Numeric, escape, raw-pragma and identifier workers (lexOctal, lexBinary, lexHexadecimal, lexDecimal, lexExponent, lexEscape, lexRawPragma, lexIdents, lexConIdOrQual) now produce Text directly via lexWhileT instead of the [Char] -> T.pack pattern. parseInteger is generalised to fold over Text. String-cons accumulators in lexString and lexQQBody are kept (cons over [Char] is O(1); only the final T.pack pays for the conversion).

Two surgical fixes to the lexer hot loops to remove allocation patterns that the Text-native rewrite would otherwise regress on. * discard: strict bang on the let-bound 'rest' and 'newCh' so the monad continuation receives forced values, rather than a lazy T.drop closure that holds the input alive across continuations. * lexWhileT: replace per-character recursion with a single T.span followed by a strict foldl' over the matched text for line/column tracking. This removes the O(n) thunk chain that recursing through the Lex CPS monad produces on long identifier-class runs.

* Make every Token payload field strict (!Text, !(Text,Text)). This forces the T.pack at token construction so the [Char] accumulator in lexString/lexQQBody is freed as soon as the token is yielded; the AST then holds only the materialized strict Text. * Reformat lexString.loop, lexQQBody and lexWhiteChars to keep the one-action-per-line house style: first action immediately after 'do', vertically aligned matrix when consecutive case branches share shape. A separate experiment with [Text] reverse-list accumulators (true strict-Text path, no Builder/Lazy) showed the [Char]-cons design is already optimal for lexString: per-char T.singleton allocates a ~48-byte Text record + ByteArray vs a 16-byte cons cell, making the strict-Text version 3.4x slower and using 90% more peak residency on the issue haskell-suite#478 stress case. The lower per-token-stream residency the [Text] approach showed on multi-literal corpora turned out to come from the AST's 'Literal l String String' representation, not from the lexer itself, and is left as Phase 2 work.

Adds lexTokenStreamText and lexTokenStreamTextWithMode in Language.Haskell.Exts.Lexer, the lexer-only counterparts of the Text-input parser entry points in Language.Haskell.Exts.Parser.Text. These let consumers that only need a token stream (linters, syntax highlighters, exact-print tooling) skip the eager Data.Text.pack at the String boundary that the existing String entry point performs. Also factors lexIt out as a top-level helper shared by both the String and Text variants.

Most string literals contain neither backslash escapes nor literal newlines. Previously every such literal was lexed character-by- character into a [Char] cons accumulator and then T.pack'd at the end, allocating one cons cell per character plus a fresh ByteArray. This commit walks the input Text with T.span looking for the run of "plain" characters (not '"', '\\', or '\n'). If the run is followed by a closing '"', we emit StringTok with the slice itself -- no allocation per character, no T.pack copy, just a Text record (32 B) pointing at the existing input ByteArray. When an escape, gap or premature EOF is encountered, we fall back to the original cons-list 'loop' starting from the prefix already consumed. Measured (15-trial 2σ-gated, vs Text PR tip without this fast path): big.hs (1 x 3 MB literal): alloc -99%, resid -99%, time -98% many.hs (200 x 50 kB liters): alloc -99%, resid -98%, time -97% repeat.hs (id-heavy, no liters): unchanged (fast path doesn't apply) vs master (String API): big.hs: 234 MB -> 4 MB residency many.hs: 628 MB -> 17 MB residency The slice keeps the input ByteArray alive while any token derived from it remains live; this is acceptable because the parser holds the whole input until parse completion anyway, and O(1) sharing beats the O(n) cons-and-pack cost we previously paid per literal.

The original fast-path bailed to a [Char]-cons loop on the first escape or string-gap. This commit unifies fast and slow paths into a single scan-and-splice: at each iteration, slice the next plain run via T.span, then handle the terminator or escape, and continue. Per-token allocation is now O(escapes), not O(chars): - A literal with no escapes allocates one Text record (slice). - A literal with K escapes allocates O(K) chunks + K T.singleton parsed-value records; intermediate runs are slices of the input. - The reverse-ordered chunk list collapses to one strict Text via T.concat at token emission. Wall-clock improvement vs the simpler fast-path (already 99% faster than master) on test corpora: big.hs -13.33% (0.015s -> 0.013s) many.hs -12.96% (0.054s -> 0.047s) repeat.hs no-effect Allocation and residency unchanged on these escape-free corpora; the new path matters most on literals with sparse escapes (no test corpus available).

DanBurton

A few nits. I'm still contemplating the larger change overall. The package hasn't been touched in over 6 years so I'm hesitant to make a breaking change after all this time.

Have you tried using Text for the internals but lazily convert back to String in the output, in order to preserve compatibility with the existing API? Or would that still incur enough of the costs from the pathological case so as to not actually solve the problem?

Or perhaps a compromise of both, exposing the new Text API, but preserving the existing String API as a thin wrapper around it? If a best-of-both-worlds solution that isn't a breaking change is possible, I'd prefer that.

Tangentially. I'd wager an LLM was used to assist in producing this. I'm fine with that, but perhaps at least adding attribution to the model(s) is warranted. Try asking it to rewrite the commits, citing itself as your co-author.

DanBurton · 2026-06-05T20:01:15Z

@@ -1,5 +1,5 @@
 Name:                   haskell-src-exts
-Version:                1.23.1
+Version:                1.23.2


This is an api-breaking change, so it would need to be a bump to 1.24.0

DanBurton · 2026-06-05T20:02:42Z

+                                eliminated by Phase 2)
+    Build.hs, Comments.hs, SrcLoc.hs (small): resid neutral to -38%,
+    alloc +11-13% (same small-file T.unpack overhead).
+


No need for all this in the changelog; just a single bullet for what the change is, and reference the pr number for the details.

mgajda added 8 commits May 8, 2026 14:43

CHANGELOG: 1.23.2 entry with measured Phase 1 numbers

635133f

DanBurton requested changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-native lexer/parser internals (closes #478)#480

Text-native lexer/parser internals (closes #478)#480
mgajda wants to merge 8 commits into
haskell-suite:masterfrom
mgajda:perf-text-lexer

mgajda commented May 8, 2026

Uh oh!

DanBurton left a comment

Uh oh!

DanBurton Jun 5, 2026

Uh oh!

DanBurton Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mgajda commented May 8, 2026

Summary

Detailed changes by commit

Measurements

Backwards compatibility

Uh oh!

DanBurton left a comment

Choose a reason for hiding this comment

Uh oh!

DanBurton Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

DanBurton Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants