Skip to content

Pathological memory on oversize string literals (~95x source size) #478

@mgajda

Description

@mgajda

On haskell-src-exts-1.23.1 (matches homplexity-0.4.8.1's pin),
Language.Haskell.Exts.parseModuleWithComments allocates dramatically
more heap than the input size when it hits string literals with a few
MB of \-escaped bytes on a single line. Observed while running
homplexity over the GHC repository:

File Source Peak RSS Ratio
libraries/base/GHC/Unicode/.../GeneralCategory.hs 3.2 MB 304 MB ~95x

Cause. A single literal of the form "\25\25\25..." with a few MB
of \-escaped bytes in one line. The lexer decodes each escape into a
Char and the parser materialises the literal as a String — a
linked list of boxed Chars in Literal. A 3 MB escaped literal
expands to roughly 120 MB of : cons cells plus boxed chars, and the
lexer holds additional intermediate state for the duration.

Suggested fix. Store string-literal bytes as Text or ByteString
(or an offset+length into the source buffer) rather than eagerly
materialising a String. This changes the public AST type for string
literals, so it would be a breaking change, but it eliminates the
pathology entirely and is also a meaningful memory win on ordinary
source with many small literals.

Meanwhile a pre-parse blob filter (reject lines > 1 KiB or string
literals > 4 KiB of raw escaped bytes) catches the worst ~0.1% of
blobs cheaply; that is what we ended up doing in homplexity-benchmark.

Happy to provide more reproduction material if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions