Skip to content

search: pre-filter literal regex patterns with case-insensitive substring search#15825

Draft
kolmodin wants to merge 1 commit into
NixOS:masterfrom
kolmodin:pr-15823-regex
Draft

search: pre-filter literal regex patterns with case-insensitive substring search#15825
kolmodin wants to merge 1 commit into
NixOS:masterfrom
kolmodin:pr-15823-regex

Conversation

@kolmodin
Copy link
Copy Markdown
Contributor

nix search spends 50%+ on doing regex on the derivations' name, path and description.

I personally usually search for a-z/A-Z words and don't often use the regex capacity, but I still have to pay the 2x regex penalty.

This PR introduces a fast path when the search patterns can be matched with std::search rather than std::regex_search.
We still call out to std::regex_search if there's a match in order to get the highlighting.

Measured on nix search nixpkgs gemma, hyperfine 30 runs (3 warmup), warm cache:

before (vkwzvmrt):     1.193 s ± 0.012 s
after  (this commit):  0.568 s ± 0.006 s

That's a 2.10x speedup, ~625 ms / ~52% off wall time. Cold cache also benefits, though less so (14.690 s -> 13.738 s, ~6.5%), since cold time is dominated by Nix expression evaluation, not regex.

The total runtime for both before/after may be lower than on your machine, as I already had #15824 applied when I benchmarked.

I saw the discussion in #14876 regarding using a faster and safer regex, which I support, but I saw no consensus. This work is orthogonal to using a faster and safer regex, those changes can still be made throughout the nix codebase.

Also see #15823.


Add 👍 to pull requests you find important.

The Nix maintainer team uses a GitHub project board to schedule and track reviews.

@kolmodin kolmodin requested a review from edolstra as a code owner May 10, 2026 18:53
@github-actions github-actions Bot added the new-cli Relating to the "nix" command label May 10, 2026
…ring search

The new substring fast-path costs ~5-15 ms (~2-3% of wall time) — a
~40-60x reduction in string-matching cost.

`nix search nixpkgs gemma` against a warm eval cache spends ~52% of
its wall time inside libc++'s std::regex matcher. Every call into
regex_search / sregex_iterator is CPU-heavy (full NFA walk of the
input) and allocation-heavy (backtracking state on the heap, per
call). The overwhelming majority of derivations searched do not match
the user's pattern, so the regex engine does this work only to return
false.

When the pattern contains no POSIX-extended regex metacharacters
(e.g. plain words like 'gemma'), a case-insensitive substring search
is equivalent and orders of magnitude cheaper. Use it as a
pre-filter; if the literal isn't present in any of (path, name,
description), the regex cannot match either and we skip it. If it
*is* present, we fall through to the existing regex iterator so that
hiliteMatches still receives proper smatch objects.

Same treatment for excludeRegexes: literal exclude patterns are
skipped entirely if the literal is absent, avoiding the regex_search.

Measured on `nix search nixpkgs gemma`, hyperfine 30 runs (3 warmup),
warm cache:

    before (vkwzvmrt):     1.193 s ± 0.012 s
    after  (this commit):  0.568 s ± 0.006 s

That's a 2.10x speedup, ~625 ms / ~52% off wall time. Cold cache also
benefits, though less so (14.690 s -> 13.738 s, ~6.5%), since cold
time is dominated by Nix expression evaluation, not regex.

Output is byte-for-byte identical for both literal and non-literal
patterns; verified with 'gemma', '^gem', '[Gg]emma', and
'gemma openllm' (AND-of-regexes).
@kolmodin
Copy link
Copy Markdown
Contributor Author

Friendly ping, @edolstra. Is this PR of interest?

@kolmodin kolmodin marked this pull request as draft May 18, 2026 07:44
@kolmodin
Copy link
Copy Markdown
Contributor Author

I also converted search.cc to using boost::regex and RE2, just to compare (assisted by Claude Opus 4.7).

The boost::regex alternative is almost completely a drop-in for std::regex, and it performs nearly as well as this PR and using RE2.

RE2 is even faster than boost::regex, and even guarantees linear execution time. boost::regex may get exponential execution time, which is mainly important when the regex is untrested.
RE2 pulls in absl though, which is a significant new dependency.

boost::regex seems to be the winner, I'll make a new PR with that version instead, stay tuned.

Converting this PR to a draft since it it's only marginally faster than boost::regex and RE2, but brings in a lot of new code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-cli Relating to the "nix" command

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant