Commit 78f196a
perf(scan): literal prefilter + lazy line counting in readAndGrep (#804)
## Summary
Two regex-level optimizations to narrow the perf gap with ripgrep on our
pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797.
- **Literal prefilter** — ripgrep-style: extract a literal substring
from the regex source (e.g., `import` from `import.*from`), scan the
buffer with `indexOf` to locate candidate lines, only invoke the regex
engine on lines that contain the literal. V8's `indexOf` is roughly
SIMD-speed; skipping the regex engine on non-candidate lines is where
most of the win comes from.
- **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n",
cursor)` hops. 2-5× faster on the line-counting sub-loop because V8
implements `indexOf` in C++ without per-iteration JS interop.
## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core)
| Op | Before | After | Δ |
|---|---:|---:|---:|
| `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** |
| `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** |
| `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** |
| `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** |
| `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) |
| `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** |
The DSN workloads improve because `DSN_PATTERN` extracts `http` as its
literal — most source files don't contain `http` at all, so the
prefilter short-circuits before the regex runs.
No regressions on any benchmark. Pure-literal patterns (e.g.,
`SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer
path unchanged.
## What changed
### New file: `src/lib/scan/literal-extract.ts` (~300 LOC)
Conservative literal extractor. Walks a regex source looking for the
longest contiguous run of literal bytes that every match must contain.
Bails out safely on top-level alternation, character classes, groups,
lookarounds, quantifiers, and escape classes.
Handles escaped metacharacters intelligently: `Sentry\.init` yields
`Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields
`foo` (escape `\b` is an anchor, not a literal `b`).
Exports:
- `extractInnerLiteral(source, flags)` — returns the literal, or null if
no safe extraction possible. Honors `/i` by lowercasing.
- `isPureLiteral(source, flags)` — true when the pattern IS a bare
literal with no metacharacters. Used by the grep pipeline to route
pure-literals to the whole-buffer path (V8's regex engine is
hyper-optimized for pure-literal patterns; the prefilter adds overhead
without benefit there).
### Modified: `src/lib/scan/grep.ts` (~240 LOC changes)
Three-way dispatch in `readAndGrep` based on the extracted literal:
1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal +
`multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs
the regex engine only on those. This is the main perf win.
2. **`grepByWholeBuffer`** — existing path, used for:
- Pure-literal patterns (V8 handles them optimally)
- Patterns with no extractable literal (complex regex, top-level
alternation)
- `multiline: false` mode (the fast path requires per-line semantics)
Also: replaced the `charCodeAt`-walk that counted newlines char-by-char
with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx,
bounds)` as a shared helper to bundle the match-construction arguments.
### Tests added
- `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the
extractor's rules (escape handling, quantifier drop, alternation bail,
case-insensitive, minimum length).
- `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast
path: correctness vs whole-buffer, escaped-literal extraction,
case-insensitive flag, zero-literal-hit short-circuit, routing of pure
literals to whole-buffer, and alternation routing.
## Why this approach
From the ripgrep research (attached to PR #791): rg's central perf trick
is extracting a literal from each regex and prefiltering with SIMD
memchr. V8 doesn't expose SIMD directly but its
`String.prototype.indexOf` is compiled to a tight byte-level loop with
internal SIMD on x64 — functionally equivalent for our use case.
Three of the five techniques in the Loggly regex-perf guide were
evaluated:
- **Character classes over `.*`** — `DSN_PATTERN` already uses
`[a-z0-9]+`, no change needed.
- **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is
already correctly ordered (`.` more common than `:` in DSN hosts);
swapping regressed perf by noise.
- **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed*
perf 2.8× on our workload. V8's existing fast character-mismatch
rejection on the first byte outperforms the boundary check overhead.
The remaining gap with rg is now primarily orchestration overhead
(async/await, `mapFilesConcurrent`, walker correctness features) rather
than regex speed. A worker-pool exploration may follow.
## Test plan
- [x] `bunx tsc --noEmit` — clean
- [x] `bun run lint` — clean (1 pre-existing warning in
`src/lib/formatters/markdown.ts` unrelated to this PR)
- [x] `bun test --timeout 15000 test/lib test/commands test/types` —
**5610 pass, 0 fail** (+58 new)
- [x] `bun test test/isolated` — 138 pass, 0 fail
- [x] `bun run bench --size large --runs 5` — all scan ops at or below
previous baseline
- [x] Manually verified semantic parity: `collectGrep` returns identical
`GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the
prefilter fires
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 76adc9a commit 78f196a
5 files changed
Lines changed: 1168 additions & 103 deletions
File tree
- src/lib/scan
- test/lib/scan
0 commit comments