Skip to content

Commit 78f196a

Browse files
BYKclaude
andauthored
perf(scan): literal prefilter + lazy line counting in readAndGrep (#804)
## Summary Two regex-level optimizations to narrow the perf gap with ripgrep on our pure-TS `collectGrep`/`grepFiles`. Follow-up to PR #791 and #797. - **Literal prefilter** — ripgrep-style: extract a literal substring from the regex source (e.g., `import` from `import.*from`), scan the buffer with `indexOf` to locate candidate lines, only invoke the regex engine on lines that contain the literal. V8's `indexOf` is roughly SIMD-speed; skipping the regex engine on non-candidate lines is where most of the win comes from. - **Lazy line counting** — swapped `charCodeAt`-walk for `indexOf("\n", cursor)` hops. 2-5× faster on the line-counting sub-loop because V8 implements `indexOf` in C++ without per-iteration JS interop. ## Perf impact (synthetic/large, 10k files, Bun 1.3.11, 4-core) | Op | Before | After | Δ | |---|---:|---:|---:| | `scan.grepFiles` (DSN pattern) | 370 ms | **318 ms** | **−14%** | | `detectAllDsns.cold` | 363 ms | **313 ms** | **−14%** | | `detectDsn.cold` | 7.73 ms | **5.61 ms** | **−27%** | | `scanCodeForFirstDsn` | 2.91 ms | **2.13 ms** | **−27%** | | `scanCodeForDsns` | 342 ms | 333 ms | −3% (noise-equivalent) | | `import.*from` uncapped (bench) | 1489 ms | **1178 ms** | **−21%** | The DSN workloads improve because `DSN_PATTERN` extracts `http` as its literal — most source files don't contain `http` at all, so the prefilter short-circuits before the regex runs. No regressions on any benchmark. Pure-literal patterns (e.g., `SENTRY_DSN`, `NONEXISTENT_TOKEN_XYZ`) continue through the whole-buffer path unchanged. ## What changed ### New file: `src/lib/scan/literal-extract.ts` (~300 LOC) Conservative literal extractor. Walks a regex source looking for the longest contiguous run of literal bytes that every match must contain. Bails out safely on top-level alternation, character classes, groups, lookarounds, quantifiers, and escape classes. Handles escaped metacharacters intelligently: `Sentry\.init` yields `Sentry.init` (extracted via literal `\.` → `.`), while `\bfoo\b` yields `foo` (escape `\b` is an anchor, not a literal `b`). Exports: - `extractInnerLiteral(source, flags)` — returns the literal, or null if no safe extraction possible. Honors `/i` by lowercasing. - `isPureLiteral(source, flags)` — true when the pattern IS a bare literal with no metacharacters. Used by the grep pipeline to route pure-literals to the whole-buffer path (V8's regex engine is hyper-optimized for pure-literal patterns; the prefilter adds overhead without benefit there). ### Modified: `src/lib/scan/grep.ts` (~240 LOC changes) Three-way dispatch in `readAndGrep` based on the extracted literal: 1. **`grepByLiteralPrefilter`** (new) — regex with extractable literal + `multiline: true`. Uses `indexOf(literal)` to find candidate lines, runs the regex engine only on those. This is the main perf win. 2. **`grepByWholeBuffer`** — existing path, used for: - Pure-literal patterns (V8 handles them optimally) - Patterns with no extractable literal (complex regex, top-level alternation) - `multiline: false` mode (the fast path requires per-line semantics) Also: replaced the `charCodeAt`-walk that counted newlines char-by-char with an `indexOf("\n", cursor)` hop loop. Extracted `buildMatch(ctx, bounds)` as a shared helper to bundle the match-construction arguments. ### Tests added - `test/lib/scan/literal-extract.test.ts` — **39 tests** covering the extractor's rules (escape handling, quantifier drop, alternation bail, case-insensitive, minimum length). - `test/lib/scan/grep.test.ts` — **7 new tests** for the prefilter fast path: correctness vs whole-buffer, escaped-literal extraction, case-insensitive flag, zero-literal-hit short-circuit, routing of pure literals to whole-buffer, and alternation routing. ## Why this approach From the ripgrep research (attached to PR #791): rg's central perf trick is extracting a literal from each regex and prefiltering with SIMD memchr. V8 doesn't expose SIMD directly but its `String.prototype.indexOf` is compiled to a tight byte-level loop with internal SIMD on x64 — functionally equivalent for our use case. Three of the five techniques in the Loggly regex-perf guide were evaluated: - **Character classes over `.*`** — `DSN_PATTERN` already uses `[a-z0-9]+`, no change needed. - **Alternation order** — `DSN_PATTERN`'s `(?:\.[a-z]+|:[0-9]+)` is already correctly ordered (`.` more common than `:` in DSN hosts); swapping regressed perf by noise. - **Anchors/word boundaries** — adding `\b` to `DSN_PATTERN` *regressed* perf 2.8× on our workload. V8's existing fast character-mismatch rejection on the first byte outperforms the boundary check overhead. The remaining gap with rg is now primarily orchestration overhead (async/await, `mapFilesConcurrent`, walker correctness features) rather than regex speed. A worker-pool exploration may follow. ## Test plan - [x] `bunx tsc --noEmit` — clean - [x] `bun run lint` — clean (1 pre-existing warning in `src/lib/formatters/markdown.ts` unrelated to this PR) - [x] `bun test --timeout 15000 test/lib test/commands test/types` — **5610 pass, 0 fail** (+58 new) - [x] `bun test test/isolated` — 138 pass, 0 fail - [x] `bun run bench --size large --runs 5` — all scan ops at or below previous baseline - [x] Manually verified semantic parity: `collectGrep` returns identical `GrepMatch[]` on prefilter vs whole-buffer paths for patterns where the prefilter fires 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 76adc9a commit 78f196a

5 files changed

Lines changed: 1168 additions & 103 deletions

File tree

0 commit comments

Comments
 (0)