fix(query-language): preserve grouped filters in regex mode#1138
fix(query-language): preserve grouped filters in regex mode#1138brendan-kellam merged 2 commits intosourcebot-dev:mainfrom
Conversation
WalkthroughThe PR fixes regex mode query parsing by introducing offset-aware helpers and Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
packages/queryLanguage/src/tokens.ts (2)
198-298: Consider early-exiting once we know the inside isn't a single token.
isRegexQueryGroupingAtenumerates every top-level token intotopLevelTokens, but the only facts that matter afterwards are (a) whether there is exactly one top-level token, and (b) the starting offset of that token (forfirstCh/ prefix /-/"/(decisions). For a long paren body that already contains two or more top-level tokens, you can returntrueas soon as the second one is seen instead of finishing the scan.♻️ Suggested simplification
- const topLevelTokens: Array<{ start: number; end: number }> = []; - - while (offset < closeOffset) { - const tokenStart = offset; - ... - topLevelTokens.push({ start: tokenStart, end: offset }); - offset = skipWhitespace(input, offset); - } - - if (topLevelTokens.length !== 1) { - return true; - } - - const [{ start, end }] = topLevelTokens; + // Scan the first top-level token and bail out to "grouping" as soon as + // we discover a second one — we don't need the token list. + let firstTokenStart = -1; + let firstTokenEnd = -1; + let seenTokens = 0; + while (offset < closeOffset) { + const tokenStart = offset; + // ...same inner scan as today... + if (seenTokens === 0) { + firstTokenStart = tokenStart; + firstTokenEnd = offset; + } + seenTokens++; + if (seenTokens > 1) { + return true; + } + offset = skipWhitespace(input, offset); + } + if (seenTokens === 0) { + return true; // whitespace-only body + } + const start = firstTokenStart; + const end = firstTokenEnd;Not a correctness issue — purely a small readability / worst-case improvement for deeply parenthesized regex bodies.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/queryLanguage/src/tokens.ts` around lines 198 - 298, The function isRegexQueryGroupingAt currently collects all top-level tokens into topLevelTokens but only needs to know if there's exactly one and the start of that one; change the loop so it tracks the first token's start (e.g., firstTokenStart) and a tokenCount instead of pushing into topLevelTokens, and as soon as tokenCount becomes 2 return true immediately (early-exit) to avoid scanning the rest of the paren body; keep the same logic afterward that uses the single token's start (firstTokenStart) and end to check startsWithPrefixAt, '-' handling, QUOTE, and nested OPEN_PAREN recursion.
300-335: Consider bounding the backward scan infindMatchingOpenParenOffset.Unlike
hasUnmatchedOpenParen(capped at 1000 chars backward), this helper walks the input backward without any upper bound. That wouldn't be a problem on its own, but in regex-modewordTokenit is invoked on every)encountered while consuming a word (lines 536–541), and it is also the first thingcloseParenTokendoes in regex mode. On pathological inputs (e.g., a long regex word containing many)characters with no matching(), each)triggers a full backward scan of everything consumed so far — an easy path to O(n²) tokenization.Two options:
- Match the 1000-char safety limit used in
hasUnmatchedOpenParenso the scan is bounded. This is the smallest change and keeps behavior consistent within this file.- Better: memoize the result (or keep a running paren stack) in the regex-mode
wordTokenloop so the information is computed once up front instead of recomputed per).♻️ Minimal bounded-scan fix
function findMatchingOpenParenOffset(input: InputStream): number | null { if (input.next !== CLOSE_PAREN) { return null; } let offset = -1; let depth = 1; - while (true) { + const MIN_OFFSET = -1000; + while (offset >= MIN_OFFSET) { const ch = input.peek(offset); ... } + return null; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/queryLanguage/src/tokens.ts` around lines 300 - 335, The backward scan in findMatchingOpenParenOffset can run unbounded and cause O(n²) behavior; bound the scan like hasUnmatchedOpenParen by stopping after a fixed limit (e.g., MAX_UNMATCHED_SCAN = 1000) and returning null if exceeded. Edit findMatchingOpenParenOffset to track how many positions it has moved (or check offset <= -MAX_UNMATCHED_SCAN) and break/return null when the limit is reached; keep the rest of the escape and depth logic unchanged so callers like wordToken and closeParenToken get a safe, bounded result.packages/queryLanguage/test/regex.txt (1)
56-74: Good regression coverage; minor inconsistency worth noting.The new fixtures line up with the grammar (
ParenExpr { openParen query? closeParen },NegateExpr { negate (PrefixExpr | ParenExpr) }) and exercise the interesting regex-mode grouping paths (top-level OR group, AND with prefix + group, multi-filter AND + group with internal spaces, negated group).Only nit: the
repo:regex on line 62 encodes-as\x2dwhile line 67 uses a plain-. Both are equivalent regex atoms, and both should travel through the tokenizer as part of the sameTerm/prefix word, so the divergence isn't strictly needed — unless the intent was explicitly to cover the hex-escape path. Consider either aligning on plain-or adding a one-line comment above line 62 stating that\x2dis there to exercise hex escapes inside a regex prefix value.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@packages/queryLanguage/test/regex.txt` around lines 56 - 74, The fixture using repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the later fixture that uses a plain hyphen; either change the hex-escaped \x2d to a plain - so both regex-prefix fixtures use the same form, or add a one-line comment above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating that \x2d is intentionally used to exercise hex-escape handling in the tokenizer (the surrounding grammar symbols to locate the case are the repo: prefix and the ParenExpr/NegateExpr fixtures such as Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@packages/queryLanguage/src/tokens.ts`:
- Around line 198-298: The function isRegexQueryGroupingAt currently collects
all top-level tokens into topLevelTokens but only needs to know if there's
exactly one and the start of that one; change the loop so it tracks the first
token's start (e.g., firstTokenStart) and a tokenCount instead of pushing into
topLevelTokens, and as soon as tokenCount becomes 2 return true immediately
(early-exit) to avoid scanning the rest of the paren body; keep the same logic
afterward that uses the single token's start (firstTokenStart) and end to check
startsWithPrefixAt, '-' handling, QUOTE, and nested OPEN_PAREN recursion.
- Around line 300-335: The backward scan in findMatchingOpenParenOffset can run
unbounded and cause O(n²) behavior; bound the scan like hasUnmatchedOpenParen by
stopping after a fixed limit (e.g., MAX_UNMATCHED_SCAN = 1000) and returning
null if exceeded. Edit findMatchingOpenParenOffset to track how many positions
it has moved (or check offset <= -MAX_UNMATCHED_SCAN) and break/return null when
the limit is reached; keep the rest of the escape and depth logic unchanged so
callers like wordToken and closeParenToken get a safe, bounded result.
In `@packages/queryLanguage/test/regex.txt`:
- Around line 56-74: The fixture using
repo:^github\.com/sourcebot\x2ddev/sourcebot$ is inconsistent with the later
fixture that uses a plain hyphen; either change the hex-escaped \x2d to a plain
- so both regex-prefix fixtures use the same form, or add a one-line comment
above the repo:^github\.com/sourcebot\x2ddev/sourcebot$ fixture stating that
\x2d is intentionally used to exercise hex-escape handling in the tokenizer (the
surrounding grammar symbols to locate the case are the repo: prefix and the
ParenExpr/NegateExpr fixtures such as
Program(AndExpr(PrefixExpr(RepoExpr),ParenExpr(...)))).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 10c94d85-06a4-4244-80bb-961710e07771
📒 Files selected for processing (4)
CHANGELOG.mdpackages/queryLanguage/src/tokens.tspackages/queryLanguage/test/regex.txtpackages/web/src/features/search/parser.ts
Summary
(foo|bar)parsed as single regex termsorgroupsTesting
../../node_modules/.bin/vitest --run(frompackages/queryLanguage)🤖 Agent Plan
Expand full agent plan
isRegexEnabledchanges query parsing and identify where field-specific filters are converted into regex queries.(regex or regex)grouping discussed in this thread.packages/queryLanguageVitest suite and verify the failing queries parse correctly.vendor/zoektsubmodule rewind, from the PR.Summary by CodeRabbit
Release Notes