feat(pattern-matcher): support !-prefixed negation in excluded_patterns#1901
feat(pattern-matcher): support !-prefixed negation in excluded_patterns#1901SuperMarioYL wants to merge 2 commits into
Conversation
Enables gitignore-style exceptions within excluded_patterns so users
can exclude broad categories while allowing specific paths through.
A pattern beginning with ! un-excludes paths that would otherwise be
excluded by preceding patterns. For example:
PatternFilePathMatcher(
excluded_patterns=[
"**/.*", # exclude all dot-entries
"!**/.github/**", # but keep .github through
]
)
Directory traversal is handled correctly: is_dir_included() uses a
probe path (<dir>/__probe__) to detect whether any negation pattern
could match files inside an otherwise-excluded directory, preventing
premature pruning.
Fixes cocoindex-io#1778.
|
thanks a lot ! @georgeh0 can help take a look! |
| /// Negation patterns (``!``-prefixed in the original list, stored without the ``!``). | ||
| /// A path that matches one of these is *not* excluded even if it matches the regular | ||
| /// exclusion patterns. | ||
| negation_glob_set: Option<GlobSet>, |
There was a problem hiding this comment.
Consider call it negation_excluded_glob_set to be more clear.
There was a problem hiding this comment.
Done — renamed to negation_excluded_glob_set throughout (field, variable, and doc-comments).
| // Probe one level inside: catches patterns like `**/.github/**`. | ||
| let probe = format!("{}/__probe__", path); | ||
| if neg_gs.is_match(probe.as_str()) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Will this work as expected?
e.g. if users put the following excluded_patterns:
dir1/**!dir1/dir2/dir3/a.yml
It won't work as expected.
I think when cases become more complex, we may consider using some more structured approach. e.g. we may consider allowing users to provide "sub matchers" for specific folders like:
sub_matchers: {
'dir1': {
exclude_patterns: ['**'],
sub_matchers: {
'dir2/dir3': {
inclue_patterns: ['a.yml'],
},
},
},
}
(throwing some rough thoughts, don't have a very concrete spec yet)
Besides, note that another feasible fallback option is post-filtering during iteration (and it's easier to do that in v1). Just wonder is it not sufficient for your case?
There was a problem hiding this comment.
Good catch — you are right that the original probe-only approach fails for this case.
With patterns ["dir1/**", "!dir1/dir2/dir3/a.yml"], the GlobSet probe (dir1/__probe__) does not match the exact-path negation pattern, so is_dir_included("dir1") used to return false, pruning the whole subtree before the exempt file could ever be reached.
I fixed this by retaining the raw negation strings in a negation_patterns_raw: Vec<String> field and adding a raw-prefix check inside is_dir_included: if any negation string starts with <dir>/, the directory is allowed through. The existing GlobSet probe is kept as-is — it still handles wildcard negations such as !**/.github/** where the pattern contains ** and the probe already works.
Added a dedicated test (test_negation_exact_path_deep_dir_traversal) that covers exactly the case you raised, and confirmed all 34 tests pass.
Regarding the structured sub_matchers idea for more complex cases — that is a clean design and would definitely give users more expressive power for nested policies. Happy to explore that in a follow-up if the team is interested; this PR just lands the flat-list negation semantics as a lower-friction first step.
There was a problem hiding this comment.
Thanks for iterating on this!
The original case works now, but this still doesn't work as expected:
dir1/**!dir?/dir2/dir3/a.yml
I'm worrying there can be some fundamental difficulty in supporting these negate-exclude every well under this API.
There was a problem hiding this comment.
Thanks @georgeh0 — you're right, the dir?/dir2/dir3/a.yml case is broken by both of the current checks:
- The GlobSet probe (
dir1/__probe__) fails because the negation pattern expects 4 segments ending ina.yml, and the 2-segment probe path can't match. - The raw-prefix check (
negation.starts_with("dir1/")) fails becausedir?is a literal?in the string and doesn't equaldir1.
I think this is fixable inside the current API without sub_matchers, by replacing the probe-and-prefix pair with a single ancestor-globset built from each negation pattern's path-prefix segments. For !dir?/dir2/dir3/a.yml the ancestors are:
dir?
dir?/dir2
dir?/dir2/dir3
is_dir_included then checks the candidate directory against the ancestor GlobSet; if any prefix matches, the directory is allowed through. The same construction subsumes the wildcard case — !**/.github/** produces ancestors ** and **/.github, which match any depth.
That said, I want to defer to your call before refactoring. Two options:
- Keep this PR scoped — extend
is_dir_includedwith the ancestor-globset approach above, add tests fordir?/...and a few more glob-in-dir-segment cases, leave the public API unchanged. Lower-risk increment. - Close this PR and design
sub_matchers— agree the negate-exclude shape is fighting the directory-pruning step, and the structured-sub-matcher API you sketched is the cleaner abstraction long-term. I'd open a design issue first, since the API shape is a larger conversation.
If (1) is acceptable I can push the ancestor-globset extension in a follow-up commit on this branch. If you'd rather go with (2), happy to close this PR after we have a sub_matchers issue to point users at. Either way, thanks for the careful review.
…traversal Addresses two review comments from @georgeh0: 1. Rename `negation_glob_set` → `negation_excluded_glob_set` for clarity. 2. Fix `is_dir_included` for exact-path negations like `!dir1/dir2/dir3/a.yml` combined with a broad exclusion like `dir1/**`. Previously the one-level GlobSet probe was sufficient only for wildcard negations (e.g. `!**/.github/**`); an exact-path negation's ancestor directories were pruned before the exempt file could ever be reached. Fix: retain raw negation strings in `negation_patterns_raw` and add a raw-prefix check in `is_dir_included` — if any negation pattern starts with `<dir>/` the directory must be traversed. The existing GlobSet probe is kept as-is for wildcard-based negations. New test: `test_negation_exact_path_deep_dir_traversal` covers the case raised in the review (exclude `dir1/**`, un-exclude `dir1/dir2/dir3/a.yml`). Signed-off-by: Lei Yu <itleiyu@gmail.com>
Summary
!negation support toexcluded_patternsinPatternMatcher(Rust) andPatternFilePathMatcher(Python).!un-excludes paths that would otherwise be excluded by preceding patterns.is_dir_included()probes one level inside an excluded dir (<dir>/__probe__) to detect whether a negation pattern could apply to its contents, preventing premature pruning.Motivation
Closes #1778.
Users maintaining large codebases hit a real pain point: they can't write
without also losing
.github/workflows/*.yml(CI config),.github/dependabot.yml, etc. The workaround is to enumerate every dot-directory to exclude explicitly — currently ~30 patterns and growing.With this change the same intent is expressed cleanly:
Implementation notes
split_excluded_patterns()splits the raw list into twoGlobSets:excluded_glob_set!prefixnegation_glob_set!prefix (stored without the!)is_excluded(path)returnstrueonly when the path matches the exclusion set and does not match the negation set.is_dir_included(path)checks:<dir>/__probe__match a negation pattern? (catches**/.github/**-style patterns) If yes → include.Test plan
cargo test --package cocoindex_ops_text— all 33 tests pass (5 new)uv run ruff format --check python/— no new violationsuv run ruff check python/cocoindex/resources/file.py— pre-existing F401 only (upstream issue, unrelated to this PR)