Skip to content

feat(pattern-matcher): support !-prefixed negation in excluded_patterns#1901

Open
SuperMarioYL wants to merge 2 commits into
cocoindex-io:mainfrom
SuperMarioYL:feat/pattern-negation-excluded
Open

feat(pattern-matcher): support !-prefixed negation in excluded_patterns#1901
SuperMarioYL wants to merge 2 commits into
cocoindex-io:mainfrom
SuperMarioYL:feat/pattern-negation-excluded

Conversation

@SuperMarioYL
Copy link
Copy Markdown

Summary

  • Adds gitignore-style ! negation support to excluded_patterns in PatternMatcher (Rust) and PatternFilePathMatcher (Python).
  • A pattern beginning with ! un-excludes paths that would otherwise be excluded by preceding patterns.
  • Directory traversal is handled correctly: is_dir_included() probes one level inside an excluded dir (<dir>/__probe__) to detect whether a negation pattern could apply to its contents, preventing premature pruning.
  • 5 new Rust unit tests covering the key scenarios; all 33 tests pass.

Motivation

Closes #1778.

Users maintaining large codebases hit a real pain point: they can't write

excluded_patterns=["**/.* "]   # exclude all dot-directories

without also losing .github/workflows/*.yml (CI config), .github/dependabot.yml, etc. The workaround is to enumerate every dot-directory to exclude explicitly — currently ~30 patterns and growing.

With this change the same intent is expressed cleanly:

PatternFilePathMatcher(
    excluded_patterns=[
        "**/.*",           # exclude all dot-entries
        "!**/.github/**",  # …except .github
    ]
)

Implementation notes

split_excluded_patterns() splits the raw list into two GlobSets:

Set Contents
excluded_glob_set patterns without ! prefix
negation_glob_set patterns with ! prefix (stored without the !)

is_excluded(path) returns true only when the path matches the exclusion set and does not match the negation set.

is_dir_included(path) checks:

  1. Is the directory matched by an exclusion pattern? If no → include.
  2. Is the directory itself matched by a negation pattern? If yes → include.
  3. Does a probe path <dir>/__probe__ match a negation pattern? (catches **/.github/**-style patterns) If yes → include.
  4. Otherwise → prune.

Test plan

  • cargo test --package cocoindex_ops_text — all 33 tests pass (5 new)
  • uv run ruff format --check python/ — no new violations
  • uv run ruff check python/cocoindex/resources/file.py — pre-existing F401 only (upstream issue, unrelated to this PR)

Enables gitignore-style exceptions within excluded_patterns so users
can exclude broad categories while allowing specific paths through.

A pattern beginning with ! un-excludes paths that would otherwise be
excluded by preceding patterns. For example:

    PatternFilePathMatcher(
        excluded_patterns=[
            "**/.*",           # exclude all dot-entries
            "!**/.github/**",  # but keep .github through
        ]
    )

Directory traversal is handled correctly: is_dir_included() uses a
probe path (<dir>/__probe__) to detect whether any negation pattern
could match files inside an otherwise-excluded directory, preventing
premature pruning.

Fixes cocoindex-io#1778.
@badmonster0
Copy link
Copy Markdown
Member

thanks a lot ! @georgeh0 can help take a look!

@badmonster0 badmonster0 requested a review from georgeh0 April 27, 2026 19:27
Copy link
Copy Markdown
Member

@georgeh0 georgeh0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for creating the PR!

Comment thread rust/ops_text/src/pattern_matcher.rs Outdated
/// Negation patterns (``!``-prefixed in the original list, stored without the ``!``).
/// A path that matches one of these is *not* excluded even if it matches the regular
/// exclusion patterns.
negation_glob_set: Option<GlobSet>,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider call it negation_excluded_glob_set to be more clear.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — renamed to negation_excluded_glob_set throughout (field, variable, and doc-comments).

Comment thread rust/ops_text/src/pattern_matcher.rs Outdated
Comment on lines +123 to +127
// Probe one level inside: catches patterns like `**/.github/**`.
let probe = format!("{}/__probe__", path);
if neg_gs.is_match(probe.as_str()) {
return true;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work as expected?

e.g. if users put the following excluded_patterns:

  • dir1/**
  • !dir1/dir2/dir3/a.yml

It won't work as expected.

I think when cases become more complex, we may consider using some more structured approach. e.g. we may consider allowing users to provide "sub matchers" for specific folders like:

sub_matchers: {
  'dir1': {
    exclude_patterns: ['**'],
    sub_matchers: {
      'dir2/dir3': {
        inclue_patterns: ['a.yml'],
      },
    },
  },
}

(throwing some rough thoughts, don't have a very concrete spec yet)

Besides, note that another feasible fallback option is post-filtering during iteration (and it's easier to do that in v1). Just wonder is it not sufficient for your case?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you are right that the original probe-only approach fails for this case.

With patterns ["dir1/**", "!dir1/dir2/dir3/a.yml"], the GlobSet probe (dir1/__probe__) does not match the exact-path negation pattern, so is_dir_included("dir1") used to return false, pruning the whole subtree before the exempt file could ever be reached.

I fixed this by retaining the raw negation strings in a negation_patterns_raw: Vec<String> field and adding a raw-prefix check inside is_dir_included: if any negation string starts with <dir>/, the directory is allowed through. The existing GlobSet probe is kept as-is — it still handles wildcard negations such as !**/.github/** where the pattern contains ** and the probe already works.

Added a dedicated test (test_negation_exact_path_deep_dir_traversal) that covers exactly the case you raised, and confirmed all 34 tests pass.

Regarding the structured sub_matchers idea for more complex cases — that is a clean design and would definitely give users more expressive power for nested policies. Happy to explore that in a follow-up if the team is interested; this PR just lands the flat-list negation semantics as a lower-friction first step.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating on this!

The original case works now, but this still doesn't work as expected:

  • dir1/**
  • !dir?/dir2/dir3/a.yml

I'm worrying there can be some fundamental difficulty in supporting these negate-exclude every well under this API.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @georgeh0 — you're right, the dir?/dir2/dir3/a.yml case is broken by both of the current checks:

  • The GlobSet probe (dir1/__probe__) fails because the negation pattern expects 4 segments ending in a.yml, and the 2-segment probe path can't match.
  • The raw-prefix check (negation.starts_with("dir1/")) fails because dir? is a literal ? in the string and doesn't equal dir1.

I think this is fixable inside the current API without sub_matchers, by replacing the probe-and-prefix pair with a single ancestor-globset built from each negation pattern's path-prefix segments. For !dir?/dir2/dir3/a.yml the ancestors are:

dir?
dir?/dir2
dir?/dir2/dir3

is_dir_included then checks the candidate directory against the ancestor GlobSet; if any prefix matches, the directory is allowed through. The same construction subsumes the wildcard case — !**/.github/** produces ancestors ** and **/.github, which match any depth.

That said, I want to defer to your call before refactoring. Two options:

  1. Keep this PR scoped — extend is_dir_included with the ancestor-globset approach above, add tests for dir?/... and a few more glob-in-dir-segment cases, leave the public API unchanged. Lower-risk increment.
  2. Close this PR and design sub_matchers — agree the negate-exclude shape is fighting the directory-pruning step, and the structured-sub-matcher API you sketched is the cleaner abstraction long-term. I'd open a design issue first, since the API shape is a larger conversation.

If (1) is acceptable I can push the ancestor-globset extension in a follow-up commit on this branch. If you'd rather go with (2), happy to close this PR after we have a sub_matchers issue to point users at. Either way, thanks for the careful review.

…traversal

Addresses two review comments from @georgeh0:

1. Rename `negation_glob_set` → `negation_excluded_glob_set` for clarity.

2. Fix `is_dir_included` for exact-path negations like `!dir1/dir2/dir3/a.yml`
   combined with a broad exclusion like `dir1/**`. Previously the one-level
   GlobSet probe was sufficient only for wildcard negations (e.g. `!**/.github/**`);
   an exact-path negation's ancestor directories were pruned before the exempt file
   could ever be reached.

   Fix: retain raw negation strings in `negation_patterns_raw` and add a
   raw-prefix check in `is_dir_included` — if any negation pattern starts with
   `<dir>/` the directory must be traversed.  The existing GlobSet probe is kept
   as-is for wildcard-based negations.

   New test: `test_negation_exact_path_deep_dir_traversal` covers the case raised
   in the review (exclude `dir1/**`, un-exclude `dir1/dir2/dir3/a.yml`).

Signed-off-by: Lei Yu <itleiyu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support negation in excluded_patterns to allow exceptions

3 participants