Fix handling of word boundaries in Filter cog#6725
Conversation
Jackenmen
left a comment
There was a problem hiding this comment.
Using negative lookahead/lookbehind with \w for word boundaries seems like an improvement over \b since, unlike \b, it always checks that the preceding character is not a word character instead of determining the behaviour based on the first/last character of the word.
For example, the current implementation (as seen in stable releases) for a word such as <h> has:
- false positives:
"x<h>y" - false negatives:
" <h> "
This PR handles both of these correctly.
With that said, this case is also handled fine by a simpler:
rf"(?<!\w){re.escape(w)}(?!\w)"I don't think we should really be deciding whether a word boundary should be checked based on whether the first/last character is a word character. Do you have some concrete cases where this would actually be an improvement rather than just make things possibly more confusing?
|
Applied change! |
Description of the changes
Fixes #3682
refactors word list pattern generation to use a dedicated
_build_word_patternhelper, for a more flexible and accurate word boundary handlingHave the changes in this PR been tested?
Yes