add filter to suppress Windows system noise & prevent binary garbage chunks in parser output#91
Open
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Open
add filter to suppress Windows system noise & prevent binary garbage chunks in parser output#91Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Conversation
- is_text_file() now rejects files where >1% of decoded chars are Unicode replacement chars (U+FFFD), stopping charset-normalizer false positives on PE/DLL/binary files - extract_text() now checks replacement char ratio after ANY extraction path (charset-normalizer or kreuzberg) and falls back to extract_strings_from_binary() when ratio exceeds 1% - Removed grep -a flag to stop binary stdin being treated as text, which was causing massive single-line binary dumps even with -m 5 Fixes: large chunks of \xef\xbf\xbd garbage being logged as matches when binary files were misidentified as text or extracted with corrupt encoding. https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…stem noise Adds two preset modes that auto-populate exclude_dirnames and exclude_extensions with well-known Windows system paths/extensions that clutter results without containing useful data: moderate: PolicyDefinitions (ADMX/ADML), WinSxS, Servicing aggressive: also System32, SysWOW64, Assembly, Fonts, Spool, Defender Both modes also suppress: .adml .admx .mui .mof .cat .manifest The presets feed directly into the existing dir/extension blacklist infrastructure, so they compose cleanly with --exclude-dirnames and --exclude-extensions. https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…FHCR Claude/review project structure dfhcr
Author
|
Hey, Fix: binary garbage in parser output Feat: Both tested on infra. Let me know if you have questions or if you'd prefer these split into two separate PRs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.