Skip to content

add filter to suppress Windows system noise & prevent binary garbage chunks in parser output#91

Open
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Vasco0x4:master
Open

add filter to suppress Windows system noise & prevent binary garbage chunks in parser output#91
Vasco0x4 wants to merge 4 commits intoblacklanternsecurity:masterfrom
Vasco0x4:master

Conversation

@Vasco0x4
Copy link
Copy Markdown

No description provided.

claude and others added 4 commits February 24, 2026 14:25
- is_text_file() now rejects files where >1% of decoded chars are
  Unicode replacement chars (U+FFFD), stopping charset-normalizer
  false positives on PE/DLL/binary files
- extract_text() now checks replacement char ratio after ANY extraction
  path (charset-normalizer or kreuzberg) and falls back to
  extract_strings_from_binary() when ratio exceeds 1%
- Removed grep -a flag to stop binary stdin being treated as text,
  which was causing massive single-line binary dumps even with -m 5

Fixes: large chunks of \xef\xbf\xbd garbage being logged as matches
when binary files were misidentified as text or extracted with corrupt
encoding.

https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…stem noise

Adds two preset modes that auto-populate exclude_dirnames and
exclude_extensions with well-known Windows system paths/extensions
that clutter results without containing useful data:

moderate: PolicyDefinitions (ADMX/ADML), WinSxS, Servicing
aggressive: also System32, SysWOW64, Assembly, Fonts, Spool, Defender

Both modes also suppress: .adml .admx .mui .mof .cat .manifest

The presets feed directly into the existing dir/extension blacklist
infrastructure, so they compose cleanly with --exclude-dirnames
and --exclude-extensions.

https://claude.ai/code/session_01HhXFjA6jdctfoi1MTfG9jY
…FHCR

Claude/review project structure dfhcr
@Vasco0x4
Copy link
Copy Markdown
Author

Hey,

Fix: binary garbage in parser output
Files that were misidentified as text (PE/DLL/binaries) were producing massive chunks of \xef\xbf\xbd replacement chars in match results. Parser now checks the replacement char ratio after every extraction path and falls back to string extraction when it exceeds 1%.

Feat: --noise-filter for Windows system noise
Adds moderate and aggressive presets to automatically exclude well-known Windows system paths and extensions (ADMX, MUI, manifests, etc.) that constantly clutter results without containing anything useful.

Both tested on infra. Let me know if you have questions or if you'd prefer these split into two separate PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants