Skip to content

fix(detectors): treat empty string outputs as inconclusive in StringDetector#1804

Open
nuthalapativarun wants to merge 2 commits into
NVIDIA:mainfrom
nuthalapativarun:fix/1114-empty-output-false-positive
Open

fix(detectors): treat empty string outputs as inconclusive in StringDetector#1804
nuthalapativarun wants to merge 2 commits into
NVIDIA:mainfrom
nuthalapativarun:fix/1114-empty-output-false-positive

Conversation

@nuthalapativarun
Copy link
Copy Markdown

Fixes #1114

Empty model outputs (```````` or whitespace-only strings) were scored as 0.0 by `StringDetector`. Inverted detectors like `MitigationBypass` and `Prefixes` then returned `1.0` (hit), producing false positives — an empty response is not a successful jailbreak.

The fix adds a whitespace-strip check after the existing None guard in StringDetector.detect():

if output_text.strip() == "\:

…etector (NVIDIA#1114)

Empty model outputs ('' or whitespace-only) were scored as 0.0 by
StringDetector, which inverted detectors like MitigationBypass returned
1.0 (hit). An empty response is not a successful jailbreak; return None
(inconclusive) instead.
@nuthalapativarun nuthalapativarun force-pushed the fix/1114-empty-output-false-positive branch from 089c458 to 4a09499 Compare May 29, 2026 04:59
@nuthalapativarun
Copy link
Copy Markdown
Author

DCO has been fixed — the commit now has a \Signed-off-by: Varun Nuthalapati nuthalapativarun@gmail.com\ trailer. The branch has also been cleaned up to contain only the single fix commit on top of current main (removing the extraneous upstream commits that were previously present). Please re-check the DCO status and re-review.

@leondz
Copy link
Copy Markdown
Collaborator

leondz commented Jun 1, 2026

looks like three conditions, i think, and three detection scenarios:

conditions:

  • text is empty str
  • text is backticks (I think? @nuthalapativarun is this what you're referring to?)
  • text is whitespace

detector:

  • stringdetector
  • mitigationbypass
  • prefixes

we should define test cases for these to make intent explicit

NB: empty string not containing a search string is strictly a miss, not None.

Per reviewer feedback: an empty string is a definite miss (0.0) in
StringDetector — it provably contains no substring. The inconclusive
(None) treatment is only correct in inverted detectors (MitigationBypass,
Prefixes) where a 0.0 from the base would flip to a 1.0 false-positive hit.

Move the empty/whitespace guard out of StringDetector.detect() and into
MitigationBypass.detect() and Prefixes.detect(), short-circuiting to None
before the inversion step. Add explicit per-detector test cases covering
the empty-string, whitespace, and tab/newline conditions.

Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
@nuthalapativarun nuthalapativarun force-pushed the fix/1114-empty-output-false-positive branch from 4a09499 to 95b6c88 Compare June 2, 2026 02:16
@nuthalapativarun
Copy link
Copy Markdown
Author

Thanks for the clear breakdown, @leondz!

On the None vs miss distinction — you're right. An empty string provably contains no substring, so StringDetector should return 0.0 (miss), not None. I've reverted the guard from the base class.

The None (inconclusive) treatment only makes sense in the inverted detectors — MitigationBypass and Prefixes — where a base 0.0 (no refusal keyword found) would flip to 1.0 (false-positive jailbreak hit). I've moved the empty/whitespace guard into those two detect() overrides so it short-circuits to None before the inversion step.

On the three conditions — I've added explicit per-detector test cases for "", " ", "\t", and "\n" across both MitigationBypass and Prefixes. StringDetector base now asserts 0.0 for those same inputs.

On the backtick case — to clarify: the original issue (#1114) was specifically about empty/whitespace outputs. The backtick-only case (e.g., a bare ``` from a code model) wasn't part of the original report. Happy to extend the guard to cover that too if it's a scenario you've seen in practice, but wanted to check first rather than expand scope unilaterally.

Updated commit: 95b6c88

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Empty Output Generating Hits

2 participants