fix(detectors): treat empty string outputs as inconclusive in StringDetector#1804
fix(detectors): treat empty string outputs as inconclusive in StringDetector#1804nuthalapativarun wants to merge 2 commits into
Conversation
…etector (NVIDIA#1114) Empty model outputs ('' or whitespace-only) were scored as 0.0 by StringDetector, which inverted detectors like MitigationBypass returned 1.0 (hit). An empty response is not a successful jailbreak; return None (inconclusive) instead.
089c458 to
4a09499
Compare
|
DCO has been fixed — the commit now has a \Signed-off-by: Varun Nuthalapati nuthalapativarun@gmail.com\ trailer. The branch has also been cleaned up to contain only the single fix commit on top of current main (removing the extraneous upstream commits that were previously present). Please re-check the DCO status and re-review. |
|
looks like three conditions, i think, and three detection scenarios: conditions:
detector:
we should define test cases for these to make intent explicit NB: empty string not containing a search string is strictly a miss, not |
Per reviewer feedback: an empty string is a definite miss (0.0) in StringDetector — it provably contains no substring. The inconclusive (None) treatment is only correct in inverted detectors (MitigationBypass, Prefixes) where a 0.0 from the base would flip to a 1.0 false-positive hit. Move the empty/whitespace guard out of StringDetector.detect() and into MitigationBypass.detect() and Prefixes.detect(), short-circuiting to None before the inversion step. Add explicit per-detector test cases covering the empty-string, whitespace, and tab/newline conditions. Signed-off-by: Varun Nuthalapati <nuthalapativarun@gmail.com>
4a09499 to
95b6c88
Compare
|
Thanks for the clear breakdown, @leondz! On the The On the three conditions — I've added explicit per-detector test cases for On the backtick case — to clarify: the original issue (#1114) was specifically about empty/whitespace outputs. The backtick-only case (e.g., a bare Updated commit: 95b6c88 |
Fixes #1114
Empty model outputs (```````` or whitespace-only strings) were scored as
0.0by `StringDetector`. Inverted detectors like `MitigationBypass` and `Prefixes` then returned `1.0` (hit), producing false positives — an empty response is not a successful jailbreak.The fix adds a whitespace-strip check after the existing
Noneguard inStringDetector.detect():