Add cross-model results table to BENCHMARKS.md

sandeepl337 · sandeepl337 · commit ab172dae4fa7 · 2026-05-31T08:09:59.000-05:00
Same eval slice, same scoring code, 5 OSS baselines now in the table:
ProtectAI v2, deepset, fmops, Meta Prompt-Guard, Meta Prompt-Guard-2.

bench_oss.py extended to handle multi-class models — Prompt-Guard is
scored as P(INJECTION)+P(JAILBREAK), Prompt-Guard-2 as P(LABEL_1).
diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md
@@ -46,11 +46,48 @@ Reproduce in two commands:
 
 ```bash
 node scripts/bench.mjs                  # promptpurify on the eval slice
-python3 scripts/bench_oss.py            # ProtectAI / deepset / fmops on the same slice
+python3 scripts/bench_oss.py            # OSS baselines on the same slice
 ```
 
 Full recipe: [REPRODUCE.md](REPRODUCE.md).
 
+## Results
+
+Same eval slice (`training/FROZEN_EVAL_SCORED.jsonl`, 791 attacks /
+132 benigns), same scoring code (`scripts/bench_oss.py`), each model
+at its published default threshold and at a cross-model neutral
+`0.5`.
+
+| Model | recall@default | FPR@default | recall@0.5 | FPR@0.5 |
+|---|---:|---:|---:|---:|
+| **promptpurify** | **83.94%** | **10.61%** | **87.10%** | **12.88%** |
+| ProtectAI v2 | 40.71% | 43.18% | 40.71% | 43.18% |
+| deepset | 97.22% | 59.85% | 97.22% | 59.85% |
+| fmops | 100.00% | 100.00% | 100.00% | 100.00% |
+| Meta Prompt-Guard | 67.00% | 88.64% | 67.00% | 88.64% |
+| Meta Prompt-Guard-2 | 12.77% | 1.52% | 12.77% | 1.52% |
+
+How to read this:
+
+- `promptpurify` ships at `0.95`; everything else ships at `0.5`.
+- Lower FPR than every other model except Prompt-Guard-2, which buys
+  its low FPR by recalling only 12.77% of attacks (≈1 in 8).
+- Higher recall than ProtectAI v2, Prompt-Guard, and Prompt-Guard-2
+  on this slice. `deepset` reaches higher recall but at ~6x the FPR
+  (60% of benigns blocked); for most production traffic that's worse,
+  not better.
+- `fmops` predicts the positive class for every input on this slice.
+  Treat the row as evidence the model is mis-calibrated for this
+  distribution, not as a real recall claim.
+- `Meta Prompt-Guard` is a 3-class model; we score it as
+  `P(INJECTION) + P(JAILBREAK)` (see `scripts/bench_oss.py`).
+
+The slice is deliberately hard — curated borderline cases, not a
+naturally-distributed sample. Numbers should be read as "relative
+behavior at the decision boundary", not as production recall on your
+traffic. Pick a threshold against your own data ([Operating
+points](#operating-points)).
+
 ## Operating points
 
 The right threshold depends on **your** traffic mix, not ours.
diff --git a/scripts/bench_oss.py b/scripts/bench_oss.py
@@ -52,6 +52,7 @@ class ModelSpec:
     hf_id: str
     injection_label: str
     default_threshold: float
+    sum_attack_labels: tuple = ()
     notes: str = ""
 
 
@@ -74,6 +75,21 @@ class ModelSpec:
         injection_label="INJECTION",
         default_threshold=0.5,
     ),
+    ModelSpec(
+        name="Meta Prompt-Guard",
+        hf_id="meta-llama/Prompt-Guard-86M",
+        injection_label="INJECTION",
+        default_threshold=0.5,
+        sum_attack_labels=("INJECTION", "JAILBREAK"),
+        notes="3-class; positive = P(INJECTION) + P(JAILBREAK)",
+    ),
+    ModelSpec(
+        name="Meta Prompt-Guard-2",
+        hf_id="meta-llama/Llama-Prompt-Guard-2-86M",
+        injection_label="LABEL_1",
+        default_threshold=0.5,
+        notes="LABEL_1 = injection class",
+    ),
 ]
 
 
@@ -110,16 +126,22 @@ def score_with_pipeline(spec: ModelSpec, texts: list[str]) -> list[float]:
         batch = [t[:4000] for t in texts[i : i + BATCH]]
         outputs = clf(batch)
         for out in outputs:
-            inj = next(
-                (
-                    o["score"]
-                    for o in out
-                    if o["label"].upper() == spec.injection_label.upper()
-                ),
-                None,
-            )
-            if inj is None:
-                inj = max(o["score"] for o in out)
+            if spec.sum_attack_labels:
+                wanted = {l.upper() for l in spec.sum_attack_labels}
+                inj = sum(
+                    o["score"] for o in out if o["label"].upper() in wanted
+                )
+            else:
+                inj = next(
+                    (
+                        o["score"]
+                        for o in out
+                        if o["label"].upper() == spec.injection_label.upper()
+                    ),
+                    None,
+                )
+                if inj is None:
+                    inj = max(o["score"] for o in out)
             scores.append(float(inj))
     return scores