Skip to content

Commit ab172da

Browse files
committed
Add cross-model results table to BENCHMARKS.md
Same eval slice, same scoring code, 5 OSS baselines now in the table: ProtectAI v2, deepset, fmops, Meta Prompt-Guard, Meta Prompt-Guard-2. bench_oss.py extended to handle multi-class models — Prompt-Guard is scored as P(INJECTION)+P(JAILBREAK), Prompt-Guard-2 as P(LABEL_1).
1 parent c4daefe commit ab172da

2 files changed

Lines changed: 70 additions & 11 deletions

File tree

docs/BENCHMARKS.md

Lines changed: 38 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,48 @@ Reproduce in two commands:
4646

4747
```bash
4848
node scripts/bench.mjs # promptpurify on the eval slice
49-
python3 scripts/bench_oss.py # ProtectAI / deepset / fmops on the same slice
49+
python3 scripts/bench_oss.py # OSS baselines on the same slice
5050
```
5151

5252
Full recipe: [REPRODUCE.md](REPRODUCE.md).
5353

54+
## Results
55+
56+
Same eval slice (`training/FROZEN_EVAL_SCORED.jsonl`, 791 attacks /
57+
132 benigns), same scoring code (`scripts/bench_oss.py`), each model
58+
at its published default threshold and at a cross-model neutral
59+
`0.5`.
60+
61+
| Model | recall@default | FPR@default | recall@0.5 | FPR@0.5 |
62+
|---|---:|---:|---:|---:|
63+
| **promptpurify** | **83.94%** | **10.61%** | **87.10%** | **12.88%** |
64+
| ProtectAI v2 | 40.71% | 43.18% | 40.71% | 43.18% |
65+
| deepset | 97.22% | 59.85% | 97.22% | 59.85% |
66+
| fmops | 100.00% | 100.00% | 100.00% | 100.00% |
67+
| Meta Prompt-Guard | 67.00% | 88.64% | 67.00% | 88.64% |
68+
| Meta Prompt-Guard-2 | 12.77% | 1.52% | 12.77% | 1.52% |
69+
70+
How to read this:
71+
72+
- `promptpurify` ships at `0.95`; everything else ships at `0.5`.
73+
- Lower FPR than every other model except Prompt-Guard-2, which buys
74+
its low FPR by recalling only 12.77% of attacks (≈1 in 8).
75+
- Higher recall than ProtectAI v2, Prompt-Guard, and Prompt-Guard-2
76+
on this slice. `deepset` reaches higher recall but at ~6x the FPR
77+
(60% of benigns blocked); for most production traffic that's worse,
78+
not better.
79+
- `fmops` predicts the positive class for every input on this slice.
80+
Treat the row as evidence the model is mis-calibrated for this
81+
distribution, not as a real recall claim.
82+
- `Meta Prompt-Guard` is a 3-class model; we score it as
83+
`P(INJECTION) + P(JAILBREAK)` (see `scripts/bench_oss.py`).
84+
85+
The slice is deliberately hard — curated borderline cases, not a
86+
naturally-distributed sample. Numbers should be read as "relative
87+
behavior at the decision boundary", not as production recall on your
88+
traffic. Pick a threshold against your own data ([Operating
89+
points](#operating-points)).
90+
5491
## Operating points
5592

5693
The right threshold depends on **your** traffic mix, not ours.

scripts/bench_oss.py

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ class ModelSpec:
5252
hf_id: str
5353
injection_label: str
5454
default_threshold: float
55+
sum_attack_labels: tuple = ()
5556
notes: str = ""
5657

5758

@@ -74,6 +75,21 @@ class ModelSpec:
7475
injection_label="INJECTION",
7576
default_threshold=0.5,
7677
),
78+
ModelSpec(
79+
name="Meta Prompt-Guard",
80+
hf_id="meta-llama/Prompt-Guard-86M",
81+
injection_label="INJECTION",
82+
default_threshold=0.5,
83+
sum_attack_labels=("INJECTION", "JAILBREAK"),
84+
notes="3-class; positive = P(INJECTION) + P(JAILBREAK)",
85+
),
86+
ModelSpec(
87+
name="Meta Prompt-Guard-2",
88+
hf_id="meta-llama/Llama-Prompt-Guard-2-86M",
89+
injection_label="LABEL_1",
90+
default_threshold=0.5,
91+
notes="LABEL_1 = injection class",
92+
),
7793
]
7894

7995

@@ -110,16 +126,22 @@ def score_with_pipeline(spec: ModelSpec, texts: list[str]) -> list[float]:
110126
batch = [t[:4000] for t in texts[i : i + BATCH]]
111127
outputs = clf(batch)
112128
for out in outputs:
113-
inj = next(
114-
(
115-
o["score"]
116-
for o in out
117-
if o["label"].upper() == spec.injection_label.upper()
118-
),
119-
None,
120-
)
121-
if inj is None:
122-
inj = max(o["score"] for o in out)
129+
if spec.sum_attack_labels:
130+
wanted = {l.upper() for l in spec.sum_attack_labels}
131+
inj = sum(
132+
o["score"] for o in out if o["label"].upper() in wanted
133+
)
134+
else:
135+
inj = next(
136+
(
137+
o["score"]
138+
for o in out
139+
if o["label"].upper() == spec.injection_label.upper()
140+
),
141+
None,
142+
)
143+
if inj is None:
144+
inj = max(o["score"] for o in out)
123145
scores.append(float(inj))
124146
return scores
125147

0 commit comments

Comments
 (0)