[FEATURE]: Prompt injection / jailbreak filter plugin using a deterministically-timed open source library

## Summary

Build a first-party ContextForge plugin that screens prompts, tool arguments, and (optionally) tool outputs for **prompt injection** and **jailbreak** attempts, using an open source detection library whose execution time is **deterministic** (bounded, non-network-dependent, not gated on an upstream LLM call).

## Motivation

- Prompt injection and jailbreak attempts are the most frequently cited OWASP LLM-Top-10 risks and are a repeat ask from security-conscious customers evaluating ContextForge for agent deployments.
- Existing moderation / PII plugins do not target this class of attack specifically.
- A **deterministically-timed** filter is a hard requirement for this workload: tool invocation latency budgets are tight, and a detector that occasionally blocks for multi-second LLM calls (e.g. calling out to a hosted judge model) would be unacceptable in production hot paths.
- We want heuristic / local-model detection that runs in predictable single- or double-digit milliseconds per check, with no network dependency by default.

## Requirements

### Functional
- New plugin (target path: `cpex-prompt-injection-guard` or similar, following the `cpex-*` PyPI pattern from #3965).
- Hooks: at minimum `prompt_pre_fetch` and `tool_pre_invoke`; optionally `tool_post_invoke` for output screening.
- Configurable thresholds, categories (injection vs. jailbreak vs. system-prompt-leak), and response modes (`block`, `redact`, `flag-only`).
- Returns a structured decision including score, matched rule(s), and category — surfaced through the existing plugin violation/metadata plumbing.
- Supports per-tool / per-tenant binding via the plugin-bindings API introduced in #4143.

### Non-functional
- **Deterministic timing**: p99 overhead target ≤ 25 ms per invocation on a single CPU core for typical prompt lengths (≤ 4 KB).
- No required outbound network calls at runtime (models, if any, must be loadable locally).
- Permissive open source license compatible with Apache-2.0 distribution.
- Actively maintained (commits within the last ~12 months).

## Candidate Libraries to Evaluate

The following OSS libraries should be benchmarked against the deterministic-timing requirement. This list is a starting point — it needs validation (license, maintenance, actual latency characteristics) during the spike.

| Library | Approach | Deterministic timing? | License | Notes |
|---|---|---|---|---|
| **LLM Guard** (Protect AI) | Mix of regex scanners, small local transformer classifiers (`PromptInjection` scanner uses a fine-tuned DeBERTa) | Yes if using regex scanners or local CPU/GPU model; no network | MIT | Most mature OSS option; well-integrated scanner API. Primary candidate. |
| **Vigil-LLM** (deadbits) | YARA rules, heuristics, local embeddings against a canonical injection corpus | Yes (all local) | Apache-2.0 | Focused specifically on prompt injection; modular detector pipeline. Strong candidate. |
| **Rebuff** (ProtectAI) | Heuristic + vector DB + LLM-based judge | Partially — the LLM-judge stage is non-deterministic | Apache-2.0 | Would need to run in heuristic-only mode to meet timing budget. |
| **Guardrails AI** (RAIL validators) | Validator framework; includes injection validators | Varies by validator | Apache-2.0 | More framework than detector; could wrap deterministic validators. |
| **NeMo Guardrails** | Orchestration framework, includes jailbreak rails | Rails can call LLMs — non-deterministic by default | Apache-2.0 | Probably too heavyweight and LLM-dependent for this plugin. |
| **garak** (NVIDIA) | Red-teaming / testing harness | N/A — offline testing tool | Apache-2.0 | Not a runtime filter, but useful to generate the **test corpus** for validating whichever library we choose. |
| **promptmap** | Injection test generator | N/A — testing | MIT | Same as garak: useful for test harness, not runtime. |

**Recommended starting pair**: **LLM Guard** for production detection (regex + local DeBERTa classifier) and **garak** / **promptmap** for generating adversarial test cases to validate catch-rate and measure p50/p95/p99 latency before committing.

## Scope / Spike Plan

1. **Library evaluation (1–2 days)**: pick 2–3 candidates from the table above, run them against a small corpus of known injection/jailbreak prompts (garak + promptmap output), measure p50/p95/p99 latency on CPU, note license and model-download requirements.
2. **Plugin skeleton**: package as `cpex-prompt-injection-guard` PyPI wheel, wire into `[plugins]` extra.
3. **Integration**: register on `prompt_pre_fetch` and `tool_pre_invoke` hooks; emit a structured violation that ContextForge can block/flag/redact on.
4. **Benchmarks**: publish per-hook overhead numbers in the plugin README.
5. **Documentation**: user-facing doc under `docs/docs/plugins/prompt-injection-guard.md`, including configuration, bindings, and a worked example of blocking a canonical injection.

## Acceptance Criteria

- [ ] Benchmarks recorded for chosen library showing p99 ≤ 25 ms on a standard CI runner.
- [ ] Plugin published as `cpex-*` PyPI package, discoverable via the dynamic plugin registry.
- [ ] Detection validated against a curated adversarial corpus (garak / promptmap generated) with a documented catch rate.
- [ ] Documentation page published, including configuration reference and binding example.
- [ ] No required outbound network calls at plugin runtime.
- [ ] License of chosen library is Apache-2.0 compatible.

## References

- #3965 — `cpex-*` PyPI plugin distribution pattern
- #4143 — per-tool plugin bindings (relevant for scoping the filter per tenant/tool)
- OWASP LLM Top 10 — LLM01 Prompt Injection, LLM07 System Prompt Leakage

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE]: Prompt injection / jailbreak filter plugin using a deterministically-timed open source library #4219

Summary

Motivation

Requirements

Functional

Non-functional

Candidate Libraries to Evaluate

Scope / Spike Plan

Acceptance Criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Library	Approach	Deterministic timing?	License	Notes
LLM Guard (Protect AI)	Mix of regex scanners, small local transformer classifiers (`PromptInjection` scanner uses a fine-tuned DeBERTa)	Yes if using regex scanners or local CPU/GPU model; no network	MIT	Most mature OSS option; well-integrated scanner API. Primary candidate.
Vigil-LLM (deadbits)	YARA rules, heuristics, local embeddings against a canonical injection corpus	Yes (all local)	Apache-2.0	Focused specifically on prompt injection; modular detector pipeline. Strong candidate.
Rebuff (ProtectAI)	Heuristic + vector DB + LLM-based judge	Partially — the LLM-judge stage is non-deterministic	Apache-2.0	Would need to run in heuristic-only mode to meet timing budget.
Guardrails AI (RAIL validators)	Validator framework; includes injection validators	Varies by validator	Apache-2.0	More framework than detector; could wrap deterministic validators.
NeMo Guardrails	Orchestration framework, includes jailbreak rails	Rails can call LLMs — non-deterministic by default	Apache-2.0	Probably too heavyweight and LLM-dependent for this plugin.
garak (NVIDIA)	Red-teaming / testing harness	N/A — offline testing tool	Apache-2.0	Not a runtime filter, but useful to generate the test corpus for validating whichever library we choose.
promptmap	Injection test generator	N/A — testing	MIT	Same as garak: useful for test harness, not runtime.

[FEATURE]: Prompt injection / jailbreak filter plugin using a deterministically-timed open source library #4219

Description

Summary

Motivation

Requirements

Functional

Non-functional

Candidate Libraries to Evaluate

Scope / Spike Plan

Acceptance Criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions