Summary
Build a first-party ContextForge plugin that screens prompts, tool arguments, and (optionally) tool outputs for prompt injection and jailbreak attempts, using an open source detection library whose execution time is deterministic (bounded, non-network-dependent, not gated on an upstream LLM call).
Motivation
- Prompt injection and jailbreak attempts are the most frequently cited OWASP LLM-Top-10 risks and are a repeat ask from security-conscious customers evaluating ContextForge for agent deployments.
- Existing moderation / PII plugins do not target this class of attack specifically.
- A deterministically-timed filter is a hard requirement for this workload: tool invocation latency budgets are tight, and a detector that occasionally blocks for multi-second LLM calls (e.g. calling out to a hosted judge model) would be unacceptable in production hot paths.
- We want heuristic / local-model detection that runs in predictable single- or double-digit milliseconds per check, with no network dependency by default.
Requirements
Functional
Non-functional
- Deterministic timing: p99 overhead target ≤ 25 ms per invocation on a single CPU core for typical prompt lengths (≤ 4 KB).
- No required outbound network calls at runtime (models, if any, must be loadable locally).
- Permissive open source license compatible with Apache-2.0 distribution.
- Actively maintained (commits within the last ~12 months).
Candidate Libraries to Evaluate
The following OSS libraries should be benchmarked against the deterministic-timing requirement. This list is a starting point — it needs validation (license, maintenance, actual latency characteristics) during the spike.
| Library |
Approach |
Deterministic timing? |
License |
Notes |
| LLM Guard (Protect AI) |
Mix of regex scanners, small local transformer classifiers (PromptInjection scanner uses a fine-tuned DeBERTa) |
Yes if using regex scanners or local CPU/GPU model; no network |
MIT |
Most mature OSS option; well-integrated scanner API. Primary candidate. |
| Vigil-LLM (deadbits) |
YARA rules, heuristics, local embeddings against a canonical injection corpus |
Yes (all local) |
Apache-2.0 |
Focused specifically on prompt injection; modular detector pipeline. Strong candidate. |
| Rebuff (ProtectAI) |
Heuristic + vector DB + LLM-based judge |
Partially — the LLM-judge stage is non-deterministic |
Apache-2.0 |
Would need to run in heuristic-only mode to meet timing budget. |
| Guardrails AI (RAIL validators) |
Validator framework; includes injection validators |
Varies by validator |
Apache-2.0 |
More framework than detector; could wrap deterministic validators. |
| NeMo Guardrails |
Orchestration framework, includes jailbreak rails |
Rails can call LLMs — non-deterministic by default |
Apache-2.0 |
Probably too heavyweight and LLM-dependent for this plugin. |
| garak (NVIDIA) |
Red-teaming / testing harness |
N/A — offline testing tool |
Apache-2.0 |
Not a runtime filter, but useful to generate the test corpus for validating whichever library we choose. |
| promptmap |
Injection test generator |
N/A — testing |
MIT |
Same as garak: useful for test harness, not runtime. |
Recommended starting pair: LLM Guard for production detection (regex + local DeBERTa classifier) and garak / promptmap for generating adversarial test cases to validate catch-rate and measure p50/p95/p99 latency before committing.
Scope / Spike Plan
- Library evaluation (1–2 days): pick 2–3 candidates from the table above, run them against a small corpus of known injection/jailbreak prompts (garak + promptmap output), measure p50/p95/p99 latency on CPU, note license and model-download requirements.
- Plugin skeleton: package as
cpex-prompt-injection-guard PyPI wheel, wire into [plugins] extra.
- Integration: register on
prompt_pre_fetch and tool_pre_invoke hooks; emit a structured violation that ContextForge can block/flag/redact on.
- Benchmarks: publish per-hook overhead numbers in the plugin README.
- Documentation: user-facing doc under
docs/docs/plugins/prompt-injection-guard.md, including configuration, bindings, and a worked example of blocking a canonical injection.
Acceptance Criteria
References
Summary
Build a first-party ContextForge plugin that screens prompts, tool arguments, and (optionally) tool outputs for prompt injection and jailbreak attempts, using an open source detection library whose execution time is deterministic (bounded, non-network-dependent, not gated on an upstream LLM call).
Motivation
Requirements
Functional
cpex-prompt-injection-guardor similar, following thecpex-*PyPI pattern from refactor(plugins): migrate in-tree plugins to PyPI packages (cpex-*) #3965).prompt_pre_fetchandtool_pre_invoke; optionallytool_post_invokefor output screening.block,redact,flag-only).Non-functional
Candidate Libraries to Evaluate
The following OSS libraries should be benchmarked against the deterministic-timing requirement. This list is a starting point — it needs validation (license, maintenance, actual latency characteristics) during the spike.
PromptInjectionscanner uses a fine-tuned DeBERTa)Recommended starting pair: LLM Guard for production detection (regex + local DeBERTa classifier) and garak / promptmap for generating adversarial test cases to validate catch-rate and measure p50/p95/p99 latency before committing.
Scope / Spike Plan
cpex-prompt-injection-guardPyPI wheel, wire into[plugins]extra.prompt_pre_fetchandtool_pre_invokehooks; emit a structured violation that ContextForge can block/flag/redact on.docs/docs/plugins/prompt-injection-guard.md, including configuration, bindings, and a worked example of blocking a canonical injection.Acceptance Criteria
cpex-*PyPI package, discoverable via the dynamic plugin registry.References
cpex-*PyPI plugin distribution pattern