Skip to content

[FEATURE]: Prompt injection / jailbreak filter plugin using a deterministically-timed open source library #4219

@jonpspri

Description

@jonpspri

Summary

Build a first-party ContextForge plugin that screens prompts, tool arguments, and (optionally) tool outputs for prompt injection and jailbreak attempts, using an open source detection library whose execution time is deterministic (bounded, non-network-dependent, not gated on an upstream LLM call).

Motivation

  • Prompt injection and jailbreak attempts are the most frequently cited OWASP LLM-Top-10 risks and are a repeat ask from security-conscious customers evaluating ContextForge for agent deployments.
  • Existing moderation / PII plugins do not target this class of attack specifically.
  • A deterministically-timed filter is a hard requirement for this workload: tool invocation latency budgets are tight, and a detector that occasionally blocks for multi-second LLM calls (e.g. calling out to a hosted judge model) would be unacceptable in production hot paths.
  • We want heuristic / local-model detection that runs in predictable single- or double-digit milliseconds per check, with no network dependency by default.

Requirements

Functional

Non-functional

  • Deterministic timing: p99 overhead target ≤ 25 ms per invocation on a single CPU core for typical prompt lengths (≤ 4 KB).
  • No required outbound network calls at runtime (models, if any, must be loadable locally).
  • Permissive open source license compatible with Apache-2.0 distribution.
  • Actively maintained (commits within the last ~12 months).

Candidate Libraries to Evaluate

The following OSS libraries should be benchmarked against the deterministic-timing requirement. This list is a starting point — it needs validation (license, maintenance, actual latency characteristics) during the spike.

Library Approach Deterministic timing? License Notes
LLM Guard (Protect AI) Mix of regex scanners, small local transformer classifiers (PromptInjection scanner uses a fine-tuned DeBERTa) Yes if using regex scanners or local CPU/GPU model; no network MIT Most mature OSS option; well-integrated scanner API. Primary candidate.
Vigil-LLM (deadbits) YARA rules, heuristics, local embeddings against a canonical injection corpus Yes (all local) Apache-2.0 Focused specifically on prompt injection; modular detector pipeline. Strong candidate.
Rebuff (ProtectAI) Heuristic + vector DB + LLM-based judge Partially — the LLM-judge stage is non-deterministic Apache-2.0 Would need to run in heuristic-only mode to meet timing budget.
Guardrails AI (RAIL validators) Validator framework; includes injection validators Varies by validator Apache-2.0 More framework than detector; could wrap deterministic validators.
NeMo Guardrails Orchestration framework, includes jailbreak rails Rails can call LLMs — non-deterministic by default Apache-2.0 Probably too heavyweight and LLM-dependent for this plugin.
garak (NVIDIA) Red-teaming / testing harness N/A — offline testing tool Apache-2.0 Not a runtime filter, but useful to generate the test corpus for validating whichever library we choose.
promptmap Injection test generator N/A — testing MIT Same as garak: useful for test harness, not runtime.

Recommended starting pair: LLM Guard for production detection (regex + local DeBERTa classifier) and garak / promptmap for generating adversarial test cases to validate catch-rate and measure p50/p95/p99 latency before committing.

Scope / Spike Plan

  1. Library evaluation (1–2 days): pick 2–3 candidates from the table above, run them against a small corpus of known injection/jailbreak prompts (garak + promptmap output), measure p50/p95/p99 latency on CPU, note license and model-download requirements.
  2. Plugin skeleton: package as cpex-prompt-injection-guard PyPI wheel, wire into [plugins] extra.
  3. Integration: register on prompt_pre_fetch and tool_pre_invoke hooks; emit a structured violation that ContextForge can block/flag/redact on.
  4. Benchmarks: publish per-hook overhead numbers in the plugin README.
  5. Documentation: user-facing doc under docs/docs/plugins/prompt-injection-guard.md, including configuration, bindings, and a worked example of blocking a canonical injection.

Acceptance Criteria

  • Benchmarks recorded for chosen library showing p99 ≤ 25 ms on a standard CI runner.
  • Plugin published as cpex-* PyPI package, discoverable via the dynamic plugin registry.
  • Detection validated against a curated adversarial corpus (garak / promptmap generated) with a documented catch rate.
  • Documentation page published, including configuration reference and binding example.
  • No required outbound network calls at plugin runtime.
  • License of chosen library is Apache-2.0 compatible.

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestpluginssecurityImproves securitytriageIssues / Features awaiting triage

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions