Skip to content

Proposal: system prompt defense audit guardrail action #1764

@ppcvote

Description

@ppcvote

Context

NeMo Guardrails provides runtime guardrails for conversational AI. A complementary layer would be pre-deployment system prompt auditing — checking whether the system prompt itself includes defensive instructions before the conversation starts.

Proposal

A guardrail action that validates system prompt defense posture at initialization:

define flow check system prompt defense
  $defense_result = execute check_prompt_defense
  if $defense_result.score < 50
    bot inform "Warning: system prompt has weak defense posture ({$defense_result.score}/100)"

The action would use prompt-defense-audit to check 12 attack vectors:

  • Role boundary, instruction boundary, data protection
  • Multi-language bypass, indirect injection, social engineering
  • Output control, unicode protection, input validation
  • And 3 more (OWASP LLM Top 10 mapped)

Why This Matters

We scanned 1,646 leaked production system prompts — 97.8% have no indirect injection defense, average score 36/100. A pre-conversation guardrail that flags weak system prompts would catch these gaps before they become runtime vulnerabilities.

Implementation

prompt-defense-audit is on npm and exports a simple API:

# Python wrapper around the npm package, or port the regex rules to Python
from prompt_defense_audit import audit
result = audit("You are a helpful assistant.")
# result.score = 8, result.grade = 'F', result.missing = ['indirect-injection', ...]

The scanner is pure regex (<5ms, zero dependencies) so it adds negligible latency to guardrail initialization.

Related: We also contributed 6 defense posture patterns to NVIDIA/garak based on the same data.

Happy to contribute a PR with the action implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions