Context
NeMo Guardrails provides runtime guardrails for conversational AI. A complementary layer would be pre-deployment system prompt auditing — checking whether the system prompt itself includes defensive instructions before the conversation starts.
Proposal
A guardrail action that validates system prompt defense posture at initialization:
define flow check system prompt defense
$defense_result = execute check_prompt_defense
if $defense_result.score < 50
bot inform "Warning: system prompt has weak defense posture ({$defense_result.score}/100)"
The action would use prompt-defense-audit to check 12 attack vectors:
- Role boundary, instruction boundary, data protection
- Multi-language bypass, indirect injection, social engineering
- Output control, unicode protection, input validation
- And 3 more (OWASP LLM Top 10 mapped)
Why This Matters
We scanned 1,646 leaked production system prompts — 97.8% have no indirect injection defense, average score 36/100. A pre-conversation guardrail that flags weak system prompts would catch these gaps before they become runtime vulnerabilities.
Implementation
prompt-defense-audit is on npm and exports a simple API:
# Python wrapper around the npm package, or port the regex rules to Python
from prompt_defense_audit import audit
result = audit("You are a helpful assistant.")
# result.score = 8, result.grade = 'F', result.missing = ['indirect-injection', ...]
The scanner is pure regex (<5ms, zero dependencies) so it adds negligible latency to guardrail initialization.
Related: We also contributed 6 defense posture patterns to NVIDIA/garak based on the same data.
Happy to contribute a PR with the action implementation.
Context
NeMo Guardrails provides runtime guardrails for conversational AI. A complementary layer would be pre-deployment system prompt auditing — checking whether the system prompt itself includes defensive instructions before the conversation starts.
Proposal
A guardrail action that validates system prompt defense posture at initialization:
The action would use prompt-defense-audit to check 12 attack vectors:
Why This Matters
We scanned 1,646 leaked production system prompts — 97.8% have no indirect injection defense, average score 36/100. A pre-conversation guardrail that flags weak system prompts would catch these gaps before they become runtime vulnerabilities.
Implementation
prompt-defense-audit is on npm and exports a simple API:
The scanner is pure regex (<5ms, zero dependencies) so it adds negligible latency to guardrail initialization.
Related: We also contributed 6 defense posture patterns to NVIDIA/garak based on the same data.
Happy to contribute a PR with the action implementation.