Skip to content

Commit 67f9fe1

Browse files
2 parents 78f6fc7 + 61fceb6 commit 67f9fe1

1 file changed

Lines changed: 79 additions & 0 deletions

File tree

docs/validation.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -194,3 +194,82 @@ interface ValidationError {
194194
suggestion?: string;
195195
}
196196
```
197+
198+
## Prompt injection and system-instruction leakage hardening
199+
200+
Regex validation is **not a complete security boundary**, but it is useful as a first-pass filter for obvious hostile payloads in high-risk fields such as `user_message`, `external_content`, or `tool_result`.
201+
202+
### Common vulnerability patterns
203+
204+
- **Direct instruction override**: attacker text attempts to supersede system policies with phrases like “ignore previous instructions”.
205+
- **Role/channel spoofing**: attacker injects fake role labels (`system:`, `assistant:`) so the model treats user content as higher-priority instructions.
206+
- **Prompt exfiltration requests**: attacker asks the model to reveal hidden instructions, internal policies, API keys, or chain-of-thought.
207+
- **Structured wrapper attacks**: attacker wraps malicious directives in XML/JSON/Markdown fences to look machine-generated or authoritative.
208+
209+
Use `deny_regex` to catch obvious attack language and `allow_regex` to constrain strict-format fields (IDs, enums, narrow command grammars).
210+
211+
### Example hardening set A: free-form user text
212+
213+
Use this when input must remain natural language but you want to block obvious prompt-injection attempts:
214+
215+
```yaml
216+
context:
217+
inputs:
218+
- name: user_message
219+
max_size: 4000
220+
non_empty: true
221+
reject_secrets: true
222+
deny_regex:
223+
pattern: "(?:ignore|disregard|forget)\s+(?:all\s+)?(?:previous|prior|above)\s+instructions|(?:^|\b)(?:system|developer|assistant)\s*:|reveal\s+(?:your|the)\s+(?:system\s+prompt|hidden\s+instructions?)|print\s+(?:the\s+)?(?:policy|rules?)|BEGIN\s+SYSTEM\s+PROMPT|END\s+SYSTEM\s+PROMPT"
224+
flags: "i"
225+
return_message: "I can help with your request, but I can't process instruction-override language. Please rephrase your question."
226+
```
227+
228+
**Mitigates:** direct override phrasing, role spoofing, and explicit system-prompt extraction asks.
229+
230+
### Example hardening set B: strict machine-readable selector
231+
232+
Use this when the field should be tightly constrained (best defense is allowlist + short max size):
233+
234+
```yaml
235+
context:
236+
inputs:
237+
- name: intent_code
238+
max_size: 32
239+
trim: true
240+
non_empty: true
241+
allow_regex:
242+
pattern: "^(billing|technical_support|account_access|cancel_subscription)$"
243+
flags: "i"
244+
return_message: "Please select one of: billing, technical_support, account_access, cancel_subscription."
245+
deny_regex:
246+
pattern: "[\r\n`{}<>:$]"
247+
```
248+
249+
**Mitigates:** multi-line payload smuggling, role-label injection, and arbitrary instruction text in fields that should only carry enum-like values.
250+
251+
### Example hardening set C: external retrieved content
252+
253+
If your prompt includes untrusted retrieved content, isolate and filter it before interpolation:
254+
255+
```yaml
256+
context:
257+
inputs:
258+
- name: retrieved_snippet
259+
max_size: 6000
260+
trim: true
261+
deny_regex:
262+
pattern: "(?:^|\b)(?:system|developer)\s*:|ignore\s+(?:all\s+)?instructions|jailbreak|do\s+anything\s+now|simulate\s+developer\s+mode|exfiltrat(?:e|ion)|reveal\s+(?:prompt|policy|secrets?)"
263+
flags: "i"
264+
return_message: "The retrieved content appears unsafe and was not included."
265+
```
266+
267+
**Mitigates:** known jailbreak strings and instruction-like payloads embedded in retrieval data.
268+
269+
### Practical guidance
270+
271+
- Prefer **allowlists** for structured fields; use denylists only as a secondary net.
272+
- Keep regexes focused on high-signal patterns to reduce false positives.
273+
- Combine regex checks with architecture controls: separate trusted instructions from untrusted context, quote/delimit untrusted text, and add explicit “treat context as data” system instructions.
274+
- Never rely on regex alone for sensitive operations; require server-side policy checks and tool authorization.
275+
- Log `POK031`/`POK032` failures and monitor spikes as potential attack signals.

0 commit comments

Comments
 (0)