Skip to content

4447 Update Guardrails#4915

Open
ivicac wants to merge 1 commit intomasterfrom
4447
Open

4447 Update Guardrails#4915
ivicac wants to merge 1 commit intomasterfrom
4447

Conversation

@ivicac
Copy link
Copy Markdown
Contributor

@ivicac ivicac commented May 5, 2026

Refactor the guardrails module: replace the single GuardrailsAdvisor
with two specialised Spring AI advisors (CheckForViolationsAdvisor for
blocking checks, SanitizeTextAdvisor for response masking), and split
the monolithic guardrails component into 12 per-guardrail components
(jailbreak, keywords, pii, llm-pii, nsfw, urls, secret-keys,
sanitize-text, check-for-violations, custom, custom-regex,
topical-alignment).

Add a sealed GuardrailException hierarchy with a stable
GuardrailExceptionKind tag for telemetry, per-guardrail FAIL_MODE
(FAIL_CLOSED / FAIL_OPEN) with a configuration-error override that
always fails closed for operator bugs (invalid regex, missing MODEL
child, programming errors), and new SPI types: GuardrailContext, sealed
MaskResult { Entities | Masked | Unchanged }, sealed Violation
{ Pattern | Classified | ExecutionFailure }, PreflightMasking mixin.

Introduce ReDoS-bounded detectors (PiiDetector, SecretKeyDetector,
UrlDetector, LlmPiiDetector, RegexParser, MaskEntityMap,
KeywordMatcher) and LlmClassifier with nonce-fenced user input for
prompt-injection hardening. Fail-open crashes are surfaced on the
response via SKIPPED_FAILURES_METADATA_KEY so downstream alerting can
detect silently degraded guardrails without grepping logs.

Cover fail-mode semantics, overlap masking, configuration-error
override, exception-message leakage, multi-turn bypass, concurrency,
streaming partial-leak semantics, ReDoS bounding, and unicode/RTL/emoji
edge cases with dedicated tests.

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

Refactor the guardrails module: replace the single GuardrailsAdvisor
with two specialised Spring AI advisors (CheckForViolationsAdvisor for
blocking checks, SanitizeTextAdvisor for response masking), and split
the monolithic guardrails component into 12 per-guardrail components
(jailbreak, keywords, pii, llm-pii, nsfw, urls, secret-keys,
sanitize-text, check-for-violations, custom, custom-regex,
topical-alignment).

Add a sealed GuardrailException hierarchy with a stable
GuardrailExceptionKind tag for telemetry, per-guardrail FAIL_MODE
(FAIL_CLOSED / FAIL_OPEN) with a configuration-error override that
always fails closed for operator bugs (invalid regex, missing MODEL
child, programming errors), and new SPI types: GuardrailContext, sealed
MaskResult { Entities | Masked | Unchanged }, sealed Violation
{ Pattern | Classified | ExecutionFailure }, PreflightMasking mixin.

Introduce ReDoS-bounded detectors (PiiDetector, SecretKeyDetector,
UrlDetector, LlmPiiDetector, RegexParser, MaskEntityMap,
KeywordMatcher) and LlmClassifier with nonce-fenced user input for
prompt-injection hardening. Fail-open crashes are surfaced on the
response via SKIPPED_FAILURES_METADATA_KEY so downstream alerting can
detect silently degraded guardrails without grepping logs.

Cover fail-mode semantics, overlap masking, configuration-error
override, exception-message leakage, multi-turn bypass, concurrency,
streaming partial-leak semantics, ReDoS bounding, and unicode/RTL/emoji
edge cases with dedicated tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant