From f8fbb89e35964c781493bc2f0e0e5f43410f7312 Mon Sep 17 00:00:00 2001 From: micheleRP Date: Tue, 28 Apr 2026 17:31:06 -0600 Subject: [PATCH] docs(governance): scaffold Guardrails pages for ADP GA MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Beta deliverable (master Week 3, May 4-8): single concept page covering INPUT/OUTPUT phase model, the three evaluator types (PII / Toxicity / Custom webhook), violation framing, and attachment scope. Anchors on RFC 0002 Phase 5; resource-model details left as TODO markers pending team-ai's post-pivot answer. GA deliverables (master Weeks 4-6): four pages stubbed with full outlines and TODO markers keyed to specific Open Questions in the Confluence companion plan, so live walkthroughs on adp-production can resolve them in place: - create-guardrail.adoc — how-to for configuring + attaching - types-reference.adoc — config schema reference per evaluator - violations.adoc — read & troubleshoot violations - cost-tracking.adoc — per-evaluator cost shape + capping Nav: replaces the flat governance:guardrails.adoc placeholder under Trust & Governance with a Guardrails parent + 5 children. The 3-line governance/pages/guardrails.adoc placeholder is deleted in this commit. Cross-link to governance:dashboard/index.adoc is transient-broken until adp-docs PR #5 (Governance Dashboard) merges; same pattern as Budgets PR #6. Refs: DOC-2113. Plan: https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438 Co-Authored-By: Claude Opus 4.7 (1M context) # Conflicts: # modules/ROOT/nav.adoc --- modules/ROOT/nav.adoc | 12 +- modules/governance/pages/guardrails.adoc | 4 - .../pages/guardrails/cost-tracking.adoc | 72 ++++++++++++ .../pages/guardrails/create-guardrail.adoc | 111 ++++++++++++++++++ .../governance/pages/guardrails/index.adoc | 77 ++++++++++++ .../pages/guardrails/types-reference.adoc | 80 +++++++++++++ .../pages/guardrails/violations.adoc | 99 ++++++++++++++++ 7 files changed, 446 insertions(+), 9 deletions(-) delete mode 100644 modules/governance/pages/guardrails.adoc create mode 100644 modules/governance/pages/guardrails/cost-tracking.adoc create mode 100644 modules/governance/pages/guardrails/create-guardrail.adoc create mode 100644 modules/governance/pages/guardrails/index.adoc create mode 100644 modules/governance/pages/guardrails/types-reference.adoc create mode 100644 modules/governance/pages/guardrails/violations.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index f07e6a7..21ac76a 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -54,11 +54,13 @@ *** For Builders **** xref:ai-gateway:builders/discover-gateways.adoc[Discover gateways] * Trust & Governance -** Governance dashboard -*** xref:governance:dashboard/index.adoc[Read the overview] -*** xref:governance:dashboard/agent-network.adoc[Agent Network] -*** xref:governance:dashboard/violations.adoc[Authorization denials and violations] -** xref:governance:guardrails.adoc[Configure guardrails] +** xref:governance:dashboard.adoc[Governance dashboard] +** Guardrails +*** xref:governance:guardrails/index.adoc[Overview] +*** xref:governance:guardrails/create-guardrail.adoc[Create a guardrail] +*** xref:governance:guardrails/types-reference.adoc[Evaluator types] +*** xref:governance:guardrails/violations.adoc[Read violations] +*** xref:governance:guardrails/cost-tracking.adoc[Cost tracking] ** xref:governance:budgets.adoc[Token budgets and limits] ** xref:governance:kill-switch.adoc[Kill switch] * Observability diff --git a/modules/governance/pages/guardrails.adoc b/modules/governance/pages/guardrails.adoc deleted file mode 100644 index c211181..0000000 --- a/modules/governance/pages/guardrails.adoc +++ /dev/null @@ -1,4 +0,0 @@ -= Configure Guardrails -:description: Set up safety guardrails for AI agent operations. - -// TODO: Add content diff --git a/modules/governance/pages/guardrails/cost-tracking.adoc b/modules/governance/pages/guardrails/cost-tracking.adoc new file mode 100644 index 0000000..3ecc8a1 --- /dev/null +++ b/modules/governance/pages/guardrails/cost-tracking.adoc @@ -0,0 +1,72 @@ += Guardrail Cost Tracking +:description: See what each evaluator costs, where the cost surfaces in transcripts and dashboards, and how guardrail spend interacts with token budgets. +:page-topic-type: reference +:personas: platform_admin +// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan. + +include::ROOT:partial$adp-la.adoc[] + +// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a should-ship deliverable; the cost-pool integration with Budgets fills in once eng confirms whether evaluator cost flows into the user-facing budget pool, a separate guardrail-evaluator pool, or both. Open Qs C2, C3 in the companion plan. + +Use this reference to: + +* [ ] Recognize the cost shape of each evaluator type (PII, Toxicity, Custom webhook) +* [ ] Locate guardrail-attributed cost in transcripts, metrics, and the governance dashboard +* [ ] Understand how guardrail spend interacts with token budgets and what knobs you have to cap it + +== Per-evaluator cost shape + +Each evaluator type has a different cost shape: + +[cols="1,2,2"] +|=== +|Type |Cost source |Where it surfaces + +|*PII* +|No per-call LLM cost. Compute time only — negligible for regex; non-trivial for entity-recognition if a NER model ships at GA. +|No transcript cost line. Compute time absorbed into gateway latency metrics. + +|*Toxicity* +|Per-call LLM cost. Counts against the *evaluator's configured upstream provider* — typically a small classifier model, separate from the user-facing LLM. +|Per-call cost line in the transcript, alongside the user-facing LLM call. Aggregated into provider-breakdown views in the governance dashboard. + +|*Custom webhook* +|Gateway charges nothing per call. Your webhook's compute cost is your own infrastructure expense. +|Not captured in transcripts. Track in your webhook's own observability surface. +|=== + +== Where guardrail cost shows up + +Guardrail-attributed cost surfaces in three places, ordered from most granular to most aggregated: + +* *Transcripts* — per-call cost line per fired evaluator, recorded alongside the user-facing LLM call. See xref:observability:transcripts.adoc[Read a transcript]. +* *Metrics* — aggregate cost per guardrail per provider per time window. See xref:observability:metrics.adoc[Metrics]. +* *Governance dashboard* — guardrail-attributed spend appears in the spend view, broken down by provider. See xref:governance:dashboard/index.adoc[Read the governance overview]. + +// TODO: confirm whether the dashboard's spend view distinguishes guardrail-evaluator spend from user-facing LLM spend. Open Q C3 in the companion plan. + +== Capping guardrail cost + +Guardrail spend can grow unexpectedly when traffic spikes or when a Toxicity guardrail runs at `BOTH` phases on a high-throughput provider. Three knobs control it: + +* *Per-guardrail toggle* — disable a guardrail to short-circuit its evaluator. The guardrail config is preserved; re-enable when ready. Useful as a kill switch when an evaluator's cost runs away. +* *Phase scoping* — running a Toxicity evaluator at `OUTPUT` only (instead of `BOTH`) halves the per-request cost. +* *Token budgets* — see xref:governance:budgets.adoc[Token budgets and limits]. Guardrail evaluator cost flows into the same spending-event pipeline as user-facing LLM cost; per-provider breakdowns separate the two. + +// TODO: confirm whether evaluator cost flows into the same budget pool as user-facing LLM cost, or a separate guardrail-evaluator pool. The master plan calls for "guardrail-cost separation documented" in the Budgets workflow GA scope. Open Q C2 in the companion plan. + +== Cost versus latency tradeoff + +Each evaluator type has a different cost-versus-latency profile: + +* *PII* is cheap and fast — regex-based detection adds milliseconds, no LLM call. +* *Toxicity* is expensive and slow — the classifier call adds tokens and latency. +* *Custom webhook* is whatever your webhook makes it — control your own infrastructure spend and latency profile. + +A typical optimization: disable Toxicity on `INPUT` and run it only on `OUTPUT`. Most policy violations are about what the model generates, not what the user asks; cutting the `INPUT` phase halves both the cost and the latency of the Toxicity guardrail without losing meaningful coverage. + +== Next steps + +* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — config schemas per evaluator type. +* xref:governance:budgets.adoc[Token budgets and limits] — the spending-event pipeline that aggregates guardrail and user-facing LLM cost. +* xref:governance:dashboard/index.adoc[Read the governance overview] — provider-breakdown view that shows guardrail-attributed spend. diff --git a/modules/governance/pages/guardrails/create-guardrail.adoc b/modules/governance/pages/guardrails/create-guardrail.adoc new file mode 100644 index 0000000..67e4f76 --- /dev/null +++ b/modules/governance/pages/guardrails/create-guardrail.adoc @@ -0,0 +1,111 @@ += Create a Guardrail +:description: Configure a guardrail, pick an evaluator type and phase, attach it to one or more LLM providers, and verify that it fires. +:page-topic-type: how-to +:personas: platform_admin +// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan. +:learning-objective-1: Create and configure a guardrail of a chosen evaluator type +:learning-objective-2: Attach the guardrail to one or more LLM providers and enable it +:learning-objective-3: Verify the guardrail fires and trace the violation through the transcript + +include::ROOT:partial$adp-la.adoc[] + +// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a must-ship deliverable; the live walkthrough fills in once eng confirms the post-pivot Guardrail resource shape and `aigwctl` (or the ADP UI) is reachable from a sandbox cluster. + +This page walks through configuring your first guardrail end-to-end: pick an evaluator type, choose a phase, fill in the per-type config, attach it to LLM providers, and confirm it fires. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== Prerequisites + +* An ADP environment with at least one LLM provider configured. See xref:ai-gateway:configure-provider.adoc[Configure your LLM provider]. +* For a *Custom webhook* evaluator, a publicly reachable HTTPS endpoint that implements the gateway's webhook contract. See xref:governance:guardrails/types-reference.adoc[Evaluator types reference]. +* For evaluators that need their own credentials (for example, a hosted PII service), the credential stored in the ADP secret store using `UPPER_SNAKE_CASE`. + +// TODO: standalone-ADP wording. Replace with the concrete sign-in URL, IAM role, and OIDC audience once the standalone product surface ships. Open Q D1 in the companion plan. + +== Open the Guardrails surface + +// TODO: finalize this section once the ADP UI ships a Guardrails surface. As of 2026-04-28, `apps/adp-ui/src/routes/` has no `guardrails/` route. The walkthrough may need to lead with `aigwctl` instead of the UI and add the UI flow in a later refresh. Open Qs C4 and C5 in the companion plan. + +In the ADP UI, open *Trust & Governance* → *Guardrails* → *Create guardrail*. + +== Pick an evaluator type + +Choose one of the supported evaluator types: + +* *PII* — detects personally identifiable information using regex and entity-recognition rules. No per-call LLM cost. +* *Toxicity* — runs content through a toxicity classifier. Per-call LLM cost. +* *Custom webhook* — delegates the decision to your HTTPS endpoint. Gateway charges nothing per call. + +For each type's full config schema and behavior, see xref:governance:guardrails/types-reference.adoc[Evaluator types reference]. + +// TODO: confirm the evaluator type set at GA. Open Q A5 in the companion plan. + +== Pick a phase + +Pick the phase or phases at which the evaluator runs: + +* `INPUT` — runs against the user's prompt before the gateway forwards it upstream. +* `OUTPUT` — runs against the model's response before the gateway returns it to the caller. +* `BOTH` — runs at both phases. + +Decision rule: + +* PII guardrails typically run at `BOTH` (defend data exfiltration in both directions). +* Toxicity guardrails typically run at `OUTPUT` only (filter what the model generates; INPUT-side toxicity filtering rarely improves outcomes). +* Custom webhook depends on what your webhook does — start with `INPUT` for prompt-injection heuristics, `OUTPUT` for brand-safety lists, `BOTH` for either-direction checks. + +== Configure the evaluator + +Fill in the per-type config block. The form fields differ per evaluator type; see xref:governance:guardrails/types-reference.adoc[Evaluator types reference] for the full schema of each type. + +// TODO: walk through the PII form as the exemplar (most common starting case) once the post-pivot field set is confirmed. Lift exact field names and labels from the proto. Open Qs A1, A2 in the companion plan. + +== Attach to LLM providers + +Select one or more LLM providers to attach the guardrail to. Multi-attach is supported — one guardrail can apply to many providers. + +// TODO: confirm whether guardrails also attach at other scopes (agents, MCP servers, organizations). The pre-pivot proto attached via `provider_ids[]` and `route_ids[]`; routes were removed in cloudv2 commit `7eff2ecbbf`. Open Qs A3, A4 in the companion plan. + +== Enable the guardrail + +Toggle the guardrail to *Enabled*. Disabled guardrails skip evaluation entirely — useful when staging a new policy before turning it on, or when troubleshooting whether a guardrail is responsible for unexpected blocks. + +== Verify the guardrail fires + +Send a request through one of the attached providers that should trigger the guardrail. For example, with a PII guardrail attached on `INPUT`: + +// TODO: replace with a working `curl` one-liner against the proxy URL once the live walkthrough resolves authentication. Standalone-ADP wording until then. + +[source,bash] +---- +curl -X POST https://your-adp-environment/v1/chat/completions \ + -H "Authorization: Bearer $ADP_TOKEN" \ + -d '{"messages":[{"role":"user","content":"My SSN is 123-45-6789"}]}' +---- + +The request should return an error. Open the request's transcript and confirm a violation entry appears for the guardrail. See xref:observability:transcripts.adoc[Read a transcript] for the transcript walkthrough and xref:governance:guardrails/violations.adoc[Read violations] for what to do when a violation surprises you. + +== Edit, disable, or delete + +* *Edit* — change the per-type config or the attached providers. Changes apply on the next request. +* *Disable* — short-circuit the middleware without losing the config. Useful when staging or troubleshooting. +* *Delete* — permanently remove the guardrail. If the guardrail is currently firing on production traffic, the UI requires confirmation. + +// TODO: confirm exact UI labels and the delete-confirmation copy once the UI ships. Open Q C4 in the companion plan. + +== Troubleshooting + +* *Evaluator returns false positives* — see xref:governance:guardrails/violations.adoc[Read violations] for tuning patterns per evaluator type. +* *Evaluator times out or is unavailable* — see xref:governance:guardrails/violations.adoc[Read violations] for the evaluator-down section. +* *Attached provider doesn't fire the guardrail* — confirm attachment (right provider, right phase), enabled state, and that requests are actually reaching the gateway (not bypassing via a direct provider URL). + +== Next steps + +* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — config schemas and gotchas per evaluator type. +* xref:governance:guardrails/violations.adoc[Read violations] — investigate fired guardrails and tune false-positive rates. +* xref:governance:guardrails/cost-tracking.adoc[Cost tracking] — see what each evaluator costs and where it shows up. diff --git a/modules/governance/pages/guardrails/index.adoc b/modules/governance/pages/guardrails/index.adoc new file mode 100644 index 0000000..f8919d4 --- /dev/null +++ b/modules/governance/pages/guardrails/index.adoc @@ -0,0 +1,77 @@ += Guardrails Overview +:description: Learn what guardrails are, the evaluator types you can choose from, the INPUT and OUTPUT phase model, and where violations show up. +:page-topic-type: overview +:personas: platform_admin, evaluator, app_developer +// TODO: confirm persona vocabulary against docs-team-standards. The Guardrails plan uses canonical personas; if a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan. +:learning-objective-1: Describe what a guardrail does and why you would attach one to an LLM provider +:learning-objective-2: Distinguish between the three evaluator types — PII, Toxicity, and Custom webhook — and the situations each fits +:learning-objective-3: Recognize where a guardrail violation surfaces and which page to read next + +include::ROOT:partial$adp-la.adoc[] + +A *guardrail* is a configurable safety or policy filter that runs on the request or response side of every LLM call routed through AI Gateway. Use a guardrail to prevent personally identifiable information (PII) from leaving your organization, filter toxic or off-policy responses before they reach end users, or delegate the decision to a custom webhook that enforces policy your way. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== Where a guardrail runs + +Every guardrail runs at one or both of two phases: + +* *INPUT* — the gateway evaluates the user's prompt before forwarding it upstream. Use INPUT to stop sensitive content from reaching a third-party model in the first place. +* *OUTPUT* — the gateway evaluates the model's response before returning it to the caller. Use OUTPUT to filter what the model generates. +* *BOTH* — runs the evaluator at both phases. Common for PII (defend in both directions); rare for Toxicity (where INPUT-side filtering is usually less useful). + +Streaming responses change the timing slightly: where async evaluation is supported, OUTPUT evaluators run alongside the stream rather than blocking it. Sync evaluators (and all INPUT evaluators) run before the request continues. + +// TODO: confirm shipping async-vs-sync behavior at GA. Open Q B1 in the companion plan. + +== The three evaluator types + +[cols="1,2,2,2"] +|=== +|Type |What it does |Where it fits |Cost shape + +|*PII* +|Detects personally identifiable information (names, emails, phone numbers, SSNs, etc.) in text using regex and entity-recognition rules. +|Defending data exfiltration to third-party models. Typically runs at `BOTH` phases. +|No per-call LLM cost. Compute time only. + +|*Toxicity* +|Runs the input or output through a toxicity classifier and flags content above a configurable threshold. +|Filtering what the model generates. Typically runs at `OUTPUT` only. +|Per-call LLM cost — counts against the evaluator's configured upstream provider, not the user-facing LLM. + +|*Custom webhook* +|Delegates the decision to a user-provided HTTPS endpoint. The gateway POSTs the content to your endpoint and acts on the pass/block response. +|Enforcing org-specific policy that doesn't fit PII or Toxicity (for example, prompt-injection heuristics, jailbreak detection, brand-safety lists). +|No gateway charge per call. Your webhook's compute cost is your own. +|=== + +For per-type config schemas, supported phases, and behavior on match, see xref:governance:guardrails/types-reference.adoc[Evaluator types reference]. + +// TODO: confirm the evaluator type set shipping at GA. RFC 0002 specifies PII + Toxicity + Custom webhook. The phase5-aigw-guardrails branch in cloudv2 ships PII + a "keyword" evaluator that may rename to Toxicity, stay as a fourth type, or be dropped. Open Q A5 in the companion plan. + +== What happens when a guardrail fires + +When an evaluator decides to block a request, the gateway stops forwarding it (or stops returning the response, on OUTPUT) and returns an error to the caller. Every fired guardrail records a *violation* entry on the request's transcript, captured in the same observability pipeline that records the LLM call itself. Read the transcript to see which guardrail fired, at which phase, and what content matched. See xref:observability:transcripts.adoc[Read a transcript]. + +A different scenario — the evaluator itself errored out (for example, a custom webhook timed out or a classifier model is unavailable) — is handled separately. See xref:governance:guardrails/violations.adoc[Read violations] for evaluator-down behavior, fail-closed versus fail-open defaults, and per-guardrail overrides. + +// TODO: confirm fail-closed vs. fail-open default at GA, and whether it's configurable per guardrail. Open Qs B2 and B5 in the companion plan. + +== Where you attach a guardrail + +A guardrail attaches to one or more LLM providers. Each provider can carry many guardrails — a typical setup pairs one PII guardrail with one Toxicity guardrail on the same provider, then layers a Custom-webhook guardrail on top for org-specific policy. + +// TODO: confirm whether guardrails also attach at other scopes — agents, MCP servers, organizations — once team-ai answers the post-pivot resource-shape question. The pre-pivot proto attached via `provider_ids[]` and `route_ids[]`; routes were removed in cloudv2 commit `7eff2ecbbf`. Open Qs A1, A3, A4 in the companion plan. + +== Where to go next + +* xref:governance:guardrails/create-guardrail.adoc[Create a guardrail] — walk through configuring and attaching your first guardrail. +* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — full config schemas for PII, Toxicity, and Custom-webhook evaluators. +* xref:governance:guardrails/violations.adoc[Read violations] — investigate why a guardrail fired and tune false-positive rates. +* xref:governance:guardrails/cost-tracking.adoc[Cost tracking] — see what each evaluator costs and where the cost surfaces. diff --git a/modules/governance/pages/guardrails/types-reference.adoc b/modules/governance/pages/guardrails/types-reference.adoc new file mode 100644 index 0000000..d1444a6 --- /dev/null +++ b/modules/governance/pages/guardrails/types-reference.adoc @@ -0,0 +1,80 @@ += Evaluator Types Reference +:description: Definitive reference for every evaluator type's config schema, supported phases, behavior on match, and gotchas. +:page-topic-type: reference +:personas: platform_admin +// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan. + +include::ROOT:partial$adp-la.adoc[] + +// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a must-ship deliverable; the per-type config schemas are filled in once eng confirms the post-pivot Guardrail resource shape and the evaluator type set at GA. Open Qs A1, A2, A5, B3, B4 in the companion plan. + +Use this reference to: + +* [ ] Look up the config schema for a given evaluator type +* [ ] Decide which phases each evaluator type supports +* [ ] Recognize the behavior-on-match options each evaluator type exposes + +Each evaluator type has its own config schema, supported phase set, behavior-on-match options, and cost shape. Sections below cover all three types currently planned for GA. + +== PII evaluator + +*What it does* — detects personally identifiable information (names, emails, phone numbers, SSNs, addresses, and other entity types) in text using regex and entity-recognition rules. + +*Phases supported* — `INPUT`, `OUTPUT`, `BOTH`. + +// TODO: lift the full config schema from `apps/aigw/internal/guardrails/pii.go` once the post-pivot proto is final. Document each field's name, type, default, and example. Likely fields: entity types to detect (allowlist or denylist), locale (US-only patterns versus EU patterns), confidence threshold. + +*Behavior on match* — block (default). Redact-and-pass behavior may be configurable; confirm whether the GA build exposes redact-mode and how it interacts with the per-type config. + +// TODO: confirm block-vs-redact options at GA. Open Q B3 in the companion plan. + +*Cost* — none beyond compute. Regex is negligible; entity-recognition can be non-trivial if a NER model ships at GA. + +*Gotchas:* + +* Regex-based detection produces false positives on uncommon PII formats (international phone numbers, non-US SSN equivalents). +* Locale-specific patterns matter — a US-tuned config will miss EU PII patterns and vice versa. +* PII matches in code blocks or quoted JSON payloads can produce surprising blocks; tune the entity allowlist if your traffic includes structured payloads. + +== Toxicity evaluator + +*What it does* — runs the input or output through a toxicity classifier and flags content above a configurable threshold. + +*Phases supported* — typically `OUTPUT`; `INPUT` and `BOTH` also valid but rarely useful. + +// TODO: confirm the config schema with eng. The phase5-aigw-guardrails branch in cloudv2 ships a "keyword" evaluator that may rename to Toxicity at GA, stay as a fourth type, or be dropped entirely. Open Q A5 in the companion plan. Likely fields: classifier model identifier, threshold (0.0–1.0), category set to flag (hate, harassment, self-harm, sexual, violence, etc.). + +*Behavior on match* — block. + +*Cost* — per-call LLM cost. Counts against the *evaluator's configured upstream provider* (typically a small classifier model, separate from the user-facing LLM). Token cost surfaces alongside the user-facing LLM call in the same transcript. See xref:governance:guardrails/cost-tracking.adoc[Cost tracking]. + +*Gotchas:* + +* Threshold tuning matters — too aggressive blocks legitimate traffic; too permissive lets toxic content through. Start at the classifier's recommended default and tune from violation review. +* Latency adds to overall response time. If async-OUTPUT evaluation isn't supported for your model's stream type, the user-visible latency includes the classifier call. +* The classifier model itself can fail or be down. See xref:governance:guardrails/violations.adoc[Read violations] for the evaluator-down section. + +== Custom webhook evaluator + +*What it does* — delegates the evaluation to a user-provided HTTPS endpoint. The gateway POSTs the content to your endpoint and acts on the response. + +*Phases supported* — `INPUT`, `OUTPUT`, `BOTH`. + +*Webhook contract:* + +// TODO: lift the exact request and response shape from `apps/aigw/internal/guardrails/registry.go` and the custom-webhook handler once the webhook contract lands. Open Q B4 in the companion plan. + +* *Request shape* — the gateway POSTs a JSON document containing the phase (`INPUT` or `OUTPUT`), the content payload (prompt or response text), request metadata (request ID for correlation, model identifier, attached provider), and any extra fields the contract specifies. +* *Response shape* — your endpoint returns a JSON document containing the decision (`pass` or `block`), an optional reason string surfaced in the violation entry, and (if redact-mode is supported) an optional redacted-content payload. +* *Authentication* — the gateway authenticates to your webhook using a shared secret stored in the ADP secret store. mTLS or signed-JWT alternatives may be available. +* *Retry / timeout* — the gateway honors a default per-call timeout. On webhook unavailable, the evaluator-down behavior applies (see xref:governance:guardrails/violations.adoc[Read violations]). + +// TODO: confirm webhook authentication options at GA. Open Q B4c in the companion plan. + +*Cost* — gateway charges nothing per call. Your webhook's compute cost is your own. + +*Gotchas:* + +* Slow webhooks add to user-visible latency, especially on `INPUT` (the request is blocked until the webhook responds). +* Webhook errors should fail closed (block) by default for safety, but make this configurable per guardrail if your use case favors availability. +* Logging and observability of the webhook itself is your responsibility — the gateway only records the decision the webhook returned. diff --git a/modules/governance/pages/guardrails/violations.adoc b/modules/governance/pages/guardrails/violations.adoc new file mode 100644 index 0000000..ab4cd17 --- /dev/null +++ b/modules/governance/pages/guardrails/violations.adoc @@ -0,0 +1,99 @@ += Read Violations +:description: Investigate why a guardrail fired, distinguish a violation from an evaluator failure, and tune the configuration. +:page-topic-type: how-to +:personas: app_developer, platform_admin +// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan. +:learning-objective-1: Locate a violation entry in a transcript and identify which guardrail fired +:learning-objective-2: Distinguish a guardrail violation from an evaluator failure and apply the right response +:learning-objective-3: Recognize common false-positive patterns per evaluator type and tune the configuration + +include::ROOT:partial$adp-la.adoc[] + +// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a must-ship deliverable; the violation field shape and async-vs-sync timing fill in once eng confirms the transcript proto integration and the shipping evaluation behavior. Open Qs B1, B2, B5, C1 in the companion plan. + +A guardrail *violation* fires when an evaluator decides to block (or redact) a request. This page covers what a violation looks like in the transcript, how to distinguish it from an evaluator failure, common false-positive patterns per evaluator type, and what happens when an evaluator can't run. + +After reading this page, you will be able to: + +* [ ] {learning-objective-1} +* [ ] {learning-objective-2} +* [ ] {learning-objective-3} + +== What a violation is + +A violation is the gateway's record that a guardrail's evaluator returned a `block` decision on a specific request, at a specific phase. Every violation carries the guardrail name, the phase (`INPUT` or `OUTPUT`), a redacted summary of what content matched, and the action the gateway took (block, redact, or pass-through-with-warning). + +A violation is distinct from an *evaluator failure*: a failure is when the evaluator itself errored out (custom webhook timed out, classifier model unavailable, regex parser crashed). Failures and violations surface differently and are handled differently. See xref:_evaluator_down_behavior[Evaluator-down behavior] below. + +== Where violations show up + +Violations surface in two places: + +* *Transcripts* — each request's transcript carries a violation entry per fired guardrail, alongside the LLM call entry, tool calls, and cost data. See xref:observability:transcripts.adoc[Read a transcript] for the full transcript walkthrough. +* *Metrics* — aggregate violation counts per guardrail per provider per time window. See xref:observability:metrics.adoc[Metrics]. + +// TODO: confirm the violation field shape in the transcript proto. The Transcripts plan (workflow #7) didn't call out a violation field specifically; coordinate with that workflow's author so the xref above resolves to a real proto field. Open Q C1 in the companion plan. + +== Read a violation entry + +Open the transcript for a request that fired a guardrail and walk through the violation entry: + +* *Guardrail name* — the human-readable identifier you assigned at create time. +* *Phase* — `INPUT` (matched the user's prompt) or `OUTPUT` (matched the model's response). +* *Matched content* — a redacted summary, not the full payload. The full payload remains in the request body itself; the violation entry is a pointer. +* *Action taken* — `block` (request stopped, error returned to caller), `redact` (matched fields stripped, request continued), or `pass-through-with-warning` (request continued, violation logged for review). + +// TODO: confirm action-taken value set at GA. Open Q B3 in the companion plan. + +== Common false-positive patterns + +Use these patterns as a starting checklist when a guardrail fires unexpectedly. + +=== PII + +* *Regex too broad* — the entity allowlist matched on a benign substring. Tune the entity types to the specific PII categories you care about. +* *Locale mismatch* — the config is tuned for US patterns but traffic includes EU PII (or vice versa). Add the relevant locale's pattern set or scope the guardrail to a per-region provider. +* *Structured-payload matches* — code blocks, JSON payloads, or sample data contain strings that resemble PII. Adjust the allowlist or scope the guardrail to user-text fields only. + +=== Toxicity + +* *Threshold too aggressive* — drop the threshold and re-evaluate from the violation history. Categories matter too — disable categories that don't apply to your use case. +* *Wrong phase* — `INPUT` toxicity blocks the user from asking certain questions; usually `OUTPUT` is what you want. Switch the phase to `OUTPUT` only. + +=== Custom webhook + +* *Webhook returning `block` for legitimate content* — the bug is in your webhook. Add logging to your webhook to see which inputs are being flagged. +* *Webhook timing out* — see xref:_evaluator_down_behavior[Evaluator-down behavior] below; the gateway fails closed (or open, depending on configuration) when a webhook can't be reached. + +== Guardrail expected to fire but didn't + +If you expect a guardrail to fire and it doesn't: + +* Confirm the guardrail is *enabled*. Disabled guardrails skip evaluation entirely. +* Confirm the *attached provider* is the one the request actually used. A guardrail attached to `provider-a` doesn't fire on requests routed to `provider-b`. +* Confirm the *phase alignment*. A guardrail set to `INPUT` only doesn't fire on the response side. A guardrail set to `OUTPUT` only doesn't fire on the request side. +* Confirm the request *actually reached the gateway* — direct-to-provider requests that bypass the gateway are invisible to guardrails. + +[#_evaluator_down_behavior] +== Evaluator-down behavior + +When an evaluator can't run (custom webhook unreachable, classifier model down, internal evaluator panic), the gateway has two options: + +* *Fail closed* — block the request. Safe default; preserves the policy at the cost of availability. +* *Fail open* — pass the request through. Available default; preserves throughput at the cost of policy coverage. + +// TODO: confirm the GA default and whether it's configurable per guardrail. Open Qs B2, B5 in the companion plan. + +The default at GA is fail-closed. Per-guardrail override is available for guardrails where availability is more important than policy coverage (for example, a brand-safety webhook that's not critical to organizational compliance). + +== Async versus sync evaluation + +Per the AI Gateway design, evaluators run async where possible — specifically, `OUTPUT` evaluators alongside non-streaming responses can complete in parallel with the response delivery. `INPUT` evaluators always run synchronously: the request is blocked until the evaluator returns, because pass/block has to be decided before the request can dispatch upstream. + +// TODO: confirm shipping behavior at GA — the design intent is async OUTPUT where possible, but streaming responses change the timing. Open Q B1 in the companion plan. + +== Next steps + +* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — config schemas and per-type tuning surface. +* xref:governance:guardrails/cost-tracking.adoc[Cost tracking] — what each evaluator costs and where the cost surfaces. +* xref:observability:transcripts.adoc[Read a transcript] — the full transcript walkthrough for finding a specific violation.