Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 7 additions & 5 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,13 @@
*** For Builders
**** xref:ai-gateway:builders/discover-gateways.adoc[Discover gateways]
* Trust & Governance
** Governance dashboard
*** xref:governance:dashboard/index.adoc[Read the overview]
*** xref:governance:dashboard/agent-network.adoc[Agent Network]
*** xref:governance:dashboard/violations.adoc[Authorization denials and violations]
** xref:governance:guardrails.adoc[Configure guardrails]
** xref:governance:dashboard.adoc[Governance dashboard]
** Guardrails
*** xref:governance:guardrails/index.adoc[Overview]
*** xref:governance:guardrails/create-guardrail.adoc[Create a guardrail]
*** xref:governance:guardrails/types-reference.adoc[Evaluator types]
*** xref:governance:guardrails/violations.adoc[Read violations]
*** xref:governance:guardrails/cost-tracking.adoc[Cost tracking]
** xref:governance:budgets.adoc[Token budgets and limits]
** xref:governance:kill-switch.adoc[Kill switch]
* Observability
Expand Down
4 changes: 0 additions & 4 deletions modules/governance/pages/guardrails.adoc

This file was deleted.

72 changes: 72 additions & 0 deletions modules/governance/pages/guardrails/cost-tracking.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
= Guardrail Cost Tracking
:description: See what each evaluator costs, where the cost surfaces in transcripts and dashboards, and how guardrail spend interacts with token budgets.
:page-topic-type: reference
:personas: platform_admin
// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan.

include::ROOT:partial$adp-la.adoc[]

// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a should-ship deliverable; the cost-pool integration with Budgets fills in once eng confirms whether evaluator cost flows into the user-facing budget pool, a separate guardrail-evaluator pool, or both. Open Qs C2, C3 in the companion plan.

Use this reference to:

* [ ] Recognize the cost shape of each evaluator type (PII, Toxicity, Custom webhook)
* [ ] Locate guardrail-attributed cost in transcripts, metrics, and the governance dashboard
* [ ] Understand how guardrail spend interacts with token budgets and what knobs you have to cap it

== Per-evaluator cost shape

Each evaluator type has a different cost shape:

[cols="1,2,2"]
|===
|Type |Cost source |Where it surfaces

|*PII*
|No per-call LLM cost. Compute time only — negligible for regex; non-trivial for entity-recognition if a NER model ships at GA.
|No transcript cost line. Compute time absorbed into gateway latency metrics.

|*Toxicity*
|Per-call LLM cost. Counts against the *evaluator's configured upstream provider* — typically a small classifier model, separate from the user-facing LLM.
|Per-call cost line in the transcript, alongside the user-facing LLM call. Aggregated into provider-breakdown views in the governance dashboard.

|*Custom webhook*
|Gateway charges nothing per call. Your webhook's compute cost is your own infrastructure expense.
|Not captured in transcripts. Track in your webhook's own observability surface.
|===

== Where guardrail cost shows up

Guardrail-attributed cost surfaces in three places, ordered from most granular to most aggregated:

* *Transcripts* — per-call cost line per fired evaluator, recorded alongside the user-facing LLM call. See xref:observability:transcripts.adoc[Read a transcript].
* *Metrics* — aggregate cost per guardrail per provider per time window. See xref:observability:metrics.adoc[Metrics].
* *Governance dashboard* — guardrail-attributed spend appears in the spend view, broken down by provider. See xref:governance:dashboard/index.adoc[Read the governance overview].

// TODO: confirm whether the dashboard's spend view distinguishes guardrail-evaluator spend from user-facing LLM spend. Open Q C3 in the companion plan.

== Capping guardrail cost

Guardrail spend can grow unexpectedly when traffic spikes or when a Toxicity guardrail runs at `BOTH` phases on a high-throughput provider. Three knobs control it:

* *Per-guardrail toggle* — disable a guardrail to short-circuit its evaluator. The guardrail config is preserved; re-enable when ready. Useful as a kill switch when an evaluator's cost runs away.
* *Phase scoping* — running a Toxicity evaluator at `OUTPUT` only (instead of `BOTH`) halves the per-request cost.
* *Token budgets* — see xref:governance:budgets.adoc[Token budgets and limits]. Guardrail evaluator cost flows into the same spending-event pipeline as user-facing LLM cost; per-provider breakdowns separate the two.

// TODO: confirm whether evaluator cost flows into the same budget pool as user-facing LLM cost, or a separate guardrail-evaluator pool. The master plan calls for "guardrail-cost separation documented" in the Budgets workflow GA scope. Open Q C2 in the companion plan.

== Cost versus latency tradeoff

Each evaluator type has a different cost-versus-latency profile:

* *PII* is cheap and fast — regex-based detection adds milliseconds, no LLM call.
* *Toxicity* is expensive and slow — the classifier call adds tokens and latency.
* *Custom webhook* is whatever your webhook makes it — control your own infrastructure spend and latency profile.

A typical optimization: disable Toxicity on `INPUT` and run it only on `OUTPUT`. Most policy violations are about what the model generates, not what the user asks; cutting the `INPUT` phase halves both the cost and the latency of the Toxicity guardrail without losing meaningful coverage.

== Next steps

* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — config schemas per evaluator type.
* xref:governance:budgets.adoc[Token budgets and limits] — the spending-event pipeline that aggregates guardrail and user-facing LLM cost.
* xref:governance:dashboard/index.adoc[Read the governance overview] — provider-breakdown view that shows guardrail-attributed spend.
111 changes: 111 additions & 0 deletions modules/governance/pages/guardrails/create-guardrail.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
= Create a Guardrail
:description: Configure a guardrail, pick an evaluator type and phase, attach it to one or more LLM providers, and verify that it fires.
:page-topic-type: how-to
:personas: platform_admin
// TODO: confirm persona vocabulary against docs-team-standards. If a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan.
:learning-objective-1: Create and configure a guardrail of a chosen evaluator type
:learning-objective-2: Attach the guardrail to one or more LLM providers and enable it
:learning-objective-3: Verify the guardrail fires and trace the violation through the transcript

include::ROOT:partial$adp-la.adoc[]

// TODO: this page lands at GA. The Guardrails plan (https://redpandadata.atlassian.net/wiki/spaces/DOC/pages/1881702438) lists this page as a must-ship deliverable; the live walkthrough fills in once eng confirms the post-pivot Guardrail resource shape and `aigwctl` (or the ADP UI) is reachable from a sandbox cluster.

This page walks through configuring your first guardrail end-to-end: pick an evaluator type, choose a phase, fill in the per-type config, attach it to LLM providers, and confirm it fires.

After reading this page, you will be able to:

* [ ] {learning-objective-1}
* [ ] {learning-objective-2}
* [ ] {learning-objective-3}

== Prerequisites

* An ADP environment with at least one LLM provider configured. See xref:ai-gateway:configure-provider.adoc[Configure your LLM provider].
* For a *Custom webhook* evaluator, a publicly reachable HTTPS endpoint that implements the gateway's webhook contract. See xref:governance:guardrails/types-reference.adoc[Evaluator types reference].
* For evaluators that need their own credentials (for example, a hosted PII service), the credential stored in the ADP secret store using `UPPER_SNAKE_CASE`.

// TODO: standalone-ADP wording. Replace with the concrete sign-in URL, IAM role, and OIDC audience once the standalone product surface ships. Open Q D1 in the companion plan.

== Open the Guardrails surface

// TODO: finalize this section once the ADP UI ships a Guardrails surface. As of 2026-04-28, `apps/adp-ui/src/routes/` has no `guardrails/` route. The walkthrough may need to lead with `aigwctl` instead of the UI and add the UI flow in a later refresh. Open Qs C4 and C5 in the companion plan.

In the ADP UI, open *Trust & Governance* → *Guardrails* → *Create guardrail*.

== Pick an evaluator type

Choose one of the supported evaluator types:

* *PII* — detects personally identifiable information using regex and entity-recognition rules. No per-call LLM cost.
* *Toxicity* — runs content through a toxicity classifier. Per-call LLM cost.
* *Custom webhook* — delegates the decision to your HTTPS endpoint. Gateway charges nothing per call.

For each type's full config schema and behavior, see xref:governance:guardrails/types-reference.adoc[Evaluator types reference].

// TODO: confirm the evaluator type set at GA. Open Q A5 in the companion plan.

== Pick a phase

Pick the phase or phases at which the evaluator runs:

* `INPUT` — runs against the user's prompt before the gateway forwards it upstream.
* `OUTPUT` — runs against the model's response before the gateway returns it to the caller.
* `BOTH` — runs at both phases.

Decision rule:

* PII guardrails typically run at `BOTH` (defend data exfiltration in both directions).
* Toxicity guardrails typically run at `OUTPUT` only (filter what the model generates; INPUT-side toxicity filtering rarely improves outcomes).
* Custom webhook depends on what your webhook does — start with `INPUT` for prompt-injection heuristics, `OUTPUT` for brand-safety lists, `BOTH` for either-direction checks.

== Configure the evaluator

Fill in the per-type config block. The form fields differ per evaluator type; see xref:governance:guardrails/types-reference.adoc[Evaluator types reference] for the full schema of each type.

// TODO: walk through the PII form as the exemplar (most common starting case) once the post-pivot field set is confirmed. Lift exact field names and labels from the proto. Open Qs A1, A2 in the companion plan.

== Attach to LLM providers

Select one or more LLM providers to attach the guardrail to. Multi-attach is supported — one guardrail can apply to many providers.

// TODO: confirm whether guardrails also attach at other scopes (agents, MCP servers, organizations). The pre-pivot proto attached via `provider_ids[]` and `route_ids[]`; routes were removed in cloudv2 commit `7eff2ecbbf`. Open Qs A3, A4 in the companion plan.

== Enable the guardrail

Toggle the guardrail to *Enabled*. Disabled guardrails skip evaluation entirely — useful when staging a new policy before turning it on, or when troubleshooting whether a guardrail is responsible for unexpected blocks.

== Verify the guardrail fires

Send a request through one of the attached providers that should trigger the guardrail. For example, with a PII guardrail attached on `INPUT`:

// TODO: replace with a working `curl` one-liner against the proxy URL once the live walkthrough resolves authentication. Standalone-ADP wording until then.

[source,bash]
----
curl -X POST https://your-adp-environment/v1/chat/completions \
-H "Authorization: Bearer $ADP_TOKEN" \
-d '{"messages":[{"role":"user","content":"My SSN is 123-45-6789"}]}'
----

The request should return an error. Open the request's transcript and confirm a violation entry appears for the guardrail. See xref:observability:transcripts.adoc[Read a transcript] for the transcript walkthrough and xref:governance:guardrails/violations.adoc[Read violations] for what to do when a violation surprises you.

== Edit, disable, or delete

* *Edit* — change the per-type config or the attached providers. Changes apply on the next request.
* *Disable* — short-circuit the middleware without losing the config. Useful when staging or troubleshooting.
* *Delete* — permanently remove the guardrail. If the guardrail is currently firing on production traffic, the UI requires confirmation.

// TODO: confirm exact UI labels and the delete-confirmation copy once the UI ships. Open Q C4 in the companion plan.

== Troubleshooting

* *Evaluator returns false positives* — see xref:governance:guardrails/violations.adoc[Read violations] for tuning patterns per evaluator type.
* *Evaluator times out or is unavailable* — see xref:governance:guardrails/violations.adoc[Read violations] for the evaluator-down section.
* *Attached provider doesn't fire the guardrail* — confirm attachment (right provider, right phase), enabled state, and that requests are actually reaching the gateway (not bypassing via a direct provider URL).

== Next steps

* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — config schemas and gotchas per evaluator type.
* xref:governance:guardrails/violations.adoc[Read violations] — investigate fired guardrails and tune false-positive rates.
* xref:governance:guardrails/cost-tracking.adoc[Cost tracking] — see what each evaluator costs and where it shows up.
77 changes: 77 additions & 0 deletions modules/governance/pages/guardrails/index.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
= Guardrails Overview
:description: Learn what guardrails are, the evaluator types you can choose from, the INPUT and OUTPUT phase model, and where violations show up.
:page-topic-type: overview
:personas: platform_admin, evaluator, app_developer
// TODO: confirm persona vocabulary against docs-team-standards. The Guardrails plan uses canonical personas; if a Guardrails-specific persona exists (e.g., security_admin), apply it here. Open Q D4 in the companion plan.
:learning-objective-1: Describe what a guardrail does and why you would attach one to an LLM provider
:learning-objective-2: Distinguish between the three evaluator types — PII, Toxicity, and Custom webhook — and the situations each fits
:learning-objective-3: Recognize where a guardrail violation surfaces and which page to read next

include::ROOT:partial$adp-la.adoc[]

A *guardrail* is a configurable safety or policy filter that runs on the request or response side of every LLM call routed through AI Gateway. Use a guardrail to prevent personally identifiable information (PII) from leaving your organization, filter toxic or off-policy responses before they reach end users, or delegate the decision to a custom webhook that enforces policy your way.

After reading this page, you will be able to:

* [ ] {learning-objective-1}
* [ ] {learning-objective-2}
* [ ] {learning-objective-3}

== Where a guardrail runs

Every guardrail runs at one or both of two phases:

* *INPUT* — the gateway evaluates the user's prompt before forwarding it upstream. Use INPUT to stop sensitive content from reaching a third-party model in the first place.
* *OUTPUT* — the gateway evaluates the model's response before returning it to the caller. Use OUTPUT to filter what the model generates.
* *BOTH* — runs the evaluator at both phases. Common for PII (defend in both directions); rare for Toxicity (where INPUT-side filtering is usually less useful).

Streaming responses change the timing slightly: where async evaluation is supported, OUTPUT evaluators run alongside the stream rather than blocking it. Sync evaluators (and all INPUT evaluators) run before the request continues.

// TODO: confirm shipping async-vs-sync behavior at GA. Open Q B1 in the companion plan.

== The three evaluator types

[cols="1,2,2,2"]
|===
|Type |What it does |Where it fits |Cost shape

|*PII*
|Detects personally identifiable information (names, emails, phone numbers, SSNs, etc.) in text using regex and entity-recognition rules.
|Defending data exfiltration to third-party models. Typically runs at `BOTH` phases.
|No per-call LLM cost. Compute time only.

|*Toxicity*
|Runs the input or output through a toxicity classifier and flags content above a configurable threshold.
|Filtering what the model generates. Typically runs at `OUTPUT` only.
|Per-call LLM cost — counts against the evaluator's configured upstream provider, not the user-facing LLM.

|*Custom webhook*
|Delegates the decision to a user-provided HTTPS endpoint. The gateway POSTs the content to your endpoint and acts on the pass/block response.
|Enforcing org-specific policy that doesn't fit PII or Toxicity (for example, prompt-injection heuristics, jailbreak detection, brand-safety lists).
|No gateway charge per call. Your webhook's compute cost is your own.
|===

For per-type config schemas, supported phases, and behavior on match, see xref:governance:guardrails/types-reference.adoc[Evaluator types reference].

// TODO: confirm the evaluator type set shipping at GA. RFC 0002 specifies PII + Toxicity + Custom webhook. The phase5-aigw-guardrails branch in cloudv2 ships PII + a "keyword" evaluator that may rename to Toxicity, stay as a fourth type, or be dropped. Open Q A5 in the companion plan.

== What happens when a guardrail fires

When an evaluator decides to block a request, the gateway stops forwarding it (or stops returning the response, on OUTPUT) and returns an error to the caller. Every fired guardrail records a *violation* entry on the request's transcript, captured in the same observability pipeline that records the LLM call itself. Read the transcript to see which guardrail fired, at which phase, and what content matched. See xref:observability:transcripts.adoc[Read a transcript].

A different scenario — the evaluator itself errored out (for example, a custom webhook timed out or a classifier model is unavailable) — is handled separately. See xref:governance:guardrails/violations.adoc[Read violations] for evaluator-down behavior, fail-closed versus fail-open defaults, and per-guardrail overrides.

// TODO: confirm fail-closed vs. fail-open default at GA, and whether it's configurable per guardrail. Open Qs B2 and B5 in the companion plan.

== Where you attach a guardrail

A guardrail attaches to one or more LLM providers. Each provider can carry many guardrails — a typical setup pairs one PII guardrail with one Toxicity guardrail on the same provider, then layers a Custom-webhook guardrail on top for org-specific policy.

// TODO: confirm whether guardrails also attach at other scopes — agents, MCP servers, organizations — once team-ai answers the post-pivot resource-shape question. The pre-pivot proto attached via `provider_ids[]` and `route_ids[]`; routes were removed in cloudv2 commit `7eff2ecbbf`. Open Qs A1, A3, A4 in the companion plan.

== Where to go next

* xref:governance:guardrails/create-guardrail.adoc[Create a guardrail] — walk through configuring and attaching your first guardrail.
* xref:governance:guardrails/types-reference.adoc[Evaluator types reference] — full config schemas for PII, Toxicity, and Custom-webhook evaluators.
* xref:governance:guardrails/violations.adoc[Read violations] — investigate why a guardrail fired and tune false-positive rates.
* xref:governance:guardrails/cost-tracking.adoc[Cost tracking] — see what each evaluator costs and where the cost surfaces.
Loading