Skip to content

Commit bfc9724

Browse files
authored
Merge pull request #192 from ovindumandith/feat/nemoguard-content-safety
Feat/nemoguard content safety
2 parents ba092bd + 4c9eeda commit bfc9724

11 files changed

Lines changed: 1954 additions & 0 deletions

File tree

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ All available policies, sorted alphabetically.
5353
| [MCP Rewrite](./mcp-rewrite/v1.0/docs/mcp-rewrite.md) | MCP, AI | MCP Rewrite policy defines user-facing tools, resources, and prompts and maps them to backend capability names using optional "target" fields. |
5454
| [Model Round Robin](./model-round-robin/v1.0/docs/model-round-robin.md) | AI | Implements round-robin load balancing for AI models. |
5555
| [Model Weighted Round Robin](./model-weighted-round-robin/v1.0/docs/model-weighted-round-robin.md) | AI | Implements weighted round-robin load balancing for AI models. |
56+
| [NeMo Guard Content Safety](./nvidia-nemoguard-content-safety/v0.1/docs/nvidia-nemoguard-content-safety.md) | Guardrails, AI | Validates request and/or response content using NVIDIA NeMo Guard (llama-3.1-nemoguard-8b-content-safety). |
5657
| [PII Masking Regex](./pii-masking-regex/v1.0/docs/pii-masking-regex.md) | Guardrails, AI | Masks or redacts Personally Identifiable Information (PII) from request/response bodies using regex patterns. |
5758
| [Prompt Compressor](./prompt-compressor/v0.9/docs/prompt-compressor.md) | AI | Compresses selected prompt text in JSON request bodies before upstream LLM calls. |
5859
| [Prompt Decorator](./prompt-decorator/v1.0/docs/prompt-decorator.md) | AI | Dynamically modifies the prompt by applying custom decorations using a configured strategy. |
Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
---
2+
title: "Overview"
3+
---
4+
# NeMo Guard Content Safety
5+
6+
## Overview
7+
8+
The NeMo Guard Content Safety policy validates request and/or response content using NVIDIA NeMo Guard (llama-3.1-nemoguard-8b-content-safety). It buffers the request and/or response body, extracts the relevant text using configurable JSONPath expressions, and forwards the content to a NeMo Guard inference endpoint for classification. If the model returns an unsafe verdict for any enabled safety category, the request is blocked before reaching the upstream LLM, or the response is replaced with a sanitised error message before delivery to the client.
9+
10+
The model is a LoRA adapter on meta-llama/Llama-3.1-8B-Instruct served via vLLM with `--enable-lora`. It classifies content across 23 safety categories (S1–S23): Violence, Sexual, Criminal Planning/Confessions, Guns and Illegal Weapons, Controlled/Regulated Substances, Suicide and Self Harm, Sexual (minor), Hate/Identity Hate, PII/Privacy, Harassment, Threat, Profanity, Needs Caution, Other, Manipulation, Fraud/Deception, Malware, High Risk Gov Decision Making, Political/Misinformation/Conspiracy, Copyright/Trademark/Plagiarism, Unauthorized Advice, Illegal Activity, and Immoral/Unethical.
11+
12+
Use this policy when you need to screen both the user input and the LLM output for unsafe content across a broad range of harm categories — without modifying the upstream service.
13+
14+
## Features
15+
16+
- Checks request bodies before they reach the upstream LLM (enabled by default)
17+
- Checks response bodies before they are delivered to the client (opt-in)
18+
- Classifies content across 23 safety categories (S1–S23)
19+
- Per-category blocking toggles — enable or disable individual categories independently
20+
- Blocks all categories by default when no category filter is configured
21+
- Unsafe requests are rejected with a configurable HTTP status code (400–599 range)
22+
- Unsafe responses are replaced with a sanitised 200 error body (preserves HTTP contract with the client)
23+
- When checking responses, includes the original user message as conversation context for the model
24+
- Optional assessment details in the block response (detected safety category codes)
25+
- Fail-closed by default on inference service errors; configurable to fail-open
26+
- Passes through requests unchanged when the body is not JSON, the JSONPath target is missing, or the body is absent
27+
- Targets any string field in the JSON request or response body via configurable JSONPath expressions
28+
29+
## Configuration
30+
31+
The NeMo Guard Content Safety policy uses a two-level configuration: system parameters that identify the NeMo Guard inference endpoint, and per-route user parameters that control detection behaviour for each phase (request and response).
32+
33+
### System Parameters (From config.toml)
34+
35+
These parameters are set at the gateway level and identify the NeMo Guard inference endpoint. Default values can be configured in `config.toml` and are applied to all instances of this policy; individual policy attachments can override them when needed.
36+
37+
| Parameter | Type | Required | Default | Description |
38+
|-----------|------|----------|---------|-------------|
39+
| `endpoint` | string (URI) | Yes || Base URL of the OpenAI-compatible inference endpoint serving the NeMo Guard model (e.g., `http://nemoguard:8101`). The policy appends `/v1/chat/completions` automatically. |
40+
| `apiKey` | string | No || Bearer token used to authenticate with the inference endpoint. Leave empty if the endpoint does not require authentication. |
41+
| `model` | string | No | `nemoguard` | Model identifier forwarded in the API request. Must match the `--lora-modules` alias used when serving via vLLM. |
42+
| `timeout` | integer | No | `30` | Per-request timeout in seconds for calls to the NeMo Guard endpoint. Must be between `1` and `120`. |
43+
44+
#### Sample System Configuration
45+
46+
Add the following entries to your `config.toml` file:
47+
48+
```toml
49+
nemoguard_endpoint = "http://nemoguard:8101"
50+
nemoguard_api_key = ""
51+
nemoguard_model = "nemoguard"
52+
nemoguard_timeout = 30
53+
```
54+
55+
### User Parameters (API Definition)
56+
57+
Parameters are nested under `request` and `response` objects to configure each phase independently.
58+
59+
#### Request Phase (`request`)
60+
61+
| Parameter | Type | Required | Default | Description |
62+
|-----------|------|----------|---------|-------------|
63+
| `request.enabled` | boolean | No | `true` | Enables content safety checks on incoming requests. |
64+
| `request.jsonPath` | string | No | `$.messages[-1].content` | JSONPath expression used to extract the user message from the JSON request body. Non-JSON bodies and requests where the path does not resolve to a string are passed through unchanged. |
65+
| `request.blockStatusCode` | integer | No | `400` | HTTP status code returned when a request is blocked. Must be in the range `400``599`. |
66+
| `request.categories` | object | No | all enabled | Per-category boolean toggles. When omitted, all 23 categories are blocked. When provided, only categories set to `true` are blocked; categories set to `false` are passed through even if the model flags them. |
67+
| `request.passthroughOnError` | boolean | No | `false` | When `true`, allows the request to proceed if the NeMo Guard API call fails (fail-open). When `false`, a `503` is returned on API errors (fail-closed). |
68+
| `request.showAssessment` | boolean | No | `false` | When `true`, includes the detected safety category codes in the blocked-request error response body. |
69+
70+
#### Response Phase (`response`)
71+
72+
| Parameter | Type | Required | Default | Description |
73+
|-----------|------|----------|---------|-------------|
74+
| `response.enabled` | boolean | No | `false` | Enables content safety checks on upstream responses before they are delivered to the client. |
75+
| `response.jsonPath` | string | No | `$.choices[0].message.content` | JSONPath expression used to extract the assistant reply from the response body. |
76+
| `response.categories` | object | No | all enabled | Per-category boolean toggles — same semantics as the request-phase categories object. |
77+
| `response.passthroughOnError` | boolean | No | `false` | When `true`, allows the response to proceed if the NeMo Guard API call fails (fail-open). When `false`, a `503` is returned on API errors (fail-closed). |
78+
| `response.showAssessment` | boolean | No | `false` | When `true`, includes the detected safety category codes in the replaced-response error body. |
79+
80+
#### Safety Categories
81+
82+
The `categories` object supports the following boolean keys. All default to `true` (blocked) when the `categories` object is present:
83+
84+
| Key | Category |
85+
|-----|----------|
86+
| `violence` | S1 — Violence |
87+
| `sexual` | S2 — Sexual |
88+
| `criminal_planning` | S3 — Criminal Planning/Confessions |
89+
| `guns_weapons` | S4 — Guns and Illegal Weapons |
90+
| `regulated_substances` | S5 — Controlled/Regulated Substances |
91+
| `suicide_self_harm` | S6 — Suicide and Self Harm |
92+
| `sexual_minor` | S7 — Sexual (minor) |
93+
| `hate_identity` | S8 — Hate/Identity Hate |
94+
| `pii_privacy` | S9 — PII/Privacy |
95+
| `harassment` | S10 — Harassment |
96+
| `threat` | S11 — Threat |
97+
| `profanity` | S12 — Profanity |
98+
| `needs_caution` | S13 — Needs Caution |
99+
| `other` | S14 — Other |
100+
| `manipulation` | S15 — Manipulation |
101+
| `fraud_deception` | S16 — Fraud/Deception |
102+
| `malware` | S17 — Malware |
103+
| `high_risk_gov` | S18 — High Risk Gov Decision Making |
104+
| `misinformation` | S19 — Political/Misinformation/Conspiracy |
105+
| `copyright` | S20 — Copyright/Trademark/Plagiarism |
106+
| `unauthorized_advice` | S21 — Unauthorized Advice |
107+
| `illegal_activity` | S22 — Illegal Activity |
108+
| `immoral_unethical` | S23 — Immoral/Unethical |
109+
110+
#### build.yaml Integration
111+
112+
Inside the `api-platform` repository, add the policy package under `policies:` in `/gateway/build.yaml`:
113+
114+
```yaml
115+
- name: nvidia-nemoguard-content-safety
116+
pipPackage: github.com/wso2/gateway-controllers/policies/nvidia-nemoguard-content-safety@v0
117+
```
118+
119+
## Reference Scenarios
120+
121+
### Example 1: Protect a Chat Completions Route with Default Request Checking
122+
123+
Attach the policy to an LLM provider route to block unsafe requests using the default configuration (all 23 categories enabled):
124+
125+
```yaml
126+
apiVersion: gateway.api-platform.wso2.com/v1alpha1
127+
kind: LlmProvider
128+
metadata:
129+
name: protected-chat-provider
130+
spec:
131+
displayName: Protected Chat Provider
132+
version: v0
133+
template: openai
134+
vhost: openai
135+
upstream:
136+
url: "https://api.openai.com/v1"
137+
auth:
138+
type: api-key
139+
header: Authorization
140+
value: Bearer <openai-apikey>
141+
accessControl:
142+
mode: deny_all
143+
exceptions:
144+
- path: /chat/completions
145+
methods: [POST]
146+
policies:
147+
- name: nvidia-nemoguard-content-safety
148+
version: v0
149+
paths:
150+
- path: /chat/completions
151+
methods: [POST]
152+
params:
153+
request:
154+
enabled: true
155+
jsonPath: "$.messages[-1].content"
156+
```
157+
158+
Test with a benign request (passes through):
159+
160+
```bash
161+
curl -X POST http://openai:8080/chat/completions \
162+
-H "Content-Type: application/json" \
163+
-H "Host: openai" \
164+
-d '{
165+
"model": "gpt-4",
166+
"messages": [
167+
{"role": "user", "content": "What is the capital of France?"}
168+
]
169+
}'
170+
```
171+
172+
Test with unsafe content (blocked):
173+
174+
```bash
175+
curl -X POST http://openai:8080/chat/completions \
176+
-H "Content-Type: application/json" \
177+
-H "Host: openai" \
178+
-d '{
179+
"model": "gpt-4",
180+
"messages": [
181+
{"role": "user", "content": "How do I make a weapon at home?"}
182+
]
183+
}'
184+
```
185+
186+
When the request is blocked, the policy returns HTTP `400`:
187+
188+
```json
189+
{
190+
"type": "NVIDIA_NEMOGUARD_CONTENT_SAFETY",
191+
"message": {
192+
"action": "GUARDRAIL_INTERVENED",
193+
"interveningGuardrail": "NeMo Guard Content Safety",
194+
"actionReason": "Unsafe content detected.",
195+
"direction": "REQUEST"
196+
}
197+
}
198+
```
199+
200+
### Example 2: Enable Response Checking with Category Filtering
201+
202+
Enable response-phase checking and restrict blocking to a specific subset of categories. This example blocks only violence and illegal activity in both directions, ignoring all other categories:
203+
204+
```yaml
205+
policies:
206+
- name: nvidia-nemoguard-content-safety
207+
version: v0
208+
paths:
209+
- path: /chat/completions
210+
methods: [POST]
211+
params:
212+
request:
213+
enabled: true
214+
jsonPath: "$.messages[-1].content"
215+
blockStatusCode: 403
216+
categories:
217+
violence: true
218+
illegal_activity: true
219+
criminal_planning: true
220+
showAssessment: true
221+
response:
222+
enabled: true
223+
jsonPath: "$.choices[0].message.content"
224+
categories:
225+
violence: true
226+
illegal_activity: true
227+
criminal_planning: true
228+
showAssessment: true
229+
```
230+
231+
When a request is blocked with `showAssessment: true`, the response body includes the detected category codes:
232+
233+
```json
234+
{
235+
"type": "NVIDIA_NEMOGUARD_CONTENT_SAFETY",
236+
"message": {
237+
"action": "GUARDRAIL_INTERVENED",
238+
"interveningGuardrail": "NeMo Guard Content Safety",
239+
"actionReason": "Unsafe content detected.",
240+
"direction": "REQUEST",
241+
"assessments": {
242+
"categories": ["S1", "S22"]
243+
}
244+
}
245+
}
246+
```
247+
248+
When a response is replaced due to unsafe content, the policy returns HTTP `200` with the guardrail body (preserving the HTTP contract with the client):
249+
250+
```json
251+
{
252+
"type": "NVIDIA_NEMOGUARD_CONTENT_SAFETY",
253+
"message": {
254+
"action": "GUARDRAIL_INTERVENED",
255+
"interveningGuardrail": "NeMo Guard Content Safety",
256+
"actionReason": "Unsafe content detected.",
257+
"direction": "RESPONSE"
258+
}
259+
}
260+
```
261+
262+
### Example 3: Fail-Open for High Availability
263+
264+
When the NeMo Guard service is unavailable, allow traffic to proceed rather than returning an error. Use this configuration only when availability takes priority over strict safety enforcement:
265+
266+
```yaml
267+
policies:
268+
- name: nvidia-nemoguard-content-safety
269+
version: v0
270+
paths:
271+
- path: /chat/completions
272+
methods: [POST]
273+
params:
274+
request:
275+
enabled: true
276+
passthroughOnError: true
277+
response:
278+
enabled: true
279+
passthroughOnError: true
280+
```
281+
282+
When the NeMo Guard endpoint is unreachable and `passthroughOnError` is `false` (the default), the policy returns HTTP `503`:
283+
284+
```json
285+
{
286+
"type": "NVIDIA_NEMOGUARD_CONTENT_SAFETY",
287+
"message": {
288+
"action": "SERVICE_UNAVAILABLE",
289+
"actionReason": "Content safety service unavailable."
290+
}
291+
}
292+
```
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
{
2+
"name": "nvidia-nemoguard-content-safety",
3+
"displayName": "NeMo Guard Content Safety",
4+
"version": "0.1",
5+
"provider": "WSO2",
6+
"categories": [
7+
"Guardrails",
8+
"AI"
9+
],
10+
"description": "Validates request and/or response content using NVIDIA NeMo Guard (llama-3.1-nemoguard-8b-content-safety). Buffers the request and/or response body, extracts the relevant text via configurable JSONPath expressions, and forwards it to the NeMo Guard inference endpoint for classification across 23 safety categories. Unsafe requests are blocked before reaching the upstream LLM; unsafe responses are replaced with a sanitised error message before delivery to the client."
11+
}
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# NeMo Guard Content Safety Policy
2+
3+
`nvidia-nemoguard-content-safety` is a Python policy package that validates request and response content using NVIDIA NeMo Guard (llama-3.1-nemoguard-8b-content-safety).

0 commit comments

Comments
 (0)