docs: document release notes for 0.21 and additional details (#1726)

miyoungc · web-flow · commit f2e7bebd7ce4 · 2026-03-12T16:29:15.000-07:00
diff --git a/docs/about/release-notes.md b/docs/about/release-notes.md
@@ -25,104 +25,132 @@ For a complete record of changes in a release, refer to the
 
 ---
 
-(v0-20-0)=
+(v0-21-0)=
 
-## 0.20.0
+## 0.21.0
 
-(v0-20-0-features)=
+(v0-21-0-features)=
 
 ### Key Features
 
-- Added support for multilingual content safety models such as [NVIDIA Nemotron Safety Guard 8B v3](https://build.nvidia.com/nvidia/llama-3_1-nemotron-safety-guard-8b-v3). This feature uses the [fast-langdetect package](https://github.com/LlmKira/fast-langdetect) to detect the user's input language and return refusal messages in the appropriate language. To use this feature, install the NeMo Guardrails library with the `multilingual` extra.
+- Added the `IORails` class, a new optimized execution engine that runs NemoGuard input and output rails, such as
+  content-safety, topic-safety, and jailbreak detection, in parallel. The engine is opt-in:
+  set `NEMO_GUARDRAILS_IORAILS_ENGINE=1` to enable it. When enabled, the configuration is
+  validated for compatibility and falls back to LLMRails if unsupported flows are detected.
+  For more information, refer to [](../configure-rails/yaml-schema/guardrails-configuration/parallel-rails.md#iorails-engine).
 
-  ```bash
-  pip install nemoguardrails[multilingual]
-  ```
+- Added the `check_async()` and `check()` methods on `LLMRails` to enable validating messages against input and output rails without triggering full LLM generation.
+  Returns a `RailsResult` with `PASSED`, `MODIFIED`, or `BLOCKED` status.
+  For more information, refer to [](../run-rails/using-python-apis/check-messages.md).
 
-- Added support for configuring custom refusal messages per language to complement multilingual content safety models. You can enable multilingual refusal messages and specify custom refusal messages in the `rails.config.content_safety` section of the `config.yml` file.
-
-  ```yaml
-  rails:
-    config:
-      content_safety:
-        multilingual:
-          enabled: true
-          refusal_messages:
-            en: "Sorry, I cannot help with that request."
-            es: "Lo siento, no puedo ayudar con esa solicitud."
-            zh: "抱歉，我无法处理该请求。"
-            # Add other languages as needed
-  ```
+- The guardrails server now exposes a fully OpenAI-compatible
+  REST API. The `/v1/chat/completions` endpoint accepts standard `ChatCompletion` requests with a
+  `guardrails` field for config selection. A new `/v1/models` endpoint lists available models from the
+  configured provider. The `openai` package is now a required component of the optional `server` extra ([#1623](https://github.com/NVIDIA-NeMo/Guardrails/pull/1623)).
+  For more information, refer to [](../run-rails/using-fastapi-server/overview.md).
+
+- Added the `GuardrailsMiddleware` class, a new middleware that integrates with
+  LangChain's Agent Middleware protocol, applying input and output rail checks before and after
+  every model call in the agent loop. It includes the `InputRailsMiddleware` and `OutputRailsMiddleware`
+  convenience subclasses.
+  For more information, refer to [](../integration/langchain/agent-middleware.md).
+
+- Added three new community rails:
+  [PolicyAI](../configure-rails/guardrail-catalog/community/policyai.md) for policy-based content moderation,
+  [CrowdStrike AIDR](../configure-rails/guardrail-catalog/community/crowdstrike-aidr.md) for AI-powered detection and response, and
+  [Regex Detection](../configure-rails/guardrail-catalog/community/regex.md) for pattern-based content filtering on input, output, and retrieval.
+
+- Jailbreak detection configuration is now validated at
+  create-time. Invalid thresholds and malformed URLs raise errors immediately.
+  For more information, refer to [](../configure-rails/guardrail-catalog/jailbreak-protection.md#configuration-validation).
 
-  For more information, refer to [](../configure-rails/guardrail-catalog/content-safety.md#multilingual-refusal-messages).
-- Added support for [NVIDIA GLiNER-PII](https://huggingface.co/nvidia/gliner-PII) for detecting entities such as names, email addresses, phone numbers, social security numbers, and more. For more information, refer to [](../configure-rails/guardrail-catalog/community/gliner.md).
+- Embedding indexes are now initialized lazily.
+  FastEmbed models are only downloaded when semantic search is needed, reducing startup time for
+  configurations that use only input and output rails.
+
+(v0-21-0-breaking-changes)=
 
 ### Breaking Changes
 
-- A breaking change removes redundant streaming configuration for output rails. Prior to the change, streaming had to be enabled in two places: `streaming` and `rails.output.streaming.enabled`. This change removes the top-level `streaming` configuration.
-  - Example `config.yml` before the change:
-
-    ```{code-block} yaml
-    :emphasize-lines: 21
-
-    models:
-      - type: main
-        engine: nvidia_ai_endpoints
-        model: meta/llama-3.3-70b-instruct
-      - type: content_safety
-        engine: nvidia_ai_endpoints
-        model: nvidia/llama-3.1-nemoguard-8b-content-safety
-
-    rails:
-      input:
-        flows:
-          - content safety check input $model=content_safety
-      output:
-        flows:
-          - content safety check output $model=content_safety
-        streaming:
-          enabled: True
-          chunk_size: 200
-          context_size: 50
-
-    streaming: True # No longer needed starting from v0.20.0
-    ```
-
-  - Example `config.yml` after the change:
-
-    ```yaml
-    models:
-      - type: main
-        engine: nvidia_ai_endpoints
-        model: meta/llama-3.3-70b-instruct
-
-      - type: content_safety
-        engine: nvidia_ai_endpoints
-        model: nvidia/llama-3.1-nemoguard-8b-content-safety
-
-    rails:
-      input:
-        flows:
-          - content safety check input $model=content_safety
-      output:
-        flows:
-          - content safety check output $model=content_safety
-        streaming:
-          enabled: True
-          chunk_size: 200
-          context_size: 50
-    ```
-
-  For more information, refer to [](../run-rails/using-python-apis/streaming.md).
+- Streaming metadata parameter renamed. The `include_generation_metadata` parameter on
+  `LLMRails.stream_async()` and `StreamingHandler` is deprecated in favor of `include_metadata`.
+  The `generation_info` field in streaming chunk dicts is renamed to `metadata`.
+  The deprecated parameter still works and emits a `DeprecationWarning`.
+
+  ```python
+  # Before (deprecated)
+  async for chunk in rails.stream_async(messages=messages, include_generation_metadata=True):
+      info = chunk["generation_info"]
+
+  # After
+  async for chunk in rails.stream_async(messages=messages, include_metadata=True):
+      info = chunk["metadata"]
+  ```
+
+- `StreamingHandler` no longer inherits from LangChain `AsyncCallbackHandler`.
+  Streaming now uses `llm.astream()` with direct `push_chunk()` calls.
+  If your code depends on `StreamingHandler` as a LangChain callback, update it to use the
+  new `push_chunk()` interface.
+
+- Removed the `stream_usage` parameter. The `stream_usage=True` parameter is no longer
+  automatically added to LLM call kwargs. Streaming metadata is now captured through
+  `response_metadata` and `usage_metadata` on final chunks.
+
+- Server request and response format changed. The `/v1/chat/completions` endpoint now uses
+  OpenAI-compatible request and response schemas. The previous `RequestBody` and `ResponseBody`
+  classes are removed. For the new format, refer to
+  [](../run-rails/using-fastapi-server/overview.md).
+
+- ChatNVIDIA streaming patch removed. The custom
+  `_langchain_nvidia_ai_endpoints_patch.py` module is removed.
+  The standard `ChatNVIDIA` from `langchain_nvidia_ai_endpoints` is used directly.
+
+(v0-21-0-bug-fixes)=
+
+### Bug Fixes
+
+- Fixed a naming mismatch where the `generate_next_step` action did not match the
+  `generate_next_steps` task enum value, which prevented task-specific LLM configuration
+  from working correctly ([#1603](https://github.com/NVIDIA-NeMo/Guardrails/pull/1603)).
+- Added the `valid` alias to action results in the GuardrailsAI integration so that
+  Colang flows checking `$result["valid"]` work as expected ([#1611](https://github.com/NVIDIA-NeMo/Guardrails/pull/1611)).
+- Filtered the `stop` parameter for OpenAI reasoning models (such as GPT-5) that do not
+  accept it, preventing `400` errors during dialogue rail execution ([#1653](https://github.com/NVIDIA-NeMo/Guardrails/pull/1653)).
+- Fixed GLiNER PII detection to use "bot refuse to respond" instead of
+  "bot inform answer unknown", which returned a misleading "I don't know" message ([#1671](https://github.com/NVIDIA-NeMo/Guardrails/pull/1671)).
+- Fixed a `TypeError` when `stop=None` is passed to `StreamingHandler` by coercing
+  `None` to an empty list ([#1685](https://github.com/NVIDIA-NeMo/Guardrails/pull/1685)).
+- Fixed a `TypeError` in `RollingBuffer.format_chunks` when `include_metadata=True` is used
+  with output rail streaming enabled. Dict chunks are now normalized to strings at the
+  input boundary ([#1687](https://github.com/NVIDIA-NeMo/Guardrails/pull/1687)).
+- Fixed `GuardrailsMiddleware` silently dropping content when rails return `MODIFIED` status.
+  Input rails now replace the last user message and output rails replace the last AI
+  message with the sanitized content ([#1714](https://github.com/NVIDIA-NeMo/Guardrails/pull/1714)).
+- Cache hit statistics are now visible in the Stats log line. Cache stats are also
+  visible in verbose mode ([#1666](https://github.com/NVIDIA-NeMo/Guardrails/pull/1666), [#1667](https://github.com/NVIDIA-NeMo/Guardrails/pull/1667)).
+
+(v0-21-0-other-changes)=
 
 ### Other Changes
 
-- Restructured the documentation with improved navigation, clearer content organization, and updated configuration reference and user guides.
+- Updated the Fiddler Guardrails API to match the new specification: the `prompt` field is
+  renamed to `input`, faithfulness uses strings instead of lists, and a new `fdl_roleplaying`
+  category is added ([#1619](https://github.com/NVIDIA-NeMo/Guardrails/pull/1619)).
+- Updated the Trend Micro Vision One AI Guard integration from the beta endpoint to the
+  officially released GA endpoint. A required `TMV1-Application-Name` header is added and the
+  request key is changed from `guard` to `prompt` ([#1546](https://github.com/NVIDIA-NeMo/Guardrails/pull/1546)).
+- Added a Locust stress-test benchmark for load testing ([#1629](https://github.com/NVIDIA-NeMo/Guardrails/pull/1629)).
+- Removed the `multi_kb` example ([#1673](https://github.com/NVIDIA-NeMo/Guardrails/pull/1673)).
+- Removed the AI Virtual Assistant Blueprint notebook ([#1682](https://github.com/NVIDIA-NeMo/Guardrails/pull/1682)).
+- Updated the Pangea User-Agent repo URL ([#1610](https://github.com/NVIDIA-NeMo/Guardrails/pull/1610)).
+- Updated dependencies for the jailbreak detection Docker container ([#1596](https://github.com/NVIDIA-NeMo/Guardrails/pull/1596)).
+- Major documentation revamp with improved structure and navigation.
 
 ---
 
 ## Previous Release Notes
 
+- [0.20.0](https://docs.nvidia.com/nemo/guardrails/0.20.0/release-notes.html)
 - [0.19.0](https://docs.nvidia.com/nemo/guardrails/0.19.0/release-notes.html)
 - [0.18.0](https://docs.nvidia.com/nemo/guardrails/0.18.0/release-notes.html)
 - [0.17.0](https://docs.nvidia.com/nemo/guardrails/0.17.0/release-notes.html)
diff --git a/docs/configure-rails/guardrail-catalog/jailbreak-protection.md b/docs/configure-rails/guardrail-catalog/jailbreak-protection.md
@@ -48,6 +48,18 @@ rails:
 If the `server_endpoint` parameter is not set, the checks will run in-process. This is useful for TESTING PURPOSES ONLY and **IS NOT RECOMMENDED FOR PRODUCTION DEPLOYMENTS**.
 ```
 
+### Configuration Validation
+
+The jailbreak detection configuration is validated at create-time. Invalid values raise errors
+immediately instead of failing silently at runtime. The following validation rules apply:
+
+| Parameter | Rule |
+|-----------|------|
+| `length_per_perplexity_threshold` | Must be greater than 0 |
+| `prefix_suffix_perplexity_threshold` | Must be greater than 0 |
+| `nim_base_url` | Must start with `http://` or `https://` |
+| `server_endpoint` | Must start with `http://` or `https://` |
+
 ### Heuristics
 
 #### Length per Perplexity
diff --git a/docs/configure-rails/yaml-schema/guardrails-configuration/parallel-rails.md b/docs/configure-rails/yaml-schema/guardrails-configuration/parallel-rails.md
@@ -7,24 +7,70 @@ description: Configure input and output rails to run in parallel for improved la
 
 You can configure input and output rails to run in parallel. This can improve latency and throughput.
 
-## When to Use Parallel Rails Execution
+## IORails Engine
 
-Use parallel execution:
+The IORails engine is an optimized execution engine that runs NemoGuard input and output rails in
+parallel with dedicated model management. The IORails engine is an opt-in feature. By default, the
+NeMo Guardrails library uses the LLMRails engine.
+
+:::{note}
+IORails is an early-release feature and currently does not support streaming, reasoning models, and telemetry as in LLMRails.
+:::
+
+### Supported Flows
+
+The IORails engine supports the following flows:
+
+- `content safety check input` / `content safety check output`
+- `topic safety check input`
+- `jailbreak detection model`
+
+When IORails is enabled and the configuration uses only these flows, the engine runs them in parallel.
+Configurations that include custom flows, dialog rails, or other unsupported flows
+raise an error at initialization.
+
+### Enabling IORails
+
+To enable the IORails engine, set the `NEMO_GUARDRAILS_IORAILS_ENGINE` environment variable to `1`:
+
+```bash
+NEMO_GUARDRAILS_IORAILS_ENGINE=1 nemoguardrails chat --config examples/configs/content_safety
+```
+
+When using the Python API, import the `Guardrails` class directly and pass `use_iorails=True`:
+
+```python
+from nemoguardrails import RailsConfig
+from nemoguardrails.guardrails.guardrails import Guardrails
+
+config = RailsConfig.from_path("./config")
+guardrails = Guardrails(config, use_iorails=True)
+```
+
+## YAML-Based Parallel Execution
+
+You can also configure existing LLMRails flows to run in parallel using the `parallel: True`
+option in the `config.yml` file. This approach works with any flow type and does not require
+the IORails engine.
+
+### When to Use
+
+Use YAML-based parallel execution:
 
 - For I/O-bound rails such as external API calls to LLMs or third-party integrations.
 - If you have two or more independent input or output rails without shared state dependencies.
 - In production environments where response latency affects user experience and business metrics.
 
-## When Not to Use Parallel Rails Execution
+### When Not to Use
 
 Avoid parallel execution:
 
 - For CPU-bound rails; it might not improve performance and can introduce overhead.
 - During development and testing for debugging and simpler workflows.
 
-## Configuration Example
+### Configuration Example
 
-To enable parallel execution, set `parallel: True` in the `rails.input` and `rails.output` sections in the `config.yml` file. The following configuration example is tested by NVIDIA and shows how to enable parallel execution for input and output rails.
+To enable parallel execution, set `parallel: True` in the `rails.input` and `rails.output` sections in the `config.yml` file.
 
 ```{note}
 Input rail mutations can lead to erroneous results during parallel execution because of race conditions arising from the execution order and timing of parallel operations. This can result in output divergence compared to sequential execution. For such cases, use sequential mode.
@@ -60,5 +106,4 @@ rails:
       chunk_size: 200
       context_size: 50
       stream_first: True
-streaming: True
 ```
diff --git a/docs/getting-started/installation-guide.md b/docs/getting-started/installation-guide.md
@@ -122,6 +122,7 @@ You can install the NeMo Guardrails library with optional extra packages to add
 |-------|-------------|
 | `nvidia` | NVIDIA-hosted model integration through [build.nvidia.com](https://build.nvidia.com/) |
 | `openai` | OpenAI-hosted model integration |
+| `server` | [Guardrails API server](../run-rails/using-fastapi-server/overview.md) dependencies (aiofiles for async file handling, openai for API schemas). FastAPI is a core dependency. Required to run `nemoguardrails server`. |
 | `sdd` | [Sensitive data detection](../configure-rails/guardrail-catalog/pii-detection.md#presidio-based-sensitive-data-detection) using Presidio |
 | `eval` | [Evaluation tools](../evaluation/evaluate-guardrails.md) for testing guardrails |
 | `tracing` | OpenTelemetry tracing support |
diff --git a/docs/project.json b/docs/project.json
@@ -1 +1 @@
-{ "name": "nemo-guardrails-toolkit", "version": "0.20.0" }
+{ "name": "nemo-guardrails-toolkit", "version": "0.21.0" }
diff --git a/docs/versions1.json b/docs/versions1.json
@@ -1,31 +1,35 @@
 [
     {
         "preferred": true,
+        "version": "0.21.0",
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.21.0/"
+    },
+    {
         "version": "0.20.0",
-        "url": "../0.20.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.20.0/"
     },
     {
         "version": "0.19.0",
-        "url": "../0.19.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.19.0/"
     },
     {
         "version": "0.18.0",
-        "url": "../0.18.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.18.0/"
     },
     {
         "version": "0.17.0",
-        "url": "../0.17.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.17.0/"
     },
     {
         "version": "0.16.0",
-        "url": "../0.16.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.16.0/"
     },
     {
         "version": "0.15.0",
-        "url": "../0.15.0/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.15.0/"
     },
     {
         "version": "0.14.1",
-        "url": "../0.14.1/"
+        "url": "https://docs.nvidia.com/nemo/guardrails/0.14.1/"
     }
 ]

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-{ "name": "nemo-guardrails-toolkit", "version": "0.20.0" }`
	`1`	`+{ "name": "nemo-guardrails-toolkit", "version": "0.21.0" }`
Original file line number	Diff line number	Diff line change
`@@ -1,31 +1,35 @@`
`1`	`1`	`[`
`2`	`2`	`{`
`3`	`3`	`"preferred": true,`
	`4`	`+ "version": "0.21.0",`
	`5`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.21.0/"`
	`6`	`+ },`
	`7`	`+ {`
`4`	`8`	`"version": "0.20.0",`
`5`		`- "url": "../0.20.0/"`
	`9`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.20.0/"`
`6`	`10`	`},`
`7`	`11`	`{`
`8`	`12`	`"version": "0.19.0",`
`9`		`- "url": "../0.19.0/"`
	`13`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.19.0/"`
`10`	`14`	`},`
`11`	`15`	`{`
`12`	`16`	`"version": "0.18.0",`
`13`		`- "url": "../0.18.0/"`
	`17`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.18.0/"`
`14`	`18`	`},`
`15`	`19`	`{`
`16`	`20`	`"version": "0.17.0",`
`17`		`- "url": "../0.17.0/"`
	`21`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.17.0/"`
`18`	`22`	`},`
`19`	`23`	`{`
`20`	`24`	`"version": "0.16.0",`
`21`		`- "url": "../0.16.0/"`
	`25`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.16.0/"`
`22`	`26`	`},`
`23`	`27`	`{`
`24`	`28`	`"version": "0.15.0",`
`25`		`- "url": "../0.15.0/"`
	`29`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.15.0/"`
`26`	`30`	`},`
`27`	`31`	`{`
`28`	`32`	`"version": "0.14.1",`
`29`		`- "url": "../0.14.1/"`
	`33`	`+ "url": "https://docs.nvidia.com/nemo/guardrails/0.14.1/"`
`30`	`34`	`}`
`31`	`35`	`]`