Skip to content

fix(output_parsers): use correct JSON key "Violated Categories" in nemoguard parsers#2011

Open
nac7 wants to merge 2 commits into
NVIDIA-NeMo:developfrom
nac7:fix/nemoguard-violated-categories-key
Open

fix(output_parsers): use correct JSON key "Violated Categories" in nemoguard parsers#2011
nac7 wants to merge 2 commits into
NVIDIA-NeMo:developfrom
nac7:fix/nemoguard-violated-categories-key

Conversation

@nac7

@nac7 nac7 commented Jun 8, 2026

Copy link
Copy Markdown

Fixes #2010.

Problem

nemoguard_parse_prompt_safety and nemoguard_parse_response_safety in nemoguardrails/llm/output_parsers.py looked for key "Safety Categories" when extracting violation categories from the NemoGuard ContentSafety model response. However, the NemoGuard model returns key "Violated Categories" — as documented in each function's own docstring.

This caused violation categories to be silently dropped on every unsafe response:

response = '{"User Safety": "unsafe", "Violated Categories": "violence, hate_speech"}'
result = nemoguard_parse_prompt_safety(response)
# Was:    [False]                           ← categories silently dropped
# Now:    [False, 'violence', 'hate_speech']

Impact:

  • Audit logs that record which policy categories were violated always received an empty list
  • Downstream guardrail logic that dispatches on violation type never received category information
  • Compliance reporting showed "unsafe" with no details

Fix

Change "Safety Categories""Violated Categories" on lines 163 and 202 of output_parsers.py to match the key the NemoGuard model actually emits (and the key documented in the function docstrings).

Tests

Updated tests/test_content_safety_output_parsers.py:

  • Renamed test cases using old wrong key to use correct "Violated Categories" key
  • Added test_wrong_key_safety_categories_yields_no_categories regression tests for both parsers to confirm the old wrong key no longer extracts categories

Summary by CodeRabbit

  • Bug Fixes

    • Updated the content safety system to correctly parse and identify violated policy categories during prompt and response safety screening.
  • Tests

    • Expanded test suite with comprehensive coverage for safety violation detection, category extraction, and handling of edge cases.

@nac7 nac7 changed the base branch from main to develop June 8, 2026 21:37
nac7 added 2 commits June 8, 2026 16:38
…moguard parsers

Both nemoguard_parse_prompt_safety and nemoguard_parse_response_safety
checked for key "Safety Categories" when extracting violation categories
from NemoGuard ContentSafety model output, but the model actually returns
key "Violated Categories" (as documented in each function's own docstring).

This caused violation categories to be silently dropped on every unsafe
response, breaking audit logging, granular guardrail policies, and
compliance reporting that depend on knowing which policy categories
were flagged.

Fixes NVIDIA-NeMo#2010
Existing tests provided mock NemoGuard JSON responses with the wrong key
Safety Categories. Now that the parser correctly reads Violated Categories,
update all mock response fixtures to match what the real model emits.

The intentional regression tests in test_content_safety_output_parsers.py
that verify Safety Categories no longer extracts data are left unchanged.
@nac7 nac7 force-pushed the fix/nemoguard-violated-categories-key branch from 7ede184 to 542ab9e Compare June 8, 2026 21:38
@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a key mismatch in nemoguard_parse_prompt_safety and nemoguard_parse_response_safety: both functions were reading "Safety Categories" from the NemoGuard model response, but the model actually emits "Violated Categories" (which also matches the functions' own docstrings). The result was that violation categories were silently dropped on every unsafe response.

  • output_parsers.py: Two-line fix replacing "Safety Categories" with "Violated Categories" in both parsers.
  • All changed test files correctly update JSON fixtures and add regression tests confirming the old key no longer extracts categories.
  • Not updated: benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env still emits the old key in its UNSAFE_TEXT, and at least 8 example prompts.yml files still instruct the model to output "Safety Categories", creating a parser/prompt mismatch for users who copy those configs.

Confidence Score: 3/5

The parser fix itself is correct, but the change is incomplete: the benchmark mock server and at least 8 example prompt configs were not updated and still reference the old key.

The two-line change to output_parsers.py is correct and well-tested. However, the benchmark mock server (nvidia-llama-3.1-nemoguard-8b-content-safety.env) still emits "Safety Categories" in its UNSAFE_TEXT, so benchmarks will now silently drop categories — precisely the defect this PR aims to fix. Additionally, eight example prompts.yml files still tell the model to output "Safety Categories", meaning users who copy those configs with instruction-following models will reproduce the original bug.

benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env and all example prompts.yml/prompts.yaml files under examples/configs/ — they still reference the old key and were not included in the PR changeset.

Important Files Changed

Filename Overview
nemoguardrails/llm/output_parsers.py Correct fix: both nemoguard_parse_prompt_safety and nemoguard_parse_response_safety now look for "Violated Categories" matching the actual NemoGuard model output and their own docstrings.
tests/test_content_safety_output_parsers.py Tests updated to use "Violated Categories" throughout; two new regression tests added that confirm the old wrong key no longer extracts categories.
tests/guardrails/test_data.py Expected prompt strings in test fixtures updated from "Safety Categories" to "Violated Categories" to match the corrected parser.
tests/guardrails/test_content_safety_iorails_actions.py Test JSON stubs updated to use "Violated Categories" key for both prompt and response safety test cases.
tests/test_content_safety_integration.py Integration test JSON responses updated to use "Violated Categories", now correctly asserting that categories are extracted.
benchmark/mock_llm_server/configs/nvidia-llama-3.1-nemoguard-8b-content-safety.env Not updated in this PR — mock UNSAFE_TEXT still uses old "Safety Categories" key, causing violation categories to be silently dropped during benchmark runs.
tests/guardrails/test_iorails_telemetry.py Telemetry test fixture UNSAFE_INPUT_JSON updated to use "Violated Categories" key.
tests/guardrails/test_rails_manager.py Rails manager test stubs updated to use "Violated Categories" for both input and output unsafe JSON.

Sequence Diagram

sequenceDiagram
    participant App
    participant Parser as output_parsers.py
    participant NemoGuard as NemoGuard Model

    App->>NemoGuard: Safety check request
    NemoGuard-->>Parser: "{"User Safety": "unsafe", "Violated Categories": "S1, S8"}"
    
    Note over Parser: Before fix: looked for "Safety Categories" → not found → []
    Note over Parser: After fix: looks for "Violated Categories" → found → ["S1", "S8"]
    
    Parser-->>App: [False, "S1", "S8"]
    App->>App: Dispatch on violation type ✓
Loading

Reviews (1): Last reviewed commit: "test: update mock responses to use corre..." | Re-trigger Greptile

@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: ab61cde9-1100-49e2-950b-523f1d846993

📥 Commits

Reviewing files that changed from the base of the PR and between 7285f2c and 542ab9e.

📒 Files selected for processing (7)
  • nemoguardrails/llm/output_parsers.py
  • tests/guardrails/test_content_safety_iorails_actions.py
  • tests/guardrails/test_data.py
  • tests/guardrails/test_iorails_telemetry.py
  • tests/guardrails/test_rails_manager.py
  • tests/test_content_safety_integration.py
  • tests/test_content_safety_output_parsers.py

📝 Walkthrough

Walkthrough

This PR fixes a silent data-loss bug in NemoGuard content-safety response parsing. The parser functions were reading from the wrong JSON field name ("Safety Categories" instead of "Violated Categories"), causing violated policy categories to be discarded. The fix updates both parser functions and all corresponding test fixtures and test cases across the test suite.

Changes

NemoGuard JSON Key Fix

Layer / File(s) Summary
Core parsing logic update
nemoguardrails/llm/output_parsers.py
nemoguard_parse_prompt_safety and nemoguard_parse_response_safety now read violated categories from "Violated Categories" instead of the incorrect "Safety Categories" key. Missing field returns empty list; JSON parse errors still return ["JSON parsing failed"].
Test data schemas and prompt templates
tests/guardrails/test_data.py
CONTENT_SAFETY_INPUT_PROMPT and CONTENT_SAFETY_OUTPUT_PROMPT updated to document the correct "Violated Categories" field in the JSON schema for unsafe content.
Comprehensive output parser test coverage
tests/test_content_safety_output_parsers.py
Regression tests added confirming "Safety Categories" yields no categories; expanded coverage for "Violated Categories" parsing (single, complex, whitespace-trimmed, empty). Real-world scenario tests updated for both prompt and response safety with colon-delimited category formats (e.g., "S1: Violence").
Integration and system test fixtures
tests/guardrails/test_content_safety_iorails_actions.py, tests/guardrails/test_iorails_telemetry.py, tests/guardrails/test_rails_manager.py, tests/test_content_safety_integration.py
Updated UNSAFE_INPUT_JSON and UNSAFE_OUTPUT_JSON fixtures to use "Violated Categories" instead of "Safety Categories". All integration test scenarios for prompt and response safety now use the corrected JSON key.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main fix: correcting the JSON key from "Safety Categories" to "Violated Categories" in nemoguard parsers.
Linked Issues check ✅ Passed All code changes address the requirements in #2010: parser functions now check for "Violated Categories" key, ensuring violation categories are extracted from NemoGuard responses and delivered to callers for audit logging and compliance reporting.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the JSON key mismatch: output_parsers.py lines updated, test fixtures and constants synchronized to use "Violated Categories", and regression tests added per #2010 requirements.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Test Results For Major Changes ✅ Passed This is a minor bug fix (2 lines changed) not a major feature/refactor. PR includes comprehensive test coverage with new positive and regression tests validating the fix works correctly.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@nac7 nac7 force-pushed the fix/nemoguard-violated-categories-key branch from 542ab9e to 2727e79 Compare June 18, 2026 00:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nemoguard_parse_prompt_safety and nemoguard_parse_response_safety silently drop violation categories due to wrong JSON key

1 participant