📈 Add OTel configuration by christinaexyou · Pull Request #10 · trustyai-explainability/NeMo-Guardrails

christinaexyou · 2025-10-23T20:04:52Z

This PR configures NeMO to allow OpenTelemetry (OTel) to export traces to Jaegar. It adds scripts/setup_otel.py to configure OTel and updates scripts/entrypoint.sh to access this script and setup OTel before starting NeMo server.

m-misiura · 2025-10-30T10:00:56Z

@@ -1,4 +1,4 @@
-# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
+# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.


what is the benefit of moving from Poetry 1.8.2 to Poetry 1.8.4?

m-misiura · 2025-10-30T10:07:44Z

+api.app.disable_chat_ui = True
+api.set_default_config_id('${CONFIG_ID}')
+
+# Start the server using uvicorn


why replace the cli path with a direct uvicorn? Cli was used previously as composition over replacement was deemed preferable

Direct uvicorn setups OpenTelemetry before starting the server, which is required, whereas the CLI starts the server immediately (line 155 in cli/init.py)

m-misiura · 2025-10-30T10:14:21Z

+    service_version = os.getenv("OTEL_SERVICE_VERSION", "1.0.0")
+    environment = os.getenv("OTEL_ENVIRONMENT", "production")
+    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
+    enable_console = os.getenv("OTEL_ENABLE_CONSOLE", "true").lower() == "true"


does the default value set on line 17 conflict with entrypoint.sh?

m-misiura · 2025-10-30T10:25:31Z

+  export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
+  export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
+  export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"
+  export OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT:-http://jaeger:4317}"


this setup seems to be tightly coupled to http://jaeger:4317 and presumably assumed same-namespace deployment; is this always the case? if it isn't, should the user have some kind of control over this configuration?

Yes, generally servers and tracing/metrics services are deployed in the same namespace

m-misiura · 2025-10-30T10:29:52Z

+        print(f'❌ OpenTelemetry setup failed: {e}')
+        import traceback
+        traceback.print_exc()
+        sys.exit(1)


so if any error occurs during OTel setup, the server refuses to start? I'd say that observability should not break core functionality

You're right - fixed this and added messages to verify setup

m-misiura · 2025-10-30T10:33:07Z

+        from setup_otel import setup_opentelemetry
+        setup_opentelemetry()
+        print('✅ OpenTelemetry configured in server process')
+    except Exception as e:


perhaps this is a too broad exception handling?

Added different exception handling according to the error

m-misiura · 2025-10-30T10:38:24Z


 RUN poetry install --no-ansi --extras="sdd jailbreak openai nvidia tracing" && \
-    poetry run pip install "spacy>=3.4.4,<4.0.0" && \
+    poetry run pip install \


we should probably move all these deps inside pyproject.toml

I did not want to initially touch pyproject.toml, but if we are adding more or more deps, they should be managed via pyproject.toml

also I think some of the opentelemetry packages are already inside .toml

I think it might be best to place opentelemetry packages inside tracing

I initially had them inside pyproject.toml but the NeMo team wants the server to be independent of opentelemetry API packages

m-misiura · 2025-10-30T10:42:18Z

 fi

+# Setup OpenTelemetry
+ENABLE_OTEL="${ENABLE_OTEL:-${AUTO_SETUP_OTEL:-false}}"


what is the advantage of having the dual variable fallback here?

100% a typo - removed AUTO_SETUP_OTEL

m-misiura · 2025-10-30T10:47:36Z

+  export OTEL_SERVICE_VERSION="${OTEL_SERVICE_VERSION:-1.0.0}"
+  export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
+  export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
+  export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"


are there any scenarios when this can be secure?

Yes, if we configure OpenTelemetry/Jaegar to use secure connections but that configuration is independent from the NeMo server (a user would have to specify it on their OTelCollector CR)

m-misiura

Thanks for the contribution! I think it could be beneficial to consider the following:

tweak how the server starts up in case there are OTel errors
add more informative logging so that debugging becomes easier
adding some documentation / example deployment manifests to the PR

I left some additional comments, highlighting specific lines that could be worth revisiting :)

christinaexyou · 2025-11-18T01:51:01Z

@m-misiura thanks for your review and comments ! i updated my PR with the following:

Better error handling in case the OpenTelemetry configuration fails
Logging throughout
Documentation on how to locally test OTel integration

* start restructuring and add local build automation * refresh poetry lock * WIP new IA implementation * more restructure and script to auto-update cards * more work * improve * nit to stage * details * nit got to be staged * pull out use case page and add table * more improvements and extend update cards script * big improvements * polish intro a bit * small edit * incorporate devs feedback * remaining pages * more fixes * small edits * more updated and incorporate feedback * some edits * accidentally dropped file

…VIDIA-NeMo#1523) * chore(test): reduce default pytest log level from DEBUG to WARNING The DEBUG level was flooding test output with verbose nemoguardrails logs,making it difficult to read test results. Users can still enable verbose logging when needed via `--log-level=DEBUG`. * Updated test that relies on INFO-level log capture to explicitly set caplog level.

Switch base image from python:3.10 to python:3.12-slim for improved performance and smaller image size. Use --no-install-recommends for apt package installation and clean up apt cache to reduce final image size.

… handling (NVIDIA-NeMo#1502)

* Use temporary path as default value in AIperf tests * Revert type fix that's needed locally but breaks on the server

…IDIA-NeMo#1516)

Add extensive test coverage for all model initialization paths and edge cases.Tests cover success scenarios, error handling, exception priority, mode filtering, and E2E integration through the full initialization chain.

… bot messages (NVIDIA-NeMo#1530) * feat(content_safety): add auto selected multilingual refusal bot message support Detect user input language and return refusal messages in the same language when content safety rails block unsafe content. Supports 9 languages: English, Spanish, Chinese, German, French, Hindi, Japanese, Arabic, and Thai.

…-NeMo#1542) * docs(streaming): update streaming configuration documentation feature: NVIDIA-NeMo#1538 Revise streaming docs to clarify usage of stream_async(), remove outdated global streaming config, and add CLI usage instructions. Explain output rails streaming requirements and deprecation of StreamingHandler. Improve examples and guidance for token usage tracking. * not sure about these 2

* docs: add multilingual refusal messages documentation Add documentation for the multilingual refusal messages feature in content safety rails, including: - Supported languages table with default messages - Basic and advanced configuration examples - Language detection behavior and accuracy benchmarks - Cold start behavior and caching details - Production considerations for container environments * docs: add installation guide for multilingual feature * remove configuration options

* docs: small update to test * fixing weird bug * updating overview page * updating use cases page * updating how it works section * updating supported llms page * updating getting started section * updating based on meeting discussion * adjusting rail types order * updating first configure rails docs * Update docs/about/rail-types.md Co-authored-by: Miyoung Choi <miyoungc@nvidia.com> Signed-off-by: alexahaushalter <alexahaushalter@hotmail.com> * updating first overall config files * updating guardrails config docs * finishing Core Configuration updates * updating actions section * updating Colang 2.0 getting started section * a few more updates * trying to get a Verified commit * trying to get a Verified commit attempt 2 --------- Signed-off-by: alexahaushalter <alexahaushalter@hotmail.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

* add frontmatter and card updates * more fixes and also the library mentions * fix capitalization * Update docs/run-rails/index.md Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/run-rails/index.md Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * save --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* more doc updates * rm duplicates * Update docs/about/overview.md Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* continuing with colang - working with actions * updating defining flows page * working with variables page

* continuing with colang - working with actions * updating defining flows page * working with variables page * through CSL section * finished language reference section * finishing colang 2 section

)

…rivateAI) (NVIDIA-NeMo#1545) * Adding GLiNER Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com> * addressing comments pt1 * revamping GLiNER server * addressing comments pt2 * fixing path on integration test after restructure * addressing comments pt3 * rolling back changes to project, not needed * mocking request on test, rolling back vscode change committed by mistake * updating docs * addressing final comments * rollback vscode * updating licenses * making extract function non-blocking by removing async --------- Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com>

* Move mock LLMs into top-level benchmark dir, local content_safety under examples, and benchmark-specific tests * Initial checkin of validation script * Remove un-needed files under nemoguardrails/benchmark * Move unit-tests under benchamrk top-level dir * Update unit-tests with new code location * Add requirements to keep benchmark dependencies separate from Guardrails itself * Update server run script and Procfile with new file locations * Return np.array with size (1,) from mock function calls in tests * Remove langchain_nvidia_ai_endpoints from requirements, Guardrails already has this in the poetry env. It's not used in the mocks * Update README to match new file locations and include venv instructions * Cleanups to the README * README.md cleanup

…d align with cards and nav (NVIDIA-NeMo#1554) * another round of frontmatter and titles * some improvements for get started * update * polish content safety tutorial and add use case diagrams to resources * improve tutorials, add config reference based on the source code * incorporate feedback

…VIDIA-NeMo#1526) OpenAI reasoning models (o1, o3, gpt-5 series excluding gpt-5-chat) only support `temperature=1`. When NeMo Guardrails uses `.bind(temperature=0.001)` for deterministic tasks like self-check input/output, the API returns an error: ``` Unsupported value: 'temperature' does not support 0.001 with this model. Only the default (1) value is supported. ``` This happens because LangChain's `ChatOpenAI` handles temperature restrictions at initialization time (setting `temperature=None` for reasoning models), but `.bind()` bypasses this protection and passes the temperature directly to the API. - Added `_filter_params_for_openai_reasoning_models()` function that: - Detects reasoning models by name (o1*, o3*, gpt-5* excluding gpt-5-chat) - Removes the `temperature` parameter before binding for these models - Follows the same pattern as LangChain's `validate_temperature` validator

…ing guides (NVIDIA-NeMo#1563) * refactor(docs): move alignscore and safeguarding guides to advanced section * docs(langchain): update and reorganize integration docs * fix typo --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

* Add streaming to mock LLMs, TTFT and ITL to config files, rename LATENCY_* to E2E_LATENCY_* to distinguish between the two

* improve doc * nit * Update docs/configure-rails/overview.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/about/overview.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/configure-rails/actions/index.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/configure-rails/overview.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/configure-rails/guardrail-catalog.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/user-guides/community/gliner.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/user-guides/community/gliner.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/user-guides/community/gliner.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/configure-rails/guardrail-catalog.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/configure-rails/guardrail-catalog.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * Update docs/user-guides/community/gliner.md Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> * resolve feedback * nit --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>

* First pass over About NeMo Guardrails Library * Address feedback. Remove extra space, move diagrams into Resource section, fix README examples link * Fixed how-it-works top-evel card * Update docs/about/supported-llms.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com> * Update docs/about/supported-llms.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com> * Update docs/about/supported-llms.md Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com> * Add Integrate NeMo Guardrails Library into Your Application dropdown back to overview * Add accidentally deleted line back --------- Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com> Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

* docs(run-rails): improve documentation structure and clarity minor fix * docs(generation-options): reorder sections based on importance/usage * Apply suggestions from code review Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

) * refactor(streaming): simplify streaming support validation - Remove `streaming` and `streaming_supported` properties from RailsConfig - Add `StreamingNotSupportedError` exception for clearer error handling - Move streaming validation from config-time to runtime in LLMRails - Remove `main_llm_supports_streaming` flag and related logic - Update CLI to catch StreamingNotSupportedError instead of pre-checking - Simplify server API streaming logic to use stream_async directly - Configure LLM streaming only when stream_async is called - Remove redundant streaming warnings and fallback logic BREAKING CHANGE: `RailsConfig.streaming` and `RailsConfig.streaming_supported` properties have been removed. Streaming support is now validated at runtime when `stream_async()` is called. * test: update tests for streaming refactor - Remove RailsConfig.streaming references from all tests - Update streaming tests to use StreamingNotSupportedError - Remove tests for deprecated streaming_supported property - Remove tests for main_llm_supports_streaming flag - Update CLI tests to verify StreamingNotSupportedError handling - Simplify test fixtures to remove streaming config parameters - Update token usage tests to reflect stream_usage always enabled - Fix test mocks to align with new streaming validation approach

…urces (NVIDIA-NeMo#1535)

* add reasoning guardrail connector Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> * update reasoning guardrail prompt Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> * feat(content_safety): add reasoning_enabled to prompt ctx Include the `reasoning_enabled` flag from the content safety config in the context passed to both input and output content safety check prompts. This enables prompt templates to conditionally adjust behavior based on the reasoning setting. Removes redundant config exposure in LLMTaskManager's default context. * revert streamers.py change * test(content_safety): add Nemotron reasoning parser tests * Revert "revert streamers.py change" This reverts commit e916dd4. * fix: add type ignore for streamer import * minor updates to docs, example Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> --------- Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com> Co-authored-by: Pouyanpi <13303554+Pouyanpi@users.noreply.github.com>

christinaexyou mentioned this pull request Oct 23, 2025

Add OpenTelemetry tracing configuration RobGeada/trustyai-service-operator#2

Open

m-misiura reviewed Oct 30, 2025

View reviewed changes

Comment thread scripts/setup_otel.py Outdated

m-misiura reviewed Oct 30, 2025

View reviewed changes

Comment thread scripts/setup_otel.py Outdated

m-misiura reviewed Oct 30, 2025

View reviewed changes

Comment thread scripts/setup_otel.py Outdated

m-misiura reviewed Oct 30, 2025

View reviewed changes

m-misiura requested changes Oct 30, 2025

View reviewed changes

m-misiura force-pushed the develop branch 2 times, most recently from b7d6750 to 603bf72 Compare November 10, 2025 16:54

christinaexyou force-pushed the develop branch from 0d1efdb to 3bd76e9 Compare November 18, 2025 01:34

github-actions Bot and others added 12 commits December 3, 2025 18:07

chore: prepare for release v0.19.0 (NVIDIA-NeMo#1519)

ccaeb19

chore(docker): upgrade to Python 3.12-slim base image (NVIDIA-NeMo#1522)

f2e77ac

Switch base image from python:3.10 to python:3.12-slim for improved performance and smaller image size. Use --no-install-recommends for apt package installation and clean up apt cache to reduce final image size.

chore(deps): restrict pytest version to <9.0.0 (NVIDIA-NeMo#1536)

9b59488

feat: propagate model and base URL in LLMCallException; improve error…

58f6e57

… handling (NVIDIA-NeMo#1502)

test: remove temp files created in aiperf unit-tests (NVIDIA-NeMo#1539)

5c0c461

* Use temporary path as default value in AIperf tests * Revert type fix that's needed locally but breaks on the server

fix: surface relevant exception when initializing langchain model (NV…

2f74cb6

…IDIA-NeMo#1516)

christinaexyou force-pushed the develop branch from 8ed88db to c20ac11 Compare December 17, 2025 21:28

miyoungc and others added 18 commits December 18, 2025 10:09

docs: more updates in the top pages (NVIDIA-NeMo#1552)

79da249

* more doc updates * rm duplicates * Update docs/about/overview.md Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> --------- Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

More PM doc updates (NVIDIA-NeMo#1553)

89300b8

* continuing with colang - working with actions * updating defining flows page * working with variables page

Colang 2.0 PM doc updates (NVIDIA-NeMo#1555)

4f6c05f

* continuing with colang - working with actions * updating defining flows page * working with variables page * through CSL section * finished language reference section * finishing colang 2 section

chore: run pre-commits to update license date for 2026 (NVIDIA-NeMo#1562

72e2e6e

)

docs(observability): fix broken links (NVIDIA-NeMo#1561)

46906f1

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com> Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>

feat(benchmark): Implement Mock LLM streaming (NVIDIA-NeMo#1564)

79101d2

* Add streaming to mock LLMs, TTFT and ITL to config files, rename LATENCY_* to E2E_LATENCY_* to distinguish between the two

docs(readme): update badges to use dynamic PyPI and GitHub Actions so…

8e8c053

…urces (NVIDIA-NeMo#1535)

m-misiura merged commit b6f4e6c into trustyai-explainability:develop Feb 10, 2026

		@@ -1,4 +1,4 @@
		# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
		# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.

Conversation

christinaexyou commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christinaexyou Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-misiura Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-misiura Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-misiura Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

m-misiura left a comment

Choose a reason for hiding this comment

Uh oh!

christinaexyou commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

christinaexyou Nov 17, 2025 •

edited

Loading

m-misiura Oct 30, 2025 •

edited

Loading

m-misiura Oct 30, 2025 •

edited

Loading

m-misiura Oct 30, 2025 •

edited

Loading