Skip to content

📈 Add OTel configuration#10

Merged
m-misiura merged 31 commits into
trustyai-explainability:developfrom
christinaexyou:develop
Feb 10, 2026
Merged

📈 Add OTel configuration#10
m-misiura merged 31 commits into
trustyai-explainability:developfrom
christinaexyou:develop

Conversation

@christinaexyou
Copy link
Copy Markdown

This PR configures NeMO to allow OpenTelemetry (OTel) to export traces to Jaegar. It adds scripts/setup_otel.py to configure OTel and updates scripts/entrypoint.sh to access this script and setup OTel before starting NeMo server.

Screenshot 2025-10-23 at 2 43 00 PM

Comment thread poetry.lock Outdated
@@ -1,4 +1,4 @@
# This file is automatically @generated by Poetry 1.8.2 and should not be changed by hand.
# This file is automatically @generated by Poetry 1.8.4 and should not be changed by hand.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the benefit of moving from Poetry 1.8.2 to Poetry 1.8.4?

Comment thread scripts/entrypoint.sh Outdated
api.app.disable_chat_ui = True
api.set_default_config_id('${CONFIG_ID}')

# Start the server using uvicorn
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why replace the cli path with a direct uvicorn? Cli was used previously as composition over replacement was deemed preferable

Copy link
Copy Markdown
Author

@christinaexyou christinaexyou Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct uvicorn setups OpenTelemetry before starting the server, which is required, whereas the CLI starts the server immediately (line 155 in cli/init.py)

Comment thread scripts/setup_otel.py Outdated
service_version = os.getenv("OTEL_SERVICE_VERSION", "1.0.0")
environment = os.getenv("OTEL_ENVIRONMENT", "production")
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
enable_console = os.getenv("OTEL_ENABLE_CONSOLE", "true").lower() == "true"
Copy link
Copy Markdown
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the default value set on line 17 conflict with entrypoint.sh?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment thread scripts/setup_otel.py Outdated
Comment thread scripts/setup_otel.py Outdated
Comment thread scripts/setup_otel.py Outdated
Comment thread scripts/entrypoint.sh Outdated
export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"
export OTEL_EXPORTER_OTLP_ENDPOINT="${OTEL_EXPORTER_OTLP_ENDPOINT:-http://jaeger:4317}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this setup seems to be tightly coupled to http://jaeger:4317 and presumably assumed same-namespace deployment; is this always the case? if it isn't, should the user have some kind of control over this configuration?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, generally servers and tracing/metrics services are deployed in the same namespace

Comment thread scripts/entrypoint.sh Outdated
print(f'❌ OpenTelemetry setup failed: {e}')
import traceback
traceback.print_exc()
sys.exit(1)
Copy link
Copy Markdown
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if any error occurs during OTel setup, the server refuses to start? I'd say that observability should not break core functionality

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right - fixed this and added messages to verify setup

Comment thread scripts/entrypoint.sh Outdated
from setup_otel import setup_opentelemetry
setup_opentelemetry()
print('✅ OpenTelemetry configured in server process')
except Exception as e:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this is a too broad exception handling?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added different exception handling according to the error

Comment thread Dockerfile.server Outdated

RUN poetry install --no-ansi --extras="sdd jailbreak openai nvidia tracing" && \
poetry run pip install "spacy>=3.4.4,<4.0.0" && \
poetry run pip install \
Copy link
Copy Markdown
Collaborator

@m-misiura m-misiura Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably move all these deps inside pyproject.toml

I did not want to initially touch pyproject.toml, but if we are adding more or more deps, they should be managed via pyproject.toml

also I think some of the opentelemetry packages are already inside .toml

I think it might be best to place opentelemetry packages inside tracing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially had them inside pyproject.toml but the NeMo team wants the server to be independent of opentelemetry API packages

Comment thread scripts/entrypoint.sh Outdated
fi

# Setup OpenTelemetry
ENABLE_OTEL="${ENABLE_OTEL:-${AUTO_SETUP_OTEL:-false}}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the advantage of having the dual variable fallback here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% a typo - removed AUTO_SETUP_OTEL

Comment thread scripts/entrypoint.sh Outdated
export OTEL_SERVICE_VERSION="${OTEL_SERVICE_VERSION:-1.0.0}"
export OTEL_ENVIRONMENT="${OTEL_ENVIRONMENT:-production}"
export OTEL_ENABLE_CONSOLE="${OTEL_ENABLE_CONSOLE:-false}"
export OTEL_EXPORTER_OTLP_INSECURE="${OTEL_EXPORTER_OTLP_INSECURE:-true}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any scenarios when this can be secure?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we configure OpenTelemetry/Jaegar to use secure connections but that configuration is independent from the NeMo server (a user would have to specify it on their OTelCollector CR)

Copy link
Copy Markdown
Collaborator

@m-misiura m-misiura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I think it could be beneficial to consider the following:

  • tweak how the server starts up in case there are OTel errors
  • add more informative logging so that debugging becomes easier
  • adding some documentation / example deployment manifests to the PR

I left some additional comments, highlighting specific lines that could be worth revisiting :)

@christinaexyou
Copy link
Copy Markdown
Author

@m-misiura thanks for your review and comments ! i updated my PR with the following:

  • Better error handling in case the OpenTelemetry configuration fails
  • Logging throughout
  • Documentation on how to locally test OTel integration

github-actions Bot and others added 12 commits December 3, 2025 18:07
* start restructuring and add local build automation

* refresh poetry lock

* WIP new IA implementation

* more restructure and script to auto-update cards

* more work

* improve

* nit to stage

* details

* nit got to be staged

* pull out use case page and add table

* more improvements and extend update cards script

* big improvements

* polish intro a bit

* small edit

* incorporate devs feedback

* remaining pages

* more fixes

* small edits

* more updated and incorporate feedback

* some edits

* accidentally dropped file
…VIDIA-NeMo#1523)

* chore(test): reduce default pytest log level from DEBUG to WARNING

The DEBUG level was flooding test output with verbose nemoguardrails
logs,making it difficult to read test results. Users can still enable
verbose logging when needed via `--log-level=DEBUG`.

* Updated test that relies on INFO-level log capture to explicitly set caplog level.
Switch base image from python:3.10 to python:3.12-slim for improved
performance and smaller image size. Use --no-install-recommends for apt
package installation and clean up apt cache to reduce final image size.
* Use temporary path as default value in AIperf tests

* Revert type fix that's needed locally but breaks on the server
Add extensive test coverage for all model initialization paths and edge
cases.Tests cover success scenarios, error handling, exception priority,
mode filtering, and E2E integration through the full initialization
chain.
… bot messages (NVIDIA-NeMo#1530)

* feat(content_safety): add auto selected multilingual refusal bot message support

Detect user input language and return refusal messages in the same
language when content safety rails block unsafe content. Supports 9
languages: English, Spanish, Chinese, German, French, Hindi, Japanese,
Arabic, and Thai.
…-NeMo#1542)

* docs(streaming): update streaming configuration documentation

feature: NVIDIA-NeMo#1538
Revise streaming docs to clarify usage of stream_async(), remove
outdated global streaming config, and add CLI usage instructions.
Explain output rails streaming requirements and deprecation of
StreamingHandler. Improve examples and guidance for token usage
tracking.

* not sure about these 2
* docs: add multilingual refusal messages documentation

Add documentation for the multilingual refusal messages feature in
content safety rails, including:

- Supported languages table with default messages
- Basic and advanced configuration examples
- Language detection behavior and accuracy benchmarks
- Cold start behavior and caching details
- Production considerations for container environments

* docs: add installation guide for multilingual feature

* remove configuration options
* docs: small update to test

* fixing weird bug

* updating overview page

* updating use cases page

* updating how it works section

* updating supported llms page

* updating getting started section

* updating based on meeting discussion

* adjusting rail types order

* updating first configure rails docs

* Update docs/about/rail-types.md

Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
Signed-off-by: alexahaushalter <alexahaushalter@hotmail.com>

* updating first overall config files

* updating guardrails config docs

* finishing Core Configuration updates

* updating actions section

* updating Colang 2.0 getting started section

* a few more updates

* trying to get a Verified commit

* trying to get a Verified commit attempt 2

---------

Signed-off-by: alexahaushalter <alexahaushalter@hotmail.com>
Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
miyoungc and others added 18 commits December 18, 2025 10:09
* add frontmatter and card updates

* more fixes and also the library mentions

* fix capitalization

* Update docs/run-rails/index.md

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/run-rails/index.md

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* save

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
* more doc updates

* rm duplicates

* Update docs/about/overview.md

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
* continuing with colang - working with actions

* updating defining flows page

* working with variables page
* continuing with colang - working with actions

* updating defining flows page

* working with variables page

* through CSL section

* finished language reference section

* finishing colang 2 section
…rivateAI) (NVIDIA-NeMo#1545)

* Adding GLiNER

Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com>

* addressing comments pt1

* revamping GLiNER server

* addressing comments pt2

* fixing path on integration test after restructure

* addressing comments pt3

* rolling back changes to project, not needed

* mocking request on test, rolling back vscode change committed by mistake

* updating docs

* addressing final comments

* rollback vscode

* updating licenses

* making extract function non-blocking by removing async

---------

Co-authored-by: Lipika Ramaswamy <lramaswamy@nvidia.com>
* Move mock LLMs into top-level benchmark dir, local content_safety under examples, and benchmark-specific tests

* Initial checkin of validation script

* Remove un-needed files under nemoguardrails/benchmark

* Move unit-tests under benchamrk top-level dir

* Update unit-tests with new code location

* Add requirements to keep benchmark dependencies separate from Guardrails itself

* Update server run script and Procfile with new file locations

* Return np.array with size (1,) from mock function calls in tests

* Remove langchain_nvidia_ai_endpoints from requirements, Guardrails already has this in the poetry env. It's not used in the mocks

* Update README to match new file locations and include venv instructions

* Cleanups to the README

* README.md cleanup
…d align with cards and nav (NVIDIA-NeMo#1554)

* another round of frontmatter and titles

* some improvements for get started

* update

* polish content safety tutorial and add use case diagrams to resources

* improve tutorials, add config reference based on the source code

* incorporate feedback
…VIDIA-NeMo#1526)

OpenAI reasoning models (o1, o3, gpt-5 series excluding gpt-5-chat) only
support `temperature=1`. When NeMo Guardrails uses
`.bind(temperature=0.001)` for deterministic tasks like self-check
input/output, the API returns an error:

```
Unsupported value: 'temperature' does not support 0.001 with this model.
Only the default (1) value is supported.
```

This happens because LangChain's `ChatOpenAI` handles temperature
restrictions at initialization time (setting `temperature=None` for
reasoning models), but `.bind()` bypasses this protection and passes the
temperature directly to the API.

- Added `_filter_params_for_openai_reasoning_models()` function that:
- Detects reasoning models by name (o1*, o3*, gpt-5* excluding
gpt-5-chat)
- Removes the `temperature` parameter before binding for these models
- Follows the same pattern as LangChain's `validate_temperature`
validator
…ing guides (NVIDIA-NeMo#1563)

* refactor(docs): move alignscore and safeguarding guides to advanced section

* docs(langchain): update and reorganize integration docs

* fix typo

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
* Add streaming to mock LLMs, TTFT and ITL to config files, rename LATENCY_* to E2E_LATENCY_* to distinguish between the two
* improve doc

* nit

* Update docs/configure-rails/overview.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/about/overview.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/configure-rails/actions/index.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/configure-rails/overview.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/configure-rails/guardrail-catalog.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/user-guides/community/gliner.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/user-guides/community/gliner.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/user-guides/community/gliner.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/configure-rails/guardrail-catalog.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/configure-rails/guardrail-catalog.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* Update docs/user-guides/community/gliner.md

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

* resolve feedback

* nit

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
* First pass over About NeMo Guardrails Library

* Address feedback. Remove extra space, move diagrams into Resource section, fix README examples link

* Fixed how-it-works top-evel card

* Update docs/about/supported-llms.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com>

* Update docs/about/supported-llms.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com>

* Update docs/about/supported-llms.md

Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com>

* Add Integrate NeMo Guardrails Library into Your Application dropdown back to overview

* Add accidentally deleted line back

---------

Signed-off-by: Tim Gasser <200644301+tgasser-nv@users.noreply.github.com>
Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Co-authored-by: Chris Parisien <64271260+cparisien@users.noreply.github.com>
Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
* docs(run-rails): improve documentation structure and clarity

minor fix

* docs(generation-options): reorder sections based on importance/usage

* Apply suggestions from code review

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>

---------

Signed-off-by: Miyoung Choi <miyoungc@nvidia.com>
Co-authored-by: Miyoung Choi <miyoungc@nvidia.com>
)

* refactor(streaming): simplify streaming support validation

- Remove `streaming` and `streaming_supported` properties from
RailsConfig
- Add `StreamingNotSupportedError` exception for clearer error handling
- Move streaming validation from config-time to runtime in LLMRails
- Remove `main_llm_supports_streaming` flag and related logic
- Update CLI to catch StreamingNotSupportedError instead of pre-checking
- Simplify server API streaming logic to use stream_async directly
- Configure LLM streaming only when stream_async is called
- Remove redundant streaming warnings and fallback logic

BREAKING CHANGE: `RailsConfig.streaming` and
`RailsConfig.streaming_supported` properties have been removed.
Streaming support is now validated at runtime when `stream_async()`
is called.

* test: update tests for streaming refactor

- Remove RailsConfig.streaming references from all tests
- Update streaming tests to use StreamingNotSupportedError
- Remove tests for deprecated streaming_supported property
- Remove tests for main_llm_supports_streaming flag
- Update CLI tests to verify StreamingNotSupportedError handling
- Simplify test fixtures to remove streaming config parameters
- Update token usage tests to reflect stream_usage always enabled
- Fix test mocks to align with new streaming validation approach
* add reasoning guardrail connector

Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com>

* update reasoning guardrail prompt

Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com>

* feat(content_safety): add reasoning_enabled to prompt ctx

Include the `reasoning_enabled` flag from the content safety config in
the context passed to both input and output content safety check
prompts.
This enables prompt templates to conditionally adjust behavior based on
the reasoning setting. Removes redundant config exposure in
LLMTaskManager's default context.

* revert streamers.py change

* test(content_safety): add Nemotron reasoning parser tests

* Revert "revert streamers.py change"

This reverts commit e916dd4.

* fix: add type ignore for streamer import

* minor updates to docs, example

Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com>

---------

Signed-off-by: Makesh Sreedhar <makeshn@nvidia.com>
Co-authored-by: Pouyanpi <13303554+Pouyanpi@users.noreply.github.com>
@m-misiura m-misiura merged commit b6f4e6c into trustyai-explainability:develop Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants