Skip to content

Commit acafd6f

Browse files
authored
docs: add trace incident runbook (#149)
#### Overview Adds a production incident runbook for diagnosing NeMo Relay trace and telemetry issues. - [x] I confirm this contribution is my own work, or I have the right to submit it under this project's license. - [x] I searched existing issues and open pull requests, and this does not duplicate existing work. #### Details - Adds a symptom-first runbook for missing traces, partial traces, wrong scope parentage, exporter failures, duplicate events, and sensitive telemetry. - Links the runbook from the troubleshooting guide, observability docs, and docs index. - Keeps production escalation guidance limited to sanitized event samples. #### Where should the reviewer start? `docs/troubleshooting/production-incident-runbook.md` #### Related Issues ## Summary by CodeRabbit * **Documentation** * Added comprehensive Trace Incident Runbook for troubleshooting missing traces, scope attachment issues, export failures, duplicate events, and sensitive telemetry concerns * Enhanced navigation and guidance directing users to trace troubleshooting resources * Includes symptom diagnosis tables, step-by-step verification checks, and escalation procedures with security-focused data collection guidelines [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/NVIDIA/NeMo-Relay/pull/149?utm_source=github_walkthrough&utm_medium=github&utm_campaign=change_stack) Authors: - https://github.com/mnajafian-nv Approvers: - Will Killian (https://github.com/willkill07) URL: #149
1 parent 8a04939 commit acafd6f

4 files changed

Lines changed: 221 additions & 1 deletion

File tree

docs/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Use the reading path that matches your task:
6868
| Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) |
6969
| Package reusable behavior | [Build Plugins](build-plugins/about.md) |
7070
| Export traces or trajectories | [Observability](plugins/observability/about.md) |
71+
| Debug trace incidents | [Trace Incident Runbook](troubleshooting/trace-incident-runbook.md) |
7172
| Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) |
7273
| Look up symbols | [APIs](reference/api/index.md) |
7374

@@ -270,6 +271,7 @@ reference/performance
270271
:maxdepth: 2
271272
272273
Troubleshooting Guide <troubleshooting/troubleshooting-guide>
274+
Trace Incident Runbook <troubleshooting/trace-incident-runbook>
273275
```
274276

275277
```{toctree}

docs/plugins/observability/about.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,9 +57,13 @@ Choose the exporter based on the downstream system:
5757
| Generic OTLP traces | [OpenTelemetry](opentelemetry.md) |
5858
| OpenInference-oriented agent and LLM spans | [OpenInference](openinference.md) |
5959

60-
Start with local event inspection before production export. Add sanitize
60+
Start with in-process event inspection before exporting externally. Add sanitize
6161
guardrails before exporters receive sensitive payloads.
6262

63+
For trace incidents involving missing traces, wrong scope attachment, export
64+
failures, duplicate events, or sensitive telemetry, use the
65+
[Trace Incident Runbook](../../troubleshooting/trace-incident-runbook.md).
66+
6367
## Correlating Trajectories And Traces
6468

6569
When ATIF and trace exporters observe the same NeMo Relay events, they share
Lines changed: 210 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,210 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
-->
5+
6+
# Trace Incident Runbook
7+
8+
Use this runbook when a NeMo Relay application has missing traces, partial
9+
traces, incorrect scope parentage, exporter failures, duplicate events, or
10+
sensitive data in telemetry. It assumes that the application already has a
11+
baseline scope and call instrumentation path.
12+
13+
For first-time setup problems, start with the
14+
[Troubleshooting Guide](troubleshooting-guide.md). For conceptual grounding,
15+
refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md),
16+
[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md),
17+
and [Subscribers](../about/concepts/subscribers.md).
18+
19+
## Protect Sensitive Data First
20+
21+
Do not collect raw prompts, model responses, authorization headers, tokens,
22+
customer records, tool arguments, or provider payloads while triaging an
23+
incident. Capture the smallest sanitized event sample that proves the failure.
24+
25+
Before exporting incident artifacts outside the current trust boundary, verify
26+
that sanitize guardrails or exporter filters remove sensitive fields. Sanitize
27+
guardrails change emitted telemetry payloads only; they do not change the live
28+
request or response passed to the tool, model provider, or application. Refer to
29+
[Middleware](../about/concepts/middleware.md) and
30+
[Add Middleware](../instrument-applications/advanced-guide.md) for the
31+
guardrail boundary.
32+
33+
## Triage By Symptom
34+
35+
Use this table to choose the first check for the symptom you see.
36+
37+
| Symptom | Likely Area | Start With |
38+
|---|---|---|
39+
| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) |
40+
| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) |
41+
| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) |
42+
| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) |
43+
| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) |
44+
| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) |
45+
46+
## Run The Ordered Checks
47+
48+
Run these checks in order before changing exporter or application code.
49+
50+
1. Confirm the instrumentation boundary.
51+
2. Confirm the active scope and root scope ownership.
52+
3. Confirm managed tool and LLM calls.
53+
4. Confirm subscriber or exporter registration timing.
54+
5. Confirm exporter endpoint, environment, and flush behavior.
55+
6. Confirm sanitization before export.
56+
57+
## Confirm Instrumentation Boundary
58+
59+
Start with the code path that owns the real work.
60+
61+
- If application code calls the tool or model provider directly, verify that the
62+
call path uses [Instrument Applications](../instrument-applications/about.md)
63+
guidance.
64+
- If a framework owns scheduling, retries, callbacks, or provider payloads,
65+
verify that the integration uses
66+
[Integrate into Frameworks](../integrate-frameworks/about.md) guidance.
67+
- If a plugin installs runtime behavior, verify that the plugin is activated
68+
before the request path starts.
69+
70+
Do not debug an exporter first if no in-process subscriber sees events. Add or
71+
enable a sanitized in-process subscriber at the same boundary and confirm that
72+
scope, tool, or LLM events exist before investigating external export.
73+
74+
## Confirm Active Scope
75+
76+
Trace gaps and wrong parent-child relationships usually start with scope
77+
ownership. Verify these conditions:
78+
79+
- Each request, agent run, or workflow starts under the intended top-level scope.
80+
- Detached tasks, worker threads, callbacks, and async jobs receive the intended
81+
scope stack when they should remain part of the same logical run.
82+
- Independent requests receive fresh isolated scope stacks.
83+
- Scope-local middleware and subscribers are registered on the owning scope or
84+
an ancestor scope.
85+
86+
Use [Adding Scopes and Marks](../instrument-applications/adding-scopes-and-marks.md)
87+
and [Scopes](../about/concepts/scopes.md) to compare the intended root scope
88+
with the emitted event `uuid` and `parent_uuid` values.
89+
90+
## Confirm Managed Calls
91+
92+
Partial traces often mean some work bypasses the runtime helpers. Check these
93+
areas:
94+
95+
- Tool calls that should emit tool start and end events use the managed tool
96+
call path.
97+
- Model calls that should emit LLM start and end events use the managed LLM call
98+
path or an integration wrapper that emits equivalent lifecycle events.
99+
- Manual lifecycle calls emit matched start and end events with the same
100+
lifecycle UUID.
101+
- Streaming LLM responses are drained until completion so final events,
102+
collectors, and subscribers can observe the completed output.
103+
104+
Refer to [Instrument a Tool Call](../instrument-applications/instrument-tool-call.md),
105+
[Instrument an LLM Call](../instrument-applications/instrument-llm-call.md),
106+
[Wrap Tool Calls](../integrate-frameworks/wrap-tool-calls.md), and
107+
[Wrap LLM Calls](../integrate-frameworks/wrap-llm-calls.md).
108+
109+
## Confirm Subscriber And Exporter Registration
110+
111+
Events are not buffered for subscribers that register after the event has
112+
already been emitted. Verify these conditions:
113+
114+
- Plugin-managed observability components are loaded before the request path.
115+
- Manual subscribers are registered before the scope, tool, or LLM events they
116+
need to observe.
117+
- Scope-local subscribers are registered on a scope that is active for the work
118+
they should observe.
119+
- Exporter filters match the intended root scope or event category.
120+
- Shutdown, teardown, or request completion calls flush owned exporters before
121+
the process exits or the container stops.
122+
123+
Use [Observability](../plugins/observability/about.md),
124+
[Observability Configuration](../plugins/observability/configuration.md), and
125+
[Subscribers](../about/concepts/subscribers.md) to verify the registration
126+
lifecycle.
127+
128+
## Confirm Exporter Setup
129+
130+
If in-process event inspection works but export fails elsewhere, isolate
131+
exporter transport and configuration from runtime instrumentation.
132+
133+
For file or trajectory export, confirm these settings:
134+
135+
- Output paths are writable by the running process.
136+
- The application shuts down or clears the exporter in a path that flushes
137+
partial output.
138+
- ATIF export is scoped to the intended agent root and does not mix concurrent
139+
root scopes.
140+
141+
For OpenTelemetry or OpenInference export, confirm these settings:
142+
143+
- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network
144+
egress are available in the target environment.
145+
- The exporter is enabled in the active configuration file or plugin document.
146+
- The backend receives spans with `nemo_relay.uuid` and
147+
`nemo_relay.parent_uuid` attributes.
148+
- The application flushes and shuts down the subscriber during graceful
149+
termination.
150+
151+
Refer to [Agent Trajectory Observability Format (ATOF)](../plugins/observability/atof.md),
152+
[Agent Trajectory Interchange Format (ATIF)](../plugins/observability/atif.md),
153+
[OpenTelemetry](../plugins/observability/opentelemetry.md), and
154+
[OpenInference](../plugins/observability/openinference.md).
155+
156+
## Check For Duplicate Event Sources
157+
158+
Duplicate events usually mean the same boundary is instrumented more than once.
159+
Check these areas:
160+
161+
- The application does not wrap a call that a framework integration already
162+
wraps.
163+
- Manual lifecycle calls are not emitted around the same call that already uses
164+
managed tool or LLM helpers.
165+
- Plugin-managed exporters and manually registered exporters are not both
166+
active for the same output path or backend.
167+
- Retry logic belongs to the framework or application and is not being counted
168+
as duplicate telemetry for the same real call.
169+
170+
If duplicate events are expected because a retry or fallback actually executed
171+
more than once, preserve the events and add stable names or metadata that let
172+
the downstream backend distinguish attempts.
173+
174+
## Confirm Sanitization Before Export
175+
176+
Sensitive data in telemetry is an incident. Use this order:
177+
178+
1. Stop or disable the affected exporter if sensitive data is leaving the
179+
intended trust boundary.
180+
2. Keep the application path stable unless the live request itself is unsafe.
181+
3. Add or fix sanitize-request and sanitize-response guardrails before
182+
subscribers and exporters receive events.
183+
4. Validate the sanitized event with ATOF JSONL or an in-process subscriber
184+
before re-enabling external export.
185+
5. Re-enable one exporter at a time and confirm the downstream backend no
186+
longer receives sensitive fields.
187+
188+
Use a request intercept only when the real request to the tool or provider must
189+
change. Use a sanitize guardrail when only the recorded telemetry should change.
190+
191+
## Escalation Capture Checklist
192+
193+
Collect this information before escalating an incident:
194+
195+
- NeMo Relay version and binding package version.
196+
- Language binding and runtime version.
197+
- Whether instrumentation is direct application code, a framework integration,
198+
or plugin-managed behavior.
199+
- Exporter type, configuration source, and activation path.
200+
- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`,
201+
`scope_category`, name, and redacted metadata.
202+
- Runtime shape, such as single process, worker pool, async tasks, sidecar, job
203+
queue, or container orchestration.
204+
- Reproduction scope, including whether the failure occurs for one request, one
205+
tenant, one service, or all requests.
206+
- Recent changes to instrumentation, plugin configuration, exporter endpoints,
207+
runtime environment, or tracing backend configuration.
208+
209+
Do not attach raw prompts, model responses, credentials, customer records,
210+
authorization headers, or unredacted tool arguments to escalation artifacts.

docs/troubleshooting/troubleshooting-guide.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ SPDX-License-Identifier: Apache-2.0
77

88
Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected.
99

10+
For trace incidents involving missing traces, wrong scope attachment, export
11+
failures, duplicate events, or sensitive telemetry, start with the
12+
[Trace Incident Runbook](trace-incident-runbook.md).
13+
1014
## Package Or Build Setup Fails
1115

1216
Confirm that your environment matches [Prerequisites](../getting-started/prerequisites.md), then rerun the binding-specific setup command from [Installation](../getting-started/installation.md).

0 commit comments

Comments
 (0)