|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +--> |
| 5 | + |
| 6 | +# Trace Incident Runbook |
| 7 | + |
| 8 | +Use this runbook when a NeMo Relay application has missing traces, partial |
| 9 | +traces, incorrect scope parentage, exporter failures, duplicate events, or |
| 10 | +sensitive data in telemetry. It assumes that the application already has a |
| 11 | +baseline scope and call instrumentation path. |
| 12 | + |
| 13 | +For first-time setup problems, start with the |
| 14 | +[Troubleshooting Guide](troubleshooting-guide.md). For conceptual grounding, |
| 15 | +refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md), |
| 16 | +[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md), |
| 17 | +and [Subscribers](../about/concepts/subscribers.md). |
| 18 | + |
| 19 | +## Protect Sensitive Data First |
| 20 | + |
| 21 | +Do not collect raw prompts, model responses, authorization headers, tokens, |
| 22 | +customer records, tool arguments, or provider payloads while triaging an |
| 23 | +incident. Capture the smallest sanitized event sample that proves the failure. |
| 24 | + |
| 25 | +Before exporting incident artifacts outside the current trust boundary, verify |
| 26 | +that sanitize guardrails or exporter filters remove sensitive fields. Sanitize |
| 27 | +guardrails change emitted telemetry payloads only; they do not change the live |
| 28 | +request or response passed to the tool, model provider, or application. Refer to |
| 29 | +[Middleware](../about/concepts/middleware.md) and |
| 30 | +[Add Middleware](../instrument-applications/advanced-guide.md) for the |
| 31 | +guardrail boundary. |
| 32 | + |
| 33 | +## Triage By Symptom |
| 34 | + |
| 35 | +Use this table to choose the first check for the symptom you see. |
| 36 | + |
| 37 | +| Symptom | Likely Area | Start With | |
| 38 | +|---|---|---| |
| 39 | +| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) | |
| 40 | +| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) | |
| 41 | +| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) | |
| 42 | +| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) | |
| 43 | +| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) | |
| 44 | +| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) | |
| 45 | + |
| 46 | +## Run The Ordered Checks |
| 47 | + |
| 48 | +Run these checks in order before changing exporter or application code. |
| 49 | + |
| 50 | +1. Confirm the instrumentation boundary. |
| 51 | +2. Confirm the active scope and root scope ownership. |
| 52 | +3. Confirm managed tool and LLM calls. |
| 53 | +4. Confirm subscriber or exporter registration timing. |
| 54 | +5. Confirm exporter endpoint, environment, and flush behavior. |
| 55 | +6. Confirm sanitization before export. |
| 56 | + |
| 57 | +## Confirm Instrumentation Boundary |
| 58 | + |
| 59 | +Start with the code path that owns the real work. |
| 60 | + |
| 61 | +- If application code calls the tool or model provider directly, verify that the |
| 62 | + call path uses [Instrument Applications](../instrument-applications/about.md) |
| 63 | + guidance. |
| 64 | +- If a framework owns scheduling, retries, callbacks, or provider payloads, |
| 65 | + verify that the integration uses |
| 66 | + [Integrate into Frameworks](../integrate-frameworks/about.md) guidance. |
| 67 | +- If a plugin installs runtime behavior, verify that the plugin is activated |
| 68 | + before the request path starts. |
| 69 | + |
| 70 | +Do not debug an exporter first if no in-process subscriber sees events. Add or |
| 71 | +enable a sanitized in-process subscriber at the same boundary and confirm that |
| 72 | +scope, tool, or LLM events exist before investigating external export. |
| 73 | + |
| 74 | +## Confirm Active Scope |
| 75 | + |
| 76 | +Trace gaps and wrong parent-child relationships usually start with scope |
| 77 | +ownership. Verify these conditions: |
| 78 | + |
| 79 | +- Each request, agent run, or workflow starts under the intended top-level scope. |
| 80 | +- Detached tasks, worker threads, callbacks, and async jobs receive the intended |
| 81 | + scope stack when they should remain part of the same logical run. |
| 82 | +- Independent requests receive fresh isolated scope stacks. |
| 83 | +- Scope-local middleware and subscribers are registered on the owning scope or |
| 84 | + an ancestor scope. |
| 85 | + |
| 86 | +Use [Adding Scopes and Marks](../instrument-applications/adding-scopes-and-marks.md) |
| 87 | +and [Scopes](../about/concepts/scopes.md) to compare the intended root scope |
| 88 | +with the emitted event `uuid` and `parent_uuid` values. |
| 89 | + |
| 90 | +## Confirm Managed Calls |
| 91 | + |
| 92 | +Partial traces often mean some work bypasses the runtime helpers. Check these |
| 93 | +areas: |
| 94 | + |
| 95 | +- Tool calls that should emit tool start and end events use the managed tool |
| 96 | + call path. |
| 97 | +- Model calls that should emit LLM start and end events use the managed LLM call |
| 98 | + path or an integration wrapper that emits equivalent lifecycle events. |
| 99 | +- Manual lifecycle calls emit matched start and end events with the same |
| 100 | + lifecycle UUID. |
| 101 | +- Streaming LLM responses are drained until completion so final events, |
| 102 | + collectors, and subscribers can observe the completed output. |
| 103 | + |
| 104 | +Refer to [Instrument a Tool Call](../instrument-applications/instrument-tool-call.md), |
| 105 | +[Instrument an LLM Call](../instrument-applications/instrument-llm-call.md), |
| 106 | +[Wrap Tool Calls](../integrate-frameworks/wrap-tool-calls.md), and |
| 107 | +[Wrap LLM Calls](../integrate-frameworks/wrap-llm-calls.md). |
| 108 | + |
| 109 | +## Confirm Subscriber And Exporter Registration |
| 110 | + |
| 111 | +Events are not buffered for subscribers that register after the event has |
| 112 | +already been emitted. Verify these conditions: |
| 113 | + |
| 114 | +- Plugin-managed observability components are loaded before the request path. |
| 115 | +- Manual subscribers are registered before the scope, tool, or LLM events they |
| 116 | + need to observe. |
| 117 | +- Scope-local subscribers are registered on a scope that is active for the work |
| 118 | + they should observe. |
| 119 | +- Exporter filters match the intended root scope or event category. |
| 120 | +- Shutdown, teardown, or request completion calls flush owned exporters before |
| 121 | + the process exits or the container stops. |
| 122 | + |
| 123 | +Use [Observability](../plugins/observability/about.md), |
| 124 | +[Observability Configuration](../plugins/observability/configuration.md), and |
| 125 | +[Subscribers](../about/concepts/subscribers.md) to verify the registration |
| 126 | +lifecycle. |
| 127 | + |
| 128 | +## Confirm Exporter Setup |
| 129 | + |
| 130 | +If in-process event inspection works but export fails elsewhere, isolate |
| 131 | +exporter transport and configuration from runtime instrumentation. |
| 132 | + |
| 133 | +For file or trajectory export, confirm these settings: |
| 134 | + |
| 135 | +- Output paths are writable by the running process. |
| 136 | +- The application shuts down or clears the exporter in a path that flushes |
| 137 | + partial output. |
| 138 | +- ATIF export is scoped to the intended agent root and does not mix concurrent |
| 139 | + root scopes. |
| 140 | + |
| 141 | +For OpenTelemetry or OpenInference export, confirm these settings: |
| 142 | + |
| 143 | +- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network |
| 144 | + egress are available in the target environment. |
| 145 | +- The exporter is enabled in the active configuration file or plugin document. |
| 146 | +- The backend receives spans with `nemo_relay.uuid` and |
| 147 | + `nemo_relay.parent_uuid` attributes. |
| 148 | +- The application flushes and shuts down the subscriber during graceful |
| 149 | + termination. |
| 150 | + |
| 151 | +Refer to [Agent Trajectory Observability Format (ATOF)](../plugins/observability/atof.md), |
| 152 | +[Agent Trajectory Interchange Format (ATIF)](../plugins/observability/atif.md), |
| 153 | +[OpenTelemetry](../plugins/observability/opentelemetry.md), and |
| 154 | +[OpenInference](../plugins/observability/openinference.md). |
| 155 | + |
| 156 | +## Check For Duplicate Event Sources |
| 157 | + |
| 158 | +Duplicate events usually mean the same boundary is instrumented more than once. |
| 159 | +Check these areas: |
| 160 | + |
| 161 | +- The application does not wrap a call that a framework integration already |
| 162 | + wraps. |
| 163 | +- Manual lifecycle calls are not emitted around the same call that already uses |
| 164 | + managed tool or LLM helpers. |
| 165 | +- Plugin-managed exporters and manually registered exporters are not both |
| 166 | + active for the same output path or backend. |
| 167 | +- Retry logic belongs to the framework or application and is not being counted |
| 168 | + as duplicate telemetry for the same real call. |
| 169 | + |
| 170 | +If duplicate events are expected because a retry or fallback actually executed |
| 171 | +more than once, preserve the events and add stable names or metadata that let |
| 172 | +the downstream backend distinguish attempts. |
| 173 | + |
| 174 | +## Confirm Sanitization Before Export |
| 175 | + |
| 176 | +Sensitive data in telemetry is an incident. Use this order: |
| 177 | + |
| 178 | +1. Stop or disable the affected exporter if sensitive data is leaving the |
| 179 | + intended trust boundary. |
| 180 | +2. Keep the application path stable unless the live request itself is unsafe. |
| 181 | +3. Add or fix sanitize-request and sanitize-response guardrails before |
| 182 | + subscribers and exporters receive events. |
| 183 | +4. Validate the sanitized event with ATOF JSONL or an in-process subscriber |
| 184 | + before re-enabling external export. |
| 185 | +5. Re-enable one exporter at a time and confirm the downstream backend no |
| 186 | + longer receives sensitive fields. |
| 187 | + |
| 188 | +Use a request intercept only when the real request to the tool or provider must |
| 189 | +change. Use a sanitize guardrail when only the recorded telemetry should change. |
| 190 | + |
| 191 | +## Escalation Capture Checklist |
| 192 | + |
| 193 | +Collect this information before escalating an incident: |
| 194 | + |
| 195 | +- NeMo Relay version and binding package version. |
| 196 | +- Language binding and runtime version. |
| 197 | +- Whether instrumentation is direct application code, a framework integration, |
| 198 | + or plugin-managed behavior. |
| 199 | +- Exporter type, configuration source, and activation path. |
| 200 | +- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`, |
| 201 | + `scope_category`, name, and redacted metadata. |
| 202 | +- Runtime shape, such as single process, worker pool, async tasks, sidecar, job |
| 203 | + queue, or container orchestration. |
| 204 | +- Reproduction scope, including whether the failure occurs for one request, one |
| 205 | + tenant, one service, or all requests. |
| 206 | +- Recent changes to instrumentation, plugin configuration, exporter endpoints, |
| 207 | + runtime environment, or tracing backend configuration. |
| 208 | + |
| 209 | +Do not attach raw prompts, model responses, credentials, customer records, |
| 210 | +authorization headers, or unredacted tool arguments to escalation artifacts. |
0 commit comments