Skip to content

Commit bcf1d44

Browse files
committed
docs: generalize trace incident runbook
Signed-off-by: mnajafian-nv <mnajafian@nvidia.com>
1 parent a47926d commit bcf1d44

4 files changed

Lines changed: 32 additions & 32 deletions

File tree

docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ Use the reading path that matches your task:
6868
| Observe a local coding-agent CLI | [NeMo Relay CLI](nemo-relay-cli/about.md) |
6969
| Package reusable behavior | [Build Plugins](build-plugins/about.md) |
7070
| Export traces or trajectories | [Observability](plugins/observability/about.md) |
71-
| Debug production trace incidents | [Production Incident Runbook](troubleshooting/production-incident-runbook.md) |
71+
| Debug trace incidents | [Trace Incident Runbook](troubleshooting/trace-incident-runbook.md) |
7272
| Tune performance with adaptive behavior | [Adaptive](plugins/adaptive/about.md) |
7373
| Look up symbols | [APIs](reference/api/index.md) |
7474

@@ -271,7 +271,7 @@ reference/performance
271271
:maxdepth: 2
272272
273273
Troubleshooting Guide <troubleshooting/troubleshooting-guide>
274-
Production Incident Runbook <troubleshooting/production-incident-runbook>
274+
Trace Incident Runbook <troubleshooting/trace-incident-runbook>
275275
```
276276

277277
```{toctree}

docs/plugins/observability/about.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -57,12 +57,12 @@ Choose the exporter based on the downstream system:
5757
| Generic OTLP traces | [OpenTelemetry](opentelemetry.md) |
5858
| OpenInference-oriented agent and LLM spans | [OpenInference](openinference.md) |
5959

60-
Start with local event inspection before production export. Add sanitize
60+
Start with in-process event inspection before exporting externally. Add sanitize
6161
guardrails before exporters receive sensitive payloads.
6262

63-
For production incidents involving missing traces, wrong scope attachment,
64-
export failures, duplicate events, or sensitive telemetry, use the
65-
[Production Incident Runbook](../../troubleshooting/production-incident-runbook.md).
63+
For trace incidents involving missing traces, wrong scope attachment, export
64+
failures, duplicate events, or sensitive telemetry, use the
65+
[Trace Incident Runbook](../../troubleshooting/trace-incident-runbook.md).
6666

6767
## Correlating Trajectories And Traces
6868

docs/troubleshooting/production-incident-runbook.md renamed to docs/troubleshooting/trace-incident-runbook.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ SPDX-FileCopyrightText: Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All
33
SPDX-License-Identifier: Apache-2.0
44
-->
55

6-
# Production Incident Runbook
6+
# Trace Incident Runbook
77

8-
Use this runbook when a production NeMo Relay deployment has missing traces,
9-
partial traces, incorrect scope parentage, exporter failures, duplicate events,
10-
or sensitive data in telemetry. It assumes that the application already has a
8+
Use this runbook when a NeMo Relay application has missing traces, partial
9+
traces, incorrect scope parentage, exporter failures, duplicate events, or
10+
sensitive data in telemetry. It assumes that the application already has a
1111
baseline scope and call instrumentation path.
1212

1313
For first-time setup problems, start with the
@@ -16,13 +16,13 @@ refer to [Agent Runtime Primer](../getting-started/agent-runtime-primer.md),
1616
[Scopes](../about/concepts/scopes.md), [Events](../about/concepts/events.md),
1717
and [Subscribers](../about/concepts/subscribers.md).
1818

19-
## Protect Production Data First
19+
## Protect Sensitive Data First
2020

2121
Do not collect raw prompts, model responses, authorization headers, tokens,
2222
customer records, tool arguments, or provider payloads while triaging an
2323
incident. Capture the smallest sanitized event sample that proves the failure.
2424

25-
Before exporting incident artifacts outside the production environment, verify
25+
Before exporting incident artifacts outside the current trust boundary, verify
2626
that sanitize guardrails or exporter filters remove sensitive fields. Sanitize
2727
guardrails change emitted telemetry payloads only; they do not change the live
2828
request or response passed to the tool, model provider, or application. Refer to
@@ -39,7 +39,7 @@ Use this table to choose the first check for the symptom you see.
3939
| No traces | Missing instrumentation boundary or inactive exporter | [Confirm Instrumentation Boundary](#confirm-instrumentation-boundary) |
4040
| Partial traces | Unwrapped calls, dropped streams, or late subscriber registration | [Confirm Managed Calls](#confirm-managed-calls) |
4141
| Wrong parent or child scope | Scope propagation or shared scope stack issue | [Confirm Active Scope](#confirm-active-scope) |
42-
| Export works locally but not in production | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) |
42+
| Events appear in process but export fails elsewhere | Exporter config, endpoint, environment, or flush path | [Confirm Exporter Setup](#confirm-exporter-setup) |
4343
| Duplicate events | Duplicate subscribers, duplicate wrappers, or mixed manual and managed lifecycle calls | [Check For Duplicate Event Sources](#check-for-duplicate-event-sources) |
4444
| Sensitive data appears in telemetry | Missing sanitize guardrails before subscribers or exporters | [Confirm Sanitization Before Export](#confirm-sanitization-before-export) |
4545

@@ -67,9 +67,9 @@ Start with the code path that owns the real work.
6767
- If a plugin installs runtime behavior, verify that the plugin is activated
6868
before the request path starts.
6969

70-
Do not debug an exporter first if no local subscriber sees events. Add or enable
71-
a local, sanitized subscriber at the same boundary and confirm that scope, tool,
72-
or LLM events exist before investigating production export.
70+
Do not debug an exporter first if no in-process subscriber sees events. Add or
71+
enable a sanitized in-process subscriber at the same boundary and confirm that
72+
scope, tool, or LLM events exist before investigating external export.
7373

7474
## Confirm Active Scope
7575

@@ -127,12 +127,12 @@ lifecycle.
127127

128128
## Confirm Exporter Setup
129129

130-
If local event inspection works but production export fails, isolate exporter
131-
transport and configuration from runtime instrumentation.
130+
If in-process event inspection works but export fails elsewhere, isolate
131+
exporter transport and configuration from runtime instrumentation.
132132

133133
For file or trajectory export, confirm these settings:
134134

135-
- Output paths are writable by the production process.
135+
- Output paths are writable by the running process.
136136
- The application shuts down or clears the exporter in a path that flushes
137137
partial output.
138138
- ATIF export is scoped to the intended agent root and does not mix concurrent
@@ -141,7 +141,7 @@ For file or trajectory export, confirm these settings:
141141
For OpenTelemetry or OpenInference export, confirm these settings:
142142

143143
- The OpenTelemetry Protocol (OTLP) endpoint, headers, credentials, and network
144-
egress are available in the production environment.
144+
egress are available in the target environment.
145145
- The exporter is enabled in the active configuration file or plugin document.
146146
- The backend receives spans with `nemo_relay.uuid` and
147147
`nemo_relay.parent_uuid` attributes.
@@ -173,15 +173,15 @@ the downstream backend distinguish attempts.
173173

174174
## Confirm Sanitization Before Export
175175

176-
Sensitive data in telemetry is a production incident. Use this order:
176+
Sensitive data in telemetry is an incident. Use this order:
177177

178178
1. Stop or disable the affected exporter if sensitive data is leaving the
179-
production trust boundary.
179+
intended trust boundary.
180180
2. Keep the application path stable unless the live request itself is unsafe.
181181
3. Add or fix sanitize-request and sanitize-response guardrails before
182-
production subscribers and exporters receive events.
183-
4. Validate the sanitized event locally with ATOF JSONL or an in-process
184-
subscriber before re-enabling external export.
182+
subscribers and exporters receive events.
183+
4. Validate the sanitized event with ATOF JSONL or an in-process subscriber
184+
before re-enabling external export.
185185
5. Re-enable one exporter at a time and confirm the downstream backend no
186186
longer receives sensitive fields.
187187

@@ -199,12 +199,12 @@ Collect this information before escalating an incident:
199199
- Exporter type, configuration source, and activation path.
200200
- Sanitized event sample that shows `uuid`, `parent_uuid`, `category`,
201201
`scope_category`, name, and redacted metadata.
202-
- Deployment shape, such as single process, worker pool, async tasks, sidecar,
203-
job queue, or container orchestration.
202+
- Runtime shape, such as single process, worker pool, async tasks, sidecar, job
203+
queue, or container orchestration.
204204
- Reproduction scope, including whether the failure occurs for one request, one
205-
tenant, one service, or all production traffic.
205+
tenant, one service, or all requests.
206206
- Recent changes to instrumentation, plugin configuration, exporter endpoints,
207-
deployment environment, or tracing backend configuration.
207+
runtime environment, or tracing backend configuration.
208208

209209
Do not attach raw prompts, model responses, credentials, customer records,
210210
authorization headers, or unredacted tool arguments to escalation artifacts.

docs/troubleshooting/troubleshooting-guide.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@ SPDX-License-Identifier: Apache-2.0
77

88
Use this page when a NeMo Relay setup, build, or runtime workflow does not behave as expected.
99

10-
For production incidents involving missing traces, wrong scope attachment,
11-
export failures, duplicate events, or sensitive telemetry, start with the
12-
[Production Incident Runbook](production-incident-runbook.md).
10+
For trace incidents involving missing traces, wrong scope attachment, export
11+
failures, duplicate events, or sensitive telemetry, start with the
12+
[Trace Incident Runbook](trace-incident-runbook.md).
1313

1414
## Package Or Build Setup Fails
1515

0 commit comments

Comments
 (0)