A runbook is an executable troubleshooting contract. It tells the agent:
- what problem pattern it covers
- which checks must happen first
- what evidence is required
- how to branch
- when to stop
Prefer a generic pattern plus team-specific specialization over one oversized runbook.
Recommended layering:
- investigation pattern:
request_not_effective - domain specialization:
order_created_but_task_missing
Steps are not prompts. They are controlled investigation actions.
A runbook should declare which evidence types are sufficient for each conclusion.
Avoid unbounded retries or broad log scans.
Runbook selection is part of the contract. If selection rules live somewhere else, the runbook and selector will drift.
If operation order stays hardcoded in the runner, the repository still has two sources of truth. The runbook should own both selection metadata and execution metadata.
If evidence interpretation stays hardcoded in the runner, the repository still has a hidden source of truth. The runbook should own the logic that maps evidence to conclusions.
If root-cause wording and next actions stay in code, the semantic layer is still split. The runbook should own not just which conclusion fires, but also how that conclusion is explained.
The list of "confirmed facts" is not raw adapter output. It is a reporting choice about which evidence should be surfaced to humans. That choice belongs to the runbook, not to generic runner code.
In the current MVP, runbook metadata is stored as sidecar files:
runbooks/request_not_effective.selector.jsonrunbooks/request_not_effective.execution.jsonrunbooks/request_not_effective.decision.jsonrunbooks/cache_stale.selector.jsonrunbooks/cache_stale.execution.jsonrunbooks/cache_stale.decision.jsonrunbooks/state_abnormal.selector.jsonrunbooks/state_abnormal.execution.jsonrunbooks/state_abnormal.decision.json
This is a practical compromise for a zero-dependency demo. A later version can move the same metadata into the runbook definition once YAML parsing is formalized.
name: request_not_effective
description: Investigate why a request did not produce the expected backend effect.
match:
context_types:
- trace_id
- request_id
- order_id
symptoms:
- request succeeded but expected side effect is missing
inputs:
required:
- context_id
- context_type
- symptom
- expected
limits:
max_steps: 8
max_tool_calls: 12
steps:
- id: locate_trace
tool: trace.lookup
required: false
purpose: confirm whether the flow can be mapped to a trace
params:
trace_ref_column: trace_id
trace_ref_table_by_context_type:
request_id: orders
order_id: orders
- id: inspect_core_path
tool: trace.inspect_spans
required: false
purpose: identify where the main flow stopped
- id: verify_persistence
tool: db.readonly_query
required: true
purpose: check whether expected state was persisted
params:
table_by_context_type:
request_id: orders
order_id: orders
match_column_by_context_type:
request_id: request_id
order_id: order_id
- id: inspect_cache
tool: redis.inspect
required: false
purpose: detect stale cache or idempotency short-circuit
params:
key_template_by_context_type:
request_id: "idempotency:{{context_id}}"
order_id: "task:idempotent:{{context_id}}"
decision_rules:
- id: trace_missing
when:
all:
- finding_type: trace_missing
conclusion: request_not_observable_via_trace
confidence: low
- id: persistence_missing
when:
all:
- finding_type: db_row_missing
conclusion: request_did_not_reach_persistence
confidence: medium
- id: stale_idempotency
when:
all:
- finding_type: db_row_missing
- finding_type: cache_key_exists
- finding_type: cache_ttl_positive
conclusion: request_short_circuited_by_cache_or_idempotency
confidence: high
output:
schema: incident_report
require_evidence: true{
"name": "request_not_effective",
"priority": 2,
"context_types": ["trace_id", "request_id", "order_id"],
"positive_signals": [
{ "pattern": "not generated", "weight": 2.5 },
{ "pattern": "should be created", "weight": 1.5 },
{ "pattern": "(not generated|did not|missing)", "weight": 2, "mode": "regex" }
],
"negative_signals": [
{ "pattern": "status is", "weight": -2 }
]
}{
"name": "request_not_effective",
"operations": [
"trace.lookup",
"trace.inspect_spans",
"db.readonly_query",
"redis.inspect"
]
}{
"name": "request_not_effective",
"confirmed_fact_templates": [
{ "finding_type": "trace_found", "text": "Trace data confirms the request or entity entered an observable service flow." }
],
"rules": [
{
"id": "cache_short_circuit",
"all": ["db_row_missing", "cache_key_exists", "cache_ttl_positive"],
"conclusion": "request_short_circuited_by_cache_or_idempotency",
"confidence": 0.88,
"root_cause": "The request was most likely short-circuited by cache or idempotency state before the expected side effect executed.",
"recommended_next_actions": [
"Check which workflow wrote the idempotency or cache key.",
"Verify whether retry or guard logic pre-created the key."
]
}
],
"fallback": {
"conclusion": "investigation_inconclusive",
"confidence": 0.4,
"root_cause": "The available evidence is insufficient to isolate a single root cause."
}
}Each runbook should define:
matchinputslimitsstepsdecision_rulesoutput- selector metadata with context types and positive or negative signals
- execution metadata with ordered operations
- decision metadata with ordered evidence rules and fallback behavior
- response wording for root cause, alternative hypotheses, and next actions
- confirmed-fact templates for report rendering
When one runbook supports multiple context_type values, prefer explicit parameter maps over hidden code assumptions.
Supported step param variants today:
table_by_context_typematch_column_by_context_typekey_template_by_context_typetrace_ref_table_by_context_type
These maps let one runbook stay honest about different lookup paths such as:
order_id -> orders.order_id -> order:view:{{context_id}}task_id -> tasks.task_id -> task:view:{{context_id}}request_id -> orders.request_id -> idempotency:{{context_id}}
- steps that instruct the model to "think harder"
- conclusions without evidence types
- hidden business assumptions in generic runbooks
- selection logic hardcoded in code but absent from runbook metadata
- execution order hardcoded in code but absent from runbook metadata
- decision logic hardcoded in code but absent from runbook metadata
- response wording hardcoded in code but absent from runbook metadata
- confirmed-fact rendering hardcoded in code but absent from runbook metadata
- write operations in the MVP