Runbook Spec

Purpose

A runbook is an executable troubleshooting contract. It tells the agent:

what problem pattern it covers
which checks must happen first
what evidence is required
how to branch
when to stop

Design Rules

1. Runbooks should be narrow enough to stay stable

Prefer a generic pattern plus team-specific specialization over one oversized runbook.

Recommended layering:

investigation pattern: request_not_effective
domain specialization: order_created_but_task_missing

2. Each step should have a clear purpose

Steps are not prompts. They are controlled investigation actions.

3. Conclusions require evidence

A runbook should declare which evidence types are sufficient for each conclusion.

4. Tool budgets should be explicit

Avoid unbounded retries or broad log scans.

5. Selector signals should live with the runbook

Runbook selection is part of the contract. If selection rules live somewhere else, the runbook and selector will drift.

6. Execution plans should live with the runbook

If operation order stays hardcoded in the runner, the repository still has two sources of truth. The runbook should own both selection metadata and execution metadata.

7. Decision rules should live with the runbook

If evidence interpretation stays hardcoded in the runner, the repository still has a hidden source of truth. The runbook should own the logic that maps evidence to conclusions.

8. Response wording should live with the runbook

If root-cause wording and next actions stay in code, the semantic layer is still split. The runbook should own not just which conclusion fires, but also how that conclusion is explained.

9. Confirmed-fact rendering should live with the runbook

The list of "confirmed facts" is not raw adapter output. It is a reporting choice about which evidence should be surfaced to humans. That choice belongs to the runbook, not to generic runner code.

In the current MVP, runbook metadata is stored as sidecar files:

runbooks/request_not_effective.selector.json
runbooks/request_not_effective.execution.json
runbooks/request_not_effective.decision.json
runbooks/cache_stale.selector.json
runbooks/cache_stale.execution.json
runbooks/cache_stale.decision.json
runbooks/state_abnormal.selector.json
runbooks/state_abnormal.execution.json
runbooks/state_abnormal.decision.json

This is a practical compromise for a zero-dependency demo. A later version can move the same metadata into the runbook definition once YAML parsing is formalized.

Suggested YAML Shape

name: request_not_effective
description: Investigate why a request did not produce the expected backend effect.

match:
  context_types:
    - trace_id
    - request_id
    - order_id
  symptoms:
    - request succeeded but expected side effect is missing

inputs:
  required:
    - context_id
    - context_type
    - symptom
    - expected

limits:
  max_steps: 8
  max_tool_calls: 12

steps:
  - id: locate_trace
    tool: trace.lookup
    required: false
    purpose: confirm whether the flow can be mapped to a trace
    params:
      trace_ref_column: trace_id
      trace_ref_table_by_context_type:
        request_id: orders
        order_id: orders

  - id: inspect_core_path
    tool: trace.inspect_spans
    required: false
    purpose: identify where the main flow stopped

  - id: verify_persistence
    tool: db.readonly_query
    required: true
    purpose: check whether expected state was persisted
    params:
      table_by_context_type:
        request_id: orders
        order_id: orders
      match_column_by_context_type:
        request_id: request_id
        order_id: order_id

  - id: inspect_cache
    tool: redis.inspect
    required: false
    purpose: detect stale cache or idempotency short-circuit
    params:
      key_template_by_context_type:
        request_id: "idempotency:{{context_id}}"
        order_id: "task:idempotent:{{context_id}}"

decision_rules:
  - id: trace_missing
    when:
      all:
        - finding_type: trace_missing
    conclusion: request_not_observable_via_trace
    confidence: low

  - id: persistence_missing
    when:
      all:
        - finding_type: db_row_missing
    conclusion: request_did_not_reach_persistence
    confidence: medium

  - id: stale_idempotency
    when:
      all:
        - finding_type: db_row_missing
        - finding_type: cache_key_exists
        - finding_type: cache_ttl_positive
    conclusion: request_short_circuited_by_cache_or_idempotency
    confidence: high

output:
  schema: incident_report
  require_evidence: true

Selector Metadata Shape

{
  "name": "request_not_effective",
  "priority": 2,
  "context_types": ["trace_id", "request_id", "order_id"],
  "positive_signals": [
    { "pattern": "not generated", "weight": 2.5 },
    { "pattern": "should be created", "weight": 1.5 },
    { "pattern": "(not generated|did not|missing)", "weight": 2, "mode": "regex" }
  ],
  "negative_signals": [
    { "pattern": "status is", "weight": -2 }
  ]
}

Execution Metadata Shape

{
  "name": "request_not_effective",
  "operations": [
    "trace.lookup",
    "trace.inspect_spans",
    "db.readonly_query",
    "redis.inspect"
  ]
}

Decision Metadata Shape

{
  "name": "request_not_effective",
  "confirmed_fact_templates": [
    { "finding_type": "trace_found", "text": "Trace data confirms the request or entity entered an observable service flow." }
  ],
  "rules": [
    {
      "id": "cache_short_circuit",
      "all": ["db_row_missing", "cache_key_exists", "cache_ttl_positive"],
      "conclusion": "request_short_circuited_by_cache_or_idempotency",
      "confidence": 0.88,
      "root_cause": "The request was most likely short-circuited by cache or idempotency state before the expected side effect executed.",
      "recommended_next_actions": [
        "Check which workflow wrote the idempotency or cache key.",
        "Verify whether retry or guard logic pre-created the key."
      ]
    }
  ],
  "fallback": {
    "conclusion": "investigation_inconclusive",
    "confidence": 0.4,
    "root_cause": "The available evidence is insufficient to isolate a single root cause."
  }
}

Required Semantics

Each runbook should define:

match
inputs
limits
steps
decision_rules
output
selector metadata with context types and positive or negative signals
execution metadata with ordered operations
decision metadata with ordered evidence rules and fallback behavior
response wording for root cause, alternative hypotheses, and next actions
confirmed-fact templates for report rendering

Optional Context-Specific Params

When one runbook supports multiple context_type values, prefer explicit parameter maps over hidden code assumptions.

Supported step param variants today:

table_by_context_type
match_column_by_context_type
key_template_by_context_type
trace_ref_table_by_context_type

These maps let one runbook stay honest about different lookup paths such as:

order_id -> orders.order_id -> order:view:{{context_id}}
task_id -> tasks.task_id -> task:view:{{context_id}}
request_id -> orders.request_id -> idempotency:{{context_id}}

Anti-Patterns

steps that instruct the model to "think harder"
conclusions without evidence types
hidden business assumptions in generic runbooks
selection logic hardcoded in code but absent from runbook metadata
execution order hardcoded in code but absent from runbook metadata
decision logic hardcoded in code but absent from runbook metadata
response wording hardcoded in code but absent from runbook metadata
confirmed-fact rendering hardcoded in code but absent from runbook metadata
write operations in the MVP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runbook Spec

Purpose

Design Rules

1. Runbooks should be narrow enough to stay stable

2. Each step should have a clear purpose

3. Conclusions require evidence

4. Tool budgets should be explicit

5. Selector signals should live with the runbook

6. Execution plans should live with the runbook

7. Decision rules should live with the runbook

8. Response wording should live with the runbook

9. Confirmed-fact rendering should live with the runbook

Suggested YAML Shape

Selector Metadata Shape

Execution Metadata Shape

Decision Metadata Shape

Required Semantics

Optional Context-Specific Params

Anti-Patterns

FilesExpand file tree

runbook-spec.md

Latest commit

History

runbook-spec.md

File metadata and controls

Runbook Spec

Purpose

Design Rules

1. Runbooks should be narrow enough to stay stable

2. Each step should have a clear purpose

3. Conclusions require evidence

4. Tool budgets should be explicit

5. Selector signals should live with the runbook

6. Execution plans should live with the runbook

7. Decision rules should live with the runbook

8. Response wording should live with the runbook

9. Confirmed-fact rendering should live with the runbook

Suggested YAML Shape

Selector Metadata Shape

Execution Metadata Shape

Decision Metadata Shape

Required Semantics

Optional Context-Specific Params

Anti-Patterns