|
| 1 | +--- |
| 2 | +name: escalation |
| 3 | +description: Generate a systematic, proof-style diagnostic framework for a customer escalation. Use when the user describes escalation symptoms and needs to determine root cause by triangulating metrics and eliminating noise. |
| 4 | +--- |
| 5 | + |
| 6 | +# Escalation Diagnosis: Proof-Style Metric Framework Generator |
| 7 | + |
| 8 | +Generate a comprehensive diagnostic framework given escalation symptoms. The output is a structured document that allows an on-call engineer to systematically determine root cause by triangulating metrics across independent measurement points and eliminating noise. |
| 9 | + |
| 10 | +## When to Use |
| 11 | + |
| 12 | +Activate when the user says: |
| 13 | +- "I'm investigating an escalation with these symptoms: ..." |
| 14 | +- "Can you build a diagnostic framework for ..." |
| 15 | +- "What metrics should I look at to diagnose ..." |
| 16 | +- "Customer is seeing X, help me figure out why" |
| 17 | +- Any description of a production issue requiring metric-based root cause analysis |
| 18 | + |
| 19 | +## Input |
| 20 | + |
| 21 | +The user provides free-form escalation symptoms as the argument. Examples: |
| 22 | +- `throughput tanks periodically` |
| 23 | +- `replicate queue failures from snapshot reservation timeouts` |
| 24 | +- `high p99 latency on SELECT queries` |
| 25 | +- `OOM kills on node 3 every 6 hours` |
| 26 | +- `cross-region replication lag increasing` |
| 27 | + |
| 28 | +## Workflow |
| 29 | + |
| 30 | +### Phase 1: Research (read the code, don't guess) |
| 31 | + |
| 32 | +Before writing any framework, you MUST read the CockroachDB source code to find the exact metric names relevant to the symptoms. **Never hardcode or guess metric names — always verify by reading the source.** |
| 33 | + |
| 34 | +#### Step 1a: Identify research areas and launch parallel Explore agents |
| 35 | + |
| 36 | +From the symptom, identify 3-6 independent research areas that together cover: |
| 37 | +- The metrics that directly measure the reported symptom |
| 38 | +- What could cause this symptom (resource bottlenecks, contention, configuration, upstream systems) |
| 39 | +- What this symptom causes downstream (secondary failures, latency, under-replication) |
| 40 | +- Resource baselines (CPU, memory, disk, network) that could explain the symptom |
| 41 | + |
| 42 | +Launch one Explore agent per area **in parallel** (in a single message). Each agent should: |
| 43 | +1. Read the relevant source files to find **exact metric name strings** and their types (gauge/counter/histogram) |
| 44 | +2. Find **cluster settings** with their default values |
| 45 | +3. Understand the **code path** — how does the system work, what triggers errors, what are the retry/timeout semantics |
| 46 | +4. Identify the **measurement point** for each metric (pgwire vs gateway vs leaseholder vs store) |
| 47 | +5. **For every metric, explain what it means in the codebase context**: Look at the metric's `Help` description in its `metric.Metadata` definition, then follow how and where the metric is used in the code. The goal is to produce a description that explains the system behavior — when and why this number changes, what a non-zero or elevated value implies for the operator, and what it does NOT capture. For reference, see the metric descriptions in `pkg/kv/kvserver/allocator/mmaprototype/mma_metrics.go` — they explain the operational meaning, not just paraphrase the metric name. For example: |
| 48 | + - BAD: "`range.snapshots.recv-failed` — snapshot receive failures" |
| 49 | + - GOOD: "`range.snapshots.recv-failed` — Number of incoming snapshot init messages that errored on the recipient store before data transfer begins. This includes rejections from `canAcceptSnapshotLocked()` (e.g., overlapping range descriptor, store draining, replica too old) and header validation failures. Does NOT count reservation queue timeouts or errors during data transmission/application — those are tracked by different error paths. A sustained non-zero rate indicates the receiver is consistently rejecting snapshots, which may point to descriptor conflicts or a draining node." |
| 50 | + |
| 51 | +Agents should search beyond the index below — grep for metric name patterns, follow code paths, find settings files. The index is a starting point, not a boundary. |
| 52 | + |
| 53 | +**Key source file index:** |
| 54 | + |
| 55 | +| Area | File | |
| 56 | +|------|------| |
| 57 | +| Go runtime, RSS, CPU, disk, network | `pkg/server/status/runtime.go` | |
| 58 | +| KV store metrics (Pebble, replication, snapshots, queues) | `pkg/kv/kvserver/metrics.go` | |
| 59 | +| KV method enumeration (50 RPC methods) | `pkg/kv/kvpb/method.go` | |
| 60 | +| DistSender per-method counters | `pkg/kv/kvclient/kvcoord/dist_sender.go` | |
| 61 | +| Node per-method recv counters | `pkg/server/node.go` | |
| 62 | +| RPC connection metrics (TCP RTT, heartbeats) | `pkg/rpc/metrics.go` | |
| 63 | +| Clock offset, round-trip-latency | `pkg/rpc/clock_offset.go` | |
| 64 | +| SQL memory accounting | `pkg/sql/mem_metrics.go` | |
| 65 | +| Admission control | `pkg/util/admission/granter.go` | |
| 66 | +| Snapshot settings & reservation logic | `pkg/kv/kvserver/snapshot_settings.go`, `store_snapshot.go` | |
| 67 | +| Replicate queue metrics | `pkg/kv/kvserver/replicate_queue.go` | |
| 68 | +| Raft entry cache | `pkg/kv/kvserver/raftentry/metrics.go` | |
| 69 | +| Rangefeed memory | `pkg/kv/kvserver/rangefeed/metrics.go` | |
| 70 | +| Lock table concurrency | `pkg/kv/kvserver/concurrency/metrics.go` | |
| 71 | + |
| 72 | +#### Step 1b: Compile results and identify noise |
| 73 | + |
| 74 | +From the agent results, collect ALL metrics and settings found. Then identify **noise** — metrics or signals in the cluster that could look like the symptom but have a different root cause. For each noise source, note which metric disambiguates it from the real symptom. Include these noise metrics in the framework so the engineer can rule them out. |
| 75 | + |
| 76 | +**Rules:** |
| 77 | +- Always verify metric name strings by reading the code. Never guess. |
| 78 | +- Check both gateway-side (`distsender.rpc.{method}.sent`) and leaseholder-side (`rpc.method.{method}.recv`) variants when relevant. |
| 79 | +- Record the type (gauge/counter/histogram) for every metric — this determines the Datadog query pattern. |
| 80 | +- For every metric, read its `metric.Metadata` `Help` field and trace how it is used in the surrounding code. Even if you cannot find the exact `.Inc()` call site, always include the metric — use the `Help` description and the code context where the metric struct is referenced to explain what it means operationally. |
| 81 | + |
| 82 | +### Phase 2: Build the Framework |
| 83 | + |
| 84 | +Write the framework to `/tmp/<topic-slug>-framework.md` following this exact structure: |
| 85 | + |
| 86 | +```markdown |
| 87 | +# <Topic>: <Subtitle> |
| 88 | + |
| 89 | +## The Question |
| 90 | +Given <symptom description>: |
| 91 | +1. **<Primary question>** (what is the root cause?) |
| 92 | +2. **<Secondary question>** (what is noise vs signal?) |
| 93 | + |
| 94 | +--- |
| 95 | + |
| 96 | +## How <Relevant System> Works: The Pipeline |
| 97 | + |
| 98 | +This section is the backbone of the framework. It must give the reader a |
| 99 | +comprehensive mental model of the system — not just a linear list of steps, |
| 100 | +but enough architectural context to reason about failures independently. |
| 101 | + |
| 102 | +### Pipeline overview |
| 103 | + |
| 104 | +Start with a high-level ASCII diagram showing all the stages and how they |
| 105 | +connect. Then describe each stage in its own subsection. |
| 106 | + |
| 107 | +### Stage N: <Stage Name> |
| 108 | + |
| 109 | +For EACH stage in the pipeline, write a dedicated subsection that covers: |
| 110 | + |
| 111 | +1. **What this component does and why it exists** — a 2-4 sentence overview |
| 112 | + that explains the component's role in the system. Write this for an |
| 113 | + engineer who is familiar with CockroachDB at a high level but has never |
| 114 | + looked at this subsystem. Explain the design motivation (e.g., "The |
| 115 | + receiver apply queue exists because applying a snapshot is an expensive |
| 116 | + disk-bound operation — it rewrites the entire range's state. The |
| 117 | + concurrency limit prevents multiple concurrent applies from saturating |
| 118 | + disk I/O and causing write stalls across all ranges on the store."). |
| 119 | + |
| 120 | +2. **How it works internally** — describe the key data structures, queues, |
| 121 | + state machines, or algorithms that govern this stage. For queues: what |
| 122 | + is the ordering, what are the capacity limits, what happens when the |
| 123 | + queue is full. For state machines: what are the states and transitions. |
| 124 | + For algorithms: what is the decision logic. Include the key types and |
| 125 | + functions by name so the reader can find them in the code. |
| 126 | + |
| 127 | +3. **Metrics at this stage** — the exact metric names that observe this |
| 128 | + stage, annotated with what each one means operationally at this point |
| 129 | + in the pipeline. |
| 130 | + |
| 131 | +4. **Concurrency limits, timeouts, and settings** — at each stage where |
| 132 | + there is a queue, rate limit, or timeout, note the cluster setting |
| 133 | + name, its default value, and what happens when it's exceeded. |
| 134 | + |
| 135 | +5. **Error behavior** — what errors can this stage produce, how are they |
| 136 | + classified (transient vs permanent), and how do they propagate to the |
| 137 | + next stage. Include retry counts, backoff parameters, and error markers |
| 138 | + (e.g. `errMarkSnapshotError`) that control retry behavior. |
| 139 | + |
| 140 | +6. **Competing consumers** — if multiple subsystems share the same resource |
| 141 | + at this stage (e.g., snapshots and regular writes both consume disk |
| 142 | + bandwidth), note the contention point and how to distinguish them in |
| 143 | + metrics. |
| 144 | + |
| 145 | +Write each stage so that it stands alone — the reader should be able to |
| 146 | +jump to "Stage 4: Receiver Reservation" and understand what it does |
| 147 | +without reading the previous stages. Cross-reference other stages by name |
| 148 | +when needed. |
| 149 | + |
| 150 | +--- |
| 151 | + |
| 152 | +## The Metrics, Grouped by What They Prove |
| 153 | + |
| 154 | +### Group 1: <What this group answers> |
| 155 | +<Table: Metric | Type (gauge/counter/histogram) | What it measures | Codebase Context> |
| 156 | + |
| 157 | +The **Codebase Context** column MUST explain each metric the way MMA metrics |
| 158 | +(`mma_metrics.go`) explain theirs — in terms of the system behavior, not as a |
| 159 | +code-level trace. Specifically: |
| 160 | +- **What system event it represents**: The operational situation that causes the number to change (e.g., "an external replica change completed successfully", not "incremented in function X") |
| 161 | +- **What a non-zero or elevated value implies for the operator**: What should the engineer conclude? |
| 162 | +- **What it does NOT capture**: Important exclusions that prevent misinterpretation |
| 163 | +- **Measurement point**: Whether this is measured at the sender, receiver, gateway, leaseholder, or store level |
| 164 | + |
| 165 | +Example row: |
| 166 | +| `range.snapshots.recv-failed` | counter | Snapshot init errors on receiver | Number of incoming snapshot init messages that errored on the recipient store before data transfer begins. This includes rejections due to overlapping range descriptors, store draining, or stale replicas. Does NOT count reservation queue timeouts or errors during data transmission/application — those are tracked separately. A sustained non-zero rate indicates the receiver is consistently rejecting snapshots, pointing to descriptor conflicts or a draining node. Measured at the receiver store. | |
| 167 | + |
| 168 | +**Key ratios:** |
| 169 | +<Derived metrics and what they infer> |
| 170 | + |
| 171 | +### Group 2: ... |
| 172 | +(continue for all relevant groups) |
| 173 | + |
| 174 | +### Group N: Noise / Context (NOT the symptom, but needed to rule out) |
| 175 | +<Metrics that look like the symptom but have different root causes> |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +## The Decision Procedure: Diagnosing <Symptom> |
| 180 | + |
| 181 | +### Step 1: <First elimination> |
| 182 | +<Table: Check | Metric | Signal | Inference> |
| 183 | + |
| 184 | +### Step 2: ... |
| 185 | +(continue for all steps) |
| 186 | + |
| 187 | +--- |
| 188 | + |
| 189 | +## Key Equations |
| 190 | +<Named equations with metric formulas> |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Quick Reference: Complete Metric List |
| 195 | + |
| 196 | +### <Category> |
| 197 | +- `metric.name` (type, unit) — Operational description: what system event this tracks, what a non-zero/elevated value implies for the operator, and what it does NOT capture (if non-obvious). Include measurement point (sender/receiver/gateway/store). |
| 198 | +(for every metric mentioned in the framework) |
| 199 | +``` |
| 200 | + |
| 201 | +### Phase 3: Output |
| 202 | + |
| 203 | +1. Display the framework to the user |
| 204 | +2. Ask if they want it pushed to a Google Doc |
| 205 | +3. If yes, use the `roachdev gdoc` toolchain: |
| 206 | + - Create new doc: `roachdev gdoc api -X POST '/documents' --body '{"title": "..."}'` |
| 207 | + - Convert markdown: `roachdev gdoc md2json --file /tmp/<file>.md` |
| 208 | + - Push: pipe md2json output to `roachdev gdoc api '/documents/<ID>:batchUpdate'` |
| 209 | + - Or add as a tab to existing doc if specified |
| 210 | + |
| 211 | +--- |
| 212 | + |
| 213 | +## The Proof Structure: Core Principles |
| 214 | + |
| 215 | +Every framework MUST follow these principles: |
| 216 | + |
| 217 | +### 1. Metrics are grouped by what they prove, not alphabetically |
| 218 | +Each group answers a specific diagnostic question. The first group always addresses the most direct signal. Later groups address noise and competing explanations. |
| 219 | + |
| 220 | +### 2. Noise elimination is explicit |
| 221 | +For every symptom, there are metrics that LOOK like the symptom but have a different root cause. These must be listed and the framework must explain how to distinguish them. Examples: |
| 222 | +- `sql.query.count` drops → could be workload drop (app stopped sending) OR closed-loop feedback (pool exhaustion). Disambiguate with `sql.bytesin`. |
| 223 | +- `sys.gc.pause.percent` high → could be memory pressure OR CPU overload starving GC workers. Disambiguate with `sys.runnable.goroutines.per.cpu`. |
| 224 | +- `queue.replicate.process.failure` high → could be snapshot timeout OR allocation failure. Disambiguate with `addreplica.error` vs `purgatory`. |
| 225 | + |
| 226 | +### 3. Cross-layer triangulation is the proof |
| 227 | +No single metric proves anything because each is contaminated by at least one system effect. Proof = agreement across independent measurement points: |
| 228 | +- **Layer 1 (pgwire)**: `sql.conns`, `sql.bytesin` — app-controlled, purest signal |
| 229 | +- **Layer 2 (SQL + DistSender)**: `sql.*.count`, `distsender.*` — gateway-side, mixed signal (retries inflate) |
| 230 | +- **Layer 3 (Store)**: `rpc.method.*.recv`, `rebalancing.*` — leaseholder-side, execution signal |
| 231 | +- **Corroboration only**: Latency metrics (`sql.exec.latency`, `sql.service.latency`) — never proof alone |
| 232 | + |
| 233 | +### 4. Decision procedures are elimination-based |
| 234 | +Each step either proves or eliminates a hypothesis. The procedure must be ordered from purest signal to most contaminated. Typical ordering: |
| 235 | +1. Is the app even sending? (pgwire layer) |
| 236 | +2. Did the workload change? (SQL ratios) |
| 237 | +3. Did CRDB internals change? (KV layer ratios, AC, contention) |
| 238 | +4. Is it a resource bottleneck? (CPU, memory, disk, network) |
| 239 | +5. Is it a background job? (export, addsstable, gc) |
| 240 | + |
| 241 | +### 5. Every metric has an operational explanation grounded in the codebase |
| 242 | +For every metric in the framework, the reader must be able to answer: "What system event does this track, and what should I conclude from it?" Each metric entry must include: |
| 243 | +- **What system event it represents** — the operational situation, not just a restatement of the metric name. Read the metric's `Help` field in its `metric.Metadata` definition and the surrounding code context. |
| 244 | +- **What a non-zero or elevated value implies** — what should the engineer conclude or investigate next? |
| 245 | +- **What it does NOT capture** — important exclusions that prevent misinterpretation (if non-obvious) |
| 246 | +- **Measurement point** — sender vs receiver, gateway vs leaseholder, etc. |
| 247 | + |
| 248 | +Follow the style of `mma_metrics.go` — descriptions that explain the operational meaning, when the event occurs, and what it implies for the operator. A metric without operational context is a label — it tells you what to search in Datadog but not what to conclude from it. |
| 249 | + |
| 250 | +### 6. Every metric has a type annotation |
| 251 | +Always indicate whether each metric is a gauge, counter, or histogram. |
| 252 | + |
| 253 | +--- |
| 254 | + |
| 255 | +## Reference Frameworks |
| 256 | + |
| 257 | +Existing frameworks built with this methodology (for reference and pattern matching): |
| 258 | + |
| 259 | +| Framework | Topic | Key decision procedure | |
| 260 | +|-----------|-------|----------------------| |
| 261 | +| Workload Change Detection | Did the workload change? | 6-step: pgwire → query mix → query weight → key distribution → noise filter → AC vs workload | |
| 262 | +| Network Latency Diagnosis | Is the network the problem? | 10-step: TCP RTT → gRPC RTT → client-server decomposition → noise elimination | |
| 263 | +| RPC Deep Dive | What operations dominate? | 4-step: workload profile → before/during/after comparison → system-generated filter → SQL cross-validation | |
| 264 | +| Memory Deep Dive | Where is memory going? | 6-step: is there pressure? → Go vs CGo → Go heap breakdown → Pebble breakdown → GOGC tuning → noise | |
| 265 | +| Snapshot & Replication | Why is replicate queue failing? | 7-step: confirm snapshot cause → sender vs receiver → slow receiver → slow sender → saturation → network → periodicity | |
| 266 | + |
| 267 | +Local framework files: `/tmp/*-framework.md` |
0 commit comments