Skip to content

Commit 499cdc5

Browse files
claude: add /escalation skill for systematic escalation diagnosis
This skill generates proof-style diagnostic frameworks for customer escalations. Given a set of symptoms (e.g. "throughput tanks periodically", "replicate queue failures from snapshot reservation timeouts"), the skill produces a structured document that allows an on-call engineer to systematically determine root cause. Usage: /escalation <symptoms> The skill produces the following output: 1. **The Question** — what the framework is trying to determine. 2. **How the System Works** — a pipeline or architecture diagram showing where in the request path the symptom can originate. 3. **Metrics Grouped by What They Prove** — each group answers a specific diagnostic question with metric tables, key ratios, and Datadog queries. 4. **Noise Sources to Rule Out** — metrics that look like the symptom but have a different root cause, with instructions on how to distinguish. 5. **Decision Procedure** — numbered, elimination-based steps that proceed from purest signal (pgwire layer) to most contaminated. 6. **Key Equations** — named formulas for derived metrics. 7. **Complete Metric List** — every metric referenced, with type (gauge/counter/histogram) and Datadog query syntax. The skill includes a comprehensive reference of ~130 CockroachDB metrics organized into 16 categories (client-side, SQL, DistSender, transactions, contention, store, network, memory, CPU, Pebble, admission control, replication, leases, background workload, disk I/O, SQL memory pools). Each metric is annotated with its type to ensure correct Datadog query construction. The core methodology is noise elimination through cross-layer triangulation: no single metric proves a root cause because each is contaminated by at least one system effect. The framework requires agreement across independent measurement points (pgwire vs gateway vs leaseholder) to constitute proof. Epic: none Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
1 parent de52b5c commit 499cdc5

1 file changed

Lines changed: 267 additions & 0 deletions

File tree

.claude/skills/escalation/SKILL.md

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
name: escalation
3+
description: Generate a systematic, proof-style diagnostic framework for a customer escalation. Use when the user describes escalation symptoms and needs to determine root cause by triangulating metrics and eliminating noise.
4+
---
5+
6+
# Escalation Diagnosis: Proof-Style Metric Framework Generator
7+
8+
Generate a comprehensive diagnostic framework given escalation symptoms. The output is a structured document that allows an on-call engineer to systematically determine root cause by triangulating metrics across independent measurement points and eliminating noise.
9+
10+
## When to Use
11+
12+
Activate when the user says:
13+
- "I'm investigating an escalation with these symptoms: ..."
14+
- "Can you build a diagnostic framework for ..."
15+
- "What metrics should I look at to diagnose ..."
16+
- "Customer is seeing X, help me figure out why"
17+
- Any description of a production issue requiring metric-based root cause analysis
18+
19+
## Input
20+
21+
The user provides free-form escalation symptoms as the argument. Examples:
22+
- `throughput tanks periodically`
23+
- `replicate queue failures from snapshot reservation timeouts`
24+
- `high p99 latency on SELECT queries`
25+
- `OOM kills on node 3 every 6 hours`
26+
- `cross-region replication lag increasing`
27+
28+
## Workflow
29+
30+
### Phase 1: Research (read the code, don't guess)
31+
32+
Before writing any framework, you MUST read the CockroachDB source code to find the exact metric names relevant to the symptoms. **Never hardcode or guess metric names — always verify by reading the source.**
33+
34+
#### Step 1a: Identify research areas and launch parallel Explore agents
35+
36+
From the symptom, identify 3-6 independent research areas that together cover:
37+
- The metrics that directly measure the reported symptom
38+
- What could cause this symptom (resource bottlenecks, contention, configuration, upstream systems)
39+
- What this symptom causes downstream (secondary failures, latency, under-replication)
40+
- Resource baselines (CPU, memory, disk, network) that could explain the symptom
41+
42+
Launch one Explore agent per area **in parallel** (in a single message). Each agent should:
43+
1. Read the relevant source files to find **exact metric name strings** and their types (gauge/counter/histogram)
44+
2. Find **cluster settings** with their default values
45+
3. Understand the **code path** — how does the system work, what triggers errors, what are the retry/timeout semantics
46+
4. Identify the **measurement point** for each metric (pgwire vs gateway vs leaseholder vs store)
47+
5. **For every metric, explain what it means in the codebase context**: Look at the metric's `Help` description in its `metric.Metadata` definition, then follow how and where the metric is used in the code. The goal is to produce a description that explains the system behavior — when and why this number changes, what a non-zero or elevated value implies for the operator, and what it does NOT capture. For reference, see the metric descriptions in `pkg/kv/kvserver/allocator/mmaprototype/mma_metrics.go` — they explain the operational meaning, not just paraphrase the metric name. For example:
48+
- BAD: "`range.snapshots.recv-failed` — snapshot receive failures"
49+
- GOOD: "`range.snapshots.recv-failed` — Number of incoming snapshot init messages that errored on the recipient store before data transfer begins. This includes rejections from `canAcceptSnapshotLocked()` (e.g., overlapping range descriptor, store draining, replica too old) and header validation failures. Does NOT count reservation queue timeouts or errors during data transmission/application — those are tracked by different error paths. A sustained non-zero rate indicates the receiver is consistently rejecting snapshots, which may point to descriptor conflicts or a draining node."
50+
51+
Agents should search beyond the index below — grep for metric name patterns, follow code paths, find settings files. The index is a starting point, not a boundary.
52+
53+
**Key source file index:**
54+
55+
| Area | File |
56+
|------|------|
57+
| Go runtime, RSS, CPU, disk, network | `pkg/server/status/runtime.go` |
58+
| KV store metrics (Pebble, replication, snapshots, queues) | `pkg/kv/kvserver/metrics.go` |
59+
| KV method enumeration (50 RPC methods) | `pkg/kv/kvpb/method.go` |
60+
| DistSender per-method counters | `pkg/kv/kvclient/kvcoord/dist_sender.go` |
61+
| Node per-method recv counters | `pkg/server/node.go` |
62+
| RPC connection metrics (TCP RTT, heartbeats) | `pkg/rpc/metrics.go` |
63+
| Clock offset, round-trip-latency | `pkg/rpc/clock_offset.go` |
64+
| SQL memory accounting | `pkg/sql/mem_metrics.go` |
65+
| Admission control | `pkg/util/admission/granter.go` |
66+
| Snapshot settings & reservation logic | `pkg/kv/kvserver/snapshot_settings.go`, `store_snapshot.go` |
67+
| Replicate queue metrics | `pkg/kv/kvserver/replicate_queue.go` |
68+
| Raft entry cache | `pkg/kv/kvserver/raftentry/metrics.go` |
69+
| Rangefeed memory | `pkg/kv/kvserver/rangefeed/metrics.go` |
70+
| Lock table concurrency | `pkg/kv/kvserver/concurrency/metrics.go` |
71+
72+
#### Step 1b: Compile results and identify noise
73+
74+
From the agent results, collect ALL metrics and settings found. Then identify **noise** — metrics or signals in the cluster that could look like the symptom but have a different root cause. For each noise source, note which metric disambiguates it from the real symptom. Include these noise metrics in the framework so the engineer can rule them out.
75+
76+
**Rules:**
77+
- Always verify metric name strings by reading the code. Never guess.
78+
- Check both gateway-side (`distsender.rpc.{method}.sent`) and leaseholder-side (`rpc.method.{method}.recv`) variants when relevant.
79+
- Record the type (gauge/counter/histogram) for every metric — this determines the Datadog query pattern.
80+
- For every metric, read its `metric.Metadata` `Help` field and trace how it is used in the surrounding code. Even if you cannot find the exact `.Inc()` call site, always include the metric — use the `Help` description and the code context where the metric struct is referenced to explain what it means operationally.
81+
82+
### Phase 2: Build the Framework
83+
84+
Write the framework to `/tmp/<topic-slug>-framework.md` following this exact structure:
85+
86+
```markdown
87+
# <Topic>: <Subtitle>
88+
89+
## The Question
90+
Given <symptom description>:
91+
1. **<Primary question>** (what is the root cause?)
92+
2. **<Secondary question>** (what is noise vs signal?)
93+
94+
---
95+
96+
## How <Relevant System> Works: The Pipeline
97+
98+
This section is the backbone of the framework. It must give the reader a
99+
comprehensive mental model of the system — not just a linear list of steps,
100+
but enough architectural context to reason about failures independently.
101+
102+
### Pipeline overview
103+
104+
Start with a high-level ASCII diagram showing all the stages and how they
105+
connect. Then describe each stage in its own subsection.
106+
107+
### Stage N: <Stage Name>
108+
109+
For EACH stage in the pipeline, write a dedicated subsection that covers:
110+
111+
1. **What this component does and why it exists** — a 2-4 sentence overview
112+
that explains the component's role in the system. Write this for an
113+
engineer who is familiar with CockroachDB at a high level but has never
114+
looked at this subsystem. Explain the design motivation (e.g., "The
115+
receiver apply queue exists because applying a snapshot is an expensive
116+
disk-bound operation — it rewrites the entire range's state. The
117+
concurrency limit prevents multiple concurrent applies from saturating
118+
disk I/O and causing write stalls across all ranges on the store.").
119+
120+
2. **How it works internally** — describe the key data structures, queues,
121+
state machines, or algorithms that govern this stage. For queues: what
122+
is the ordering, what are the capacity limits, what happens when the
123+
queue is full. For state machines: what are the states and transitions.
124+
For algorithms: what is the decision logic. Include the key types and
125+
functions by name so the reader can find them in the code.
126+
127+
3. **Metrics at this stage** — the exact metric names that observe this
128+
stage, annotated with what each one means operationally at this point
129+
in the pipeline.
130+
131+
4. **Concurrency limits, timeouts, and settings** — at each stage where
132+
there is a queue, rate limit, or timeout, note the cluster setting
133+
name, its default value, and what happens when it's exceeded.
134+
135+
5. **Error behavior** — what errors can this stage produce, how are they
136+
classified (transient vs permanent), and how do they propagate to the
137+
next stage. Include retry counts, backoff parameters, and error markers
138+
(e.g. `errMarkSnapshotError`) that control retry behavior.
139+
140+
6. **Competing consumers** — if multiple subsystems share the same resource
141+
at this stage (e.g., snapshots and regular writes both consume disk
142+
bandwidth), note the contention point and how to distinguish them in
143+
metrics.
144+
145+
Write each stage so that it stands alone — the reader should be able to
146+
jump to "Stage 4: Receiver Reservation" and understand what it does
147+
without reading the previous stages. Cross-reference other stages by name
148+
when needed.
149+
150+
---
151+
152+
## The Metrics, Grouped by What They Prove
153+
154+
### Group 1: <What this group answers>
155+
<Table: Metric | Type (gauge/counter/histogram) | What it measures | Codebase Context>
156+
157+
The **Codebase Context** column MUST explain each metric the way MMA metrics
158+
(`mma_metrics.go`) explain theirs — in terms of the system behavior, not as a
159+
code-level trace. Specifically:
160+
- **What system event it represents**: The operational situation that causes the number to change (e.g., "an external replica change completed successfully", not "incremented in function X")
161+
- **What a non-zero or elevated value implies for the operator**: What should the engineer conclude?
162+
- **What it does NOT capture**: Important exclusions that prevent misinterpretation
163+
- **Measurement point**: Whether this is measured at the sender, receiver, gateway, leaseholder, or store level
164+
165+
Example row:
166+
| `range.snapshots.recv-failed` | counter | Snapshot init errors on receiver | Number of incoming snapshot init messages that errored on the recipient store before data transfer begins. This includes rejections due to overlapping range descriptors, store draining, or stale replicas. Does NOT count reservation queue timeouts or errors during data transmission/application — those are tracked separately. A sustained non-zero rate indicates the receiver is consistently rejecting snapshots, pointing to descriptor conflicts or a draining node. Measured at the receiver store. |
167+
168+
**Key ratios:**
169+
<Derived metrics and what they infer>
170+
171+
### Group 2: ...
172+
(continue for all relevant groups)
173+
174+
### Group N: Noise / Context (NOT the symptom, but needed to rule out)
175+
<Metrics that look like the symptom but have different root causes>
176+
177+
---
178+
179+
## The Decision Procedure: Diagnosing <Symptom>
180+
181+
### Step 1: <First elimination>
182+
<Table: Check | Metric | Signal | Inference>
183+
184+
### Step 2: ...
185+
(continue for all steps)
186+
187+
---
188+
189+
## Key Equations
190+
<Named equations with metric formulas>
191+
192+
---
193+
194+
## Quick Reference: Complete Metric List
195+
196+
### <Category>
197+
- `metric.name` (type, unit) — Operational description: what system event this tracks, what a non-zero/elevated value implies for the operator, and what it does NOT capture (if non-obvious). Include measurement point (sender/receiver/gateway/store).
198+
(for every metric mentioned in the framework)
199+
```
200+
201+
### Phase 3: Output
202+
203+
1. Display the framework to the user
204+
2. Ask if they want it pushed to a Google Doc
205+
3. If yes, use the `roachdev gdoc` toolchain:
206+
- Create new doc: `roachdev gdoc api -X POST '/documents' --body '{"title": "..."}'`
207+
- Convert markdown: `roachdev gdoc md2json --file /tmp/<file>.md`
208+
- Push: pipe md2json output to `roachdev gdoc api '/documents/<ID>:batchUpdate'`
209+
- Or add as a tab to existing doc if specified
210+
211+
---
212+
213+
## The Proof Structure: Core Principles
214+
215+
Every framework MUST follow these principles:
216+
217+
### 1. Metrics are grouped by what they prove, not alphabetically
218+
Each group answers a specific diagnostic question. The first group always addresses the most direct signal. Later groups address noise and competing explanations.
219+
220+
### 2. Noise elimination is explicit
221+
For every symptom, there are metrics that LOOK like the symptom but have a different root cause. These must be listed and the framework must explain how to distinguish them. Examples:
222+
- `sql.query.count` drops → could be workload drop (app stopped sending) OR closed-loop feedback (pool exhaustion). Disambiguate with `sql.bytesin`.
223+
- `sys.gc.pause.percent` high → could be memory pressure OR CPU overload starving GC workers. Disambiguate with `sys.runnable.goroutines.per.cpu`.
224+
- `queue.replicate.process.failure` high → could be snapshot timeout OR allocation failure. Disambiguate with `addreplica.error` vs `purgatory`.
225+
226+
### 3. Cross-layer triangulation is the proof
227+
No single metric proves anything because each is contaminated by at least one system effect. Proof = agreement across independent measurement points:
228+
- **Layer 1 (pgwire)**: `sql.conns`, `sql.bytesin` — app-controlled, purest signal
229+
- **Layer 2 (SQL + DistSender)**: `sql.*.count`, `distsender.*` — gateway-side, mixed signal (retries inflate)
230+
- **Layer 3 (Store)**: `rpc.method.*.recv`, `rebalancing.*` — leaseholder-side, execution signal
231+
- **Corroboration only**: Latency metrics (`sql.exec.latency`, `sql.service.latency`) — never proof alone
232+
233+
### 4. Decision procedures are elimination-based
234+
Each step either proves or eliminates a hypothesis. The procedure must be ordered from purest signal to most contaminated. Typical ordering:
235+
1. Is the app even sending? (pgwire layer)
236+
2. Did the workload change? (SQL ratios)
237+
3. Did CRDB internals change? (KV layer ratios, AC, contention)
238+
4. Is it a resource bottleneck? (CPU, memory, disk, network)
239+
5. Is it a background job? (export, addsstable, gc)
240+
241+
### 5. Every metric has an operational explanation grounded in the codebase
242+
For every metric in the framework, the reader must be able to answer: "What system event does this track, and what should I conclude from it?" Each metric entry must include:
243+
- **What system event it represents** — the operational situation, not just a restatement of the metric name. Read the metric's `Help` field in its `metric.Metadata` definition and the surrounding code context.
244+
- **What a non-zero or elevated value implies** — what should the engineer conclude or investigate next?
245+
- **What it does NOT capture** — important exclusions that prevent misinterpretation (if non-obvious)
246+
- **Measurement point** — sender vs receiver, gateway vs leaseholder, etc.
247+
248+
Follow the style of `mma_metrics.go` — descriptions that explain the operational meaning, when the event occurs, and what it implies for the operator. A metric without operational context is a label — it tells you what to search in Datadog but not what to conclude from it.
249+
250+
### 6. Every metric has a type annotation
251+
Always indicate whether each metric is a gauge, counter, or histogram.
252+
253+
---
254+
255+
## Reference Frameworks
256+
257+
Existing frameworks built with this methodology (for reference and pattern matching):
258+
259+
| Framework | Topic | Key decision procedure |
260+
|-----------|-------|----------------------|
261+
| Workload Change Detection | Did the workload change? | 6-step: pgwire → query mix → query weight → key distribution → noise filter → AC vs workload |
262+
| Network Latency Diagnosis | Is the network the problem? | 10-step: TCP RTT → gRPC RTT → client-server decomposition → noise elimination |
263+
| RPC Deep Dive | What operations dominate? | 4-step: workload profile → before/during/after comparison → system-generated filter → SQL cross-validation |
264+
| Memory Deep Dive | Where is memory going? | 6-step: is there pressure? → Go vs CGo → Go heap breakdown → Pebble breakdown → GOGC tuning → noise |
265+
| Snapshot & Replication | Why is replicate queue failing? | 7-step: confirm snapshot cause → sender vs receiver → slow receiver → slow sender → saturation → network → periodicity |
266+
267+
Local framework files: `/tmp/*-framework.md`

0 commit comments

Comments
 (0)