Skip to content

claude: add /escalation skill for systematic escalation diagnosis#167470

Open
wenyihu6 wants to merge 1 commit intocockroachdb:masterfrom
wenyihu6:wenyihu/escalation-skill
Open

claude: add /escalation skill for systematic escalation diagnosis#167470
wenyihu6 wants to merge 1 commit intocockroachdb:masterfrom
wenyihu6:wenyihu/escalation-skill

Conversation

@wenyihu6
Copy link
Copy Markdown
Contributor

@wenyihu6 wenyihu6 commented Apr 3, 2026

Summary

  • Add a new /escalation Claude Code skill that generates proof-style diagnostic frameworks for customer escalations
  • Given symptoms, the skill reads the CockroachDB codebase to find exact metric names, then produces a structured document with metrics grouped by what they prove, noise elimination, a numbered decision procedure, key equations, and Datadog queries

Usage

/escalation <symptoms>

Examples:

  • /escalation throughput tanks periodically
  • /escalation replicate queue failures from snapshot reservation timeouts
  • /escalation high p99 latency on SELECT queries

Output

The skill produces a markdown framework with:

  1. The Question — what we're trying to determine
  2. How the System Works — pipeline diagram with measurement points
  3. Metrics Grouped by What They Prove — tables with types, ratios, Datadog queries
  4. Noise Sources to Rule Out — metrics that mimic the symptom but have different causes
  5. Decision Procedure — elimination-based steps from purest signal to most contaminated
  6. Key Equations — derived metrics and formulas
  7. Complete Metric List — every metric with type and Datadog query syntax

Epic: none

@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Apr 3, 2026

Merging to master in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

@wenyihu6 wenyihu6 requested a review from tbg April 3, 2026 11:35
@wenyihu6 wenyihu6 force-pushed the wenyihu/escalation-skill branch 2 times, most recently from d24ee9f to 499cdc5 Compare April 3, 2026 13:27
This skill generates proof-style diagnostic frameworks for customer
escalations. Given a set of symptoms (e.g. "throughput tanks periodically",
"replicate queue failures from snapshot reservation timeouts"), the skill
produces a structured document that allows an on-call engineer to
systematically determine root cause.

Usage:

    /escalation <symptoms>

The skill produces the following output:

1. **The Question** — what the framework is trying to determine.
2. **How the System Works** — a pipeline or architecture diagram showing
   where in the request path the symptom can originate.
3. **Metrics Grouped by What They Prove** — each group answers a specific
   diagnostic question with metric tables, key ratios, and Datadog
   queries.
4. **Noise Sources to Rule Out** — metrics that look like the symptom but
   have a different root cause, with instructions on how to distinguish.
5. **Decision Procedure** — numbered, elimination-based steps that
   proceed from purest signal (pgwire layer) to most contaminated.
6. **Key Equations** — named formulas for derived metrics.
7. **Complete Metric List** — every metric referenced, with type
   (gauge/counter/histogram) and Datadog query syntax.

The skill includes a comprehensive reference of ~130 CockroachDB metrics
organized into 16 categories (client-side, SQL, DistSender, transactions,
contention, store, network, memory, CPU, Pebble, admission control,
replication, leases, background workload, disk I/O, SQL memory pools).
Each metric is annotated with its type to ensure correct Datadog query
construction.

The core methodology is noise elimination through cross-layer
triangulation: no single metric proves a root cause because each is
contaminated by at least one system effect. The framework requires
agreement across independent measurement points (pgwire vs gateway vs
leaseholder) to constitute proof.

Epic: none
Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@wenyihu6 wenyihu6 force-pushed the wenyihu/escalation-skill branch from 499cdc5 to 59f5b48 Compare April 3, 2026 18:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants