Skip to content

Commit 8e33741

Browse files
Morlejclaude
andcommitted
feat(dsql): extract query plan workflow, add rewrite evals
- Extract Workflow 8 (query plan explainability) from SKILL.md into references/query-plan/workflow.md to stay under the 300 LOC limit - Wire query-rewrites-generic.md and query-rewrites-dsql-specific.md into the workflow (Phase 0 load list + Phase 2 evidence gathering) - Add behavioral evals (query_plan_rewrite_evals.json) covering type coercion detection, subquery unnesting, OR-to-IN, GROUP BY pushdown, large join splitting, and reltuples estimation - Add eval results (query_plan_rewrite_eval_results.md) with with-skill vs baseline comparison Validation: - validate-size.py: 275 lines (good) - validate-references.py: 0 broken links, 0 new orphans Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent c5457a4 commit 8e33741

4 files changed

Lines changed: 343 additions & 62 deletions

File tree

plugins/databases-on-aws/skills/dsql/SKILL.md

Lines changed: 3 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,8 @@ sampled in [mcp/.mcp.json](mcp/.mcp.json)
108108

109109
### Query Plan Explainability (modular):
110110

111-
**When:** MUST load all four at Workflow 8 Phase 0 — [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md), [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md), [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md), [query-plan/report-format.md](references/query-plan/report-format.md)
112-
**Contains:** DSQL node types + Node Duration math + estimation-error bands, pg_class/pg_stats/pg_indexes SQL + correlated-predicate verification, GUC experiment procedures + 30-second skip protocol, required report structure + element checklist + support request template
111+
**When:** MUST load [query-plan/workflow.md](references/query-plan/workflow.md) at Workflow 8 entry — it gates the remaining files
112+
**Contains:** Trigger criteria, context disambiguation, routing, phased workflow, and references to: [plan-interpretation.md](references/query-plan/plan-interpretation.md), [catalog-queries.md](references/query-plan/catalog-queries.md), [guc-experiments.md](references/query-plan/guc-experiments.md), [report-format.md](references/query-plan/report-format.md), [query-rewrites-generic.md](references/query-plan/query-rewrites-generic.md), [query-rewrites-dsql-specific.md](references/query-plan/query-rewrites-dsql-specific.md)
113113

114114
---
115115

@@ -254,66 +254,7 @@ MUST load [mysql-migrations/type-mapping.md](references/mysql-migrations/type-ma
254254

255255
### Workflow 8: Query Plan Explainability
256256

257-
Explains why the DSQL optimizer chose a particular plan. **REQUIRES a structured Markdown diagnostic report as the deliverable** — run the workflow end-to-end before answering. Use the `aurora-dsql` MCP when connected; fall back to raw `psql` with a generated IAM token (see the fallback block below) otherwise.
258-
259-
#### Trigger Criteria
260-
261-
Enter this workflow if **ANY** of these signals are present:
262-
263-
| Signal | Examples |
264-
|--------|----------|
265-
| User provides SQL + mentions performance/speed/cost | "this query takes 8 seconds", "too slow", "optimize this", "make this faster" |
266-
| User mentions DPU cost or resource consumption | "high DPU", "query cost is too high", "read DPU seems excessive" |
267-
| User asks about a plan choice or scan type | "why is it doing a full scan?", "why not use the index?" |
268-
| User pastes EXPLAIN / EXPLAIN ANALYZE output | Raw plan text in the message |
269-
| User references a Query ID and asks about performance | "query abc-123 is slow" |
270-
| User says "reassess" / "re-run" / "I added the index" | Phase 5 re-entry for an existing report |
271-
272-
#### Context Disambiguation
273-
274-
Before entering the workflow, confirm the query targets DSQL:
275-
276-
| Condition | Action |
277-
|-----------|--------|
278-
| Only `aurora-dsql` MCP is connected (no other database MCPs) | Proceed — DSQL is the only target |
279-
| User explicitly mentions DSQL, Aurora DSQL, or a known DSQL cluster | Proceed |
280-
| Conversation already has prior DSQL interaction (earlier queries, schema ops) | Proceed |
281-
| Multiple database MCPs are connected and no DSQL signal in the message | Ask the user which database they mean before proceeding |
282-
| No database MCP is connected | Inform the user that the `aurora-dsql` MCP is required and offer the psql fallback |
283-
284-
#### Routing (sub-path selection)
285-
286-
| Condition | Path |
287-
|-----------|------|
288-
| User provides SQL but no plan output | Full workflow: Phase 0 → 1 → 2 → 3 → 4 |
289-
| User pastes plan output + asks to fix/optimize | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4 |
290-
| User pastes plan output + asks what it means (educational) | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4. The report is the explanation — do not produce a shorter conversational answer instead |
291-
| Execution time >30s detected at Phase 1 | Phase 3 skips experiments per guc-experiments.md |
292-
| User says "reassess" or equivalent | Re-run Phase 1–2, append Addendum to existing report |
293-
294-
**Phase 0 — Load reference material.** Read all four before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements):
295-
296-
1. [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md) — node types, duration math, anomalous values
297-
2. [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL
298-
3. [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md) — GUC procedures and `>30s` skip protocol
299-
4. [query-plan/report-format.md](references/query-plan/report-format.md) — required report structure
300-
301-
**Phase 1 — Capture the plan.** **ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks. When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause. Extract Query ID, Planning Time, Execution Time, DPU Estimate. **SELECT** runs as-is. **UPDATE/DELETE** rewrite to the equivalent SELECT (same join chain + WHERE) — the optimizer picks the same plan shape. **INSERT**, pl/pgsql, DO blocks, and functions **MUST** be rejected. **MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety.
302-
303-
**Phase 2 — Gather evidence.** Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe). Detect correlated predicates and data skew. When a Full Scan appears despite an apparently usable index, check for type coercion index bypass: retrieve indexed column types and compare against predicate literal types using the implicit cast compatibility matrix in `plan-interpretation.md`.
304-
305-
**Phase 3 — Experiment (conditional).** ≤30s: run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test. >30s: skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing. Anomalous values (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`.
306-
307-
**Phase 4 — Produce the report, invite reassessment.** Produce the full diagnostic report per the "Required Elements Checklist" in [query-plan/report-format.md](references/query-plan/report-format.md) — structure is non-negotiable. End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation. When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report.
308-
309-
**psql fallback (MCP unavailable).** Pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence:
310-
311-
```bash
312-
TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION")
313-
PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;"
314-
```
315-
316-
**Safety.** Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql.
257+
Explains why the DSQL optimizer chose a particular plan. **REQUIRES a structured Markdown diagnostic report as the deliverable.** MUST load [query-plan/workflow.md](references/query-plan/workflow.md) for trigger criteria, context disambiguation, routing, and the full phased workflow (Phase 0–4).
317258

318259
---
319260

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# Query Plan Explainability — Workflow
2+
3+
Complete workflow for diagnosing DSQL query plan performance issues. Produces a structured Markdown diagnostic report as the deliverable.
4+
5+
## Table of Contents
6+
7+
1. [Trigger Criteria](#trigger-criteria)
8+
2. [Context Disambiguation](#context-disambiguation)
9+
3. [Routing](#routing)
10+
4. [Phase 0 — Load Reference Material](#phase-0--load-reference-material)
11+
5. [Phase 1 — Capture the Plan](#phase-1--capture-the-plan)
12+
6. [Phase 2 — Gather Evidence](#phase-2--gather-evidence)
13+
7. [Phase 3 — Experiment](#phase-3--experiment)
14+
8. [Phase 4 — Produce the Report](#phase-4--produce-the-report)
15+
9. [psql Fallback](#psql-fallback)
16+
10. [Safety](#safety)
17+
18+
---
19+
20+
## Trigger Criteria
21+
22+
Enter this workflow if **ANY** of these signals are present:
23+
24+
| Signal | Examples |
25+
|--------|----------|
26+
| User provides SQL + mentions performance/speed/cost | "this query takes 8 seconds", "too slow", "optimize this", "make this faster" |
27+
| User mentions DPU cost or resource consumption | "high DPU", "query cost is too high", "read DPU seems excessive" |
28+
| User asks about a plan choice or scan type | "why is it doing a full scan?", "why not use the index?" |
29+
| User pastes EXPLAIN / EXPLAIN ANALYZE output | Raw plan text in the message |
30+
| User references a Query ID and asks about performance | "query abc-123 is slow" |
31+
| User says "reassess" / "re-run" / "I added the index" | Phase 5 re-entry for an existing report |
32+
33+
---
34+
35+
## Context Disambiguation
36+
37+
Before entering the workflow, confirm the query targets DSQL:
38+
39+
| Condition | Action |
40+
|-----------|--------|
41+
| Only `aurora-dsql` MCP is connected (no other database MCPs) | Proceed — DSQL is the only target |
42+
| User explicitly mentions DSQL, Aurora DSQL, or a known DSQL cluster | Proceed |
43+
| Conversation already has prior DSQL interaction (earlier queries, schema ops) | Proceed |
44+
| Multiple database MCPs are connected and no DSQL signal in the message | Ask the user which database they mean before proceeding |
45+
| No database MCP is connected | Inform the user that the `aurora-dsql` MCP is required and offer the psql fallback |
46+
47+
---
48+
49+
## Routing
50+
51+
| Condition | Path |
52+
|-----------|------|
53+
| User provides SQL but no plan output | Full workflow: Phase 0 → 1 → 2 → 3 → 4 |
54+
| User pastes plan output + asks to fix/optimize | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4 |
55+
| User pastes plan output + asks what it means (educational) | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4. The report is the explanation — do not produce a shorter conversational answer instead |
56+
| Execution time >30s detected at Phase 1 | Phase 3 skips experiments per guc-experiments.md |
57+
| User says "reassess" or equivalent | Re-run Phase 1–2, append Addendum to existing report |
58+
59+
---
60+
61+
## Phase 0 — Load Reference Material
62+
63+
Read all files before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements):
64+
65+
1. [plan-interpretation.md](plan-interpretation.md) — node types, duration math, anomalous values
66+
2. [catalog-queries.md](catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL
67+
3. [guc-experiments.md](guc-experiments.md) — GUC procedures and `>30s` skip protocol
68+
4. [report-format.md](report-format.md) — required report structure
69+
5. [query-rewrites-generic.md](query-rewrites-generic.md) — generic SQL rewrite patterns
70+
6. [query-rewrites-dsql-specific.md](query-rewrites-dsql-specific.md) — DSQL-specific rewrites
71+
72+
---
73+
74+
## Phase 1 — Capture the Plan
75+
76+
**ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks.
77+
78+
When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause.
79+
80+
Extract: Query ID, Planning Time, Execution Time, DPU Estimate.
81+
82+
| Statement type | Action |
83+
|---------------|--------|
84+
| SELECT | Run as-is |
85+
| UPDATE / DELETE | Rewrite to equivalent SELECT (same join chain + WHERE) — optimizer picks the same plan shape |
86+
| INSERT, pl/pgsql, DO blocks, functions | **MUST** reject |
87+
88+
**MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety.
89+
90+
---
91+
92+
## Phase 2 — Gather Evidence
93+
94+
Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`.
95+
96+
1. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe).
97+
2. Detect correlated predicates and data skew.
98+
3. When a Full Scan appears despite an apparently usable index, check for **type coercion index bypass**: retrieve indexed column types and compare against predicate literal types using the implicit cast compatibility matrix in `plan-interpretation.md`.
99+
4. Check whether any query rewrite from `query-rewrites-generic.md` or `query-rewrites-dsql-specific.md` applies to the query structure (e.g., OR-to-IN, subquery unnesting, NOT IN to NOT EXISTS, split large joins).
100+
101+
---
102+
103+
## Phase 3 — Experiment (conditional)
104+
105+
- **≤30s:** Run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test.
106+
- **>30s:** Skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing.
107+
- **Anomalous values** (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`.
108+
109+
---
110+
111+
## Phase 4 — Produce the Report, Invite Reassessment
112+
113+
Produce the full diagnostic report per the "Required Elements Checklist" in [report-format.md](report-format.md) — structure is non-negotiable.
114+
115+
End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation.
116+
117+
When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report.
118+
119+
If a query rewrite was identified in Phase 2, include it as a recommendation with the original and rewritten SQL side by side.
120+
121+
---
122+
123+
## psql Fallback
124+
125+
When the MCP is unavailable, pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence:
126+
127+
```bash
128+
TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION")
129+
PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;"
130+
```
131+
132+
---
133+
134+
## Safety
135+
136+
Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql.

0 commit comments

Comments
 (0)