awslabs
diff --git a/‎plugins/databases-on-aws/skills/dsql/SKILL.md‎
Lines changed: 3 additions & 62 deletions b/‎plugins/databases-on-aws/skills/dsql/SKILL.md‎
Lines changed: 3 additions & 62 deletions
diff --git a/‎plugins/databases-on-aws/skills/dsql/references/query-plan/workflow.md‎
Lines changed: 136 additions & 0 deletions b/‎plugins/databases-on-aws/skills/dsql/references/query-plan/workflow.md‎
Lines changed: 136 additions & 0 deletions
@@ -108,8 +108,8 @@ sampled in [mcp/.mcp.json](mcp/.mcp.json)
 
 ### Query Plan Explainability (modular):
 
-**When:** MUST load all four at Workflow 8 Phase 0 — [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md), [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md), [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md), [query-plan/report-format.md](references/query-plan/report-format.md)
-**Contains:** DSQL node types + Node Duration math + estimation-error bands, pg_class/pg_stats/pg_indexes SQL + correlated-predicate verification, GUC experiment procedures + 30-second skip protocol, required report structure + element checklist + support request template
+**When:** MUST load [query-plan/workflow.md](references/query-plan/workflow.md) at Workflow 8 entry — it gates the remaining files
+**Contains:** Trigger criteria, context disambiguation, routing, phased workflow, and references to: [plan-interpretation.md](references/query-plan/plan-interpretation.md), [catalog-queries.md](references/query-plan/catalog-queries.md), [guc-experiments.md](references/query-plan/guc-experiments.md), [report-format.md](references/query-plan/report-format.md), [query-rewrites-generic.md](references/query-plan/query-rewrites-generic.md), [query-rewrites-dsql-specific.md](references/query-plan/query-rewrites-dsql-specific.md)
 
 ---
 
@@ -254,66 +254,7 @@ MUST load [mysql-migrations/type-mapping.md](references/mysql-migrations/type-ma
 
 ### Workflow 8: Query Plan Explainability
 
-Explains why the DSQL optimizer chose a particular plan. **REQUIRES a structured Markdown diagnostic report as the deliverable** — run the workflow end-to-end before answering. Use the `aurora-dsql` MCP when connected; fall back to raw `psql` with a generated IAM token (see the fallback block below) otherwise.
-
-#### Trigger Criteria
-
-Enter this workflow if **ANY** of these signals are present:
-
-| Signal | Examples |
-|--------|----------|
-| User provides SQL + mentions performance/speed/cost | "this query takes 8 seconds", "too slow", "optimize this", "make this faster" |
-| User mentions DPU cost or resource consumption | "high DPU", "query cost is too high", "read DPU seems excessive" |
-| User asks about a plan choice or scan type | "why is it doing a full scan?", "why not use the index?" |
-| User pastes EXPLAIN / EXPLAIN ANALYZE output | Raw plan text in the message |
-| User references a Query ID and asks about performance | "query abc-123 is slow" |
-| User says "reassess" / "re-run" / "I added the index" | Phase 5 re-entry for an existing report |
-
-#### Context Disambiguation
-
-Before entering the workflow, confirm the query targets DSQL:
-
-| Condition | Action |
-|-----------|--------|
-| Only `aurora-dsql` MCP is connected (no other database MCPs) | Proceed — DSQL is the only target |
-| User explicitly mentions DSQL, Aurora DSQL, or a known DSQL cluster | Proceed |
-| Conversation already has prior DSQL interaction (earlier queries, schema ops) | Proceed |
-| Multiple database MCPs are connected and no DSQL signal in the message | Ask the user which database they mean before proceeding |
-| No database MCP is connected | Inform the user that the `aurora-dsql` MCP is required and offer the psql fallback |
-
-#### Routing (sub-path selection)
-
-| Condition | Path |
-|-----------|------|
-| User provides SQL but no plan output | Full workflow: Phase 0 → 1 → 2 → 3 → 4 |
-| User pastes plan output + asks to fix/optimize | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4 |
-| User pastes plan output + asks what it means (educational) | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4. The report is the explanation — do not produce a shorter conversational answer instead |
-| Execution time >30s detected at Phase 1 | Phase 3 skips experiments per guc-experiments.md |
-| User says "reassess" or equivalent | Re-run Phase 1–2, append Addendum to existing report |
-
-**Phase 0 — Load reference material.** Read all four before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements):
-
-1. [query-plan/plan-interpretation.md](references/query-plan/plan-interpretation.md) — node types, duration math, anomalous values
-2. [query-plan/catalog-queries.md](references/query-plan/catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL
-3. [query-plan/guc-experiments.md](references/query-plan/guc-experiments.md) — GUC procedures and `>30s` skip protocol
-4. [query-plan/report-format.md](references/query-plan/report-format.md) — required report structure
-
-**Phase 1 — Capture the plan.** **ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks. When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause. Extract Query ID, Planning Time, Execution Time, DPU Estimate. **SELECT** runs as-is. **UPDATE/DELETE** rewrite to the equivalent SELECT (same join chain + WHERE) — the optimizer picks the same plan shape. **INSERT**, pl/pgsql, DO blocks, and functions **MUST** be rejected. **MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety.
-
-**Phase 2 — Gather evidence.** Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe). Detect correlated predicates and data skew. When a Full Scan appears despite an apparently usable index, check for type coercion index bypass: retrieve indexed column types and compare against predicate literal types using the implicit cast compatibility matrix in `plan-interpretation.md`.
-
-**Phase 3 — Experiment (conditional).** ≤30s: run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test. >30s: skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing. Anomalous values (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`.
-
-**Phase 4 — Produce the report, invite reassessment.** Produce the full diagnostic report per the "Required Elements Checklist" in [query-plan/report-format.md](references/query-plan/report-format.md) — structure is non-negotiable. End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation. When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report.
-
-**psql fallback (MCP unavailable).** Pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence:
-
-```bash
-TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION")
-PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;"
-```
-
-**Safety.** Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql.
+Explains why the DSQL optimizer chose a particular plan. **REQUIRES a structured Markdown diagnostic report as the deliverable.** MUST load [query-plan/workflow.md](references/query-plan/workflow.md) for trigger criteria, context disambiguation, routing, and the full phased workflow (Phase 0–4).
 
 ---
 
 
@@ -0,0 +1,136 @@
+# Query Plan Explainability — Workflow
+
+Complete workflow for diagnosing DSQL query plan performance issues. Produces a structured Markdown diagnostic report as the deliverable.
+
+## Table of Contents
+
+1. [Trigger Criteria](#trigger-criteria)
+2. [Context Disambiguation](#context-disambiguation)
+3. [Routing](#routing)
+4. [Phase 0 — Load Reference Material](#phase-0--load-reference-material)
+5. [Phase 1 — Capture the Plan](#phase-1--capture-the-plan)
+6. [Phase 2 — Gather Evidence](#phase-2--gather-evidence)
+7. [Phase 3 — Experiment](#phase-3--experiment)
+8. [Phase 4 — Produce the Report](#phase-4--produce-the-report)
+9. [psql Fallback](#psql-fallback)
+10. [Safety](#safety)
+
+---
+
+## Trigger Criteria
+
+Enter this workflow if **ANY** of these signals are present:
+
+| Signal | Examples |
+|--------|----------|
+| User provides SQL + mentions performance/speed/cost | "this query takes 8 seconds", "too slow", "optimize this", "make this faster" |
+| User mentions DPU cost or resource consumption | "high DPU", "query cost is too high", "read DPU seems excessive" |
+| User asks about a plan choice or scan type | "why is it doing a full scan?", "why not use the index?" |
+| User pastes EXPLAIN / EXPLAIN ANALYZE output | Raw plan text in the message |
+| User references a Query ID and asks about performance | "query abc-123 is slow" |
+| User says "reassess" / "re-run" / "I added the index" | Phase 5 re-entry for an existing report |
+
+---
+
+## Context Disambiguation
+
+Before entering the workflow, confirm the query targets DSQL:
+
+| Condition | Action |
+|-----------|--------|
+| Only `aurora-dsql` MCP is connected (no other database MCPs) | Proceed — DSQL is the only target |
+| User explicitly mentions DSQL, Aurora DSQL, or a known DSQL cluster | Proceed |
+| Conversation already has prior DSQL interaction (earlier queries, schema ops) | Proceed |
+| Multiple database MCPs are connected and no DSQL signal in the message | Ask the user which database they mean before proceeding |
+| No database MCP is connected | Inform the user that the `aurora-dsql` MCP is required and offer the psql fallback |
+
+---
+
+## Routing
+
+| Condition | Path |
+|-----------|------|
+| User provides SQL but no plan output | Full workflow: Phase 0 → 1 → 2 → 3 → 4 |
+| User pastes plan output + asks to fix/optimize | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4 |
+| User pastes plan output + asks what it means (educational) | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4. The report is the explanation — do not produce a shorter conversational answer instead |
+| Execution time >30s detected at Phase 1 | Phase 3 skips experiments per guc-experiments.md |
+| User says "reassess" or equivalent | Re-run Phase 1–2, append Addendum to existing report |
+
+---
+
+## Phase 0 — Load Reference Material
+
+Read all files before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements):
+
+1. [plan-interpretation.md](plan-interpretation.md) — node types, duration math, anomalous values
+2. [catalog-queries.md](catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL
+3. [guc-experiments.md](guc-experiments.md) — GUC procedures and `>30s` skip protocol
+4. [report-format.md](report-format.md) — required report structure
+5. [query-rewrites-generic.md](query-rewrites-generic.md) — generic SQL rewrite patterns
+6. [query-rewrites-dsql-specific.md](query-rewrites-dsql-specific.md) — DSQL-specific rewrites
+
+---
+
+## Phase 1 — Capture the Plan
+
+**ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks.
+
+When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause.
+
+Extract: Query ID, Planning Time, Execution Time, DPU Estimate.
+
+| Statement type | Action |
+|---------------|--------|
+| SELECT | Run as-is |
+| UPDATE / DELETE | Rewrite to equivalent SELECT (same join chain + WHERE) — optimizer picks the same plan shape |
+| INSERT, pl/pgsql, DO blocks, functions | **MUST** reject |
+
+**MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety.
+
+---
+
+## Phase 2 — Gather Evidence
+
+Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`.
+
+1. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe).
+2. Detect correlated predicates and data skew.
+3. When a Full Scan appears despite an apparently usable index, check for **type coercion index bypass**: retrieve indexed column types and compare against predicate literal types using the implicit cast compatibility matrix in `plan-interpretation.md`.
+4. Check whether any query rewrite from `query-rewrites-generic.md` or `query-rewrites-dsql-specific.md` applies to the query structure (e.g., OR-to-IN, subquery unnesting, NOT IN to NOT EXISTS, split large joins).
+
+---
+
+## Phase 3 — Experiment (conditional)
+
+- **≤30s:** Run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test.
+- **>30s:** Skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing.
+- **Anomalous values** (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`.
+
+---
+
+## Phase 4 — Produce the Report, Invite Reassessment
+
+Produce the full diagnostic report per the "Required Elements Checklist" in [report-format.md](report-format.md) — structure is non-negotiable.
+
+End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation.
+
+When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report.
+
+If a query rewrite was identified in Phase 2, include it as a recommendation with the original and rewritten SQL side by side.
+
+---
+
+## psql Fallback
+
+When the MCP is unavailable, pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence:
+
+```bash
+TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION")
+PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;"
+```
+
+---
+
+## Safety
+
+Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql.