|
| 1 | +# Query Plan Explainability — Workflow |
| 2 | + |
| 3 | +Complete workflow for diagnosing DSQL query plan performance issues. Produces a structured Markdown diagnostic report as the deliverable. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | + |
| 7 | +1. [Trigger Criteria](#trigger-criteria) |
| 8 | +2. [Context Disambiguation](#context-disambiguation) |
| 9 | +3. [Routing](#routing) |
| 10 | +4. [Phase 0 — Load Reference Material](#phase-0--load-reference-material) |
| 11 | +5. [Phase 1 — Capture the Plan](#phase-1--capture-the-plan) |
| 12 | +6. [Phase 2 — Gather Evidence](#phase-2--gather-evidence) |
| 13 | +7. [Phase 3 — Experiment](#phase-3--experiment) |
| 14 | +8. [Phase 4 — Produce the Report](#phase-4--produce-the-report) |
| 15 | +9. [psql Fallback](#psql-fallback) |
| 16 | +10. [Safety](#safety) |
| 17 | + |
| 18 | +--- |
| 19 | + |
| 20 | +## Trigger Criteria |
| 21 | + |
| 22 | +Enter this workflow if **ANY** of these signals are present: |
| 23 | + |
| 24 | +| Signal | Examples | |
| 25 | +|--------|----------| |
| 26 | +| User provides SQL + mentions performance/speed/cost | "this query takes 8 seconds", "too slow", "optimize this", "make this faster" | |
| 27 | +| User mentions DPU cost or resource consumption | "high DPU", "query cost is too high", "read DPU seems excessive" | |
| 28 | +| User asks about a plan choice or scan type | "why is it doing a full scan?", "why not use the index?" | |
| 29 | +| User pastes EXPLAIN / EXPLAIN ANALYZE output | Raw plan text in the message | |
| 30 | +| User references a Query ID and asks about performance | "query abc-123 is slow" | |
| 31 | +| User says "reassess" / "re-run" / "I added the index" | Phase 5 re-entry for an existing report | |
| 32 | + |
| 33 | +--- |
| 34 | + |
| 35 | +## Context Disambiguation |
| 36 | + |
| 37 | +Before entering the workflow, confirm the query targets DSQL: |
| 38 | + |
| 39 | +| Condition | Action | |
| 40 | +|-----------|--------| |
| 41 | +| Only `aurora-dsql` MCP is connected (no other database MCPs) | Proceed — DSQL is the only target | |
| 42 | +| User explicitly mentions DSQL, Aurora DSQL, or a known DSQL cluster | Proceed | |
| 43 | +| Conversation already has prior DSQL interaction (earlier queries, schema ops) | Proceed | |
| 44 | +| Multiple database MCPs are connected and no DSQL signal in the message | Ask the user which database they mean before proceeding | |
| 45 | +| No database MCP is connected | Inform the user that the `aurora-dsql` MCP is required and offer the psql fallback | |
| 46 | + |
| 47 | +--- |
| 48 | + |
| 49 | +## Routing |
| 50 | + |
| 51 | +| Condition | Path | |
| 52 | +|-----------|------| |
| 53 | +| User provides SQL but no plan output | Full workflow: Phase 0 → 1 → 2 → 3 → 4 | |
| 54 | +| User pastes plan output + asks to fix/optimize | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4 | |
| 55 | +| User pastes plan output + asks what it means (educational) | Full workflow: Phase 0 → 1 (re-capture fresh plan) → 2 → 3 → 4. The report is the explanation — do not produce a shorter conversational answer instead | |
| 56 | +| Execution time >30s detected at Phase 1 | Phase 3 skips experiments per guc-experiments.md | |
| 57 | +| User says "reassess" or equivalent | Re-run Phase 1–2, append Addendum to existing report | |
| 58 | + |
| 59 | +--- |
| 60 | + |
| 61 | +## Phase 0 — Load Reference Material |
| 62 | + |
| 63 | +Read all files before starting — each has content later phases need verbatim (node-type math, exact catalog SQL, the `>30s` skip protocol, required report elements): |
| 64 | + |
| 65 | +1. [plan-interpretation.md](plan-interpretation.md) — node types, duration math, anomalous values |
| 66 | +2. [catalog-queries.md](catalog-queries.md) — pg_class / pg_stats / pg_indexes SQL |
| 67 | +3. [guc-experiments.md](guc-experiments.md) — GUC procedures and `>30s` skip protocol |
| 68 | +4. [report-format.md](report-format.md) — required report structure |
| 69 | +5. [query-rewrites-generic.md](query-rewrites-generic.md) — generic SQL rewrite patterns |
| 70 | +6. [query-rewrites-dsql-specific.md](query-rewrites-dsql-specific.md) — DSQL-specific rewrites |
| 71 | + |
| 72 | +--- |
| 73 | + |
| 74 | +## Phase 1 — Capture the Plan |
| 75 | + |
| 76 | +**ALWAYS** run `readonly_query("EXPLAIN ANALYZE VERBOSE …")` on the user's query verbatim (SELECT form) — **ALWAYS** capture a fresh plan from the cluster, even when the user describes the plan or reports an anomaly. **MAY** leverage `get_schema` or `information_schema` for schema sanity checks. |
| 77 | + |
| 78 | +When EXPLAIN errors (`relation does not exist`, `column does not exist`), **MUST** report the error verbatim — **MUST NOT** invent DSQL-specific semantics (e.g., case sensitivity, identifier quoting) as the root cause. |
| 79 | + |
| 80 | +Extract: Query ID, Planning Time, Execution Time, DPU Estimate. |
| 81 | + |
| 82 | +| Statement type | Action | |
| 83 | +|---------------|--------| |
| 84 | +| SELECT | Run as-is | |
| 85 | +| UPDATE / DELETE | Rewrite to equivalent SELECT (same join chain + WHERE) — optimizer picks the same plan shape | |
| 86 | +| INSERT, pl/pgsql, DO blocks, functions | **MUST** reject | |
| 87 | + |
| 88 | +**MUST NOT** use `transact --allow-writes` for plan capture; it bypasses MCP safety. |
| 89 | + |
| 90 | +--- |
| 91 | + |
| 92 | +## Phase 2 — Gather Evidence |
| 93 | + |
| 94 | +Using SQL from `catalog-queries.md`, query `pg_class`, `pg_stats`, `pg_indexes`, `COUNT(*)`, `COUNT(DISTINCT)`. |
| 95 | + |
| 96 | +1. Classify estimation errors per `plan-interpretation.md` (2x–5x minor, 5x–50x significant, 50x+ severe). |
| 97 | +2. Detect correlated predicates and data skew. |
| 98 | +3. When a Full Scan appears despite an apparently usable index, check for **type coercion index bypass**: retrieve indexed column types and compare against predicate literal types using the implicit cast compatibility matrix in `plan-interpretation.md`. |
| 99 | +4. Check whether any query rewrite from `query-rewrites-generic.md` or `query-rewrites-dsql-specific.md` applies to the query structure (e.g., OR-to-IN, subquery unnesting, NOT IN to NOT EXISTS, split large joins). |
| 100 | + |
| 101 | +--- |
| 102 | + |
| 103 | +## Phase 3 — Experiment (conditional) |
| 104 | + |
| 105 | +- **≤30s:** Run GUC experiments per `guc-experiments.md` (default + merge-join-only) plus optional redundant-predicate test. |
| 106 | +- **>30s:** Skip experiments, include the manual GUC testing SQL verbatim in the report, and do not re-run for redundant-predicate testing. |
| 107 | +- **Anomalous values** (impossible row counts): confirm query results are correct despite the anomalous EXPLAIN, flag as a potential DSQL bug, and produce the Support Request Template from `report-format.md`. |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Phase 4 — Produce the Report, Invite Reassessment |
| 112 | + |
| 113 | +Produce the full diagnostic report per the "Required Elements Checklist" in [report-format.md](report-format.md) — structure is non-negotiable. |
| 114 | + |
| 115 | +End with the "Next Steps" block from that reference so the user can ask for a reassessment after applying a recommendation. |
| 116 | + |
| 117 | +When the user says "reassess" (or equivalent), re-run Phase 1–2 and **append an "Addendum: After-Change Performance"** to the original report (before/after table, match against expected impact) rather than producing a new report. |
| 118 | + |
| 119 | +If a query rewrite was identified in Phase 2, include it as a recommendation with the original and rewritten SQL side by side. |
| 120 | + |
| 121 | +--- |
| 122 | + |
| 123 | +## psql Fallback |
| 124 | + |
| 125 | +When the MCP is unavailable, pipe statements into `psql` via heredoc and check `$?`; report failures without proceeding on partial evidence: |
| 126 | + |
| 127 | +```bash |
| 128 | +TOKEN=$(aws dsql generate-db-connect-admin-auth-token --hostname "$HOST" --region "$REGION") |
| 129 | +PGPASSWORD="$TOKEN" psql "host=$HOST port=5432 user=admin dbname=postgres sslmode=require" <<<"EXPLAIN ANALYZE VERBOSE <sql>;" |
| 130 | +``` |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Safety |
| 135 | + |
| 136 | +Plan capture uses `readonly_query` exclusively — it rejects INSERT/UPDATE/DELETE/DDL at the MCP layer. Rewrite DML to SELECT (Phase 1) rather than asking `transact --allow-writes` to run it; write-mode `transact` bypasses all MCP safety checks. **MUST NOT** run arbitrary DDL/DML or pl/pgsql. |
0 commit comments