Skip to content

Latest commit

 

History

History
235 lines (153 loc) · 6.54 KB

File metadata and controls

235 lines (153 loc) · 6.54 KB

Section 09 — Analyse Queries

What you'll practice: reading query plans from EXPLAIN and EXPLAIN ANALYZE, telling a sequential scan from an index scan, comparing estimated vs actual row counts, and spotting when a join is nested loop vs hash join vs merge join.

This section is paired with Section 08. We reuse a similar 300k-row employee table in its own schema so you can add/remove indexes without affecting the indexes section.

Learning objectives

  • Read the output of EXPLAIN
  • Understand the difference between EXPLAIN (estimate) and EXPLAIN ANALYZE (actually run)
  • Read the cost numbers: (cost=start..total rows=N width=W)
  • Identify Seq Scan, Index Scan, Index Only Scan, Bitmap Heap Scan
  • Identify Nested Loop, Hash Join, Merge Join
  • Use EXPLAIN (ANALYZE, BUFFERS) to see disk/cache behaviour
  • Spot estimator mistakes (estimated rows vs actual rows)

How to run these exercises

psql -U postgres -d sql_exercise -f 04-advanced-databases/09-analyse-queries/schema.sql
psql -U postgres -d sql_exercise

Inside psql:

SET search_path TO adv_explain;
\timing on

Exercises

Exercise 1 — Your first EXPLAIN

Run:

EXPLAIN SELECT * FROM employee;

Question: what's the "node" type? What are the three numbers in (cost=... rows=... width=...)? What does each one mean?


Exercise 2 — EXPLAIN vs EXPLAIN ANALYZE

Run both of these and compare:

EXPLAIN           SELECT * FROM employee WHERE country = 'GB';
EXPLAIN ANALYZE   SELECT * FROM employee WHERE country = 'GB';

Question: what extra information does EXPLAIN ANALYZE give you? Why is it dangerous to run EXPLAIN ANALYZE on DELETE or UPDATE statements?


Exercise 3 — Seq Scan

Run:

EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';

Question: which scan type was used? Why? (There is currently no index on email.)


Exercise 4 — Add an index, re-EXPLAIN

CREATE INDEX employee_email_idx ON employee (email);
EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';

Question: what type of scan does Postgres pick now? How has the execution time changed?


Exercise 5 — Bitmap heap scan

CREATE INDEX employee_salary_idx ON employee (salary);
EXPLAIN ANALYZE SELECT * FROM employee WHERE salary BETWEEN 50000 AND 55000;

Question: the plan will probably show a "Bitmap Index Scan" and a "Bitmap Heap Scan". What is a bitmap scan, and why does Postgres prefer it over a plain Index Scan for range queries that return many rows?


Exercise 6 — Index Only Scan

Run:

EXPLAIN ANALYZE SELECT email FROM employee WHERE email = 'user100000@example.com';

Question: what's the scan type? Why is it an "Index Only Scan" and not a plain "Index Scan"?


Exercise 7 — Nested Loop join

With very small department table (8 rows) and large employee table, Postgres should pick a hash join. Force a nested loop the silly way:

SET enable_hashjoin = OFF;
SET enable_mergejoin = OFF;
EXPLAIN ANALYZE
SELECT e.first_name, e.last_name, d.head_name
FROM   employee e
JOIN   department d ON d.name = e.department
WHERE  e.id < 100;
RESET enable_hashjoin;
RESET enable_mergejoin;

Question: what does the Nested Loop node look like? Which side is the "outer" side and which is the "inner"?


Exercise 8 — Hash join

Let Postgres decide again:

EXPLAIN ANALYZE
SELECT e.first_name, e.last_name, d.head_name
FROM   employee e
JOIN   department d ON d.name = e.department;

Question: is this a Hash Join, a Merge Join, or a Nested Loop? Why did Postgres pick this type for the full-table join?


Exercise 9 — Estimated vs actual rows

Run:

EXPLAIN ANALYZE
SELECT * FROM employee WHERE department = 'engineering';

Question: look at the rows= number in the plan (estimated) and the actual rows= number (real). How close are they? If they're very different, it often means ANALYZE hasn't been run recently — try ANALYZE employee; and re-run.


Exercise 10 — EXPLAIN (ANALYZE, BUFFERS)

EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM employee WHERE country = 'GB';

Question: what do shared hit, shared read, and shared dirtied mean in the BUFFERS output? (Hint: hit = came from cache; read = had to read from disk.)

Run the same query again — you should see mostly shared hit the second time.


Exercise 11 — Get the plan as JSON

EXPLAIN (FORMAT JSON)
SELECT * FROM employee WHERE id = 42;

Question: when might you want the JSON format instead of the default tree? (Think: programmatic tools.)


Exercise 12 — Force a Seq Scan to prove a point

SET enable_indexscan = OFF;
SET enable_bitmapscan = OFF;
EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';
RESET enable_indexscan;
RESET enable_bitmapscan;

Question: how much slower is the forced Seq Scan than the Index Scan on the same query? This is the gap an index pays for.


Exercise 13 — The planner changes its mind with stats

Add a few million rows to skew the table? We don't need to — instead, change the query to one that returns most of the table and watch the planner pick a Seq Scan:

EXPLAIN ANALYZE SELECT COUNT(*) FROM employee WHERE salary > 0;

Question: is Postgres using the employee_salary_idx index here? Why not? When a filter matches almost everything, an index is worse than just reading the table.


Exercise 14 — Costing a simple aggregate

EXPLAIN ANALYZE
SELECT department, COUNT(*), AVG(salary)::INT AS avg_salary
FROM   employee
GROUP  BY department
ORDER  BY avg_salary DESC;

Question: which nodes does the plan have? How does GROUP BY show up? (You'll typically see HashAggregate or GroupAggregate.)


Bonus challenges

Bonus 1 — Read a deeply nested plan

Come up with a query that joins employee to itself to find pairs in the same department but different countries. Run EXPLAIN ANALYZE. Read the plan from the bottom up — each child feeds its parent.

Bonus 2 — auto_explain

Read about the auto_explain extension. It captures plans for slow queries automatically. What problem does it solve that manual EXPLAIN ANALYZE doesn't?

Bonus 3 — pg_stat_statements

Read about pg_stat_statements. How would you use it to find the top 10 slowest queries by total time running against your database right now?


Cleanup

DROP SCHEMA IF EXISTS adv_explain CASCADE;