What you'll practice: reading query plans from
EXPLAINandEXPLAIN ANALYZE, telling a sequential scan from an index scan, comparing estimated vs actual row counts, and spotting when a join is nested loop vs hash join vs merge join.
This section is paired with Section 08. We reuse a similar 300k-row employee table in its own schema so you can add/remove indexes without affecting the indexes section.
- Read the output of
EXPLAIN - Understand the difference between
EXPLAIN(estimate) andEXPLAIN ANALYZE(actually run) - Read the cost numbers:
(cost=start..total rows=N width=W) - Identify
Seq Scan,Index Scan,Index Only Scan,Bitmap Heap Scan - Identify
Nested Loop,Hash Join,Merge Join - Use
EXPLAIN (ANALYZE, BUFFERS)to see disk/cache behaviour - Spot estimator mistakes (estimated rows vs actual rows)
psql -U postgres -d sql_exercise -f 04-advanced-databases/09-analyse-queries/schema.sql
psql -U postgres -d sql_exerciseInside psql:
SET search_path TO adv_explain;
\timing onRun:
EXPLAIN SELECT * FROM employee;Question: what's the "node" type? What are the three numbers in (cost=... rows=... width=...)? What does each one mean?
Run both of these and compare:
EXPLAIN SELECT * FROM employee WHERE country = 'GB';
EXPLAIN ANALYZE SELECT * FROM employee WHERE country = 'GB';Question: what extra information does EXPLAIN ANALYZE give you? Why is it dangerous to run EXPLAIN ANALYZE on DELETE or UPDATE statements?
Run:
EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';Question: which scan type was used? Why? (There is currently no index on email.)
CREATE INDEX employee_email_idx ON employee (email);
EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';Question: what type of scan does Postgres pick now? How has the execution time changed?
CREATE INDEX employee_salary_idx ON employee (salary);
EXPLAIN ANALYZE SELECT * FROM employee WHERE salary BETWEEN 50000 AND 55000;Question: the plan will probably show a "Bitmap Index Scan" and a "Bitmap Heap Scan". What is a bitmap scan, and why does Postgres prefer it over a plain Index Scan for range queries that return many rows?
Run:
EXPLAIN ANALYZE SELECT email FROM employee WHERE email = 'user100000@example.com';Question: what's the scan type? Why is it an "Index Only Scan" and not a plain "Index Scan"?
With very small department table (8 rows) and large employee table, Postgres should pick a hash join. Force a nested loop the silly way:
SET enable_hashjoin = OFF;
SET enable_mergejoin = OFF;
EXPLAIN ANALYZE
SELECT e.first_name, e.last_name, d.head_name
FROM employee e
JOIN department d ON d.name = e.department
WHERE e.id < 100;
RESET enable_hashjoin;
RESET enable_mergejoin;Question: what does the Nested Loop node look like? Which side is the "outer" side and which is the "inner"?
Let Postgres decide again:
EXPLAIN ANALYZE
SELECT e.first_name, e.last_name, d.head_name
FROM employee e
JOIN department d ON d.name = e.department;Question: is this a Hash Join, a Merge Join, or a Nested Loop? Why did Postgres pick this type for the full-table join?
Run:
EXPLAIN ANALYZE
SELECT * FROM employee WHERE department = 'engineering';Question: look at the rows= number in the plan (estimated) and the actual rows= number (real). How close are they? If they're very different, it often means ANALYZE hasn't been run recently — try ANALYZE employee; and re-run.
EXPLAIN (ANALYZE, BUFFERS)
SELECT COUNT(*) FROM employee WHERE country = 'GB';Question: what do shared hit, shared read, and shared dirtied mean in the BUFFERS output? (Hint: hit = came from cache; read = had to read from disk.)
Run the same query again — you should see mostly shared hit the second time.
EXPLAIN (FORMAT JSON)
SELECT * FROM employee WHERE id = 42;Question: when might you want the JSON format instead of the default tree? (Think: programmatic tools.)
SET enable_indexscan = OFF;
SET enable_bitmapscan = OFF;
EXPLAIN ANALYZE SELECT * FROM employee WHERE email = 'user100000@example.com';
RESET enable_indexscan;
RESET enable_bitmapscan;Question: how much slower is the forced Seq Scan than the Index Scan on the same query? This is the gap an index pays for.
Add a few million rows to skew the table? We don't need to — instead, change the query to one that returns most of the table and watch the planner pick a Seq Scan:
EXPLAIN ANALYZE SELECT COUNT(*) FROM employee WHERE salary > 0;Question: is Postgres using the employee_salary_idx index here? Why not? When a filter matches almost everything, an index is worse than just reading the table.
EXPLAIN ANALYZE
SELECT department, COUNT(*), AVG(salary)::INT AS avg_salary
FROM employee
GROUP BY department
ORDER BY avg_salary DESC;Question: which nodes does the plan have? How does GROUP BY show up? (You'll typically see HashAggregate or GroupAggregate.)
Come up with a query that joins employee to itself to find pairs in the same department but different countries. Run EXPLAIN ANALYZE. Read the plan from the bottom up — each child feeds its parent.
Read about the auto_explain extension. It captures plans for slow queries automatically. What problem does it solve that manual EXPLAIN ANALYZE doesn't?
Read about pg_stat_statements. How would you use it to find the top 10 slowest queries by total time running against your database right now?
DROP SCHEMA IF EXISTS adv_explain CASCADE;