Skip to content

Latest commit

 

History

History
248 lines (155 loc) · 7.82 KB

File metadata and controls

248 lines (155 loc) · 7.82 KB

Section 14 — Window Functions

What you'll practice: the OVER clause, PARTITION BY, ORDER BY inside a window, ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, and aggregates as window functions (SUM() OVER, AVG() OVER).

Window functions are the single most powerful feature in standard SQL that most intermediate users never touch. They let you compute "for each row, something that depends on a set of other rows" — running totals, ranks, moving averages, row-over-row differences — without collapsing the result set with a GROUP BY.

Mental model

A window function is like a GROUP BY aggregate that does not reduce the row count. Instead of computing one value per group, it computes one value per row, where that value is calculated over a "window" of related rows.

The magic word is OVER. Anywhere you'd normally write an aggregate like SUM(salary), you can write SUM(salary) OVER (PARTITION BY department ORDER BY hired_on) and get a "running total by department, in hire order" attached to every employee row.

Learning objectives

  • Attach a window to an aggregate with OVER ()
  • Partition a window with PARTITION BY
  • Order a window with ORDER BY inside the OVER
  • Use the ranking functions: ROW_NUMBER, RANK, DENSE_RANK, NTILE
  • Use the value functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE
  • Use aggregates as window functions: SUM, AVG, COUNT, MIN, MAX
  • Understand frame clauses (ROWS BETWEEN ...) at a high level

How to run these exercises

psql -U postgres -d sql_exercise -f 04-advanced-databases/14-window-functions/schema.sql
psql -U postgres -d sql_exercise -f 04-advanced-databases/14-window-functions/seed.sql
psql -U postgres -d sql_exercise

Inside psql:

SET search_path TO adv_windows;

Exercises

Exercise 1 — Compare aggregate with and without a window

Run both of these and look at the output:

SELECT department, AVG(salary)::INT FROM employee GROUP BY department;

vs.

SELECT department, full_name, salary,
       AVG(salary) OVER (PARTITION BY department)::INT AS dept_avg
FROM employee;

Question: how many rows does each query return? What did OVER (PARTITION BY department) buy you over GROUP BY department?


Exercise 2 — Salary vs department average

Return full_name, department, salary, dept_avg, and salary - dept_avg. The goal: show each employee's salary next to how it compares to their department's average, in one pass.


Exercise 3 — ROW_NUMBER per department

Assign a per-department row number so the highest-paid employee in each department is 1, the next is 2, and so on.

Hint:

ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn

Exercise 4 — Top earner per department

Building on Exercise 3, return only the top earner per department. (Hint: wrap the query in a sub-query or CTE and filter WHERE rn = 1.)

Expected: 4 rows — one per department.


Exercise 5 — Top 2 earners per department

Same idea, but return the top two per department.


Exercise 6 — RANK vs DENSE_RANK vs ROW_NUMBER

Return full_name, department, salary, and three columns:

  • ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC)
  • RANK() OVER (PARTITION BY department ORDER BY salary DESC)
  • DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC)

Question: notice the difference between RANK and DENSE_RANK on data (Grace and Donald? Or wherever there's a tie in your data). When is DENSE_RANK "the right answer"?


Exercise 7 — NTILE: split into quartiles

Split all employees into four pay quartiles and show which quartile each falls into.

Hint: NTILE(4) OVER (ORDER BY salary).


Exercise 8 — Running total of sales per rep

For every sale row, return rep_name, sold_at, amount, and a running total of that rep's sales up to and including that date.

Hint:

SUM(amount) OVER (PARTITION BY rep_name ORDER BY sold_at) AS running_total

Exercise 9 — Row-over-row difference with LAG

Show each sale next to the previous sale (in time order, per rep) and the difference.

Hint:

LAG(amount)  OVER (PARTITION BY rep_name ORDER BY sold_at) AS prev_amount,
amount - LAG(amount) OVER (PARTITION BY rep_name ORDER BY sold_at) AS diff

Exercise 10 — LEAD: "next" sale

Same as Exercise 9 but showing the next sale's amount per rep.


Exercise 11 — First and last sale per rep

For every sale row, show the rep_name, sold_at, amount, and two extra columns:

  • first_sale_amount — the rep's very first sale amount
  • last_sale_amount — the rep's very last sale amount

Hint:

FIRST_VALUE(amount) OVER (PARTITION BY rep_name ORDER BY sold_at)
LAST_VALUE(amount)  OVER (PARTITION BY rep_name ORDER BY sold_at
                          ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)

Question: why does LAST_VALUE need the explicit frame clause? (This is the window function "gotcha" everyone hits once.)


Exercise 12 — 3-day moving average of daily sales

First, aggregate sales to (day, total). Then attach a 3-day (current + 2 previous) moving average to every day.

Hint:

WITH daily AS (
    SELECT sold_at AS day, SUM(amount) AS total
    FROM   sale
    GROUP  BY sold_at
)
SELECT day, total,
       AVG(total) OVER (
           ORDER BY day
           ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
       )::INT AS moving_avg
FROM daily
ORDER BY day;

Exercise 13 — Percent of department total

For every employee, return full_name, department, salary, and pct_of_dept_total — their salary as a percentage of their department's total salary spend.

Hint: 100.0 * salary / SUM(salary) OVER (PARTITION BY department).


Exercise 14 — Cumulative distribution

For every employee, return full_name, salary, and a cume_dist column using CUME_DIST() OVER (ORDER BY salary). Also add PERCENT_RANK() OVER (ORDER BY salary).

Question: what does CUME_DIST mean, and how is it different from PERCENT_RANK?


Exercise 15 — Largest gap between hires

For every employee, return full_name, hired_on, and the number of days between their hire date and the next hire date (chronologically). Hint: LEAD(hired_on) OVER (ORDER BY hired_on) - hired_on.

Then wrap the query in a CTE and return the single largest gap.


Exercise 16 — Months with above-average sales

For each month, compute the total sales. Then use a window function to attach the overall average monthly sales as a column, and return only months whose total exceeds that overall average.

Hint:

WITH monthly AS (
    SELECT DATE_TRUNC('month', sold_at)::DATE AS month, SUM(amount) AS total
    FROM   sale GROUP BY 1
),
with_avg AS (
    SELECT month, total, AVG(total) OVER () AS overall_avg FROM monthly
)
SELECT month, total, overall_avg::INT
FROM   with_avg
WHERE  total > overall_avg;

Bonus challenges

Bonus 1 — WINDOW clause

Several window functions in one query can share a named window. Rewrite Exercise 6 using WINDOW w AS (PARTITION BY department ORDER BY salary DESC) and reference it with ROW_NUMBER() OVER w.

Bonus 2 — Gaps and islands

Classic interview question: given the sale dates, find the start and end of each contiguous "island" of dates where the same rep made sales on consecutive days (if any). Hint: the "gaps and islands" trick uses ROW_NUMBER() minus a date sequence number.

Bonus 3 — Percent change vs previous row

Add a pct_change column to your Exercise 9 result that shows the percent change from the previous sale. Handle the first row (no previous) by returning NULL.


Cleanup

DROP SCHEMA IF EXISTS adv_windows CASCADE;