Query Compilation 101

Prerequisite videos

Materialized performance 101 (skip forward to 5:13)
Materialized internals 101.
Introduction to one-off queries.
Materialize Decorrelation explained in Jamie Brandon’s Blog.

Current Compiler Pipeline

%%{init: {"flowchart": {"defaultRenderer": "elk"} } }%%
flowchart LR
    SQL@{ shape: doc, label="SQL" } --> IR
    IR --> dataflow@{ shape: docs }

    subgraph IR["intermediate languages"]
      direction LR
      HIR@{ shape: doc } --> MIR@{ shape: docs } --> LIR@{ shape: docs }
      MIR -. optimizations .-> MIR
    end

    classDef purple fill:#472F85
    class SQL,HIR,MIR,LIR,dataflow purple

Representations:

SQL — source language
AST — a parsed version of a SQL query.
HIR — high-level intermediate representation.
MIR — mid-level intermediate representation.
LIR — low-level intermediate representation.
TDO — target language (timely & differential operators).

Transformations in the compile-time lifecycle of a dataflow.

SQL ⇒ AST.
- Parsing the SQL query.
AST ⇒ AST
- Resolving names against the catalog.
  - CatalogItemType lists the kinds of objects that can be resolved against the catalog.
AST ⇒ HIR.
- Resolving column references, column aliases, and table aliases
- If the SQL query is a one-off, the outermost TopK is converted to a RowSetFinishing at this point.
- EXPLAIN RAW PLAN returns the result of transformations up to this point.
HIR ⇒ HIR
- Predecorrelation rewrites:
  - Split out subquery conditions out as a separate predicate.
  - Try to rewrite other types of subqueries into EXISTS subqueries.
HIR ⇒ MIR.
- Decorrelation:
  - Correlated queries are rewritten as graphs with join and distinct.
- Lowering — express SQL-specific concepts as dataflow sub-graphs:
  - Outer joins are decomposed into multiple inner joins (see README.md).
  - Machinery for introducing defaults in empty global aggregates.
  - Machinery for introducing errors for SELECT subqueries with more than one return value.
- EXPLAIN DECORRELATED PLAN returns the result of transformations up to this point.
MIR ⇒ MIR.
- If the query is a view definition, run per-view logical optimizations against the SQL query. The catalog stores the result of transformations up to this point.
- Construct a dataflow for the query:
  - If the query depends on not-materialized views, the definitions of the not-materialized views get inlined.
  - For each materialized view that a query depends on, import all of its materializations. (This corresponds to all indexes on that view, which you can see if you call SHOW INDEXES IN <view>).
- Run optimizations against the dataflow:
  - Per-view logical.
  - Cross-view logical.
    - Propagating source information up: optimize_dataflow_monotonic
    - Pushing optimizations down to sources: LinearOperators
      - CSV decoding
      - Upsert
    - View inlining.
    - Theoretically supports producing more than one index/sink in the same dataflow.
  - Per-view logical (second round).
  - Per-view physical.
- EXPLAIN OPTIMIZED PLAN returns the result of transformations up to this point.
MIR ⇒ LIR.
- Decisions are made regarding rendering.
  - All aggregations are created equal in MIR, but from the rendering perspective, aggregations are evaluated differently according to what data needs to be kept to recalculate the aggregation after receiving a diff. A pictorial version can be found here.
  - Joins are broken down into multiple stages, and filters + projects run between each stage to shrink the intermediate result.
- RelationTypes (column types + unique keys) are discarded since we do no key or type of validation at render time.
- EXPLAIN PHYSICAL PLAN returns the result of transformations up to this point.
LIR ⇒ TDO.

For a one-off query, we run all the transformations until the MIR stage. Then we determine whether we need to serve the query on the "slow path", that is, creating a temporary dataflow and then deleting it. If we don't need to serve the query on the "slow path", then we can skip the MIR ⇒ LIR and the LIR ⇒ TDO steps. Existing "fast paths" include:

Currently, the optimization team is mostly concerned with the HIR ⇒ MIR and MIR ⇒ MIR stages.

Testing

Integration tests

Sqllogictest
- Philip’s RQG tests will be in this format.
  - Add Philip to any PR where query plans may change.
- A PR can be merged if it passes Fast SLT.
- A PR does not need to pass Full SLT tests (test/sqllogictest/sqlite) to be merged.
  - Full SLT tests take 2-3 hours.
  - You can manually initiate full SLT tests on your branch here.
Testdrive
- We generally do not use testdrive except to see linear operators in action.

Unit tests

Datadriven
- Transform unit tests currently allow:
  - testing each transformation independently of the others.
  - Printing out which block of transformations change the plan and how.
- Unit tests in the mz-expr crate currently allow:
  - Testing the simplifying MirScalarExpr, predicates, join equivalences.
  - Testing MapFilterProject.
- There is a DSL to specifying arbitrary MIRs.
- DSL to specify arbitrary enums and structs.

Performance tests

TPCH

Tooling

mzt — can be used to create repositories of plans and write up a markdown that explains something based on those plans (see Alexander’s mzt-repos for example).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query Compilation 101

Prerequisite videos

Current Compiler Pipeline

Testing

Integration tests

Unit tests

Performance tests

Tooling

FilesExpand file tree

101-query-compilation.md

Latest commit

History

101-query-compilation.md

File metadata and controls

Query Compilation 101

Prerequisite videos

Current Compiler Pipeline

Testing

Integration tests

Unit tests

Performance tests

Tooling