- Materialized performance 101 (skip forward to 5:13)
- Materialized internals 101.
- Introduction to one-off queries.
- Materialize Decorrelation explained in Jamie Brandon’s Blog.
%%{init: {"flowchart": {"defaultRenderer": "elk"} } }%%
flowchart LR
SQL@{ shape: doc, label="SQL" } --> IR
IR --> dataflow@{ shape: docs }
subgraph IR["intermediate languages"]
direction LR
HIR@{ shape: doc } --> MIR@{ shape: docs } --> LIR@{ shape: docs }
MIR -. optimizations .-> MIR
end
classDef purple fill:#472F85
class SQL,HIR,MIR,LIR,dataflow purple
Representations:
SQL— source languageAST— a parsed version of a SQL query.HIR— high-level intermediate representation.MIR— mid-level intermediate representation.LIR— low-level intermediate representation.TDO— target language (timely & differential operators).
Transformations in the compile-time lifecycle of a dataflow.
SQL ⇒ AST.- Parsing the SQL query.
AST ⇒ AST- Resolving names against the catalog.
CatalogItemTypelists the kinds of objects that can be resolved against the catalog.
- Resolving names against the catalog.
AST ⇒ HIR.- Resolving column references, column aliases, and table aliases
- If the SQL query is a one-off, the outermost
TopKis converted to a RowSetFinishing at this point. EXPLAIN RAW PLANreturns the result of transformations up to this point.
HIR ⇒ HIRHIR ⇒ MIR.- Decorrelation:
- Correlated queries are rewritten as graphs with join and distinct.
- Lowering — express SQL-specific concepts as dataflow sub-graphs:
- Outer joins are decomposed into multiple inner joins (see README.md).
- Machinery for introducing defaults in empty global aggregates.
- Machinery for introducing errors for
SELECTsubqueries with more than one return value.
EXPLAIN DECORRELATED PLANreturns the result of transformations up to this point.
- Decorrelation:
MIR ⇒ MIR.- If the query is a view definition, run per-view logical optimizations against the SQL query. The catalog stores the result of transformations up to this point.
- Construct a dataflow for the query:
- If the query depends on not-materialized views, the definitions of the not-materialized views get inlined.
- For each materialized view that a query depends on, import all of its
materializations. (This corresponds to all indexes on that view, which
you can see if you call
SHOW INDEXES IN <view>).
- Run optimizations against the dataflow:
- Per-view logical.
- Cross-view logical.
- Propagating source information up: optimize_dataflow_monotonic
- Pushing optimizations down to sources:
LinearOperators - View inlining.
- Theoretically supports producing more than one index/sink in the same dataflow.
- Per-view logical (second round).
- Per-view physical.
EXPLAIN OPTIMIZED PLANreturns the result of transformations up to this point.
MIR ⇒ LIR.- Decisions are made regarding rendering.
- All aggregations are created equal in MIR, but from the rendering perspective, aggregations are evaluated differently according to what data needs to be kept to recalculate the aggregation after receiving a diff. A pictorial version can be found here.
- Joins are broken down into multiple stages, and filters + projects run between each stage to shrink the intermediate result.
- RelationTypes (column types + unique keys) are discarded since we do no key or type of validation at render time.
EXPLAIN PHYSICAL PLANreturns the result of transformations up to this point.
- Decisions are made regarding rendering.
LIR ⇒ TDO.
For a one-off query, we run all the transformations until the MIR stage. Then we
determine whether we need to serve the query on the "slow path", that is,
creating a temporary dataflow and then deleting it. If we don't need to serve
the query on the "slow path", then we can skip the MIR ⇒ LIR and the LIR ⇒ TDO steps.
Existing "fast paths" include:
Currently, the optimization team is mostly concerned with the HIR ⇒ MIR and MIR ⇒ MIR stages.
- Sqllogictest
- Philip’s RQG tests will be in this format.
- Add Philip to any PR where query plans may change.
- A PR can be merged if it passes Fast SLT.
- A PR does not need to pass Full SLT tests (
test/sqllogictest/sqlite) to be merged.- Full SLT tests take 2-3 hours.
- You can manually initiate full SLT tests on your branch here.
- Philip’s RQG tests will be in this format.
- Testdrive
- We generally do not use testdrive except to see linear operators in action.
- Datadriven
- Transform unit tests currently allow:
- testing each transformation independently of the others.
- Printing out which block of transformations change the plan and how.
- Unit tests in the mz-expr crate currently allow:
- Testing the simplifying MirScalarExpr, predicates, join equivalences.
- Testing MapFilterProject.
- There is a DSL to specifying arbitrary MIRs.
- DSL to specify arbitrary enums and structs.
- Transform unit tests currently allow: