11# Architecture
22
3- This document describes how sqllineage is structured and how to navigate
4- the codebase. It is intended for contributors and anyone reading the source.
3+ How sqllineage is structured and how to navigate the codebase.
54
65## Overview
76
@@ -22,82 +21,85 @@ Each statement is processed independently with its own graph and scope tree.
2221
2322## Build phase
2423
25- ` build/statement.rs ` is the entry point. It pattern-matches on the sqlparser
26- ` Statement ` enum to identify the output table and delegate to the appropriate
27- handler. INSERT, CTAS, UPDATE, DELETE, and MERGE each have specific logic for
28- extracting column-level assignments.
24+ ` build/statement.rs ` dispatches on the sqlparser ` Statement ` enum. Six
25+ variants carry lineage (Query, Insert, CreateTable, Update, Delete, Merge);
26+ all others produce ` StatementType::Other ` .
27+
28+ ` build/query.rs ` handles CTEs and set operations. CTEs are registered as
29+ scope bindings in the parent scope; the query body runs in a child scope
30+ to prevent CTE names from interfering with FROM bindings. UNION sides each
31+ get their own scope; outputs are merged positionally. Recursive CTE
32+ self-references are detected via ` recursive_cte_name ` , and edges from the
33+ recursive step are marked with ` is_recursive_back_edge ` . CTE-wrapped DML
34+ (` WITH ... UPDATE ` , ` WITH ... DELETE ` , etc.) is handled via
35+ ` SetExpr::Update/Insert/Delete/Merge ` delegation.
36+
37+ ` build/select.rs ` processes FROM first (registering bindings), then
38+ projection (creating Output nodes). When a FROM item references a CTE,
39+ the existing binding is reused — this requires a ` lookup() ` call during
40+ Phase 1 to avoid adding CTE names to ` tables.inputs ` . Alias-less derived
41+ tables are tracked separately on the scope because they have no name to
42+ use as a binding key.
43+
44+ ` build/expr.rs ` matches all ` Expr ` variants exhaustively — no catch-all.
45+ New variants added by sqlparser will cause a compile error, ensuring
46+ lineage is never silently lost.
2947
30- ` build/query.rs ` handles ` Query ` — first processing any CTEs (each gets its
31- own scope with bindings registered in the parent), then the query body. UNION
32- is handled by giving each side its own scope and merging output columns
33- positionally. ` SetExpr::Update ` / ` SetExpr::Insert ` (for CTE-wrapped DML
34- like ` WITH ... UPDATE ` ) delegates back to statement dispatch.
48+ ## Resolve phase
3549
36- ` build/select .rs` processes a ` SELECT ` : FROM first (registering bindings),
37- then projection (creating Output nodes). When a FROM item is a CTE reference ,
38- the existing CTE binding is reused instead of creating a new Table binding.
39- Anonymous (alias-less) derived tables are tracked on the scope separately .
50+ ` resolve/mod .rs` iterates the root scope's output columns. Named columns
51+ resolve via scope lookup: Table bindings produce ` Concrete ` origins ,
52+ CTE/DerivedTable bindings follow through ` output_columns ` to the base
53+ table recursively. Star nodes go through the same chain via ` expand_star() ` .
4054
41- ` build/expr.rs ` recursively walks expressions to collect leaf column
42- references. Window function PARTITION BY / ORDER BY clauses are included
43- as ancestors.
55+ Every resolve path handles all three binding types (Table, Cte, DerivedTable)
56+ uniformly.
4457
45- ### Graph and scope structures
58+ ` resolve/topo.rs ` validates that the graph is a DAG after removing recursive
59+ CTE back-edges. ` resolve/catalog.rs ` applies the optional ` CatalogProvider `
60+ as a post-processing step.
4661
47- ` graph/node.rs ` defines four node types: Output (projection result), Ref
48- (qualified column like ` t.col ` ), Unqualified (bare column name), and Star
49- (` SELECT * ` ).
62+ ## False positives
5063
51- ` graph/scope.rs ` models SQL name resolution. Each scope holds a map of
52- bindings (Table, Cte, or DerivedTable) and a list of output columns. Lookup
53- walks the parent chain with nearest-scope-wins shadowing. CTE body scopes
54- are separated from CTE registration scopes to prevent false ambiguity.
64+ Two cases produce sources that may not contribute at runtime:
5565
56- ## Resolve phase
66+ - ** Conditional branches.** ` CASE WHEN flag THEN a ELSE b END ` includes
67+ both ` a ` and ` b ` . If ` flag ` is always true, ` b ` is a false positive.
68+ - ** Opaque functions.** ` my_udf(a, b) ` includes both arguments even if
69+ the function internally uses only ` a ` .
5770
58- ` resolve/mod.rs ` iterates the root scope's output columns and resolves each
59- one. Named columns resolve through scope lookup: Table bindings produce
60- ` Concrete ` origins, CTE/DerivedTable bindings follow through ` output_columns `
61- to the base table recursively. Star nodes go through the same chain via
62- ` expand_star() ` .
71+ These are inherent to static analysis. Any other false positive is a bug.
6372
64- Every resolve path must handle all three binding types (Table, Cte,
65- DerivedTable) uniformly. This is the most important invariant in the
66- codebase — past bugs came from paths that only handled ` Binding::Table ` .
73+ ## Limitations
6774
68- ` resolve/topo.rs ` validates that the graph is a DAG after removing recursive
69- CTE back-edges. ` resolve/catalog.rs ` applies the optional ` CatalogProvider `
70- as a post-processing step.
75+ - ` SELECT * ` and unqualified columns in multi-table scopes require a
76+ ` CatalogProvider ` for full resolution.
77+ - ` PIVOT ` /` UNPIVOT ` are not yet handled — output columns are generated
78+ dynamically from literal values, requiring semantic interpretation
79+ beyond AST traversal.
7180
7281## Upgrading sqlparser
7382
74- When sqlparser adds new ` Statement ` or ` Expr ` variants, the build layer
75- may need updating:
76-
77- - New ` Statement ` variant → add a match arm in ` statement.rs ` (or let
78- it fall through to ` StatementType::Other ` if it carries no lineage).
79- - New ` Expr ` variant → add a match arm in ` expr.rs ` (the catch-all
80- emits ` WarningKind::UnhandledExpression ` ).
81- - Renamed/restructured fields → follow compiler errors.
82-
83- If sqlparser already parses a construct, sqllineage generally handles it
84- automatically through the existing AST traversal. Changes to sqllineage
85- itself should only be needed when sqlparser's representation changes.
83+ All sqlparser enums (` Statement ` , ` Expr ` , ` SetExpr ` , ` TableFactor ` ,
84+ ` SelectItem ` ) are matched exhaustively — no catch-all arms. New variants
85+ will cause a compile error. For each new variant, determine whether it
86+ carries column references (add traversal) or is lineage-neutral (return
87+ empty / ` StatementType::Other ` ).
8688
8789## Module map
8890
8991| Path | Responsibility |
9092| ------| ---------------|
9193| ` lib.rs ` | ` pub fn analyze() ` — parses SQL, runs build+resolve per statement |
92- | ` types.rs ` | All public types: ` AnalyzeResult ` , ` TableRef ` , ` ColumnMapping ` , etc. |
93- | ` dialect.rs ` | Maps ` Dialect ` enum to sqlparser dialect implementations |
94- | ` build/statement.rs ` | Statement-level dispatch and column mapping for DML |
95- | ` build/query.rs ` | CTE registration , UNION merge, CTE-wrapped DML |
96- | ` build/select.rs ` | FROM binding registration, projection processing |
97- | ` build/expr.rs ` | Recursive expression traversal for ancestor collection |
98- | ` graph/node.rs ` | ` RawNode ` enum ( Output, Ref, Unqualified, Star) |
99- | ` graph/edge.rs ` | ` RawEdge ` with ` EdgeKind ` (Direct, ViaExpression, etc.) |
100- | ` graph/scope.rs ` | ` ScopeTree ` with bindings, output columns, parent chain |
101- | ` resolve/mod.rs ` | Scope-chain resolution, Star expansion , ` ColumnMapping ` assembly |
94+ | ` types.rs ` | Public types: ` AnalyzeResult ` , ` TableRef ` , ` ColumnMapping ` , etc. |
95+ | ` dialect.rs ` | ` Dialect ` enum → sqlparser dialect mapping |
96+ | ` build/statement.rs ` | Statement dispatch, UPDATE SET / MERGE WHEN column mapping |
97+ | ` build/query.rs ` | CTE scope management , UNION merge, CTE-wrapped DML |
98+ | ` build/select.rs ` | FROM binding registration, projection, CTE reference detection |
99+ | ` build/expr.rs ` | Expression traversal for ancestor collection |
100+ | ` graph/node.rs ` | ` RawNode ` enum: Output, Ref, Unqualified, Star |
101+ | ` graph/edge.rs ` | ` RawEdge ` with ` EdgeKind ` and ` is_recursive_back_edge ` flag |
102+ | ` graph/scope.rs ` | ` ScopeTree ` : bindings, output columns, anonymous derived tables |
103+ | ` resolve/mod.rs ` | Scope-chain resolution, ` expand_star() ` , ` ColumnMapping ` assembly |
102104| ` resolve/topo.rs ` | Kahn's algorithm for DAG validation |
103105| ` resolve/catalog.rs ` | ` CatalogProvider ` application (Wildcard/Ambiguous refinement) |
0 commit comments