Skip to content

Commit 63a3775

Browse files
committed
feat: exhaustive Expr and SetExpr coverage
Handle all 63 Expr variants and all 9 SetExpr variants explicitly. Remove catch-all arms so new sqlparser variants cause compile errors instead of silent lineage loss. Remove unused WarningKind variants (UnhandledExpression, UnhandledStatement, AmbiguousColumn) and the warn() helper.
1 parent 7059163 commit 63a3775

13 files changed

Lines changed: 486 additions & 125 deletions

File tree

ARCHITECTURE.md

Lines changed: 62 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# Architecture
22

3-
This document describes how sqllineage is structured and how to navigate
4-
the codebase. It is intended for contributors and anyone reading the source.
3+
How sqllineage is structured and how to navigate the codebase.
54

65
## Overview
76

@@ -22,82 +21,85 @@ Each statement is processed independently with its own graph and scope tree.
2221

2322
## Build phase
2423

25-
`build/statement.rs` is the entry point. It pattern-matches on the sqlparser
26-
`Statement` enum to identify the output table and delegate to the appropriate
27-
handler. INSERT, CTAS, UPDATE, DELETE, and MERGE each have specific logic for
28-
extracting column-level assignments.
24+
`build/statement.rs` dispatches on the sqlparser `Statement` enum. Six
25+
variants carry lineage (Query, Insert, CreateTable, Update, Delete, Merge);
26+
all others produce `StatementType::Other`.
27+
28+
`build/query.rs` handles CTEs and set operations. CTEs are registered as
29+
scope bindings in the parent scope; the query body runs in a child scope
30+
to prevent CTE names from interfering with FROM bindings. UNION sides each
31+
get their own scope; outputs are merged positionally. Recursive CTE
32+
self-references are detected via `recursive_cte_name`, and edges from the
33+
recursive step are marked with `is_recursive_back_edge`. CTE-wrapped DML
34+
(`WITH ... UPDATE`, `WITH ... DELETE`, etc.) is handled via
35+
`SetExpr::Update/Insert/Delete/Merge` delegation.
36+
37+
`build/select.rs` processes FROM first (registering bindings), then
38+
projection (creating Output nodes). When a FROM item references a CTE,
39+
the existing binding is reused — this requires a `lookup()` call during
40+
Phase 1 to avoid adding CTE names to `tables.inputs`. Alias-less derived
41+
tables are tracked separately on the scope because they have no name to
42+
use as a binding key.
43+
44+
`build/expr.rs` matches all `Expr` variants exhaustively — no catch-all.
45+
New variants added by sqlparser will cause a compile error, ensuring
46+
lineage is never silently lost.
2947

30-
`build/query.rs` handles `Query` — first processing any CTEs (each gets its
31-
own scope with bindings registered in the parent), then the query body. UNION
32-
is handled by giving each side its own scope and merging output columns
33-
positionally. `SetExpr::Update` / `SetExpr::Insert` (for CTE-wrapped DML
34-
like `WITH ... UPDATE`) delegates back to statement dispatch.
48+
## Resolve phase
3549

36-
`build/select.rs` processes a `SELECT`: FROM first (registering bindings),
37-
then projection (creating Output nodes). When a FROM item is a CTE reference,
38-
the existing CTE binding is reused instead of creating a new Table binding.
39-
Anonymous (alias-less) derived tables are tracked on the scope separately.
50+
`resolve/mod.rs` iterates the root scope's output columns. Named columns
51+
resolve via scope lookup: Table bindings produce `Concrete` origins,
52+
CTE/DerivedTable bindings follow through `output_columns` to the base
53+
table recursively. Star nodes go through the same chain via `expand_star()`.
4054

41-
`build/expr.rs` recursively walks expressions to collect leaf column
42-
references. Window function PARTITION BY / ORDER BY clauses are included
43-
as ancestors.
55+
Every resolve path handles all three binding types (Table, Cte, DerivedTable)
56+
uniformly.
4457

45-
### Graph and scope structures
58+
`resolve/topo.rs` validates that the graph is a DAG after removing recursive
59+
CTE back-edges. `resolve/catalog.rs` applies the optional `CatalogProvider`
60+
as a post-processing step.
4661

47-
`graph/node.rs` defines four node types: Output (projection result), Ref
48-
(qualified column like `t.col`), Unqualified (bare column name), and Star
49-
(`SELECT *`).
62+
## False positives
5063

51-
`graph/scope.rs` models SQL name resolution. Each scope holds a map of
52-
bindings (Table, Cte, or DerivedTable) and a list of output columns. Lookup
53-
walks the parent chain with nearest-scope-wins shadowing. CTE body scopes
54-
are separated from CTE registration scopes to prevent false ambiguity.
64+
Two cases produce sources that may not contribute at runtime:
5565

56-
## Resolve phase
66+
- **Conditional branches.** `CASE WHEN flag THEN a ELSE b END` includes
67+
both `a` and `b`. If `flag` is always true, `b` is a false positive.
68+
- **Opaque functions.** `my_udf(a, b)` includes both arguments even if
69+
the function internally uses only `a`.
5770

58-
`resolve/mod.rs` iterates the root scope's output columns and resolves each
59-
one. Named columns resolve through scope lookup: Table bindings produce
60-
`Concrete` origins, CTE/DerivedTable bindings follow through `output_columns`
61-
to the base table recursively. Star nodes go through the same chain via
62-
`expand_star()`.
71+
These are inherent to static analysis. Any other false positive is a bug.
6372

64-
Every resolve path must handle all three binding types (Table, Cte,
65-
DerivedTable) uniformly. This is the most important invariant in the
66-
codebase — past bugs came from paths that only handled `Binding::Table`.
73+
## Limitations
6774

68-
`resolve/topo.rs` validates that the graph is a DAG after removing recursive
69-
CTE back-edges. `resolve/catalog.rs` applies the optional `CatalogProvider`
70-
as a post-processing step.
75+
- `SELECT *` and unqualified columns in multi-table scopes require a
76+
`CatalogProvider` for full resolution.
77+
- `PIVOT`/`UNPIVOT` are not yet handled — output columns are generated
78+
dynamically from literal values, requiring semantic interpretation
79+
beyond AST traversal.
7180

7281
## Upgrading sqlparser
7382

74-
When sqlparser adds new `Statement` or `Expr` variants, the build layer
75-
may need updating:
76-
77-
- New `Statement` variant → add a match arm in `statement.rs` (or let
78-
it fall through to `StatementType::Other` if it carries no lineage).
79-
- New `Expr` variant → add a match arm in `expr.rs` (the catch-all
80-
emits `WarningKind::UnhandledExpression`).
81-
- Renamed/restructured fields → follow compiler errors.
82-
83-
If sqlparser already parses a construct, sqllineage generally handles it
84-
automatically through the existing AST traversal. Changes to sqllineage
85-
itself should only be needed when sqlparser's representation changes.
83+
All sqlparser enums (`Statement`, `Expr`, `SetExpr`, `TableFactor`,
84+
`SelectItem`) are matched exhaustively — no catch-all arms. New variants
85+
will cause a compile error. For each new variant, determine whether it
86+
carries column references (add traversal) or is lineage-neutral (return
87+
empty / `StatementType::Other`).
8688

8789
## Module map
8890

8991
| Path | Responsibility |
9092
|------|---------------|
9193
| `lib.rs` | `pub fn analyze()` — parses SQL, runs build+resolve per statement |
92-
| `types.rs` | All public types: `AnalyzeResult`, `TableRef`, `ColumnMapping`, etc. |
93-
| `dialect.rs` | Maps `Dialect` enum to sqlparser dialect implementations |
94-
| `build/statement.rs` | Statement-level dispatch and column mapping for DML |
95-
| `build/query.rs` | CTE registration, UNION merge, CTE-wrapped DML |
96-
| `build/select.rs` | FROM binding registration, projection processing |
97-
| `build/expr.rs` | Recursive expression traversal for ancestor collection |
98-
| `graph/node.rs` | `RawNode` enum (Output, Ref, Unqualified, Star) |
99-
| `graph/edge.rs` | `RawEdge` with `EdgeKind` (Direct, ViaExpression, etc.) |
100-
| `graph/scope.rs` | `ScopeTree` with bindings, output columns, parent chain |
101-
| `resolve/mod.rs` | Scope-chain resolution, Star expansion, `ColumnMapping` assembly |
94+
| `types.rs` | Public types: `AnalyzeResult`, `TableRef`, `ColumnMapping`, etc. |
95+
| `dialect.rs` | `Dialect` enum sqlparser dialect mapping |
96+
| `build/statement.rs` | Statement dispatch, UPDATE SET / MERGE WHEN column mapping |
97+
| `build/query.rs` | CTE scope management, UNION merge, CTE-wrapped DML |
98+
| `build/select.rs` | FROM binding registration, projection, CTE reference detection |
99+
| `build/expr.rs` | Expression traversal for ancestor collection |
100+
| `graph/node.rs` | `RawNode` enum: Output, Ref, Unqualified, Star |
101+
| `graph/edge.rs` | `RawEdge` with `EdgeKind` and `is_recursive_back_edge` flag |
102+
| `graph/scope.rs` | `ScopeTree`: bindings, output columns, anonymous derived tables |
103+
| `resolve/mod.rs` | Scope-chain resolution, `expand_star()`, `ColumnMapping` assembly |
102104
| `resolve/topo.rs` | Kahn's algorithm for DAG validation |
103105
| `resolve/catalog.rs` | `CatalogProvider` application (Wildcard/Ambiguous refinement) |

sqllineage-python/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ build-backend = "maturin"
44

55
[project]
66
name = "sqllineage-rs"
7-
version = "0.1.3"
7+
version = "0.1.4"
88
description = "Extract table and column-level lineage from SQL"
99
license = "MIT OR Apache-2.0"
1010
requires-python = ">=3.9"

sqllineage/Cargo.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[package]
22
name = "sqllineage"
3-
version = "0.1.3"
3+
version = "0.1.4"
44
edition = "2024"
55
description = "Extract table and column-level lineage from SQL"
66
license = "MIT OR Apache-2.0"

0 commit comments

Comments
 (0)