Skip to content

[ADR] dbt-aware rule pac #54

@Pawansingh3889

Description

@Pawansingh3889

What
A new opt-in rule pack inside sql-sop, activated by --dbt, that catches patterns specific to dbt projects: missing tests, raw table refs instead of ref(), incrementals without a unique_key, DDL hiding in pre_hook / post_hook, and so on. Silent without the flag; no change for users who don't run dbt.

Why now
dbt is where most analytics SQL lives in 2026. sql-sop already runs on .sql files, but a dbt model is .sql + Jinja wrapped in a project layout the linter cannot read. Several things that are obvious if you've ever debugged a dbt project are invisible to plain sql-sop:

a mart model with zero entries in schema.yml
raw FROM raw_db.orders instead of FROM {{ ref('stg_orders') }}
materialized='incremental' and no unique_key (silent duplicate inserts)
pre_hook='DROP TABLE staging_x' — somebody's day is going to be bad
SELECT * in a mart (W003 already covers this, but doesn't know "mart" is special)
Adding dbt awareness lets sql-sop sit cleanly in dbt CI and catch the patterns plain SQL rules cannot see.

Architecture
Same shape as the Contracts Pack. Opt-in via --dbt, single PyPI distribution, single rule registry, single CHANGELOG. The pack is silent unless the flag is supplied.

I considered shipping dbt-sop as a separate PyPI package. Rejected for v1 — adds a release surface, a parallel CHANGELOG, and an answer to "which one do I install" before the rule set is even stable. If the pack grows past ~15 rules and the dbt audience separates from the SQL-Server one, splitting it later is straightforward.

Plugin system was the other option. GOVERNANCE.md still rules that out: "Plugin system for third-party rules (out of scope until the core rule API is stable)."

Jinja
dbt files aren't SQL. They're Jinja templates that produce SQL after dbt compile. Some patterns the linter wants to see ({{ ref('x') }}, {% if var('flag') %}) need to survive into the rules layer; some ({% set x = ... %}, {% materialization ... %}) want to be stripped before regex matching.

Plan: a Jinja preprocessor pass that runs on dbt-marked files. Resolves the structural macros (ref, source, config) into placeholders the existing rules can see through, and strips the rest. Uses jinja2 directly. Already a dbt transitive dep, so no install burden for dbt users; gated behind a [dbt] extra so non-dbt users don't pay for the package.

Initial rule set
These are the ones I'd ship in the first cut. Severities reflect my read of "what would I actually want to block CI on" — open to argument on any of them.

DBT001 model-without-test — warning. Mart model has no tests: entry anywhere in schema.yml.
DBT002 direct-table-ref — warning. Raw table name instead of {{ ref(...) }} or {{ source(...) }}. information_schema and dbt internal refs allowlisted.
DBT003 incremental-without-unique-key — error. materialized='incremental' with no unique_key. Duplicate inserts are silent.
DBT004 hook-with-ddl — error. pre_hook / post_hook contains DROP / TRUNCATE / ALTER. If the model is actually supposed to do DDL, run it elsewhere.
DBT005 select-star-in-mart — warning. SELECT * in a file under the configured mart model-paths. Refines W003 with dbt context.
DBT006 model-without-description — warning. Model listed in schema.yml with no description:. Catches drift after a refactor.
DBT007 var-injection-risk — error. {{ var('x') }} interpolated directly into SQL without quote context.
Seven rules is a reasonable v1. Adding more is cheaper than carrying noise.

Compatibility
New ID prefix DBT. Doesn't touch any existing ID.
Default: silent unless --dbt is supplied. Zero change for current users.
Where rules overlap (DBT005 vs W003): both fire. DBT005 is more specific and its message is dbt-aware.
No new mandatory dependency. jinja2 lives behind the [dbt] extra alongside [python] and [snapshot].
What I'd build
sql_guard/dbt.py — project discovery (find dbt_project.yml, parse model-paths, walk schema.yml).
sql_guard/jinja_preprocess.py — strip / resolve Jinja before regex rules see the file.
sql_guard/rules/dbt.py — DBT001..DBT007 classes.
--dbt flag in the CLI activates discovery and the pack.
Tests per rule (fires / doesn't fire). Plus an integration test against tests/fixtures/dbt_project/, a minimal fake dbt project.
Documentation: rule table section in README, Key Numbers bump, CHANGELOG.
Realistic effort: two to four weeks. The Jinja preprocessor alone is most of week one (raw blocks, nested macros, custom user macros that aren't ref / source / config).

Open questions
jinja2 as required dep or [dbt] extra? Lean: extra. Non-dbt users shouldn't carry it.
schema.yml parsing across dbt versions. dbt 1.5 vs 1.7 changed test syntax. Lean: support both, warn-but-don't-fail when ambiguous.
DBT002 allowlist. information_schema, system catalogs, and any table prefixed with the dbt staging. schema should be safe by default. Configurable.
DBT005 path source. Hardcoded marts/ glob, or read model-paths from dbt_project.yml. Lean: read the config — projects rename marts to gold, core, etc.
Naming the IDs. DBT* reads cleanly and matches the dbt brand. Alternative was M* (model-rules) but it collides with anything else that might later want the prefix. Lean: DBT*.
What I rejected
A separate dbt-sop package. v1 doesn't need the extra release surface.
Plugin API for third-party dbt rules. Out per GOVERNANCE.md until core rule API is stable.
Compile-then-lint (run dbt compile and check the target/ output). You'd lose pre-commit feedback on the source file, which is the whole point of running a linter.
Extending dbt-checkpoint. Considered seriously. dbt-checkpoint is pre-commit-only, has no stable rule-ID contract, and doesn't produce SARIF. sql-sop has all three plus a governance doc. The path of least surprise is to build it here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions