Skip to content

Hoist DECLAREs via sqlglot tokenizer instead of regex#2053

Open
batikankarakan wants to merge 6 commits into
mainfrom
batikankarakan/sqlglot-hoist-declares
Open

Hoist DECLAREs via sqlglot tokenizer instead of regex#2053
batikankarakan wants to merge 6 commits into
mainfrom
batikankarakan/sqlglot-hoist-declares

Conversation

@batikankarakan
Copy link
Copy Markdown
Contributor

Summary

  • Replaces the regex-based DECLARE hoister in pkg/pipeline/hooks.go with a sqlglot-driven implementation: tokenize with the dialect-aware tokenizer, track paren + BEGIN..END depth to find top-level ; positions, slice the original source by those positions, and classify each slice with parse_one(). sqlglot never regenerates user SQL — statement text is sliced verbatim, so casing, comments, escapes and dialect-specific syntax (e.g. array<STRING>) survive byte-for-byte.
  • WrapHooks / wrapHookQueriesList now take a DeclareHoister interface implemented by SQLParser, plumbed through the HookWrapperMaterializer* structs at every call site in cmd/run.go and cmd/render.go. The parser is initialized unconditionally in run.go so the hoister is always available.
  • DECLAREs nested inside BEGIN..END or stored-procedure bodies are correctly left in place; ; inside string literals no longer splits; on parse failure the input passes through unchanged.

Test plan

  • make format clean
  • make test clean (incl. pkg/pipeline + pkg/sqlparser)
  • New TestSqlParser_HoistDeclares subtests verify byte-identical preservation, BEGIN..END exemption, string-literal safety, and array<STRING> casing survival
  • New pythonsrc/parser/main_test.py cases mirror the same invariants on the Python side

🤖 Generated with Claude Code

Replaces the brittle regex-based DECLARE hoister with a sqlglot-driven
approach. We tokenize the SQL with sqlglot's dialect-aware tokenizer,
track paren and BEGIN..END depth to find top-level ';' positions, then
slice the original source by those positions and classify each slice
with parse_one(). DECLAREs nested inside stored procedures or BEGIN..END
blocks are no longer top-level and stay put. Statement text is sliced
from the original source — sqlglot never regenerates user SQL, so
casing, comments, escapes and dialect-specific syntax (e.g.
'array<STRING>') survive byte-for-byte. WrapHooks now takes a
DeclareHoister interface implemented by SQLParser, plumbed through the
HookWrapperMaterializer* structs at every call site in cmd/run.go and
cmd/render.go.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
pythonsrc/parser/main.py:558-562
**`CASE...END` leaks through `begin_end_depth` tracking**

`TokenType.END` is emitted for *both* `CASE WHEN … END` expressions and `BEGIN … END` blocks — the tokenizer has no context to distinguish them. Inside a `BEGIN … END` body, a `CASE` expression's `END` will decrement `begin_end_depth` to 0 prematurely, causing the semicolons that follow (including any `DECLARE y INT64;`) to be treated as top-level. Those DECLAREs are then incorrectly hoisted out of the block.

Concrete example: `"SET x=1;\nBEGIN\n  SELECT CASE WHEN x>0 THEN 'a' ELSE 'b' END;\n  DECLARE y INT64;\nEND;"` — the `DECLARE y INT64` would be extracted and placed at the script's top, breaking stored-procedure semantics.

The same applies to `END IF`, `END WHILE`, `END LOOP` in dialects that emit individual `END` tokens for each construct. A `case_depth` counter (incremented on `TokenType.CASE`, decremented on `TokenType.END` before `begin_end_depth`) is needed to disambiguate.

### Issue 2 of 3
pythonsrc/parser/main.py:558-559
**`BEGIN TRANSACTION` inflates `begin_end_depth` without a matching `END`**

`BEGIN TRANSACTION;` produces `TokenType.BEGIN`, incrementing `begin_end_depth` to 1. `COMMIT TRANSACTION;` and `ROLLBACK TRANSACTION;` produce no `TokenType.END`, so depth never returns to 0. Every semicolon after `BEGIN TRANSACTION;` is treated as nested and excluded from `positions`.

In the common case where all DECLAREs precede the transaction block this is accidentally correct (the whole transaction body becomes one non-DECLARE slice). However, any script where `BEGIN TRANSACTION` appears before a top-level DECLARE that still needs hoisting — or where the rejoined output must produce well-formed statements — will silently get incorrect slicing. The PR test suite does not cover this pattern; the old `BEGIN TRANSACTION` test case in `TestWrapHooks_HoistsDeclares` was removed and not ported to the Python test suite.

### Issue 3 of 3
cmd/render.go:256-267
**Parser started but `Start()` called again on each invocation**

If `sqlparser.NewSQLParser(false)` succeeds but `sp.Start()` fails, `defer sp.Close()` is still registered and will run when the action function exits — which is correct. However, `sp.Start()` is called again inside `HoistDeclares` (and `HoistDeclaresList`) for every invocation, because those methods call `s.Start()` unconditionally. If `Start()` is not fully idempotent this double-start could cause unexpected behaviour. The pattern in `run.go` passes the parser only after a successful start (via a goroutine), so the two initialization styles are inconsistent. Worth confirming `Start()` is a no-op when already running.

Reviews (1): Last reviewed commit: "Hoist DECLARE statements via sqlglot tok..." | Re-trigger Greptile

Comment thread pythonsrc/parser/main.py Outdated
Comment thread pythonsrc/parser/main.py Outdated
Comment thread cmd/render.go Outdated
batikankarakan and others added 2 commits May 14, 2026 15:31
Replaces the Python sqlglot IPC-based hoister with an in-process
implementation built on the existing polyglot-sql Rust crate. A single
CGo call replaces a JSON-over-stdin/stdout round trip per hook wrap,
removing the Python subprocess from the materialization hot path.

Adds `hoist_declares` / `hoist_declares_list` to compat.rs and lib.rs,
CGo bindings on darwin/linux, and `HoistDeclares` / `HoistDeclaresList`
methods on `RustSQLParser`. `cmd/run.go` reverts the Python parser
initialization to its original schema-prefix gating and wires a separate
RustSQLParser instance as the `DeclareHoister`. `cmd/render.go` swaps
its hoister to the Rust parser too.

The slice-and-classify algorithm matches the previous Python version
(top-level ';' positions found via dialect-aware tokenizer, original
text preserved byte-for-byte) and now also tracks CASE depth so
CASE..END inside a BEGIN body doesn't prematurely close the block.
Python hoist code and tests removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CI rust artifact cache key only hashed Cargo.lock and Cargo.toml,
so edits under pkg/sqlparser/rustffi/src/ produced a stale cache hit:
the .a was restored and touched without being rebuilt, then linking
failed because new FFI symbols defined in src/lib.rs and src/compat.rs
weren't present in the cached library. Include src/**/*.rs in the
cache key so any Rust source change forces a fresh build.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
pkg/sqlparser/rustffi/src/compat.rs:1059-1074
**`BEGIN TRANSACTION` inflates `begin_end_depth` without a matching decrement**

`BEGIN TRANSACTION;` emits `TokenType::Begin`, incrementing `begin_end_depth` to 1, but `COMMIT TRANSACTION;` and `ROLLBACK TRANSACTION;` emit no `TokenType::End`, so the depth never returns to 0. Every semicolon that follows is then treated as nested and excluded from `positions`. In the common case (DECLAREs precede the transaction block) this is accidentally correct, but any script where a top-level statement or DECLARE appears after a `COMMIT TRANSACTION;` without a matching `END` will silently have those statements merged into the transaction slice, making them invisible to the hoisting classifier.

Reviews (2): Last reviewed commit: "ci: invalidate rust cache on rustffi/src..." | Re-trigger Greptile

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Want your agent to iterate on Greptile's feedback? Try greploops.

batikankarakan and others added 3 commits May 14, 2026 16:28
`BEGIN TRANSACTION` emits a Begin token but the matching `COMMIT
TRANSACTION` / `ROLLBACK TRANSACTION` produce no End token, so naively
counting Begin/End leaves the depth stuck above zero and silently
classifies every following semicolon as nested. After a top-level
DECLARE following BEGIN TRANSACTION would never be hoisted.

Peek one token past Begin: if the next token is Transaction, treat it
as a transaction-control statement and skip the depth increment.

Adds regression tests for both shapes — already-ordered DECLAREs
preceding a BEGIN TRANSACTION block (must be byte-identical no-op) and
a DECLARE inside the transaction body (must be hoisted past
BEGIN TRANSACTION to the script's top).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make explicit that Start is a no-op verification against the linked
FFI library rather than a startup step that holds state. Methods such
as HoistDeclares don't call it internally, so the double-start hazard
that applied to the Python SQLParser is not present here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Most hook-wrapped scripts contain no DECLAREs — the typical shape is
pre-hook SET / asset SELECT / post-hook INSERT. In that hot path we
were paying a CGo round trip (tokenize + parse_one per slice) just to
have the hoister conclude that no reorder is needed.

Add a cheap case-insensitive byte-level scan for the keyword "declare"
in maybeHoist / maybeHoistList. When it returns false the input is
guaranteed to contain no DECLARE statement, so we skip the hoister
entirely. False positives ("declare" inside a string literal, or
embedded in an identifier like predeclared) are harmless: the hoister
runs and classifies them correctly. False negatives are impossible
since a real DECLARE statement must contain the keyword.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant