Hoist DECLAREs via sqlglot tokenizer instead of regex#2053
Open
batikankarakan wants to merge 6 commits into
Open
Hoist DECLAREs via sqlglot tokenizer instead of regex#2053batikankarakan wants to merge 6 commits into
batikankarakan wants to merge 6 commits into
Conversation
Replaces the brittle regex-based DECLARE hoister with a sqlglot-driven approach. We tokenize the SQL with sqlglot's dialect-aware tokenizer, track paren and BEGIN..END depth to find top-level ';' positions, then slice the original source by those positions and classify each slice with parse_one(). DECLAREs nested inside stored procedures or BEGIN..END blocks are no longer top-level and stay put. Statement text is sliced from the original source — sqlglot never regenerates user SQL, so casing, comments, escapes and dialect-specific syntax (e.g. 'array<STRING>') survive byte-for-byte. WrapHooks now takes a DeclareHoister interface implemented by SQLParser, plumbed through the HookWrapperMaterializer* structs at every call site in cmd/run.go and cmd/render.go. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
pythonsrc/parser/main.py:558-562
**`CASE...END` leaks through `begin_end_depth` tracking**
`TokenType.END` is emitted for *both* `CASE WHEN … END` expressions and `BEGIN … END` blocks — the tokenizer has no context to distinguish them. Inside a `BEGIN … END` body, a `CASE` expression's `END` will decrement `begin_end_depth` to 0 prematurely, causing the semicolons that follow (including any `DECLARE y INT64;`) to be treated as top-level. Those DECLAREs are then incorrectly hoisted out of the block.
Concrete example: `"SET x=1;\nBEGIN\n SELECT CASE WHEN x>0 THEN 'a' ELSE 'b' END;\n DECLARE y INT64;\nEND;"` — the `DECLARE y INT64` would be extracted and placed at the script's top, breaking stored-procedure semantics.
The same applies to `END IF`, `END WHILE`, `END LOOP` in dialects that emit individual `END` tokens for each construct. A `case_depth` counter (incremented on `TokenType.CASE`, decremented on `TokenType.END` before `begin_end_depth`) is needed to disambiguate.
### Issue 2 of 3
pythonsrc/parser/main.py:558-559
**`BEGIN TRANSACTION` inflates `begin_end_depth` without a matching `END`**
`BEGIN TRANSACTION;` produces `TokenType.BEGIN`, incrementing `begin_end_depth` to 1. `COMMIT TRANSACTION;` and `ROLLBACK TRANSACTION;` produce no `TokenType.END`, so depth never returns to 0. Every semicolon after `BEGIN TRANSACTION;` is treated as nested and excluded from `positions`.
In the common case where all DECLAREs precede the transaction block this is accidentally correct (the whole transaction body becomes one non-DECLARE slice). However, any script where `BEGIN TRANSACTION` appears before a top-level DECLARE that still needs hoisting — or where the rejoined output must produce well-formed statements — will silently get incorrect slicing. The PR test suite does not cover this pattern; the old `BEGIN TRANSACTION` test case in `TestWrapHooks_HoistsDeclares` was removed and not ported to the Python test suite.
### Issue 3 of 3
cmd/render.go:256-267
**Parser started but `Start()` called again on each invocation**
If `sqlparser.NewSQLParser(false)` succeeds but `sp.Start()` fails, `defer sp.Close()` is still registered and will run when the action function exits — which is correct. However, `sp.Start()` is called again inside `HoistDeclares` (and `HoistDeclaresList`) for every invocation, because those methods call `s.Start()` unconditionally. If `Start()` is not fully idempotent this double-start could cause unexpected behaviour. The pattern in `run.go` passes the parser only after a successful start (via a goroutine), so the two initialization styles are inconsistent. Worth confirming `Start()` is a no-op when already running.
Reviews (1): Last reviewed commit: "Hoist DECLARE statements via sqlglot tok..." | Re-trigger Greptile |
Replaces the Python sqlglot IPC-based hoister with an in-process implementation built on the existing polyglot-sql Rust crate. A single CGo call replaces a JSON-over-stdin/stdout round trip per hook wrap, removing the Python subprocess from the materialization hot path. Adds `hoist_declares` / `hoist_declares_list` to compat.rs and lib.rs, CGo bindings on darwin/linux, and `HoistDeclares` / `HoistDeclaresList` methods on `RustSQLParser`. `cmd/run.go` reverts the Python parser initialization to its original schema-prefix gating and wires a separate RustSQLParser instance as the `DeclareHoister`. `cmd/render.go` swaps its hoister to the Rust parser too. The slice-and-classify algorithm matches the previous Python version (top-level ';' positions found via dialect-aware tokenizer, original text preserved byte-for-byte) and now also tracks CASE depth so CASE..END inside a BEGIN body doesn't prematurely close the block. Python hoist code and tests removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The CI rust artifact cache key only hashed Cargo.lock and Cargo.toml, so edits under pkg/sqlparser/rustffi/src/ produced a stale cache hit: the .a was restored and touched without being rebuilt, then linking failed because new FFI symbols defined in src/lib.rs and src/compat.rs weren't present in the cached library. Include src/**/*.rs in the cache key so any Rust source change forces a fresh build. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
pkg/sqlparser/rustffi/src/compat.rs:1059-1074
**`BEGIN TRANSACTION` inflates `begin_end_depth` without a matching decrement**
`BEGIN TRANSACTION;` emits `TokenType::Begin`, incrementing `begin_end_depth` to 1, but `COMMIT TRANSACTION;` and `ROLLBACK TRANSACTION;` emit no `TokenType::End`, so the depth never returns to 0. Every semicolon that follows is then treated as nested and excluded from `positions`. In the common case (DECLAREs precede the transaction block) this is accidentally correct, but any script where a top-level statement or DECLARE appears after a `COMMIT TRANSACTION;` without a matching `END` will silently have those statements merged into the transaction slice, making them invisible to the hoisting classifier.
Reviews (2): Last reviewed commit: "ci: invalidate rust cache on rustffi/src..." | Re-trigger Greptile |
Contributor
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
`BEGIN TRANSACTION` emits a Begin token but the matching `COMMIT TRANSACTION` / `ROLLBACK TRANSACTION` produce no End token, so naively counting Begin/End leaves the depth stuck above zero and silently classifies every following semicolon as nested. After a top-level DECLARE following BEGIN TRANSACTION would never be hoisted. Peek one token past Begin: if the next token is Transaction, treat it as a transaction-control statement and skip the depth increment. Adds regression tests for both shapes — already-ordered DECLAREs preceding a BEGIN TRANSACTION block (must be byte-identical no-op) and a DECLARE inside the transaction body (must be hoisted past BEGIN TRANSACTION to the script's top). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Make explicit that Start is a no-op verification against the linked FFI library rather than a startup step that holds state. Methods such as HoistDeclares don't call it internally, so the double-start hazard that applied to the Python SQLParser is not present here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Most hook-wrapped scripts contain no DECLAREs — the typical shape is
pre-hook SET / asset SELECT / post-hook INSERT. In that hot path we
were paying a CGo round trip (tokenize + parse_one per slice) just to
have the hoister conclude that no reorder is needed.
Add a cheap case-insensitive byte-level scan for the keyword "declare"
in maybeHoist / maybeHoistList. When it returns false the input is
guaranteed to contain no DECLARE statement, so we skip the hoister
entirely. False positives ("declare" inside a string literal, or
embedded in an identifier like predeclared) are harmless: the hoister
runs and classifies them correctly. False negatives are impossible
since a real DECLARE statement must contain the keyword.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pkg/pipeline/hooks.gowith a sqlglot-driven implementation: tokenize with the dialect-aware tokenizer, track paren + BEGIN..END depth to find top-level;positions, slice the original source by those positions, and classify each slice withparse_one(). sqlglot never regenerates user SQL — statement text is sliced verbatim, so casing, comments, escapes and dialect-specific syntax (e.g.array<STRING>) survive byte-for-byte.WrapHooks/wrapHookQueriesListnow take aDeclareHoisterinterface implemented bySQLParser, plumbed through theHookWrapperMaterializer*structs at every call site incmd/run.goandcmd/render.go. The parser is initialized unconditionally inrun.goso the hoister is always available.BEGIN..ENDor stored-procedure bodies are correctly left in place;;inside string literals no longer splits; on parse failure the input passes through unchanged.Test plan
make formatcleanmake testclean (incl.pkg/pipeline+pkg/sqlparser)TestSqlParser_HoistDeclaressubtests verify byte-identical preservation, BEGIN..END exemption, string-literal safety, andarray<STRING>casing survivalpythonsrc/parser/main_test.pycases mirror the same invariants on the Python side🤖 Generated with Claude Code