fix(snowflake-profiler): Handle leading whitespace/comments when parsing DML queries#28280
Conversation
…ndle leading whitespace and comments Before this fix, _parse_query used re.match() directly on the raw query text. re.match() only matches at the very start of the string, so queries with any leading whitespace (spaces, tabs, newlines) or SQL comments (-- or /* */) before the DML keyword would fail to parse and silently return None, causing those rows to be dropped from system profiling results. A more defensive normalisation step is now applied before matching: 1. Block comments (/* ... */) are replaced with a single space, so they act as a token separator but their body is not visible to the regex. This prevents commented-out SQL snippets (e.g. "/* INSERT INTO old_table ... */") from being mistakenly identified as the active DML operation. 2. Single-line comments (-- ...) are stripped to end-of-line, leaving the newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newline intact so adjacent tokens are n newlinethe newline intact so adjacent tokens are n newline intact so adjacent tokera newline intact so adjacent tokens are n newline intact so adjacent to4 l newline intact so adjacent tokens are n newline intact so adjacent ases - 3 leading /* */ comment cases - 3 non-DML (SELECT) correctly returning N- 3 non-DML (SELECT) correctly returning N- 3 nonses (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D (D
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
1 similar comment
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
There was a problem hiding this comment.
Pull request overview
This PR improves Snowflake system-metrics DML parsing by normalizing query text before applying the existing QUERY_PATTERN, allowing leading whitespace and SQL comments to no longer cause silent parse misses and dropped DML events.
Changes:
- Added
_normalise_dml_sql()to strip block comments,--comments, and leading/trailing whitespace prior to regex matching. - Updated
_parse_query()to run the DML regex against the normalized SQL. - Added unit test coverage for leading whitespace/comments and for ensuring DML keywords inside comments don’t affect extraction.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/profiler/metrics/system/snowflake/system.py |
Adds SQL normalization helper and uses it in _parse_query prior to re.match. |
ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py |
Adds parametrized tests covering whitespace/comment-prefixed DML parsing and comment-body safety cases. |
| @cache.wrap(key_func=lambda query: sha256_hash(query.strip())) | ||
| def _parse_query(query: str) -> Optional[str]: # noqa: UP045 | ||
| """Parse snowflake queries to extract the identifiers""" | ||
| match = re.match(QUERY_PATTERN, query, re.IGNORECASE) | ||
| """Parse snowflake queries to extract the identifiers. | ||
|
|
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
…sql (American spelling)
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ ApprovedNormalizes DML queries by stripping leading whitespace and comments before regex parsing to prevent silent failures in the Snowflake profiler. No issues found. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
| # Pass 1: block comments → single space (dotall so . matches \n) | ||
| query = re.sub(r"/\*.*?\*/", " ", query, flags=re.DOTALL) | ||
| # Pass 2: single-line comments → remove to end of line, keep the newline | ||
| query = re.sub(r"--[^\n]*", "", query) | ||
| # Pass 3: strip leading/trailing whitespace | ||
| return query.strip() |
| @cache.wrap(key_func=lambda query: sha256_hash(query.strip())) | ||
| def _parse_query(query: str) -> Optional[str]: # noqa: UP045 | ||
| """Parse snowflake queries to extract the identifiers""" | ||
| match = re.match(QUERY_PATTERN, query, re.IGNORECASE) | ||
| """Parse snowflake queries to extract the identifiers. | ||
|
|
||
| The query is first normalized (block comments, single-line comments, and | ||
| leading whitespace removed) so that re.match() can reliably find the DML | ||
| keyword at position 0 without being confused by commented-out SQL in the | ||
| query body. | ||
| """ | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) | ||
| try: |
| """Parse snowflake queries to extract the identifiers. | ||
|
|
||
| The query is first normalized (block comments, single-line comments, and | ||
| leading whitespace removed) so that re.match() can reliably find the DML | ||
| keyword at position 0 without being confused by commented-out SQL in the | ||
| query body. | ||
| """ | ||
| match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE) |
Describe your changes:
The Snowflake system-metrics profiler parses query-log entries with a regex (
QUERY_PATTERN) to identify DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) and extract the target table name. The regex usesre.match(), which anchors at the start of the string, so any leading whitespace or SQL comment caused the match to fail silently — the profiler returnedNonefor the query, discarding the DML event entirely and producing an incomplete system-metrics profile.This was fixed by adding a
_normalize_dml_sql()helper that is called inside_parse_querybefore the regex:re.sub(r"/\*.*?\*/", " ", …, flags=re.DOTALL)so their bodies (which may contain other DML keywords) are reduced to a single space and cannot confuse the pattern.re.sub(r"--[^\n]*", "", …)(preserves the trailing newline as a token separator).str.strip()sore.match()finds the DML keyword at position 0.The LRU-cache key (
sha256_hash(query.strip())) is unchanged — the cache continues to deduplicate identical raw queries efficiently.Changes
ingestion/src/metadata/profiler/metrics/system/snowflake/system.pyingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.pyTest plan
pytest ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py— all tests passpy-testsCI workflow on branchfix/snowflake-profiling-query-parse-leading-whitespacein forkType of change:
High-level design:
N/A — small pre-processing step added to one helper function.
Tests:
Use cases covered
-- …comments are correctly parsed./* … */comments are correctly parsed.SELECT) continue to returnNone.Unit tests
test_parse_query, containing test cases that passed under the old query parser, and extra cases that correct extract the table despite comments/whitespace. This also accounts for cases where just using re.search() would incorrectly extract a table found within a comment, rather than from actual query.ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.pyBackend integration tests
Ingestion integration tests
Playwright (UI) tests
Manual testing performed
_normalise_sql_for_dml_parsingcall and confirmed parametrized leading-whitespace and comment cases fail as expected.UI screen recording / screenshots:
N/A
Checklist: