Skip to content

fix(snowflake-profiler): Handle leading whitespace/comments when parsing DML queries#28280

Open
zak-nuccio wants to merge 5 commits into
open-metadata:mainfrom
zak-nuccio:fix/snowflake-profiling-query-parse-leading-whitespace
Open

fix(snowflake-profiler): Handle leading whitespace/comments when parsing DML queries#28280
zak-nuccio wants to merge 5 commits into
open-metadata:mainfrom
zak-nuccio:fix/snowflake-profiling-query-parse-leading-whitespace

Conversation

@zak-nuccio
Copy link
Copy Markdown
Contributor

@zak-nuccio zak-nuccio commented May 19, 2026

Describe your changes:

The Snowflake system-metrics profiler parses query-log entries with a regex (QUERY_PATTERN) to identify DML operations (INSERT, UPDATE, DELETE, MERGE, COPY) and extract the target table name. The regex uses re.match(), which anchors at the start of the string, so any leading whitespace or SQL comment caused the match to fail silently — the profiler returned None for the query, discarding the DML event entirely and producing an incomplete system-metrics profile.

This was fixed by adding a _normalize_dml_sql() helper that is called inside _parse_query before the regex:

  1. Block comments — replaced via re.sub(r"/\*.*?\*/", " ", …, flags=re.DOTALL) so their bodies (which may contain other DML keywords) are reduced to a single space and cannot confuse the pattern.
  2. Single-line comments — stripped with re.sub(r"--[^\n]*", "", …) (preserves the trailing newline as a token separator).
  3. Leading/trailing whitespacestr.strip() so re.match() finds the DML keyword at position 0.

The LRU-cache key (sha256_hash(query.strip())) is unchanged — the cache continues to deduplicate identical raw queries efficiently.

Changes

  • ingestion/src/metadata/profiler/metrics/system/snowflake/system.py
  • ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py

Test plan

  • pytest ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py — all tests pass
  • Manually triggered py-tests CI workflow on branch fix/snowflake-profiling-query-parse-leading-whitespace in fork

Type of change:

  • Bug fix

High-level design:

N/A — small pre-processing step added to one helper function.

Tests:

Use cases covered

  • DML queries with leading spaces, tabs, or newlines are correctly parsed.
  • DML queries prefixed with single-line -- … comments are correctly parsed.
  • DML queries prefixed with block /* … */ comments are correctly parsed.
  • Comment bodies that contain DML keywords do NOT affect which table is extracted.
  • Non-DML queries (SELECT) continue to return None.

Unit tests

  • Added test_parse_query, containing test cases that passed under the old query parser, and extra cases that correct extract the table despite comments/whitespace. This also accounts for cases where just using re.search() would incorrectly extract a table found within a comment, rather than from actual query.
  • Files updated:
    ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py

Backend integration tests

  • N/A

Ingestion integration tests

  • N/A (Snowflake profiler uses mocked query-log responses in tests).

Playwright (UI) tests

  • N/A

Manual testing performed

  1. Confirmed all existing and new unit tests pass locally.
  2. Manually reverted _normalise_sql_for_dml_parsing call and confirmed parametrized leading-whitespace and comment cases fail as expected.

UI screen recording / screenshots:

N/A

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above.

…ndle leading whitespace and comments

Before this fix, _parse_query used re.match() directly on the raw query text.
re.match() only matches at the very start of the string, so queries with any
leading whitespace (spaces, tabs, newlines) or SQL comments (-- or /* */)
before the DML keyword would fail to parse and silently return None, causing
those rows to be dropped from system profiling results.

A more defensive normalisation step is now applied before matching:
1. Block comments (/* ... */) are replaced with a single space, so they act
   as a token separator but their body is not visible to the regex. This
   prevents commented-out SQL snippets (e.g. "/* INSERT INTO old_table ... */")
   from being mistakenly identified as the active DML operation.
2. Single-line comments (-- ...) are stripped to end-of-line, leaving the
   newline intact so adjacent tokens are n   newline intact so adjacent tokens are n   newline intact so adjacent tokens are n   newline intact so adjacent tokens are n   newline intact so adjacent tokens are n   newlinethe   newline intact so adjacent tokens are n   newline intact so adjacent tokera   newline intact so adjacent tokens are n   newline intact so adjacent to4 l   newline intact so adjacent tokens are n   newline intact so adjacent ases
- 3 leading /* */ comment cases
- 3 non-DML (SELECT) correctly returning N- 3 non-DML (SELECT) correctly returning N- 3 nonses
  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D  (D
Copilot AI review requested due to automatic review settings May 19, 2026 22:21
@zak-nuccio zak-nuccio requested a review from a team as a code owner May 19, 2026 22:21
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Snowflake system-metrics DML parsing by normalizing query text before applying the existing QUERY_PATTERN, allowing leading whitespace and SQL comments to no longer cause silent parse misses and dropped DML events.

Changes:

  • Added _normalise_dml_sql() to strip block comments, -- comments, and leading/trailing whitespace prior to regex matching.
  • Updated _parse_query() to run the DML regex against the normalized SQL.
  • Added unit test coverage for leading whitespace/comments and for ensuring DML keywords inside comments don’t affect extraction.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
ingestion/src/metadata/profiler/metrics/system/snowflake/system.py Adds SQL normalization helper and uses it in _parse_query prior to re.match.
ingestion/tests/unit/metadata/ingestion/source/database/snowflake/profiler/test_system_metrics.py Adds parametrized tests covering whitespace/comment-prefixed DML parsing and comment-body safety cases.

Comment on lines 80 to +83
@cache.wrap(key_func=lambda query: sha256_hash(query.strip()))
def _parse_query(query: str) -> Optional[str]: # noqa: UP045
"""Parse snowflake queries to extract the identifiers"""
match = re.match(QUERY_PATTERN, query, re.IGNORECASE)
"""Parse snowflake queries to extract the identifiers.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 19, 2026 22:32
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread ingestion/src/metadata/profiler/metrics/system/snowflake/system.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings May 20, 2026 04:41
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 20, 2026

Code Review ✅ Approved

Normalizes DML queries by stripping leading whitespace and comments before regex parsing to prevent silent failures in the Snowflake profiler. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comment on lines +72 to +77
# Pass 1: block comments → single space (dotall so . matches \n)
query = re.sub(r"/\*.*?\*/", " ", query, flags=re.DOTALL)
# Pass 2: single-line comments → remove to end of line, keep the newline
query = re.sub(r"--[^\n]*", "", query)
# Pass 3: strip leading/trailing whitespace
return query.strip()
Comment on lines 80 to 90
@cache.wrap(key_func=lambda query: sha256_hash(query.strip()))
def _parse_query(query: str) -> Optional[str]: # noqa: UP045
"""Parse snowflake queries to extract the identifiers"""
match = re.match(QUERY_PATTERN, query, re.IGNORECASE)
"""Parse snowflake queries to extract the identifiers.

The query is first normalized (block comments, single-line comments, and
leading whitespace removed) so that re.match() can reliably find the DML
keyword at position 0 without being confused by commented-out SQL in the
query body.
"""
match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE)
try:
Comment on lines +82 to +89
"""Parse snowflake queries to extract the identifiers.

The query is first normalized (block comments, single-line comments, and
leading whitespace removed) so that re.match() can reliably find the DML
keyword at position 0 without being confused by commented-out SQL in the
query body.
"""
match = re.match(QUERY_PATTERN, _normalize_dml_sql(query), re.IGNORECASE)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants