Skip to content

fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273

Open
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:fix/similar-to-postgres-compat
Open

fix: translate SIMILAR TO pattern to PostgreSQL-compatible regex#22273
Dandandan wants to merge 1 commit into
apache:mainfrom
Dandandan:fix/similar-to-postgres-compat

Conversation

@Dandandan
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

SIMILAR TO is supposed to mix SQL LIKE wildcards (%, _) with POSIX regex metacharacters and match the entire string. DataFusion was lowering SIMILAR TO directly to a RegexMatch over the pattern verbatim, so:

  • 'abc' SIMILAR TO 'a%' returned false (PG: true) because % was passed through as a literal regex character.
  • 'abc' SIMILAR TO 'b' returned true (PG: false) because the regex match isn't anchored.
  • ., ^, $ were treated as regex metacharacters instead of literals.

What changes are included in this PR?

  • New translate_similar_to_pattern() helper in datafusion/physical-expr/src/expressions/binary.rs that converts a SIMILAR TO pattern into an equivalent POSIX regex:
    • Anchors the result with ^...$.
    • %.*, _..
    • Escapes ., ^, $ (literal in SIMILAR TO, meta in regex).
    • Passes POSIX metas (|, *, +, ?, (), {m,n}, [...]) through unchanged.
    • Handles \ as an escape for the next character.
    • Passes bracket expressions through verbatim, including a leading literal ]/^].
  • Planner (datafusion/physical-expr/src/planner.rs) now translates literal Utf8 / LargeUtf8 / Utf8View patterns at planning time. NULL patterns flow through as a typed Utf8 null. Non-literal patterns return a clear not_impl_err rather than the previous silently-wrong behavior.
  • SQL layer (datafusion/sql/src/expr/mod.rs) pattern-type check widened to accept LargeUtf8 and Utf8View literals (previously rejected even though the underlying regex match supports them).

Are these changes tested?

Yes:

  • Unit test similar_to_pattern_translation in binary.rs covers wildcards, anchoring, regex metas, literal ./^/$, backslash escapes, bracket expressions (including []abc] and [^]abc]).
  • datafusion/sqllogictest/test_files/strings.slt has a new regression block exercising the bug-report case, _ wildcard, anchoring, literal ./^/$, regex metas (|, {m,n}, +), backslash-escaped wildcards, NULL pattern, and the non-literal-pattern error.
  • The existing SIMILAR TO 'p[12].*' test in strings.slt was relying on the buggy regex-passthrough behavior (p1e1 etc. matched only because . was treated as regex .); it's been changed to 'p[12]%' which expresses the same intent under correct SIMILAR TO semantics.

Are there any user-facing changes?

Yes — SIMILAR TO is now PostgreSQL-compatible:

  • Behavior changes for patterns containing %, _, or unanchored matches. Existing patterns that relied on the previous regex-passthrough behavior may need to be updated (most obviously, change .* to %).
  • Non-literal patterns now return Not yet implemented: SIMILAR TO with a non-literal pattern is not yet supported instead of silently returning a regex match. This was almost certainly broken in practice already, but it is a visible error message change.

`SIMILAR TO` was lowered directly to a regex match without translating
SQL wildcards or anchoring the pattern, so `'abc' SIMILAR TO 'a%'`
returned false instead of true and `.`/`^`/`$` were treated as regex
metacharacters instead of literals.

The planner now translates literal `SIMILAR TO` patterns into an
equivalent POSIX regex (anchored with `^...$`, `%`→`.*`, `_`→`.`,
literal `.`/`^`/`$` escaped, backslash escape and bracket expressions
preserved) before lowering to the existing regex-match operator. NULL
patterns flow through as a typed Utf8 null. Non-literal patterns now
return a clear `not_impl` error rather than a silently wrong result,
and the SQL-layer pattern-type check is widened to accept `LargeUtf8`
and `Utf8View` literals.

Closes apache#22263.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added sql SQL Planner physical-expr Changes to the physical-expr crates sqllogictest SQL Logic Tests (.slt) labels May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-expr Changes to the physical-expr crates sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostgreSQL compatibility: SIMILAR TO should treat % as a wildcard

1 participant