Optimize regexp_replace by stripping trailing .* from anchored patterns. 2.4x improvement (ClickBench Q28)#21379
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Seems to make a non trivial difference |
|
I don't think this rewrite is correct in all cases Here are some .slt test that pass on main but not this PR: # If the overall pattern matches but capture group 1 does not participate,
# regexp_replace(..., '\1') should substitute the empty string, not keep
# the original input.
query B
SELECT regexp_replace('bzzz', '^(a)?b.*$', '\1') = '';
----
true
# Stripping trailing .*$ must not change match semantics for inputs with
# newlines when the original pattern does not use the 's' flag.
query B
SELECT regexp_replace(concat('http://x/', chr(10), 'rest'), '^https?://([^/]+)/.*$', '\1')
= concat('http://x/', chr(10), 'rest');
----
true(they return false on this branch) |
FWIW I checked and duckdb also returns true. Maybe we should commit these tests? Or have some sort of regex fuzz tests? |
|
Hmm I need to spend some more time to make these cases work (but I think the optimization is still right for this case, we should mostly retract other cases). |
349496e to
5a86142
Compare
|
(I haven't added coverage for the Utf8View branch as it will be a larger change -- however, i think we can do it as a follow on PR -- I'll make a proposal) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
🤖 Criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (114eec6) to 603bfb4 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
Added test coverage in |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
I also made a PR to avoid the duplication of loops: |
|
run benchmark clickbench_1 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing optimize-regexp-replace-v2 (114eec6) to 603bfb4 (merge-base) diff using: clickbench_1 File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_1 — base (merge-base)
clickbench_1 — branch
File an issue against this benchmark runner |
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
woohoo 🚀 |
## Which issue does this PR close? - related to #21379 ## Rationale for this change While reviewing #21379 I noticed there was minimal Utf8View coverage of the related code. ## What changes are included in this PR? Update the regexp_replace tests to cover utf8, largeutf8, utf8view and dictionary ## Are these changes tested? Yes only tests I verified these tests also pass when run on - #21379 ## Are there any user-facing changes? No
Which issue does this PR close?
Rationale for this change
regexp_replacewith anchored patterns like^https?://(?:www\.)?([^/]+)/.*$spends time scanning the trailing.*$and usingcaptures()+expand()withStringallocation on every row.It just happens this
SELECT regexp_replace(url, '^https?://(?:www\.)?([^/]+)/.*$', '\1')query benefits from this optimization (2.4x faster)What changes are included in this PR?
.*$from the pattern string for anchored patterns where the replacement is\1captures_readwith pre-allocatedCaptureLocationsfor direct byte-slice extractionAre these changes tested?
Yes, covered by existing
regexp_replaceunit tests, ClickBench sqllogictests, and the new URL domain extraction sqllogictest.Are there any user-facing changes?
No.