Skip to content

Commit 37c1b75

Browse files
authored
test: scale remaining sort-merge join (SMJ) benchmark queries (#21200)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Our SMJ benchmark queries finish too quickly to demonstrate improvements that aren't massive. For example, I am working on an optimization that introduces `DynComparator` (part of #20910) and it's about a 10% improvement, but only when you actually make the queries run long enough. The new queries for #21184 are scaled enough to see improvements, but we need to scale the older queries. I am also continuing to see SMJ issues with Comet when running joins with billions (sometimes trillions) of rows. We can't do that for microbenchmarks, but we can at least start hitting millions of rows to look at more than a handful of batches. ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Bring our SMJ queries into alignment with some of the newer ones (Q21-23) to demonstrate further performance wins. ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> I ran the benchmark. On my M3 Max, here's how long it takes: | Query | Join Type | Rows | Keys | Filter | Median (ms) | |-------|-----------|------|------|--------|-------------| | Q1 | INNER | 1M×1M | 1:1 | — | 16.3 | | Q2 | INNER | 1M×10M | 1:10 | — | 117.4 | | Q3 | INNER | 1M×1M | 1:100 | — | 74.2 | | Q4 | INNER | 1M×10M | 1:10 | 1% | 17.1 | | Q5 | INNER | 1M×1M | 1:100 | 10% | 18.4 | | Q6 | LEFT | 1M×10M | 1:10 | — | 129.3 | | Q7 | LEFT | 1M×10M | 1:10 | 50% | 150.2 | | Q8 | FULL | 1M×1M | 1:10 | — | 16.6 | | Q9 | FULL | 1M×10M | 1:10 | 10% | 153.5 | | Q10 | LEFT SEMI | 1M×10M | 1:10 | — | 53.1 | | Q11 | LEFT SEMI | 1M×10M | 1:10 | 1% | 15.5 | | Q12 | LEFT SEMI | 1M×10M | 1:10 | 50% | 65.0 | | Q13 | LEFT SEMI | 1M×10M | 1:10 | 90% | 105.7 | | Q14 | LEFT ANTI | 1M×10M | 1:10 | — | 54.3 | | Q15 | LEFT ANTI | 1M×10M | 1:10 | partial | 51.5 | | Q16 | LEFT ANTI | 1M×1M | 1:1 | — | 10.3 | | Q17 | INNER | 1M×50M | 1:50 | 5% | 75.9 | | Q18 | LEFT SEMI | 1M×50M | 1:50 | 2% | 50.2 | | Q19 | LEFT ANTI | 1M×50M | 1:50 | partial | 336.4 | | Q20 | INNER | 1M×10M | 1:100 | GROUP BY | 763.7 | | Q21 | INNER | 10M×10M | 1:1 | 50% | 186.1 | | Q22 | LEFT | 10M×10M | 1:1 | 50% | 10,193.8 | | Q23 | FULL | 10M×10M | 1:1 | 50% | 10,194.7 | Note that Q22 and Q23 will be about 20x faster when #21184 merges, so taking 10 seconds to run is just a short-term issue. ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> No.
1 parent 580b0ab commit 37c1b75

1 file changed

Lines changed: 83 additions & 83 deletions

File tree

benchmarks/src/smj.rs

Lines changed: 83 additions & 83 deletions
Original file line numberDiff line numberDiff line change
@@ -60,27 +60,27 @@ pub struct RunOpt {
6060
/// - Key cardinality (rows per key)
6161
/// - Filter selectivity (if applicable)
6262
const SMJ_QUERIES: &[&str] = &[
63-
// Q1: INNER 100K x 100K | 1:1
63+
// Q1: INNER 1M x 1M | 1:1
6464
r#"
6565
WITH t1_sorted AS (
66-
SELECT value as key FROM range(100000) ORDER BY value
66+
SELECT value as key FROM range(1000000) ORDER BY value
6767
),
6868
t2_sorted AS (
69-
SELECT value as key FROM range(100000) ORDER BY value
69+
SELECT value as key FROM range(1000000) ORDER BY value
7070
)
7171
SELECT t1_sorted.key as k1, t2_sorted.key as k2
7272
FROM t1_sorted JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
7373
"#,
74-
// Q2: INNER 100K x 1M | 1:10
74+
// Q2: INNER 1M x 10M | 1:10
7575
r#"
7676
WITH t1_sorted AS (
77-
SELECT value % 10000 as key, value as data
78-
FROM range(100000)
77+
SELECT value % 100000 as key, value as data
78+
FROM range(1000000)
7979
ORDER BY key, data
8080
),
8181
t2_sorted AS (
82-
SELECT value % 10000 as key, value as data
83-
FROM range(1000000)
82+
SELECT value % 100000 as key, value as data
83+
FROM range(10000000)
8484
ORDER BY key, data
8585
)
8686
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
@@ -101,16 +101,16 @@ const SMJ_QUERIES: &[&str] = &[
101101
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
102102
FROM t1_sorted JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
103103
"#,
104-
// Q4: INNER 100K x 1M | 1:10 | 1%
104+
// Q4: INNER 1M x 10M | 1:10 | 1%
105105
r#"
106106
WITH t1_sorted AS (
107-
SELECT value % 10000 as key, value as data
108-
FROM range(100000)
107+
SELECT value % 100000 as key, value as data
108+
FROM range(1000000)
109109
ORDER BY key, data
110110
),
111111
t2_sorted AS (
112-
SELECT value % 10000 as key, value as data
113-
FROM range(1000000)
112+
SELECT value % 100000 as key, value as data
113+
FROM range(10000000)
114114
ORDER BY key, data
115115
)
116116
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
@@ -133,63 +133,63 @@ const SMJ_QUERIES: &[&str] = &[
133133
FROM t1_sorted JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
134134
WHERE t1_sorted.data <> t2_sorted.data AND t2_sorted.data % 10 = 0
135135
"#,
136-
// Q6: LEFT 100K x 1M | 1:10
136+
// Q6: LEFT 1M x 10M | 1:10
137137
r#"
138138
WITH t1_sorted AS (
139-
SELECT value % 10500 as key, value as data
140-
FROM range(100000)
139+
SELECT value % 105000 as key, value as data
140+
FROM range(1000000)
141141
ORDER BY key, data
142142
),
143143
t2_sorted AS (
144-
SELECT value % 10000 as key, value as data
145-
FROM range(1000000)
144+
SELECT value % 100000 as key, value as data
145+
FROM range(10000000)
146146
ORDER BY key, data
147147
)
148148
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
149149
FROM t1_sorted LEFT JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
150150
"#,
151-
// Q7: LEFT 100K x 1M | 1:10 | 50%
151+
// Q7: LEFT 1M x 10M | 1:10 | 50%
152152
r#"
153153
WITH t1_sorted AS (
154-
SELECT value % 10000 as key, value as data
155-
FROM range(100000)
154+
SELECT value % 100000 as key, value as data
155+
FROM range(1000000)
156156
ORDER BY key, data
157157
),
158158
t2_sorted AS (
159-
SELECT value % 10000 as key, value as data
160-
FROM range(1000000)
159+
SELECT value % 100000 as key, value as data
160+
FROM range(10000000)
161161
ORDER BY key, data
162162
)
163163
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
164164
FROM t1_sorted LEFT JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
165165
WHERE t2_sorted.data IS NULL OR t2_sorted.data % 2 = 0
166166
"#,
167-
// Q8: FULL 100K x 100K | 1:10
167+
// Q8: FULL 1M x 1M | 1:10
168168
r#"
169169
WITH t1_sorted AS (
170-
SELECT value % 10000 as key, value as data
171-
FROM range(100000)
170+
SELECT value % 100000 as key, value as data
171+
FROM range(1000000)
172172
ORDER BY key, data
173173
),
174174
t2_sorted AS (
175-
SELECT value % 12500 as key, value as data
176-
FROM range(100000)
175+
SELECT value % 125000 as key, value as data
176+
FROM range(1000000)
177177
ORDER BY key, data
178178
)
179179
SELECT t1_sorted.key as k1, t1_sorted.data as d1,
180180
t2_sorted.key as k2, t2_sorted.data as d2
181181
FROM t1_sorted FULL JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
182182
"#,
183-
// Q9: FULL 100K x 1M | 1:10 | 10%
183+
// Q9: FULL 1M x 10M | 1:10 | 10%
184184
r#"
185185
WITH t1_sorted AS (
186-
SELECT value % 10000 as key, value as data
187-
FROM range(100000)
186+
SELECT value % 100000 as key, value as data
187+
FROM range(1000000)
188188
ORDER BY key, data
189189
),
190190
t2_sorted AS (
191-
SELECT value % 10000 as key, value as data
192-
FROM range(1000000)
191+
SELECT value % 100000 as key, value as data
192+
FROM range(10000000)
193193
ORDER BY key, data
194194
)
195195
SELECT t1_sorted.key as k1, t1_sorted.data as d1,
@@ -199,16 +199,16 @@ const SMJ_QUERIES: &[&str] = &[
199199
OR t1_sorted.data <> t2_sorted.data)
200200
AND (t1_sorted.data IS NULL OR t1_sorted.data % 10 = 0)
201201
"#,
202-
// Q10: LEFT SEMI 100K x 1M | 1:10
202+
// Q10: LEFT SEMI 1M x 10M | 1:10
203203
r#"
204204
WITH t1_sorted AS (
205-
SELECT value % 10000 as key, value as data
206-
FROM range(100000)
205+
SELECT value % 100000 as key, value as data
206+
FROM range(1000000)
207207
ORDER BY key, data
208208
),
209209
t2_sorted AS (
210-
SELECT value % 10000 as key
211-
FROM range(1000000)
210+
SELECT value % 100000 as key
211+
FROM range(10000000)
212212
ORDER BY key
213213
)
214214
SELECT t1_sorted.key, t1_sorted.data
@@ -218,16 +218,16 @@ const SMJ_QUERIES: &[&str] = &[
218218
WHERE t2_sorted.key = t1_sorted.key
219219
)
220220
"#,
221-
// Q11: LEFT SEMI 100K x 1M | 1:10 | 1%
221+
// Q11: LEFT SEMI 1M x 10M | 1:10 | 1%
222222
r#"
223223
WITH t1_sorted AS (
224-
SELECT value % 10000 as key, value as data
225-
FROM range(100000)
224+
SELECT value % 100000 as key, value as data
225+
FROM range(1000000)
226226
ORDER BY key, data
227227
),
228228
t2_sorted AS (
229-
SELECT value % 10000 as key, value as data
230-
FROM range(1000000)
229+
SELECT value % 100000 as key, value as data
230+
FROM range(10000000)
231231
ORDER BY key, data
232232
)
233233
SELECT t1_sorted.key, t1_sorted.data
@@ -239,16 +239,16 @@ const SMJ_QUERIES: &[&str] = &[
239239
AND t2_sorted.data % 100 = 0
240240
)
241241
"#,
242-
// Q12: LEFT SEMI 100K x 1M | 1:10 | 50%
242+
// Q12: LEFT SEMI 1M x 10M | 1:10 | 50%
243243
r#"
244244
WITH t1_sorted AS (
245-
SELECT value % 10000 as key, value as data
246-
FROM range(100000)
245+
SELECT value % 100000 as key, value as data
246+
FROM range(1000000)
247247
ORDER BY key, data
248248
),
249249
t2_sorted AS (
250-
SELECT value % 10000 as key, value as data
251-
FROM range(1000000)
250+
SELECT value % 100000 as key, value as data
251+
FROM range(10000000)
252252
ORDER BY key, data
253253
)
254254
SELECT t1_sorted.key, t1_sorted.data
@@ -260,16 +260,16 @@ const SMJ_QUERIES: &[&str] = &[
260260
AND t2_sorted.data % 2 = 0
261261
)
262262
"#,
263-
// Q13: LEFT SEMI 100K x 1M | 1:10 | 90%
263+
// Q13: LEFT SEMI 1M x 10M | 1:10 | 90%
264264
r#"
265265
WITH t1_sorted AS (
266-
SELECT value % 10000 as key, value as data
267-
FROM range(100000)
266+
SELECT value % 100000 as key, value as data
267+
FROM range(1000000)
268268
ORDER BY key, data
269269
),
270270
t2_sorted AS (
271-
SELECT value % 10000 as key, value as data
272-
FROM range(1000000)
271+
SELECT value % 100000 as key, value as data
272+
FROM range(10000000)
273273
ORDER BY key, data
274274
)
275275
SELECT t1_sorted.key, t1_sorted.data
@@ -281,16 +281,16 @@ const SMJ_QUERIES: &[&str] = &[
281281
AND t2_sorted.data % 10 <> 0
282282
)
283283
"#,
284-
// Q14: LEFT ANTI 100K x 1M | 1:10
284+
// Q14: LEFT ANTI 1M x 10M | 1:10
285285
r#"
286286
WITH t1_sorted AS (
287-
SELECT value % 10500 as key, value as data
288-
FROM range(100000)
287+
SELECT value % 105000 as key, value as data
288+
FROM range(1000000)
289289
ORDER BY key, data
290290
),
291291
t2_sorted AS (
292-
SELECT value % 10000 as key
293-
FROM range(1000000)
292+
SELECT value % 100000 as key
293+
FROM range(10000000)
294294
ORDER BY key
295295
)
296296
SELECT t1_sorted.key, t1_sorted.data
@@ -300,16 +300,16 @@ const SMJ_QUERIES: &[&str] = &[
300300
WHERE t2_sorted.key = t1_sorted.key
301301
)
302302
"#,
303-
// Q15: LEFT ANTI 100K x 1M | 1:10 | partial match
303+
// Q15: LEFT ANTI 1M x 10M | 1:10 | partial match
304304
r#"
305305
WITH t1_sorted AS (
306-
SELECT value % 12000 as key, value as data
307-
FROM range(100000)
306+
SELECT value % 120000 as key, value as data
307+
FROM range(1000000)
308308
ORDER BY key, data
309309
),
310310
t2_sorted AS (
311-
SELECT value % 10000 as key
312-
FROM range(1000000)
311+
SELECT value % 100000 as key
312+
FROM range(10000000)
313313
ORDER BY key
314314
)
315315
SELECT t1_sorted.key, t1_sorted.data
@@ -319,16 +319,16 @@ const SMJ_QUERIES: &[&str] = &[
319319
WHERE t2_sorted.key = t1_sorted.key
320320
)
321321
"#,
322-
// Q16: LEFT ANTI 100K x 100K | 1:1 | stress
322+
// Q16: LEFT ANTI 1M x 1M | 1:1 | stress
323323
r#"
324324
WITH t1_sorted AS (
325-
SELECT value % 11000 as key, value as data
326-
FROM range(100000)
325+
SELECT value % 110000 as key, value as data
326+
FROM range(1000000)
327327
ORDER BY key, data
328328
),
329329
t2_sorted AS (
330-
SELECT value % 10000 as key
331-
FROM range(100000)
330+
SELECT value % 100000 as key
331+
FROM range(1000000)
332332
ORDER BY key
333333
)
334334
SELECT t1_sorted.key, t1_sorted.data
@@ -338,32 +338,32 @@ const SMJ_QUERIES: &[&str] = &[
338338
WHERE t2_sorted.key = t1_sorted.key
339339
)
340340
"#,
341-
// Q17: INNER 100K x 5M | 1:50 | 5%
341+
// Q17: INNER 1M x 50M | 1:50 | 5%
342342
r#"
343343
WITH t1_sorted AS (
344-
SELECT value % 10000 as key, value as data
345-
FROM range(100000)
344+
SELECT value % 100000 as key, value as data
345+
FROM range(1000000)
346346
ORDER BY key, data
347347
),
348348
t2_sorted AS (
349-
SELECT value % 10000 as key, value as data
350-
FROM range(5000000)
349+
SELECT value % 100000 as key, value as data
350+
FROM range(50000000)
351351
ORDER BY key, data
352352
)
353353
SELECT t1_sorted.key, t1_sorted.data as d1, t2_sorted.data as d2
354354
FROM t1_sorted JOIN t2_sorted ON t1_sorted.key = t2_sorted.key
355355
WHERE t2_sorted.data <> t1_sorted.data AND t2_sorted.data % 20 = 0
356356
"#,
357-
// Q18: LEFT SEMI 100K x 5M | 1:50 | 2%
357+
// Q18: LEFT SEMI 1M x 50M | 1:50 | 2%
358358
r#"
359359
WITH t1_sorted AS (
360-
SELECT value % 10000 as key, value as data
361-
FROM range(100000)
360+
SELECT value % 100000 as key, value as data
361+
FROM range(1000000)
362362
ORDER BY key, data
363363
),
364364
t2_sorted AS (
365-
SELECT value % 10000 as key, value as data
366-
FROM range(5000000)
365+
SELECT value % 100000 as key, value as data
366+
FROM range(50000000)
367367
ORDER BY key, data
368368
)
369369
SELECT t1_sorted.key, t1_sorted.data
@@ -375,16 +375,16 @@ const SMJ_QUERIES: &[&str] = &[
375375
AND t2_sorted.data % 50 = 0
376376
)
377377
"#,
378-
// Q19: LEFT ANTI 100K x 5M | 1:50 | partial match
378+
// Q19: LEFT ANTI 1M x 50M | 1:50 | partial match
379379
r#"
380380
WITH t1_sorted AS (
381-
SELECT value % 15000 as key, value as data
382-
FROM range(100000)
381+
SELECT value % 150000 as key, value as data
382+
FROM range(1000000)
383383
ORDER BY key, data
384384
),
385385
t2_sorted AS (
386-
SELECT value % 10000 as key
387-
FROM range(5000000)
386+
SELECT value % 100000 as key
387+
FROM range(50000000)
388388
ORDER BY key
389389
)
390390
SELECT t1_sorted.key, t1_sorted.data

0 commit comments

Comments
 (0)