Commit 5e14164
fix(sampler): Respect randomizedSample flag at 100% percentage sampling (#26966)
* fix(sampler): respect randomizedSample flag at 100% percentage sampling
When profileSample is 100% with PERCENTAGE type, the sampler
short-circuits and returns the raw dataset without any randomization,
even when randomizedSample is True (the default).
Split the combined condition so:
- No profileSample set -> return raw dataset (no sampling configured)
- 100% PERCENTAGE + randomizedSample=False -> return raw dataset (optimization)
- 100% PERCENTAGE + randomizedSample=True -> go through normal sampling path
which applies RandomNumFn/df.sample for proper row shuffling
Fixes #21304
* Address review: use 'is False' for Optional[bool] and add unit tests
- Fix randomizedSample check from 'not' to 'is False' in both SQASampler
and DatalakeSampler to correctly handle None (Optional[bool] default=True)
- Add unit tests verifying 100%% PERCENTAGE behavior for randomizedSample
values True, False, and None
* Add ORDER BY on random column in fetch_sample_data for true randomization
The get_dataset() fix ensures 100% PERCENTAGE + randomizedSample routes
through get_sample_query() which produces a CTE with a random column.
Now fetch_sample_data() detects that column and applies ORDER BY before
LIMIT, so each call returns a different subset of rows.
Also add real-DB integration tests using SQLite for the 100% PERCENTAGE
edge case (True, False, None).
* Address review: remove stale comment, unused import, add return assertions
* Apply suggestion from @Copilot
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Address review: move ORDER BY to get_sample_query, clean up fetch_sample_data
- Move ORDER BY rnd.c.random into get_sample_query() PERCENTAGE branch,
gated on randomizedSample is not False (mirrors ABSOLUTE branch pattern)
- Revert fetch_sample_data() to original: remove ds_columns variable,
random_column detection, and ORDER BY logic (ordering now handled in CTE)
- Remove duplicate assertions in DatalakeSampler100Pct tests
* Address review: None defaults to False for randomizedSample
Per TeddyCr's feedback, randomization is computationally heavy and
should not be the default. Changed from 'is False'/'is not False' to
truthiness checks so None (unset) behaves the same as False.
Only explicit randomizedSample=True triggers ORDER BY and skips the
100% fast path. This is consistent with the ABSOLUTE branch which
already uses truthiness checks.
* Fix integration test: None should skip sample_query (matches truthiness semantics)
* fix(tests): update BigQuery view sampling expected queries with ORDER BY
BigQuery views fall through to SQASampler.get_sample_query() which now
adds ORDER BY rnd.random when randomizedSample is enabled. Update the
expected SQL strings in test_sampling_for_views and
test_sampling_view_with_partition to match.
* refactor: use explicit is False for randomizedSample checks
Address review comments: SampleConfig.randomizedSample defaults to True,
so only an explicit False should disable randomization. Using is False
/ is not False instead of truthiness ensures None follows the model
default (enabled) rather than being incorrectly treated as disabled.
* ci: re-trigger checks after SIGSEGV flake
* refactor: only explicit True randomizes, add non-determinism tests
* test: increase non-determinism iterations to reduce flakiness
* chore: added randomize as false
* fix: align randomizedSample defaults with schema (false)
* fix: remove ORDER BY from BigQuery test expectations
BigQuery sampling tests create SampleConfig without setting
randomizedSample, which now defaults to False. Since ORDER BY
is only added when randomizedSample is True, the expected query
strings should not include ORDER BY.
Also fix inaccurate docstring in test_sample.py.
* test: increase non-determinism test iterations to reduce flakiness
Increase fetch_sample_data loop from 10 to 20 iterations to further
reduce the theoretical probability of a false failure in the
randomized ordering test.
---------
Co-authored-by: Teddy <teddy.crepineau@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>1 parent 8018f9a commit 5e14164
11 files changed
Lines changed: 298 additions & 15 deletions
File tree
- bootstrap/sql/migrations/native/1.13.0
- mysql
- postgres
- ingestion
- src/metadata/sampler
- pandas
- sqlalchemy
- tests/unit
- observability/profiler/sqlalchemy
- sampler
- openmetadata-spec/src/main/resources/json/schema
- entity/data
- metadataIngestion
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
6 | 24 | | |
7 | 25 | | |
8 | 26 | | |
| |||
Lines changed: 18 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
6 | 24 | | |
7 | 25 | | |
8 | 26 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
11 | 11 | | |
12 | 12 | | |
13 | 13 | | |
| 14 | + | |
14 | 15 | | |
15 | 16 | | |
16 | 17 | | |
| |||
42 | 43 | | |
43 | 44 | | |
44 | 45 | | |
45 | | - | |
| 46 | + | |
46 | 47 | | |
47 | 48 | | |
48 | 49 | | |
| |||
58 | 59 | | |
59 | 60 | | |
60 | 61 | | |
61 | | - | |
| 62 | + | |
62 | 63 | | |
63 | 64 | | |
64 | 65 | | |
| |||
127 | 128 | | |
128 | 129 | | |
129 | 130 | | |
130 | | - | |
| 131 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
107 | 107 | | |
108 | 108 | | |
109 | 109 | | |
110 | | - | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
111 | 114 | | |
112 | 115 | | |
| 116 | + | |
113 | 117 | | |
114 | 118 | | |
115 | 119 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
164 | 164 | | |
165 | 165 | | |
166 | 166 | | |
167 | | - | |
| 167 | + | |
168 | 168 | | |
169 | | - | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
170 | 173 | | |
171 | 174 | | |
172 | 175 | | |
173 | 176 | | |
174 | 177 | | |
175 | 178 | | |
176 | 179 | | |
177 | | - | |
| 180 | + | |
178 | 181 | | |
179 | 182 | | |
180 | 183 | | |
181 | 184 | | |
182 | | - | |
| 185 | + | |
183 | 186 | | |
184 | 187 | | |
185 | 188 | | |
| |||
194 | 197 | | |
195 | 198 | | |
196 | 199 | | |
197 | | - | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
198 | 207 | | |
199 | 208 | | |
| 209 | + | |
200 | 210 | | |
201 | 211 | | |
202 | 212 | | |
| |||
217 | 227 | | |
218 | 228 | | |
219 | 229 | | |
220 | | - | |
221 | 230 | | |
222 | 231 | | |
223 | 232 | | |
| |||
Lines changed: 114 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
24 | | - | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
25 | 30 | | |
26 | 31 | | |
27 | 32 | | |
| |||
361 | 366 | | |
362 | 367 | | |
363 | 368 | | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
| 417 | + | |
| 418 | + | |
| 419 | + | |
| 420 | + | |
| 421 | + | |
| 422 | + | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
364 | 477 | | |
365 | 478 | | |
366 | 479 | | |
| |||
0 commit comments