Commit 2b2aa6b
authored
fix(parquet): align dictionary fallback with parquet-mr (#786)
### Rationale for this change
On dictionary overflow, arrow-go always flushed the dictionary page and
any buffered dict-encoded data pages before switching to PLAIN, even
when no dict-encoded data page had been cut. On mid-cardinality columns
the result was a 4-encoding chunk layout (PLAIN_DICTIONARY, PLAIN, RLE,
PLAIN) that bloated output by 20-30% versus parquet-mr.
This was noticed when testing iceberg-go's recently added compaction
feature, where some tables with particular high cardinality columns
would see a 30% size increase after compaction.
### What changes are included in this PR?
Mirror parquet-mr's FallbackValuesWriter:
- Discard the dictionary and re-encode buffered indices as PLAIN when no
dict-encoded data page has been flushed yet; only emit the dictionary
page once a dict-encoded page is committed.
- Before the first dict-encoded page, fall back to PLAIN if dict +
indices >= raw input bytes.
- Size dict-encoded pages by raw input bytes (not the RLE indices'
encoded size) so the page cadence matches PLAIN.
Adds DictEncoder.FallBackTo / ObservedRawSize and exposes
BinaryMemoTable.Value for the fallback translation.
### Are these changes tested?
Yes, as part of the PR and also e2e while testing compaction in
iceberg-go.
### Are there any user-facing changes?
No public API changes, only observable thing should be the dropped
double encoding.1 parent cb314d6 commit 2b2aa6b
12 files changed
Lines changed: 1003 additions & 126 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
140 | 140 | | |
141 | 141 | | |
142 | 142 | | |
| 143 | + | |
143 | 144 | | |
144 | 145 | | |
145 | 146 | | |
| |||
264 | 265 | | |
265 | 266 | | |
266 | 267 | | |
267 | | - | |
268 | | - | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
269 | 282 | | |
270 | 283 | | |
271 | 284 | | |
| |||
427 | 440 | | |
428 | 441 | | |
429 | 442 | | |
| 443 | + | |
| 444 | + | |
430 | 445 | | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
431 | 451 | | |
432 | 452 | | |
433 | 453 | | |
| |||
502 | 522 | | |
503 | 523 | | |
504 | 524 | | |
| 525 | + | |
505 | 526 | | |
506 | 527 | | |
507 | 528 | | |
| |||
620 | 641 | | |
621 | 642 | | |
622 | 643 | | |
623 | | - | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
624 | 652 | | |
625 | 653 | | |
626 | 654 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
427 | 427 | | |
428 | 428 | | |
429 | 429 | | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
430 | 436 | | |
431 | | - | |
432 | | - | |
433 | | - | |
434 | | - | |
435 | | - | |
436 | | - | |
437 | | - | |
438 | | - | |
439 | | - | |
440 | | - | |
441 | | - | |
442 | | - | |
443 | | - | |
| 437 | + | |
444 | 438 | | |
445 | 439 | | |
446 | | - | |
447 | | - | |
448 | | - | |
449 | | - | |
450 | | - | |
451 | | - | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
457 | | - | |
458 | | - | |
459 | | - | |
460 | | - | |
461 | | - | |
462 | | - | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
463 | 444 | | |
464 | 445 | | |
465 | 446 | | |
| |||
0 commit comments