Fix row index miscalculation in ParquetLoader#810
Conversation
index miscalculated
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #810 +/- ##
===================================
Coverage 81% 81%
===================================
Files 54 54
Lines 7617 7630 +13
===================================
+ Hits 6143 6157 +14
+ Misses 1474 1473 -1 🚀 New features to boost your workflow:
|
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #810 +/- ##
===================================
- Coverage 81% 81% -0%
===================================
Files 54 54
Lines 7617 7631 +14
===================================
+ Hits 6144 6155 +11
- Misses 1473 1476 +3 🚀 New features to boost your workflow:
|
5eddfe6 to
bac1292
Compare
|
@Borda I had to fix three unrelated issues that were causing CI problems:
|
Thank you for reaching out, however you shall ping maintainers ⚡ |
|
@tchaton @justusschock Any feedback here? |
Before submitting
What does this PR do?
Fixes #809.
This replaces the uniform-size arithmetic with a per-chunk prefix-sum of row-group sizes, computed once when the ParquetFile is first opened, then use
bisect.bisect_right(offsets, row_index) - 1to locate the group androw_index - offsets[group]for the offset inside it.The same
num_rows_per_row_groupvalue is also used in the cache-eviction check atlitData/src/litdata/streaming/item_loader.py
Line 749 in 1fdfad7
But that check needs the actual size of the current group
(offsets[g+1] - offsets[g]), otherwise with uneven groups memory either leaks (never hits threshold) or is freed too early (forcing re-reads).