Fix row index miscalculation in ParquetLoader by vini-fda · Pull Request #810 · Lightning-AI/litData

vini-fda · 2026-04-24T03:32:34Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #809.

This replaces the uniform-size arithmetic with a per-chunk prefix-sum of row-group sizes, computed once when the ParquetFile is first opened, then use bisect.bisect_right(offsets, row_index) - 1 to locate the group and row_index - offsets[group] for the offset inside it.

The same num_rows_per_row_group value is also used in the cache-eviction check at

litData/src/litdata/streaming/item_loader.py

Line 749 in 1fdfad7

if read_count >= num_rows_per_row_group:

.

But that check needs the actual size of the current group (offsets[g+1] - offsets[g]), otherwise with uneven groups memory either leaks (never hits threshold) or is freed too early (forcing re-reads).

index miscalculated

codecov · 2026-04-24T09:48:21Z

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (1fdfad7) to head (3bdc6b5).

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
  Coverage    81%    81%           
===================================
  Files        54     54           
  Lines      7617   7630   +13     
===================================
+ Hits       6143   6157   +14     
+ Misses     1474   1473    -1

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov-commenter · 2026-05-26T15:40:28Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (5213544) to head (ea2d8b8).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
- Coverage    81%    81%   -0%     
===================================
  Files        54     54           
  Lines      7617   7631   +14     
===================================
+ Hits       6144   6155   +11     
- Misses     1473   1476    +3

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

vini-fda · 2026-05-26T17:22:36Z

@Borda I had to fix three unrelated issues that were causing CI problems:

Example on macos-14, python 3.11: https://github.com/Lightning-AI/litData/actions/runs/26457907073/job/77897116939?pr=810

Tests using clean_pq_index_cache all share ~/.lightning/chunks and the fixture shutil.rmtrees it on setup/teardown. Under pytest -n 2 two of them landed on different xdist workers and deleted each other's chunks mid-iteration. Fixed in bac1292 by auto-tagging those tests with xdist_group("hf_default_cache") and adding --dist=loadgroup to the CI invocation so they serialize onto one worker. The first attempt still scattered them across workers because my pytest_collection_modifyitems hook ran after xdist's own hook had already finalized nodeids; 8e3ffed adds @pytest.hookimpl(tryfirst=True) so the marker is in place before xdist appends the @group suffix that LoadGroupScheduling uses for routing.
Example on windows-2022, python 3.11: https://github.com/Lightning-AI/litData/actions/runs/26457907073/job/77897116885?pr=810

get_parquet_indexer_cls runs urlparse on the input path. On Windows a path like C:\Users\... parses with scheme='c', so it failed the scheme in ("local", "") check and raised ValueError. Fixed in bcb13c9 by treating any single-letter alphabetic scheme as a Windows drive letter and dispatching to LocalParquetDir.
Example on ubuntu-22.04, python 3.11: optimize() writes intermediate chunks to tempfile.gettempdir() + "/chunks" (i.e. /tmp/chunks on Linux), configurable via DATA_OPTIMIZER_CACHE_FOLDER. Two streaming tests calling optimize() on different xdist workers both wrote to the same /tmp/chunks and the upload worker raced its sibling's cleanup, producing FileNotFoundError: '/tmp/chunks/chunk-0-0.bin'. Every optimize()-using test in tests/processing/test_data_processor.py already monkeypatches that env var to a per-test tmp dir, and the two streaming tests just didn't. Fixed in ea2d8b8 by adding the same monkeypatch.

Borda · 2026-05-26T19:11:37Z

Borda I had to fix three unrelated issues that were causing CI problems

Thank you for reaching out, however you shall ping maintainers ⚡

vinicius-freitas-nubank · 2026-05-26T21:13:09Z

@tchaton @justusschock Any feedback here?

fix bug where parquet tables with non-uniform row group sizes get their

b1068e7

index miscalculated

vini-fda requested review from justusschock and tchaton as code owners April 24, 2026 03:32

remove test with shuffle due to ruff lint

3bdc6b5

Borda approved these changes Apr 24, 2026

View reviewed changes

Merge branch 'main' into fix/parquet-loader-outofbounds

f77234f

force sequential execution on tests sharing the global default cache dir

bac1292

vini-fda force-pushed the fix/parquet-loader-outofbounds branch from 5eddfe6 to bac1292 Compare May 26, 2026 16:29

vini-fda added 3 commits May 26, 2026 13:38

fix: handle Windows drive letters in get_parquet_indexer_cls

bcb13c9

fix: run xdist_group marker hook before xdist processes nodeids

8e3ffed

fix: isolate optimize() cache dir per test to avoid /tmp/chunks race

ea2d8b8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix row index miscalculation in ParquetLoader#810

Fix row index miscalculation in ParquetLoader#810
vini-fda wants to merge 7 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds

vini-fda commented Apr 24, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 26, 2026 •

edited

Loading

Uh oh!

vini-fda commented May 26, 2026 •

edited

Loading

Uh oh!

Borda commented May 26, 2026

Uh oh!

vinicius-freitas-nubank commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vini-fda commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

codecov Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

codecov-commenter commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

vini-fda commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Borda commented May 26, 2026

Uh oh!

vinicius-freitas-nubank commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vini-fda commented Apr 24, 2026 •

edited

Loading

codecov Bot commented Apr 24, 2026 •

edited

Loading

codecov-commenter commented May 26, 2026 •

edited

Loading

vini-fda commented May 26, 2026 •

edited

Loading